mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-17 01:30:14 +00:00
add step2
This commit is contained in:
@@ -0,0 +1,141 @@
|
||||
# Temporal Sampling Strategy Visualization
|
||||
|
||||
## How `--sample-interval` Works
|
||||
|
||||
### Example: 30 fps dataset, `--sample-interval 1.0` (1 second)
|
||||
|
||||
```
|
||||
Timeline (seconds): 0.0 0.5 1.0 1.5 2.0 2.5 3.0
|
||||
│ │ │ │ │ │ │
|
||||
Frames: 0───15───30───45───60───75───90───105──120──135──150
|
||||
│ │ │ │ │ │ │
|
||||
▼ ▼ ▼ ▼
|
||||
Sampled: YES NO YES NO YES NO YES
|
||||
│ │ │ │
|
||||
Task Index: [0]──────────────>[1]──────────────>[2]──────────────>[3]
|
||||
│ │ │ │
|
||||
VLM Called: ✓ Gen ✓ Gen ✓ Gen ✓ Gen
|
||||
dialogue dialogue dialogue dialogue
|
||||
│ │ │ │
|
||||
Frames 0-29 ─────┘ │ │ │
|
||||
get task 0 │ │ │
|
||||
│ │ │
|
||||
Frames 30-59 ────────────────────────┘ │ │
|
||||
get task 1 │ │
|
||||
│ │
|
||||
Frames 60-89 ──────────────────────────────────────────┘ │
|
||||
get task 2 │
|
||||
│
|
||||
Frames 90-119 ────────────────────────────────────────────────────────────┘
|
||||
get task 3
|
||||
```
|
||||
|
||||
## Comparison: Different Sampling Intervals
|
||||
|
||||
### `--sample-interval 2.0` (every 2 seconds)
|
||||
```
|
||||
Timeline: 0.0 1.0 2.0 3.0 4.0 5.0 6.0
|
||||
│ │ │ │ │ │ │
|
||||
Sampled: YES NO YES NO YES NO YES
|
||||
│ │ │ │
|
||||
Tasks: [0]───────────────>[1]───────────────>[2]───────────────>[3]
|
||||
|
||||
VLM Calls: 4 (fewer calls, faster but less granular)
|
||||
```
|
||||
|
||||
### `--sample-interval 1.0` (every 1 second) - **DEFAULT**
|
||||
```
|
||||
Timeline: 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
|
||||
│ │ │ │ │ │ │ │ │ │ │ │ │
|
||||
Sampled: YES NO YES NO YES NO YES NO YES NO YES NO YES
|
||||
│ │ │ │ │ │ │
|
||||
Tasks: [0]─────────>[1]─────────>[2]─────────>[3]─────────>[4]─────────>[5]─────>[6]
|
||||
|
||||
VLM Calls: 7 (balanced coverage and speed)
|
||||
```
|
||||
|
||||
### `--sample-interval 0.5` (every 0.5 seconds)
|
||||
```
|
||||
Timeline: 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
|
||||
│ │ │ │ │ │ │ │ │ │ │ │ │
|
||||
Sampled: YES YES YES YES YES YES YES YES YES YES YES YES YES
|
||||
│ │ │ │ │ │ │ │ │ │ │ │ │
|
||||
Tasks: [0]─>[1]─>[2]─>[3]─>[4]─>[5]─>[6]─>[7]─>[8]─>[9]─>[10]>[11]>[12]
|
||||
|
||||
VLM Calls: 13 (high granularity, slower but more detailed)
|
||||
```
|
||||
|
||||
## Episode Boundaries
|
||||
|
||||
The script always samples the **first frame** of each episode:
|
||||
|
||||
```
|
||||
Episode 0 Episode 1 Episode 2
|
||||
├─────────────────────────────────┤├─────────────────────────────────┤├──────...
|
||||
│ ││ ││
|
||||
Frame: 0 30 60 90 120 130 160 190 220 250 260 290 320
|
||||
Time: 0.0 1.0 2.0 3.0 4.0 0.0 1.0 2.0 3.0 4.0 0.0 1.0 2.0
|
||||
│ │ │ │ │ │ │ │ │ │ │ │ │
|
||||
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
|
||||
Sample:YES YES YES YES YES YES YES YES YES YES YES YES YES
|
||||
│ │ │ │ │ │ │ │ │ │ │ │ │
|
||||
Task: 0────1─────2─────3────4 5─────6─────7─────8────9 10────11───12
|
||||
|
||||
Note: Frames 0, 130, 260 are ALWAYS sampled (episode starts)
|
||||
Even if they're within the sample-interval window
|
||||
```
|
||||
|
||||
## Real-World Example: svla_so101_pickplace Dataset
|
||||
|
||||
Typical stats:
|
||||
- **Total episodes**: 50
|
||||
- **Avg episode length**: 300 frames (10 seconds at 30 fps)
|
||||
- **Total frames**: 15,000
|
||||
|
||||
### Without Sampling (every frame)
|
||||
```
|
||||
Frames processed: 15,000
|
||||
VLM calls: 15,000
|
||||
Time estimate: ~5 hours
|
||||
Unique tasks: ~12,000 (lots of duplicates)
|
||||
```
|
||||
|
||||
### With `--sample-interval 1.0` (every 1 second)
|
||||
```
|
||||
Frames processed: 15,000 ✓
|
||||
VLM calls: 500
|
||||
Time estimate: ~10 minutes
|
||||
Unique tasks: ~450 (meaningful variety)
|
||||
Efficiency gain: 30x faster
|
||||
```
|
||||
|
||||
### With `--sample-interval 2.0` (every 2 seconds)
|
||||
```
|
||||
Frames processed: 15,000 ✓
|
||||
VLM calls: 250
|
||||
Time estimate: ~5 minutes
|
||||
Unique tasks: ~220
|
||||
Efficiency gain: 60x faster
|
||||
```
|
||||
|
||||
## Key Points
|
||||
|
||||
1. **All frames get labeled**: Every frame gets a `task_index_high_level`
|
||||
2. **Only sampled frames call VLM**: Huge efficiency gain
|
||||
3. **Temporal coherence**: Nearby frames share the same task
|
||||
4. **Episode-aware**: Always samples episode starts
|
||||
5. **Configurable**: Adjust `--sample-interval` based on your needs
|
||||
|
||||
## Choosing Your Sampling Interval
|
||||
|
||||
| Use Case | Recommended Interval | Why |
|
||||
|----------|---------------------|-----|
|
||||
| Quick testing | 2.0s | Fastest iteration |
|
||||
| Standard training | 1.0s | Good balance |
|
||||
| High-quality dataset | 0.5s | Better coverage |
|
||||
| Fine-grained control | 0.33s | Very detailed |
|
||||
| Dense annotations | 0.1s | Nearly every frame |
|
||||
|
||||
**Rule of thumb**: Match your sampling interval to your typical skill duration.
|
||||
If skills last 1-3 seconds, sampling every 1 second captures each skill multiple times.
|
||||
|
||||
Reference in New Issue
Block a user