mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-15 16:49:55 +00:00
144 lines
4.2 KiB
Markdown
144 lines
4.2 KiB
Markdown
# Example: Synthetic Data Generation with Sampling
|
|
|
|
## Quick Start
|
|
|
|
### 1. Test with 100 frames and 1 second sampling
|
|
```bash
|
|
python examples/dataset/annotate_pgen.py \
|
|
--data-dir /fsx/jade_choghari/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace \
|
|
--model Qwen/Qwen2-VL-7B-Instruct \
|
|
--num-samples 100 \
|
|
--sample-interval 1.0 \
|
|
--output-dir ./outputs/test_pgen
|
|
```
|
|
|
|
**Expected behavior** (assuming 30 fps):
|
|
- Total frames: 100
|
|
- Frames sampled: ~4 (every 30 frames = 1 second)
|
|
- Efficiency: 96% fewer VLM calls
|
|
- Output: All 100 frames get `task_index_high_level`, but only 4 unique dialogues generated
|
|
|
|
### 2. Process full dataset with different sampling rates
|
|
|
|
#### Conservative (every 2 seconds)
|
|
```bash
|
|
python examples/dataset/annotate_pgen.py \
|
|
--data-dir /fsx/jade_choghari/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace \
|
|
--model Qwen/Qwen2-VL-7B-Instruct \
|
|
--sample-interval 2.0 \
|
|
--output-dir ./outputs/pgen_2s
|
|
```
|
|
|
|
#### Standard (every 1 second) - **RECOMMENDED**
|
|
```bash
|
|
python examples/dataset/annotate_pgen.py \
|
|
--data-dir /fsx/jade_choghari/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace \
|
|
--model Qwen/Qwen2-VL-7B-Instruct \
|
|
--sample-interval 1.0 \
|
|
--output-dir ./outputs/pgen_1s
|
|
```
|
|
|
|
#### Fine-grained (every 0.5 seconds)
|
|
```bash
|
|
python examples/dataset/annotate_pgen.py \
|
|
--data-dir /fsx/jade_choghari/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace \
|
|
--model Qwen/Qwen2-VL-7B-Instruct \
|
|
--sample-interval 0.5 \
|
|
--output-dir ./outputs/pgen_0.5s
|
|
```
|
|
|
|
## Performance Estimates
|
|
|
|
For a dataset with:
|
|
- 100 episodes
|
|
- 10 seconds per episode (average)
|
|
- 30 fps
|
|
- Total frames: 30,000
|
|
|
|
| Sampling Interval | Frames Sampled | % Sampled | Speedup | Time Estimate |
|
|
|-------------------|----------------|-----------|---------|---------------|
|
|
| Every frame (0.033s) | 30,000 | 100% | 1x | ~10 hours |
|
|
| 0.5 seconds | 2,000 | 6.7% | 15x | ~40 min |
|
|
| **1.0 seconds** | **1,000** | **3.3%** | **30x** | **~20 min** |
|
|
| 2.0 seconds | 500 | 1.7% | 60x | ~10 min |
|
|
|
|
*Note: Times are approximate and depend on GPU, model size, and generation speed*
|
|
|
|
## Understanding the Output
|
|
|
|
### Console Output Example
|
|
```
|
|
[cyan]Generating synthetic data for 30000 frames...[/cyan]
|
|
[cyan]Sampling interval: 1.0s (fps: 30)[/cyan]
|
|
Generating synthetic dialogue: 100%|████████| 30000/30000 [20:15<00:00, 24.68it/s]
|
|
[green]✓ Sampled 1000 frames out of 30000 (3.3%)[/green]
|
|
[green]✓ Generated 450 unique high-level tasks[/green]
|
|
```
|
|
|
|
### What happens:
|
|
1. **Frame 0 (t=0.0s)**: Generate dialogue → Task index 0
|
|
2. **Frames 1-29 (t=0.033s-0.967s)**: Reuse task index 0
|
|
3. **Frame 30 (t=1.0s)**: Generate new dialogue → Task index 1
|
|
4. **Frames 31-59 (t=1.033s-1.967s)**: Reuse task index 1
|
|
5. And so on...
|
|
|
|
### Result:
|
|
- Every frame has a `task_index_high_level`
|
|
- Only sampled frames have unique dialogues generated
|
|
- Intermediate frames inherit from the most recent sample
|
|
- Maintains temporal coherence within episodes
|
|
|
|
## Checking Your Results
|
|
|
|
After running, verify the output:
|
|
|
|
```bash
|
|
# Check the generated tasks
|
|
python -c "
|
|
import pandas as pd
|
|
from pathlib import Path
|
|
|
|
tasks = pd.read_parquet('outputs/test_pgen/meta/tasks_high_level.parquet')
|
|
print(f'Total unique tasks: {len(tasks)}')
|
|
print(f'Sample tasks:')
|
|
print(tasks[['user_prompt', 'robot_utterance', 'skill']].head())
|
|
"
|
|
|
|
# Check debug output
|
|
head outputs/test_pgen/meta/syn_annotations.jsonl
|
|
|
|
# Load and verify dataset
|
|
python -c "
|
|
from lerobot.datasets.lerobot_dataset import LeRobotDataset
|
|
|
|
ds = LeRobotDataset(repo_id='local_with_high_level_tasks',
|
|
root='outputs/test_pgen')
|
|
print(f'Dataset has {len(ds)} frames')
|
|
print(f'Features: {list(ds.features.keys())}')
|
|
assert 'task_index_high_level' in ds.features
|
|
print('✓ task_index_high_level feature added successfully!')
|
|
"
|
|
```
|
|
|
|
## Common Use Cases
|
|
|
|
### Development/Testing
|
|
```bash
|
|
--sample-interval 2.0 # Fast iteration
|
|
--num-samples 500 # Small subset
|
|
```
|
|
|
|
### Production Training
|
|
```bash
|
|
--sample-interval 1.0 # Good coverage
|
|
# Process all samples (no --num-samples)
|
|
```
|
|
|
|
### High-Quality Dataset
|
|
```bash
|
|
--sample-interval 0.5 # Fine-grained
|
|
--temperature 0.6 # More consistent
|
|
--model Qwen/Qwen3-VL-30B-A3B-Instruct # Larger model
|
|
```
|
|
|