Files
lerobot/examples/dataset/example_pgen_usage.md
T
Jade Choghari c8eee4ea16 add step2
2025-12-09 12:28:46 +00:00

4.2 KiB

Example: Synthetic Data Generation with Sampling

Quick Start

1. Test with 100 frames and 1 second sampling

python examples/dataset/annotate_pgen.py \
    --data-dir /fsx/jade_choghari/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace \
    --model Qwen/Qwen2-VL-7B-Instruct \
    --num-samples 100 \
    --sample-interval 1.0 \
    --output-dir ./outputs/test_pgen

Expected behavior (assuming 30 fps):

  • Total frames: 100
  • Frames sampled: ~4 (every 30 frames = 1 second)
  • Efficiency: 96% fewer VLM calls
  • Output: All 100 frames get task_index_high_level, but only 4 unique dialogues generated

2. Process full dataset with different sampling rates

Conservative (every 2 seconds)

python examples/dataset/annotate_pgen.py \
    --data-dir /fsx/jade_choghari/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace \
    --model Qwen/Qwen2-VL-7B-Instruct \
    --sample-interval 2.0 \
    --output-dir ./outputs/pgen_2s
python examples/dataset/annotate_pgen.py \
    --data-dir /fsx/jade_choghari/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace \
    --model Qwen/Qwen2-VL-7B-Instruct \
    --sample-interval 1.0 \
    --output-dir ./outputs/pgen_1s

Fine-grained (every 0.5 seconds)

python examples/dataset/annotate_pgen.py \
    --data-dir /fsx/jade_choghari/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace \
    --model Qwen/Qwen2-VL-7B-Instruct \
    --sample-interval 0.5 \
    --output-dir ./outputs/pgen_0.5s

Performance Estimates

For a dataset with:

  • 100 episodes
  • 10 seconds per episode (average)
  • 30 fps
  • Total frames: 30,000
Sampling Interval Frames Sampled % Sampled Speedup Time Estimate
Every frame (0.033s) 30,000 100% 1x ~10 hours
0.5 seconds 2,000 6.7% 15x ~40 min
1.0 seconds 1,000 3.3% 30x ~20 min
2.0 seconds 500 1.7% 60x ~10 min

Note: Times are approximate and depend on GPU, model size, and generation speed

Understanding the Output

Console Output Example

[cyan]Generating synthetic data for 30000 frames...[/cyan]
[cyan]Sampling interval: 1.0s (fps: 30)[/cyan]
Generating synthetic dialogue: 100%|████████| 30000/30000 [20:15<00:00, 24.68it/s]
[green]✓ Sampled 1000 frames out of 30000 (3.3%)[/green]
[green]✓ Generated 450 unique high-level tasks[/green]

What happens:

  1. Frame 0 (t=0.0s): Generate dialogue → Task index 0
  2. Frames 1-29 (t=0.033s-0.967s): Reuse task index 0
  3. Frame 30 (t=1.0s): Generate new dialogue → Task index 1
  4. Frames 31-59 (t=1.033s-1.967s): Reuse task index 1
  5. And so on...

Result:

  • Every frame has a task_index_high_level
  • Only sampled frames have unique dialogues generated
  • Intermediate frames inherit from the most recent sample
  • Maintains temporal coherence within episodes

Checking Your Results

After running, verify the output:

# Check the generated tasks
python -c "
import pandas as pd
from pathlib import Path

tasks = pd.read_parquet('outputs/test_pgen/meta/tasks_high_level.parquet')
print(f'Total unique tasks: {len(tasks)}')
print(f'Sample tasks:')
print(tasks[['user_prompt', 'robot_utterance', 'skill']].head())
"

# Check debug output
head outputs/test_pgen/meta/syn_annotations.jsonl

# Load and verify dataset
python -c "
from lerobot.datasets.lerobot_dataset import LeRobotDataset

ds = LeRobotDataset(repo_id='local_with_high_level_tasks', 
                    root='outputs/test_pgen')
print(f'Dataset has {len(ds)} frames')
print(f'Features: {list(ds.features.keys())}')
assert 'task_index_high_level' in ds.features
print('✓ task_index_high_level feature added successfully!')
"

Common Use Cases

Development/Testing

--sample-interval 2.0  # Fast iteration
--num-samples 500      # Small subset

Production Training

--sample-interval 1.0  # Good coverage
# Process all samples (no --num-samples)

High-Quality Dataset

--sample-interval 0.5  # Fine-grained
--temperature 0.6      # More consistent
--model Qwen/Qwen3-VL-30B-A3B-Instruct  # Larger model