admin/lerobot

Fork 0

mirror of https://github.com/huggingface/lerobot.git synced 2026-05-16 00:59:46 +00:00

Files

T

Jade Choghari c8eee4ea16 add step2

2025-12-09 12:28:46 +00:00

7.9 KiB

Raw Blame History

Synthetic Data Generation for Hierarchical Robot Policies

This directory contains annotate_pgen.py, a script for generating synthetic user prompts and robot utterances for hierarchical policy training using Vision-Language Models (VLMs).

Overview

The script implements the synthetic data generation pipeline described in the Hi-Robot paper:

Load a LeRobot dataset with skill annotations (from annotate.py)
Generate synthetic dialogue using Qwen VLM:
- User prompts (ℓ_t): Natural requests that lead to specific skills
- Robot utterances (u_t): Acknowledgments and clarifications
Save results as a new dataset feature task_index_high_level

Prerequisites

First, annotate your dataset with skills using annotate.py:

python examples/dataset/annotate.py \
    --repo-id lerobot/svla_so101_pickplace \
    --video-key observation.images.base \
    --model Qwen/Qwen2-VL-7B-Instruct

This creates meta/skills.json with skill segmentation for each episode.

Usage

Basic Usage

python examples/dataset/annotate_pgen.py \
    --repo-id lerobot/svla_so101_pickplace \
    --model Qwen/Qwen2-VL-7B-Instruct \
    --sample-interval 1.0 \
    --output-dir ./outputs/pgen_dataset

Note: The script processes all episodes in the dataset. It generates dialogue every 1 second (--sample-interval 1.0) using temporal sampling. Frames between samples reuse the last generated dialogue. This makes the process efficient while ensuring complete dataset coverage.

Advanced Options

python examples/dataset/annotate_pgen.py \
    --repo-id lerobot/svla_so101_pickplace \
    --model Qwen/Qwen3-VL-30B-A3B-Instruct \
    --temperature 0.8 \
    --sample-interval 0.5 \
    --num-image-views-per-sample 2 \
    --output-dir ./outputs/pgen_dataset \
    --push-to-hub

This example uses a more powerful model and samples every 0.5 seconds for finer granularity.

Fast Testing (larger interval)

python examples/dataset/annotate_pgen.py \
    --repo-id lerobot/svla_so101_pickplace \
    --model Qwen/Qwen2-VL-7B-Instruct \
    --sample-interval 5.0 \
    --output-dir ./outputs/pgen_quick_test

Use a larger interval (5.0 seconds) for rapid iteration during development. All episodes are still processed.

Using Local Dataset

python examples/dataset/annotate_pgen.py \
    --data-dir /fsx/jade_choghari/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace \
    --model Qwen/Qwen2-VL-7B-Instruct \
    --output-dir ./outputs/pgen_dataset

Output Files

The script produces several outputs:

meta/tasks_high_level.parquet: High-level tasks with user prompts and robot utterances
- Columns: task_index, user_prompt, robot_utterance, skill, scenario_type, response_type
meta/syn_annotations.jsonl: Debug file with all generated dialogues
- One JSON object per line with full context for each frame
Modified dataset: New dataset with task_index_high_level feature added to all parquet files

Scenario and Response Types

The generator produces diverse interaction types:

Scenario Types

specific_object: Direct specification of objects/actions
negative_task: Instructions about what NOT to do
situated_correction: Adjustments based on current state
implicit_request: Implied needs without direct commands
constraint_based: Specific constraints or preferences

Response Types

confirmation: Simple acknowledgment ("OK, I'll do X")
clarification: Seeking confirmation ("Just to confirm...")
acknowledgment: Action acknowledgment ("Got it, doing X")
constraint_acknowledgment: Acknowledging constraints ("Sure, I'll X while Y")

Example Generated Data

{
  "episode_id": 0,
  "frame_index": 45,
  "timestamp": 2.5,
  "skill_current": "robot arm picks up pink lego brick",
  "skill_history": ["robot arm moves towards pink lego brick"],
  "task_description": "pink lego brick into the transparent box",
  "scenario_type": "specific_object",
  "response_type": "confirmation",
  "user_prompt": "Can you grab the pink brick?",
  "robot_utterance": "Sure, I'll pick up the pink lego brick."
}

Accessing the Data

After running the script, access the synthetic data in your code:

from lerobot.datasets.lerobot_dataset import LeRobotDataset
import pandas as pd

# Load modified dataset
dataset = LeRobotDataset(repo_id="lerobot/svla_so101_pickplace_with_high_level_tasks")

# Access frame with high-level task
frame = dataset[100]
high_level_task_idx = frame["task_index_high_level"].item()

# Load high-level tasks
tasks_df = pd.read_parquet(dataset.root / "meta" / "tasks_high_level.parquet")
task_info = tasks_df.iloc[high_level_task_idx]

print(f"User prompt: {task_info['user_prompt']}")
print(f"Robot utterance: {task_info['robot_utterance']}")
print(f"Skill: {task_info['skill']}")

Architecture

The script is modular and extensible:

# Core components
class QwenPgen:
    """VLM wrapper for generation"""
    def call_qwen(images, prompt) -> dict
    
def construct_prompt(task, history, skill) -> str
    """Build prompt for VLM"""
    
def annotate_sample(pgen, images, ...) -> dict
    """Generate dialogue for one sample"""
    
def generate_synthetic_data(dataset, pgen, ...) -> tuple
    """Process entire dataset"""

Parameters

Parameter	Default	Description
`--repo-id`	-	HuggingFace dataset ID
`--data-dir`	-	Local dataset path
`--model`	Qwen/Qwen2-VL-7B-Instruct	VLM model name
`--device`	cuda	Device (cuda/cpu)
`--dtype`	bfloat16	Model precision
`--temperature`	0.7	Sampling temperature
`--sample-interval`	1.0	Generate dialogue every N seconds (all episodes processed)
`--num-image-views-per-sample`	1	Number of cameras
`--output-dir`	None	Output directory
`--push-to-hub`	False	Push to HuggingFace Hub

Sampling Strategy

The script uses temporal sampling to efficiently generate dialogue:

Default: Generate dialogue every 1 second (--sample-interval 1.0)
Efficiency: If a dataset runs at 30fps, this samples ~3% of frames
Propagation: Frames between samples reuse the last generated task_index
Episode-aware: Always samples the first frame of each episode

Example with 30 fps dataset:

# Sample every 1 second (every 30 frames)
--sample-interval 1.0  # ~3,000 generations for a 100 episode dataset (3 sec/episode)

# Sample every 0.5 seconds (every 15 frames)
--sample-interval 0.5  # ~6,000 generations (more granular)

# Sample every 2 seconds (every 60 frames)
--sample-interval 2.0  # ~1,500 generations (more efficient)

Why sampling works:

Skills typically last 1-3 seconds
Dialogue doesn't need to change every frame
Reduces computational cost by 30-100x
Still provides good coverage for training

Tips

Quick testing: Use larger --sample-interval (e.g., 5.0 or 10.0) for rapid iteration
Monitor GPU: VLM inference is memory-intensive
Check outputs: Review syn_annotations.jsonl for quality
Adjust temperature: Higher = more diverse, lower = more consistent
Multiple views: Use --num-image-views-per-sample 2+ for better context
Tune sampling: Start with 1.0s, increase for speed (testing), decrease for granularity (production)

Troubleshooting

No skills.json found

Run annotate.py first to generate skill annotations.

Out of memory

Reduce batch size to 1
Use smaller model (Qwen2-VL-7B instead of Qwen3-VL-30B)
Process fewer samples at a time

Poor quality generations

Adjust temperature (try 0.6-0.9)
Check that skills.json has good annotations
Ensure images are loading correctly

Citation

Based on the Hi-Robot paper's synthetic data generation approach:

@article{hirobot2024,
  title={Hi-Robot: Hierarchical Robot Learning with Vision-Language Models},
  year={2024}
}

7.9 KiB Raw Blame History Unescape Escape