mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-17 01:30:14 +00:00
244 lines
7.9 KiB
Markdown
244 lines
7.9 KiB
Markdown
# Synthetic Data Generation for Hierarchical Robot Policies
|
||
|
||
This directory contains `annotate_pgen.py`, a script for generating synthetic user prompts and robot utterances for hierarchical policy training using Vision-Language Models (VLMs).
|
||
|
||
## Overview
|
||
|
||
The script implements the synthetic data generation pipeline described in the Hi-Robot paper:
|
||
|
||
1. **Load** a LeRobot dataset with skill annotations (from `annotate.py`)
|
||
2. **Generate** synthetic dialogue using Qwen VLM:
|
||
- User prompts (ℓ_t): Natural requests that lead to specific skills
|
||
- Robot utterances (u_t): Acknowledgments and clarifications
|
||
3. **Save** results as a new dataset feature `task_index_high_level`
|
||
|
||
## Prerequisites
|
||
|
||
1. First, annotate your dataset with skills using `annotate.py`:
|
||
|
||
```bash
|
||
python examples/dataset/annotate.py \
|
||
--repo-id lerobot/svla_so101_pickplace \
|
||
--video-key observation.images.base \
|
||
--model Qwen/Qwen2-VL-7B-Instruct
|
||
```
|
||
|
||
This creates `meta/skills.json` with skill segmentation for each episode.
|
||
|
||
## Usage
|
||
|
||
### Basic Usage
|
||
|
||
```bash
|
||
python examples/dataset/annotate_pgen.py \
|
||
--repo-id lerobot/svla_so101_pickplace \
|
||
--model Qwen/Qwen2-VL-7B-Instruct \
|
||
--sample-interval 1.0 \
|
||
--output-dir ./outputs/pgen_dataset
|
||
```
|
||
|
||
**Note**: The script processes **all episodes** in the dataset. It generates dialogue every 1 second (`--sample-interval 1.0`) using temporal sampling. Frames between samples reuse the last generated dialogue. This makes the process efficient while ensuring complete dataset coverage.
|
||
|
||
### Advanced Options
|
||
|
||
```bash
|
||
python examples/dataset/annotate_pgen.py \
|
||
--repo-id lerobot/svla_so101_pickplace \
|
||
--model Qwen/Qwen3-VL-30B-A3B-Instruct \
|
||
--temperature 0.8 \
|
||
--sample-interval 0.5 \
|
||
--num-image-views-per-sample 2 \
|
||
--output-dir ./outputs/pgen_dataset \
|
||
--push-to-hub
|
||
```
|
||
|
||
This example uses a more powerful model and samples every 0.5 seconds for finer granularity.
|
||
|
||
### Fast Testing (larger interval)
|
||
|
||
```bash
|
||
python examples/dataset/annotate_pgen.py \
|
||
--repo-id lerobot/svla_so101_pickplace \
|
||
--model Qwen/Qwen2-VL-7B-Instruct \
|
||
--sample-interval 5.0 \
|
||
--output-dir ./outputs/pgen_quick_test
|
||
```
|
||
|
||
Use a larger interval (5.0 seconds) for rapid iteration during development. All episodes are still processed.
|
||
|
||
### Using Local Dataset
|
||
|
||
```bash
|
||
python examples/dataset/annotate_pgen.py \
|
||
--data-dir /fsx/jade_choghari/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace \
|
||
--model Qwen/Qwen2-VL-7B-Instruct \
|
||
--output-dir ./outputs/pgen_dataset
|
||
```
|
||
|
||
## Output Files
|
||
|
||
The script produces several outputs:
|
||
|
||
1. **`meta/tasks_high_level.parquet`**: High-level tasks with user prompts and robot utterances
|
||
- Columns: task_index, user_prompt, robot_utterance, skill, scenario_type, response_type
|
||
|
||
2. **`meta/syn_annotations.jsonl`**: Debug file with all generated dialogues
|
||
- One JSON object per line with full context for each frame
|
||
|
||
3. **Modified dataset**: New dataset with `task_index_high_level` feature added to all parquet files
|
||
|
||
## Scenario and Response Types
|
||
|
||
The generator produces diverse interaction types:
|
||
|
||
### Scenario Types
|
||
- **specific_object**: Direct specification of objects/actions
|
||
- **negative_task**: Instructions about what NOT to do
|
||
- **situated_correction**: Adjustments based on current state
|
||
- **implicit_request**: Implied needs without direct commands
|
||
- **constraint_based**: Specific constraints or preferences
|
||
|
||
### Response Types
|
||
- **confirmation**: Simple acknowledgment ("OK, I'll do X")
|
||
- **clarification**: Seeking confirmation ("Just to confirm...")
|
||
- **acknowledgment**: Action acknowledgment ("Got it, doing X")
|
||
- **constraint_acknowledgment**: Acknowledging constraints ("Sure, I'll X while Y")
|
||
|
||
## Example Generated Data
|
||
|
||
```json
|
||
{
|
||
"episode_id": 0,
|
||
"frame_index": 45,
|
||
"timestamp": 2.5,
|
||
"skill_current": "robot arm picks up pink lego brick",
|
||
"skill_history": ["robot arm moves towards pink lego brick"],
|
||
"task_description": "pink lego brick into the transparent box",
|
||
"scenario_type": "specific_object",
|
||
"response_type": "confirmation",
|
||
"user_prompt": "Can you grab the pink brick?",
|
||
"robot_utterance": "Sure, I'll pick up the pink lego brick."
|
||
}
|
||
```
|
||
|
||
## Accessing the Data
|
||
|
||
After running the script, access the synthetic data in your code:
|
||
|
||
```python
|
||
from lerobot.datasets.lerobot_dataset import LeRobotDataset
|
||
import pandas as pd
|
||
|
||
# Load modified dataset
|
||
dataset = LeRobotDataset(repo_id="lerobot/svla_so101_pickplace_with_high_level_tasks")
|
||
|
||
# Access frame with high-level task
|
||
frame = dataset[100]
|
||
high_level_task_idx = frame["task_index_high_level"].item()
|
||
|
||
# Load high-level tasks
|
||
tasks_df = pd.read_parquet(dataset.root / "meta" / "tasks_high_level.parquet")
|
||
task_info = tasks_df.iloc[high_level_task_idx]
|
||
|
||
print(f"User prompt: {task_info['user_prompt']}")
|
||
print(f"Robot utterance: {task_info['robot_utterance']}")
|
||
print(f"Skill: {task_info['skill']}")
|
||
```
|
||
|
||
## Architecture
|
||
|
||
The script is modular and extensible:
|
||
|
||
```python
|
||
# Core components
|
||
class QwenPgen:
|
||
"""VLM wrapper for generation"""
|
||
def call_qwen(images, prompt) -> dict
|
||
|
||
def construct_prompt(task, history, skill) -> str
|
||
"""Build prompt for VLM"""
|
||
|
||
def annotate_sample(pgen, images, ...) -> dict
|
||
"""Generate dialogue for one sample"""
|
||
|
||
def generate_synthetic_data(dataset, pgen, ...) -> tuple
|
||
"""Process entire dataset"""
|
||
```
|
||
|
||
## Parameters
|
||
|
||
| Parameter | Default | Description |
|
||
|-----------|---------|-------------|
|
||
| `--repo-id` | - | HuggingFace dataset ID |
|
||
| `--data-dir` | - | Local dataset path |
|
||
| `--model` | Qwen/Qwen2-VL-7B-Instruct | VLM model name |
|
||
| `--device` | cuda | Device (cuda/cpu) |
|
||
| `--dtype` | bfloat16 | Model precision |
|
||
| `--temperature` | 0.7 | Sampling temperature |
|
||
| `--sample-interval` | 1.0 | Generate dialogue every N seconds (all episodes processed) |
|
||
| `--num-image-views-per-sample` | 1 | Number of cameras |
|
||
| `--output-dir` | None | Output directory |
|
||
| `--push-to-hub` | False | Push to HuggingFace Hub |
|
||
|
||
## Sampling Strategy
|
||
|
||
The script uses **temporal sampling** to efficiently generate dialogue:
|
||
|
||
- **Default**: Generate dialogue every 1 second (`--sample-interval 1.0`)
|
||
- **Efficiency**: If a dataset runs at 30fps, this samples ~3% of frames
|
||
- **Propagation**: Frames between samples reuse the last generated task_index
|
||
- **Episode-aware**: Always samples the first frame of each episode
|
||
|
||
### Example with 30 fps dataset:
|
||
```bash
|
||
# Sample every 1 second (every 30 frames)
|
||
--sample-interval 1.0 # ~3,000 generations for a 100 episode dataset (3 sec/episode)
|
||
|
||
# Sample every 0.5 seconds (every 15 frames)
|
||
--sample-interval 0.5 # ~6,000 generations (more granular)
|
||
|
||
# Sample every 2 seconds (every 60 frames)
|
||
--sample-interval 2.0 # ~1,500 generations (more efficient)
|
||
```
|
||
|
||
### Why sampling works:
|
||
- Skills typically last 1-3 seconds
|
||
- Dialogue doesn't need to change every frame
|
||
- Reduces computational cost by 30-100x
|
||
- Still provides good coverage for training
|
||
|
||
## Tips
|
||
|
||
1. **Quick testing**: Use larger `--sample-interval` (e.g., 5.0 or 10.0) for rapid iteration
|
||
2. **Monitor GPU**: VLM inference is memory-intensive
|
||
3. **Check outputs**: Review `syn_annotations.jsonl` for quality
|
||
4. **Adjust temperature**: Higher = more diverse, lower = more consistent
|
||
5. **Multiple views**: Use `--num-image-views-per-sample 2+` for better context
|
||
6. **Tune sampling**: Start with 1.0s, increase for speed (testing), decrease for granularity (production)
|
||
|
||
## Troubleshooting
|
||
|
||
### No skills.json found
|
||
Run `annotate.py` first to generate skill annotations.
|
||
|
||
### Out of memory
|
||
- Reduce batch size to 1
|
||
- Use smaller model (Qwen2-VL-7B instead of Qwen3-VL-30B)
|
||
- Process fewer samples at a time
|
||
|
||
### Poor quality generations
|
||
- Adjust temperature (try 0.6-0.9)
|
||
- Check that skills.json has good annotations
|
||
- Ensure images are loading correctly
|
||
|
||
## Citation
|
||
|
||
Based on the Hi-Robot paper's synthetic data generation approach:
|
||
```
|
||
@article{hirobot2024,
|
||
title={Hi-Robot: Hierarchical Robot Learning with Vision-Language Models},
|
||
year={2024}
|
||
}
|
||
```
|
||
|