add step2

2026-07-23 17:56:07 +00:00 · 2025-12-09 12:28:46 +00:00
parent 9091b68d86
commit c8eee4ea16
8 changed files with 1935 additions and 0 deletions
@@ -0,0 +1,243 @@
+# Synthetic Data Generation for Hierarchical Robot Policies
+
+This directory contains `annotate_pgen.py`, a script for generating synthetic user prompts and robot utterances for hierarchical policy training using Vision-Language Models (VLMs).
+
+## Overview
+
+The script implements the synthetic data generation pipeline described in the Hi-Robot paper:
+
+1. **Load** a LeRobot dataset with skill annotations (from `annotate.py`)
+2. **Generate** synthetic dialogue using Qwen VLM:
+   - User prompts (ℓ_t): Natural requests that lead to specific skills
+   - Robot utterances (u_t): Acknowledgments and clarifications
+3. **Save** results as a new dataset feature `task_index_high_level`
+
+## Prerequisites
+
+1. First, annotate your dataset with skills using `annotate.py`:
+
+```bash
+python examples/dataset/annotate.py \
+    --repo-id lerobot/svla_so101_pickplace \
+    --video-key observation.images.base \
+    --model Qwen/Qwen2-VL-7B-Instruct
+```
+
+This creates `meta/skills.json` with skill segmentation for each episode.
+
+## Usage
+
+### Basic Usage
+
+```bash
+python examples/dataset/annotate_pgen.py \
+    --repo-id lerobot/svla_so101_pickplace \
+    --model Qwen/Qwen2-VL-7B-Instruct \
+    --sample-interval 1.0 \
+    --output-dir ./outputs/pgen_dataset
+```
+
+**Note**: The script processes **all episodes** in the dataset. It generates dialogue every 1 second (`--sample-interval 1.0`) using temporal sampling. Frames between samples reuse the last generated dialogue. This makes the process efficient while ensuring complete dataset coverage.
+
+### Advanced Options
+
+```bash
+python examples/dataset/annotate_pgen.py \
+    --repo-id lerobot/svla_so101_pickplace \
+    --model Qwen/Qwen3-VL-30B-A3B-Instruct \
+    --temperature 0.8 \
+    --sample-interval 0.5 \
+    --num-image-views-per-sample 2 \
+    --output-dir ./outputs/pgen_dataset \
+    --push-to-hub
+```
+
+This example uses a more powerful model and samples every 0.5 seconds for finer granularity.
+
+### Fast Testing (larger interval)
+
+```bash
+python examples/dataset/annotate_pgen.py \
+    --repo-id lerobot/svla_so101_pickplace \
+    --model Qwen/Qwen2-VL-7B-Instruct \
+    --sample-interval 5.0 \
+    --output-dir ./outputs/pgen_quick_test
+```
+
+Use a larger interval (5.0 seconds) for rapid iteration during development. All episodes are still processed.
+
+### Using Local Dataset
+
+```bash
+python examples/dataset/annotate_pgen.py \
+    --data-dir /fsx/jade_choghari/.cache/huggingface/lerobot/lerobot/svla_so101_pickplace \
+    --model Qwen/Qwen2-VL-7B-Instruct \
+    --output-dir ./outputs/pgen_dataset
+```
+
+## Output Files
+
+The script produces several outputs:
+
+1. **`meta/tasks_high_level.parquet`**: High-level tasks with user prompts and robot utterances
+   - Columns: task_index, user_prompt, robot_utterance, skill, scenario_type, response_type
+
+2. **`meta/syn_annotations.jsonl`**: Debug file with all generated dialogues
+   - One JSON object per line with full context for each frame
+
+3. **Modified dataset**: New dataset with `task_index_high_level` feature added to all parquet files
+
+## Scenario and Response Types
+
+The generator produces diverse interaction types:
+
+### Scenario Types
+- **specific_object**: Direct specification of objects/actions
+- **negative_task**: Instructions about what NOT to do
+- **situated_correction**: Adjustments based on current state
+- **implicit_request**: Implied needs without direct commands
+- **constraint_based**: Specific constraints or preferences
+
+### Response Types
+- **confirmation**: Simple acknowledgment ("OK, I'll do X")
+- **clarification**: Seeking confirmation ("Just to confirm...")
+- **acknowledgment**: Action acknowledgment ("Got it, doing X")
+- **constraint_acknowledgment**: Acknowledging constraints ("Sure, I'll X while Y")
+
+## Example Generated Data
+
+```json
+{
+  "episode_id": 0,
+  "frame_index": 45,
+  "timestamp": 2.5,
+  "skill_current": "robot arm picks up pink lego brick",
+  "skill_history": ["robot arm moves towards pink lego brick"],
+  "task_description": "pink lego brick into the transparent box",
+  "scenario_type": "specific_object",
+  "response_type": "confirmation",
+  "user_prompt": "Can you grab the pink brick?",
+  "robot_utterance": "Sure, I'll pick up the pink lego brick."
+}
+```
+
+## Accessing the Data
+
+After running the script, access the synthetic data in your code:
+
+```python
+from lerobot.datasets.lerobot_dataset import LeRobotDataset
+import pandas as pd
+
+# Load modified dataset
+dataset = LeRobotDataset(repo_id="lerobot/svla_so101_pickplace_with_high_level_tasks")
+
+# Access frame with high-level task
+frame = dataset[100]
+high_level_task_idx = frame["task_index_high_level"].item()
+
+# Load high-level tasks
+tasks_df = pd.read_parquet(dataset.root / "meta" / "tasks_high_level.parquet")
+task_info = tasks_df.iloc[high_level_task_idx]
+
+print(f"User prompt: {task_info['user_prompt']}")
+print(f"Robot utterance: {task_info['robot_utterance']}")
+print(f"Skill: {task_info['skill']}")
+```
+
+## Architecture
+
+The script is modular and extensible:
+
+```python
+# Core components
+class QwenPgen:
+    """VLM wrapper for generation"""
+    def call_qwen(images, prompt) -> dict
+    
+def construct_prompt(task, history, skill) -> str
+    """Build prompt for VLM"""
+    
+def annotate_sample(pgen, images, ...) -> dict
+    """Generate dialogue for one sample"""
+    
+def generate_synthetic_data(dataset, pgen, ...) -> tuple
+    """Process entire dataset"""
+```
+
+## Parameters
+
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `--repo-id` | - | HuggingFace dataset ID |
+| `--data-dir` | - | Local dataset path |
+| `--model` | Qwen/Qwen2-VL-7B-Instruct | VLM model name |
+| `--device` | cuda | Device (cuda/cpu) |
+| `--dtype` | bfloat16 | Model precision |
+| `--temperature` | 0.7 | Sampling temperature |
+| `--sample-interval` | 1.0 | Generate dialogue every N seconds (all episodes processed) |
+| `--num-image-views-per-sample` | 1 | Number of cameras |
+| `--output-dir` | None | Output directory |
+| `--push-to-hub` | False | Push to HuggingFace Hub |
+
+## Sampling Strategy
+
+The script uses **temporal sampling** to efficiently generate dialogue:
+
+- **Default**: Generate dialogue every 1 second (`--sample-interval 1.0`)
+- **Efficiency**: If a dataset runs at 30fps, this samples ~3% of frames
+- **Propagation**: Frames between samples reuse the last generated task_index
+- **Episode-aware**: Always samples the first frame of each episode
+
+### Example with 30 fps dataset:
+```bash
+# Sample every 1 second (every 30 frames)
+--sample-interval 1.0  # ~3,000 generations for a 100 episode dataset (3 sec/episode)
+
+# Sample every 0.5 seconds (every 15 frames)
+--sample-interval 0.5  # ~6,000 generations (more granular)
+
+# Sample every 2 seconds (every 60 frames)
+--sample-interval 2.0  # ~1,500 generations (more efficient)
+```
+
+### Why sampling works:
+- Skills typically last 1-3 seconds
+- Dialogue doesn't need to change every frame
+- Reduces computational cost by 30-100x
+- Still provides good coverage for training
+
+## Tips
+
+1. **Quick testing**: Use larger `--sample-interval` (e.g., 5.0 or 10.0) for rapid iteration
+2. **Monitor GPU**: VLM inference is memory-intensive
+3. **Check outputs**: Review `syn_annotations.jsonl` for quality
+4. **Adjust temperature**: Higher = more diverse, lower = more consistent
+5. **Multiple views**: Use `--num-image-views-per-sample 2+` for better context
+6. **Tune sampling**: Start with 1.0s, increase for speed (testing), decrease for granularity (production)
+
+## Troubleshooting
+
+### No skills.json found
+Run `annotate.py` first to generate skill annotations.
+
+### Out of memory
+- Reduce batch size to 1
+- Use smaller model (Qwen2-VL-7B instead of Qwen3-VL-30B)
+- Process fewer samples at a time
+
+### Poor quality generations
+- Adjust temperature (try 0.6-0.9)
+- Check that skills.json has good annotations
+- Ensure images are loading correctly
+
+## Citation
+
+Based on the Hi-Robot paper's synthetic data generation approach:
+```
+@article{hirobot2024,
+  title={Hi-Robot: Hierarchical Robot Learning with Vision-Language Models},
+  year={2024}
+}
+```
+