# Dataset Annotation Tools This guide explains how to use the automatic annotation tools to add skill labels and synthetic dialogue to your LeRobot datasets. ## Overview The annotation pipeline consists of two main components: 1. **Subtask Annotation** (`subtask_annotate.py`): Automatically segments robot demonstrations into atomic skills using Vision-Language Models (VLMs) 2. **High-Level Annotation** (`high_level_annotate.py`): Generates synthetic user prompts and robot utterances for hierarchical policy training These tools enable you to transform raw robot demonstration data into richly annotated datasets suitable for training hierarchical policies. ## Installation Requirements Before using the annotation tools, ensure you have the required dependencies: ```bash pip install transformers qwen-vl-utils opencv-python rich pandas pyarrow ``` You'll also need FFmpeg for video processing: ```bash # Ubuntu/Debian sudo apt-get install ffmpeg # macOS brew install ffmpeg ``` ## Part 1: Subtask Annotation ### What It Does The subtask annotator segments each episode into short atomic manipulation skills (1-3 seconds each). For example, a "pick and place" episode might be segmented into: - "reach towards object" (0.0s - 1.2s) - "grasp object" (1.2s - 2.1s) - "lift object" (2.1s - 3.5s) - "move to target" (3.5s - 5.0s) - "release object" (5.0s - 6.2s) ### Usage #### Basic Example ```bash python src/lerobot/policies/pi05_full/annotate/subtask_annotate.py \ --repo-id your-username/your-dataset \ --video-key observation.images.base \ --output-dir /path/to/output ``` #### With Local Dataset ```bash python src/lerobot/policies/pi05_full/annotate/subtask_annotate.py \ --data-dir /path/to/local/dataset \ --video-key observation.images.base \ --output-dir /path/to/output ``` #### Advanced Options ```bash python src/lerobot/policies/pi05_full/annotate/subtask_annotate.py \ --repo-id your-username/your-dataset \ --video-key observation.images.base \ --model Qwen/Qwen2-VL-7B-Instruct \ --batch-size 16 \ --output-dir /path/to/output \ --push-to-hub ``` ### Parameters | Parameter | Description | Default | |-----------|-------------|---------| | `--repo-id` | HuggingFace Hub dataset ID | Required (or use --data-dir) | | `--data-dir` | Path to local dataset | Required (or use --repo-id) | | `--video-key` | Video observation key | Required | | `--model` | VLM model to use | `Qwen/Qwen2-VL-7B-Instruct` | | `--device` | Device to run model on | `cuda` | | `--dtype` | Model dtype | `bfloat16` | | `--batch-size` | Episodes per batch | `8` | | `--episodes` | Specific episodes to annotate | All episodes | | `--output-dir` | Output directory | Auto-generated | | `--push-to-hub` | Push to HuggingFace Hub | `False` | ### Supported Models - **Qwen2-VL**: `Qwen/Qwen2-VL-2B-Instruct`, `Qwen/Qwen2-VL-7B-Instruct`, `Qwen/Qwen2-VL-72B-Instruct` - **Qwen3-VL**: `Qwen/Qwen3-VL-30B-A3B-Instruct` ### Output Files The subtask annotation creates the following files in your dataset: 1. **`meta/subtasks.parquet`**: DataFrame with unique subtask names ```python # Structure: # Index: subtask name (string) # Column: subtask_index (int64) ``` 2. **`meta/skills.json`**: Raw skill annotations with timestamps ```json { "coarse_description": "Pick and place the object", "skill_to_subtask_index": { "reach towards object": 0, "grasp object": 1, ... }, "episodes": { "0": { "episode_index": 0, "description": "Pick and place the object", "skills": [ {"name": "reach towards object", "start": 0.0, "end": 1.2}, {"name": "grasp object", "start": 1.2, "end": 2.1}, ... ] } } } ``` 3. **`subtask_index` feature**: Added to each frame in the dataset - Type: `int64` - Shape: `(1,)` - Maps each frame to its corresponding subtask ### Accessing Subtask Annotations ```python from lerobot.datasets.lerobot_dataset import LeRobotDataset # Load annotated dataset dataset = LeRobotDataset(repo_id="your/dataset_with_subtasks") # Get a frame frame = dataset[100] # Get the subtask for this frame subtask_idx = frame["subtask_index"].item() subtask_name = dataset.meta.subtasks.iloc[subtask_idx].name print(f"Frame 100 is performing: {subtask_name}") # Load all subtasks subtasks_df = dataset.meta.subtasks print(subtasks_df) ``` ## Part 2: High-Level Annotation ### What It Does The high-level annotator generates synthetic dialogue for hierarchical policy training. For each skill, it creates: - **User Prompt** (`ℓ_t`): A natural language request from the user - **Robot Utterance** (`u_t`): A natural language response from the robot This enables training policies that can understand and respond to human instructions in natural dialogue. ### Prerequisites **Important**: You must run subtask annotation first! High-level annotation requires the `skills.json` file generated by subtask annotation. ### Usage #### Image Mode (Default) Samples frames at regular intervals and passes images to the VLM: ```bash python src/lerobot/policies/pi05_full/annotate/high_level_annotate.py \ --repo-id your/dataset_with_subtasks \ --model Qwen/Qwen2-VL-7B-Instruct \ --image-key observation.images.base \ --output-dir /path/to/output ``` #### Video Mode Passes entire episode videos to the VLM for better temporal understanding: ```bash python src/lerobot/policies/pi05_full/annotate/high_level_annotate.py \ --repo-id your/dataset_with_subtasks \ --model Qwen/Qwen2-VL-7B-Instruct \ --video-mode \ --video-key observation.images.base \ --video-batch-size 4 \ --output-dir /path/to/output ``` ### Parameters | Parameter | Description | Default | |-----------|-------------|---------| | `--repo-id` | HuggingFace Hub dataset ID | Required (or use --data-dir) | | `--data-dir` | Path to local dataset | Required (or use --repo-id) | | `--model` | VLM model to use | `Qwen/Qwen2-VL-7B-Instruct` | | `--image-key` | Image observation key (image mode) | First camera key | | `--video-mode` | Use video instead of images | `False` | | `--video-key` | Video observation key (video mode) | Auto-detected | | `--video-batch-size` | Episodes per batch (video mode) | `1` | | `--sample-interval` | Sampling interval in seconds | `1.0` | | `--temperature` | Sampling temperature | `0.7` | | `--output-dir` | Output directory | Auto-generated | | `--push-to-hub` | Push to HuggingFace Hub | `False` | ### Output Files The high-level annotation creates: 1. **`meta/tasks_high_level.parquet`**: DataFrame with high-level tasks ```python # Structure: # Index: task string (concatenated user_prompt | robot_utterance) # Columns: # - task_index: int64 # - user_prompt: string # - robot_utterance: string # - skill: string (associated subtask) # - scenario_type: string # - response_type: string ``` 2. **`meta/syn_annotations.jsonl`**: Debug annotations (JSONL format) ```json {"episode_id": 0, "timestamp": 1.5, "skill_current": "grasp object", "user_prompt": "Can you pick that up?", "robot_utterance": "Sure, I'll grasp it now", ...} ``` 3. **`task_index_high_level` feature**: Added to each frame - Type: `int64` - Shape: `(1,)` - Maps each frame to its high-level task ### Dialogue Types Generated The system generates diverse interaction types: **Scenario Types:** - `specific_object`: "Pick up the red block" - `negative_task`: "Don't touch the blue one" - `situated_correction`: "Actually, move to the other box instead" - `implicit_request`: "I need something red for the tower" - `constraint_based`: "Make sure to handle it gently" **Response Types:** - `confirmation`: "OK, I'll pick it up" - `clarification`: "Just to confirm, you want me to pick up the red block?" - `acknowledgment`: "Got it, picking up the red block" - `constraint_acknowledgment`: "Sure, I'll pick it up gently" ### Accessing High-Level Annotations ```python from lerobot.datasets.lerobot_dataset import LeRobotDataset import pandas as pd # Load annotated dataset dataset = LeRobotDataset(repo_id="your/dataset_with_high_level_tasks") # Get a frame frame = dataset[100] # Get the high-level task task_idx = frame["task_index_high_level"].item() # Load tasks metadata tasks_df = pd.read_parquet(dataset.root / "meta" / "tasks_high_level.parquet") task_row = tasks_df[tasks_df["task_index"] == task_idx].iloc[0] print(f"User: {task_row['user_prompt']}") print(f"Robot: {task_row['robot_utterance']}") print(f"Skill: {task_row['skill']}") # Use in a DataLoader import torch from torch.utils.data import DataLoader dataloader = DataLoader(dataset, batch_size=32, shuffle=True) batch = next(iter(dataloader)) print(f"Task indices: {batch['task_index_high_level']}") print(f"User prompts: {batch['user_prompt'][0]}") print(f"Robot utterances: {batch['robot_utterance'][0]}") ``` ## Complete Pipeline Example Here's how to run both annotation stages: ```bash #!/bin/bash REPO_ID="your-username/your-dataset" MODEL="Qwen/Qwen2-VL-7B-Instruct" OUTPUT_DIR="/path/to/output" # Step 1: Subtask Annotation python src/lerobot/policies/pi05_full/annotate/subtask_annotate.py \ --repo-id "$REPO_ID" \ --video-key observation.images.base \ --model "$MODEL" \ --batch-size 8 \ --output-dir "${OUTPUT_DIR}/subtasks" # Step 2: High-Level Annotation (Image Mode) python src/lerobot/policies/pi05_full/annotate/high_level_annotate.py \ --data-dir "${OUTPUT_DIR}/subtasks" \ --model "$MODEL" \ --image-key observation.images.base \ --sample-interval 1.0 \ --output-dir "${OUTPUT_DIR}/final" # Or Step 2: High-Level Annotation (Video Mode - Recommended) python src/lerobot/policies/pi05_full/annotate/high_level_annotate.py \ --data-dir "${OUTPUT_DIR}/subtasks" \ --model "$MODEL" \ --video-mode \ --video-key observation.images.base \ --video-batch-size 4 \ --output-dir "${OUTPUT_DIR}/final" ``` ## Performance Tips ### For Faster Processing 1. **Increase batch size**: Use `--batch-size 16` or higher (subtask annotation) 2. **Increase video batch size**: Use `--video-batch-size 8` (high-level annotation in video mode) 3. **Larger sampling interval**: Use `--sample-interval 5.0` for testing (samples every 5 seconds instead of 1) 4. **Use smaller models**: `Qwen/Qwen2-VL-2B-Instruct` is faster than `Qwen2-VL-7B-Instruct` 5. **Process specific episodes**: Use `--episodes 0 1 2 3` to annotate only a subset ### For Better Quality 1. **Use larger models**: `Qwen/Qwen3-VL-30B-A3B-Instruct` or `Qwen/Qwen2-VL-72B-Instruct` 2. **Use video mode**: Provides better temporal context 3. **Smaller sampling intervals**: `--sample-interval 0.5` for dense annotations 4. **Adjust temperature**: Use `--temperature 0.9` for more diverse dialogue ## Memory Requirements | Model | GPU Memory | Recommended Batch Size | |-------|------------|------------------------| | Qwen2-VL-2B | ~8 GB | 16-32 | | Qwen2-VL-7B | ~16 GB | 8-16 | | Qwen2-VL-72B | ~80 GB | 1-2 | | Qwen3-VL-30B | ~40 GB | 4-8 | ## Troubleshooting ### "FFmpeg not found" ```bash # Install FFmpeg sudo apt-get install ffmpeg # Ubuntu/Debian brew install ffmpeg # macOS ``` ### "CUDA out of memory" - Reduce batch size: `--batch-size 1` or `--video-batch-size 1` - Use smaller model: `Qwen/Qwen2-VL-2B-Instruct` - Use CPU: `--device cpu` (much slower) ### "No skills.json found" Run subtask annotation first before high-level annotation. ### "Video key not found" List available keys: ```python from lerobot.datasets.lerobot_dataset import LeRobotDataset dataset = LeRobotDataset(repo_id="your/dataset") print("Video keys:", dataset.meta.video_keys) print("Camera keys:", dataset.meta.camera_keys) ``` ## Dataset Structure After Annotation ``` your_dataset_with_high_level_tasks/ ├── meta/ │ ├── info.json # Original metadata │ ├── tasks.parquet # Original tasks (preserved) │ ├── subtasks.parquet # NEW: Subtask names and indices │ ├── skills.json # NEW: Raw skill annotations with timestamps │ ├── tasks_high_level.parquet # NEW: High-level tasks with dialogue │ └── syn_annotations.jsonl # NEW: Debug annotations ├── data/ │ └── chunk-000/ │ ├── observation.images.base.mp4 │ ├── action.safetensors │ ├── subtask_index.safetensors # NEW: Subtask per frame │ └── task_index_high_level.safetensors # NEW: High-level task per frame └── videos/ └── ... ``` ## Citation If you use these annotation tools in your research, please cite: ```bibtex @article{lerobot2024, title={LeRobot: State-of-the-art Machine Learning for Real-World Robotics}, author={LeRobot Contributors}, year={2024}, url={https://github.com/huggingface/lerobot} } ``` ## Next Steps After annotation, you can: 1. Train hierarchical policies using the subtask and high-level annotations 2. Use the synthetic dialogue for instruction-following policy training 3. Analyze skill distributions and dialogue patterns 4. Share your annotated dataset on HuggingFace Hub with `--push-to-hub` For training examples, see the [training documentation](../training/).