Files
lerobot/docs/source/annotation_tools.mdx
T
Jade Choghari b864c13dfb add docs
2026-01-19 10:36:25 +00:00

426 lines
13 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Dataset Annotation Tools
This guide explains how to use the automatic annotation tools to add skill labels and synthetic dialogue to your LeRobot datasets.
## Overview
The annotation pipeline consists of two main components:
1. **Subtask Annotation** (`subtask_annotate.py`): Automatically segments robot demonstrations into atomic skills using Vision-Language Models (VLMs)
2. **High-Level Annotation** (`high_level_annotate.py`): Generates synthetic user prompts and robot utterances for hierarchical policy training
These tools enable you to transform raw robot demonstration data into richly annotated datasets suitable for training hierarchical policies.
## Installation Requirements
Before using the annotation tools, ensure you have the required dependencies:
```bash
pip install transformers qwen-vl-utils opencv-python rich pandas pyarrow
```
You'll also need FFmpeg for video processing:
```bash
# Ubuntu/Debian
sudo apt-get install ffmpeg
# macOS
brew install ffmpeg
```
## Part 1: Subtask Annotation
### What It Does
The subtask annotator segments each episode into short atomic manipulation skills (1-3 seconds each). For example, a "pick and place" episode might be segmented into:
- "reach towards object" (0.0s - 1.2s)
- "grasp object" (1.2s - 2.1s)
- "lift object" (2.1s - 3.5s)
- "move to target" (3.5s - 5.0s)
- "release object" (5.0s - 6.2s)
### Usage
#### Basic Example
```bash
python src/lerobot/policies/pi05_full/annotate/subtask_annotate.py \
--repo-id your-username/your-dataset \
--video-key observation.images.base \
--output-dir /path/to/output
```
#### With Local Dataset
```bash
python src/lerobot/policies/pi05_full/annotate/subtask_annotate.py \
--data-dir /path/to/local/dataset \
--video-key observation.images.base \
--output-dir /path/to/output
```
#### Advanced Options
```bash
python src/lerobot/policies/pi05_full/annotate/subtask_annotate.py \
--repo-id your-username/your-dataset \
--video-key observation.images.base \
--model Qwen/Qwen2-VL-7B-Instruct \
--batch-size 16 \
--output-dir /path/to/output \
--push-to-hub
```
### Parameters
| Parameter | Description | Default |
|-----------|-------------|---------|
| `--repo-id` | HuggingFace Hub dataset ID | Required (or use --data-dir) |
| `--data-dir` | Path to local dataset | Required (or use --repo-id) |
| `--video-key` | Video observation key | Required |
| `--model` | VLM model to use | `Qwen/Qwen2-VL-7B-Instruct` |
| `--device` | Device to run model on | `cuda` |
| `--dtype` | Model dtype | `bfloat16` |
| `--batch-size` | Episodes per batch | `8` |
| `--episodes` | Specific episodes to annotate | All episodes |
| `--output-dir` | Output directory | Auto-generated |
| `--push-to-hub` | Push to HuggingFace Hub | `False` |
### Supported Models
- **Qwen2-VL**: `Qwen/Qwen2-VL-2B-Instruct`, `Qwen/Qwen2-VL-7B-Instruct`, `Qwen/Qwen2-VL-72B-Instruct`
- **Qwen3-VL**: `Qwen/Qwen3-VL-30B-A3B-Instruct`
### Output Files
The subtask annotation creates the following files in your dataset:
1. **`meta/subtasks.parquet`**: DataFrame with unique subtask names
```python
# Structure:
# Index: subtask name (string)
# Column: subtask_index (int64)
```
2. **`meta/skills.json`**: Raw skill annotations with timestamps
```json
{
"coarse_description": "Pick and place the object",
"skill_to_subtask_index": {
"reach towards object": 0,
"grasp object": 1,
...
},
"episodes": {
"0": {
"episode_index": 0,
"description": "Pick and place the object",
"skills": [
{"name": "reach towards object", "start": 0.0, "end": 1.2},
{"name": "grasp object", "start": 1.2, "end": 2.1},
...
]
}
}
}
```
3. **`subtask_index` feature**: Added to each frame in the dataset
- Type: `int64`
- Shape: `(1,)`
- Maps each frame to its corresponding subtask
### Accessing Subtask Annotations
```python
from lerobot.datasets.lerobot_dataset import LeRobotDataset
# Load annotated dataset
dataset = LeRobotDataset(repo_id="your/dataset_with_subtasks")
# Get a frame
frame = dataset[100]
# Get the subtask for this frame
subtask_idx = frame["subtask_index"].item()
subtask_name = dataset.meta.subtasks.iloc[subtask_idx].name
print(f"Frame 100 is performing: {subtask_name}")
# Load all subtasks
subtasks_df = dataset.meta.subtasks
print(subtasks_df)
```
## Part 2: High-Level Annotation
### What It Does
The high-level annotator generates synthetic dialogue for hierarchical policy training. For each skill, it creates:
- **User Prompt** (`_t`): A natural language request from the user
- **Robot Utterance** (`u_t`): A natural language response from the robot
This enables training policies that can understand and respond to human instructions in natural dialogue.
### Prerequisites
**Important**: You must run subtask annotation first! High-level annotation requires the `skills.json` file generated by subtask annotation.
### Usage
#### Image Mode (Default)
Samples frames at regular intervals and passes images to the VLM:
```bash
python src/lerobot/policies/pi05_full/annotate/high_level_annotate.py \
--repo-id your/dataset_with_subtasks \
--model Qwen/Qwen2-VL-7B-Instruct \
--image-key observation.images.base \
--output-dir /path/to/output
```
#### Video Mode
Passes entire episode videos to the VLM for better temporal understanding:
```bash
python src/lerobot/policies/pi05_full/annotate/high_level_annotate.py \
--repo-id your/dataset_with_subtasks \
--model Qwen/Qwen2-VL-7B-Instruct \
--video-mode \
--video-key observation.images.base \
--video-batch-size 4 \
--output-dir /path/to/output
```
### Parameters
| Parameter | Description | Default |
|-----------|-------------|---------|
| `--repo-id` | HuggingFace Hub dataset ID | Required (or use --data-dir) |
| `--data-dir` | Path to local dataset | Required (or use --repo-id) |
| `--model` | VLM model to use | `Qwen/Qwen2-VL-7B-Instruct` |
| `--image-key` | Image observation key (image mode) | First camera key |
| `--video-mode` | Use video instead of images | `False` |
| `--video-key` | Video observation key (video mode) | Auto-detected |
| `--video-batch-size` | Episodes per batch (video mode) | `1` |
| `--sample-interval` | Sampling interval in seconds | `1.0` |
| `--temperature` | Sampling temperature | `0.7` |
| `--output-dir` | Output directory | Auto-generated |
| `--push-to-hub` | Push to HuggingFace Hub | `False` |
### Output Files
The high-level annotation creates:
1. **`meta/tasks_high_level.parquet`**: DataFrame with high-level tasks
```python
# Structure:
# Index: task string (concatenated user_prompt | robot_utterance)
# Columns:
# - task_index: int64
# - user_prompt: string
# - robot_utterance: string
# - skill: string (associated subtask)
# - scenario_type: string
# - response_type: string
```
2. **`meta/syn_annotations.jsonl`**: Debug annotations (JSONL format)
```json
{"episode_id": 0, "timestamp": 1.5, "skill_current": "grasp object", "user_prompt": "Can you pick that up?", "robot_utterance": "Sure, I'll grasp it now", ...}
```
3. **`task_index_high_level` feature**: Added to each frame
- Type: `int64`
- Shape: `(1,)`
- Maps each frame to its high-level task
### Dialogue Types Generated
The system generates diverse interaction types:
**Scenario Types:**
- `specific_object`: "Pick up the red block"
- `negative_task`: "Don't touch the blue one"
- `situated_correction`: "Actually, move to the other box instead"
- `implicit_request`: "I need something red for the tower"
- `constraint_based`: "Make sure to handle it gently"
**Response Types:**
- `confirmation`: "OK, I'll pick it up"
- `clarification`: "Just to confirm, you want me to pick up the red block?"
- `acknowledgment`: "Got it, picking up the red block"
- `constraint_acknowledgment`: "Sure, I'll pick it up gently"
### Accessing High-Level Annotations
```python
from lerobot.datasets.lerobot_dataset import LeRobotDataset
import pandas as pd
# Load annotated dataset
dataset = LeRobotDataset(repo_id="your/dataset_with_high_level_tasks")
# Get a frame
frame = dataset[100]
# Get the high-level task
task_idx = frame["task_index_high_level"].item()
# Load tasks metadata
tasks_df = pd.read_parquet(dataset.root / "meta" / "tasks_high_level.parquet")
task_row = tasks_df[tasks_df["task_index"] == task_idx].iloc[0]
print(f"User: {task_row['user_prompt']}")
print(f"Robot: {task_row['robot_utterance']}")
print(f"Skill: {task_row['skill']}")
# Use in a DataLoader
import torch
from torch.utils.data import DataLoader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
batch = next(iter(dataloader))
print(f"Task indices: {batch['task_index_high_level']}")
print(f"User prompts: {batch['user_prompt'][0]}")
print(f"Robot utterances: {batch['robot_utterance'][0]}")
```
## Complete Pipeline Example
Here's how to run both annotation stages:
```bash
#!/bin/bash
REPO_ID="your-username/your-dataset"
MODEL="Qwen/Qwen2-VL-7B-Instruct"
OUTPUT_DIR="/path/to/output"
# Step 1: Subtask Annotation
python src/lerobot/policies/pi05_full/annotate/subtask_annotate.py \
--repo-id "$REPO_ID" \
--video-key observation.images.base \
--model "$MODEL" \
--batch-size 8 \
--output-dir "${OUTPUT_DIR}/subtasks"
# Step 2: High-Level Annotation (Image Mode)
python src/lerobot/policies/pi05_full/annotate/high_level_annotate.py \
--data-dir "${OUTPUT_DIR}/subtasks" \
--model "$MODEL" \
--image-key observation.images.base \
--sample-interval 1.0 \
--output-dir "${OUTPUT_DIR}/final"
# Or Step 2: High-Level Annotation (Video Mode - Recommended)
python src/lerobot/policies/pi05_full/annotate/high_level_annotate.py \
--data-dir "${OUTPUT_DIR}/subtasks" \
--model "$MODEL" \
--video-mode \
--video-key observation.images.base \
--video-batch-size 4 \
--output-dir "${OUTPUT_DIR}/final"
```
## Performance Tips
### For Faster Processing
1. **Increase batch size**: Use `--batch-size 16` or higher (subtask annotation)
2. **Increase video batch size**: Use `--video-batch-size 8` (high-level annotation in video mode)
3. **Larger sampling interval**: Use `--sample-interval 5.0` for testing (samples every 5 seconds instead of 1)
4. **Use smaller models**: `Qwen/Qwen2-VL-2B-Instruct` is faster than `Qwen2-VL-7B-Instruct`
5. **Process specific episodes**: Use `--episodes 0 1 2 3` to annotate only a subset
### For Better Quality
1. **Use larger models**: `Qwen/Qwen3-VL-30B-A3B-Instruct` or `Qwen/Qwen2-VL-72B-Instruct`
2. **Use video mode**: Provides better temporal context
3. **Smaller sampling intervals**: `--sample-interval 0.5` for dense annotations
4. **Adjust temperature**: Use `--temperature 0.9` for more diverse dialogue
## Memory Requirements
| Model | GPU Memory | Recommended Batch Size |
|-------|------------|------------------------|
| Qwen2-VL-2B | ~8 GB | 16-32 |
| Qwen2-VL-7B | ~16 GB | 8-16 |
| Qwen2-VL-72B | ~80 GB | 1-2 |
| Qwen3-VL-30B | ~40 GB | 4-8 |
## Troubleshooting
### "FFmpeg not found"
```bash
# Install FFmpeg
sudo apt-get install ffmpeg # Ubuntu/Debian
brew install ffmpeg # macOS
```
### "CUDA out of memory"
- Reduce batch size: `--batch-size 1` or `--video-batch-size 1`
- Use smaller model: `Qwen/Qwen2-VL-2B-Instruct`
- Use CPU: `--device cpu` (much slower)
### "No skills.json found"
Run subtask annotation first before high-level annotation.
### "Video key not found"
List available keys:
```python
from lerobot.datasets.lerobot_dataset import LeRobotDataset
dataset = LeRobotDataset(repo_id="your/dataset")
print("Video keys:", dataset.meta.video_keys)
print("Camera keys:", dataset.meta.camera_keys)
```
## Dataset Structure After Annotation
```
your_dataset_with_high_level_tasks/
├── meta/
│ ├── info.json # Original metadata
│ ├── tasks.parquet # Original tasks (preserved)
│ ├── subtasks.parquet # NEW: Subtask names and indices
│ ├── skills.json # NEW: Raw skill annotations with timestamps
│ ├── tasks_high_level.parquet # NEW: High-level tasks with dialogue
│ └── syn_annotations.jsonl # NEW: Debug annotations
├── data/
│ └── chunk-000/
│ ├── observation.images.base.mp4
│ ├── action.safetensors
│ ├── subtask_index.safetensors # NEW: Subtask per frame
│ └── task_index_high_level.safetensors # NEW: High-level task per frame
└── videos/
└── ...
```
## Citation
If you use these annotation tools in your research, please cite:
```bibtex
@article{lerobot2024,
title={LeRobot: State-of-the-art Machine Learning for Real-World Robotics},
author={LeRobot Contributors},
year={2024},
url={https://github.com/huggingface/lerobot}
}
```
## Next Steps
After annotation, you can:
1. Train hierarchical policies using the subtask and high-level annotations
2. Use the synthetic dialogue for instruction-following policy training
3. Analyze skill distributions and dialogue patterns
4. Share your annotated dataset on HuggingFace Hub with `--push-to-hub`
For training examples, see the [training documentation](../training/).