# Annotation Pipeline `lerobot-annotate` populates the two language columns introduced by the [Language Columns and Recipes](./language_and_recipes) page — `language_persistent` and `language_events` — directly into `data/chunk-*/file-*.parquet`. ## What the pipeline produces Three modules write into a per-episode staging tree, then a single writer rewrites the data shards in place: | Style / atom | Column | Module | | ------------------------------------------- | --------------------- | -------------- | | `subtask` (Pi0.7-style "how, not what") | `language_persistent` | `plan` | | `plan` (initial + refresh on interjection) | `language_persistent` | `plan` | | `memory` (MEM-style compression) | `language_persistent` | `plan` | | `task_aug` (rephrasings of canonical task) | `language_persistent` | `plan` | | `interjection` | `language_events` | `interjections`| | speech tool-call atom (`style=null`, `say`) | `language_events` | `interjections`| | `vqa` (user / assistant pair) | `language_events` | `vqa` | The writer does **not** add a `tools` column to the parquet — the tool catalog lives at `meta/info.json["tools"]` instead (see [Tools](./tools)). After every annotation run the pipeline ensures the canonical `say` schema is present in that list, preserving any tools the user pre-declared. If you want to declare additional tools for a dataset before annotation runs, edit `meta/info.json["tools"]` directly — the pipeline preserves anything already there. Implementations of those tools live under `src/lerobot/tools/`; one file per tool, registered via `TOOL_REGISTRY`. See the [Tools](./tools) doc for the authoring guide. ## Running locally Install the extra and invoke the console script. Episode-level concurrency comes from `--executor.episode_parallelism` (default 16); that is the only knob the in-process executor exposes. ```bash uv sync --extra annotations uv run lerobot-annotate \ --root=/path/to/dataset \ --vlm.model_id=Qwen/Qwen2.5-VL-7B-Instruct ``` The pipeline attaches actual camera footage to every `plan` / `interjections` / `vqa` prompt by default, decoded from the dataset's first `observation.images.*` stream. Override with `--vlm.camera_key=observation.images.` to pin a specific viewpoint. Datasets with no video tracks fall back to text-only prompts automatically. **The `plan` module sees the whole episode as one video block.** Subtask decomposition gets a `{"type":"video", "video":[]}` block covering the entire demonstration; Qwen-VL pools temporally on its own and decides where to cut. There is no keyframe stride or count knob — `--plan.max_video_frames` (default 128) only caps the frames packed into the video block as a model-capacity bound. The `interjections` module attaches a short window of frames straddling the interjection timestamp. The `vqa` module grounds each VQA pair on a single frame — its `--vqa.K` knob sets how many consecutive frames each emission tick anchors, and every anchored frame gets its own VQA pair on that one frame (there is no per-pair frame window). ## Running on Hugging Face Jobs Distributed annotation is delegated to [Hugging Face Jobs](https://huggingface.co/docs/hub/en/jobs). The repo ships a launcher script you copy and edit for your dataset: ```bash HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py ``` [`examples/annotations/run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py) spawns one `h200x2` job that: 1. installs the branch under test plus the annotation extras, 2. boots two vllm servers (one per GPU) for the chosen model, 3. runs the `plan` / `interjections` / `vqa` modules across the dataset via `lerobot-annotate`, 4. uploads the annotated dataset to `--push_to_hub`. To target a different dataset, model, or hub repo, edit the `CMD` block inside the script — every flag in there maps directly onto a CLI flag of `lerobot-annotate` (see `lerobot-annotate --help` for the full list). ## Style-to-recipe consumer mapping The pipeline's outputs are designed to be consumed by recipes (see [Language Columns and Recipes](./language_and_recipes)) — typically: - low-level / high-level / memory-update branches consume `subtask`/`plan`/`memory` from `language_persistent`. - An interjection-response branch consumes `interjection` events plus the paired speech atom (merged into one assistant target turn via `tool_calls_from`) and the same-timestamp `plan` refresh. - A VQA branch consumes the `(vqa, user)` and `(vqa, assistant)` pairs from `language_events`. ## Why the design splits state from events Two things drive the scope: 1. **Persistent state vs exact-event split.** Persistent rows (`subtask`, `plan`, `memory`) broadcast per episode and answer "what state is in force at this frame?". Event rows (`interjection`, `vqa`, speech) only appear on the exact frame whose timestamp matches the emission. The pipeline writes timestamps taken straight from the source parquet — no floating-point recomputation. 2. **One Qwen-VL pass.** All three modules share a single VLM client (vLLM if available, transformers fallback) so the cost is one model load per dataset, not three. ## Module independence and staged reruns Each module writes its raw output to `/.annotate_staging/episode_{N:06d}/.jsonl`. That makes prompt iteration cheap — re-running one module overwrites only its own JSONL file before the writer composes the final parquet. Modules can be disabled via `--plan.enabled=false` (and likewise `--interjections.enabled` / `--vqa.enabled`) to test them in isolation. ## Validation/report checks before final write Before the writer runs, `StagingValidator` checks: - exact frame-timestamp alignment for every event row; - no orphan speech / interjection pairs; - `plan` is refreshed at every interjection timestamp; - `memory` rows fall on subtask boundaries (warning, not error); - VQA assistant `content` parses as JSON in one of the bbox / keypoint / count / attribute / spatial shapes; - every row routes to the column dictated by `column_for_style(style)`. Errors abort the writer (`--skip_validation=true` overrides for debugging). ## Paper inspirations per module - **`plan` module — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417)) atom granularity ("pick up one piece of lettuce", "place bowl to box"); Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07)) "how, not what" detail. - **`plan` module — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596)) compression directive: keep only minimal relevant information; functional outcomes preserved, specific attributes dropped. - **`interjections` module.** Hi Robot scenario taxonomy: negative task, situated correction, specific constraint, preference. Speech is a tool-call-only atom (`tool_calls=[{type:function, function:{name:"say", arguments:{text:...}}}]`). - **`vqa` module.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693)) grounded features (bounding boxes in pixel `[x_min, y_min, x_max, y_max]`, keypoints) and Steerable VLA Policies ([Zhao 2025](https://arxiv.org/abs/2509.07626)) multi-abstraction grounding. Pi0.7 also grounds answers across multiple abstraction levels. Future maintainers should adjust the prompt templates in `src/lerobot/annotations/steerable_pipeline/prompts/` against these references rather than rewriting from scratch. ## Compute and list-size estimates Per episode, the pipeline issues O(`max_steps`) `plan`-module calls, O(`max_interjections_per_episode`) `interjections`-module calls, and O(`vqa_emission_hz × episode_seconds`) `vqa`-module calls. With defaults (8 subtasks, 1 interjection, 1 Hz × 3 pairs) and 30-second episodes, that is ~50 VLM calls per episode. `language_persistent` per episode is ~10s of KB at most (parquet dictionary-encodes one entry per episode); `language_events` is empty on most frames and is bounded by the number of emissions, not `num_frames × num_emissions`. ## Reproducibility via seed and prompt hashes `--seed` (default 1729) feeds the per-episode RNGs that select interjection timestamps and VQA question types. Combined with the deterministic prompt templates checked into `prompts/`, two runs at the same seed against the same dataset and the same model checkpoint produce byte-identical staging artifacts. Prompt edits are recorded by file hash; future tooling can pin expected `(seed, prompt_hash)` pairs into the dataset card.