Files
lerobot/docs/source/annotation_pipeline.mdx
T
Pepijn 73740ecf4b feat(annotate): write tool catalog to meta/info.json after annotation
After every ``lerobot-annotate`` run, the executor ensures
``meta/info.json["tools"]`` contains at minimum the canonical ``say``
schema, while preserving any tools the user pre-declared on the
dataset. Chat-template consumers (PR 3 SmolVLA2 / Pi0.5 / dataset
visualizer) read the catalog through
``LeRobotDatasetMetadata.tools`` and pass it to
``apply_chat_template(messages, tools=meta.tools, ...)``.

- ``executor.py``: new ``_ensure_tools_in_info`` helper called
  after the parquet rewrite. Idempotent and additive — merges by
  ``function.name``, only writes back if the list changed.
- ``writer.py``: drops the duplicated ``SAY_TOOL_SCHEMA`` /
  ``DEFAULT_TOOLS`` constants in favour of importing from
  ``lerobot.datasets.language`` (PR 1's single source of truth).
  Re-exported so existing imports keep working.
- ``annotation_pipeline.mdx``: replace the "code constant only" note
  with a pointer to the new Tools doc and a description of the
  meta/info.json behaviour, including how to pre-declare custom
  tools before annotation runs.

This is the storage half of the tools work; PR 3 ships the runnable
implementations under ``src/lerobot/tools/`` (one file per tool,
first up: ``say.py`` wired to Kyutai's pocket-tts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 18:51:38 +02:00

162 lines
7.6 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Annotation Pipeline
`lerobot-annotate` populates the two language columns introduced by the
[Language Columns and Recipes](./language_and_recipes) page —
`language_persistent` and `language_events` — directly into
`data/chunk-*/file-*.parquet`. There is no flavor namespace and no sidecar
file tree: multiple revisions of a dataset mean multiple dataset copies.
## What the pipeline produces
Three modules write into a per-episode staging tree, then a single writer
rewrites the data shards in place:
| Style / atom | Column | Module |
| ------------------------------------------- | --------------------- | -------- |
| `subtask` (Pi0.7-style "how, not what") | `language_persistent` | Module 1 |
| `plan` (initial + refresh on interjection) | `language_persistent` | Module 1 |
| `memory` (MEM-style compression) | `language_persistent` | Module 1 |
| `interjection` | `language_events` | Module 2 |
| speech tool-call atom (`style=null`, `say`) | `language_events` | Module 2 |
| `vqa` (user / assistant pair) | `language_events` | Module 3 |
The writer drops the legacy `subtask_index` column. It does **not** add a
`tools` column to the parquet — the tool catalog lives at
`meta/info.json["tools"]` instead (see [Tools](./tools)). After every
annotation run the pipeline ensures the canonical `say` schema is
present in that list, preserving any tools the user pre-declared. Chat-
template consumers read the catalog through
`LeRobotDatasetMetadata.tools` and pass it to
`apply_chat_template(messages, tools=meta.tools, ...)`.
If you want to declare additional tools for a dataset before annotation
runs, edit `meta/info.json["tools"]` directly — the pipeline preserves
anything already there. Implementations of those tools live under
`src/lerobot/tools/`; one file per tool, registered via
`TOOL_REGISTRY`. See the [Tools](./tools) doc for the authoring guide.
## How to run it locally or on SLURM
Install the extra and invoke the console script:
```bash
uv sync --extra annotations
uv run lerobot-annotate \
--repo_id=imstevenpmwork/super_poulain_draft \
--vlm.backend=vllm \
--vlm.model_id=Qwen/Qwen3.6-27B-FP8 \
--vlm.tensor_parallel_size=2
```
The pipeline attaches actual camera footage to every Module 1/2/3 prompt
by default, decoded from the dataset's first `observation.images.*`
stream. Override with `--vlm.camera_key=observation.images.<name>` to
pin a specific viewpoint. Datasets with no video tracks fall back to
text-only prompts automatically.
**Module 1 sees the whole episode as one video block.** Subtask
decomposition gets a `{"type":"video", "video":[<frames>]}` block
covering the entire demonstration; Qwen-VL pools temporally on its own
and decides where to cut. There is no keyframe stride or count knob —
`--module_1.max_video_frames` (default 32) only caps the frames packed
into the video block as a model-capacity bound. Module 2 attaches a
single still frame at the interjection timestamp; Module 3 attaches the
exact emission frame to each VQA pair.
The executor picks `LocalPipelineExecutor` for small datasets and
`SlurmPipelineExecutor` for large ones based on
`--executor.auto_threshold` (default 32 episodes). Force local with
`--executor.force_local=true`. SLURM jobs honour `--executor.slurm_partition`,
`--executor.slurm_gpus`, and `--executor.slurm_time`.
## Style-to-recipe consumer mapping
The pipeline produces exactly the styles consumed by
`src/lerobot/configs/recipes/pi05_hirobot.yaml`:
- `low_level_execution`, `high_level_subtask`, `memory_update` consume
`subtask`/`plan`/`memory` from `language_persistent`.
- `user_interjection_response` consumes `interjection` events plus the
paired speech atom (merged into one assistant target turn via
`tool_calls_from`) and the same-timestamp `plan` refresh.
- `ask_vqa` consumes the `(vqa, user)` and `(vqa, assistant)` pairs from
`language_events`.
## Why the design is scoped to the canonical recipe
Two things drive the scope:
1. **Persistent state vs exact-event split.** Persistent rows (`subtask`,
`plan`, `memory`) broadcast per episode and answer "what state is in
force at this frame?". Event rows (`interjection`, `vqa`, speech) only
appear on the exact frame whose timestamp matches the emission. The
pipeline writes timestamps taken straight from the source parquet — no
floating-point recomputation.
2. **One Qwen-VL pass.** All three modules share a single VLM client
(vLLM if available, transformers fallback) so the cost is one model
load per dataset, not three.
## Module independence and staged reruns
Each module writes its raw output to
`<root>/.annotate_staging/episode_{N:06d}/<module>.jsonl`. That makes
prompt iteration cheap — re-running one module overwrites only its own
JSONL file before the writer composes the final parquet. Modules can be
disabled via `--module_1.enabled=false` (and similarly for 2 and 3) to
test them in isolation.
## Validation/report checks before final write
Before the writer runs, `StagingValidator` checks:
- exact frame-timestamp alignment for every event row;
- no orphan speech / interjection pairs;
- `plan` is refreshed at every interjection timestamp;
- `memory` rows fall on subtask boundaries (warning, not error);
- VQA assistant `content` parses as JSON in one of the
bbox / keypoint / count / attribute / spatial shapes;
- every row routes to the column dictated by `column_for_style(style)`.
Errors abort the writer (`--skip_validation=true` overrides for debugging).
## Paper inspirations per module
- **Module 1 — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
atom granularity ("pick up one piece of lettuce", "place bowl to box");
Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07)) "how, not
what" detail.
- **Module 1 — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596))
compression directive: keep only minimal relevant information; functional
outcomes preserved, specific attributes dropped.
- **Module 2 — interjections.** Hi Robot scenario taxonomy: negative task,
situated correction, specific constraint, preference. Speech is a
tool-call-only atom (`tool_calls=[{type:function, function:{name:"say",
arguments:{text:...}}}]`).
- **Module 3 — VQA.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693))
grounded features (bounding boxes in pixel `[x_min, y_min, x_max, y_max]`,
keypoints) and Steerable Policies' multi-abstraction grounding.
Future maintainers should adjust the prompt templates in
`src/lerobot/annotations/steerable_pipeline/prompts/` against these
references rather than rewriting from scratch.
## Compute and list-size estimates
Per episode, the pipeline issues O(`max_steps`) Module 1 calls,
O(`max_interjections_per_episode`) Module 2 calls, and
O(`vqa_emission_hz × episode_seconds`) Module 3 calls. With defaults
(8 subtasks, 1 interjection, 1 Hz × 3 pairs) and 30-second episodes, that
is ~50 VLM calls per episode. `language_persistent` per episode is ~10s of
KB at most (parquet dictionary-encodes one entry per episode);
`language_events` is empty on most frames and is bounded by the number of
emissions, not `num_frames × num_emissions`.
## Reproducibility via seed and prompt hashes
`--seed` (default 1729) feeds the per-episode RNGs that select interjection
timestamps and VQA question types. Combined with the deterministic prompt
templates checked into `prompts/`, two runs at the same seed against the
same dataset and the same model checkpoint produce byte-identical staging
artifacts. Prompt edits are recorded by file hash; future tooling can pin
expected `(seed, prompt_hash)` pairs into the dataset card.