mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-14 16:19:45 +00:00
73740ecf4b
After every ``lerobot-annotate`` run, the executor ensures ``meta/info.json["tools"]`` contains at minimum the canonical ``say`` schema, while preserving any tools the user pre-declared on the dataset. Chat-template consumers (PR 3 SmolVLA2 / Pi0.5 / dataset visualizer) read the catalog through ``LeRobotDatasetMetadata.tools`` and pass it to ``apply_chat_template(messages, tools=meta.tools, ...)``. - ``executor.py``: new ``_ensure_tools_in_info`` helper called after the parquet rewrite. Idempotent and additive — merges by ``function.name``, only writes back if the list changed. - ``writer.py``: drops the duplicated ``SAY_TOOL_SCHEMA`` / ``DEFAULT_TOOLS`` constants in favour of importing from ``lerobot.datasets.language`` (PR 1's single source of truth). Re-exported so existing imports keep working. - ``annotation_pipeline.mdx``: replace the "code constant only" note with a pointer to the new Tools doc and a description of the meta/info.json behaviour, including how to pre-declare custom tools before annotation runs. This is the storage half of the tools work; PR 3 ships the runnable implementations under ``src/lerobot/tools/`` (one file per tool, first up: ``say.py`` wired to Kyutai's pocket-tts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
162 lines
7.6 KiB
Plaintext
162 lines
7.6 KiB
Plaintext
# Annotation Pipeline
|
||
|
||
`lerobot-annotate` populates the two language columns introduced by the
|
||
[Language Columns and Recipes](./language_and_recipes) page —
|
||
`language_persistent` and `language_events` — directly into
|
||
`data/chunk-*/file-*.parquet`. There is no flavor namespace and no sidecar
|
||
file tree: multiple revisions of a dataset mean multiple dataset copies.
|
||
|
||
## What the pipeline produces
|
||
|
||
Three modules write into a per-episode staging tree, then a single writer
|
||
rewrites the data shards in place:
|
||
|
||
| Style / atom | Column | Module |
|
||
| ------------------------------------------- | --------------------- | -------- |
|
||
| `subtask` (Pi0.7-style "how, not what") | `language_persistent` | Module 1 |
|
||
| `plan` (initial + refresh on interjection) | `language_persistent` | Module 1 |
|
||
| `memory` (MEM-style compression) | `language_persistent` | Module 1 |
|
||
| `interjection` | `language_events` | Module 2 |
|
||
| speech tool-call atom (`style=null`, `say`) | `language_events` | Module 2 |
|
||
| `vqa` (user / assistant pair) | `language_events` | Module 3 |
|
||
|
||
The writer drops the legacy `subtask_index` column. It does **not** add a
|
||
`tools` column to the parquet — the tool catalog lives at
|
||
`meta/info.json["tools"]` instead (see [Tools](./tools)). After every
|
||
annotation run the pipeline ensures the canonical `say` schema is
|
||
present in that list, preserving any tools the user pre-declared. Chat-
|
||
template consumers read the catalog through
|
||
`LeRobotDatasetMetadata.tools` and pass it to
|
||
`apply_chat_template(messages, tools=meta.tools, ...)`.
|
||
|
||
If you want to declare additional tools for a dataset before annotation
|
||
runs, edit `meta/info.json["tools"]` directly — the pipeline preserves
|
||
anything already there. Implementations of those tools live under
|
||
`src/lerobot/tools/`; one file per tool, registered via
|
||
`TOOL_REGISTRY`. See the [Tools](./tools) doc for the authoring guide.
|
||
|
||
## How to run it locally or on SLURM
|
||
|
||
Install the extra and invoke the console script:
|
||
|
||
```bash
|
||
uv sync --extra annotations
|
||
uv run lerobot-annotate \
|
||
--repo_id=imstevenpmwork/super_poulain_draft \
|
||
--vlm.backend=vllm \
|
||
--vlm.model_id=Qwen/Qwen3.6-27B-FP8 \
|
||
--vlm.tensor_parallel_size=2
|
||
```
|
||
|
||
The pipeline attaches actual camera footage to every Module 1/2/3 prompt
|
||
by default, decoded from the dataset's first `observation.images.*`
|
||
stream. Override with `--vlm.camera_key=observation.images.<name>` to
|
||
pin a specific viewpoint. Datasets with no video tracks fall back to
|
||
text-only prompts automatically.
|
||
|
||
**Module 1 sees the whole episode as one video block.** Subtask
|
||
decomposition gets a `{"type":"video", "video":[<frames>]}` block
|
||
covering the entire demonstration; Qwen-VL pools temporally on its own
|
||
and decides where to cut. There is no keyframe stride or count knob —
|
||
`--module_1.max_video_frames` (default 32) only caps the frames packed
|
||
into the video block as a model-capacity bound. Module 2 attaches a
|
||
single still frame at the interjection timestamp; Module 3 attaches the
|
||
exact emission frame to each VQA pair.
|
||
|
||
The executor picks `LocalPipelineExecutor` for small datasets and
|
||
`SlurmPipelineExecutor` for large ones based on
|
||
`--executor.auto_threshold` (default 32 episodes). Force local with
|
||
`--executor.force_local=true`. SLURM jobs honour `--executor.slurm_partition`,
|
||
`--executor.slurm_gpus`, and `--executor.slurm_time`.
|
||
|
||
## Style-to-recipe consumer mapping
|
||
|
||
The pipeline produces exactly the styles consumed by
|
||
`src/lerobot/configs/recipes/pi05_hirobot.yaml`:
|
||
|
||
- `low_level_execution`, `high_level_subtask`, `memory_update` consume
|
||
`subtask`/`plan`/`memory` from `language_persistent`.
|
||
- `user_interjection_response` consumes `interjection` events plus the
|
||
paired speech atom (merged into one assistant target turn via
|
||
`tool_calls_from`) and the same-timestamp `plan` refresh.
|
||
- `ask_vqa` consumes the `(vqa, user)` and `(vqa, assistant)` pairs from
|
||
`language_events`.
|
||
|
||
## Why the design is scoped to the canonical recipe
|
||
|
||
Two things drive the scope:
|
||
|
||
1. **Persistent state vs exact-event split.** Persistent rows (`subtask`,
|
||
`plan`, `memory`) broadcast per episode and answer "what state is in
|
||
force at this frame?". Event rows (`interjection`, `vqa`, speech) only
|
||
appear on the exact frame whose timestamp matches the emission. The
|
||
pipeline writes timestamps taken straight from the source parquet — no
|
||
floating-point recomputation.
|
||
2. **One Qwen-VL pass.** All three modules share a single VLM client
|
||
(vLLM if available, transformers fallback) so the cost is one model
|
||
load per dataset, not three.
|
||
|
||
## Module independence and staged reruns
|
||
|
||
Each module writes its raw output to
|
||
`<root>/.annotate_staging/episode_{N:06d}/<module>.jsonl`. That makes
|
||
prompt iteration cheap — re-running one module overwrites only its own
|
||
JSONL file before the writer composes the final parquet. Modules can be
|
||
disabled via `--module_1.enabled=false` (and similarly for 2 and 3) to
|
||
test them in isolation.
|
||
|
||
## Validation/report checks before final write
|
||
|
||
Before the writer runs, `StagingValidator` checks:
|
||
|
||
- exact frame-timestamp alignment for every event row;
|
||
- no orphan speech / interjection pairs;
|
||
- `plan` is refreshed at every interjection timestamp;
|
||
- `memory` rows fall on subtask boundaries (warning, not error);
|
||
- VQA assistant `content` parses as JSON in one of the
|
||
bbox / keypoint / count / attribute / spatial shapes;
|
||
- every row routes to the column dictated by `column_for_style(style)`.
|
||
|
||
Errors abort the writer (`--skip_validation=true` overrides for debugging).
|
||
|
||
## Paper inspirations per module
|
||
|
||
- **Module 1 — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
|
||
atom granularity ("pick up one piece of lettuce", "place bowl to box");
|
||
Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07)) "how, not
|
||
what" detail.
|
||
- **Module 1 — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596))
|
||
compression directive: keep only minimal relevant information; functional
|
||
outcomes preserved, specific attributes dropped.
|
||
- **Module 2 — interjections.** Hi Robot scenario taxonomy: negative task,
|
||
situated correction, specific constraint, preference. Speech is a
|
||
tool-call-only atom (`tool_calls=[{type:function, function:{name:"say",
|
||
arguments:{text:...}}}]`).
|
||
- **Module 3 — VQA.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693))
|
||
grounded features (bounding boxes in pixel `[x_min, y_min, x_max, y_max]`,
|
||
keypoints) and Steerable Policies' multi-abstraction grounding.
|
||
|
||
Future maintainers should adjust the prompt templates in
|
||
`src/lerobot/annotations/steerable_pipeline/prompts/` against these
|
||
references rather than rewriting from scratch.
|
||
|
||
## Compute and list-size estimates
|
||
|
||
Per episode, the pipeline issues O(`max_steps`) Module 1 calls,
|
||
O(`max_interjections_per_episode`) Module 2 calls, and
|
||
O(`vqa_emission_hz × episode_seconds`) Module 3 calls. With defaults
|
||
(8 subtasks, 1 interjection, 1 Hz × 3 pairs) and 30-second episodes, that
|
||
is ~50 VLM calls per episode. `language_persistent` per episode is ~10s of
|
||
KB at most (parquet dictionary-encodes one entry per episode);
|
||
`language_events` is empty on most frames and is bounded by the number of
|
||
emissions, not `num_frames × num_emissions`.
|
||
|
||
## Reproducibility via seed and prompt hashes
|
||
|
||
`--seed` (default 1729) feeds the per-episode RNGs that select interjection
|
||
timestamps and VQA question types. Combined with the deterministic prompt
|
||
templates checked into `prompts/`, two runs at the same seed against the
|
||
same dataset and the same model checkpoint produce byte-identical staging
|
||
artifacts. Prompt edits are recorded by file hash; future tooling can pin
|
||
expected `(seed, prompt_hash)` pairs into the dataset card.
|