Files
lerobot/docs/source/annotation_pipeline.mdx
T
Pepijn dad2cf1178 refactor(annotate): delegate distribution to HF Jobs; drop SLURM/local switch
The executor previously claimed it would "optionally hand off" to
datatrove's LocalPipelineExecutor or SlurmPipelineExecutor — but it
already runs phases inline in every code path, and HF Jobs (see
``examples/annotation/run_hf_job.py``) is the actual distribution
strategy. Stop pretending we have an executor selector.

* `executor.py`: drop `select_executor_class`, the "kind" log line, and
  the references to LocalPipelineExecutor / SlurmPipelineExecutor.
  Module docstring now says distribution is delegated to HF Jobs.
* `config.py`: drop `auto_threshold`, `force_local`, `slurm_partition`,
  `slurm_gpus`, `slurm_time`, `workers`. `ExecutorConfig` keeps only
  `episode_parallelism`. While here, prune the longer "why" docstrings
  on every field down to the load-bearing bits — full story moves to
  `docs/source/annotation_pipeline.mdx`.
* `pyproject.toml`: drop `datatrove>=0.4.0,<2.0.0` from the
  `[annotations]` extra; the dep was only there for the (never used)
  cluster executors. Comment block notes the new HF-Jobs delegation.
* `reader.py`, `lerobot_annotate.py`: drop their own datatrove /
  flavor-namespace mentions.
* `docs/source/annotation_pipeline.mdx`:
  - remove the flavor-namespace / sidecar paragraph (out of scope —
    "multiple revisions = multiple copies" is dataset-level policy);
  - remove the "writer drops the legacy `subtask_index` column" note
    (already covered by PR 1's intentional-break call-out);
  - remove the chat-template + `apply_chat_template(messages, tools=...)`
    line (covered by Tools doc);
  - replace the "executor picks Local vs Slurm" paragraph with
    `--executor.episode_parallelism` and a pointer to HF Jobs;
  - rewrite the style→recipe section to talk about "recipes" generically
    instead of pinning a specific YAML;
  - add a "Running on Hugging Face Jobs" section pointing at
    `examples/annotation/run_hf_job.py`;
  - add a "Running locally" example matching the CLI's docstring
    (`uv run lerobot-annotate --root=... --vlm.model_id=...`);
  - extend the paper-inspirations list with Pi0.7 and Steerable VLA
    Policies (Zhao 2025) for Module 3.

Tests: same 3 pre-existing failures as before this commit (2 module
assertions still in flight; 1 carryover from PR 1). 41/44 pass.
Pre-commit clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:09:22 +02:00

177 lines
8.2 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Annotation Pipeline
`lerobot-annotate` populates the two language columns introduced by the
[Language Columns and Recipes](./language_and_recipes) page —
`language_persistent` and `language_events` — directly into
`data/chunk-*/file-*.parquet`.
## What the pipeline produces
Three modules write into a per-episode staging tree, then a single writer
rewrites the data shards in place:
| Style / atom | Column | Module |
| ------------------------------------------- | --------------------- | -------- |
| `subtask` (Pi0.7-style "how, not what") | `language_persistent` | Module 1 |
| `plan` (initial + refresh on interjection) | `language_persistent` | Module 1 |
| `memory` (MEM-style compression) | `language_persistent` | Module 1 |
| `task_aug` (rephrasings of canonical task) | `language_persistent` | Module 1 |
| `interjection` | `language_events` | Module 2 |
| speech tool-call atom (`style=null`, `say`) | `language_events` | Module 2 |
| `vqa` (user / assistant pair) | `language_events` | Module 3 |
The writer does **not** add a `tools` column to the parquet — the tool
catalog lives at `meta/info.json["tools"]` instead (see
[Tools](./tools)). After every annotation run the pipeline ensures the
canonical `say` schema is present in that list, preserving any tools the
user pre-declared.
If you want to declare additional tools for a dataset before annotation
runs, edit `meta/info.json["tools"]` directly — the pipeline preserves
anything already there. Implementations of those tools live under
`src/lerobot/tools/`; one file per tool, registered via
`TOOL_REGISTRY`. See the [Tools](./tools) doc for the authoring guide.
## Running locally
Install the extra and invoke the console script. Episode-level
concurrency comes from `--executor.episode_parallelism` (default 16);
that is the only knob the in-process executor exposes.
```bash
uv sync --extra annotations
uv run lerobot-annotate \
--root=/path/to/dataset \
--vlm.model_id=Qwen/Qwen2.5-VL-7B-Instruct
```
The pipeline attaches actual camera footage to every Module 1/2/3 prompt
by default, decoded from the dataset's first `observation.images.*`
stream. Override with `--vlm.camera_key=observation.images.<name>` to
pin a specific viewpoint. Datasets with no video tracks fall back to
text-only prompts automatically.
**Module 1 sees the whole episode as one video block.** Subtask
decomposition gets a `{"type":"video", "video":[<frames>]}` block
covering the entire demonstration; Qwen-VL pools temporally on its own
and decides where to cut. There is no keyframe stride or count knob —
`--module_1.max_video_frames` (default 128) only caps the frames packed
into the video block as a model-capacity bound. Module 2 attaches a
short window of frames around the interjection timestamp; Module 3
attaches the exact emission frame to each VQA pair.
## Running on Hugging Face Jobs
Distributed annotation is delegated to
[Hugging Face Jobs](https://huggingface.co/docs/hub/en/jobs). The repo
ships a launcher script you copy and edit for your dataset:
```bash
HF_TOKEN=hf_... uv run python examples/annotation/run_hf_job.py
```
[`examples/annotation/run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotation/run_hf_job.py)
spawns one `h200x2` job that:
1. installs the branch under test plus the annotation extras,
2. boots two vllm servers (one per GPU) for the chosen model,
3. runs Modules 1 / 2 / 3 across the dataset via `lerobot-annotate`,
4. uploads the annotated dataset to `--push_to_hub`.
To target a different dataset, model, or hub repo, edit the `CMD` block
inside the script — every flag in there maps directly onto a CLI flag of
`lerobot-annotate` (see `lerobot-annotate --help` for the full list).
## Style-to-recipe consumer mapping
The pipeline's outputs are designed to be consumed by recipes (see
[Language Columns and Recipes](./language_and_recipes)) — typically:
- low-level / high-level / memory-update branches consume
`subtask`/`plan`/`memory` from `language_persistent`.
- An interjection-response branch consumes `interjection` events plus
the paired speech atom (merged into one assistant target turn via
`tool_calls_from`) and the same-timestamp `plan` refresh.
- A VQA branch consumes the `(vqa, user)` and `(vqa, assistant)` pairs
from `language_events`.
## Why the design splits state from events
Two things drive the scope:
1. **Persistent state vs exact-event split.** Persistent rows
(`subtask`, `plan`, `memory`) broadcast per episode and answer "what
state is in force at this frame?". Event rows (`interjection`, `vqa`,
speech) only appear on the exact frame whose timestamp matches the
emission. The pipeline writes timestamps taken straight from the
source parquet — no floating-point recomputation.
2. **One Qwen-VL pass.** All three modules share a single VLM client
(vLLM if available, transformers fallback) so the cost is one model
load per dataset, not three.
## Module independence and staged reruns
Each module writes its raw output to
`<root>/.annotate_staging/episode_{N:06d}/<module>.jsonl`. That makes
prompt iteration cheap — re-running one module overwrites only its own
JSONL file before the writer composes the final parquet. Modules can be
disabled via `--module_1.enabled=false` (and similarly for 2 and 3) to
test them in isolation.
## Validation/report checks before final write
Before the writer runs, `StagingValidator` checks:
- exact frame-timestamp alignment for every event row;
- no orphan speech / interjection pairs;
- `plan` is refreshed at every interjection timestamp;
- `memory` rows fall on subtask boundaries (warning, not error);
- VQA assistant `content` parses as JSON in one of the
bbox / keypoint / count / attribute / spatial shapes;
- every row routes to the column dictated by `column_for_style(style)`.
Errors abort the writer (`--skip_validation=true` overrides for debugging).
## Paper inspirations per module
- **Module 1 — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
atom granularity ("pick up one piece of lettuce", "place bowl to box");
Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07)) "how, not
what" detail.
- **Module 1 — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596))
compression directive: keep only minimal relevant information; functional
outcomes preserved, specific attributes dropped.
- **Module 2 — interjections.** Hi Robot scenario taxonomy: negative task,
situated correction, specific constraint, preference. Speech is a
tool-call-only atom (`tool_calls=[{type:function, function:{name:"say",
arguments:{text:...}}}]`).
- **Module 3 — VQA.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693))
grounded features (bounding boxes in pixel `[x_min, y_min, x_max, y_max]`,
keypoints) and Steerable VLA Policies ([Zhao 2025](https://arxiv.org/abs/2509.07626))
multi-abstraction grounding. Pi0.7 also grounds answers across
multiple abstraction levels.
Future maintainers should adjust the prompt templates in
`src/lerobot/annotations/steerable_pipeline/prompts/` against these
references rather than rewriting from scratch.
## Compute and list-size estimates
Per episode, the pipeline issues O(`max_steps`) Module 1 calls,
O(`max_interjections_per_episode`) Module 2 calls, and
O(`vqa_emission_hz × episode_seconds`) Module 3 calls. With defaults
(8 subtasks, 1 interjection, 1 Hz × 3 pairs) and 30-second episodes, that
is ~50 VLM calls per episode. `language_persistent` per episode is ~10s of
KB at most (parquet dictionary-encodes one entry per episode);
`language_events` is empty on most frames and is bounded by the number of
emissions, not `num_frames × num_emissions`.
## Reproducibility via seed and prompt hashes
`--seed` (default 1729) feeds the per-episode RNGs that select interjection
timestamps and VQA question types. Combined with the deterministic prompt
templates checked into `prompts/`, two runs at the same seed against the
same dataset and the same model checkpoint produce byte-identical staging
artifacts. Prompt edits are recorded by file hash; future tooling can pin
expected `(seed, prompt_hash)` pairs into the dataset card.