feat: language annotation pipeline (PR 2/3)

Adds the steerable annotation pipeline (`lerobot-annotate`) that populates the `language_persistent` and `language_events` columns introduced in PR 1 directly into `data/chunk-*/file-*.parquet`. No flavor namespace, no sidecar tree. Modules produced: - Module 1 (plan_subtasks_memory): Pi0.7-style subtasks, plan (init + refresh on interjection), MEM-style memory at subtask boundaries. - Module 2 (interjections_and_speech): t=0 speech-only acknowledgement, mid-episode paired interjection + speech tool-call atom. - Module 3 (general_vqa): bbox/keypoint/count/attribute/spatial pairs at configurable cadence with one-retry JSON validation. Writer enforces: per-episode persistent identity, exact-frame event timestamps, column routing per `column_for_style`, dataset-level `tools` column with the `say` schema, drops legacy `subtask_index`. Validator runs against staged JSONL artifacts before the writer rewrites parquet. Adds `lerobot-annotate` console script, `annotations` extra (datatrove + optional vllm), `make annotation-e2e` opt-in smoke target, and `docs/source/annotation_pipeline.mdx`. Branched from PR 1 (`feat/language-columns`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 16:49:55 +00:00 · 2026-04-27 16:22:51 +02:00
parent 0b06790da0
commit a635a32290
33 changed files with 3409 additions and 0 deletions
@@ -33,6 +33,8 @@
    title: Using the Dataset Tools
  - local: language_and_recipes
    title: Language Columns and Recipes
+  - local: annotation_pipeline
+    title: Annotation Pipeline
  - local: streaming_video_encoding
    title: Streaming Video Encoding
  title: "Datasets"
@@ -0,0 +1,133 @@
+# Annotation Pipeline
+
+`lerobot-annotate` populates the two language columns introduced by the
+[Language Columns and Recipes](./language_and_recipes) page —
+`language_persistent` and `language_events` — directly into
+`data/chunk-*/file-*.parquet`. There is no flavor namespace and no sidecar
+file tree: multiple revisions of a dataset mean multiple dataset copies.
+
+## What the pipeline produces
+
+Three modules write into a per-episode staging tree, then a single writer
+rewrites the data shards in place:
+
+| Style / atom                                | Column                | Module   |
+| ------------------------------------------- | --------------------- | -------- |
+| `subtask` (Pi0.7-style "how, not what")     | `language_persistent` | Module 1 |
+| `plan` (initial + refresh on interjection)  | `language_persistent` | Module 1 |
+| `memory` (MEM-style compression)            | `language_persistent` | Module 1 |
+| `interjection`                              | `language_events`     | Module 2 |
+| speech tool-call atom (`style=null`, `say`) | `language_events`     | Module 2 |
+| `vqa` (user / assistant pair)               | `language_events`     | Module 3 |
+
+The writer also adds a dataset-level `tools` column carrying the JSON schema
+for the `say` tool call, and drops the legacy `subtask_index` column.
+
+## How to run it locally or on SLURM
+
+Install the extra and invoke the console script:
+
+```bash
+uv sync --extra annotations
+uv run lerobot-annotate \
+  --root=/path/to/dataset \
+  --vlm.backend=transformers \
+  --vlm.model_id=Qwen/Qwen2.5-VL-7B-Instruct
+```
+
+The executor picks `LocalPipelineExecutor` for small datasets and
+`SlurmPipelineExecutor` for large ones based on
+`--executor.auto_threshold` (default 32 episodes). Force local with
+`--executor.force_local=true`. SLURM jobs honour `--executor.slurm_partition`,
+`--executor.slurm_gpus`, and `--executor.slurm_time`.
+
+## Style-to-recipe consumer mapping
+
+The pipeline produces exactly the styles consumed by
+`src/lerobot/configs/recipes/pi05_hirobot.yaml`:
+
+- `low_level_execution`, `high_level_subtask`, `memory_update` consume
+  `subtask`/`plan`/`memory` from `language_persistent`.
+- `user_interjection_response` consumes `interjection` events plus the
+  paired speech atom (merged into one assistant target turn via
+  `tool_calls_from`) and the same-timestamp `plan` refresh.
+- `ask_vqa` consumes the `(vqa, user)` and `(vqa, assistant)` pairs from
+  `language_events`.
+
+## Why the design is scoped to the canonical recipe
+
+Two things drive the scope:
+
+1. **Persistent state vs exact-event split.** Persistent rows (`subtask`,
+   `plan`, `memory`) broadcast per episode and answer "what state is in
+   force at this frame?". Event rows (`interjection`, `vqa`, speech) only
+   appear on the exact frame whose timestamp matches the emission. The
+   pipeline writes timestamps taken straight from the source parquet — no
+   floating-point recomputation.
+2. **One Qwen-VL pass.** All three modules share a single VLM client
+   (vLLM if available, transformers fallback) so the cost is one model
+   load per dataset, not three.
+
+## Module independence and staged reruns
+
+Each module writes its raw output to
+`<root>/.annotate_staging/episode_{N:06d}/<module>.jsonl`. That makes
+prompt iteration cheap — re-running one module overwrites only its own
+JSONL file before the writer composes the final parquet. Modules can be
+disabled via `--module_1.enabled=false` (and similarly for 2 and 3) to
+test them in isolation.
+
+## Validation/report checks before final write
+
+Before the writer runs, `StagingValidator` checks:
+
+- exact frame-timestamp alignment for every event row;
+- no orphan speech / interjection pairs;
+- `plan` is refreshed at every interjection timestamp;
+- `memory` rows fall on subtask boundaries (warning, not error);
+- VQA assistant `content` parses as JSON in one of the
+  bbox / keypoint / count / attribute / spatial shapes;
+- every row routes to the column dictated by `column_for_style(style)`.
+
+Errors abort the writer (`--skip_validation=true` overrides for debugging).
+
+## Paper inspirations per module
+
+- **Module 1 — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
+  atom granularity ("pick up one piece of lettuce", "place bowl to box");
+  Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07)) "how, not
+  what" detail.
+- **Module 1 — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596))
+  compression directive: keep only minimal relevant information; functional
+  outcomes preserved, specific attributes dropped.
+- **Module 2 — interjections.** Hi Robot scenario taxonomy: negative task,
+  situated correction, specific constraint, preference. Speech is a
+  tool-call-only atom (`tool_calls=[{type:function, function:{name:"say",
+arguments:{text:...}}}]`).
+- **Module 3 — VQA.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693))
+  grounded features (bounding boxes in pixel `[x_min, y_min, x_max, y_max]`,
+  keypoints) and Steerable Policies' multi-abstraction grounding.
+
+Future maintainers should adjust the prompt templates in
+`src/lerobot/annotations/steerable_pipeline/prompts/` against these
+references rather than rewriting from scratch.
+
+## Compute and list-size estimates
+
+Per episode, the pipeline issues O(`max_steps`) Module 1 calls,
+O(`max_interjections_per_episode`) Module 2 calls, and
+O(`vqa_emission_hz × episode_seconds`) Module 3 calls. With defaults
+(8 subtasks, 1 interjection, 1 Hz × 3 pairs) and 30-second episodes, that
+is ~50 VLM calls per episode. `language_persistent` per episode is ~10s of
+KB at most (parquet dictionary-encodes one entry per episode);
+`language_events` is empty on most frames and is bounded by the number of
+emissions, not `num_frames × num_emissions`.
+
+## Reproducibility via seed and prompt hashes
+
+`--seed` (default 1729) feeds the per-episode RNGs that select interjection
+timestamps and VQA question types. Combined with the deterministic prompt
+templates checked into `prompts/`, two runs at the same seed against the
+same dataset and the same model checkpoint produce byte-identical staging
+artifacts. Prompt edits are recorded by file hash; future tooling can pin
+expected `(seed, prompt_hash)` pairs into the dataset card.