Merge branch 'feat/language-annotation-pipeline' into feat/smolvla-on-steerable

Resolves conflicts from 66 commits on the base branch: * pyproject.toml — keep base's transformers>=5.4.0,<5.6.0; add the sentencepiece-dep entry pi052 (FAST action tokenizer) needs. * policies/__init__.py — keep pi052 export; drop the RewardClassifierConfig export that base removed. * policies/factory.py — docstring list resolution (keep pi052; drop reward_classifier, removed by base). * annotations/steerable_pipeline/executor.py — adopt base's renamed _ensure_annotation_metadata_in_info (it already advertises the say tool); drop pi052's older _ensure_tools_in_info call. * configs/train.py — keep pi052's vqa_target_fraction; adopt base's SampleWeightingConfig (legacy RA-BC inline params already covered by the migration shim base added). * scripts/lerobot_train.py — merge pi052's per-policy processor rebuild + dataset_repo_id pass-through with base's active_cfg / is_reward_model_training tightening, and re-route vqa-weighted sampler to active_cfg.drop_n_last_frames. * datasets/language_render.py — adopt base's _select_one + timestamp tolerance (drops pi052's stale _select_latest / per-style sort_key). * tests — adopt base's parametrized per-camera blend + tolerance test; drop pi052 tests that overlap with base's tighter rewrites; keep pi052's flow-only / VQA-blend coverage; add a test_canonical_recipe_loads check on subtask_mem_vqa_speech.yaml. * policies/pi052/processor_pi052.py — import RenderMessagesStep directly from render_messages_processor (base intentionally dropped it from lerobot.processor's re-exports). * uv.lock — regenerated cleanly from base + pi052's pocket-tts / beartype. All 67 touched tests pass (30 pi052 + 37 recipe / language-render / pipeline / render-messages). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-07-23 01:41:54 +00:00 · 2026-05-25 14:47:09 +02:00
parent 67bdf4690e 471b2b1b1d
commit 1ff10b935c
210 changed files with 14334 additions and 5728 deletions
@@ -3,31 +3,44 @@
 `lerobot-annotate` populates the two language columns introduced by the
 [Language Columns and Recipes](./language_and_recipes) page —
 `language_persistent` and `language_events` — directly into
-`data/chunk-*/file-*.parquet`. There is no flavor namespace and no sidecar
-file tree: multiple revisions of a dataset mean multiple dataset copies.
+`data/chunk-*/file-*.parquet`.

 ## What the pipeline produces

-Three modules write into a per-episode staging tree, then a single writer
+A vocabulary-discovery phase derives a small canonical wording, then three
+modules write into a per-episode staging tree, then a single writer
 rewrites the data shards in place:

-| Style / atom                                | Column                | Module   |
-| ------------------------------------------- | --------------------- | -------- |
-| `subtask` (Pi0.7-style "how, not what")     | `language_persistent` | Module 1 |
-| `plan` (initial + refresh on interjection)  | `language_persistent` | Module 1 |
-| `memory` (MEM-style compression)            | `language_persistent` | Module 1 |
-| `interjection`                              | `language_events`     | Module 2 |
-| speech tool-call atom (`style=null`, `say`) | `language_events`     | Module 2 |
-| `vqa` (user / assistant pair)               | `language_events`     | Module 3 |
+| Style / atom                                | Column                | Module         |
+| ------------------------------------------- | --------------------- | -------------- |
+| `subtask` (Pi0.7-style "how, not what")     | `language_persistent` | `plan`         |
+| `plan` (initial + refresh on interjection)  | `language_persistent` | `plan`         |
+| `memory` (MEM-style compression)            | `language_persistent` | `plan`         |
+| `task_aug` (rephrasings of canonical task)  | `language_persistent` | `plan`         |
+| `interjection`                              | `language_events`     | `interjections`|
+| speech tool-call atom (`style=null`, `say`) | `language_events`     | `interjections`|
+| `vqa` (user / assistant pair)               | `language_events`     | `vqa`          |

-The writer drops the legacy `subtask_index` column. It does **not** add a
-`tools` column to the parquet — the tool catalog lives at
-`meta/info.json["tools"]` instead (see [Tools](./tools)). After every
-annotation run the pipeline ensures the canonical `say` schema is
-present in that list, preserving any tools the user pre-declared. Chat-
-template consumers read the catalog through
-`LeRobotDatasetMetadata.tools` and pass it to
-`apply_chat_template(messages, tools=meta.tools, ...)`.
+The `plan` module is constrained to a **canonical vocabulary** discovered
+once per dataset by the `vocabulary` module (phase 0). It watches a few
+sample episode videos (`--vocabulary.sample_episodes`, default `3`) and
+asks the VLM to derive a small set of imperative subtask labels and
+first-person memory milestones that recur across the demos. The VLM
+picks the right number of entries itself based on what it sees in the
+clips — short pick-and-place demos get ~6 subtask labels, longer
+multi-step recipes get more. The result lands at
+`meta/canonical_vocabulary.json` (human-readable / hand-editable) and
+is reused on every subsequent run. The `plan` module then constrains
+both subtask + memory generation to those exact strings — the
+downstream low-level policy sees a small, repeatable target
+distribution instead of thousands of LLM paraphrases. Disable with
+`--vocabulary.enabled=False` to fall back to free-form generation.
+
+The writer does **not** add a `tools` column to the parquet — the tool
+catalog lives at `meta/info.json["tools"]` instead (see
+[Tools](./tools)). After every annotation run the pipeline ensures the
+canonical `say` schema is present in that list, preserving any tools the
+user pre-declared.

 If you want to declare additional tools for a dataset before annotation
 runs, edit `meta/info.json["tools"]` directly — the pipeline preserves
@@ -35,63 +48,89 @@ anything already there. Implementations of those tools live under
 `src/lerobot/tools/`; one file per tool, registered via
 `TOOL_REGISTRY`. See the [Tools](./tools) doc for the authoring guide.

-## How to run it locally or on SLURM
+## Running locally

-Install the extra and invoke the console script:
+Install the extra and invoke the console script. Episode-level
+concurrency comes from `--executor.episode_parallelism` (default 16);
+that is the only knob the in-process executor exposes.

 ```bash
 uv sync --extra annotations
 uv run lerobot-annotate \
-  --repo_id=imstevenpmwork/super_poulain_draft \
-  --vlm.backend=vllm \
-  --vlm.model_id=Qwen/Qwen3.6-27B-FP8 \
-  --vlm.tensor_parallel_size=2
+  --root=/path/to/dataset \
+  --vlm.model_id=Qwen/Qwen2.5-VL-7B-Instruct
 ```

-The pipeline attaches actual camera footage to every Module 1/2/3 prompt
-by default, decoded from the dataset's first `observation.images.*`
-stream. Override with `--vlm.camera_key=observation.images.<name>` to
-pin a specific viewpoint. Datasets with no video tracks fall back to
-text-only prompts automatically.
+The pipeline attaches actual camera footage to every `plan` /
+`interjections` / `vqa` prompt by default, decoded from the dataset's
+first `observation.images.*` stream. Override with
+`--vlm.camera_key=observation.images.<name>` to pin a specific
+viewpoint. Datasets with no video tracks fall back to text-only prompts
+automatically.

-**Module 1 sees the whole episode as one video block.** Subtask
+**The `plan` module sees the whole episode as one video block.** Subtask
 decomposition gets a `{"type":"video", "video":[<frames>]}` block
 covering the entire demonstration; Qwen-VL pools temporally on its own
 and decides where to cut. There is no keyframe stride or count knob —
-`--module_1.max_video_frames` (default 32) only caps the frames packed
-into the video block as a model-capacity bound. Module 2 attaches a
-single still frame at the interjection timestamp; Module 3 attaches the
-exact emission frame to each VQA pair.
+`--plan.max_video_frames` (default 128) only caps the frames packed
+into the video block as a model-capacity bound. The `interjections`
+module attaches a short window of frames straddling the interjection
+timestamp. The `vqa` module grounds each VQA pair on a single frame —
+its `--vqa.K` knob sets how many consecutive frames each emission tick
+anchors, and every anchored frame gets its own VQA pair on that one
+frame (there is no per-pair frame window).

-The executor picks `LocalPipelineExecutor` for small datasets and
-`SlurmPipelineExecutor` for large ones based on
-`--executor.auto_threshold` (default 32 episodes). Force local with
-`--executor.force_local=true`. SLURM jobs honour `--executor.slurm_partition`,
-`--executor.slurm_gpus`, and `--executor.slurm_time`.
+## Running on Hugging Face Jobs
+
+Distributed annotation is delegated to
+[Hugging Face Jobs](https://huggingface.co/docs/hub/en/jobs). The repo
+ships a launcher script you copy and edit for your dataset:
+
+```bash
+HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
+```
+
+[`examples/annotations/run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
+spawns one `h200x2` job that:
+
+1. installs the branch under test plus the annotation extras,
+2. boots two vllm servers (one per GPU) for the chosen model,
+3. runs the `plan` / `interjections` / `vqa` modules across the dataset
+   via `lerobot-annotate`,
+4. uploads the annotated dataset to `--push_to_hub`.
+
+To target a different dataset, model, or hub repo, edit the `CMD` block
+inside the script — every flag in there maps directly onto a CLI flag of
+`lerobot-annotate` (see `lerobot-annotate --help` for the full list).

 ## Style-to-recipe consumer mapping

+<<<<<<< HEAD
 The pipeline produces exactly the styles consumed by
 `src/lerobot/configs/recipes/subtask_mem_vqa_speech.yaml`:
+=======
+The pipeline's outputs are designed to be consumed by recipes (see
+[Language Columns and Recipes](./language_and_recipes)) — typically:
+>>>>>>> origin/feat/language-annotation-pipeline

- `low_level_execution`, `high_level_subtask`, `memory_update` consume
+- low-level / high-level / memory-update branches consume
  `subtask`/`plan`/`memory` from `language_persistent`.
- `user_interjection_response` consumes `interjection` events plus the
-  paired speech atom (merged into one assistant target turn via
+- An interjection-response branch consumes `interjection` events plus
+  the paired speech atom (merged into one assistant target turn via
  `tool_calls_from`) and the same-timestamp `plan` refresh.
- `ask_vqa` consumes the `(vqa, user)` and `(vqa, assistant)` pairs from
-  `language_events`.
+- A VQA branch consumes the `(vqa, user)` and `(vqa, assistant)` pairs
+  from `language_events`.

-## Why the design is scoped to the canonical recipe
+## Why the design splits state from events

 Two things drive the scope:

-1. **Persistent state vs exact-event split.** Persistent rows (`subtask`,
-   `plan`, `memory`) broadcast per episode and answer "what state is in
-   force at this frame?". Event rows (`interjection`, `vqa`, speech) only
-   appear on the exact frame whose timestamp matches the emission. The
-   pipeline writes timestamps taken straight from the source parquet — no
-   floating-point recomputation.
+1. **Persistent state vs exact-event split.** Persistent rows
+   (`subtask`, `plan`, `memory`) broadcast per episode and answer "what
+   state is in force at this frame?". Event rows (`interjection`, `vqa`,
+   speech) only appear on the exact frame whose timestamp matches the
+   emission. The pipeline writes timestamps taken straight from the
+   source parquet — no floating-point recomputation.
 2. **One Qwen-VL pass.** All three modules share a single VLM client
   (vLLM if available, transformers fallback) so the cost is one model
   load per dataset, not three.
@@ -102,7 +141,8 @@ Each module writes its raw output to
 `<root>/.annotate_staging/episode_{N:06d}/<module>.jsonl`. That makes
 prompt iteration cheap — re-running one module overwrites only its own
 JSONL file before the writer composes the final parquet. Modules can be
-disabled via `--module_1.enabled=false` (and similarly for 2 and 3) to
+disabled via `--plan.enabled=false` (and likewise `--interjections.enabled`
+/ `--vqa.enabled`) to
 test them in isolation.

 ## Validation/report checks before final write
@@ -121,20 +161,22 @@ Errors abort the writer (`--skip_validation=true` overrides for debugging).

 ## Paper inspirations per module

- **Module 1 — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
+- **`plan` module — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
  atom granularity ("pick up one piece of lettuce", "place bowl to box");
  Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07)) "how, not
  what" detail.
- **Module 1 — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596))
+- **`plan` module — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596))
  compression directive: keep only minimal relevant information; functional
  outcomes preserved, specific attributes dropped.
- **Module 2 — interjections.** Hi Robot scenario taxonomy: negative task,
+- **`interjections` module.** Hi Robot scenario taxonomy: negative task,
  situated correction, specific constraint, preference. Speech is a
  tool-call-only atom (`tool_calls=[{type:function, function:{name:"say",
 arguments:{text:...}}}]`).
- **Module 3 — VQA.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693))
+- **`vqa` module.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693))
  grounded features (bounding boxes in pixel `[x_min, y_min, x_max, y_max]`,
-  keypoints) and Steerable Policies' multi-abstraction grounding.
+  keypoints) and Steerable VLA Policies ([Zhao 2025](https://arxiv.org/abs/2509.07626))
+  multi-abstraction grounding. Pi0.7 also grounds answers across
+  multiple abstraction levels.

 Future maintainers should adjust the prompt templates in
 `src/lerobot/annotations/steerable_pipeline/prompts/` against these
@@ -142,9 +184,9 @@ references rather than rewriting from scratch.

 ## Compute and list-size estimates

-Per episode, the pipeline issues O(`max_steps`) Module 1 calls,
-O(`max_interjections_per_episode`) Module 2 calls, and
-O(`vqa_emission_hz × episode_seconds`) Module 3 calls. With defaults
+Per episode, the pipeline issues O(`max_steps`) `plan`-module calls,
+O(`max_interjections_per_episode`) `interjections`-module calls, and
+O(`vqa_emission_hz × episode_seconds`) `vqa`-module calls. With defaults
 (8 subtasks, 1 interjection, 1 Hz × 3 pairs) and 30-second episodes, that
 is ~50 VLM calls per episode. `language_persistent` per episode is ~10s of
 KB at most (parquet dictionary-encodes one entry per episode);