Merge branch 'feat/language-annotation-pipeline' into feat/smolvla-on-steerable

Resolves conflicts from 66 commits on the base branch:

* pyproject.toml — keep base's transformers>=5.4.0,<5.6.0; add the
  sentencepiece-dep entry pi052 (FAST action tokenizer) needs.
* policies/__init__.py — keep pi052 export; drop the
  RewardClassifierConfig export that base removed.
* policies/factory.py — docstring list resolution (keep pi052; drop
  reward_classifier, removed by base).
* annotations/steerable_pipeline/executor.py — adopt base's renamed
  _ensure_annotation_metadata_in_info (it already advertises the say
  tool); drop pi052's older _ensure_tools_in_info call.
* configs/train.py — keep pi052's vqa_target_fraction; adopt base's
  SampleWeightingConfig (legacy RA-BC inline params already covered
  by the migration shim base added).
* scripts/lerobot_train.py — merge pi052's per-policy processor
  rebuild + dataset_repo_id pass-through with base's active_cfg /
  is_reward_model_training tightening, and re-route vqa-weighted
  sampler to active_cfg.drop_n_last_frames.
* datasets/language_render.py — adopt base's _select_one + timestamp
  tolerance (drops pi052's stale _select_latest / per-style sort_key).
* tests — adopt base's parametrized per-camera blend + tolerance
  test; drop pi052 tests that overlap with base's tighter rewrites;
  keep pi052's flow-only / VQA-blend coverage; add a
  test_canonical_recipe_loads check on subtask_mem_vqa_speech.yaml.
* policies/pi052/processor_pi052.py — import RenderMessagesStep
  directly from render_messages_processor (base intentionally
  dropped it from lerobot.processor's re-exports).
* uv.lock — regenerated cleanly from base + pi052's pocket-tts /
  beartype.

All 67 touched tests pass (30 pi052 + 37 recipe / language-render /
pipeline / render-messages).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Pepijn
2026-05-25 14:47:09 +02:00
210 changed files with 14334 additions and 5728 deletions
+103 -61
View File
@@ -3,31 +3,44 @@
`lerobot-annotate` populates the two language columns introduced by the
[Language Columns and Recipes](./language_and_recipes) page —
`language_persistent` and `language_events` — directly into
`data/chunk-*/file-*.parquet`. There is no flavor namespace and no sidecar
file tree: multiple revisions of a dataset mean multiple dataset copies.
`data/chunk-*/file-*.parquet`.
## What the pipeline produces
Three modules write into a per-episode staging tree, then a single writer
A vocabulary-discovery phase derives a small canonical wording, then three
modules write into a per-episode staging tree, then a single writer
rewrites the data shards in place:
| Style / atom | Column | Module |
| ------------------------------------------- | --------------------- | -------- |
| `subtask` (Pi0.7-style "how, not what") | `language_persistent` | Module 1 |
| `plan` (initial + refresh on interjection) | `language_persistent` | Module 1 |
| `memory` (MEM-style compression) | `language_persistent` | Module 1 |
| `interjection` | `language_events` | Module 2 |
| speech tool-call atom (`style=null`, `say`) | `language_events` | Module 2 |
| `vqa` (user / assistant pair) | `language_events` | Module 3 |
| Style / atom | Column | Module |
| ------------------------------------------- | --------------------- | -------------- |
| `subtask` (Pi0.7-style "how, not what") | `language_persistent` | `plan` |
| `plan` (initial + refresh on interjection) | `language_persistent` | `plan` |
| `memory` (MEM-style compression) | `language_persistent` | `plan` |
| `task_aug` (rephrasings of canonical task) | `language_persistent` | `plan` |
| `interjection` | `language_events` | `interjections`|
| speech tool-call atom (`style=null`, `say`) | `language_events` | `interjections`|
| `vqa` (user / assistant pair) | `language_events` | `vqa` |
The writer drops the legacy `subtask_index` column. It does **not** add a
`tools` column to the parquet the tool catalog lives at
`meta/info.json["tools"]` instead (see [Tools](./tools)). After every
annotation run the pipeline ensures the canonical `say` schema is
present in that list, preserving any tools the user pre-declared. Chat-
template consumers read the catalog through
`LeRobotDatasetMetadata.tools` and pass it to
`apply_chat_template(messages, tools=meta.tools, ...)`.
The `plan` module is constrained to a **canonical vocabulary** discovered
once per dataset by the `vocabulary` module (phase 0). It watches a few
sample episode videos (`--vocabulary.sample_episodes`, default `3`) and
asks the VLM to derive a small set of imperative subtask labels and
first-person memory milestones that recur across the demos. The VLM
picks the right number of entries itself based on what it sees in the
clips — short pick-and-place demos get ~6 subtask labels, longer
multi-step recipes get more. The result lands at
`meta/canonical_vocabulary.json` (human-readable / hand-editable) and
is reused on every subsequent run. The `plan` module then constrains
both subtask + memory generation to those exact strings — the
downstream low-level policy sees a small, repeatable target
distribution instead of thousands of LLM paraphrases. Disable with
`--vocabulary.enabled=False` to fall back to free-form generation.
The writer does **not** add a `tools` column to the parquet — the tool
catalog lives at `meta/info.json["tools"]` instead (see
[Tools](./tools)). After every annotation run the pipeline ensures the
canonical `say` schema is present in that list, preserving any tools the
user pre-declared.
If you want to declare additional tools for a dataset before annotation
runs, edit `meta/info.json["tools"]` directly — the pipeline preserves
@@ -35,63 +48,89 @@ anything already there. Implementations of those tools live under
`src/lerobot/tools/`; one file per tool, registered via
`TOOL_REGISTRY`. See the [Tools](./tools) doc for the authoring guide.
## How to run it locally or on SLURM
## Running locally
Install the extra and invoke the console script:
Install the extra and invoke the console script. Episode-level
concurrency comes from `--executor.episode_parallelism` (default 16);
that is the only knob the in-process executor exposes.
```bash
uv sync --extra annotations
uv run lerobot-annotate \
--repo_id=imstevenpmwork/super_poulain_draft \
--vlm.backend=vllm \
--vlm.model_id=Qwen/Qwen3.6-27B-FP8 \
--vlm.tensor_parallel_size=2
--root=/path/to/dataset \
--vlm.model_id=Qwen/Qwen2.5-VL-7B-Instruct
```
The pipeline attaches actual camera footage to every Module 1/2/3 prompt
by default, decoded from the dataset's first `observation.images.*`
stream. Override with `--vlm.camera_key=observation.images.<name>` to
pin a specific viewpoint. Datasets with no video tracks fall back to
text-only prompts automatically.
The pipeline attaches actual camera footage to every `plan` /
`interjections` / `vqa` prompt by default, decoded from the dataset's
first `observation.images.*` stream. Override with
`--vlm.camera_key=observation.images.<name>` to pin a specific
viewpoint. Datasets with no video tracks fall back to text-only prompts
automatically.
**Module 1 sees the whole episode as one video block.** Subtask
**The `plan` module sees the whole episode as one video block.** Subtask
decomposition gets a `{"type":"video", "video":[<frames>]}` block
covering the entire demonstration; Qwen-VL pools temporally on its own
and decides where to cut. There is no keyframe stride or count knob —
`--module_1.max_video_frames` (default 32) only caps the frames packed
into the video block as a model-capacity bound. Module 2 attaches a
single still frame at the interjection timestamp; Module 3 attaches the
exact emission frame to each VQA pair.
`--plan.max_video_frames` (default 128) only caps the frames packed
into the video block as a model-capacity bound. The `interjections`
module attaches a short window of frames straddling the interjection
timestamp. The `vqa` module grounds each VQA pair on a single frame —
its `--vqa.K` knob sets how many consecutive frames each emission tick
anchors, and every anchored frame gets its own VQA pair on that one
frame (there is no per-pair frame window).
The executor picks `LocalPipelineExecutor` for small datasets and
`SlurmPipelineExecutor` for large ones based on
`--executor.auto_threshold` (default 32 episodes). Force local with
`--executor.force_local=true`. SLURM jobs honour `--executor.slurm_partition`,
`--executor.slurm_gpus`, and `--executor.slurm_time`.
## Running on Hugging Face Jobs
Distributed annotation is delegated to
[Hugging Face Jobs](https://huggingface.co/docs/hub/en/jobs). The repo
ships a launcher script you copy and edit for your dataset:
```bash
HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
```
[`examples/annotations/run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
spawns one `h200x2` job that:
1. installs the branch under test plus the annotation extras,
2. boots two vllm servers (one per GPU) for the chosen model,
3. runs the `plan` / `interjections` / `vqa` modules across the dataset
via `lerobot-annotate`,
4. uploads the annotated dataset to `--push_to_hub`.
To target a different dataset, model, or hub repo, edit the `CMD` block
inside the script — every flag in there maps directly onto a CLI flag of
`lerobot-annotate` (see `lerobot-annotate --help` for the full list).
## Style-to-recipe consumer mapping
<<<<<<< HEAD
The pipeline produces exactly the styles consumed by
`src/lerobot/configs/recipes/subtask_mem_vqa_speech.yaml`:
=======
The pipeline's outputs are designed to be consumed by recipes (see
[Language Columns and Recipes](./language_and_recipes)) — typically:
>>>>>>> origin/feat/language-annotation-pipeline
- `low_level_execution`, `high_level_subtask`, `memory_update` consume
- low-level / high-level / memory-update branches consume
`subtask`/`plan`/`memory` from `language_persistent`.
- `user_interjection_response` consumes `interjection` events plus the
paired speech atom (merged into one assistant target turn via
- An interjection-response branch consumes `interjection` events plus
the paired speech atom (merged into one assistant target turn via
`tool_calls_from`) and the same-timestamp `plan` refresh.
- `ask_vqa` consumes the `(vqa, user)` and `(vqa, assistant)` pairs from
`language_events`.
- A VQA branch consumes the `(vqa, user)` and `(vqa, assistant)` pairs
from `language_events`.
## Why the design is scoped to the canonical recipe
## Why the design splits state from events
Two things drive the scope:
1. **Persistent state vs exact-event split.** Persistent rows (`subtask`,
`plan`, `memory`) broadcast per episode and answer "what state is in
force at this frame?". Event rows (`interjection`, `vqa`, speech) only
appear on the exact frame whose timestamp matches the emission. The
pipeline writes timestamps taken straight from the source parquet — no
floating-point recomputation.
1. **Persistent state vs exact-event split.** Persistent rows
(`subtask`, `plan`, `memory`) broadcast per episode and answer "what
state is in force at this frame?". Event rows (`interjection`, `vqa`,
speech) only appear on the exact frame whose timestamp matches the
emission. The pipeline writes timestamps taken straight from the
source parquet — no floating-point recomputation.
2. **One Qwen-VL pass.** All three modules share a single VLM client
(vLLM if available, transformers fallback) so the cost is one model
load per dataset, not three.
@@ -102,7 +141,8 @@ Each module writes its raw output to
`<root>/.annotate_staging/episode_{N:06d}/<module>.jsonl`. That makes
prompt iteration cheap — re-running one module overwrites only its own
JSONL file before the writer composes the final parquet. Modules can be
disabled via `--module_1.enabled=false` (and similarly for 2 and 3) to
disabled via `--plan.enabled=false` (and likewise `--interjections.enabled`
/ `--vqa.enabled`) to
test them in isolation.
## Validation/report checks before final write
@@ -121,20 +161,22 @@ Errors abort the writer (`--skip_validation=true` overrides for debugging).
## Paper inspirations per module
- **Module 1 — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
- **`plan` module — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
atom granularity ("pick up one piece of lettuce", "place bowl to box");
Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07)) "how, not
what" detail.
- **Module 1 — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596))
- **`plan` module — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596))
compression directive: keep only minimal relevant information; functional
outcomes preserved, specific attributes dropped.
- **Module 2 — interjections.** Hi Robot scenario taxonomy: negative task,
- **`interjections` module.** Hi Robot scenario taxonomy: negative task,
situated correction, specific constraint, preference. Speech is a
tool-call-only atom (`tool_calls=[{type:function, function:{name:"say",
arguments:{text:...}}}]`).
- **Module 3 — VQA.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693))
- **`vqa` module.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693))
grounded features (bounding boxes in pixel `[x_min, y_min, x_max, y_max]`,
keypoints) and Steerable Policies' multi-abstraction grounding.
keypoints) and Steerable VLA Policies ([Zhao 2025](https://arxiv.org/abs/2509.07626))
multi-abstraction grounding. Pi0.7 also grounds answers across
multiple abstraction levels.
Future maintainers should adjust the prompt templates in
`src/lerobot/annotations/steerable_pipeline/prompts/` against these
@@ -142,9 +184,9 @@ references rather than rewriting from scratch.
## Compute and list-size estimates
Per episode, the pipeline issues O(`max_steps`) Module 1 calls,
O(`max_interjections_per_episode`) Module 2 calls, and
O(`vqa_emission_hz × episode_seconds`) Module 3 calls. With defaults
Per episode, the pipeline issues O(`max_steps`) `plan`-module calls,
O(`max_interjections_per_episode`) `interjections`-module calls, and
O(`vqa_emission_hz × episode_seconds`) `vqa`-module calls. With defaults
(8 subtasks, 1 interjection, 1 Hz × 3 pairs) and 30-second episodes, that
is ~50 VLM calls per episode. `language_persistent` per episode is ~10s of
KB at most (parquet dictionary-encodes one entry per episode);