Merge remote-tracking branch 'origin/main' into feat/language-columns

2026-05-23 20:50:02 +00:00 · 2026-05-06 12:09:13 +02:00
parent e3e9374e2c ce24063efd
commit 5c30b14929
146 changed files with 9361 additions and 4180 deletions
@@ -1,7 +1,34 @@
 # Language columns and recipes

-LeRobot stores reusable language annotations directly next to frame data in `data/chunk-*/file-*.parquet`.
-The two optional columns are:
+Most LeRobot datasets ship with a single `task` string per episode — fine for
+short, single-instruction skills, but not enough for the longer-horizon,
+multi-modal robot policies the field is moving toward (high-level planning,
+memory, interjections, VQA, tool use). To support those policies without
+forking the dataset format, LeRobot extends `LeRobotDataset` with two optional
+language columns and a small recipe layer that turns those rows into
+chat-style training samples on the fly.
+
+The design splits cleanly into three layers:
+
+1. **Data in the dataset** — language annotations stored next to frames in
+   `data/chunk-*/file-*.parquet` as two optional columns (`language_persistent`
+   and `language_events`). Datasets without these columns keep their existing
+   behavior.
+2. **Recipe** — a YAML file that declares which annotation rows to bind and
+   how to lay them out as chat turns (`role`, `content`, optional images,
+   optional tool calls). Recipes are pure config; no Python required to add a
+   new one.
+3. **Training format** — at sample time, `RenderMessagesStep` resolves the
+   recipe against the per-frame annotations and emits HF-style `messages` plus
+   LeRobot-specific sidecars (`message_streams`, `target_message_indices`)
+   that policy processors consume.
+
+This page describes each layer in turn.
+
+## Layer 1 — language columns in the dataset
+
+The two optional columns live next to frame data in
+`data/chunk-*/file-*.parquet`:

 - `language_persistent`: a list of rows broadcast across every frame in an episode for state that remains active, such as `subtask`, `plan`, and `memory`.
 - `language_events`: a list of rows only on the exact frame where an event was emitted, such as `interjection`, `vqa`, and speech tool calls.
@@ -26,9 +53,9 @@ the validator enforce this via `validate_camera_field(style, camera)`.

 `meta/tasks.parquet` remains the canonical source for the task. The special `${task}` recipe binding always reads that task string and does not depend on language annotations.

-## Architecture
+### Architecture

-The language stack has three layers:
+The language stack itself has three internal modules backing layer 1:

 1. `lerobot.datasets.language` defines the schema, style registry, and `column_for_style`.
 2. `lerobot.datasets.language_render` resolves rows and renders messages.
@@ -36,7 +63,7 @@ The language stack has three layers:

 `LeRobotDataset` stays recipe-agnostic. It passes `language_persistent` and `language_events` through when present, and unannotated datasets keep their existing behavior.

-## Temporal semantics
+### Temporal semantics

 Persistent styles are active after emission until replaced:

@@ -52,7 +79,7 @@ Event styles only exist on their exact timestamp:

 Exact event matching has no tolerance window, so writers must stamp event rows with frame timestamps from the parquet data.

-## View-dependent resolution
+### View-dependent resolution

 For view-dependent styles (`vqa`, `motion`, `trace`), the resolver gains a
 `camera=` filter parallel to `role=` and `tool_name=`. Datasets with multiple
@@ -78,9 +105,11 @@ ask_vqa_top:

 Add one such sub-recipe per camera the dataset records.

-## Recipe anatomy
+## Layer 2 — recipe anatomy

-Recipes are YAML files backed by `TrainingRecipe` and `MessageTurn`.
+Recipes are YAML files backed by `TrainingRecipe` and `MessageTurn`. They
+declare which annotation rows to pull (via `bindings`) and how to compose them
+into chat turns (`messages`).

 ```yaml
 messages:
@@ -88,6 +117,13 @@ messages:
  - { role: assistant, content: "${subtask}", stream: low_level, target: true }
 ```

+A recipe can also branch into a weighted **blend** of sub-recipes. At sample
+time, exactly one branch is selected deterministically from the sample index,
+so different frames train different objectives (e.g. memory updates vs.
+low-level execution vs. VQA) without any Python wiring.
+
+## Layer 3 — training format
+
 Rendered samples use HF-style chat messages plus LeRobot sidecars:

 ```python
@@ -96,12 +132,7 @@ sample["message_streams"]
 sample["target_message_indices"]
 ```

-The renderer does not apply a tokenizer chat template. Policy processors decide how to serialize the messages for their backbone.
-
-## Blends
-
-Blend recipes select one weighted sub-recipe deterministically from the sample index.
-The canonical `recipes/pi05_hirobot.yaml` combines memory updates, interjection responses, high-level subtask prediction, low-level execution, and VQA.
+The renderer does not apply a tokenizer chat template. Policy processors decide how to serialize the messages for their backbone, which keeps the same dataset usable across SmolVLA, Pi0.5, and any future VLM that expects OpenAI-style chat messages.

 ## Graceful absence