mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-19 10:40:04 +00:00
0f5f0e4091
- hirobot.yaml -> subtasks_vqa.yaml - hirobot_memory.yaml -> subtask_mem_vqa_speech.yaml - pi05_hirobot.yaml -> deleted (stale: uses plan, top-camera names; superseded by the two recipes above) - smolvla2_hirobot.yaml -> deleted (was untracked stale junk) Updated the smolvla2 / pi052 `recipe_path` config defaults, all docstring / comment references, the annotation-pipeline + recipe docs, and the three tests that loaded pi05_hirobot.yaml (repointed to the renamed recipes; the low-level-branch and pipeline-render assertions now accept a flow-only `low_level` stream as valid supervision, since the new recipes' low_level_execution has no text-CE target). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
110 lines
4.5 KiB
Plaintext
110 lines
4.5 KiB
Plaintext
# Language columns and recipes
|
|
|
|
LeRobot stores reusable language annotations directly next to frame data in `data/chunk-*/file-*.parquet`.
|
|
The two optional columns are:
|
|
|
|
- `language_persistent`: a list of rows broadcast across every frame in an episode for state that remains active, such as `subtask`, `plan`, and `memory`.
|
|
- `language_events`: a list of rows only on the exact frame where an event was emitted, such as `interjection`, `vqa`, and speech tool calls.
|
|
|
|
Both columns share the same row shape (event rows omit `timestamp` because the
|
|
frame the row sits on already provides it):
|
|
|
|
```text
|
|
role: string
|
|
content: string | null
|
|
style: string | null
|
|
timestamp: float64 # persistent rows only
|
|
camera: string | null # observation.images.* feature key, view-dependent rows only
|
|
tool_calls: list[Json] | null
|
|
```
|
|
|
|
The `camera` field tags rows whose `content` is grounded in a specific camera
|
|
view. Rows of view-dependent styles (`vqa`, and the reserved `motion` /
|
|
`trace`) MUST set `camera` to the matching `observation.images.*` feature key.
|
|
Rows of every other style MUST leave `camera` as `null`. Pipeline writers and
|
|
the validator enforce this via `validate_camera_field(style, camera)`.
|
|
|
|
`meta/tasks.parquet` remains the canonical source for the task. The special `${task}` recipe binding always reads that task string and does not depend on language annotations.
|
|
|
|
## Architecture
|
|
|
|
The language stack has three layers:
|
|
|
|
1. `lerobot.datasets.language` defines the schema, style registry, and `column_for_style`.
|
|
2. `lerobot.datasets.language_render` resolves rows and renders messages.
|
|
3. `RenderMessagesStep` turns dataset samples into `messages`, `message_streams`, and `target_message_indices`.
|
|
|
|
`LeRobotDataset` stays recipe-agnostic. It passes `language_persistent` and `language_events` through when present, and unannotated datasets keep their existing behavior.
|
|
|
|
## Temporal semantics
|
|
|
|
Persistent styles are active after emission until replaced:
|
|
|
|
- `active_at(t, style=subtask)`
|
|
- `nth_prev(style=memory, offset=1)`
|
|
- `nth_next(style=subtask, offset=1)`
|
|
|
|
Event styles only exist on their exact timestamp:
|
|
|
|
- `emitted_at(t, style=interjection)`
|
|
- `emitted_at(t, style=vqa, role=user, camera=observation.images.top)`
|
|
- `emitted_at(t, role=assistant, tool_name=say)`
|
|
|
|
Exact event matching has no tolerance window, so writers must stamp event rows with frame timestamps from the parquet data.
|
|
|
|
## View-dependent resolution
|
|
|
|
For view-dependent styles (`vqa`, `motion`, `trace`), the resolver gains a
|
|
`camera=` filter parallel to `role=` and `tool_name=`. Datasets with multiple
|
|
cameras typically emit one (`vqa`, `user`) + (`vqa`, `assistant`) pair per
|
|
camera at the same timestamp; without `camera=`, those resolvers see two
|
|
matches and raise an ambiguity error. Recipes consume each camera through its
|
|
own binding plus a matching image block, e.g.
|
|
|
|
```yaml
|
|
ask_vqa_top:
|
|
bindings:
|
|
vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.top)"
|
|
vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.top)"
|
|
messages:
|
|
- role: user
|
|
stream: high_level
|
|
if_present: vqa_query
|
|
content:
|
|
- { type: image, feature: observation.images.top }
|
|
- { type: text, text: "${vqa_query}" }
|
|
- { role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa }
|
|
```
|
|
|
|
Add one such sub-recipe per camera the dataset records.
|
|
|
|
## Recipe anatomy
|
|
|
|
Recipes are YAML files backed by `TrainingRecipe` and `MessageTurn`.
|
|
|
|
```yaml
|
|
messages:
|
|
- { role: user, content: "${task}", stream: high_level }
|
|
- { role: assistant, content: "${subtask}", stream: low_level, target: true }
|
|
```
|
|
|
|
Rendered samples use HF-style chat messages plus LeRobot sidecars:
|
|
|
|
```python
|
|
sample["messages"]
|
|
sample["message_streams"]
|
|
sample["target_message_indices"]
|
|
```
|
|
|
|
The renderer does not apply a tokenizer chat template. Policy processors decide how to serialize the messages for their backbone.
|
|
|
|
## Blends
|
|
|
|
Blend recipes select one weighted sub-recipe deterministically from the sample index.
|
|
`recipes/subtasks_vqa.yaml` trains the core blend — high-level subtask prediction, low-level execution, and VQA. `recipes/subtask_mem_vqa_speech.yaml` is the fuller variant that also adds memory updates and spoken interjection responses.
|
|
|
|
## Graceful absence
|
|
|
|
If both language columns are missing, `None`, or empty, `RenderMessagesStep` is a no-op.
|
|
If an event-scoped branch is selected on a frame without the required event row, rendering returns `None`, allowing a loader to retry another sample.
|