diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 470319c48..412386e2d 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -39,8 +39,10 @@ title: Porting Large Datasets - local: using_dataset_tools title: Using the Dataset Tools - - local: dataset_subtask - title: Using Subtasks in the Dataset + - local: language_and_recipes + title: Language Columns and Recipes + - local: tools + title: Tools - local: video_encoding_parameters title: Video encoding parameters - local: streaming_video_encoding diff --git a/docs/source/dataset_subtask.mdx b/docs/source/dataset_subtask.mdx deleted file mode 100644 index 6264aca22..000000000 --- a/docs/source/dataset_subtask.mdx +++ /dev/null @@ -1,277 +0,0 @@ -# Using Subtasks in LeRobot Datasets - -Subtask support in robotics datasets has proven effective in improving robot reasoning and understanding. Subtasks are particularly useful for: - -- **Hierarchical policies**: Building policies that include subtask predictions to visualize robot reasoning in real time -- **Reward modeling**: Helping reward models understand task progression (e.g., SARM-style stage-aware reward models) -- **Task decomposition**: Breaking down complex manipulation tasks into atomic, interpretable steps - -LeRobotDataset now supports subtasks as part of its dataset structure, alongside tasks. - -## What are Subtasks? - -While a **task** describes the overall goal (e.g., "Pick up the apple and place it in the basket"), **subtasks** break down the execution into finer-grained steps: - -1. "Approach the apple" -2. "Grasp the apple" -3. "Lift the apple" -4. "Move to basket" -5. "Release the apple" - -Each frame in the dataset can be annotated with its corresponding subtask, enabling models to learn and predict these intermediate stages. - -An overview of subtask annotation showing how frames are labeled with intermediate subtask stages - -

- Figure: Overview of subtask annotation. -

- -**Reference:** _Subtask-learning based for robot self-assembly in flexible collaborative assembly in manufacturing_, Original Article, Published: 19 April 2022. - -## Dataset Structure - -Subtask information is stored in the dataset metadata: - -``` -my-dataset/ -├── data/ -│ └── ... -├── meta/ -│ ├── info.json -│ ├── stats.json -│ ├── tasks.parquet -│ ├── subtasks.parquet # Subtask index → subtask string mapping -│ └── episodes/ -│ └── ... -└── videos/ - └── ... -``` - -### Subtasks Parquet File - -The `meta/subtasks.parquet` file maps subtask indices to their natural language descriptions: - -| subtask_index | subtask (index column) | -| ------------- | ---------------------- | -| 0 | "Approach the apple" | -| 1 | "Grasp the apple" | -| 2 | "Lift the apple" | -| ... | ... | - -### Frame-Level Annotations - -Each frame in the dataset can include a `subtask_index` field that references the subtasks parquet file: - -```python -# Example frame data in the parquet file -{ - "index": 42, - "timestamp": 1.4, - "episode_index": 0, - "task_index": 0, - "subtask_index": 2, # References "Lift the apple" - "observation.state": [...], - "action": [...], -} -``` - -## Annotating Datasets with Subtasks - -We provide a HuggingFace Space for easily annotating any LeRobotDataset with subtasks: - -**[https://huggingface.co/spaces/lerobot/annotate](https://huggingface.co/spaces/lerobot/annotate)** - -After completing your annotation: - -1. Click "Push to Hub" to upload your annotated dataset -2. You can also run the annotation space locally by following the instructions at [github.com/huggingface/lerobot-annotate](https://github.com/huggingface/lerobot-annotate) - -## Loading Datasets with Subtasks - -When you load a dataset with subtask annotations, the subtask information is automatically available: - -```python -from lerobot.datasets import LeRobotDataset - -# Load a dataset with subtask annotations -dataset = LeRobotDataset("jadechoghari/collect-fruit-annotated") - -# Access a sample -sample = dataset[100] - -# The sample includes both task and subtask information -print(sample["task"]) # "Collect the fruit" -print(sample["subtask"]) # "Grasp the apple" -print(sample["task_index"]) # tensor(0) -print(sample["subtask_index"]) # tensor(2) -``` - -### Checking for Subtask Support - -You can check if a dataset has subtask annotations: - -```python -# Check if subtasks are available -has_subtasks = ( - "subtask_index" in dataset.features - and dataset.meta.subtasks is not None -) - -if has_subtasks: - print(f"Dataset has {len(dataset.meta.subtasks)} unique subtasks") - print("Subtasks:", list(dataset.meta.subtasks.index)) -``` - -## Using Subtasks for Training - -### With the Tokenizer Processor - -The `TokenizerProcessor` automatically handles subtask tokenization for Vision-Language Action (VLA) models: - -```python -from lerobot.processor import TokenizerProcessorStep - -# Create a tokenizer processor step -tokenizer_processor = TokenizerProcessorStep( - tokenizer_name_or_path="google/paligemma-3b-pt-224", - padding="max_length", - max_length=64, -) - -# The processor will automatically tokenize subtasks if present in the batch -# and add them to the observation under: -# - "observation.subtask.tokens" -# - "observation.subtask.attention_mask" -``` - -When subtasks are available in the batch, the tokenizer processor adds: - -- `observation.subtask.tokens`: Tokenized subtask text -- `observation.subtask.attention_mask`: Attention mask for the subtask tokens - -### DataLoader with Subtasks - -```python -import torch -from lerobot.datasets import LeRobotDataset - -dataset = LeRobotDataset("jadechoghari/collect-fruit-annotated") - -dataloader = torch.utils.data.DataLoader( - dataset, - batch_size=16, - shuffle=True, -) - -for batch in dataloader: - # Access subtask information in the batch - subtasks = batch["subtask"] # List of subtask strings - subtask_indices = batch["subtask_index"] # Tensor of subtask indices - - # Use for training hierarchical policies or reward models - print(f"Batch subtasks: {set(subtasks)}") -``` - -## Example Datasets with Subtask Annotations - -Try loading a dataset with subtask annotations: - -```python -from lerobot.datasets import LeRobotDataset - -# Example dataset with subtask annotations -dataset = LeRobotDataset("jadechoghari/collect-fruit-annotated") - -# Explore the subtasks -print("Available subtasks:") -for subtask_name in dataset.meta.subtasks.index: - print(f" - {subtask_name}") - -# Get subtask distribution -subtask_counts = {} -for i in range(len(dataset)): - sample = dataset[i] - subtask = sample["subtask"] - subtask_counts[subtask] = subtask_counts.get(subtask, 0) + 1 - -print("\nSubtask distribution:") -for subtask, count in sorted(subtask_counts.items(), key=lambda x: -x[1]): - print(f" {subtask}: {count} frames") -``` - -## Use Cases - -### 1. Hierarchical Policy Training - -Train policies that predict both actions and current subtask: - -```python -class HierarchicalPolicy(nn.Module): - def __init__(self, num_subtasks): - super().__init__() - self.action_head = nn.Linear(hidden_dim, action_dim) - self.subtask_head = nn.Linear(hidden_dim, num_subtasks) - - def forward(self, observations): - features = self.encoder(observations) - actions = self.action_head(features) - subtask_logits = self.subtask_head(features) - return actions, subtask_logits -``` - -### 2. Stage-Aware Reward Modeling (SARM) - -Build reward models that understand task progression: - -```python -# SARM predicts: -# - Stage: Which subtask is being executed (discrete) -# - Progress: How far along the subtask (continuous 0-1) - -class SARMRewardModel(nn.Module): - def forward(self, observations): - features = self.encoder(observations) - stage_logits = self.stage_classifier(features) - progress = self.progress_regressor(features) - return stage_logits, progress -``` - -### 3. Progress Visualization - -Monitor robot execution by tracking subtask progression: - -```python -def visualize_execution(model, observations): - for t, obs in enumerate(observations): - action, subtask_logits = model(obs) - predicted_subtask = subtask_names[subtask_logits.argmax()] - print(f"t={t}: Executing '{predicted_subtask}'") -``` - -## API Reference - -### LeRobotDataset Properties - -| Property | Type | Description | -| --------------------------- | ---------------------- | ------------------------------------------ | -| `meta.subtasks` | `pd.DataFrame \| None` | DataFrame mapping subtask names to indices | -| `features["subtask_index"]` | `dict` | Feature spec for subtask_index if present | - -### Sample Keys - -When subtasks are available, each sample includes: - -| Key | Type | Description | -| --------------- | -------------- | ------------------------------------ | -| `subtask_index` | `torch.Tensor` | Integer index of the current subtask | -| `subtask` | `str` | Natural language subtask description | - -## Related Resources - -- [SARM Paper](https://arxiv.org/pdf/2509.25358) - Stage-Aware Reward Modeling for Long Horizon Robot Manipulation -- [LeRobot Annotate Space](https://huggingface.co/spaces/lerobot/annotate) - Interactive annotation tool -- [LeRobotDataset v3.0](./lerobot-dataset-v3) - Dataset format documentation diff --git a/docs/source/language_and_recipes.mdx b/docs/source/language_and_recipes.mdx new file mode 100644 index 000000000..4181dbe34 --- /dev/null +++ b/docs/source/language_and_recipes.mdx @@ -0,0 +1,147 @@ +# Language columns and recipes + +Most LeRobot datasets ship with a single `task` string per episode — fine for +short, single-instruction skills, but not enough for the longer-horizon, +multi-modal robot policies the field is moving toward (high-level planning, +memory, interjections, VQA, tool use). To support those policies without +forking the dataset format, LeRobot extends `LeRobotDataset` with two optional +language columns and a small recipe layer that turns those rows into +chat-style training samples on the fly. + +The design splits cleanly into three layers: + +1. **Data in the dataset** — language annotations stored next to frames in + `data/chunk-*/file-*.parquet` as two optional columns (`language_persistent` + and `language_events`). Datasets without these columns keep their existing + behavior. +2. **Recipe** — a YAML file that declares which annotation rows to bind and + how to lay them out as chat turns (`role`, `content`, optional images, + optional tool calls). Recipes are pure config; no Python required to add a + new one. +3. **Training format** — at sample time, `RenderMessagesStep` resolves the + recipe against the per-frame annotations and emits HF-style `messages` plus + LeRobot-specific sidecars (`message_streams`, `target_message_indices`) + that policy processors consume. + +This page describes each layer in turn. + +## Layer 1 — language columns in the dataset + +The two optional columns live next to frame data in +`data/chunk-*/file-*.parquet`: + +- `language_persistent`: a list of rows broadcast across every frame in an episode for state that remains active, such as `subtask`, `plan`, and `memory`. +- `language_events`: a list of rows only on the exact frame where an event was emitted, such as `interjection`, `vqa`, and speech tool calls. + +Both columns share the same row shape (event rows omit `timestamp` because the +frame the row sits on already provides it): + +```text +role: string +content: string | null +style: string | null +timestamp: float32 # persistent rows only +camera: string | null # observation.images.* feature key, view-dependent rows only +tool_calls: list[Json] | null +``` + +The `camera` field tags rows whose `content` is grounded in a specific camera +view. Rows of view-dependent styles (`vqa` and `trace`) MUST set `camera` to +the matching `observation.images.*` feature key. Rows of every other style — +including `motion`, which describes robot-frame primitives in joint / Cartesian +terms — MUST leave `camera` as `null`. Pipeline writers and the validator +enforce this via `validate_camera_field(style, camera)`. + +`meta/tasks.parquet` remains the canonical source for the task. The special `${task}` recipe binding always reads that task string and does not depend on language annotations. + +### Architecture + +The language stack itself has three internal modules backing layer 1: + +1. `lerobot.datasets.language` defines the schema, style registry, and `column_for_style`. +2. `lerobot.datasets.language_render` resolves rows and renders messages. +3. `RenderMessagesStep` turns dataset samples into `messages`, `message_streams`, and `target_message_indices`. + +`LeRobotDataset` stays recipe-agnostic. It passes `language_persistent` and `language_events` through when present, and unannotated datasets keep their existing behavior. + +## Layer 2 — recipe anatomy + +Recipes are YAML files backed by `TrainingRecipe` and `MessageTurn`. They +declare which annotation rows to pull (via `bindings`) and how to compose them +into chat turns (`messages`). + +```yaml +messages: + - { role: user, content: "${task}", stream: high_level } + - { role: assistant, content: "${subtask}", stream: low_level, target: true } +``` + +A recipe can also branch into a weighted **blend** of sub-recipes. At sample +time, exactly one branch is selected deterministically from the sample index, +so different frames train different objectives (e.g. memory updates vs. +low-level execution vs. VQA) without any Python wiring. + +### Temporal semantics + +Persistent styles are active after emission until replaced: + +- `active_at(t, style=subtask)` +- `nth_prev(style=memory, offset=1)` +- `nth_next(style=subtask, offset=1)` + +Event styles only exist on their exact timestamp: + +- `emitted_at(t, style=interjection)` +- `emitted_at(t, style=vqa, role=user, camera=observation.images.top)` +- `emitted_at(t, role=assistant, tool_name=say)` + +Exact event matching has no tolerance window, so writers must stamp event rows with frame timestamps from the parquet data. + +### View-dependent resolution + +For view-dependent styles (`vqa` and `trace`), the resolver gains a +`camera=` filter parallel to `role=` and `tool_name=`. Datasets with multiple +cameras typically emit one (`vqa`, `user`) + (`vqa`, `assistant`) pair per +camera at the same timestamp; without `camera=`, those resolvers see two +matches and raise an ambiguity error. Recipes consume each camera through its +own binding plus a matching image block, e.g. + +```yaml +ask_vqa_top: + bindings: + vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.top)" + vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.top)" + messages: + - role: user + stream: high_level + if_present: vqa_query + content: + - { type: image, feature: observation.images.top } + - { type: text, text: "${vqa_query}" } + - { + role: assistant, + content: "${vqa}", + stream: high_level, + target: true, + if_present: vqa, + } +``` + +Add one such sub-recipe per camera the dataset records. + +## Layer 3 — training format + +Rendered samples use HF-style chat messages plus LeRobot sidecars: + +```python +sample["messages"] +sample["message_streams"] +sample["target_message_indices"] +``` + +The renderer does not apply a tokenizer chat template. Policy processors decide how to serialize the messages for their backbone, which keeps the same dataset usable across SmolVLA, Pi0.5, and any future VLM that expects OpenAI-style chat messages. + +## Graceful absence + +If both language columns are missing, `None`, or empty, `RenderMessagesStep` is a no-op. +If an event-scoped branch is selected on a frame without the required event row, rendering returns `None`, allowing a loader to retry another sample. diff --git a/docs/source/tools.mdx b/docs/source/tools.mdx new file mode 100644 index 000000000..d88881184 --- /dev/null +++ b/docs/source/tools.mdx @@ -0,0 +1,210 @@ +# Tools + +LeRobot v3.1 supports **tool calls** in policies — assistant messages can +emit structured invocations like `say(text="OK, starting now")` that the +runtime dispatches to a real implementation (TTS, controller, logger, …). + +This page covers: + +1. Where the tool catalog lives. +2. How the annotation pipeline produces tool-call atoms. +3. How to add your own tool. + +## Where tools are declared + +Two layers. + +**The catalog** — a list of OpenAI-style function schemas — lives at +`meta/info.json["tools"]` on each dataset. Example: + +```json +{ + "features": { "...": "..." }, + "tools": [ + { + "type": "function", + "function": { + "name": "say", + "description": "Speak a short utterance to the user via the TTS executor.", + "parameters": { + "type": "object", + "properties": { + "text": { + "type": "string", + "description": "The verbatim text to speak." + } + }, + "required": ["text"] + } + } + } + ] +} +``` + +Read it via the dataset metadata accessor: + +```python +from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata + +meta = LeRobotDatasetMetadata(repo_id="pepijn/super_poulain_final_annotations") +tools = meta.tools # list[dict] — OpenAI tool schemas +``` + +If the dataset's `info.json` doesn't declare any tools, `meta.tools` +returns `DEFAULT_TOOLS` from `lerobot.datasets.language` — currently a +single-entry list with the canonical `say` schema. So unannotated +datasets and chat-template consumers keep working without any +configuration: + +```python +prompt_str = tokenizer.apply_chat_template( + sample["messages"], + tools=meta.tools, # works either way + add_generation_prompt=False, + tokenize=False, +) +``` + +**The implementations** — runnable Python — will live under +`src/lerobot/tools/`, one file per tool. The runtime dispatcher and +the canonical `say` implementation (wrapping Kyutai's pocket-tts) are +not part of the catalog layer described here; today this layer ships +only the schema storage and the `DEFAULT_TOOLS` fallback constant. + +## Per-row tool _invocations_ + +The catalog above describes _what can be called_. The actual _call_ — the +function name plus the argument values — is stored per-row, on the +assistant atoms in `language_events`: + +```python +{ + "role": "assistant", + "content": null, + "style": null, + "timestamp": 12.4, + "camera": null, + "tool_calls": [ + { "type": "function", + "function": { "name": "say", "arguments": { "text": "On it." } } } + ] +} +``` + +Recipes splice these into rendered messages via `tool_calls_from`: + +```yaml +user_interjection_response: + bindings: + speech: "emitted_at(t, role=assistant, tool_name=say)" + messages: + - { role: user, content: "${task}", stream: high_level } + - { + role: assistant, + content: "${current_plan}", + stream: high_level, + target: true, + tool_calls_from: speech, + } +``` + +The model's training target is one assistant turn that carries both the +plan text _and_ the `say` tool call. At inference, the runtime parses +the generated text back into structured `tool_calls` and dispatches to +the matching implementation. + +## How to add your own tool + +> **Note:** Steps 2 and 3 below describe the runtime layer +> (`src/lerobot/tools/`, the `Tool` protocol, `TOOL_REGISTRY`, +> `get_tools(meta)`) which is not part of the catalog layer shipped +> today — those modules don't yet exist in the tree. Step 1 alone is +> enough to make the tool visible to the chat template via +> `meta.tools` so the model can learn to _generate_ the call; +> executing the call at inference requires the runtime layer. + +Three steps. Concrete example: a `record_observation` tool the policy +can call to capture an extra observation outside the regular control +loop. + +### Step 1 — declare the schema + +Add an entry under `meta/info.json["tools"]`. Either edit the file +directly on disk _before_ running the annotation pipeline (it'll be +preserved) or hand it to `lerobot-annotate` via a config flag. + +```json +{ + "tools": [ + { "type": "function", "function": { "name": "say", "...": "..." } }, + { + "type": "function", + "function": { + "name": "record_observation", + "description": "Capture a high-resolution still image for the user.", + "parameters": { + "type": "object", + "properties": { + "label": { + "type": "string", + "description": "Short label for the saved image." + } + }, + "required": ["label"] + } + } + } + ] +} +``` + +The schema follows OpenAI's function-calling convention exactly, so the +chat template can render it natively. + +### Step 2 — implement the call + +Create `src/lerobot/tools/record_observation.py`: + +```python +from .base import Tool +from typing import Any + +RECORD_OBSERVATION_SCHEMA: dict[str, Any] = { "...": "..." } # mirrors the JSON above + + +class RecordObservationTool: + name = "record_observation" + schema = RECORD_OBSERVATION_SCHEMA + + def __init__(self, schema: dict | None = None, output_dir: str = "."): + self.output_dir = output_dir + + def call(self, arguments: dict) -> str: + label = arguments["label"] + # ... save the latest camera frame to /