review: skip-count fix, atomic writes, dedupe span reconstruction, role guards

**#1 Plan-update phase reports correct skip count.** ``_run_plan_update_phase`` only ran ``run_plan_updates`` for episodes with at least one interjection but hardcoded ``episodes_skipped=0``. The summary undercounted skipped episodes. Now returns ``len(records) - processed`` so processed + skipped == total. **#2 ``run_hf_job.py`` installs ``openai``.** The ``CMD`` block does ``pip install --no-deps lerobot[branch]`` then explicitly lists transitive deps. ``openai`` was missing — and since ``VlmConfig.backend`` defaults to ``"openai"``, the job would have ``ImportError``'d when ``vlm_client._make_openai_client`` ran. **#3 Dedupe subtask-span reconstruction.** Module 1's ``_reconstruct_subtasks_from_rows`` (no ``and spans`` guard) and Module 2's ``_read_subtask_spans`` (with the guard) had near- identical logic. Promoted to ``reconstruct_subtask_spans`` in ``reader.py`` using the safer guarded form. Both modules now import the single helper. **#5 Atomic staging.py JSONL writes.** Mirroring the parquet-writer fix from an earlier review round: ``EpisodeStaging.write`` now writes to a sibling ``.tmp`` and ``Path.replace`` atomically. A crash mid-write can no longer leave a half-written JSONL that ``read()`` would then fail to parse. **#6 Atomic ``info.json`` write.** Same pattern in ``executor._ensure_annotation_metadata_in_info`` — ``info.json`` is load-bearing for dataset metadata, so partial writes brick the dataset. **#7 Writer's role-key guard.** ``_normalize_persistent_row`` and ``_normalize_event_row`` accessed ``row["role"]`` directly while every other field used ``.get()``. Pre-validate ``"role" in row`` and raise a friendly ``ValueError`` naming the row, so a future module that accidentally drops ``role`` fails with a triagable message instead of a bare KeyError deep in the writer. **#8 Last subtask span's ``end`` extends to episode end.** ``reconstruct_subtask_spans`` (the new shared helper) takes an optional ``episode_end_t``. When provided, the final span's ``end`` is closed to that timestamp instead of equalling its own ``start`` (zero duration). Both Module 1's plan-update pass and Module 2's interjection anchoring pass ``record.frame_timestamps[-1]``, so downstream "current subtask at refresh_t" lookups no longer miss refreshes that land inside the final span. Sweep: 66 passed, 0 failed. Pre-commit clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test(annotate): unstale the two failing module tests
2026-05-12 07:09:43 +00:00 · 2026-05-08 12:18:09 +02:00 · 2026-05-08 11:59:27 +02:00 · 2026-05-08 11:53:43 +02:00 · 2026-05-08 11:29:45 +02:00 · 2026-05-08 11:13:11 +02:00
137 changed files with 11936 additions and 2385 deletions
@@ -382,6 +382,7 @@ jobs:
                --policy.path=\"\$ROBOTWIN_POLICY\" \
                --env.type=robotwin \
                --env.task=\"\$ROBOTWIN_TASKS\" \
+                --env.max_parallel_tasks=5 \
                --eval.batch_size=1 \
                --eval.n_episodes=1 \
                --eval.use_async_envs=false \
@@ -482,6 +483,7 @@ jobs:
                --policy.path=lerobot/smolvla_robocasa \
                --env.type=robocasa \
                --env.task=CloseFridge,OpenCabinet,OpenDrawer,TurnOnMicrowave,TurnOffStove,CloseToasterOvenDoor,SlideDishwasherRack,TurnOnSinkFaucet,NavigateKitchen,TurnOnElectricKettle \
+                --env.max_parallel_tasks=5 \
                --eval.batch_size=1 \
                --eval.n_episodes=1 \
                --eval.use_async_envs=false \
@@ -693,6 +695,7 @@ jobs:
                --env.task=\"\$ROBOMME_TASKS\" \
                --env.dataset_split=test \
                --env.task_ids=[0] \
+                --env.max_parallel_tasks=5 \
                --eval.batch_size=1 \
                --eval.n_episodes=1 \
                --eval.use_async_envs=false \
@@ -800,6 +803,7 @@ jobs:
                --env.type=libero_plus \
                --env.task=\"\$LIBERO_PLUS_SUITE\" \
                --env.task_ids=\"\$LIBERO_PLUS_TASK_IDS\" \
+                --env.max_parallel_tasks=5 \
                --eval.batch_size=1 \
                --eval.n_episodes=1 \
                --eval.use_async_envs=false \
@@ -900,6 +904,8 @@ jobs:
                --policy.path=lerobot/smolvla_vlabench \
                --env.type=vlabench \
                --env.task=select_fruit,select_toy,select_book,select_painting,select_drink,select_ingredient,select_billiards,select_poker,add_condiment,insert_flower \
+                --env.episode_length=50 \
+                --env.max_parallel_tasks=5 \
                --eval.batch_size=1 \
                --eval.n_episodes=1 \
                --eval.use_async_envs=false \
@@ -24,14 +24,14 @@ on:

 env:
  CLOSE_ISSUE_MESSAGE: >
-    This issue was closed because it has been stalled for 14 days with no activity.
+    This issue was closed because it has been stalled for 30 days with no activity.
    Feel free to reopen if is still relevant, or to ping a collaborator if you have any questions.
  CLOSE_PR_MESSAGE: >
-    This PR was closed because it has been stalled for 21 days with no activity.
+    This PR was closed because it has been stalled for 30 days with no activity.
    Feel free to reopen if is still relevant, or to ping a collaborator if you have any questions.
  WARN_ISSUE_MESSAGE: >
    This issue has been automatically marked as stale because it has not had
-    recent activity (6 months). It will be closed if no further activity occurs.
+    recent activity (1 year). It will be closed if no further activity occurs.
    Any change, comment or update to this issue will reset this count.
    Thank you for your contributions.
  WARN_PR_MESSAGE: >
@@ -59,10 +59,10 @@ jobs:
          stale-pr-label: stale
          exempt-issue-labels: never-stale
          exempt-pr-labels: never-stale
-          days-before-issue-stale: 180
-          days-before-issue-close: 14
+          days-before-issue-stale: 365
+          days-before-issue-close: 30
          days-before-pr-stale: 365
-          days-before-pr-close: 21
+          days-before-pr-close: 30
          delete-branch: true
          close-issue-message: ${{ env.CLOSE_ISSUE_MESSAGE }}
          close-pr-message: ${{ env.CLOSE_PR_MESSAGE }}
@@ -178,3 +178,9 @@ test-smolvla-ete-eval:
 		--env.episode_length=5 \
 		--eval.n_episodes=1 \
 		--eval.batch_size=1
+
+# E2E annotation pipeline smoke test against a tiny in-memory fixture
+# dataset. Opt-in (not part of `make test-end-to-end`) and uses a stub VLM
+# backend, so it does not require a real model checkpoint or GPU.
+annotation-e2e:
+	uv run python -m tests.annotations.run_e2e_smoke
@@ -39,7 +39,6 @@ from tqdm import tqdm

 from lerobot.datasets.lerobot_dataset import LeRobotDataset
 from lerobot.datasets.video_utils import (
-    VideoEncoderConfig,
    decode_video_frames,
    encode_video_frames,
 )
@@ -252,13 +251,10 @@ def benchmark_encoding_decoding(
            imgs_dir=imgs_dir,
            video_path=video_path,
            fps=fps,
-            camera_encoder_config=VideoEncoderConfig(
-                vcodec=encoding_cfg["vcodec"],
-                pix_fmt=encoding_cfg["pix_fmt"],
-                g=encoding_cfg.get("g"),
-                crf=encoding_cfg.get("crf"),
-                preset=encoding_cfg.get("preset"),
-            ),
+            vcodec=encoding_cfg["vcodec"],
+            pix_fmt=encoding_cfg["pix_fmt"],
+            g=encoding_cfg.get("g"),
+            crf=encoding_cfg.get("crf"),
            # fast_decode=encoding_cfg.get("fastdecode"),
            overwrite=True,
        )
@@ -35,7 +35,7 @@ USER root
 ARG ROBOTWIN_SHA=0aeea2d669c0f8516f4d5785f0aa33ba812c14b4
 RUN apt-get update \
    && apt-get install -y --no-install-recommends \
-         cuda-nvcc-12-4 cuda-cudart-dev-12-4 \
+         cuda-nvcc-12-6 cuda-cudart-dev-12-6 \
         libvulkan1 vulkan-tools \
    && mkdir -p /usr/share/vulkan/icd.d \
    && echo '{"file_format_version":"1.0.0","ICD":{"library_path":"libGLX_nvidia.so.0","api_version":"1.3.0"}}' \
@@ -18,9 +18,8 @@
 # docker build -f docker/Dockerfile.internal -t lerobot-internal .

 # Configure the base image for CI with GPU access
-# TODO(Steven): Bump these versions
-ARG CUDA_VERSION=12.4.1
-ARG OS_VERSION=22.04
+ARG CUDA_VERSION=12.6.3
+ARG OS_VERSION=24.04
 FROM nvidia/cuda:${CUDA_VERSION}-base-ubuntu${OS_VERSION}

 # Define Python version argument
@@ -36,16 +35,13 @@ ENV DEBIAN_FRONTEND=noninteractive \

 # Install Python, system dependencies, and uv (as root)
 RUN apt-get update && apt-get install -y --no-install-recommends \
-    software-properties-common build-essential git curl \
-    libglib2.0-0 libgl1-mesa-glx libegl1-mesa ffmpeg \
+    build-essential git curl \
+    libglib2.0-0 libgl1 libegl1 ffmpeg \
    libusb-1.0-0-dev speech-dispatcher libgeos-dev portaudio19-dev \
    cmake pkg-config ninja-build \
-    && add-apt-repository -y ppa:deadsnakes/ppa \
-    && apt-get update \
-    && apt-get install -y --no-install-recommends \
-       python${PYTHON_VERSION} \
-       python${PYTHON_VERSION}-venv \
-       python${PYTHON_VERSION}-dev \
+    python${PYTHON_VERSION} \
+    python${PYTHON_VERSION}-venv \
+    python${PYTHON_VERSION}-dev \
    && curl -LsSf https://astral.sh/uv/install.sh | sh \
    && mv /root/.local/bin/uv /usr/local/bin/uv \
    && useradd --create-home --shell /bin/bash user_lerobot \
@@ -31,10 +31,12 @@
    title: Porting Large Datasets
  - local: using_dataset_tools
    title: Using the Dataset Tools
-  - local: dataset_subtask
-    title: Using Subtasks in the Dataset
-  - local: video_encoding_parameters
-    title: Video encoding parameters
+  - local: language_and_recipes
+    title: Language Columns and Recipes
+  - local: tools
+    title: Tools
+  - local: annotation_pipeline
+    title: Annotation Pipeline
  - local: streaming_video_encoding
    title: Streaming Video Encoding
  title: "Datasets"
@@ -49,6 +51,8 @@
    title: π₀-FAST (Pi0Fast)
  - local: pi05
    title: π₀.₅ (Pi05)
+  - local: eo1
+    title: EO-1
  - local: groot
    title: NVIDIA GR00T N1.5
  - local: xvla
@@ -90,6 +90,6 @@ lerobot-record \
  --dataset.single_task="Your task description" \
  --dataset.streaming_encoding=true \
  --dataset.encoder_threads=2 \
-  # --dataset.camera_encoder_config.vcodec=auto \
+  # --dataset.vcodec=auto \
  --policy.path=${HF_USER}/act_policy
 ```
@@ -0,0 +1,176 @@
+# Annotation Pipeline
+
+`lerobot-annotate` populates the two language columns introduced by the
+[Language Columns and Recipes](./language_and_recipes) page —
+`language_persistent` and `language_events` — directly into
+`data/chunk-*/file-*.parquet`.
+
+## What the pipeline produces
+
+Three modules write into a per-episode staging tree, then a single writer
+rewrites the data shards in place:
+
+| Style / atom                                | Column                | Module   |
+| ------------------------------------------- | --------------------- | -------- |
+| `subtask` (Pi0.7-style "how, not what")     | `language_persistent` | Module 1 |
+| `plan` (initial + refresh on interjection)  | `language_persistent` | Module 1 |
+| `memory` (MEM-style compression)            | `language_persistent` | Module 1 |
+| `task_aug` (rephrasings of canonical task)  | `language_persistent` | Module 1 |
+| `interjection`                              | `language_events`     | Module 2 |
+| speech tool-call atom (`style=null`, `say`) | `language_events`     | Module 2 |
+| `vqa` (user / assistant pair)               | `language_events`     | Module 3 |
+
+The writer does **not** add a `tools` column to the parquet — the tool
+catalog lives at `meta/info.json["tools"]` instead (see
+[Tools](./tools)). After every annotation run the pipeline ensures the
+canonical `say` schema is present in that list, preserving any tools the
+user pre-declared.
+
+If you want to declare additional tools for a dataset before annotation
+runs, edit `meta/info.json["tools"]` directly — the pipeline preserves
+anything already there. Implementations of those tools live under
+`src/lerobot/tools/`; one file per tool, registered via
+`TOOL_REGISTRY`. See the [Tools](./tools) doc for the authoring guide.
+
+## Running locally
+
+Install the extra and invoke the console script. Episode-level
+concurrency comes from `--executor.episode_parallelism` (default 16);
+that is the only knob the in-process executor exposes.
+
+```bash
+uv sync --extra annotations
+uv run lerobot-annotate \
+  --root=/path/to/dataset \
+  --vlm.model_id=Qwen/Qwen2.5-VL-7B-Instruct
+```
+
+The pipeline attaches actual camera footage to every Module 1/2/3 prompt
+by default, decoded from the dataset's first `observation.images.*`
+stream. Override with `--vlm.camera_key=observation.images.<name>` to
+pin a specific viewpoint. Datasets with no video tracks fall back to
+text-only prompts automatically.
+
+**Module 1 sees the whole episode as one video block.** Subtask
+decomposition gets a `{"type":"video", "video":[<frames>]}` block
+covering the entire demonstration; Qwen-VL pools temporally on its own
+and decides where to cut. There is no keyframe stride or count knob —
+`--module_1.max_video_frames` (default 128) only caps the frames packed
+into the video block as a model-capacity bound. Module 2 attaches a
+short window of frames around the interjection timestamp; Module 3
+attaches the exact emission frame to each VQA pair.
+
+## Running on Hugging Face Jobs
+
+Distributed annotation is delegated to
+[Hugging Face Jobs](https://huggingface.co/docs/hub/en/jobs). The repo
+ships a launcher script you copy and edit for your dataset:
+
+```bash
+HF_TOKEN=hf_... uv run python examples/annotation/run_hf_job.py
+```
+
+[`examples/annotation/run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotation/run_hf_job.py)
+spawns one `h200x2` job that:
+
+1. installs the branch under test plus the annotation extras,
+2. boots two vllm servers (one per GPU) for the chosen model,
+3. runs Modules 1 / 2 / 3 across the dataset via `lerobot-annotate`,
+4. uploads the annotated dataset to `--push_to_hub`.
+
+To target a different dataset, model, or hub repo, edit the `CMD` block
+inside the script — every flag in there maps directly onto a CLI flag of
+`lerobot-annotate` (see `lerobot-annotate --help` for the full list).
+
+## Style-to-recipe consumer mapping
+
+The pipeline's outputs are designed to be consumed by recipes (see
+[Language Columns and Recipes](./language_and_recipes)) — typically:
+
+- low-level / high-level / memory-update branches consume
+  `subtask`/`plan`/`memory` from `language_persistent`.
+- An interjection-response branch consumes `interjection` events plus
+  the paired speech atom (merged into one assistant target turn via
+  `tool_calls_from`) and the same-timestamp `plan` refresh.
+- A VQA branch consumes the `(vqa, user)` and `(vqa, assistant)` pairs
+  from `language_events`.
+
+## Why the design splits state from events
+
+Two things drive the scope:
+
+1. **Persistent state vs exact-event split.** Persistent rows
+   (`subtask`, `plan`, `memory`) broadcast per episode and answer "what
+   state is in force at this frame?". Event rows (`interjection`, `vqa`,
+   speech) only appear on the exact frame whose timestamp matches the
+   emission. The pipeline writes timestamps taken straight from the
+   source parquet — no floating-point recomputation.
+2. **One Qwen-VL pass.** All three modules share a single VLM client
+   (vLLM if available, transformers fallback) so the cost is one model
+   load per dataset, not three.
+
+## Module independence and staged reruns
+
+Each module writes its raw output to
+`<root>/.annotate_staging/episode_{N:06d}/<module>.jsonl`. That makes
+prompt iteration cheap — re-running one module overwrites only its own
+JSONL file before the writer composes the final parquet. Modules can be
+disabled via `--module_1.enabled=false` (and similarly for 2 and 3) to
+test them in isolation.
+
+## Validation/report checks before final write
+
+Before the writer runs, `StagingValidator` checks:
+
+- exact frame-timestamp alignment for every event row;
+- no orphan speech / interjection pairs;
+- `plan` is refreshed at every interjection timestamp;
+- `memory` rows fall on subtask boundaries (warning, not error);
+- VQA assistant `content` parses as JSON in one of the
+  bbox / keypoint / count / attribute / spatial shapes;
+- every row routes to the column dictated by `column_for_style(style)`.
+
+Errors abort the writer (`--skip_validation=true` overrides for debugging).
+
+## Paper inspirations per module
+
+- **Module 1 — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
+  atom granularity ("pick up one piece of lettuce", "place bowl to box");
+  Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07)) "how, not
+  what" detail.
+- **Module 1 — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596))
+  compression directive: keep only minimal relevant information; functional
+  outcomes preserved, specific attributes dropped.
+- **Module 2 — interjections.** Hi Robot scenario taxonomy: negative task,
+  situated correction, specific constraint, preference. Speech is a
+  tool-call-only atom (`tool_calls=[{type:function, function:{name:"say",
+arguments:{text:...}}}]`).
+- **Module 3 — VQA.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693))
+  grounded features (bounding boxes in pixel `[x_min, y_min, x_max, y_max]`,
+  keypoints) and Steerable VLA Policies ([Zhao 2025](https://arxiv.org/abs/2509.07626))
+  multi-abstraction grounding. Pi0.7 also grounds answers across
+  multiple abstraction levels.
+
+Future maintainers should adjust the prompt templates in
+`src/lerobot/annotations/steerable_pipeline/prompts/` against these
+references rather than rewriting from scratch.
+
+## Compute and list-size estimates
+
+Per episode, the pipeline issues O(`max_steps`) Module 1 calls,
+O(`max_interjections_per_episode`) Module 2 calls, and
+O(`vqa_emission_hz × episode_seconds`) Module 3 calls. With defaults
+(8 subtasks, 1 interjection, 1 Hz × 3 pairs) and 30-second episodes, that
+is ~50 VLM calls per episode. `language_persistent` per episode is ~10s of
+KB at most (parquet dictionary-encodes one entry per episode);
+`language_events` is empty on most frames and is bounded by the number of
+emissions, not `num_frames × num_emissions`.
+
+## Reproducibility via seed and prompt hashes
+
+`--seed` (default 1729) feeds the per-episode RNGs that select interjection
+timestamps and VQA question types. Combined with the deterministic prompt
+templates checked into `prompts/`, two runs at the same seed against the
+same dataset and the same model checkpoint produce byte-identical staging
+artifacts. Prompt edits are recorded by file hash; future tooling can pin
+expected `(seed, prompt_hash)` pairs into the dataset card.
@@ -1,277 +0,0 @@
-# Using Subtasks in LeRobot Datasets
-
-Subtask support in robotics datasets has proven effective in improving robot reasoning and understanding. Subtasks are particularly useful for:
-
- **Hierarchical policies**: Building policies that include subtask predictions to visualize robot reasoning in real time
- **Reward modeling**: Helping reward models understand task progression (e.g., SARM-style stage-aware reward models)
- **Task decomposition**: Breaking down complex manipulation tasks into atomic, interpretable steps
-
-LeRobotDataset now supports subtasks as part of its dataset structure, alongside tasks.
-
-## What are Subtasks?
-
-While a **task** describes the overall goal (e.g., "Pick up the apple and place it in the basket"), **subtasks** break down the execution into finer-grained steps:
-
-1. "Approach the apple"
-2. "Grasp the apple"
-3. "Lift the apple"
-4. "Move to basket"
-5. "Release the apple"
-
-Each frame in the dataset can be annotated with its corresponding subtask, enabling models to learn and predict these intermediate stages.
-
-<img
-  src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/subtask-asset.png"
-  alt="An overview of subtask annotation showing how frames are labeled with intermediate subtask stages"
-  width="80%"
-/>
-
-<p>
-  <em>Figure: Overview of subtask annotation.</em>
-</p>
-
-**Reference:** _Subtask-learning based for robot self-assembly in flexible collaborative assembly in manufacturing_, Original Article, Published: 19 April 2022.
-
-## Dataset Structure
-
-Subtask information is stored in the dataset metadata:
-
-```
-my-dataset/
-├── data/
-│   └── ...
-├── meta/
-│   ├── info.json
-│   ├── stats.json
-│   ├── tasks.parquet
-│   ├── subtasks.parquet      # Subtask index → subtask string mapping
-│   └── episodes/
-│       └── ...
-└── videos/
-    └── ...
-```
-
-### Subtasks Parquet File
-
-The `meta/subtasks.parquet` file maps subtask indices to their natural language descriptions:
-
-| subtask_index | subtask (index column) |
-| ------------- | ---------------------- |
-| 0             | "Approach the apple"   |
-| 1             | "Grasp the apple"      |
-| 2             | "Lift the apple"       |
-| ...           | ...                    |
-
-### Frame-Level Annotations
-
-Each frame in the dataset can include a `subtask_index` field that references the subtasks parquet file:
-
-```python
-# Example frame data in the parquet file
-{
-    "index": 42,
-    "timestamp": 1.4,
-    "episode_index": 0,
-    "task_index": 0,
-    "subtask_index": 2,  # References "Lift the apple"
-    "observation.state": [...],
-    "action": [...],
-}
-```
-
-## Annotating Datasets with Subtasks
-
-We provide a HuggingFace Space for easily annotating any LeRobotDataset with subtasks:
-
-**[https://huggingface.co/spaces/lerobot/annotate](https://huggingface.co/spaces/lerobot/annotate)**
-
-After completing your annotation:
-
-1. Click "Push to Hub" to upload your annotated dataset
-2. You can also run the annotation space locally by following the instructions at [github.com/huggingface/lerobot-annotate](https://github.com/huggingface/lerobot-annotate)
-
-## Loading Datasets with Subtasks
-
-When you load a dataset with subtask annotations, the subtask information is automatically available:
-
-```python
-from lerobot.datasets import LeRobotDataset
-
-# Load a dataset with subtask annotations
-dataset = LeRobotDataset("jadechoghari/collect-fruit-annotated")
-
-# Access a sample
-sample = dataset[100]
-
-# The sample includes both task and subtask information
-print(sample["task"])        # "Collect the fruit"
-print(sample["subtask"])     # "Grasp the apple"
-print(sample["task_index"])  # tensor(0)
-print(sample["subtask_index"])  # tensor(2)
-```
-
-### Checking for Subtask Support
-
-You can check if a dataset has subtask annotations:
-
-```python
-# Check if subtasks are available
-has_subtasks = (
-    "subtask_index" in dataset.features
-    and dataset.meta.subtasks is not None
-)
-
-if has_subtasks:
-    print(f"Dataset has {len(dataset.meta.subtasks)} unique subtasks")
-    print("Subtasks:", list(dataset.meta.subtasks.index))
-```
-
-## Using Subtasks for Training
-
-### With the Tokenizer Processor
-
-The `TokenizerProcessor` automatically handles subtask tokenization for Vision-Language Action (VLA) models:
-
-```python
-from lerobot.processor import TokenizerProcessorStep
-
-# Create a tokenizer processor step
-tokenizer_processor = TokenizerProcessorStep(
-    tokenizer_name_or_path="google/paligemma-3b-pt-224",
-    padding="max_length",
-    max_length=64,
-)
-
-# The processor will automatically tokenize subtasks if present in the batch
-# and add them to the observation under:
-# - "observation.subtask.tokens"
-# - "observation.subtask.attention_mask"
-```
-
-When subtasks are available in the batch, the tokenizer processor adds:
-
- `observation.subtask.tokens`: Tokenized subtask text
- `observation.subtask.attention_mask`: Attention mask for the subtask tokens
-
-### DataLoader with Subtasks
-
-```python
-import torch
-from lerobot.datasets import LeRobotDataset
-
-dataset = LeRobotDataset("jadechoghari/collect-fruit-annotated")
-
-dataloader = torch.utils.data.DataLoader(
-    dataset,
-    batch_size=16,
-    shuffle=True,
-)
-
-for batch in dataloader:
-    # Access subtask information in the batch
-    subtasks = batch["subtask"]  # List of subtask strings
-    subtask_indices = batch["subtask_index"]  # Tensor of subtask indices
-
-    # Use for training hierarchical policies or reward models
-    print(f"Batch subtasks: {set(subtasks)}")
-```
-
-## Example Datasets with Subtask Annotations
-
-Try loading a dataset with subtask annotations:
-
-```python
-from lerobot.datasets import LeRobotDataset
-
-# Example dataset with subtask annotations
-dataset = LeRobotDataset("jadechoghari/collect-fruit-annotated")
-
-# Explore the subtasks
-print("Available subtasks:")
-for subtask_name in dataset.meta.subtasks.index:
-    print(f"  - {subtask_name}")
-
-# Get subtask distribution
-subtask_counts = {}
-for i in range(len(dataset)):
-    sample = dataset[i]
-    subtask = sample["subtask"]
-    subtask_counts[subtask] = subtask_counts.get(subtask, 0) + 1
-
-print("\nSubtask distribution:")
-for subtask, count in sorted(subtask_counts.items(), key=lambda x: -x[1]):
-    print(f"  {subtask}: {count} frames")
-```
-
-## Use Cases
-
-### 1. Hierarchical Policy Training
-
-Train policies that predict both actions and current subtask:
-
-```python
-class HierarchicalPolicy(nn.Module):
-    def __init__(self, num_subtasks):
-        super().__init__()
-        self.action_head = nn.Linear(hidden_dim, action_dim)
-        self.subtask_head = nn.Linear(hidden_dim, num_subtasks)
-
-    def forward(self, observations):
-        features = self.encoder(observations)
-        actions = self.action_head(features)
-        subtask_logits = self.subtask_head(features)
-        return actions, subtask_logits
-```
-
-### 2. Stage-Aware Reward Modeling (SARM)
-
-Build reward models that understand task progression:
-
-```python
-# SARM predicts:
-# - Stage: Which subtask is being executed (discrete)
-# - Progress: How far along the subtask (continuous 0-1)
-
-class SARMRewardModel(nn.Module):
-    def forward(self, observations):
-        features = self.encoder(observations)
-        stage_logits = self.stage_classifier(features)
-        progress = self.progress_regressor(features)
-        return stage_logits, progress
-```
-
-### 3. Progress Visualization
-
-Monitor robot execution by tracking subtask progression:
-
-```python
-def visualize_execution(model, observations):
-    for t, obs in enumerate(observations):
-        action, subtask_logits = model(obs)
-        predicted_subtask = subtask_names[subtask_logits.argmax()]
-        print(f"t={t}: Executing '{predicted_subtask}'")
-```
-
-## API Reference
-
-### LeRobotDataset Properties
-
-| Property                    | Type                   | Description                                |
-| --------------------------- | ---------------------- | ------------------------------------------ |
-| `meta.subtasks`             | `pd.DataFrame \| None` | DataFrame mapping subtask names to indices |
-| `features["subtask_index"]` | `dict`                 | Feature spec for subtask_index if present  |
-
-### Sample Keys
-
-When subtasks are available, each sample includes:
-
-| Key             | Type           | Description                          |
-| --------------- | -------------- | ------------------------------------ |
-| `subtask_index` | `torch.Tensor` | Integer index of the current subtask |
-| `subtask`       | `str`          | Natural language subtask description |
-
-## Related Resources
-
- [SARM Paper](https://arxiv.org/pdf/2509.25358) - Stage-Aware Reward Modeling for Long Horizon Robot Manipulation
- [LeRobot Annotate Space](https://huggingface.co/spaces/lerobot/annotate) - Interactive annotation tool
- [LeRobotDataset v3.0](./lerobot-dataset-v3) - Dataset format documentation
@@ -194,7 +194,7 @@ lerobot-record \
    --dataset.single_task="Navigate around obstacles" \
    --dataset.streaming_encoding=true \
    --dataset.encoder_threads=2 \
-    # --dataset.camera_encoder_config.vcodec=auto \
+    # --dataset.vcodec=auto \
    --display_data=true
 ```

@@ -0,0 +1,168 @@
+# EO-1
+
+EO-1 is a **Vision-Language-Action policy for robot control**. The LeRobot implementation integrates EO-1 with the standard LeRobot training, evaluation, processor interface.
+
+## Model Overview
+
+EO-1 uses a Qwen2.5-VL backbone for vision-language understanding and adds a continuous flow-matching action head for robot control. The policy formats each robot-control sample as a multimodal conversation: camera images are passed to Qwen2.5-VL, the robot state is represented with EO-1 state tokens, and the future action chunk is represented with EO-1 action tokens.
+
+<img
+  src="https://huggingface.co/datasets/HaomingSong/lerobot-documentation-images/resolve/main/lerobot/eo_pipeline.png"
+  alt="An overview of EO-1"
+  width="85%"
+/>
+
+During training, EO-1 learns to denoise continuous action chunks at the action-token positions. During inference, it samples an action chunk, returns continuous actions, and executes `n_action_steps` from the chunk before sampling again.
+
+### What the LeRobot Integration Covers
+
+- Standard `policy.type=eo1` configuration through LeRobot
+- Qwen2.5-VL image and text preprocessing through policy processors
+- Continuous flow-matching action prediction
+- Checkpoint save/load through LeRobot policy APIs
+- Training with `lerobot-train` and evaluation with `lerobot-eval`
+
+The broader EO-1 project also includes interleaved vision-text-action pretraining and multimodal reasoning workflows. This page focuses on the LeRobot robot-control policy path.
+
+## Installation Requirements
+
+1. Install LeRobot by following the [Installation Guide](./installation).
+2. Install EO-1 dependencies by running:
+
+   ```bash
+   pip install -e ".[eo1]"
+   ```
+
+3. If you want to train or evaluate on LIBERO, install the LIBERO dependencies too:
+
+   ```bash
+   pip install -e ".[eo1,libero]"
+   ```
+
+EO-1 can use the standard PyTorch scaled-dot-product attention backend through `policy.attn_implementation=sdpa`. If your environment has a compatible `flash_attn` installation, you can request `policy.attn_implementation=flash_attention_2`.
+
+## Data Requirements
+
+EO-1 expects a LeRobot dataset with:
+
+- At least one visual observation, for example `observation.images.image`
+- `observation.state`
+- `action`
+- A language task instruction through the dataset `task` field
+
+If your dataset uses different observation names, use `rename_map` to align them with the names expected by your training or evaluation setup.
+
+## Usage
+
+To use EO-1 in a LeRobot configuration, specify the policy type as:
+
+```python
+policy.type=eo1
+```
+
+By default, a new EO-1 policy initializes its backbone from:
+
+```python
+policy.vlm_base=Qwen/Qwen2.5-VL-3B-Instruct
+```
+
+Once a LeRobot-format EO-1 checkpoint is available, load it with:
+
+```python
+policy.path=your-org/your-eo1-checkpoint
+```
+
+## Training
+
+### Training Command Example
+
+```bash
+lerobot-train \
+  --dataset.repo_id=your_org/your_dataset \
+  --policy.type=eo1 \
+  --policy.vlm_base=Qwen/Qwen2.5-VL-3B-Instruct \
+  --policy.dtype=bfloat16 \
+  --policy.attn_implementation=sdpa \
+  --policy.gradient_checkpointing=false \
+  --output_dir=./outputs/eo1_training \
+  --job_name=eo1_training \
+  --steps=300000 \
+  --batch_size=16 \
+  --policy.device=cuda
+```
+
+### Key Training Parameters
+
+| Parameter                              | Default                       | Description                                                             |
+| -------------------------------------- | ----------------------------- | ----------------------------------------------------------------------- |
+| `policy.vlm_base`                      | `Qwen/Qwen2.5-VL-3B-Instruct` | Qwen2.5-VL checkpoint used to initialize a new policy                   |
+| `policy.dtype`                         | `auto`                        | Backbone dtype request: `auto`, `bfloat16`, or `float32`                |
+| `policy.attn_implementation`           | `None`                        | Optional Qwen attention backend, such as `sdpa`                         |
+| `policy.gradient_checkpointing`        | `false`                       | Reduces memory usage during training                                    |
+| `policy.chunk_size`                    | `8`                           | Number of future actions predicted per chunk                            |
+| `policy.n_action_steps`                | `8`                           | Number of actions consumed from a sampled chunk                         |
+| `policy.num_denoise_steps`             | `10`                          | Number of flow-matching denoising steps used during sampling            |
+| `policy.max_state_dim`                 | `32`                          | State padding dimension                                                 |
+| `policy.max_action_dim`                | `32`                          | Action padding dimension                                                |
+| `policy.force_fp32_autocast`           | `true`                        | Keeps the flow head in fp32 even when the backbone uses mixed precision |
+| `policy.supervise_padding_action_dims` | `true`                        | Controls whether padded action dimensions are supervised                |
+| `policy.supervise_padding_actions`     | `true`                        | Controls whether padded future action rows are supervised               |
+
+## Evaluation
+
+EO-1 can be evaluated through `lerobot-eval` once you have a LeRobot-format checkpoint:
+
+```bash
+lerobot-eval \
+  --policy.path=your-org/your-eo1-checkpoint \
+  --env.type=libero \
+  --env.task=libero_object \
+  --eval.batch_size=1 \
+  --eval.n_episodes=20
+```
+
+For datasets or environments whose camera names differ from the checkpoint configuration, pass a `rename_map`:
+
+```bash
+lerobot-eval \
+  --policy.path=your-org/your-eo1-checkpoint \
+  --env.type=libero \
+  --env.task=libero_object \
+  --rename_map='{"observation.images.image2":"observation.images.wrist_image"}'
+```
+
+## Configuration Notes
+
+### Image Processing
+
+EO-1 uses the Qwen2.5-VL processor. The `policy.image_min_pixels` and `policy.image_max_pixels` settings control the image resizing bounds before the visual tokens are passed into the backbone.
+
+### State and Action Dimensions
+
+The policy pads state and action vectors to `policy.max_state_dim` and `policy.max_action_dim` before the EO-1 flow head. Predictions are cropped back to the original action dimension before being returned by the policy.
+
+### Attention Backend
+
+Use `policy.attn_implementation=sdpa` for a portable setup. Use `flash_attention_2` only when `flash_attn` is installed and compatible with your environment.
+
+## References
+
+- [EO-1 project](https://github.com/EO-Robotics/EO1)
+- [EO-1 paper](https://arxiv.org/abs/2508.21112)
+- [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
+
+## Citation
+
+```bibtex
+@article{eo1,
+  title={EO-1: Interleaved Vision-Text-Action Pretraining for General Robot Control},
+  author={Delin Qu and Haoming Song and Qizhi Chen and Zhaoqing Chen and Xianqiang Gao and Xinyi Ye and Qi Lv and Modi Shi and Guanghui Ren and Cheng Ruan and Maoqing Yao and Haoran Yang and Jiacheng Bao and Bin Zhao and Dong Wang},
+  journal={arXiv preprint},
+  year={2025},
+  url={https://arxiv.org/abs/2508.21112}
+}
+```
+
+## License
+
+This LeRobot integration follows the **Apache 2.0 License** used by LeRobot. Check the upstream EO-1 model and dataset pages for the licenses of released EO-1 checkpoints and data.
@@ -123,7 +123,7 @@ lerobot-record \
  --dataset.single_task="Grab and handover the red cube to the other arm" \
  --dataset.streaming_encoding=true \
  --dataset.encoder_threads=2 \
-  # --dataset.camera_encoder_config.vcodec=auto \
+  # --dataset.vcodec=auto \
  --policy.path=<user>/groot-bimanual \ # your trained model
  --dataset.episode_time_s=30 \
  --dataset.reset_time_s=10
@@ -232,7 +232,7 @@ lerobot-record \
    --dataset.private=true \
    --dataset.streaming_encoding=true \
    --dataset.encoder_threads=2 \
-    # --dataset.camera_encoder_config.vcodec=auto \
+    # --dataset.vcodec=auto \
    --display_data=true
 ```

@@ -278,6 +278,6 @@ lerobot-record \
  --dataset.num_episodes=10 \
  --dataset.streaming_encoding=true \
  --dataset.encoder_threads=2 \
-  # --dataset.camera_encoder_config.vcodec=auto \
+  # --dataset.vcodec=auto \
  --policy.path=outputs/train/hopejr_hand/checkpoints/last/pretrained_model
 ```
@@ -193,7 +193,7 @@ lerobot-record \
    --dataset.num_episodes=5 \
    --dataset.single_task="Grab the black cube" \
    --dataset.streaming_encoding=true \
-    # --dataset.camera_encoder_config.vcodec=auto \
+    # --dataset.vcodec=auto \
    --dataset.encoder_threads=2
 ```
 </hfoption>
@@ -0,0 +1,147 @@
+# Language columns and recipes
+
+Most LeRobot datasets ship with a single `task` string per episode — fine for
+short, single-instruction skills, but not enough for the longer-horizon,
+multi-modal robot policies the field is moving toward (high-level planning,
+memory, interjections, VQA, tool use). To support those policies without
+forking the dataset format, LeRobot extends `LeRobotDataset` with two optional
+language columns and a small recipe layer that turns those rows into
+chat-style training samples on the fly.
+
+The design splits cleanly into three layers:
+
+1. **Data in the dataset** — language annotations stored next to frames in
+   `data/chunk-*/file-*.parquet` as two optional columns (`language_persistent`
+   and `language_events`). Datasets without these columns keep their existing
+   behavior.
+2. **Recipe** — a YAML file that declares which annotation rows to bind and
+   how to lay them out as chat turns (`role`, `content`, optional images,
+   optional tool calls). Recipes are pure config; no Python required to add a
+   new one.
+3. **Training format** — at sample time, `RenderMessagesStep` resolves the
+   recipe against the per-frame annotations and emits HF-style `messages` plus
+   LeRobot-specific sidecars (`message_streams`, `target_message_indices`)
+   that policy processors consume.
+
+This page describes each layer in turn.
+
+## Layer 1 — language columns in the dataset
+
+The two optional columns live next to frame data in
+`data/chunk-*/file-*.parquet`:
+
+- `language_persistent`: a list of rows broadcast across every frame in an episode for state that remains active, such as `subtask`, `plan`, and `memory`.
+- `language_events`: a list of rows only on the exact frame where an event was emitted, such as `interjection`, `vqa`, and speech tool calls.
+
+Both columns share the same row shape (event rows omit `timestamp` because the
+frame the row sits on already provides it):
+
+```text
+role: string
+content: string | null
+style: string | null
+timestamp: float64        # persistent rows only
+camera: string | null     # observation.images.* feature key, view-dependent rows only
+tool_calls: list[Json] | null
+```
+
+The `camera` field tags rows whose `content` is grounded in a specific camera
+view. Rows of view-dependent styles (`vqa` and `trace`) MUST set `camera` to
+the matching `observation.images.*` feature key. Rows of every other style —
+including `motion`, which describes robot-frame primitives in joint / Cartesian
+terms — MUST leave `camera` as `null`. Pipeline writers and the validator
+enforce this via `validate_camera_field(style, camera)`.
+
+`meta/tasks.parquet` remains the canonical source for the task. The special `${task}` recipe binding always reads that task string and does not depend on language annotations.
+
+### Architecture
+
+The language stack itself has three internal modules backing layer 1:
+
+1. `lerobot.datasets.language` defines the schema, style registry, and `column_for_style`.
+2. `lerobot.datasets.language_render` resolves rows and renders messages.
+3. `RenderMessagesStep` turns dataset samples into `messages`, `message_streams`, and `target_message_indices`.
+
+`LeRobotDataset` stays recipe-agnostic. It passes `language_persistent` and `language_events` through when present, and unannotated datasets keep their existing behavior.
+
+### Temporal semantics
+
+Persistent styles are active after emission until replaced:
+
+- `active_at(t, style=subtask)`
+- `nth_prev(style=memory, offset=1)`
+- `nth_next(style=subtask, offset=1)`
+
+Event styles only exist on their exact timestamp:
+
+- `emitted_at(t, style=interjection)`
+- `emitted_at(t, style=vqa, role=user, camera=observation.images.top)`
+- `emitted_at(t, role=assistant, tool_name=say)`
+
+Exact event matching has no tolerance window, so writers must stamp event rows with frame timestamps from the parquet data.
+
+### View-dependent resolution
+
+For view-dependent styles (`vqa` and `trace`), the resolver gains a
+`camera=` filter parallel to `role=` and `tool_name=`. Datasets with multiple
+cameras typically emit one (`vqa`, `user`) + (`vqa`, `assistant`) pair per
+camera at the same timestamp; without `camera=`, those resolvers see two
+matches and raise an ambiguity error. Recipes consume each camera through its
+own binding plus a matching image block, e.g.
+
+```yaml
+ask_vqa_top:
+  bindings:
+    vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.top)"
+    vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.top)"
+  messages:
+    - role: user
+      stream: high_level
+      if_present: vqa_query
+      content:
+        - { type: image, feature: observation.images.top }
+        - { type: text, text: "${vqa_query}" }
+    - {
+        role: assistant,
+        content: "${vqa}",
+        stream: high_level,
+        target: true,
+        if_present: vqa,
+      }
+```
+
+Add one such sub-recipe per camera the dataset records.
+
+## Layer 2 — recipe anatomy
+
+Recipes are YAML files backed by `TrainingRecipe` and `MessageTurn`. They
+declare which annotation rows to pull (via `bindings`) and how to compose them
+into chat turns (`messages`).
+
+```yaml
+messages:
+  - { role: user, content: "${task}", stream: high_level }
+  - { role: assistant, content: "${subtask}", stream: low_level, target: true }
+```
+
+A recipe can also branch into a weighted **blend** of sub-recipes. At sample
+time, exactly one branch is selected deterministically from the sample index,
+so different frames train different objectives (e.g. memory updates vs.
+low-level execution vs. VQA) without any Python wiring.
+
+## Layer 3 — training format
+
+Rendered samples use HF-style chat messages plus LeRobot sidecars:
+
+```python
+sample["messages"]
+sample["message_streams"]
+sample["target_message_indices"]
+```
+
+The renderer does not apply a tokenizer chat template. Policy processors decide how to serialize the messages for their backbone, which keeps the same dataset usable across SmolVLA, Pi0.5, and any future VLM that expects OpenAI-style chat messages.
+
+## Graceful absence
+
+If both language columns are missing, `None`, or empty, `RenderMessagesStep` is a no-op.
+If an event-scoped branch is selected on a frame without the required event row, rendering returns `None`, allowing a loader to retry another sample.
@@ -43,7 +43,7 @@ lerobot-record \
  --dataset.num_episodes=5 \
  --dataset.single_task="Grab the black cube" \
  --dataset.streaming_encoding=true \
-  # --dataset.camera_encoder_config.vcodec=auto \
+  # --dataset.vcodec=auto \
  --dataset.encoder_threads=2
 ```

@@ -161,7 +161,7 @@ lerobot-record \
    --dataset.private=true \
    --dataset.streaming_encoding=true \
    --dataset.encoder_threads=2 \
-    # --dataset.camera_encoder_config.vcodec=auto \
+    # --dataset.vcodec=auto \
    --display_data=true
 ```

@@ -203,7 +203,7 @@ lerobot-record \
    --dataset.private=true \
    --dataset.streaming_encoding=true \
    --dataset.encoder_threads=2 \
-    # --dataset.camera_encoder_config.vcodec=auto \
+    # --dataset.vcodec=auto \
    --display_data=true
 ```

@@ -108,7 +108,7 @@ lerobot-record \
  --dataset.num_episodes=10 \
  --dataset.streaming_encoding=true \
  --dataset.encoder_threads=2 \
-  # --dataset.camera_encoder_config.vcodec=auto \
+  # --dataset.vcodec=auto \
  # <- Teleop optional if you want to teleoperate in between episodes \
  # --teleop.type=so100_leader \
  # --teleop.port=/dev/ttyACM0 \
@@ -14,12 +14,12 @@ This makes `save_episode()` near-instant (the video is already encoded by the ti

 ## 2. Tuning Parameters

-| Parameter               | CLI Flag                                 | Type          | Default       | Description                                                       |
-| ----------------------- | ---------------------------------------- | ------------- | ------------- | ----------------------------------------------------------------- |
-| `streaming_encoding`    | `--dataset.streaming_encoding`           | `bool`        | `True`        | Enable real-time encoding during capture                          |
-| `vcodec`                | `--dataset.camera_encoder_config.vcodec` | `str`         | `"libsvtav1"` | Video codec. `"auto"` detects best HW encoder                     |
-| `encoder_threads`       | `--dataset.encoder_threads`              | `int \| None` | `None` (auto) | Threads per encoder instance. `None` will leave the vcoded decide |
-| `encoder_queue_maxsize` | `--dataset.encoder_queue_maxsize`        | `int`         | `60`          | Max buffered frames per camera (~2s at 30fps). Consumes RAM       |
+| Parameter               | CLI Flag                          | Type          | Default       | Description                                                       |
+| ----------------------- | --------------------------------- | ------------- | ------------- | ----------------------------------------------------------------- |
+| `streaming_encoding`    | `--dataset.streaming_encoding`    | `bool`        | `True`        | Enable real-time encoding during capture                          |
+| `vcodec`                | `--dataset.vcodec`                | `str`         | `"libsvtav1"` | Video codec. `"auto"` detects best HW encoder                     |
+| `encoder_threads`       | `--dataset.encoder_threads`       | `int \| None` | `None` (auto) | Threads per encoder instance. `None` will leave the vcoded decide |
+| `encoder_queue_maxsize` | `--dataset.encoder_queue_maxsize` | `int`         | `60`          | Max buffered frames per camera (~2s at 30fps). Consumes RAM       |

 ## 3. Performance Considerations

@@ -48,7 +48,7 @@ This parameter controls how many threads each encoder instance uses internally:

 ### Backpressure and Frame Dropping

-Each camera has a bounded queue (`encoder_queue_maxsize`, default 30 frames). When the encoder can't keep up:
+Each camera has a bounded queue (`encoder_queue_maxsize`, default 60 frames). When the encoder can't keep up:

 1. The queue fills up (consuming RAM)
 2. New frames are **dropped** (not blocked) — the capture loop continues uninterrupted
@@ -82,15 +82,15 @@ Use HW encoding when:

 ### Available HW Encoders

-| Encoder             | Platform      | Hardware                                                                                         | CLI Value                                                  |
-| ------------------- | ------------- | ------------------------------------------------------------------------------------------------ | ---------------------------------------------------------- |
-| `h264_videotoolbox` | macOS         | Apple Silicon / Intel                                                                            | `--dataset.camera_encoder_config.vcodec=h264_videotoolbox` |
-| `hevc_videotoolbox` | macOS         | Apple Silicon / Intel                                                                            | `--dataset.camera_encoder_config.vcodec=hevc_videotoolbox` |
-| `h264_nvenc`        | Linux/Windows | NVIDIA GPU                                                                                       | `--dataset.camera_encoder_config.vcodec=h264_nvenc`        |
-| `hevc_nvenc`        | Linux/Windows | NVIDIA GPU                                                                                       | `--dataset.camera_encoder_config.vcodec=hevc_nvenc`        |
-| `h264_vaapi`        | Linux         | Intel/AMD GPU                                                                                    | `--dataset.camera_encoder_config.vcodec=h264_vaapi`        |
-| `h264_qsv`          | Linux/Windows | Intel Quick Sync                                                                                 | `--dataset.camera_encoder_config.vcodec=h264_qsv`          |
-| `auto`              | Any           | Probes the system for available HW encoders. Falls back to `libsvtav1` if no HW encoder is found | `--dataset.camera_encoder_config.vcodec=auto`              |
+| Encoder             | Platform      | Hardware                                                                                         | CLI Value                            |
+| ------------------- | ------------- | ------------------------------------------------------------------------------------------------ | ------------------------------------ |
+| `h264_videotoolbox` | macOS         | Apple Silicon / Intel                                                                            | `--dataset.vcodec=h264_videotoolbox` |
+| `hevc_videotoolbox` | macOS         | Apple Silicon / Intel                                                                            | `--dataset.vcodec=hevc_videotoolbox` |
+| `h264_nvenc`        | Linux/Windows | NVIDIA GPU                                                                                       | `--dataset.vcodec=h264_nvenc`        |
+| `hevc_nvenc`        | Linux/Windows | NVIDIA GPU                                                                                       | `--dataset.vcodec=hevc_nvenc`        |
+| `h264_vaapi`        | Linux         | Intel/AMD GPU                                                                                    | `--dataset.vcodec=h264_vaapi`        |
+| `h264_qsv`          | Linux/Windows | Intel Quick Sync                                                                                 | `--dataset.vcodec=h264_qsv`          |
+| `auto`              | Any           | Probes the system for available HW encoders. Falls back to `libsvtav1` if no HW encoder is found | `--dataset.vcodec=auto`              |

 > [!NOTE]
 > In order to use the HW accelerated encoders you might need to upgrade your GPU drivers.
@@ -100,15 +100,15 @@ Use HW encoding when:

 ## 5. Troubleshooting

-| Symptom                                                            | Likely Cause                                 | Fix                                                                                                                                                                                                                                                                                                        |
-| ------------------------------------------------------------------ | -------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| System freezes or choppy robot movement or Rerun visualization lag | CPU starved (100% load usage)                | Close other apps, reduce encoding throughput, lower `encoder_threads`, use `h264`, use `display_data=False`. If the CPU continues to be at 100% then it might be insufficient for your setup, consider `--dataset.streaming_encoding=false` or HW encoding (`--dataset.camera_encoder_config.vcodec=auto`) |
-| "Encoder queue full" warnings or dropped frames in dataset         | Encoder can't keep up (Queue overflow)       | If CPU is not at 100%: Increase `encoder_threads`, increase `encoder_queue_maxsize` or use HW encoding (`--dataset.camera_encoder_config.vcodec=auto`).                                                                                                                                                    |
-| High RAM usage                                                     | Queue filling faster than encoding           | `encoder_threads` too low or CPU insufficient. Reduce `encoder_queue_maxsize` or use HW encoding                                                                                                                                                                                                           |
-| Large video files                                                  | Using HW encoder or H.264                    | Expected trade-off. Switch to `libsvtav1` if CPU allows                                                                                                                                                                                                                                                    |
-| `save_episode()` still slow                                        | `streaming_encoding` is `False`              | Set `--dataset.streaming_encoding=true`                                                                                                                                                                                                                                                                    |
-| Encoder thread crash                                               | Codec not available or invalid settings      | Check `vcodec` is installed, try `--dataset.camera_encoder_config.vcodec=auto`                                                                                                                                                                                                                             |
-| Recorded dataset is missing frames                                 | CPU/GPU starvation or occasional load spikes | If ~5% of frames are missing, your system is likely overloaded — follow the recommendations above. If fewer frames are missing (~2%), they are probably due to occasional transient load spikes (often at startup) and can be considered expected.                                                         |
+| Symptom                                                            | Likely Cause                                 | Fix                                                                                                                                                                                                                                                                                  |
+| ------------------------------------------------------------------ | -------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| System freezes or choppy robot movement or Rerun visualization lag | CPU starved (100% load usage)                | Close other apps, reduce encoding throughput, lower `encoder_threads`, use `h264`, use `display_data=False`. If the CPU continues to be at 100% then it might be insufficient for your setup, consider `--dataset.streaming_encoding=false` or HW encoding (`--dataset.vcodec=auto`) |
+| "Encoder queue full" warnings or dropped frames in dataset         | Encoder can't keep up (Queue overflow)       | If CPU is not at 100%: Increase `encoder_threads`, increase `encoder_queue_maxsize` or use HW encoding (`--dataset.vcodec=auto`).                                                                                                                                                    |
+| High RAM usage                                                     | Queue filling faster than encoding           | `encoder_threads` too low or CPU insufficient. Reduce `encoder_queue_maxsize` or use HW encoding                                                                                                                                                                                     |
+| Large video files                                                  | Using HW encoder or H.264                    | Expected trade-off. Switch to `libsvtav1` if CPU allows                                                                                                                                                                                                                              |
+| `save_episode()` still slow                                        | `streaming_encoding` is `False`              | Set `--dataset.streaming_encoding=true`                                                                                                                                                                                                                                              |
+| Encoder thread crash                                               | Codec not available or invalid settings      | Check `vcodec` is installed, try `--dataset.vcodec=auto`                                                                                                                                                                                                                             |
+| Recorded dataset is missing frames                                 | CPU/GPU starvation or occasional load spikes | If ~5% of frames are missing, your system is likely overloaded — follow the recommendations above. If fewer frames are missing (~2%), they are probably due to occasional transient load spikes (often at startup) and can be considered expected.                                   |

 ## 6. Recommended Configurations

@@ -146,7 +146,7 @@ On very constrained systems, streaming encoding may compete too heavily with the
 # 2camsx 640x480x3 @30fps: Requires some tuning.

 # Use H.264, disable streaming, consider batching encoding
-lerobot-record --dataset.camera_encoder_config.vcodec=h264 --dataset.streaming_encoding=false ...
+lerobot-record --dataset.vcodec=h264 --dataset.streaming_encoding=false ...
 ```

 ## 7. Closing note
@@ -0,0 +1,210 @@
+# Tools
+
+LeRobot v3.1 supports **tool calls** in policies — assistant messages can
+emit structured invocations like `say(text="OK, starting now")` that the
+runtime dispatches to a real implementation (TTS, controller, logger, …).
+
+This page covers:
+
+1. Where the tool catalog lives.
+2. How the annotation pipeline produces tool-call atoms.
+3. How to add your own tool.
+
+## Where tools are declared
+
+Two layers.
+
+**The catalog** — a list of OpenAI-style function schemas — lives at
+`meta/info.json["tools"]` on each dataset. Example:
+
+```json
+{
+  "features": { "...": "..." },
+  "tools": [
+    {
+      "type": "function",
+      "function": {
+        "name": "say",
+        "description": "Speak a short utterance to the user via the TTS executor.",
+        "parameters": {
+          "type": "object",
+          "properties": {
+            "text": {
+              "type": "string",
+              "description": "The verbatim text to speak."
+            }
+          },
+          "required": ["text"]
+        }
+      }
+    }
+  ]
+}
+```
+
+Read it via the dataset metadata accessor:
+
+```python
+from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
+
+meta = LeRobotDatasetMetadata(repo_id="pepijn/super_poulain_final_annotations")
+tools = meta.tools     # list[dict] — OpenAI tool schemas
+```
+
+If the dataset's `info.json` doesn't declare any tools, `meta.tools`
+returns `DEFAULT_TOOLS` from `lerobot.datasets.language` — currently a
+single-entry list with the canonical `say` schema. So unannotated
+datasets and chat-template consumers keep working without any
+configuration:
+
+```python
+prompt_str = tokenizer.apply_chat_template(
+    sample["messages"],
+    tools=meta.tools,                 # works either way
+    add_generation_prompt=False,
+    tokenize=False,
+)
+```
+
+**The implementations** — runnable Python — will live under
+`src/lerobot/tools/`, one file per tool. The runtime dispatcher and
+the canonical `say` implementation (wrapping Kyutai's pocket-tts) are
+not part of the catalog layer described here; today this layer ships
+only the schema storage and the `DEFAULT_TOOLS` fallback constant.
+
+## Per-row tool _invocations_
+
+The catalog above describes _what can be called_. The actual _call_ — the
+function name plus the argument values — is stored per-row, on the
+assistant atoms in `language_events`:
+
+```python
+{
+  "role": "assistant",
+  "content": null,
+  "style": null,
+  "timestamp": 12.4,
+  "camera": null,
+  "tool_calls": [
+    { "type": "function",
+      "function": { "name": "say", "arguments": { "text": "On it." } } }
+  ]
+}
+```
+
+Recipes splice these into rendered messages via `tool_calls_from`:
+
+```yaml
+user_interjection_response:
+  bindings:
+    speech: "emitted_at(t, role=assistant, tool_name=say)"
+  messages:
+    - { role: user, content: "${task}", stream: high_level }
+    - {
+        role: assistant,
+        content: "${current_plan}",
+        stream: high_level,
+        target: true,
+        tool_calls_from: speech,
+      }
+```
+
+The model's training target is one assistant turn that carries both the
+plan text _and_ the `say` tool call. At inference, the runtime parses
+the generated text back into structured `tool_calls` and dispatches to
+the matching implementation.
+
+## How to add your own tool
+
+> **Note:** Steps 2 and 3 below describe the runtime layer
+> (`src/lerobot/tools/`, the `Tool` protocol, `TOOL_REGISTRY`,
+> `get_tools(meta)`) which is not part of the catalog layer shipped
+> today — those modules don't yet exist in the tree. Step 1 alone is
+> enough to make the tool visible to the chat template via
+> `meta.tools` so the model can learn to _generate_ the call;
+> executing the call at inference requires the runtime layer.
+
+Three steps. Concrete example: a `record_observation` tool the policy
+can call to capture an extra observation outside the regular control
+loop.
+
+### Step 1 — declare the schema
+
+Add an entry under `meta/info.json["tools"]`. Either edit the file
+directly on disk _before_ running the annotation pipeline (it'll be
+preserved) or hand it to `lerobot-annotate` via a config flag.
+
+```json
+{
+  "tools": [
+    { "type": "function", "function": { "name": "say", "...": "..." } },
+    {
+      "type": "function",
+      "function": {
+        "name": "record_observation",
+        "description": "Capture a high-resolution still image for the user.",
+        "parameters": {
+          "type": "object",
+          "properties": {
+            "label": {
+              "type": "string",
+              "description": "Short label for the saved image."
+            }
+          },
+          "required": ["label"]
+        }
+      }
+    }
+  ]
+}
+```
+
+The schema follows OpenAI's function-calling convention exactly, so the
+chat template can render it natively.
+
+### Step 2 — implement the call
+
+Create `src/lerobot/tools/record_observation.py`:
+
+```python
+from .base import Tool
+from typing import Any
+
+RECORD_OBSERVATION_SCHEMA: dict[str, Any] = { "...": "..." }   # mirrors the JSON above
+
+
+class RecordObservationTool:
+    name = "record_observation"
+    schema = RECORD_OBSERVATION_SCHEMA
+
+    def __init__(self, schema: dict | None = None, output_dir: str = "."):
+        self.output_dir = output_dir
+
+    def call(self, arguments: dict) -> str:
+        label = arguments["label"]
+        # ... save the latest camera frame to <output_dir>/<label>.png ...
+        return f"saved {label}.png"
+```
+
+One file per tool keeps dependencies isolated — `record_observation`
+might pull `pillow`, while `say` pulls `pocket-tts`. Users installing
+only the tools they need avoid heavy transitive deps.
+
+### Step 3 — register it
+
+Add to `src/lerobot/tools/registry.py`:
+
+```python
+from .record_observation import RecordObservationTool
+
+TOOL_REGISTRY["record_observation"] = RecordObservationTool
+```
+
+That's it. At runtime `get_tools(meta)` looks up each schema in
+`meta.tools`, instantiates the matching registered class, and returns
+a name → instance dict the dispatcher can route into.
+
+If you want to use a tool _without_ writing an implementation (e.g. for
+training-time chat-template formatting only), step 1 alone is enough —
+the model still learns to _generate_ the call. Steps 2 and 3 are only
+needed to actually _execute_ it at inference.
@@ -117,10 +117,10 @@ lerobot-edit-dataset \
    --repo_id lerobot/pusht_image \
    --operation.type convert_image_to_video \
    --operation.output_dir outputs/pusht_video \
-    --operation.camera_encoder_config.vcodec libsvtav1 \
-    --operation.camera_encoder_config.pix_fmt yuv420p \
-    --operation.camera_encoder_config.g 2 \
-    --operation.camera_encoder_config.crf 30
+    --operation.vcodec libsvtav1 \
+    --operation.pix_fmt yuv420p \
+    --operation.g 2 \
+    --operation.crf 30

 # Convert only specific episodes
 lerobot-edit-dataset \
@@ -147,7 +147,11 @@ lerobot-edit-dataset \
 **Parameters:**

 - `output_dir`: Custom output directory (optional - by default uses `new_repo_id` or `{repo_id}_video`)
- `camera_encoder_config`: Video encoder settings — all sub-fields accessible via `--operation.camera_encoder_config.<field>. See [Video Encoding Parameters](./video_encoding_parameters) for more details.
+- `vcodec`: Video codec to use - options: `h264`, `hevc`, `libsvtav1` (default: `libsvtav1`)
+- `pix_fmt`: Pixel format - options: `yuv420p`, `yuv444p` (default: `yuv420p`)
+- `g`: Group of pictures (GOP) size - lower values give better quality but larger files (default: 2)
+- `crf`: Constant rate factor - lower values give better quality but larger files, 0 is lossless (default: 30)
+- `fast_decode`: Fast decode tuning option (default: 0)
 - `episode_indices`: List of specific episodes to convert (default: all episodes)
 - `num_workers`: Number of parallel workers for processing (default: 4)

@@ -1,81 +0,0 @@
-# Video encoding parameters
-
-When **video storage** is on, LeRobot stores each camera stream as an **MP4** file rather than saving **every timestep as its own image file**. **Video encoding compress across time**, which usually cuts **dataset size and I/O** compared to heaps of PNGs, and MP4 stays a **familiar format** for players and loaders. Incoding frames into a MP4 file is a full FFmpeg pipeline: choice of encoder, pixel format, GOP/keyframes, quality vs speed, and
-optional extra encoder flags. **Many of those knobs are user-tunable** and are exposed on the dataset config as
-**`dataset.camera_encoder_config`** — a nested **`VideoEncoderConfig`** (`lerobot.datasets.video_utils.
-VideoEncoderConfig`) passed through **PyAV**.
-
-You can set these parameters from the CLI with **`--dataset.camera_encoder_config.<field>`** (e.g. `lerobot-record`, `lerobot-rollout`). The same block applies to **every** camera video stream in that run. **Video storage must be on** — **`use_videos=True`** in Python APIs or **`--dataset.video=true`** (recording default); with video off, inputs stay as images and **`camera_encoder_config` is ignored.**
-
-For **when** frames are written vs encoded (streaming vs post-episode), queues, and other top-level **`--dataset.*`** switches, see [Streaming Video Encoding](./streaming_video_encoding). For codec/size/speed experiments, see the [video-benchmark Space](https://huggingface.co/spaces/lerobot/video-benchmark).
-
---
-
-## Tuning Parameters
-
-| Parameter       | CLI flag                                        | Type                 | Default       | Description                                                                                                                                             |
-| --------------- | ----------------------------------------------- | -------------------- | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `vcodec`        | `--dataset.camera_encoder_config.vcodec`        | `str`                | `"libsvtav1"` | Video codec name. `"auto"` picks the first available hardware encoder from a fixed preference list, else `libsvtav1`.                                   |
-| `pix_fmt`       | `--dataset.camera_encoder_config.pix_fmt`       | `str`                | `"yuv420p"`   | Output pixel format; must be supported by the specified codec in your FFmpeg build.                                                                     |
-| `g`             | `--dataset.camera_encoder_config.g`             | `int \| None`        | `2`           | GOP size (keyframes every `g` frames). Emitted as FFmpeg option `g`.                                                                                    |
-| `crf`           | `--dataset.camera_encoder_config.crf`           | `int \| None`        | `30`          | Abstract **quality**; mapped per codec in the table below (CRF, QP, `q:v`, etc.). Lower → higher quality / larger output where the mapping is monotone. |
-| `preset`        | `--dataset.camera_encoder_config.preset`        | `int \| str \| None` | `12`\*        | Video encoding speed preset; meaning depends on the specified codec. \*Unset + `libsvtav1` → LeRobot sets `12`.                                         |
-| `fast_decode`   | `--dataset.camera_encoder_config.fast_decode`   | `int`                | `0`           | `libsvtav1`: `0–2` passed in `svtav1-params`; `h264` / `hevc` (software): if `>0`, sets `tune=fastdecode`; other codecs: often unused.                  |
-| `video_backend` | `--dataset.camera_encoder_config.video_backend` | `str`                | `"pyav"`      | Only `"pyav"` is implemented for video encoding today.                                                                                                  |
-| `extra_options` | (nested config / non-scalar)                    | `dict`               | `{}`          | Extra FFmpeg options merged after the built-in mapping; **cannot** override keys already set from structured fields above.                              |
-
---
-
-## Validation
-
-| What                 | Behavior                                                                                                                                                                                        |
-| -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| Video codec presence | `vcodec` must exist as a video encoder in the local FFmpeg build (after resolving `"auto"`).                                                                                                    |
-| Pixel format         | `pix_fmt` is checked against the encoder’s reported pixel formats when available.                                                                                                               |
-| Options              | `get_codec_options()` output (including values originating from `extra_options`) is checked against PyAV/FFmpeg option metadata (ranges, integer constraints, string choices) where applicable. |
-
---
-
-## Mapping: `VideoEncoderConfig` → FFmpeg options
-
-From **`get_codec_options()`** after `vcodec` resolution. Only fields on `camera_encoder_config` are listed here (no global thread / queue flags).
-
-| Resolved `vcodec`                        | `g` | Quality from `crf`          | `preset` | `fast_decode`                              |
-| ---------------------------------------- | --- | --------------------------- | -------- | ------------------------------------------ |
-| `libsvtav1`                              | `g` | `crf`                       | `preset` | `svtav1-params` includes `fast-decode=0…2` |
-| `h264`, `hevc` (software)                | `g` | `crf`                       | `preset` | `tune=fastdecode` if `fast_decode > 0`     |
-| `h264_videotoolbox`, `hevc_videotoolbox` | `g` | `q:v` (derived from `crf`)  | —        | —                                          |
-| `h264_nvenc`, `hevc_nvenc`               | `g` | `rc=constqp` + `qp` ← `crf` | `preset` | —                                          |
-| `h264_vaapi`                             | `g` | `qp` ← `crf`                | —        | —                                          |
-| `h264_qsv`                               | `g` | `global_quality` ← `crf`    | `preset` | —                                          |
-
---
-
-## `extra_options`
-
- Merged **after** structured options; keys **already** set by `g`, `crf`, `preset`, etc. are **not** replaced by `extra_options`.
- Values are strings or numbers as FFmpeg expects; numeric values are validated when the codec exposes option metadata.
-
---
-
-## Example
-
-```bash
-lerobot-record \
-    --robot.type=so100_follower \
-    --robot.port=/dev/tty.usbmodem58760431541 \
-    --robot.cameras="{laptop: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
-    --robot.id=black \
-    --teleop.type=so100_leader \
-    --teleop.port=/dev/tty.usbmodem58760431551 \
-    --teleop.id=blue \
-    --dataset.repo_id=<my_username>/<my_dataset_name> \
-    --dataset.num_episodes=2 \
-    --dataset.single_task="Grab the cube" \
-    --dataset.streaming_encoding=true \
-    --dataset.encoder_threads=2 \
-    --dataset.camera_encoder_config.vcodec=h264 \
-    --dataset.camera_encoder_config.preset=fast \
-    --dataset.camera_encoder_config.extra_options={"tune": "film", "profile:v": "high", "bf": 2} \
-    --display_data=true
-```
@@ -0,0 +1,68 @@
+#!/usr/bin/env python
+"""Launch ``lerobot-annotate`` on a Hugging Face job (vllm + Qwen3.6 MoE).
+
+Spawns one ``h200x2`` job that:
+
+  1. installs this branch of ``lerobot`` plus the annotation extras,
+  2. boots two vllm servers (one per GPU) with Qwen3.6-35B-A3B-FP8,
+  3. runs Module 1/2/3 across the dataset (per-camera VQA via PR 3471),
+  4. uploads the annotated dataset to ``--push_to_hub``.
+
+Usage:
+
+    HF_TOKEN=hf_... uv run python examples/annotation/run_hf_job.py
+
+Adjust ``CMD`` below to point at your own dataset / target hub repo.
+"""
+
+import os
+
+from huggingface_hub import get_token, run_job
+
+token = os.environ.get("HF_TOKEN") or get_token()
+if not token:
+    raise RuntimeError("No HF token. Run `huggingface-cli login` or `export HF_TOKEN=hf_...`")
+
+CMD = (
+    "apt-get update -qq && apt-get install -y -qq git ffmpeg && "
+    "pip install --no-deps "
+    "'lerobot @ git+https://github.com/huggingface/lerobot.git@feat/language-annotation-pipeline' && "
+    "pip install --upgrade-strategy only-if-needed "
+    # Mirror lerobot's [annotations] runtime deps. ``openai`` is required
+    # because ``VlmConfig.backend`` defaults to ``"openai"`` (which talks
+    # to a vllm/transformers/ktransformers OpenAI-compatible server).
+    "datasets pyarrow av jsonlines draccus gymnasium torchcodec mergedeep pyyaml-include "
+    "toml typing-inspect openai && "
+    "export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 && "
+    "export VLLM_VIDEO_BACKEND=pyav && "
+    "lerobot-annotate "
+    "--repo_id=imstevenpmwork/super_poulain_draft "
+    "--vlm.backend=openai "
+    "--vlm.model_id=Qwen/Qwen3.6-35B-A3B-FP8 "
+    "--vlm.parallel_servers=2 "
+    "--vlm.num_gpus=2 "
+    '--vlm.serve_command="vllm serve Qwen/Qwen3.6-35B-A3B-FP8 '
+    "--tensor-parallel-size 1 --max-model-len 32768 "
+    '--gpu-memory-utilization 0.8 --uvicorn-log-level warning --port {port}" '
+    "--vlm.serve_ready_timeout_s=1800 "
+    "--vlm.client_concurrency=256 "
+    "--vlm.max_new_tokens=512 "
+    "--executor.episode_parallelism=32 "
+    "--vlm.chat_template_kwargs='{enable_thinking: false}' "
+    "--vlm.camera_key=observation.images.wrist "
+    "--module_1.frames_per_second=1.0 "
+    "--module_1.use_video_url=true "
+    "--module_1.use_video_url_fps=1.0 "
+    "--module_3.K=1 --module_3.vqa_emission_hz=0.2 "
+    "--push_to_hub=pepijn223/super_poulain_qwen36moe-3"
+)
+
+job = run_job(
+    image="vllm/vllm-openai:latest",
+    command=["bash", "-c", CMD],
+    flavor="h200x2",
+    secrets={"HF_TOKEN": token},
+    timeout="2h",
+)
+print(f"Job URL: {job.url}")
+print(f"Job ID:  {job.id}")
@@ -95,7 +95,7 @@ dependencies = [

 # ── Feature-scoped extras ──────────────────────────────────
 dataset = [
-    "datasets>=4.0.0,<5.0.0",
+    "datasets>=4.7.0,<5.0.0",
    "pandas>=2.0.0,<3.0.0", # NOTE: Transitive dependency of datasets
    "pyarrow>=21.0.0,<30.0.0", # NOTE: Transitive dependency of datasets
    "lerobot[av-dep]",
@@ -128,7 +128,7 @@ dataset_viz = ["lerobot[dataset]", "lerobot[viz]"]
 av-dep = ["av>=15.0.0,<16.0.0"]
 pygame-dep = ["pygame>=2.5.1,<2.7.0"]
 placo-dep = ["placo>=0.9.6,<0.9.17"]
-transformers-dep = ["transformers==5.3.0"] # TODO(Steven): https://github.com/huggingface/lerobot/pull/3249
+transformers-dep = ["transformers>=5.4.0,<5.6.0"]
 grpcio-dep = ["grpcio==1.73.1", "protobuf>=6.31.1,<6.32.0"]
 can-dep = ["python-can>=4.2.0,<5.0.0"]
 peft-dep = ["peft>=0.18.0,<1.0.0"]
@@ -194,12 +194,25 @@ groot = [
 ]
 sarm = ["lerobot[transformers-dep]", "pydantic>=2.0.0,<3.0.0", "faker>=33.0.0,<35.0.0", "lerobot[matplotlib-dep]", "lerobot[qwen-vl-utils-dep]"]
 xvla = ["lerobot[transformers-dep]"]
+eo1 = ["lerobot[transformers-dep]", "lerobot[qwen-vl-utils-dep]"]
 hilserl = ["lerobot[transformers-dep]", "gym-hil>=0.1.13,<0.2.0", "lerobot[grpcio-dep]", "lerobot[placo-dep]"]

 # Features
 async = ["lerobot[grpcio-dep]", "lerobot[matplotlib-dep]"]
 peft = ["lerobot[transformers-dep]", "lerobot[peft-dep]"]

+# Annotation pipeline (lerobot-annotate). vllm is the preferred backend
+# on Linux, with a transformers fallback elsewhere; openai is the default
+# backend and talks to any OpenAI-compatible server (``vllm serve`` /
+# ``transformers serve`` / hosted endpoints). Distributed execution is
+# delegated to Hugging Face Jobs (see examples/annotation/run_hf_job.py).
+annotations = [
+    "lerobot[dataset]",
+    "lerobot[transformers-dep]",
+    "openai>=1.40,<2.0",
+    "vllm>=0.6.0,<1.0.0; sys_platform == 'linux'",
+]
+
 # Development
 dev = ["pre-commit>=3.7.0,<5.0.0", "debugpy>=1.8.1,<1.9.0", "lerobot[grpcio-dep]", "grpcio-tools==1.73.1", "mypy>=1.19.1", "ruff>=0.14.1", "lerobot[notebook]"]
 notebook = ["jupyter>=1.0.0,<2.0.0", "ipykernel>=6.0.0,<7.0.0"]
@@ -289,11 +302,12 @@ lerobot-find-joint-limits="lerobot.scripts.lerobot_find_joint_limits:main"
 lerobot-imgtransform-viz="lerobot.scripts.lerobot_imgtransform_viz:main"
 lerobot-edit-dataset="lerobot.scripts.lerobot_edit_dataset:main"
 lerobot-setup-can="lerobot.scripts.lerobot_setup_can:main"
+lerobot-annotate="lerobot.scripts.lerobot_annotate:main"
 lerobot-rollout="lerobot.scripts.lerobot_rollout:main"

 # ---------------- Tool Configurations ----------------
 [tool.setuptools.package-data]
-lerobot = ["envs/*.json"]
+lerobot = ["envs/*.json", "annotations/steerable_pipeline/prompts/*.txt"]

 [tool.setuptools.packages.find]
 where = ["src"]
@@ -0,0 +1,15 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
@@ -0,0 +1,36 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Steerable annotation pipeline producing ``language_persistent`` and
+``language_events`` columns for LeRobot datasets.
+
+The pipeline is decomposed into three independently runnable modules whose
+outputs are staged per-episode before a final parquet rewrite:
+
+- :mod:`.modules.plan_subtasks_memory` (Module 1) — persistent styles
+- :mod:`.modules.interjections_and_speech` (Module 2) — event styles + speech
+- :mod:`.modules.general_vqa` (Module 3) — event-style VQA pairs
+"""
+
+from .config import AnnotationPipelineConfig
+from .validator import StagingValidator, ValidationReport
+from .writer import LanguageColumnsWriter
+
+__all__ = [
+    "AnnotationPipelineConfig",
+    "LanguageColumnsWriter",
+    "StagingValidator",
+    "ValidationReport",
+]
@@ -0,0 +1,188 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+
+@dataclass
+class Module1Config:
+    """Module 1: plan + subtasks + memory + task augmentation.
+
+    Module 1 attaches the whole episode as one Qwen-VL video block;
+    ``max_video_frames`` only caps the frames packed in (a model-capacity
+    bound, not an annotation-logic knob).
+    """
+
+    enabled: bool = True
+
+    # Number of ``task_aug`` rephrasings emitted at ``t=0``. The renderer's
+    # ``${task}`` binding rotates among them per ``sample_idx``. ``0`` disables.
+    n_task_rephrasings: int = 10
+
+    # When to derive the task from the video instead of using
+    # ``record.episode_task``: ``off``, ``if_short`` (short / placeholder /
+    # missing canonical task), or ``always``. The derived task replaces the
+    # canonical one for every Module-1 prompt; ``meta/tasks.parquet`` is
+    # never modified.
+    derive_task_from_video: str = "if_short"
+    derive_task_min_words: int = 3
+
+    # Frame sampling for the subtask-decomposition prompt.
+    frames_per_second: float = 1.0
+    max_video_frames: int = 128
+
+    min_subtask_seconds: float = 1.5
+    plan_max_steps: int = 8
+
+    # When True (and backend supports it, e.g. ``openai``), Module 1 sends a
+    # ``video_url`` block pointing at a per-episode mp4 subclip and lets the
+    # server sample frames at ``use_video_url_fps``.
+    use_video_url: bool = False
+    use_video_url_fps: float = 1.0
+
+
+@dataclass
+class Module2Config:
+    """Module 2: interjections + paired speech."""
+
+    enabled: bool = True
+
+    # Each interjection emits a paired ``(interjection, speech)`` event row
+    # and triggers a ``plan`` refresh at the same timestamp via Module 1.
+    max_interjections_per_episode: int = 3
+    interjection_min_t: float = 2.0
+
+    # Visual context attached to the interjection prompt: a short window
+    # of frames centered on the chosen timestamp so the VLM sees the
+    # ongoing motion rather than a single frozen frame.
+    interjection_window_seconds: float = 2.0
+    interjection_window_frames: int = 4
+
+
+@dataclass
+class Module3Config:
+    """Module 3: general VQA."""
+
+    enabled: bool = True
+    vqa_emission_hz: float = 1.0
+    K: int = 3
+    question_types: tuple[str, ...] = ("bbox", "keypoint", "count", "attribute", "spatial")
+
+
+@dataclass
+class VlmConfig:
+    """Shared Qwen-VL client configuration."""
+
+    # One of ``vllm``, ``transformers``, ``openai``, or ``stub`` (tests).
+    # ``openai`` talks to a local OpenAI-compatible server; the CLI
+    # auto-spawns one when ``auto_serve=True``.
+    backend: str = "openai"
+    model_id: str = "Qwen/Qwen2.5-VL-7B-Instruct"
+
+    # OpenAI-compatible server endpoint; ``EMPTY`` works for local servers.
+    api_base: str = "http://localhost:8000/v1"
+    api_key: str = "EMPTY"
+
+    # When True with ``backend=openai``, the CLI probes ``api_base`` and
+    # spawns a server if none answers (default: ``transformers serve``).
+    # Set to False to fail fast when pointing at a remote endpoint.
+    auto_serve: bool = True
+    serve_port: int = 8000
+    # Override the auto-serve command. ``{port}`` is substituted per replica
+    # when ``parallel_servers > 1``.
+    serve_command: str | None = None
+
+    # Run multiple independent inference servers for round-robin client
+    # routing (each pinned to a GPU via ``CUDA_VISIBLE_DEVICES`` and bound
+    # to ``serve_port + i``). ``num_gpus=0`` means one GPU per replica.
+    parallel_servers: int = 1
+    num_gpus: int = 0
+    client_concurrency: int = 16
+    serve_ready_timeout_s: float = 600.0
+
+    max_new_tokens: int = 512
+    temperature: float = 0.2
+    json_mode: bool = True
+    batch_size: int = 4
+    tensor_parallel_size: int = 1
+
+    # Fraction of GPU memory vllm allocates for weights + KV cache.
+    gpu_memory_utilization: float = 0.9
+    # Cap context length (None = model default). On 80 GB H100 a 30B BF16
+    # model often needs <= 8192 to leave KV-cache headroom.
+    max_model_len: int | None = None
+    trust_remote_code: bool = False
+
+    # Override the camera stream used for keyframe attachment. None picks
+    # the first ``observation.images.*`` key the dataset declares.
+    camera_key: str | None = None
+    # Forwarded as ``extra_body.chat_template_kwargs`` on every chat call;
+    # use to pass model-specific flags such as ``{"enable_thinking": false}``.
+    chat_template_kwargs: dict[str, Any] | None = None
+
+
+@dataclass
+class ExecutorConfig:
+    """Executor settings.
+
+    Distributed execution is provided by Hugging Face Jobs (see
+    ``examples/annotation/run_hf_job.py``); this config only controls
+    intra-process episode concurrency.
+    """
+
+    # Episodes processed concurrently within each module phase. Each
+    # in-flight episode dispatches 3-5 dependent VLM calls, so this is the
+    # main knob for saturating ``parallel_servers`` and ``client_concurrency``.
+    episode_parallelism: int = 16
+
+
+@dataclass
+class AnnotationPipelineConfig:
+    """Top-level config for ``lerobot-annotate``.
+
+    The writer rewrites ``data/chunk-*/file-*.parquet`` in place. Multiple
+    revisions of the same dataset live in separate copies.
+    """
+
+    repo_id: str | None = None
+    root: Path | None = None
+
+    # Defaults to ``<root>/.annotate_staging/`` when unset.
+    staging_dir: Path | None = None
+
+    seed: int = 1729
+
+    module_1: Module1Config = field(default_factory=Module1Config)
+    module_2: Module2Config = field(default_factory=Module2Config)
+    module_3: Module3Config = field(default_factory=Module3Config)
+
+    vlm: VlmConfig = field(default_factory=VlmConfig)
+    executor: ExecutorConfig = field(default_factory=ExecutorConfig)
+
+    skip_validation: bool = False
+    only_episodes: tuple[int, ...] | None = None
+
+    # Upload the annotated dataset to the Hugging Face Hub when set.
+    push_to_hub: str | None = None
+    push_private: bool = False
+    push_commit_message: str | None = None
+
+    def resolved_staging_dir(self, root: Path) -> Path:
+        return self.staging_dir if self.staging_dir is not None else root / ".annotate_staging"
@@ -0,0 +1,267 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""In-process executor that runs the annotation phases.
+
+The executor plans **six phases** in the dependency order from the plan:
+
+    phase 1: Module 1 (plan + subtasks + memory)
+    phase 2: Module 2 (interjections + speech)
+    phase 3: Module 1 plan-update pass — re-runs plan emission at every
+             interjection timestamp produced by phase 2
+    phase 4: Module 3 (VQA)
+    phase 5: validator
+    phase 6: writer
+
+Phase 3 is why Module 1 must be re-entered after Module 2 — to refresh
+``plan`` rows at interjection timestamps.
+
+Distributed execution is provided by Hugging Face Jobs (see
+``examples/annotation/run_hf_job.py``); the runner inside the job
+invokes ``lerobot-annotate`` which uses this in-process executor.
+Episode-level concurrency is controlled by
+``ExecutorConfig.episode_parallelism``.
+"""
+
+from __future__ import annotations
+
+import logging
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+
+from .config import AnnotationPipelineConfig
+from .reader import EpisodeRecord, iter_episodes
+from .staging import EpisodeStaging
+from .validator import StagingValidator
+from .writer import LanguageColumnsWriter
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class PhaseResult:
+    """Summary of one pipeline phase across all episodes."""
+
+    name: str
+    episodes_processed: int
+    episodes_skipped: int
+
+
+@dataclass
+class PipelineRunSummary:
+    """Aggregated result returned by :meth:`Executor.run`."""
+
+    phases: list[PhaseResult]
+    written_paths: list[Path]
+    validation_report: Any  # ValidationReport, kept Any to avoid import cycle
+
+
+@dataclass
+class Executor:
+    """Run all four phases over a dataset root in-process.
+
+    Episode-level concurrency comes from ``ExecutorConfig.episode_parallelism``
+    (a thread pool); cluster-level concurrency comes from running this
+    executor inside a Hugging Face Job. Tests construct the executor
+    directly with stub modules.
+    """
+
+    config: AnnotationPipelineConfig
+    module_1: Any  # PlanSubtasksMemoryModule
+    module_2: Any  # InterjectionsAndSpeechModule
+    module_3: Any  # GeneralVqaModule
+    writer: LanguageColumnsWriter
+    validator: StagingValidator
+
+    def run(self, root: Path) -> PipelineRunSummary:
+        records = list(iter_episodes(root, only_episodes=self.config.only_episodes))
+        n = len(records)
+        if n == 0:
+            raise ValueError(f"No episodes found under {root}/data/")
+
+        print(f"[annotate] {n} episodes total", flush=True)
+
+        staging_dir = self.config.resolved_staging_dir(root)
+        staging_dir.mkdir(parents=True, exist_ok=True)
+
+        phases: list[PhaseResult] = []
+
+        # Phase 1: Module 1 (plan + subtasks + memory)
+        phases.append(self._run_module_phase("module_1", records, staging_dir, self.module_1))
+        # Phase 2: Module 2 (interjections + speech). Module 2 reads
+        # Module 1's subtask rows from the same staging tree to ground
+        # the interjection prompt in the correct local subtask.
+        phases.append(self._run_module_phase("module_2", records, staging_dir, self.module_2))
+        # Phase 3: Module 1 plan-update pass at interjection timestamps.
+        phases.append(self._run_plan_update_phase(records, staging_dir))
+        # Phase 4: Module 3 (VQA)
+        phases.append(self._run_module_phase("module_3", records, staging_dir, self.module_3))
+
+        print("[annotate] running validator...", flush=True)
+        report = self.validator.validate(records, staging_dir)
+        if not report.ok and not self.config.skip_validation:
+            raise RuntimeError(f"Staging validation failed: {report.summary()}")
+        print(f"[annotate] validator: {report.summary()}", flush=True)
+
+        print(f"[annotate] writing parquet shards into {root}/data/...", flush=True)
+        written = self.writer.write_all(records, staging_dir, root)
+        print(f"[annotate] wrote {len(written)} shard(s); pipeline complete", flush=True)
+
+        # Keep meta/info.json aligned with the parquet schema we just wrote.
+        # Idempotent and additive: existing user metadata is preserved.
+        self._ensure_annotation_metadata_in_info(root)
+
+        return PipelineRunSummary(phases=phases, written_paths=written, validation_report=report)
+
+    @staticmethod
+    def _ensure_annotation_metadata_in_info(root: Path) -> None:
+        """Write language features and canonical tools to ``meta/info.json``.
+
+        ``LanguageColumnsWriter`` adds ``language_persistent`` and
+        ``language_events`` to parquet shards. The metadata must advertise
+        those columns too, otherwise non-streaming ``LeRobotDataset`` loads
+        cast against the old schema and fail on the extra parquet columns.
+        """
+        import json  # noqa: PLC0415
+
+        from lerobot.datasets.language import SAY_TOOL_SCHEMA, language_feature_info  # noqa: PLC0415
+
+        info_path = root / "meta" / "info.json"
+        if not info_path.exists():
+            return
+        try:
+            info = json.loads(info_path.read_text())
+        except Exception as exc:  # noqa: BLE001
+            print(f"[annotate] could not read {info_path}: {exc}", flush=True)
+            return
+
+        changed = False
+
+        features = info.get("features")
+        if not isinstance(features, dict):
+            features = {}
+        merged_features = {**features, **language_feature_info()}
+        if merged_features != features:
+            info["features"] = merged_features
+            changed = True
+
+        existing = info.get("tools")
+        if not isinstance(existing, list):
+            existing = []
+        names = {(t.get("function") or {}).get("name") for t in existing if isinstance(t, dict)}
+        merged = list(existing)
+        if SAY_TOOL_SCHEMA["function"]["name"] not in names:
+            merged.append(SAY_TOOL_SCHEMA)
+        if merged != existing:
+            info["tools"] = merged
+            changed = True
+
+        if changed:
+            # Atomic replace — info.json is load-bearing for dataset
+            # metadata, so a crash mid-write would brick the dataset.
+            tmp_info = info_path.with_suffix(info_path.suffix + ".tmp")
+            tmp_info.write_text(json.dumps(info, indent=2))
+            tmp_info.replace(info_path)
+            print(
+                "[annotate] meta/info.json: "
+                f"language_features={list(language_feature_info())}, "
+                f"tools={[t['function']['name'] for t in merged]}",
+                flush=True,
+            )
+
+    def _run_module_phase(
+        self,
+        name: str,
+        records: list[EpisodeRecord],
+        staging_dir: Path,
+        module: Any,
+    ) -> PhaseResult:
+        import time as _time  # noqa: PLC0415
+        from concurrent.futures import ThreadPoolExecutor, as_completed  # noqa: PLC0415
+
+        if not module.enabled:
+            print(f"[annotate] phase={name} skipped (module disabled)", flush=True)
+            return PhaseResult(name=name, episodes_processed=0, episodes_skipped=len(records))
+        n = len(records)
+        parallelism = max(1, min(self.config.executor.episode_parallelism, n))
+        print(
+            f"[annotate] phase={name} starting on {n} episode(s) (parallelism={parallelism})",
+            flush=True,
+        )
+        t0 = _time.time()
+
+        def _do(idx_record: tuple[int, EpisodeRecord]) -> tuple[int, int, float]:
+            i, record = idx_record
+            ep_start = _time.time()
+            staging = EpisodeStaging(staging_dir, record.episode_index)
+            module.run_episode(record, staging)
+            return i, record.episode_index, _time.time() - ep_start
+
+        processed = 0
+        if parallelism == 1:
+            for i, record in enumerate(records, 1):
+                _, ep_idx, elapsed = _do((i, record))
+                processed += 1
+                print(
+                    f"[annotate]   {name} episode {i}/{n} (idx={ep_idx}) done in {elapsed:.1f}s",
+                    flush=True,
+                )
+        else:
+            with ThreadPoolExecutor(max_workers=parallelism) as pool:
+                futures = [pool.submit(_do, (i, r)) for i, r in enumerate(records, 1)]
+                for fut in as_completed(futures):
+                    i, ep_idx, elapsed = fut.result()
+                    processed += 1
+                    print(
+                        f"[annotate]   {name} episode {processed}/{n} "
+                        f"(idx={ep_idx}, submit_order={i}) done in {elapsed:.1f}s",
+                        flush=True,
+                    )
+        total = _time.time() - t0
+        print(f"[annotate] phase={name} complete: {processed}/{n} in {total:.1f}s", flush=True)
+        return PhaseResult(name=name, episodes_processed=processed, episodes_skipped=0)
+
+    def _run_plan_update_phase(  # noqa: PLR0915
+        self, records: list[EpisodeRecord], staging_dir: Path
+    ) -> PhaseResult:
+        """Re-emit ``plan`` rows at each interjection timestamp from Module 2.
+
+        Module 1 owns the prompt; Module 2 produced the timestamps. This phase
+        therefore calls back into Module 1 with the interjection timestamps so
+        Module 1's existing prompt path is reused.
+        """
+        if not self.module_1.enabled or not self.module_2.enabled:
+            return PhaseResult(
+                name="module_1_plan_update", episodes_processed=0, episodes_skipped=len(records)
+            )
+        processed = 0
+        for record in records:
+            staging = EpisodeStaging(staging_dir, record.episode_index)
+            interjection_rows = [
+                row for row in staging.read("module_2") if row.get("style") == "interjection"
+            ]
+            interjection_times = [float(row["timestamp"]) for row in interjection_rows]
+            interjection_texts = [str(row.get("content") or "") for row in interjection_rows]
+            if interjection_times:
+                self.module_1.run_plan_updates(record, staging, interjection_times, interjection_texts)
+                processed += 1
+        # Episodes without any interjections are skipped (no plan refresh
+        # needed); count them so the summary's processed+skipped == total.
+        return PhaseResult(
+            name="module_1_plan_update",
+            episodes_processed=processed,
+            episodes_skipped=len(records) - processed,
+        )
@@ -0,0 +1,394 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Keyframe extraction for the annotation pipeline.
+
+Modules attach decoded camera frames to their VLM prompts so the model can
+ground subtask decomposition, interjection scenarios, and VQA in actual
+visual content. The pipeline shares one provider across modules and one
+episode at a time, with a small per-episode cache so multiple modules
+querying the same timestamp pay decode cost once.
+"""
+
+from __future__ import annotations
+
+import threading
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any, Protocol
+
+from .reader import EpisodeRecord
+
+
+class FrameProvider(Protocol):
+    """Decodes camera frames at episode-relative timestamps."""
+
+    @property
+    def camera_keys(self) -> list[str]:
+        """All ``observation.images.*`` feature keys this provider can decode."""
+
+    def frames_at(
+        self,
+        record: EpisodeRecord,
+        timestamps: list[float],
+        camera_key: str | None = None,
+    ) -> list[Any]:
+        """Return one PIL.Image per timestamp from ``camera_key`` (or default).
+
+        Empty list if the camera is unavailable. ``camera_key=None`` falls back
+        to the provider's default camera so existing single-camera callers
+        (Module 1, Module 2) keep working unchanged.
+        """
+
+    def video_for_episode(
+        self,
+        record: EpisodeRecord,
+        max_frames: int,
+        camera_key: str | None = None,
+    ) -> list[Any]:
+        """Return up to ``max_frames`` PIL images covering the whole episode.
+
+        Sampling is uniform across the episode duration. The returned list is
+        intended to be passed as one ``{"type":"video", "video":<list>}``
+        block to a Qwen-VL-compatible model that pools temporally itself.
+        Empty list if no camera available.
+        """
+
+
+@dataclass
+class _NullProvider:
+    """No-op provider used when the dataset has no video keys or in tests."""
+
+    @property
+    def camera_keys(self) -> list[str]:
+        return []
+
+    def frames_at(
+        self,
+        record: EpisodeRecord,
+        timestamps: list[float],
+        camera_key: str | None = None,
+    ) -> list[Any]:
+        return []
+
+    def video_for_episode(
+        self,
+        record: EpisodeRecord,
+        max_frames: int,
+        camera_key: str | None = None,
+    ) -> list[Any]:
+        return []
+
+
+def null_provider() -> FrameProvider:
+    return _NullProvider()
+
+
+@dataclass
+class VideoFrameProvider:
+    """Decodes frames from the dataset's ``observation.images.*`` streams.
+
+    By default the *first* camera key is used for Module 1 (subtask
+    decomposition) and Module 2 (interjection scenarios) — those prompts care
+    about *what is happening*, not which angle. Module 3 (VQA) instead
+    iterates over every camera in :attr:`camera_keys` so each frame's
+    grounded answer (bbox/keypoint/...) is tagged with the camera it was
+    grounded against.
+
+    ``camera_key`` overrides the default-camera choice but does not restrict
+    :attr:`camera_keys`. Pass ``camera_key`` explicitly to ``frames_at`` /
+    ``video_for_episode`` to read a non-default stream.
+
+    Caches up to ``cache_size`` decoded frames per process to keep
+    co-timestamped Module 2 + Module 1 plan-update calls cheap.
+    """
+
+    root: Path
+    camera_key: str | None = None
+    tolerance_s: float = 1e-2
+    cache_size: int = 256
+    _meta: Any = field(default=None, init=False, repr=False)
+    _cache: dict = field(default_factory=dict, init=False, repr=False)
+    _camera_keys: list[str] = field(default_factory=list, init=False, repr=False)
+    # Pipeline runs Module 1/2/3 phases under a ThreadPoolExecutor (see
+    # ``ExecutorConfig.episode_parallelism``); guard the dict cache and the
+    # one-shot warn flag against concurrent updates from worker threads.
+    _lock: threading.Lock = field(default_factory=threading.Lock, init=False, repr=False)
+
+    def __post_init__(self) -> None:
+        from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata  # noqa: PLC0415
+
+        self._meta = LeRobotDatasetMetadata(repo_id="local", root=self.root)
+        # ``camera_keys`` covers both image- and video-stored cameras
+        # (``video_keys`` is video-only). Some datasets declare cameras with
+        # ``dtype=image``, which would otherwise look empty here and silently
+        # disable Module 3 even though the videos are there.
+        keys = list(getattr(self._meta, "camera_keys", None) or self._meta.video_keys or [])
+        # Last-resort fallback: if metadata didn't surface anything but the
+        # caller explicitly named a camera (``--vlm.camera_key=...``), trust
+        # them — the key is by definition known to exist on the dataset.
+        if not keys and self.camera_key:
+            keys = [self.camera_key]
+        self._camera_keys = keys
+        if self.camera_key is None:
+            self.camera_key = keys[0] if keys else None
+
+    @property
+    def camera_keys(self) -> list[str]:
+        """All ``observation.images.*`` keys available on this dataset."""
+        return list(self._camera_keys)
+
+    def frames_at(
+        self,
+        record: EpisodeRecord,
+        timestamps: list[float],
+        camera_key: str | None = None,
+    ) -> list[Any]:
+        target = camera_key if camera_key is not None else self.camera_key
+        if not timestamps or target is None:
+            return []
+
+        out: list[Any] = []
+        misses: list[float] = []
+        miss_indices: list[int] = []
+        with self._lock:
+            for i, ts in enumerate(timestamps):
+                key = (record.episode_index, target, round(float(ts), 6))
+                cached = self._cache.get(key)
+                if cached is not None:
+                    out.append(cached)
+                else:
+                    out.append(None)
+                    misses.append(float(ts))
+                    miss_indices.append(i)
+
+        if misses:
+            decoded = self._decode(record.episode_index, misses, target)
+            # decoder may return fewer frames than requested when some
+            # timestamps fall outside the video; pair what we have and
+            # leave the rest as None to be filtered below.
+            with self._lock:
+                for i, img in zip(miss_indices, decoded, strict=False):
+                    out[i] = img
+                    key = (record.episode_index, target, round(float(timestamps[i]), 6))
+                    if len(self._cache) >= self.cache_size:
+                        self._cache.pop(next(iter(self._cache)))
+                    self._cache[key] = img
+        # filter out any None left over from decode failures
+        return [img for img in out if img is not None]
+
+    def video_for_episode(
+        self,
+        record: EpisodeRecord,
+        max_frames: int,
+        camera_key: str | None = None,
+    ) -> list[Any]:
+        """Return up to ``max_frames`` images uniformly sampled across the episode.
+
+        The whole episode duration is covered; the model picks subtask
+        boundaries from the temporal pooling it does internally.
+        """
+        target = camera_key if camera_key is not None else self.camera_key
+        if max_frames <= 0 or target is None or not record.frame_timestamps:
+            return []
+        n_frames = min(max_frames, len(record.frame_timestamps))
+        if n_frames == len(record.frame_timestamps):
+            timestamps = list(record.frame_timestamps)
+        else:
+            t0 = record.frame_timestamps[0]
+            t_last = record.frame_timestamps[-1]
+            if t_last <= t0:
+                timestamps = [float(t0)] * n_frames
+            else:
+                step = (t_last - t0) / (n_frames - 1) if n_frames > 1 else 0.0
+                timestamps = [float(t0 + i * step) for i in range(n_frames)]
+        return self.frames_at(record, timestamps, camera_key=target)
+
+    def episode_clip_path(self, record: EpisodeRecord, cache_dir: Path) -> Path | None:
+        """Extract the episode's subclip to ``cache_dir/ep_{idx:06d}.mp4``.
+
+        Returns ``None`` if the dataset has no video tracks. Skips
+        re-extract when the cached clip already exists. Re-encodes to
+        H.264 (libx264) so the resulting mp4 is decodable by every
+        downstream video processor — stream-copy would inherit the
+        source codec (often AV1 in modern LeRobot datasets), which
+        vllm's libav build cannot decode.
+        """
+        import subprocess  # noqa: PLC0415
+
+        if self.camera_key is None:
+            return None
+        cache_dir.mkdir(parents=True, exist_ok=True)
+        out_path = cache_dir / f"ep_{record.episode_index:06d}.mp4"
+        if out_path.exists() and out_path.stat().st_size > 0:
+            return out_path
+        ep = self._meta.episodes[record.episode_index]
+        from_timestamp = float(ep[f"videos/{self.camera_key}/from_timestamp"])
+        to_timestamp = float(ep[f"videos/{self.camera_key}/to_timestamp"])
+        src = self.root / self._meta.get_video_file_path(record.episode_index, self.camera_key)
+        cmd = [
+            "ffmpeg",
+            "-y",
+            "-loglevel",
+            "error",
+            "-ss",
+            f"{from_timestamp:.3f}",
+            "-to",
+            f"{to_timestamp:.3f}",
+            "-i",
+            str(src),
+            "-c:v",
+            "libx264",
+            "-preset",
+            "ultrafast",
+            "-crf",
+            "23",
+            "-pix_fmt",
+            "yuv420p",
+            "-an",
+            str(out_path),
+        ]
+        try:
+            subprocess.run(cmd, check=True, timeout=300)
+        except (subprocess.CalledProcessError, subprocess.TimeoutExpired, FileNotFoundError):
+            return None
+        return out_path if out_path.exists() and out_path.stat().st_size > 0 else None
+
+    def _decode(self, episode_index: int, timestamps: list[float], camera_key: str) -> list[Any]:
+        ep = self._meta.episodes[episode_index]
+        from_timestamp = ep[f"videos/{camera_key}/from_timestamp"]
+        shifted = [from_timestamp + ts for ts in timestamps]
+        video_path = self.root / self._meta.get_video_file_path(episode_index, camera_key)
+
+        try:
+            return _decode_pyav_direct(video_path, shifted, self.tolerance_s)
+        except Exception as exc:
+            # Log loudly the first time decoding fails so silent
+            # Module-3-no-op (every prompt skipped because frames_at returned
+            # []) is debuggable from the job log instead of post-hoc parquet
+            # inspection. Subsequent failures stay quiet.
+            with self._lock:
+                already_warned = getattr(self, "_warned_decode_fail", False)
+                if not already_warned:
+                    self._warned_decode_fail = True
+            if not already_warned:
+                import logging  # noqa: PLC0415
+
+                logging.getLogger(__name__).warning(
+                    "VideoFrameProvider._decode failed for episode=%s camera=%s video_path=%s: %s",
+                    episode_index,
+                    camera_key,
+                    video_path,
+                    exc,
+                    exc_info=True,
+                )
+            return []
+
+
+def _decode_pyav_direct(video_path: Any, timestamps: list[float], tolerance_s: float) -> list[Any]:
+    """Decode the requested timestamps from ``video_path`` using PyAV directly.
+
+    Bypasses ``lerobot.datasets.video_utils.decode_video_frames`` entirely
+    because its "pyav" path actually goes through
+    ``decode_video_frames_torchvision`` → ``torchvision.io.VideoReader``,
+    which was removed in torchvision >= 0.22 (the vllm/vllm-openai:latest
+    container ships with torchvision 0.25). The annotation pipeline only
+    needs a handful of PIL images per (episode, ts), so we can decode them
+    with PyAV without any torch dependency at all.
+
+    Returns one ``PIL.Image`` per requested timestamp, in the same order.
+    Any timestamp the decoder couldn't reach is silently dropped (mirrors
+    the previous behaviour); callers filter ``None``/missing entries.
+    """
+    import av  # noqa: PLC0415
+
+    if not timestamps:
+        return []
+
+    targets = sorted(set(timestamps))
+    seek_to = max(0.0, min(targets) - max(0.5, tolerance_s))
+
+    container = av.open(str(video_path))
+    try:
+        stream = container.streams.video[0]
+        # PyAV needs the seek target in stream timebase ticks.
+        seek_pts = 0 if stream.time_base is None else int(seek_to / float(stream.time_base))
+        try:
+            container.seek(seek_pts, any_frame=False, backward=True, stream=stream)
+        except av.AVError:
+            # Some streams reject the explicit seek; fall back to decoding from start.
+            container.seek(0)
+
+        results: dict[float, Any] = {}
+        target_iter = iter(targets)
+        next_target = next(target_iter, None)
+        for frame in container.decode(stream):
+            if next_target is None:
+                break
+            ts = float(frame.pts * frame.time_base) if frame.pts is not None else None
+            if ts is None:
+                continue
+            # Walk past targets we've already overshot — we keep the closest
+            # frame within tolerance.
+            while next_target is not None and ts >= next_target - tolerance_s:
+                if abs(ts - next_target) <= tolerance_s or ts >= next_target:
+                    img = frame.to_image()  # PIL.Image.Image (RGB)
+                    results.setdefault(next_target, img)
+                    next_target = next(target_iter, None)
+                else:
+                    break
+    finally:
+        container.close()
+
+    return [results[ts] for ts in timestamps if ts in results]
+
+
+def make_frame_provider(root: Path, camera_key: str | None = None) -> FrameProvider:
+    """Build a :class:`VideoFrameProvider` if videos are present, else null."""
+    try:
+        provider = VideoFrameProvider(root=root, camera_key=camera_key)
+    except Exception:
+        return null_provider()
+    if provider.camera_key is None:
+        return null_provider()
+    return provider
+
+
+def to_image_blocks(images: list[Any]) -> list[dict[str, Any]]:
+    """Convert PIL images to Qwen-VL-compatible content blocks."""
+    return [{"type": "image", "image": img} for img in images]
+
+
+def to_video_block(images: list[Any]) -> list[dict[str, Any]]:
+    """Wrap a list of PIL images as one Qwen-VL video block.
+
+    Returns ``[]`` when the list is empty, so the caller can splat the result
+    into a content array without a separate emptiness check.
+    """
+    if not images:
+        return []
+    return [{"type": "video", "video": list(images)}]
+
+
+def to_video_url_block(url: str | None, fps: float = 2.0) -> list[dict[str, Any]]:
+    """Wrap a video file URL as one ``video_url`` block.
+
+    Used by the ``openai`` backend (transformers serve / vllm serve /
+    ktransformers serve), where the server handles frame sampling.
+    Returns ``[]`` when ``url`` is ``None`` so the caller can splat.
+    """
+    if not url:
+        return []
+    return [{"type": "video_url", "video_url": {"url": url}, "fps": fps}]
@@ -0,0 +1,25 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .general_vqa import GeneralVqaModule
+from .interjections_and_speech import InterjectionsAndSpeechModule
+from .plan_subtasks_memory import PlanSubtasksMemoryModule
+
+__all__ = [
+    "GeneralVqaModule",
+    "InterjectionsAndSpeechModule",
+    "PlanSubtasksMemoryModule",
+]
@@ -0,0 +1,238 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Module 3: general VQA at a timed cadence.
+
+Anchors ``K`` (question, answer) pairs to ``K`` consecutive frames per
+emission. For datasets with multiple cameras, every emission tick produces
+one ``(vqa, user)`` + ``(vqa, assistant)`` pair *per camera*: each pair is
+generated against that camera's frame and stamped with the matching
+``camera`` field on the emitted rows. The resolver disambiguates via
+``camera=...``; recipes that consume VQA do so through one sub-recipe
+per camera (see ``recipes/pi05_hirobot.yaml``).
+
+Within a single (frame, camera) we still emit at most one ``(vqa, user)``
+and one ``(vqa, assistant)`` row, so the resolver contract stays scalar.
+
+Question types covered (per the plan's Module 3 table): bbox, keypoint,
+count, attribute, spatial. The assistant's ``content`` is a JSON string
+whose schema depends on the question type. Malformed JSON triggers one
+retry inside :meth:`VlmClient.generate_json`.
+"""
+
+from __future__ import annotations
+
+import json
+import random
+from collections.abc import Sequence
+from dataclasses import dataclass, field
+from typing import Any
+
+from ..config import Module3Config
+from ..frames import FrameProvider, null_provider, to_image_blocks
+from ..prompts import load as load_prompt
+from ..reader import EpisodeRecord
+from ..staging import EpisodeStaging
+from ..validator import classify_vqa_answer
+from ..vlm_client import VlmClient
+
+
+def _emission_anchor_indices(frame_timestamps: Sequence[float], hz: float, k: int) -> list[int]:
+    """Return the relative frame indices to anchor VQA emissions to.
+
+    For each emission tick (every ``1/hz`` seconds), we anchor ``k``
+    consecutive frames starting at the tick. Ticks fall on the nearest
+    available source frame timestamp.
+    """
+    if hz <= 0 or k <= 0 or not frame_timestamps:
+        return []
+    t0 = frame_timestamps[0]
+    t_last = frame_timestamps[-1]
+    period = 1.0 / hz
+    indices: list[int] = []
+    t = t0
+    while t <= t_last + 1e-9:
+        # find the index of the nearest frame to t
+        nearest_i = min(range(len(frame_timestamps)), key=lambda i: abs(frame_timestamps[i] - t))
+        for offset in range(k):
+            j = nearest_i + offset
+            if j >= len(frame_timestamps):
+                break
+            if not indices or indices[-1] != j:
+                indices.append(j)
+        t += period
+    # dedupe while preserving order
+    seen: set[int] = set()
+    deduped: list[int] = []
+    for i in indices:
+        if i in seen:
+            continue
+        seen.add(i)
+        deduped.append(i)
+    return deduped
+
+
+@dataclass
+class GeneralVqaModule:
+    """Emit grounded VQA pairs at a timed cadence."""
+
+    vlm: VlmClient
+    config: Module3Config
+    seed: int = 1729
+    frame_provider: FrameProvider = field(default_factory=null_provider)
+
+    @property
+    def enabled(self) -> bool:
+        return self.config.enabled
+
+    def run_episode(self, record: EpisodeRecord, staging: EpisodeStaging) -> None:
+        if not record.frame_timestamps:
+            staging.write("module_3", [])
+            return
+        rng = random.Random(f"{self.seed}:{record.episode_index}:vqa")
+        anchor_idx = _emission_anchor_indices(
+            record.frame_timestamps, self.config.vqa_emission_hz, self.config.K
+        )
+        cameras = self._target_cameras()
+        if not cameras:
+            # No camera available — emit nothing rather than producing
+            # untagged rows that would fail validation. Surface a loud one-
+            # time warning so this is never silently a no-op.
+            if not getattr(self, "_warned_no_camera", False):
+                import logging  # noqa: PLC0415
+
+                logging.getLogger(__name__).warning(
+                    "Module 3 (VQA) found no cameras on the frame provider — "
+                    "every episode will emit zero VQA rows. Check that the "
+                    "dataset declares observation.images.* features in "
+                    "meta/info.json; passing --vlm.camera_key=<key> at the "
+                    "CLI now also seeds the cameras list as a fallback."
+                )
+                self._warned_no_camera = True
+            staging.write("module_3", [])
+            return
+
+        # Build all messages first (one per (frame, camera)), then issue them
+        # as a single batched generate_json call so the client can fan them
+        # out concurrently.
+        per_call: list[tuple[float, str, str, list[dict[str, Any]]]] = []
+        for idx in anchor_idx:
+            ts = float(record.frame_timestamps[idx])
+            qtype = rng.choice(self.config.question_types)
+            for camera in cameras:
+                messages = self._build_messages(record, qtype, ts, camera)
+                # Skip cameras that decoded to zero frames at this ts: no point
+                # asking the VLM to ground a bbox without an image.
+                if not _has_image_block(messages):
+                    continue
+                per_call.append((ts, camera, qtype, messages))
+
+        if not per_call:
+            staging.write("module_3", [])
+            return
+
+        results = self.vlm.generate_json([m for _, _, _, m in per_call])
+
+        rows: list[dict[str, Any]] = []
+        for (ts, camera, _qtype, _messages), result in zip(per_call, results):
+            qa = self._postprocess(result)
+            if qa is None:
+                continue
+            question, answer = qa
+            rows.append(
+                {
+                    "role": "user",
+                    "content": question,
+                    "style": "vqa",
+                    "timestamp": ts,
+                    "camera": camera,
+                    "tool_calls": None,
+                }
+            )
+            rows.append(
+                {
+                    "role": "assistant",
+                    "content": json.dumps(answer, sort_keys=True),
+                    "style": "vqa",
+                    "timestamp": ts,
+                    "camera": camera,
+                    "tool_calls": None,
+                }
+            )
+        staging.write("module_3", rows)
+
+    def _target_cameras(self) -> list[str]:
+        """Return the cameras Module 3 should iterate per emission tick.
+
+        Defaults to every camera the provider exposes. Datasets with no
+        cameras (or test/null providers) yield an empty list, which makes
+        ``run_episode`` a no-op.
+        """
+        return list(getattr(self.frame_provider, "camera_keys", []) or [])
+
+    def _build_messages(
+        self,
+        record: EpisodeRecord,
+        question_type: str,
+        frame_timestamp: float,
+        camera_key: str,
+    ) -> list[dict[str, Any]]:
+        prompt = load_prompt("module_3_vqa").format(
+            episode_task=record.episode_task,
+            question_type=question_type,
+        )
+        images = self.frame_provider.frames_at(
+            record, [frame_timestamp], camera_key=camera_key
+        )
+        content = [*to_image_blocks(images), {"type": "text", "text": prompt}]
+        return [{"role": "user", "content": content}]
+
+    def _postprocess(self, result: Any) -> tuple[str, dict[str, Any]] | None:
+        if not isinstance(result, dict):
+            return None
+        question = result.get("question")
+        answer = result.get("answer")
+        if not isinstance(question, str) or not question.strip():
+            return None
+        if not isinstance(answer, dict):
+            return None
+        # The validator will enforce shape; here we just sanity-check that the
+        # answer matches *some* known shape so we can drop garbage early.
+        if classify_vqa_answer(answer) is None:
+            return None
+        return question.strip(), answer
+
+    def _generate_one(
+        self,
+        record: EpisodeRecord,
+        question_type: str,
+        frame_timestamp: float,
+        camera_key: str,
+    ) -> tuple[str, dict[str, Any]] | None:
+        messages = self._build_messages(record, question_type, frame_timestamp, camera_key)
+        result = self.vlm.generate_json([messages])[0]
+        return self._postprocess(result)
+
+
+def _has_image_block(messages: list[dict[str, Any]]) -> bool:
+    """Return True if any user content block is a populated image block."""
+    for msg in messages:
+        content = msg.get("content")
+        if not isinstance(content, list):
+            continue
+        for block in content:
+            if isinstance(block, dict) and block.get("type") == "image":
+                return True
+    return False
@@ -0,0 +1,210 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Module 2: interjections + paired speech (EVENT styles + speech atoms).
+
+Two sub-passes:
+
+1. At ``t=0``, emit ONLY a speech tool-call atom (acknowledgement of the
+   canonical task). No interjection row — the canonical task is already the
+   user utterance from ``meta/tasks.parquet``.
+
+2. For mid-episode interruptions, emit a co-timestamped pair:
+       {role:user, style:interjection, content:<text>}
+       speech atom (role:assistant, style:None, tool_calls=[say(...)])
+   Both rows go in ``language_events`` at the same timestamp.
+
+Module 1's :meth:`run_plan_updates` reuses Module 2's interjection
+timestamps to refresh the ``plan`` row at the same instant.
+"""
+
+from __future__ import annotations
+
+import random
+from collections.abc import Sequence
+from dataclasses import dataclass, field
+from typing import Any
+
+from ..config import Module2Config
+from ..frames import FrameProvider, null_provider, to_image_blocks
+from ..prompts import load as load_prompt
+from ..reader import EpisodeRecord, reconstruct_subtask_spans, snap_to_frame
+from ..staging import EpisodeStaging
+from ..vlm_client import VlmClient
+from ..writer import speech_atom
+
+
+@dataclass
+class InterjectionsAndSpeechModule:
+    """Generate task-start speech and mid-episode interjection/speech pairs."""
+
+    vlm: VlmClient
+    config: Module2Config
+    seed: int = 1729
+    frame_provider: FrameProvider = field(default_factory=null_provider)
+
+    @property
+    def enabled(self) -> bool:
+        return self.config.enabled
+
+    def run_episode(self, record: EpisodeRecord, staging: EpisodeStaging) -> None:
+        rows: list[dict[str, Any]] = []
+        if record.frame_timestamps:
+            t0 = float(record.frame_timestamps[0])
+            initial = self._initial_speech(record)
+            if initial:
+                rows.append(speech_atom(t0, initial))
+        # Pull Module 1's subtask spans for this episode so the
+        # interjection prompt can ground itself in the actual current
+        # subtask at each chosen timestamp. Module 1 ran first.
+        episode_end_t = float(record.frame_timestamps[-1]) if record.frame_timestamps else None
+        subtask_spans = reconstruct_subtask_spans(staging.read("module_1"), episode_end_t=episode_end_t)
+        rows.extend(self._mid_episode_interjections(record, subtask_spans))
+        staging.write("module_2", rows)
+
+    @staticmethod
+    def _subtask_at(spans: Sequence[dict[str, Any]], t: float) -> str | None:
+        current: str | None = None
+        for span in spans:
+            if float(span["start"]) <= t:
+                current = span.get("text")
+            else:
+                break
+        return current
+
+    def _initial_speech(self, record: EpisodeRecord) -> str | None:
+        prompt = load_prompt("module_2_initial_speech").format(
+            episode_task=record.episode_task,
+        )
+        messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]
+        result = self.vlm.generate_json([messages])[0]
+        if isinstance(result, dict) and isinstance(result.get("text"), str):
+            text = result["text"].strip()
+            if text:
+                return text
+        return None
+
+    def _mid_episode_interjections(
+        self,
+        record: EpisodeRecord,
+        subtask_spans: Sequence[dict[str, Any]],
+    ) -> list[dict[str, Any]]:
+        """Generate interjections aligned with the actual demo trajectory.
+
+        Teleop data is frozen — the robot already executed every step in
+        the video. A *counterfactual* interjection like "actually skip
+        the wipe" contradicts what then happens in the video, which is
+        what qwen36moe-10/11 surfaced as low-quality interjections.
+
+        Instead, anchor every interjection at a subtask boundary and
+        write it as a natural user request for the *upcoming* subtask.
+        The robot's visible next behavior IS the interjection's effect,
+        so the training signal stays consistent: interjection text →
+        plan refresh → action stream all line up.
+        """
+        if self.config.max_interjections_per_episode <= 0:
+            return []
+        if len(subtask_spans) < 2:
+            # Need at least one transition (subtask 0 → subtask 1).
+            return []
+        # Deterministic per-episode RNG so reruns are stable across SLURM jobs.
+        rng = random.Random(f"{self.seed}:{record.episode_index}:interjection")
+
+        # Boundaries: the start time of every subtask except the first
+        # (which is just t0 and is covered by the initial-task speech atom).
+        boundaries: list[tuple[float, str, str]] = []
+        for i in range(1, len(subtask_spans)):
+            ts = float(subtask_spans[i]["start"])
+            if ts < self.config.interjection_min_t:
+                continue
+            prev_text = (subtask_spans[i - 1].get("text") or "").strip()
+            next_text = (subtask_spans[i].get("text") or "").strip()
+            if not next_text:
+                continue
+            boundaries.append((ts, prev_text, next_text))
+        if not boundaries:
+            return []
+
+        n = min(self.config.max_interjections_per_episode, len(boundaries))
+        chosen = sorted(rng.sample(boundaries, n), key=lambda b: b[0])
+
+        out: list[dict[str, Any]] = []
+        for t, prev_subtask, next_subtask in chosen:
+            t_snap = snap_to_frame(t, record.frame_timestamps)
+            # Window straddles the boundary so the VLM sees the end of the
+            # previous subtask and the start of the next one — same
+            # conditioning the policy will see at training time.
+            window_ts = self._window_timestamps(t_snap, record.frame_timestamps)
+            prompt = load_prompt("module_2_interjection").format(
+                episode_task=record.episode_task,
+                prev_subtask=prev_subtask or "(starting from initial state)",
+                next_subtask=next_subtask,
+                timestamp=t_snap,
+                window_seconds=self.config.interjection_window_seconds,
+            )
+            images = self.frame_provider.frames_at(record, window_ts)
+            content = [*to_image_blocks(images), {"type": "text", "text": prompt}]
+            messages = [{"role": "user", "content": content}]
+            result = self.vlm.generate_json([messages])[0]
+            if not isinstance(result, dict):
+                continue
+            interjection_text = result.get("interjection")
+            speech_text = result.get("speech")
+            if not isinstance(interjection_text, str) or not interjection_text.strip():
+                continue
+            if not isinstance(speech_text, str) or not speech_text.strip():
+                continue
+            out.append(
+                {
+                    "role": "user",
+                    "content": interjection_text.strip(),
+                    "style": "interjection",
+                    "timestamp": t_snap,
+                    "tool_calls": None,
+                }
+            )
+            out.append(speech_atom(t_snap, speech_text.strip()))
+        return out
+
+    def _window_timestamps(self, t_anchor: float, frame_timestamps: Sequence[float]) -> list[float]:
+        """Return a small set of frame timestamps centered on ``t_anchor``.
+
+        The window straddles the subtask boundary the interjection sits
+        on: roughly half the frames cover the end of the previous
+        subtask, half cover the start of the next one. The VLM therefore
+        sees BOTH what just finished AND what's about to start, which is
+        the conditioning we need to write a natural "now please do X"
+        request that matches the visible upcoming behavior.
+        """
+        if not frame_timestamps:
+            return [t_anchor]
+        n = max(1, int(self.config.interjection_window_frames))
+        if n == 1:
+            return [t_anchor]
+        window = float(self.config.interjection_window_seconds)
+        step = window / max(1, n - 1)
+        # Center the window on the anchor so half lands before, half after.
+        start_offset = -window / 2.0
+        targets = [t_anchor + start_offset + step * i for i in range(n)]
+        last_ts = float(frame_timestamps[-1])
+        snapped: list[float] = []
+        seen: set[float] = set()
+        for tgt in targets:
+            clamped = min(last_ts, max(0.0, tgt))
+            t = snap_to_frame(clamped, frame_timestamps)
+            if t not in seen:
+                seen.add(t)
+                snapped.append(t)
+        return snapped or [t_anchor]
@@ -0,0 +1,389 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Module 1: subtask decomposition + plan + memory (PERSISTENT styles)."""
+
+from __future__ import annotations
+
+from collections.abc import Sequence
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+from ..config import Module1Config
+from ..frames import (
+    FrameProvider,
+    VideoFrameProvider,
+    null_provider,
+    to_video_block,
+    to_video_url_block,
+)
+from ..prompts import load as load_prompt
+from ..reader import EpisodeRecord, reconstruct_subtask_spans, snap_to_frame
+from ..staging import EpisodeStaging
+from ..vlm_client import VlmClient
+
+
+@dataclass
+class PlanSubtasksMemoryModule:
+    """Generate subtask spans, plan, and memory rows.
+
+    All output is persistent (lives in ``language_persistent``):
+
+    - ``subtask`` rows: one per span, stamped at the span's *start* timestamp
+      (snapped to an exact frame).
+    - ``plan`` rows: emitted at ``t=0``; refreshed at every interjection
+      timestamp via :meth:`run_plan_updates` (called by the executor after
+      Module 2 completes).
+    - ``memory`` rows: emitted at each subtask boundary (= subtask start
+      timestamp from the second subtask onward).
+    """
+
+    vlm: VlmClient
+    config: Module1Config
+    frame_provider: FrameProvider = field(default_factory=null_provider)
+
+    @property
+    def enabled(self) -> bool:
+        return self.config.enabled
+
+    def run_episode(self, record: EpisodeRecord, staging: EpisodeStaging) -> None:
+        rows: list[dict[str, Any]] = []
+        # Resolve the task that drives every other Module-1 prompt. May be
+        # the canonical ``record.episode_task`` (default), or a fresh
+        # description derived from the video when the canonical task is
+        # empty / placeholder / forced-off (see Module1Config.derive_task_*).
+        effective_task = self._resolve_effective_task(record)
+        # ``task_aug`` rows at t=0 (role=user), one per rephrasing — the
+        # PR 1 renderer rotates ``${task}`` deterministically through them
+        # so the policy sees diverse phrasings during training.
+        t0 = float(record.frame_timestamps[0]) if record.frame_timestamps else 0.0
+        if self.config.n_task_rephrasings > 0 and effective_task:
+            rephrasings = self._generate_task_rephrasings(effective_task, n=self.config.n_task_rephrasings)
+            # Always include the effective task itself as the first variant
+            # so the rotation is guaranteed to cover the source-of-truth
+            # phrasing, not just synthetic alternatives.
+            seen: set[str] = set()
+            ordered = [effective_task, *rephrasings]
+            for phrasing in ordered:
+                key = phrasing.strip()
+                if not key or key in seen:
+                    continue
+                seen.add(key)
+                rows.append(
+                    {
+                        "role": "user",
+                        "content": key,
+                        "style": "task_aug",
+                        "timestamp": t0,
+                        "tool_calls": None,
+                    }
+                )
+
+        subtask_spans = self._generate_subtasks(record, task=effective_task)
+        # subtask rows
+        for span in subtask_spans:
+            rows.append(
+                {
+                    "role": "assistant",
+                    "content": span["text"],
+                    "style": "subtask",
+                    "timestamp": snap_to_frame(span["start"], record.frame_timestamps),
+                    "tool_calls": None,
+                }
+            )
+        # plan row at t=0
+        plan_text = self._generate_plan(record, subtask_spans, task=effective_task)
+        if plan_text is not None:
+            rows.append(
+                {
+                    "role": "assistant",
+                    "content": plan_text,
+                    "style": "plan",
+                    "timestamp": float(t0),
+                    "tool_calls": None,
+                }
+            )
+        # memory rows at every subtask boundary except the very first start
+        prior_memory = ""
+        for i, span in enumerate(subtask_spans[1:], start=1):
+            completed = subtask_spans[i - 1]["text"]
+            remaining = [s["text"] for s in subtask_spans[i:]]
+            mem_text = self._generate_memory(record, prior_memory, completed, remaining, task=effective_task)
+            if mem_text:
+                ts = snap_to_frame(span["start"], record.frame_timestamps)
+                rows.append(
+                    {
+                        "role": "assistant",
+                        "content": mem_text,
+                        "style": "memory",
+                        "timestamp": ts,
+                        "tool_calls": None,
+                    }
+                )
+                prior_memory = mem_text
+        staging.write("module_1", rows)
+
+    # ------------------------------------------------------------------
+    # Task derivation + rephrasings
+    # ------------------------------------------------------------------
+
+    _PLACEHOLDER_TASKS: frozenset[str] = frozenset(
+        {
+            "debug",
+            "test",
+            "tbd",
+            "todo",
+            "n/a",
+            "na",
+            "untitled",
+            "unnamed",
+            "default",
+            "placeholder",
+        }
+    )
+
+    def _resolve_effective_task(self, record: EpisodeRecord) -> str:
+        """Decide which task string drives Module 1 for this episode.
+
+        Returns the user-supplied ``record.episode_task`` unless
+        ``derive_task_from_video`` says otherwise (see config docstring).
+        Falls back gracefully to the canonical task if video derivation
+        fails.
+        """
+        canonical = (record.episode_task or "").strip()
+        mode = (self.config.derive_task_from_video or "off").strip().lower()
+        if mode == "always":
+            derived = self._derive_task_from_video(record)
+            return derived or canonical
+        if mode == "if_short" and self._task_seems_bad(canonical):
+            derived = self._derive_task_from_video(record)
+            if derived:
+                return derived
+        return canonical
+
+    def _task_seems_bad(self, task: str) -> bool:
+        if not task:
+            return True
+        if len(task.split()) < int(self.config.derive_task_min_words):
+            return True
+        return task.lower() in self._PLACEHOLDER_TASKS
+
+    # ------------------------------------------------------------------
+    # VLM call helpers (factored out: every Module-1 prompt below follows
+    # the same "build messages → single VLM call → pull a named field"
+    # shape, only differing in field name + post-processing).
+    # ------------------------------------------------------------------
+
+    def _vlm_field(self, messages: list[dict[str, Any]], field: str) -> Any:
+        """Run a single VLM call and return ``result[field]`` or ``None``.
+
+        Centralizes the ``vlm.generate_json([m])[0]`` + ``isinstance(dict)``
+        dance every prompt-call site needs.
+        """
+        result = self.vlm.generate_json([messages])[0]
+        if isinstance(result, dict):
+            return result.get(field)
+        return None
+
+    @staticmethod
+    def _text_message(text: str) -> list[dict[str, Any]]:
+        """One-shot text-only user message wrapped for ``generate_json``."""
+        return [{"role": "user", "content": [{"type": "text", "text": text}]}]
+
+    def _video_message(self, record: EpisodeRecord, prompt: str) -> list[dict[str, Any]]:
+        """User message combining the episode video block with ``prompt``."""
+        content = [*self._episode_video_block(record), {"type": "text", "text": prompt}]
+        return [{"role": "user", "content": content}]
+
+    def _derive_task_from_video(self, record: EpisodeRecord) -> str | None:
+        """Ask the VLM "what is this video about" with no task hint at all."""
+        text = self._vlm_field(self._video_message(record, load_prompt("module_1_video_task")), "task")
+        return text.strip() if isinstance(text, str) and text.strip() else None
+
+    def _generate_task_rephrasings(self, base_task: str, *, n: int) -> list[str]:
+        """Generate ``n`` text-only paraphrases of ``base_task``."""
+        if n <= 0 or not base_task:
+            return []
+        prompt = load_prompt("module_1_task_rephrasings").format(base_task=base_task, n=n)
+        raw = self._vlm_field(self._text_message(prompt), "rephrasings")
+        if not isinstance(raw, list):
+            return []
+        out = [item.strip().strip('"').strip("'") for item in raw if isinstance(item, str)]
+        return [s for s in out if s][:n]
+
+    def _episode_video_block(self, record: EpisodeRecord) -> list[dict[str, Any]]:
+        """Same video block ``_generate_subtasks`` builds — extracted helper."""
+        if not record.frame_timestamps:
+            return []
+        if self.config.use_video_url and isinstance(self.frame_provider, VideoFrameProvider):
+            cache_dir = Path(self.frame_provider.root) / ".annotate_staging" / ".video_clips"
+            clip = self.frame_provider.episode_clip_path(record, cache_dir)
+            return (
+                to_video_url_block(f"file://{clip}", fps=self.config.use_video_url_fps)
+                if clip is not None
+                else []
+            )
+        episode_duration = record.frame_timestamps[-1] - record.frame_timestamps[0]
+        target_count = max(1, int(round(episode_duration * self.config.frames_per_second)))
+        target_count = min(target_count, self.config.max_video_frames)
+        video_frames = self.frame_provider.video_for_episode(record, target_count)
+        return to_video_block(video_frames)
+
+    def run_plan_updates(
+        self,
+        record: EpisodeRecord,
+        staging: EpisodeStaging,
+        interjection_times: Sequence[float],
+        interjection_texts: Sequence[str] | None = None,
+    ) -> None:
+        """Append additional ``plan`` rows at every interjection timestamp.
+
+        Plans refresh ONLY on user interjections — subtask generation
+        runs ~1 Hz at inference, but plan re-emission is event-driven.
+        Now also forwards the interjection's own text into the prompt so
+        the refreshed plan can actually reflect the user's correction
+        (the previous version told the model "an interjection happened"
+        without telling it what the user said).
+        """
+        existing = staging.read("module_1")
+        # Pass the episode's last frame timestamp so the final subtask
+        # span is closed (otherwise its ``end`` equals its ``start``,
+        # zero duration, and the "current subtask at refresh_t" lookup
+        # in ``_generate_plan`` misses any refresh that lands inside it).
+        episode_end_t = float(record.frame_timestamps[-1]) if record.frame_timestamps else None
+        spans = reconstruct_subtask_spans(existing, episode_end_t=episode_end_t)
+        already_planned: set[float] = {float(r["timestamp"]) for r in existing if r.get("style") == "plan"}
+        new_rows = list(existing)
+
+        texts: list[str | None] = (
+            [None] * len(interjection_times)
+            if interjection_texts is None
+            else [str(t) if t else None for t in interjection_texts]
+        )
+        for raw_t, inter_text in zip(interjection_times, texts, strict=True):
+            t = snap_to_frame(raw_t, record.frame_timestamps)
+            if t in already_planned:
+                continue
+            already_planned.add(t)
+            plan_text = self._generate_plan(record, spans, refresh_t=t, interjection=inter_text)
+            if plan_text is not None:
+                new_rows.append(
+                    {
+                        "role": "assistant",
+                        "content": plan_text,
+                        "style": "plan",
+                        "timestamp": t,
+                        "tool_calls": None,
+                    }
+                )
+        staging.write("module_1", new_rows)
+
+    def _generate_subtasks(self, record: EpisodeRecord, *, task: str | None = None) -> list[dict[str, Any]]:
+        if record.row_count == 0 or not record.frame_timestamps:
+            return []
+        episode_duration = record.frame_timestamps[-1] - record.frame_timestamps[0]
+        prompt = load_prompt("module_1_subtasks").format(
+            episode_task=(task if task is not None else record.episode_task),
+            min_subtask_seconds=self.config.min_subtask_seconds,
+            max_steps=self.config.plan_max_steps,
+            episode_duration=f"{episode_duration:.3f}",
+        )
+        spans = self._vlm_field(self._video_message(record, prompt), "subtasks")
+        if not spans:
+            return []
+        # clamp to [t0, t_last] and sort
+        t0 = record.frame_timestamps[0]
+        t_last = record.frame_timestamps[-1]
+        cleaned: list[dict[str, Any]] = []
+        for span in spans:
+            try:
+                start = float(span["start"])
+                end = float(span["end"])
+                text = str(span["text"]).strip()
+            except (KeyError, ValueError, TypeError):
+                continue
+            start = max(t0, min(start, t_last))
+            end = max(t0, min(end, t_last))
+            if end < start:
+                start, end = end, start
+            if not text:
+                continue
+            cleaned.append({"text": text, "start": start, "end": end})
+        cleaned.sort(key=lambda s: s["start"])
+        return cleaned
+
+    def _generate_plan(
+        self,
+        record: EpisodeRecord,
+        subtask_spans: Sequence[dict[str, Any]],
+        *,
+        refresh_t: float | None = None,
+        interjection: str | None = None,
+        task: str | None = None,
+    ) -> str | None:
+        if not subtask_spans:
+            return None
+        subtasks_text = "\n".join(f"- {s['text']}" for s in subtask_spans)
+        prompt = load_prompt("module_1_plan").format(
+            episode_task=(task if task is not None else record.episode_task),
+            subtasks_text=subtasks_text,
+            plan_max_steps=self.config.plan_max_steps,
+        )
+        if refresh_t is not None:
+            # ``current_subtask`` is the span the refresh time falls into,
+            # so the model knows where in the demonstration the planner is
+            # standing when it re-emits.
+            current_subtask = ""
+            for span in subtask_spans:
+                if float(span["start"]) <= refresh_t and (
+                    "end" not in span or float(span["end"]) > refresh_t
+                ):
+                    current_subtask = span.get("text", "")
+                    break
+            if interjection:
+                prompt += (
+                    f"\n\n(Plan refresh at t={refresh_t:.2f}s after a user "
+                    f"interjection: {interjection!r}. Current subtask just "
+                    f"before the interjection: {current_subtask!r}. Update "
+                    f"the plan so it reflects the interjection — drop or "
+                    f"reorder steps as needed; do not just restate.)\n"
+                )
+            else:
+                # Refresh without an interjection text: still tell the model
+                # where in the episode the plan stands so the re-emission
+                # is grounded. Should be rare — plan refreshes are
+                # interjection-driven by design.
+                prompt += f"\n\n(Plan refresh at t={refresh_t:.2f}s. Current subtask: {current_subtask!r}.)\n"
+        plan = self._vlm_field(self._text_message(prompt), "plan")
+        return plan.strip() if isinstance(plan, str) else None
+
+    def _generate_memory(
+        self,
+        record: EpisodeRecord,
+        prior_memory: str,
+        completed: str,
+        remaining: Sequence[str],
+        *,
+        task: str | None = None,
+    ) -> str:
+        prompt = load_prompt("module_1_memory").format(
+            episode_task=(task if task is not None else record.episode_task),
+            prior_memory=prior_memory or "(none)",
+            completed_subtask=completed,
+            remaining_subtasks=", ".join(remaining) if remaining else "(none)",
+        )
+        memory = self._vlm_field(self._text_message(prompt), "memory")
+        return memory.strip() if isinstance(memory, str) else ""
@@ -0,0 +1,33 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Prompt templates loaded as plain text.
+
+One file per use site. Templates use ``str.format(**vars)`` substitution; we
+intentionally avoid jinja2 here so the templates remain inspectable in
+plain editors and roundtrip cleanly through ``ruff format``.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+_DIR = Path(__file__).parent
+
+
+def load(name: str) -> str:
+    """Read prompt template ``name.txt`` from the ``prompts/`` directory."""
+    path = _DIR / f"{name}.txt"
+    return path.read_text(encoding="utf-8")
@@ -0,0 +1,25 @@
+You are updating the robot's compressed semantic memory at the boundary of
+a completed subtask.
+
+Reference (verbatim from MEM, Torne 2026):
+"Remove or compress information in the language memory whenever
+appropriate. Keep ONLY the minimal set of relevant information for future
+task execution. Specific object attributes (colors, precise quantities of
+each item) get discarded when their details won't affect subsequent
+actions. Functional outcomes (where items went, how many) are preserved."
+
+Concrete example from MEM:
+  Before: "I put a light green bowl, a dark blue bowl and a bright yellow
+           bowl into the top right cabinet"
+  After:  "I placed three bowls in the top right cabinet"
+
+Episode task: "{episode_task}"
+Previous memory: {prior_memory}
+Just-completed subtask: "{completed_subtask}"
+Remaining subtasks (for relevance judgement only): {remaining_subtasks}
+
+Update the memory. Drop irrelevant detail. Compress completed steps.
+Keep WHAT happened, drop HOW. Shorter is better.
+
+Output strictly valid JSON:
+  {{ "memory": "<one or two short sentences>" }}
@@ -0,0 +1,18 @@
+You are the high-level planner for a robot demonstrating: "{episode_task}".
+
+Given the subtask decomposition below, write a concise hierarchical PLAN
+the robot should follow. Format the plan as a numbered list, one line per
+high-level step. The plan describes the full task; subtasks are the atomic
+skills used to execute it.
+
+Subtasks for context:
+{subtasks_text}
+
+Authoring rules:
+- 3 to {plan_max_steps} steps.
+- Each step describes one logical chunk of the task, not one motion.
+- Steps must be in execution order.
+- Plain prose, no JSON, no markdown headers.
+
+Output strictly valid JSON:
+  {{ "plan": "1. ...\n2. ...\n3. ..." }}
@@ -0,0 +1,33 @@
+You are labeling a teleoperated robot demonstration.
+
+The user originally asked: "{episode_task}"
+
+You are shown the entire demonstration as a single video. Watch the
+whole clip, then segment it into a list of consecutive atomic subtasks
+the robot performs.
+
+Authoring rules — based on Hi Robot (Shi 2025) atom granularity and
+Pi0.7 (Physical Intelligence 2025) "how, not what" detail:
+
+- Each subtask is one atomic skill the low-level policy can execute,
+  e.g. "pick up one piece of lettuce", "place the bowl into the box",
+  "move the right arm to the left".
+- Capture HOW the subtask is performed, not only WHAT — e.g. prefer
+  "grasp the handle of the sponge with the left hand" to "pick up the
+  sponge".
+- Subtasks are non-overlapping and cover the full episode in order.
+  Choose the cut points yourself based on what you see in the video
+  (gripper open/close events, contact, regrasps, transitions).
+- Each subtask spans at least {min_subtask_seconds} seconds.
+- Do not exceed {max_steps} subtasks total.
+- Every subtask's [start_time, end_time] must lie within
+  [0.0, {episode_duration}] seconds.
+
+Output strictly valid JSON of shape:
+
+  {{
+    "subtasks": [
+      {{"text": "<how-not-what>", "start": <float>, "end": <float>}},
+      ...
+    ]
+  }}
@@ -0,0 +1,32 @@
+You are generating training data for a Hi Robot-style policy. We need
+{n} alternative phrasings of the same robot task so the policy sees
+diverse user prompts during training instead of the same canonical
+string repeated every frame.
+
+Original task:
+"{base_task}"
+
+Generate exactly {n} alternative phrasings of the same task. Vary:
+
+- formality (casual / polite / curt)
+- verbosity (short imperative vs longer polite request)
+- word choice (synonyms, different verbs)
+- sentence structure (imperative / question / suggestion)
+
+Hard rules:
+- Each phrasing MUST preserve the exact meaning of the original task.
+  Do not change which object is involved, the destination, or the
+  action. Do not add extra steps. Do not invent new objects.
+- Each phrasing must be a single short sentence, plain prose, no
+  markdown, no quotes, no list numbers.
+- Phrasings must be distinct — no near-duplicates.
+- Output exactly {n} entries.
+
+Output strictly valid JSON:
+  {{
+    "rephrasings": [
+      "<phrasing 1>",
+      "<phrasing 2>",
+      ...
+    ]
+  }}
@@ -0,0 +1,17 @@
+The video above shows a robot manipulation episode in full. Look at
+the entire video and describe in ONE concise sentence what the robot
+is doing.
+
+Rules:
+- One sentence, in natural English, like a user instruction.
+- Capture the goal of the demonstration, not low-level motions.
+  Example: "place the yellow cube into the red bin" — not "move the
+  end-effector down 5cm and close the gripper".
+- 4 to 15 words. Plain prose, no markdown, no bullets, no quotes.
+- Do not invent objects or actions that aren't visible.
+- Do not output anything other than the JSON object below.
+
+Output strictly valid JSON:
+  {{
+    "task": "<single concise sentence describing what the robot does in this video>"
+  }}
@@ -0,0 +1,10 @@
+The user just asked the robot: "{episode_task}".
+
+Generate a short verbal acknowledgement the robot would speak back before
+beginning the task. Style: confident, friendly, single short sentence.
+
+Examples (Hi Robot, Shi 2025): "Sure, I won't put cheese on it.",
+"OK, starting with the sponge.", "Got it.".
+
+Output strictly valid JSON:
+  {{ "text": "<the spoken acknowledgement>" }}
@@ -0,0 +1,46 @@
+You are generating training data for a Hi Robot-style hierarchical
+robot policy. The robot in this demonstration has ALREADY executed
+every step shown in the video — we cannot retroactively change the
+action stream. To keep training data consistent with the video, the
+"interjection" must align with what the robot is *about to do next* in
+the demonstration, framed as a natural mid-task user request.
+
+The episode's overall task: "{episode_task}".
+
+The images above show roughly {window_seconds:.1f} seconds straddling a
+subtask boundary in the demonstration:
+
+- Subtask the robot just finished: "{prev_subtask}"
+- Subtask the robot is about to start: "{next_subtask}"
+- Time into episode: {timestamp:.2f}s
+
+Write ONE interjection the user would naturally say at this moment to
+prompt / confirm / encourage the robot to do "{next_subtask}". Phrase it
+like a real human mid-task remark — conversational, varied, sometimes
+just a nudge, sometimes a clarification, sometimes a small constraint
+that the upcoming motion happens to satisfy. Plus the robot's verbal
+acknowledgement.
+
+Hard rules:
+
+- The interjection MUST be consistent with the next subtask. The user
+  cannot ask for something different from what the robot then does in
+  the video. If you're tempted to say "actually skip X" or "do Y
+  instead", DO NOT — those would contradict the demonstration.
+- The interjection must reference an object, location, or action that
+  is plausible given the visible scene and the next subtask text.
+- One sentence each. Conversational, not robotic.
+
+Style examples (vary the phrasing — don't reuse these verbatim):
+  - "Now go ahead and {next_subtask}."
+  - "Great, can you {next_subtask} next?"
+  - "{next_subtask}, please."
+  - "Before you continue, please {next_subtask}."
+  - "Looking good — {next_subtask} now."
+  - "Okay, {next_subtask}."
+
+Output strictly valid JSON:
+  {{
+    "interjection": "<single sentence the user says, asking for the next subtask>",
+    "speech":       "<single sentence the robot speaks back, confirming and starting>"
+  }}
@@ -0,0 +1,32 @@
+You are generating a frame-grounded visual question/answer pair for
+chain-of-thought training. Reference: ECoT (Zawalski 2024) and Steerable
+Policies — both train policies on grounded features such as bounding box
+pixel coordinates, keypoints, counts, attributes, and spatial relations.
+
+The frame shows a robot working on: "{episode_task}".
+
+Question types and the EXACT answer JSON shape required for each:
+
+  bbox       => {{"detections": [{{"label": "<obj>", "bbox_format": "xyxy",
+                                    "bbox": [x1, y1, x2, y2]}}, ...]}}
+                bbox is in pixel coordinates (x_min, y_min, x_max, y_max).
+                ECoT example: "a white cup [124, 25, 176, 113]".
+
+  keypoint   => {{"label": "<point>", "point_format": "xy",
+                  "point": [x, y]}}
+
+  count      => {{"label": "<obj>", "count": <int>,
+                  "note": "<optional short note>"}}
+
+  attribute  => {{"label": "<obj>", "attribute": "<color|shape|state|...>",
+                  "value": "<observed value>"}}
+
+  spatial    => {{"subject": "<obj>", "relation": "<left_of|right_of|on|in|"
+                  "above|below|near>", "object": "<obj>"}}
+
+Generate a question of type "{question_type}". Output strictly valid JSON:
+
+  {{
+    "question": "<short, frame-grounded question>",
+    "answer":   <object whose shape matches the schema above>
+  }}
@@ -0,0 +1,268 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Datatrove-shaped reader.
+
+The reader walks ``data/chunk-*/file-*.parquet`` and yields one record per
+episode containing:
+
+- ``episode_index``: int
+- ``frame_timestamps``: tuple[float, ...]
+- ``frame_indices``: tuple[int, ...]
+- ``episode_task``: str (canonical task from ``meta/tasks.parquet``)
+- ``data_path``: pathlib.Path of the source parquet shard
+- ``frames_df``: pandas.DataFrame slice for the episode (only loaded on demand)
+
+This shape lets each module operate per-episode without loading all parquet
+rows into memory at once.
+"""
+
+from __future__ import annotations
+
+from collections.abc import Iterator, Sequence
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+import pyarrow.parquet as pq
+
+from lerobot.datasets.utils import DEFAULT_TASKS_PATH
+
+
+@dataclass
+class EpisodeRecord:
+    """Per-episode record yielded by the reader."""
+
+    episode_index: int
+    episode_task: str
+    frame_timestamps: tuple[float, ...]
+    frame_indices: tuple[int, ...]
+    data_path: Path
+    row_offset: int  # row offset within the parquet file where this episode starts
+    row_count: int  # number of rows for this episode
+
+    # Memoized parquet slice — populated on first ``frames_df()`` call so
+    # repeat queries from different modules don't re-read the whole shard.
+    _frames_df_cache: Any = field(default=None, init=False, repr=False, compare=False)
+
+    def frames_df(self):  # type: ignore[no-untyped-def]
+        """Lazy-load the pandas slice for this episode (memoized)."""
+        if self._frames_df_cache is None:
+            import pandas as pd  # noqa: PLC0415  - deferred for optional dataset extra
+
+            table = pq.read_table(self.data_path)
+            df: pd.DataFrame = table.to_pandas()
+            self._frames_df_cache = df.iloc[self.row_offset : self.row_offset + self.row_count].reset_index(
+                drop=True
+            )
+        return self._frames_df_cache
+
+
+def reconstruct_subtask_spans(
+    rows: Sequence[dict[str, Any]],
+    *,
+    episode_end_t: float | None = None,
+) -> list[dict[str, Any]]:
+    """Turn ``style="subtask"`` rows into ``{text, start, end}`` spans.
+
+    Each span's ``end`` is the next span's ``start``. The final span's
+    ``end`` defaults to its own ``start`` (zero-duration) — pass
+    ``episode_end_t`` to extend it to the episode's last frame instead,
+    which is what downstream consumers (memory, interjection boundary
+    selection) expect.
+
+    Used by Module 1 (plan-update pass) and Module 2 (interjection
+    anchoring), which both need the same span shape.
+    """
+    sorted_rows = sorted(
+        (r for r in rows if r.get("style") == "subtask"),
+        key=lambda r: float(r["timestamp"]),
+    )
+    spans: list[dict[str, Any]] = []
+    for r in sorted_rows:
+        t = float(r["timestamp"])
+        if spans:
+            spans[-1]["end"] = t
+        spans.append({"text": r.get("content") or "", "start": t, "end": t})
+    if spans and episode_end_t is not None and float(episode_end_t) > spans[-1]["start"]:
+        spans[-1]["end"] = float(episode_end_t)
+    return spans
+
+
+def snap_to_frame(t: float, frame_timestamps: Sequence[float]) -> float:
+    """Snap an arbitrary float to the nearest exact source frame timestamp.
+
+    Modules use this when emitting event-style rows so the row's
+    timestamp matches a real parquet frame (event rows must land on an
+    exact frame, see PR 1's "exact event matching" rule).
+    """
+    if not frame_timestamps:
+        return float(t)
+    nearest = min(frame_timestamps, key=lambda f: abs(f - t))
+    return float(nearest)
+
+
+def _load_tasks_lookup(root: Path) -> dict[int, str]:
+    tasks_path = root / DEFAULT_TASKS_PATH
+    if not tasks_path.exists():
+        return {}
+    table = pq.read_table(tasks_path)
+    cols = {name: table.column(name).to_pylist() for name in table.column_names}
+    if "task_index" in cols and "task" in cols:
+        return dict(zip(cols["task_index"], cols["task"], strict=True))
+    raise ValueError(f"meta/tasks.parquet at {tasks_path} missing 'task_index' or 'task'")
+
+
+def iter_episodes(root: Path, *, only_episodes: tuple[int, ...] | None = None) -> Iterator[EpisodeRecord]:
+    """Yield :class:`EpisodeRecord` for every episode under ``root/data/``.
+
+    Episodes are yielded in ascending ``episode_index`` order. The reader does
+    not assume a specific chunk/file layout: it scans every ``*.parquet``
+    under ``data/`` and groups by ``episode_index``.
+    """
+    tasks = _load_tasks_lookup(root)
+    data_dir = root / "data"
+    parquet_files = sorted(data_dir.rglob("*.parquet"))
+
+    only_set = set(only_episodes) if only_episodes is not None else None
+
+    for path in parquet_files:
+        yield from _iter_one_path(path, tasks, only_set)
+
+
+def _iter_one_path(path: Path, tasks: dict[int, str], only_set: set[int] | None) -> Iterator[EpisodeRecord]:
+    table = pq.read_table(path)
+    names = table.column_names
+    if "episode_index" not in names:
+        return
+    episode_col = table.column("episode_index").to_pylist()
+    timestamp_col = (
+        table.column("timestamp").to_pylist() if "timestamp" in names else [0.0] * len(episode_col)
+    )
+    frame_col = (
+        table.column("frame_index").to_pylist() if "frame_index" in names else list(range(len(episode_col)))
+    )
+    task_col = table.column("task_index").to_pylist() if "task_index" in names else None
+
+    def _build(
+        ep: int,
+        start: int,
+        end: int,
+        task_idx: int | None,
+        ts_buf: list[float],
+        fi_buf: list[int],
+    ) -> EpisodeRecord | None:
+        if only_set is not None and ep not in only_set:
+            return None
+        task = tasks.get(task_idx, "") if task_idx is not None else ""
+        return EpisodeRecord(
+            episode_index=ep,
+            episode_task=task,
+            frame_timestamps=tuple(ts_buf),
+            frame_indices=tuple(fi_buf),
+            data_path=path,
+            row_offset=start,
+            row_count=end - start,
+        )
+
+    cur_ep: int | None = None
+    start_offset = 0
+    ts_buf: list[float] = []
+    fi_buf: list[int] = []
+    cur_task_idx: int | None = None
+
+    for i, ep in enumerate(episode_col):
+        if cur_ep is None:
+            cur_ep = ep
+            start_offset = i
+            ts_buf = [timestamp_col[i]]
+            fi_buf = [frame_col[i]]
+            cur_task_idx = task_col[i] if task_col is not None else None
+            continue
+        if ep != cur_ep:
+            rec = _build(cur_ep, start_offset, i, cur_task_idx, ts_buf, fi_buf)
+            if rec is not None:
+                yield rec
+            cur_ep = ep
+            start_offset = i
+            ts_buf = [timestamp_col[i]]
+            fi_buf = [frame_col[i]]
+            cur_task_idx = task_col[i] if task_col is not None else None
+        else:
+            ts_buf.append(timestamp_col[i])
+            fi_buf.append(frame_col[i])
+
+    if cur_ep is not None:
+        rec = _build(cur_ep, start_offset, len(episode_col), cur_task_idx, ts_buf, fi_buf)
+        if rec is not None:
+            yield rec
+
+
+def gather_data_paths(root: Path) -> list[Path]:
+    """Return every ``data/chunk-*/file-*.parquet`` path under ``root``."""
+    return sorted((root / "data").rglob("*.parquet"))
+
+
+def episode_offsets_per_path(path: Path) -> dict[int, tuple[int, int]]:
+    """Return ``{episode_index: (row_offset, row_count)}`` for one parquet."""
+    table = pq.read_table(path, columns=["episode_index"])
+    episode_col = table.column("episode_index").to_pylist()
+    out: dict[int, tuple[int, int]] = {}
+    cur_ep: int | None = None
+    start = 0
+    for i, ep in enumerate(episode_col):
+        if cur_ep is None:
+            cur_ep = ep
+            start = i
+            continue
+        if ep != cur_ep:
+            out[cur_ep] = (start, i - start)
+            cur_ep = ep
+            start = i
+    if cur_ep is not None:
+        out[cur_ep] = (start, len(episode_col) - start)
+    return out
+
+
+def keyframe_indices(record: EpisodeRecord, k: int) -> list[int]:
+    """Return ``k`` evenly spaced row indices into the episode (relative)."""
+    n = record.row_count
+    if k <= 0 or n == 0:
+        return []
+    if k >= n:
+        return list(range(n))
+    step = (n - 1) / (k - 1) if k > 1 else 0.0
+    return [int(round(i * step)) for i in range(k)] if k > 1 else [n // 2]
+
+
+def lookup_data_path(root: Path, episode_index: int) -> tuple[Path, int, int] | None:
+    """Find the parquet file containing ``episode_index`` and its slice bounds."""
+    for path in gather_data_paths(root):
+        offsets = episode_offsets_per_path(path)
+        if episode_index in offsets:
+            start, count = offsets[episode_index]
+            return path, start, count
+    return None
+
+
+def episode_frame_timestamps(root: Path, episode_index: int) -> tuple[Any, list[float]]:
+    """Return the parquet path and per-frame timestamps for ``episode_index``."""
+    found = lookup_data_path(root, episode_index)
+    if found is None:
+        raise ValueError(f"Episode {episode_index} not found under {root}/data/")
+    path, start, count = found
+    table = pq.read_table(path, columns=["timestamp"])
+    timestamps = table.column("timestamp").to_pylist()[start : start + count]
+    return path, [float(t) for t in timestamps]
@@ -0,0 +1,104 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Per-episode staging.
+
+Each module writes its raw output as a JSONL file under
+``<staging_dir>/episode_{ep:06d}/<module>.jsonl``. The writer reads back this
+staging tree and partitions rows into the two language columns.
+
+JSONL is preferred over parquet here because the staging artifact is meant to
+be human-inspectable, easy to diff between prompt iterations, and trivially
+appended to. The final dataset format is parquet; staging is just an
+intermediate.
+"""
+
+from __future__ import annotations
+
+import json
+from collections.abc import Iterable, Iterator
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+
+ModuleName = str
+
+_MODULES: tuple[ModuleName, ...] = (
+    "module_1",
+    "module_2",
+    "module_3",
+)
+
+
+@dataclass
+class EpisodeStaging:
+    """Filesystem layout for a single episode's staged module outputs."""
+
+    root: Path
+    episode_index: int
+
+    @property
+    def episode_dir(self) -> Path:
+        return self.root / f"episode_{self.episode_index:06d}"
+
+    def path_for(self, module: ModuleName) -> Path:
+        if module not in _MODULES:
+            raise ValueError(f"Unknown module {module!r}; expected one of {_MODULES}")
+        return self.episode_dir / f"{module}.jsonl"
+
+    def write(self, module: ModuleName, rows: Iterable[dict[str, Any]]) -> Path:
+        path = self.path_for(module)
+        path.parent.mkdir(parents=True, exist_ok=True)
+        # Atomic replace: a crash mid-write would otherwise leave a
+        # half-written JSONL file that ``read()`` would then fail to
+        # parse. Write to a sibling .tmp and rename so the target path
+        # only ever points at a complete file.
+        tmp_path = path.with_suffix(path.suffix + ".tmp")
+        with tmp_path.open("w", encoding="utf-8") as f:
+            for row in rows:
+                f.write(json.dumps(row, ensure_ascii=False, sort_keys=True))
+                f.write("\n")
+        tmp_path.replace(path)
+        return path
+
+    def read(self, module: ModuleName) -> list[dict[str, Any]]:
+        path = self.path_for(module)
+        if not path.exists():
+            return []
+        out: list[dict[str, Any]] = []
+        with path.open(encoding="utf-8") as f:
+            for line in f:
+                line = line.strip()
+                if line:
+                    out.append(json.loads(line))
+        return out
+
+    def read_all(self) -> dict[ModuleName, list[dict[str, Any]]]:
+        return {m: self.read(m) for m in _MODULES}
+
+    def has(self, module: ModuleName) -> bool:
+        return self.path_for(module).exists()
+
+
+def iter_staged_episodes(root: Path) -> Iterator[int]:
+    """Yield episode indices for which any staging artifact exists."""
+    if not root.exists():
+        return
+    for child in sorted(root.iterdir()):
+        if child.is_dir() and child.name.startswith("episode_"):
+            try:
+                yield int(child.name.removeprefix("episode_"))
+            except ValueError:
+                continue
@@ -0,0 +1,334 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Pre-write validation against staged outputs.
+
+Runs after Modules 1–3 have all written their per-episode artifacts but
+*before* the writer rewrites parquet shards. The validator never touches
+parquet; it only inspects the staging tree and the source frame timestamps
+exposed by :class:`EpisodeRecord`.
+
+Checks (per the plan's "Intermediate staging and validation" section):
+
+- exact timestamp alignment against source frame timestamps
+- no orphan speech / interjection pairs
+- plan / memory emission consistency (events have a paired persistent row)
+- VQA assistant ``content`` is valid JSON (one of bbox / keypoint / count /
+  attribute / spatial)
+- every row maps to its correct column under :func:`column_for_style`
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+from collections.abc import Iterable, Sequence
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+from lerobot.datasets.language import (
+    LANGUAGE_EVENTS,
+    LANGUAGE_PERSISTENT,
+    column_for_style,
+    is_view_dependent_style,
+    validate_camera_field,
+)
+
+from .reader import EpisodeRecord
+from .staging import EpisodeStaging
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class ValidationReport:
+    """Outcome of one validation pass across all episodes."""
+
+    errors: list[str] = field(default_factory=list)
+    warnings: list[str] = field(default_factory=list)
+    episodes_checked: int = 0
+
+    @property
+    def ok(self) -> bool:
+        return not self.errors
+
+    def add_error(self, message: str) -> None:
+        self.errors.append(message)
+
+    def add_warning(self, message: str) -> None:
+        self.warnings.append(message)
+
+    def summary(self) -> str:
+        return f"checked={self.episodes_checked} errors={len(self.errors)} warnings={len(self.warnings)}"
+
+
+VQA_ANSWER_SHAPES: dict[str, set[str]] = {
+    "bbox": {"detections"},
+    "keypoint": {"label", "point_format", "point"},
+    "count": {"label", "count"},
+    "attribute": {"label", "attribute", "value"},
+    "spatial": {"subject", "relation", "object"},
+}
+
+
+def classify_vqa_answer(payload: Any) -> str | None:
+    """Best-effort classification of a VQA answer payload to a question type."""
+    if not isinstance(payload, dict):
+        return None
+    keys = set(payload.keys())
+    for kind, required in VQA_ANSWER_SHAPES.items():
+        if required.issubset(keys):
+            return kind
+    return None
+
+
+@dataclass
+class StagingValidator:
+    """Walks the staging tree and produces a :class:`ValidationReport`."""
+
+    timestamp_atol: float = 0.0  # exact-match by default
+    dataset_camera_keys: tuple[str, ...] | None = None
+    """Known ``observation.images.*`` keys on the dataset. When set, the
+    validator additionally enforces that every view-dependent row's
+    ``camera`` field references one of these keys. Pass ``None`` (default)
+    to skip that cross-check (e.g. in unit tests with no real dataset)."""
+
+    def validate(
+        self,
+        records: Sequence[EpisodeRecord],
+        staging_dir: Path,
+    ) -> ValidationReport:
+        report = ValidationReport()
+        for record in records:
+            self._validate_episode(record, staging_dir, report)
+            report.episodes_checked += 1
+        return report
+
+    def _validate_episode(
+        self,
+        record: EpisodeRecord,
+        staging_dir: Path,
+        report: ValidationReport,
+    ) -> None:
+        staging = EpisodeStaging(staging_dir, record.episode_index)
+        staged = staging.read_all()
+        all_rows: list[dict[str, Any]] = []
+        for module_name, rows in staged.items():
+            for row in rows:
+                row = {**row, "_module": module_name}
+                all_rows.append(row)
+
+        frame_ts = set(record.frame_timestamps)
+
+        events: list[dict[str, Any]] = []
+        persistent: list[dict[str, Any]] = []
+        for row in all_rows:
+            self._check_column_routing(row, report, record.episode_index)
+            self._check_camera_field(
+                row, report, record.episode_index, self.dataset_camera_keys
+            )
+            if column_for_style(row.get("style")) == LANGUAGE_PERSISTENT:
+                persistent.append(row)
+            else:
+                events.append(row)
+
+        for row in events:
+            self._check_event_timestamp_alignment(row, frame_ts, report, record.episode_index)
+
+        self._check_speech_interjection_pairs(events, report, record.episode_index)
+        self._check_plan_memory_consistency(persistent, events, report, record.episode_index)
+        self._check_vqa_json(events, report, record.episode_index)
+        self._check_vqa_uniqueness_per_frame_camera(events, report, record.episode_index)
+
+    def _check_camera_field(
+        self,
+        row: dict[str, Any],
+        report: ValidationReport,
+        episode_index: int,
+        dataset_camera_keys: Sequence[str] | None,
+    ) -> None:
+        """Enforce the camera invariant + that the key matches the dataset's cameras."""
+        style = row.get("style")
+        camera = row.get("camera")
+        try:
+            validate_camera_field(style, camera)
+        except ValueError as exc:
+            report.add_error(
+                f"ep={episode_index} module={row.get('_module')}: {exc}"
+            )
+            return
+        if (
+            is_view_dependent_style(style)
+            and dataset_camera_keys
+            and camera not in dataset_camera_keys
+        ):
+            report.add_error(
+                f"ep={episode_index} module={row.get('_module')}: camera {camera!r} on style "
+                f"{style!r} is not one of the dataset's video keys {sorted(dataset_camera_keys)!r}"
+            )
+
+    def _check_vqa_uniqueness_per_frame_camera(
+        self,
+        events: Iterable[dict[str, Any]],
+        report: ValidationReport,
+        episode_index: int,
+    ) -> None:
+        """Ensure at most one (vqa, user) and one (vqa, assistant) per (t, camera)."""
+        counts: dict[tuple[float, str, str], int] = {}
+        for row in events:
+            if row.get("style") != "vqa":
+                continue
+            ts = row.get("timestamp")
+            camera = row.get("camera")
+            role = row.get("role")
+            if ts is None or camera is None or role is None:
+                continue  # other validators flag these
+            key = (float(ts), str(camera), str(role))
+            counts[key] = counts.get(key, 0) + 1
+        for (ts, camera, role), n in counts.items():
+            if n > 1:
+                report.add_error(
+                    f"ep={episode_index}: {n} duplicate vqa rows at t={ts} "
+                    f"camera={camera!r} role={role!r}; expected at most one per (t, camera, role)"
+                )
+
+    def _check_column_routing(
+        self,
+        row: dict[str, Any],
+        report: ValidationReport,
+        episode_index: int,
+    ) -> None:
+        style = row.get("style")
+        module = row.get("_module")
+        try:
+            target_col = column_for_style(style)
+        except ValueError:
+            report.add_error(f"ep={episode_index} module={module}: unknown style {style!r}")
+            return
+        if module == "module_1" and target_col != LANGUAGE_PERSISTENT:
+            report.add_error(
+                f"ep={episode_index} module=module_1 emitted style {style!r} that routes to {target_col} (must be persistent)"
+            )
+        if module in {"module_2", "module_3"} and target_col != LANGUAGE_EVENTS:
+            report.add_error(
+                f"ep={episode_index} module={module} emitted style {style!r} that routes to {target_col} (must be events)"
+            )
+
+    def _check_event_timestamp_alignment(
+        self,
+        row: dict[str, Any],
+        frame_ts: set[float],
+        report: ValidationReport,
+        episode_index: int,
+    ) -> None:
+        ts = row.get("timestamp")
+        if ts is None:
+            report.add_error(f"ep={episode_index}: event row missing timestamp: {row!r}")
+            return
+        if self.timestamp_atol == 0.0:
+            if float(ts) not in frame_ts:
+                report.add_error(
+                    f"ep={episode_index}: event row timestamp {ts!r} does not match any source frame timestamp"
+                )
+        else:
+            if not any(abs(float(ts) - f) <= self.timestamp_atol for f in frame_ts):
+                report.add_error(
+                    f"ep={episode_index}: event row timestamp {ts!r} not within {self.timestamp_atol}s of any frame"
+                )
+
+    def _check_speech_interjection_pairs(
+        self,
+        events: Iterable[dict[str, Any]],
+        report: ValidationReport,
+        episode_index: int,
+    ) -> None:
+        speech_ts: dict[float, int] = {}
+        interjection_ts: dict[float, int] = {}
+        for row in events:
+            ts = row.get("timestamp")
+            if ts is None:
+                continue
+            ts_f = float(ts)
+            if row.get("style") is None and row.get("role") == "assistant":
+                speech_ts[ts_f] = speech_ts.get(ts_f, 0) + 1
+            if row.get("style") == "interjection":
+                interjection_ts[ts_f] = interjection_ts.get(ts_f, 0) + 1
+
+        for ts in interjection_ts:
+            if ts not in speech_ts:
+                report.add_error(f"ep={episode_index}: interjection at t={ts} has no paired speech atom")
+
+    def _check_plan_memory_consistency(
+        self,
+        persistent: Sequence[dict[str, Any]],
+        events: Sequence[dict[str, Any]],
+        report: ValidationReport,
+        episode_index: int,
+    ) -> None:
+        plan_ts = sorted({float(r["timestamp"]) for r in persistent if r.get("style") == "plan"})
+        memory_ts = sorted({float(r["timestamp"]) for r in persistent if r.get("style") == "memory"})
+        subtask_ts = sorted({float(r["timestamp"]) for r in persistent if r.get("style") == "subtask"})
+        interjection_ts = sorted(
+            {
+                float(r["timestamp"])
+                for r in events
+                if r.get("style") == "interjection" and r.get("timestamp") is not None
+            }
+        )
+
+        if persistent and not plan_ts:
+            report.add_warning(f"ep={episode_index}: persistent rows present but no plan emitted")
+        # every interjection should have a same-timestamp plan refresh
+        for ts in interjection_ts:
+            if ts not in set(plan_ts):
+                report.add_error(
+                    f"ep={episode_index}: interjection at t={ts} has no co-timestamped plan update"
+                )
+        # memory should be emitted at subtask boundaries (subset relation)
+        if memory_ts and subtask_ts:
+            mem_set = set(memory_ts)
+            sub_set = set(subtask_ts)
+            stray = sorted(mem_set - sub_set)
+            if stray:
+                report.add_warning(f"ep={episode_index}: memory rows at {stray} not at any subtask boundary")
+
+    def _check_vqa_json(
+        self,
+        events: Iterable[dict[str, Any]],
+        report: ValidationReport,
+        episode_index: int,
+    ) -> None:
+        for row in events:
+            if row.get("style") != "vqa" or row.get("role") != "assistant":
+                continue
+            content = row.get("content")
+            if content is None:
+                report.add_error(
+                    f"ep={episode_index}: VQA assistant row at t={row.get('timestamp')} has null content"
+                )
+                continue
+            try:
+                payload = json.loads(content)
+            except (TypeError, ValueError) as exc:
+                report.add_error(
+                    f"ep={episode_index}: VQA assistant content not valid JSON at t={row.get('timestamp')}: {exc}"
+                )
+                continue
+            shape = classify_vqa_answer(payload)
+            if shape is None:
+                report.add_error(
+                    f"ep={episode_index}: VQA assistant payload at t={row.get('timestamp')} does not match any known shape: keys={list(payload) if isinstance(payload, dict) else type(payload).__name__}"
+                )
@@ -0,0 +1,723 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Shared Qwen-VL client.
+
+The pipeline uses a single shared VLM across modules. vLLM is preferred when
+available (high throughput, JSON-guided decoding); transformers is the
+fallback. A ``stub`` backend is used for unit tests so fixtures never call
+into a real model.
+
+The client speaks one method, :meth:`VlmClient.generate_json`, which:
+
+- accepts a list of OpenAI/HF-style multimodal messages,
+- requests JSON output (``json_mode=True`` enables guided decoding when the
+  backend supports it),
+- batches requests transparently,
+- and reprompts once on a JSON parse failure with an inline correction
+  message before raising.
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import threading
+from collections.abc import Callable, Sequence
+from dataclasses import dataclass
+from typing import Any, Protocol
+
+from .config import VlmConfig
+
+
+class VlmClient(Protocol):
+    """Protocol every backend must implement."""
+
+    def generate_json(
+        self,
+        messages_batch: Sequence[Sequence[dict[str, Any]]],
+        *,
+        max_new_tokens: int | None = None,
+        temperature: float | None = None,
+    ) -> list[Any]:
+        """Generate one JSON-decoded response per messages list."""
+
+
+@dataclass
+class StubVlmClient:
+    """Deterministic stub used in unit tests.
+
+    A test passes a callable that maps the *last user message text* (or, if
+    that is empty, the full message list) to a JSON-serializable response.
+    """
+
+    responder: Callable[[Sequence[dict[str, Any]]], Any]
+
+    def generate_json(
+        self,
+        messages_batch: Sequence[Sequence[dict[str, Any]]],
+        *,
+        max_new_tokens: int | None = None,
+        temperature: float | None = None,
+    ) -> list[Any]:
+        return [self.responder(list(messages)) for messages in messages_batch]
+
+
+def _strip_to_json(text: str) -> Any:
+    text = text.strip()
+    # Strip <think>...</think> blocks (Qwen3 Thinking style)
+    while "<think>" in text and "</think>" in text:
+        start = text.find("<think>")
+        end = text.find("</think>", start) + len("</think>")
+        text = (text[:start] + text[end:]).strip()
+    # Strip ```json ... ``` fences from chat-tuned backbones
+    if text.startswith("```"):
+        first = text.find("\n")
+        last = text.rfind("```")
+        if first != -1 and last != -1 and last > first:
+            text = text[first + 1 : last].strip()
+    try:
+        return json.loads(text)
+    except (ValueError, json.JSONDecodeError):
+        pass
+    # Fall back to extracting the first balanced {...} block.
+    obj_text = _extract_first_json_object(text)
+    if obj_text is None:
+        raise json.JSONDecodeError("No JSON object found", text, 0)
+    return json.loads(obj_text)
+
+
+def _extract_first_json_object(text: str) -> str | None:
+    """Return the first balanced ``{...}`` substring, ignoring braces in
+    string literals. Returns ``None`` if no balanced block is found."""
+    start = text.find("{")
+    if start < 0:
+        return None
+    depth = 0
+    in_string = False
+    escape = False
+    for i in range(start, len(text)):
+        ch = text[i]
+        if escape:
+            escape = False
+            continue
+        if ch == "\\":
+            escape = True
+            continue
+        # Note: ``escape`` is always False here — the ``if escape`` branch
+        # above already handled and reset it.
+        if ch == '"':
+            in_string = not in_string
+            continue
+        if in_string:
+            continue
+        if ch == "{":
+            depth += 1
+        elif ch == "}":
+            depth -= 1
+            if depth == 0:
+                return text[start : i + 1]
+    return None
+
+
+@dataclass
+class _GenericTextClient:
+    """Wraps any text-generation callable in JSON-mode + one-retry semantics."""
+
+    generate_text: Callable[[Sequence[Sequence[dict[str, Any]]], int, float], list[str]]
+    config: VlmConfig
+
+    def generate_json(
+        self,
+        messages_batch: Sequence[Sequence[dict[str, Any]]],
+        *,
+        max_new_tokens: int | None = None,
+        temperature: float | None = None,
+    ) -> list[Any]:
+        max_tok = max_new_tokens if max_new_tokens is not None else self.config.max_new_tokens
+        temp = temperature if temperature is not None else self.config.temperature
+        raw = self.generate_text(messages_batch, max_tok, temp)
+        out: list[Any] = []
+        for messages, text in zip(messages_batch, raw, strict=True):
+            try:
+                out.append(_strip_to_json(text))
+                continue
+            except (ValueError, json.JSONDecodeError):
+                pass
+            retry = list(messages) + [
+                {"role": "assistant", "content": text},
+                {
+                    "role": "user",
+                    "content": (
+                        "Your previous reply was not valid JSON. "
+                        "Reply with strictly valid JSON, no prose, no fences."
+                    ),
+                },
+            ]
+            retry_text = self.generate_text([retry], max_tok, temp)[0]
+            try:
+                out.append(_strip_to_json(retry_text))
+            except (ValueError, json.JSONDecodeError):
+                # After retry: log preview and return None instead of crashing
+                # the whole pipeline. Modules treat None as "skip".
+                preview = retry_text.strip().replace("\n", " ")[:200]
+                print(
+                    f"[vlm] WARNING: failed to parse JSON after retry; preview: {preview!r}",
+                    flush=True,
+                )
+                out.append(None)
+        return out
+
+
+def make_vlm_client(config: VlmConfig) -> VlmClient:
+    """Build the shared VLM client per the configured backend.
+
+    For ``stub``, callers should construct :class:`StubVlmClient` directly with
+    a responder callable. ``stub`` here is rejected to make accidental misuse
+    obvious.
+    """
+    if config.backend == "stub":
+        raise ValueError(
+            "Use StubVlmClient(...) directly for the stub backend; make_vlm_client builds real clients."
+        )
+    if config.backend == "vllm":
+        return _make_vllm_client(config)
+    if config.backend == "transformers":
+        return _make_transformers_client(config)
+    if config.backend == "openai":
+        return _make_openai_client(config)
+    raise ValueError(f"Unknown VLM backend: {config.backend!r}")
+
+
+def _make_vllm_client(config: VlmConfig) -> VlmClient:
+    try:
+        from vllm import LLM, SamplingParams  # type: ignore[import-not-found]
+    except ImportError as exc:
+        raise ImportError(
+            "vllm is required for backend='vllm'. Install with `pip install lerobot[annotations]`."
+        ) from exc
+    # Workaround for cuDNN 9.x + torch 2.8 conv3d regression that surfaces
+    # as CUDNN_STATUS_NOT_INITIALIZED in Qwen-VL vision-tower patch
+    # embedders. Setting LEROBOT_DISABLE_CUDNN=1 forces native PyTorch
+    # convolution kernels — slower but functional.
+    import os as _os  # noqa: PLC0415
+
+    if _os.environ.get("LEROBOT_DISABLE_CUDNN", "").lower() in {"1", "true", "yes"}:
+        import torch as _torch  # noqa: PLC0415
+
+        _torch.backends.cudnn.enabled = False
+    llm_kwargs: dict[str, Any] = {
+        "model": config.model_id,
+        "tensor_parallel_size": config.tensor_parallel_size,
+        "gpu_memory_utilization": config.gpu_memory_utilization,
+        "trust_remote_code": config.trust_remote_code,
+    }
+    if config.max_model_len is not None:
+        llm_kwargs["max_model_len"] = config.max_model_len
+    llm = LLM(**llm_kwargs)
+
+    def _gen(batch: Sequence[Sequence[dict[str, Any]]], max_tok: int, temp: float) -> list[str]:
+        # ``guided_decoding`` would speed up parsing but its API differs across
+        # vllm releases (dict vs GuidedDecodingParams). The _GenericTextClient
+        # wrapper already has a one-retry JSON-recovery path, so we skip it.
+        params = SamplingParams(max_tokens=max_tok, temperature=temp)
+        # ``llm.chat`` handles chat-template application + multimodal input
+        # extraction (image/video blocks) internally, which ``llm.generate``
+        # does not.
+        outputs = llm.chat([list(m) for m in batch], params)
+        return [o.outputs[0].text for o in outputs]
+
+    return _GenericTextClient(_gen, config)
+
+
+def _make_transformers_client(config: VlmConfig) -> VlmClient:
+    try:
+        import torch  # type: ignore[import-not-found]
+        import transformers  # type: ignore[import-not-found]
+        from transformers import AutoProcessor  # type: ignore[import-not-found]
+    except ImportError as exc:
+        raise ImportError("transformers + torch are required for backend='transformers'.") from exc
+    auto_cls = getattr(transformers, "AutoModelForImageTextToText", None) or getattr(
+        transformers, "AutoModelForVision2Seq", None
+    )
+    if auto_cls is None:
+        raise ImportError(
+            "Neither AutoModelForImageTextToText nor AutoModelForVision2Seq is available in this "
+            "transformers version. Install transformers>=4.45 (which has AutoModelForImageTextToText) "
+            "for VL models."
+        )
+    processor = AutoProcessor.from_pretrained(config.model_id, trust_remote_code=config.trust_remote_code)
+    import os as _os  # noqa: PLC0415
+
+    use_accelerate = _os.environ.get("LEROBOT_TRANSFORMERS_DEVICE_MAP", "manual") != "manual"
+    # ``device_map='auto'`` triggers a known std::bad_alloc on the Qwen3-VL
+    # post-load dispatch path (the alloc fails in accelerate's hook setup
+    # even with TBs of host RAM). Default to manual: load on CPU with
+    # ``low_cpu_mem_usage=True``, then ``.to("cuda")``. Set
+    # ``LEROBOT_TRANSFORMERS_DEVICE_MAP=auto`` to opt back into the old path.
+    if use_accelerate:
+        model = auto_cls.from_pretrained(
+            config.model_id,
+            torch_dtype="auto",
+            device_map="auto",
+            low_cpu_mem_usage=True,
+            trust_remote_code=config.trust_remote_code,
+        )
+    else:
+        import torch as _torch  # noqa: PLC0415
+
+        model = auto_cls.from_pretrained(
+            config.model_id,
+            torch_dtype=_torch.bfloat16,
+            low_cpu_mem_usage=True,
+            trust_remote_code=config.trust_remote_code,
+        )
+        model = model.to("cuda")
+    model.eval()
+
+    def _gen(batch: Sequence[Sequence[dict[str, Any]]], max_tok: int, temp: float) -> list[str]:
+        outs: list[str] = []
+        for messages in batch:
+            text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
+            inputs = processor(text=[text], return_tensors="pt").to(model.device)
+            with torch.no_grad():
+                gen = model.generate(
+                    **inputs,
+                    max_new_tokens=max_tok,
+                    temperature=temp,
+                    do_sample=temp > 0.0,
+                )
+            decoded = processor.batch_decode(
+                gen[:, inputs["input_ids"].shape[-1] :], skip_special_tokens=True
+            )[0]
+            outs.append(decoded)
+        return outs
+
+    return _GenericTextClient(_gen, config)
+
+
+def _make_openai_client(config: VlmConfig) -> VlmClient:
+    """Backend that talks to any OpenAI-compatible server.
+
+    Compatible with ``vllm serve``, ``transformers serve``,
+    ``ktransformers serve``, and hosted endpoints. By default the server
+    is expected to be already running. Set ``auto_serve=True`` to have
+    this client spawn one (default: ``transformers serve``), wait until
+    it's ready, and tear it down on process exit.
+
+    Image blocks ``{"type":"image", "image":<PIL.Image>}`` are
+    auto-converted to ``image_url`` data-URLs. Video blocks
+    ``{"type":"video", "video":[<PIL>...]}`` are forwarded as
+    multi-frame ``video_url`` items where supported.
+    """
+    try:
+        from openai import OpenAI  # type: ignore[import-not-found]
+    except ImportError as exc:
+        raise ImportError(
+            "openai package is required for backend='openai'. Install with `pip install openai`."
+        ) from exc
+
+    api_base = config.api_base
+    api_key = config.api_key
+    auto_serve = config.auto_serve
+    api_bases: list[str] = [api_base]
+
+    print(
+        f"[lerobot-annotate] backend=openai model={config.model_id} "
+        f"api_base={api_base} auto_serve={auto_serve}",
+        flush=True,
+    )
+    if auto_serve:
+        if config.parallel_servers > 1:
+            print(
+                f"[lerobot-annotate] spawning {config.parallel_servers} parallel servers",
+                flush=True,
+            )
+            api_bases = _spawn_parallel_inference_servers(config)
+        elif _server_is_up(api_base):
+            print(f"[lerobot-annotate] reusing server already up at {api_base}", flush=True)
+        else:
+            print("[lerobot-annotate] no server reachable; spawning one", flush=True)
+            api_base = _spawn_inference_server(config)
+            api_bases = [api_base]
+            print(f"[lerobot-annotate] server ready at {api_base}", flush=True)
+
+    clients = [OpenAI(base_url=base, api_key=api_key) for base in api_bases]
+    # round-robin counter for parallel mode
+    rr_counter = {"i": 0}
+
+    # ``mm_processor_kwargs`` is a vllm-specific extra; transformers serve
+    # rejects it with HTTP 422. Send it only when explicitly opted in via
+    # an env var (e.g. ``LEROBOT_OPENAI_SEND_MM_KWARGS=1`` for vllm).
+    send_mm_kwargs = os.environ.get("LEROBOT_OPENAI_SEND_MM_KWARGS", "").lower() in {"1", "true", "yes"}
+
+    rr_lock = threading.Lock()
+
+    def _one_call(messages: Sequence[dict[str, Any]], max_tok: int, temp: float) -> str:
+        api_messages, mm_kwargs = _to_openai_messages(messages)
+        kwargs: dict[str, Any] = {
+            "model": config.model_id,
+            "messages": api_messages,
+            "max_tokens": max_tok,
+            "temperature": temp,
+        }
+        extra_body: dict[str, Any] = {}
+        if send_mm_kwargs and mm_kwargs:
+            extra_body["mm_processor_kwargs"] = {**mm_kwargs, "do_sample_frames": True}
+        if config.chat_template_kwargs:
+            extra_body["chat_template_kwargs"] = config.chat_template_kwargs
+        if extra_body:
+            kwargs["extra_body"] = extra_body
+        with rr_lock:
+            chosen = clients[rr_counter["i"] % len(clients)]
+            rr_counter["i"] += 1
+        response = chosen.chat.completions.create(**kwargs)
+        return response.choices[0].message.content or ""
+
+    def _gen(batch: Sequence[Sequence[dict[str, Any]]], max_tok: int, temp: float) -> list[str]:
+        if len(batch) <= 1 or config.client_concurrency <= 1:
+            return [_one_call(messages, max_tok, temp) for messages in batch]
+        # Parallel fan-out — vllm batches these on the server side.
+        from concurrent.futures import ThreadPoolExecutor  # noqa: PLC0415
+
+        max_workers = min(config.client_concurrency, len(batch))
+        with ThreadPoolExecutor(max_workers=max_workers) as pool:
+            futures = [pool.submit(_one_call, messages, max_tok, temp) for messages in batch]
+            return [f.result() for f in futures]
+
+    return _GenericTextClient(_gen, config)
+
+
+def _spawn_parallel_inference_servers(config: VlmConfig) -> list[str]:
+    """Spawn ``config.parallel_servers`` independent vllm replicas.
+
+    Each replica:
+    - is pinned to a single GPU via ``CUDA_VISIBLE_DEVICES``
+    - listens on ``serve_port + i``
+    - is shut down via the same atexit hook as the single-server path
+
+    Returns the list of ``api_base`` URLs the client should round-robin
+    across.
+    """
+    import atexit  # noqa: PLC0415
+    import os as _os  # noqa: PLC0415
+    import shlex  # noqa: PLC0415
+    import signal  # noqa: PLC0415
+    import subprocess  # noqa: PLC0415
+    import sys  # noqa: PLC0415
+    import threading  # noqa: PLC0415
+    import time  # noqa: PLC0415
+
+    n = config.parallel_servers
+    api_bases: list[str] = []
+    procs: list[subprocess.Popen] = []
+    ready_events: list[threading.Event] = []
+    # Multiple readiness signals — uvicorn's own banner is suppressed at
+    # ``--uvicorn-log-level warning``, so we also accept vllm's own
+    # "Starting vLLM API server" line and the route-listing line. The
+    # HTTP probe below is the ultimate fallback.
+    ready_markers = (
+        "Uvicorn running",
+        "Application startup complete",
+        "Starting vLLM API server",
+        "Available routes are",
+    )
+    # Single lock for all server-stream threads so multibyte chars from
+    # different servers don't interleave and tear UTF-8 sequences.
+    print_lock = threading.Lock()
+
+    base_cmd = config.serve_command or (
+        f"vllm serve {shlex.quote(config.model_id)} "
+        f"--tensor-parallel-size 1 "
+        f"--max-model-len {config.max_model_len or 32768} "
+        f"--uvicorn-log-level warning"
+    )
+
+    num_gpus = config.num_gpus if config.num_gpus > 0 else n
+    for i in range(n):
+        port = config.serve_port + i
+        gpu = i % num_gpus
+        env = _os.environ.copy()
+        env["CUDA_VISIBLE_DEVICES"] = str(gpu)
+        cmd = base_cmd.replace("{port}", str(port)) if "{port}" in base_cmd else f"{base_cmd} --port {port}"
+        api_base = f"http://localhost:{port}/v1"
+        api_bases.append(api_base)
+        print(f"[server-{i}] launching on GPU {gpu} port {port}: {cmd}", flush=True)
+        proc = subprocess.Popen(
+            shlex.split(cmd),
+            stdout=subprocess.PIPE,
+            stderr=subprocess.STDOUT,
+            text=True,
+            bufsize=1,
+            env=env,
+        )
+        procs.append(proc)
+        ready = threading.Event()
+        ready_events.append(ready)
+
+        def _stream(idx: int, p: subprocess.Popen, ev: threading.Event) -> None:
+            # Read whole lines and emit each line atomically under the
+            # shared print_lock so output from N servers stays readable.
+            assert p.stdout is not None
+            for line in iter(p.stdout.readline, ""):
+                with print_lock:
+                    sys.stdout.write(f"[server-{idx}] {line}")
+                    if not line.endswith(("\n", "\r")):
+                        sys.stdout.write("\n")
+                    sys.stdout.flush()
+                if any(m in line for m in ready_markers):
+                    ev.set()
+
+        threading.Thread(target=_stream, args=(i, proc, ready), daemon=True).start()
+
+        def _probe(idx: int, base: str, ev: threading.Event, p: subprocess.Popen) -> None:
+            while not ev.is_set() and p.poll() is None:
+                if _server_is_up(base):
+                    print(f"[server-{idx}] ready (http probe)", flush=True)
+                    ev.set()
+                    return
+                time.sleep(2)
+
+        threading.Thread(target=_probe, args=(i, api_base, ready, proc), daemon=True).start()
+
+    def _shutdown() -> None:
+        for i, p in enumerate(procs):
+            if p.poll() is None:
+                print(f"[server-{i}] stopping pid={p.pid}", flush=True)
+                p.send_signal(signal.SIGINT)
+        for p in procs:
+            try:
+                p.wait(timeout=15)
+            except subprocess.TimeoutExpired:
+                p.kill()
+                p.wait(timeout=5)
+
+    atexit.register(_shutdown)
+
+    deadline = time.monotonic() + config.serve_ready_timeout_s
+    while any(not ev.is_set() for ev in ready_events) and time.monotonic() < deadline:
+        for i, p in enumerate(procs):
+            if p.poll() is not None:
+                raise RuntimeError(
+                    f"[server-{i}] inference server exited unexpectedly with rc={p.returncode}"
+                )
+        time.sleep(2)
+    if any(not ev.is_set() for ev in ready_events):
+        raise RuntimeError(f"[server] not all replicas became ready within {config.serve_ready_timeout_s}s")
+    print(f"[lerobot-annotate] all {n} servers ready: {api_bases}", flush=True)
+    return api_bases
+
+
+def _server_is_up(api_base: str) -> bool:
+    """Return True if ``api_base/models`` answers 200 within 2 seconds."""
+    import urllib.request  # noqa: PLC0415
+
+    url = api_base.rstrip("/") + "/models"
+    # ``api_base`` is the user-configured local-server URL we just spawned
+    # or the user passed in via ``--vlm.api_base``; the bandit B310 warning
+    # is for arbitrary user-controlled URLs with file:/ schemes which
+    # cannot reach this code path.
+    try:
+        with urllib.request.urlopen(url, timeout=2) as resp:  # noqa: S310  # nosec B310
+            return resp.status == 200
+    except Exception:  # noqa: BLE001
+        return False
+
+
+def _spawn_inference_server(config: VlmConfig) -> str:
+    """Spawn ``transformers serve`` (or ``serve_command``), wait until it
+    accepts ``/v1/models``, and register a shutdown hook.
+
+    Streams the server's stdout/stderr to the parent terminal in
+    real-time on a background thread so users can see model-load
+    progress and errors as they happen.
+
+    Returns the full ``api_base`` URL the OpenAI client should use.
+    """
+    import atexit  # noqa: PLC0415
+    import shlex  # noqa: PLC0415
+    import signal  # noqa: PLC0415
+    import subprocess  # noqa: PLC0415
+    import sys  # noqa: PLC0415
+    import threading  # noqa: PLC0415
+    import time  # noqa: PLC0415
+
+    cmd = config.serve_command
+    if not cmd:
+        cmd = (
+            f"transformers serve {shlex.quote(config.model_id)} "
+            f"--port {config.serve_port} --continuous-batching"
+        )
+    api_base = f"http://localhost:{config.serve_port}/v1"
+    print(f"[server] launching: {cmd}", flush=True)
+    proc = subprocess.Popen(
+        shlex.split(cmd),
+        stdout=subprocess.PIPE,
+        stderr=subprocess.STDOUT,
+        text=True,
+        bufsize=1,
+    )
+
+    # Watch the server output for the uvicorn readiness banner. This is
+    # more reliable than polling /v1/models because transformers serve
+    # rescans its cache on every model-list request, which can exceed
+    # the urllib timeout and trigger an infinite probe loop.
+    ready_event = threading.Event()
+    # See _spawn_parallel_inference_servers for why we accept these.
+    ready_markers = (
+        "Uvicorn running",
+        "Application startup complete",
+        "Starting vLLM API server",
+        "Available routes are",
+    )
+
+    def _probe() -> None:
+        while not ready_event.is_set() and proc.poll() is None:
+            if _server_is_up(api_base):
+                print("[server] ready (http probe)", flush=True)
+                ready_event.set()
+                return
+            time.sleep(2)
+
+    threading.Thread(target=_probe, daemon=True).start()
+
+    def _stream_output() -> None:
+        # Read raw chunks instead of iterating lines so tqdm progress
+        # bars (which overwrite using \r) flush in real time.
+        assert proc.stdout is not None
+        buf = ""
+        prefix_started = False
+        while True:
+            ch = proc.stdout.read(1)
+            if ch == "":
+                # process exited; flush any tail
+                if buf:
+                    sys.stdout.write(buf)
+                    sys.stdout.flush()
+                return
+            if not prefix_started:
+                sys.stdout.write("[server] ")
+                prefix_started = True
+            sys.stdout.write(ch)
+            sys.stdout.flush()
+            buf += ch
+            if ch in ("\n", "\r"):
+                if any(marker in buf for marker in ready_markers):
+                    ready_event.set()
+                buf = ""
+                prefix_started = False
+
+    threading.Thread(target=_stream_output, daemon=True).start()
+
+    def _shutdown() -> None:
+        if proc.poll() is None:
+            print(f"[server] stopping pid={proc.pid}", flush=True)
+            proc.send_signal(signal.SIGINT)
+            try:
+                proc.wait(timeout=15)
+            except subprocess.TimeoutExpired:
+                proc.kill()
+                proc.wait(timeout=5)
+
+    atexit.register(_shutdown)
+
+    deadline = time.monotonic() + config.serve_ready_timeout_s
+    while time.monotonic() < deadline:
+        if proc.poll() is not None:
+            raise RuntimeError(
+                f"[server] inference server exited unexpectedly with rc={proc.returncode}. "
+                f"See [server] log lines above for the cause."
+            )
+        if ready_event.wait(timeout=2):
+            return api_base
+    proc.terminate()
+    raise RuntimeError(f"[server] did not become ready within {config.serve_ready_timeout_s}s")
+
+
+def _to_openai_messages(
+    messages: Sequence[dict[str, Any]],
+) -> tuple[list[dict[str, Any]], dict[str, Any]]:
+    """Convert internal messages to OpenAI chat format.
+
+    Returns ``(api_messages, mm_kwargs)``. Multimodal-processor kwargs
+    (``fps`` from ``video_url`` blocks) are extracted out so the caller
+    can pass them via ``extra_body.mm_processor_kwargs`` rather than
+    inside the content blocks (which transformers serve rejects).
+
+    File-URL video blocks are inlined as base64 data URLs.
+    """
+    out_messages: list[dict[str, Any]] = []
+    mm_kwargs: dict[str, Any] = {}
+    for message in messages:
+        content = message.get("content")
+        if not isinstance(content, list):
+            out_messages.append({"role": message["role"], "content": content})
+            continue
+        out_blocks: list[dict[str, Any]] = []
+        for block in content:
+            block_type = block.get("type") if isinstance(block, dict) else None
+            if block_type == "text":
+                out_blocks.append({"type": "text", "text": block.get("text", "")})
+            elif block_type == "image":
+                out_blocks.append(
+                    {"type": "image_url", "image_url": {"url": _pil_to_data_url(block["image"])}}
+                )
+            elif block_type == "video":
+                frames = block.get("video", [])
+                for img in frames:
+                    out_blocks.append({"type": "image_url", "image_url": {"url": _pil_to_data_url(img)}})
+            elif block_type == "video_url":
+                video_url = dict(block["video_url"])
+                url = video_url.get("url", "")
+                if url.startswith("file://"):
+                    video_url["url"] = _file_to_data_url(url[len("file://") :])
+                out_blocks.append({"type": "video_url", "video_url": video_url})
+                fps = block.get("fps")
+                if fps is not None:
+                    mm_kwargs["fps"] = fps
+            else:
+                out_blocks.append(block)
+        out_messages.append({"role": message["role"], "content": out_blocks})
+    return out_messages, mm_kwargs
+
+
+def _file_to_data_url(path: str) -> str:
+    """Read a local video file and return a base64 ``data:video/mp4`` URL."""
+    import base64  # noqa: PLC0415
+
+    with open(path, "rb") as f:
+        b64 = base64.b64encode(f.read()).decode("ascii")
+    return f"data:video/mp4;base64,{b64}"
+
+
+def _pil_to_data_url(image: Any) -> str:
+    """Encode a PIL.Image as a base64 data URL."""
+    import base64  # noqa: PLC0415
+    import io  # noqa: PLC0415
+
+    buf = io.BytesIO()
+    image.save(buf, format="PNG")
+    b64 = base64.b64encode(buf.getvalue()).decode("ascii")
+    return f"data:image/png;base64,{b64}"
+
+
+def _messages_to_prompt(messages: Sequence[dict[str, Any]]) -> Any:
+    """Pass-through hook used by the vllm backend.
+
+    vllm exposes its own multimodal entry points that vary by version; for the
+    base flow we simply forward the raw message list and let the caller's
+    custom backend handle templating. Real deployments override this.
+    """
+    return list(messages)
@@ -0,0 +1,356 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Final parquet rewrite.
+
+For every episode the writer:
+
+1. reads the staged module outputs,
+2. partitions them into a persistent slice (PERSISTENT_STYLES) and an event
+   slice (EVENT_ONLY_STYLES + style=None tool-call atoms),
+3. sorts each slice deterministically,
+4. broadcasts the persistent slice across every frame in the episode,
+5. for each frame, materializes the sublist of event rows whose timestamp
+   exactly equals that frame's timestamp,
+6. drops the legacy ``subtask_index`` column,
+7. writes the parquet shard back in place.
+
+The writer does NOT add a dataset-level ``tools`` column. Tool *calls* are
+emitted per-row via the existing ``tool_calls`` field on the v3.1 row
+struct (PR 1) for every speech atom. The tool *schema* (the description
+of the ``say`` function and its parameters) is a fixed code constant —
+``SAY_TOOL_SCHEMA`` below — and downstream chat-template consumers import
+it directly rather than reading a redundant per-row column.
+
+Invariants enforced here (and re-checked by the validator):
+
+- per-episode persistent slice is byte-identical across every frame;
+- ``language_events`` rows on a frame all have ``timestamp == frame_ts``
+  (timestamps come straight from the source parquet — never recomputed);
+- every row passes ``column_for_style(style)``.
+"""
+
+from __future__ import annotations
+
+import logging
+from collections import defaultdict
+from collections.abc import Iterable, Sequence
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+
+import pyarrow as pa
+import pyarrow.parquet as pq
+
+from lerobot.datasets.language import (
+    EVENT_ONLY_STYLES,
+    LANGUAGE_EVENTS,
+    LANGUAGE_PERSISTENT,
+    PERSISTENT_STYLES,
+    column_for_style,
+    validate_camera_field,
+)
+
+from .reader import EpisodeRecord
+from .staging import EpisodeStaging
+
+logger = logging.getLogger(__name__)
+
+
+# Tool schema constants moved to lerobot.datasets.language in PR 1 — single
+# source of truth. Re-exported here so existing imports
+# (``from lerobot.annotations.steerable_pipeline.writer import SAY_TOOL_SCHEMA``)
+# keep working.
+from lerobot.datasets.language import DEFAULT_TOOLS, SAY_TOOL_SCHEMA  # noqa: F401, E402
+
+
+def _row_persistent_sort_key(row: dict[str, Any]) -> tuple:
+    return (float(row["timestamp"]), row.get("style") or "", row.get("role") or "")
+
+
+def _row_event_sort_key(row: dict[str, Any]) -> tuple:
+    # events are bucketed per-frame, but within a frame we still want determinism
+    return (
+        row.get("style") or "",
+        row.get("role") or "",
+        row.get("camera") or "",
+    )
+
+
+def _normalize_persistent_row(row: dict[str, Any]) -> dict[str, Any]:
+    """Coerce a staged row into the persistent column's struct shape."""
+    style = row.get("style")
+    if style not in PERSISTENT_STYLES:
+        raise ValueError(
+            f"persistent slice contains row with non-persistent style {style!r}; "
+            "row would be misrouted under column_for_style()"
+        )
+    if "timestamp" not in row:
+        raise ValueError(f"persistent row missing timestamp: {row!r}")
+    if "role" not in row:
+        # Surface a friendly error from the writer rather than letting
+        # the raw KeyError bubble out of the dict access below — modules
+        # are expected to always emit ``role``, but the validator
+        # currently doesn't check this so a future bug would otherwise
+        # be hard to triage.
+        raise ValueError(f"persistent row missing role: {row!r}")
+    camera = row.get("camera")
+    validate_camera_field(style, camera)
+    return {
+        "role": str(row["role"]),
+        "content": None if row.get("content") is None else str(row["content"]),
+        "style": style,
+        "timestamp": float(row["timestamp"]),
+        "camera": None if camera is None else str(camera),
+        "tool_calls": _normalize_tool_calls(row.get("tool_calls")),
+    }
+
+
+def _normalize_event_row(row: dict[str, Any]) -> dict[str, Any]:
+    """Coerce a staged row into the event column's struct shape (no timestamp)."""
+    style = row.get("style")
+    if style is not None and style not in EVENT_ONLY_STYLES:
+        raise ValueError(
+            f"event slice contains row with style {style!r}; expected None or one of {EVENT_ONLY_STYLES}"
+        )
+    if column_for_style(style) != LANGUAGE_EVENTS:
+        raise ValueError(f"event row with style {style!r} would not route to language_events")
+    if "role" not in row:
+        raise ValueError(f"event row missing role: {row!r}")
+    camera = row.get("camera")
+    validate_camera_field(style, camera)
+    return {
+        "role": str(row["role"]),
+        "content": None if row.get("content") is None else str(row["content"]),
+        "style": style,
+        "camera": None if camera is None else str(camera),
+        "tool_calls": _normalize_tool_calls(row.get("tool_calls")),
+    }
+
+
+def _normalize_tool_calls(value: Any) -> list[Any] | None:
+    if value is None:
+        return None
+    if not isinstance(value, list):
+        raise ValueError(f"tool_calls must be a list or None, got {type(value).__name__}")
+    return list(value)
+
+
+def _validate_atom_invariants(row: dict[str, Any]) -> None:
+    """At-least-one of content/tool_calls; style=None implies tool_calls."""
+    has_content = row.get("content") is not None
+    has_tools = row.get("tool_calls") is not None
+    if not (has_content or has_tools):
+        raise ValueError(f"row has neither content nor tool_calls: {row!r}")
+    if row.get("style") is None and not has_tools:
+        raise ValueError(f"style=None requires tool_calls: {row!r}")
+
+
+def _validate_speech_atom(row: dict[str, Any]) -> None:
+    """Speech atoms: role=assistant, style=None, content=None, say tool call."""
+    if row.get("style") is not None:
+        return  # not a speech atom
+    if row.get("role") != "assistant":
+        raise ValueError(f"speech atom must have role=assistant: {row!r}")
+    if row.get("content") is not None:
+        raise ValueError(f"speech atom must have content=null: {row!r}")
+    tool_calls = row.get("tool_calls")
+    if not tool_calls or not isinstance(tool_calls, list):
+        raise ValueError(f"speech atom must have non-empty tool_calls list: {row!r}")
+    first = tool_calls[0]
+    if not isinstance(first, dict):
+        raise ValueError(f"speech atom tool_calls[0] must be a dict: {row!r}")
+    if first.get("type") != "function":
+        raise ValueError(f"speech atom tool_calls[0].type must be 'function': {row!r}")
+    fn = first.get("function") or {}
+    if fn.get("name") != "say":
+        raise ValueError(f"speech atom tool_calls[0].function.name must be 'say': {row!r}")
+    args = fn.get("arguments") or {}
+    if not isinstance(args, dict) or "text" not in args or not isinstance(args["text"], str):
+        raise ValueError(f"speech atom must carry 'text' string in arguments: {row!r}")
+
+
+@dataclass
+class LanguageColumnsWriter:
+    """Rewrite ``data/chunk-*/file-*.parquet`` with the two language columns."""
+
+    drop_existing_subtask_index: bool = True
+
+    def write_all(
+        self,
+        records: Sequence[EpisodeRecord],
+        staging_dir: Path,
+        root: Path,
+    ) -> list[Path]:
+        episodes_by_path: dict[Path, list[EpisodeRecord]] = defaultdict(list)
+        for record in records:
+            episodes_by_path[record.data_path].append(record)
+
+        written: list[Path] = []
+        for path, eps in episodes_by_path.items():
+            self._rewrite_one(path, eps, staging_dir, root)
+            written.append(path)
+        return written
+
+    def _rewrite_one(
+        self,
+        path: Path,
+        episodes: Sequence[EpisodeRecord],
+        staging_dir: Path,
+        root: Path,
+    ) -> None:
+        table = pq.read_table(path)
+        n_rows = table.num_rows
+
+        # Ensure we cover every episode in the file. Episodes that don't have
+        # staging artifacts are passed through with empty annotation lists —
+        # this keeps the writer idempotent and safe for partial reruns.
+        staged_per_ep: dict[int, dict[str, list[dict[str, Any]]]] = {}
+        for record in episodes:
+            staging = EpisodeStaging(staging_dir, record.episode_index)
+            staged_per_ep[record.episode_index] = staging.read_all()
+
+        persistent_by_ep: dict[int, list[dict[str, Any]]] = {}
+        events_by_ep_ts: dict[int, dict[float, list[dict[str, Any]]]] = {}
+
+        for ep_index, ep_staged in staged_per_ep.items():
+            persistent_rows: list[dict[str, Any]] = []
+            event_rows: list[dict[str, Any]] = []  # carry timestamp until bucketed
+            for _module_name, rows in ep_staged.items():
+                for row in rows:
+                    style = row.get("style")
+                    if column_for_style(style) == LANGUAGE_PERSISTENT:
+                        persistent_rows.append(row)
+                    else:
+                        event_rows.append(row)
+
+            persistent_rows.sort(key=_row_persistent_sort_key)
+            normalized_persistent = []
+            for r in persistent_rows:
+                _validate_atom_invariants(r)
+                _validate_speech_atom(r)
+                normalized_persistent.append(_normalize_persistent_row(r))
+            persistent_by_ep[ep_index] = normalized_persistent
+
+            buckets: dict[float, list[dict[str, Any]]] = defaultdict(list)
+            for r in event_rows:
+                _validate_atom_invariants(r)
+                _validate_speech_atom(r)
+                ts = float(r["timestamp"])
+                buckets[ts].append(_normalize_event_row(r))
+            for ts in list(buckets.keys()):
+                buckets[ts].sort(key=_row_event_sort_key)
+            events_by_ep_ts[ep_index] = buckets
+
+        episode_col = (
+            table.column("episode_index").to_pylist() if "episode_index" in table.column_names else None
+        )
+        ts_col = table.column("timestamp").to_pylist() if "timestamp" in table.column_names else None
+        if episode_col is None or ts_col is None:
+            raise ValueError(f"{path} is missing 'episode_index' or 'timestamp' — required by the writer.")
+
+        per_row_persistent: list[list[dict[str, Any]]] = []
+        per_row_events: list[list[dict[str, Any]]] = []
+        for i in range(n_rows):
+            ep = episode_col[i]
+            ts = float(ts_col[i])
+            per_row_persistent.append(persistent_by_ep.get(ep, []))
+            buckets = events_by_ep_ts.get(ep, {})
+            per_row_events.append(buckets.get(ts, []))
+
+        new_table = self._materialize_table(
+            table, per_row_persistent, per_row_events, drop_old=self.drop_existing_subtask_index
+        )
+        # Atomic replace: write to a sibling tmp path and rename so a crash
+        # mid-write can't leave a half-written shard that ``pq.read_table``
+        # would then fail to open. ``Path.replace`` is atomic on POSIX +
+        # Windows when source and target sit on the same filesystem.
+        tmp_path = path.with_suffix(path.suffix + ".tmp")
+        pq.write_table(new_table, tmp_path)
+        tmp_path.replace(path)
+
+    def _materialize_table(
+        self,
+        table: pa.Table,
+        persistent: list[list[dict[str, Any]]],
+        events: list[list[dict[str, Any]]],
+        *,
+        drop_old: bool,
+    ) -> pa.Table:
+        cols = []
+        names = []
+        for name in table.column_names:
+            if drop_old and name == "subtask_index":
+                continue
+            if name in (LANGUAGE_PERSISTENT, LANGUAGE_EVENTS):
+                continue  # we'll re-add canonical versions
+            # Strip any legacy ``tools`` column previously emitted by older
+            # writers — the schema no longer uses it (constant lives in
+            # SAY_TOOL_SCHEMA / DEFAULT_TOOLS).
+            if name == "tools":
+                continue
+            cols.append(table.column(name))
+            names.append(name)
+
+        # We let pyarrow infer struct/list schema rather than passing the
+        # canonical type from `lerobot.datasets.language` directly: that type
+        # uses `pa.json_()` for the `tool_calls` element type, which
+        # `pa.array(..., type=...)` cannot materialize from Python lists on
+        # current pyarrow versions. The inferred schema round-trips through
+        # parquet and `LeRobotDataset` correctly — see PR 1's
+        # `tests/datasets/test_language.py` which exercises the same flow.
+        persistent_arr = pa.array(persistent)
+        events_arr = pa.array(events)
+
+        cols.extend([persistent_arr, events_arr])
+        names.extend([LANGUAGE_PERSISTENT, LANGUAGE_EVENTS])
+
+        return pa.Table.from_arrays(cols, names=names)
+
+
+def speech_atom(timestamp: float, text: str) -> dict[str, Any]:
+    """Build a canonical speech tool-call atom for the events column."""
+    return {
+        "role": "assistant",
+        "content": None,
+        "style": None,
+        "timestamp": float(timestamp),
+        "camera": None,
+        "tool_calls": [
+            {
+                "type": "function",
+                "function": {
+                    "name": "say",
+                    "arguments": {"text": text},
+                },
+            }
+        ],
+    }
+
+
+def normalize_rows_for_writer(
+    rows: Iterable[dict[str, Any]],
+) -> tuple[list[dict[str, Any]], list[dict[str, Any]]]:
+    """Helper used by tests/validators to partition a flat row list into
+    (persistent_rows, event_rows) using ``column_for_style``.
+    """
+    persistent: list[dict[str, Any]] = []
+    events: list[dict[str, Any]] = []
+    for row in rows:
+        if column_for_style(row.get("style")) == LANGUAGE_PERSISTENT:
+            persistent.append(row)
+        else:
+            events.append(row)
+    return persistent, events
@@ -24,6 +24,7 @@ Import them directly: ``from lerobot.configs.train import TrainPipelineConfig``
 from .dataset import DatasetRecordConfig
 from .default import DatasetConfig, EvalConfig, PeftConfig, WandBConfig
 from .policies import PreTrainedConfig
+from .recipe import MessageTurn, TrainingRecipe, load_recipe
 from .types import (
    FeatureType,
    NormalizationMode,
@@ -31,12 +32,6 @@ from .types import (
    PolicyFeature,
    RTCAttentionSchedule,
 )
-from .video import (
-    VALID_VIDEO_CODECS,
-    VIDEO_ENCODER_INFO_KEYS,
-    VideoEncoderConfig,
-    camera_encoder_defaults,
-)

 __all__ = [
    # Types
@@ -49,13 +44,10 @@ __all__ = [
    "DatasetRecordConfig",
    "DatasetConfig",
    "EvalConfig",
+    "MessageTurn",
    "PeftConfig",
    "PreTrainedConfig",
+    "TrainingRecipe",
    "WandBConfig",
-    "VideoEncoderConfig",
-    # Defaults
-    "camera_encoder_defaults",
-    # Constants
-    "VALID_VIDEO_CODECS",
-    "VIDEO_ENCODER_INFO_KEYS",
+    "load_recipe",
 ]
@@ -14,12 +14,10 @@

 """Shared dataset recording configuration used by both ``lerobot-record`` and ``lerobot-rollout``."""

-from dataclasses import dataclass, field
+from dataclasses import dataclass
 from datetime import datetime
 from pathlib import Path

-from .video import VideoEncoderConfig, camera_encoder_defaults
-

@dataclass
 class DatasetRecordConfig:
@@ -57,9 +55,10 @@ class DatasetRecordConfig:
    # Number of episodes to record before batch encoding videos
    # Set to 1 for immediate encoding (default behavior), or higher for batched encoding
    video_encoding_batch_size: int = 1
-    # Video encoder settings for camera MP4s (codec, quality, GOP, etc.). Tuned via CLI nested keys,
-    # e.g. ``--dataset.camera_encoder_config.vcodec=h264`` (see ``VideoEncoderConfig``).
-    camera_encoder_config: VideoEncoderConfig = field(default_factory=camera_encoder_defaults)
+    # Video codec for encoding videos. Options: 'h264', 'hevc', 'libsvtav1', 'auto',
+    # or hardware-specific: 'h264_videotoolbox', 'h264_nvenc', 'h264_vaapi', 'h264_qsv'.
+    # Use 'auto' to auto-detect the best available hardware encoder.
+    vcodec: str = "libsvtav1"
    # Enable streaming video encoding: encode frames in real-time during capture instead
    # of writing PNG images first. Makes save_episode() near-instant. More info in the documentation: https://huggingface.co/docs/lerobot/streaming_video_encoding
    streaming_encoding: bool = False
@@ -17,7 +17,7 @@
 from dataclasses import dataclass, field

 from lerobot.transforms import ImageTransformsConfig
-from lerobot.utils.import_utils import get_safe_default_video_backend
+from lerobot.utils.import_utils import get_safe_default_codec


@dataclass
@@ -34,7 +34,7 @@ class DatasetConfig:
    image_transforms: ImageTransformsConfig = field(default_factory=ImageTransformsConfig)
    revision: str | None = None
    use_imagenet_stats: bool = True
-    video_backend: str = field(default_factory=get_safe_default_video_backend)
+    video_backend: str = field(default_factory=get_safe_default_codec)
    # When True, video frames are returned as uint8 tensors (0-255) instead of float32 (0.0-1.0).
    # This reduces memory and speeds up DataLoader IPC. The training pipeline handles the conversion.
    return_uint8: bool = False
@@ -0,0 +1,206 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import re
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any, Literal, get_args
+
+MessageRole = Literal["user", "assistant", "system", "tool"]
+MessageStream = Literal["high_level", "low_level"]
+
+DEFAULT_BINDINGS = {
+    "subtask": "active_at(t, style=subtask)",
+    "memory": "active_at(t, style=memory)",
+    "plan": "active_at(t, style=plan)",
+    "speech": "emitted_at(t, role=assistant, tool_name=say)",
+    "interjection": "emitted_at(t, style=interjection)",
+    "vqa": "emitted_at(t, style=vqa, role=assistant)",
+    "vqa_query": "emitted_at(t, style=vqa, role=user)",
+}
+
+PLACEHOLDER_RE = re.compile(r"\$\{([A-Za-z_][A-Za-z0-9_]*)\}")
+"""``${name}`` placeholder pattern used by both recipe binding-reference
+discovery (here) and rendered-message substitution (in ``language_render``)."""
+
+_VALID_ROLES = frozenset(get_args(MessageRole))
+_VALID_STREAMS = frozenset(get_args(MessageStream))
+
+
+@dataclass
+class MessageTurn:
+    """A single chat-style turn in a recipe template.
+
+    ``content`` may be a plain string, a list of HF-style multimodal blocks, or
+    ``None`` when ``tool_calls_from`` supplies tool-call payloads instead.
+    ``stream`` tags the turn for downstream filtering, ``target`` flags it as a
+    training target, and ``if_present`` skips the turn when the named binding
+    resolves to ``None``.
+    """
+
+    role: MessageRole
+    content: str | list[dict[str, Any]] | None = None
+    stream: MessageStream | None = None
+    target: bool = False
+    if_present: str | None = None
+    tool_calls_from: str | None = None
+
+    def __post_init__(self) -> None:
+        """Validate role, stream, and content after dataclass construction."""
+        if self.role not in _VALID_ROLES:
+            raise ValueError(f"Unsupported message role: {self.role!r}")
+        # ``stream`` is typed Optional only so the dataclass can keep its
+        # field ordering, but recipes must always tag every turn with a
+        # stream — the renderer's ``_validate_rendered`` would reject
+        # ``None`` later on. Fail at construction so the bad recipe is
+        # caught at YAML load time rather than at the first sample.
+        if self.stream is None:
+            raise ValueError(
+                f"MessageTurn(role={self.role!r}) is missing a stream — "
+                f"every turn must declare one of {sorted(_VALID_STREAMS)}."
+            )
+        if self.stream not in _VALID_STREAMS:
+            raise ValueError(f"Unsupported message stream: {self.stream!r}")
+        if self.content is None and self.tool_calls_from is None:
+            raise ValueError("MessageTurn.content is required unless tool_calls_from is set.")
+        if self.content is not None and not isinstance(self.content, (str, list)):
+            raise TypeError("MessageTurn.content must be a string, a list of HF-style blocks, or None.")
+        if isinstance(self.content, list):
+            for block in self.content:
+                if not isinstance(block, dict) or "type" not in block:
+                    raise ValueError(
+                        "Multimodal content blocks must be HF-style dictionaries with a type key."
+                    )
+
+    @classmethod
+    def from_dict(cls, data: dict[str, Any]) -> MessageTurn:
+        """Construct a :class:`MessageTurn` from a plain dictionary."""
+        return cls(**data)
+
+
+@dataclass
+class TrainingRecipe:
+    """A recipe describing how to render training samples from language rows.
+
+    A recipe is either a *message recipe* (``messages`` plus optional
+    ``bindings``) or a *blend recipe* (``blend`` mapping names to weighted
+    sub-recipes). ``weight`` is only meaningful inside a blend.
+    """
+
+    messages: list[MessageTurn] | None = None
+    bindings: dict[str, str] | None = None
+    blend: dict[str, TrainingRecipe] | None = None
+    weight: float | None = None
+
+    def __post_init__(self) -> None:
+        """Validate that exactly one of ``messages`` or ``blend`` is set."""
+        if self.messages is not None and self.blend is not None:
+            raise ValueError("TrainingRecipe must set only one of messages or blend.")
+        if self.messages is None and self.blend is None:
+            raise ValueError("TrainingRecipe must set one of messages or blend.")
+
+        if self.messages is not None:
+            self._validate_message_recipe()
+        if self.blend is not None:
+            self._validate_blend_recipe()
+
+    @classmethod
+    def from_dict(cls, data: dict[str, Any]) -> TrainingRecipe:
+        """Construct a :class:`TrainingRecipe` from a nested dictionary."""
+        data = dict(data)
+        if data.get("messages") is not None:
+            data["messages"] = [
+                turn if isinstance(turn, MessageTurn) else MessageTurn.from_dict(turn)
+                for turn in data["messages"]
+            ]
+        if data.get("blend") is not None:
+            data["blend"] = {
+                name: recipe if isinstance(recipe, TrainingRecipe) else cls.from_dict(recipe)
+                for name, recipe in data["blend"].items()
+            }
+        return cls(**data)
+
+    @classmethod
+    def from_yaml(cls, path: str | Path) -> TrainingRecipe:
+        """Load a :class:`TrainingRecipe` from a YAML file at ``path``."""
+        import yaml  # type: ignore[import-untyped]
+
+        with open(path) as f:
+            data = yaml.safe_load(f)
+        if not isinstance(data, dict):
+            raise ValueError(f"Recipe YAML must contain a mapping at the top level: {path}")
+        return cls.from_dict(data)
+
+    def _validate_message_recipe(self) -> None:
+        """Ensure every templated binding is known and at least one turn is a target."""
+        assert self.messages is not None
+        known_bindings = set(DEFAULT_BINDINGS) | set(self.bindings or {}) | {"task"}
+
+        for turn in self.messages:
+            missing = self._referenced_bindings(turn) - known_bindings
+            if missing:
+                raise ValueError(f"MessageTurn references unknown binding(s): {sorted(missing)}")
+
+        if not any(turn.target for turn in self.messages):
+            raise ValueError("Message recipes must contain at least one target turn.")
+
+    def _validate_blend_recipe(self) -> None:
+        """Ensure each blend component is a non-empty, weighted message recipe."""
+        assert self.blend is not None
+        if not self.blend:
+            raise ValueError("Blend recipes must contain at least one component.")
+
+        for name, recipe in self.blend.items():
+            if recipe.blend is not None:
+                raise ValueError(f"Blend component {name!r} cannot itself define a blend.")
+            if recipe.messages is None:
+                raise ValueError(f"Blend component {name!r} must define messages.")
+            if recipe.weight is None:
+                raise ValueError(f"Blend component {name!r} must define weight.")
+            if recipe.weight <= 0:
+                raise ValueError(f"Blend component {name!r} must have a positive weight.")
+
+    def _referenced_bindings(self, turn: MessageTurn) -> set[str]:
+        """Return the binding names that ``turn`` references via placeholders or attributes."""
+        names: set[str] = set()
+        if turn.if_present is not None:
+            names.add(turn.if_present)
+        if turn.tool_calls_from is not None:
+            names.add(turn.tool_calls_from)
+        names.update(_placeholders_in_content(turn.content))
+        return names
+
+
+def _placeholders_in_content(content: str | list[dict[str, Any]] | None) -> set[str]:
+    """Return the set of ``${name}`` placeholders found anywhere in ``content``."""
+    if content is None:
+        return set()
+    if isinstance(content, str):
+        return set(PLACEHOLDER_RE.findall(content))
+
+    names: set[str] = set()
+    for block in content:
+        for value in block.values():
+            if isinstance(value, str):
+                names.update(PLACEHOLDER_RE.findall(value))
+    return names
+
+
+def load_recipe(path: str | Path) -> TrainingRecipe:
+    """Load a :class:`TrainingRecipe` from a YAML file at ``path``."""
+    return TrainingRecipe.from_yaml(path)
@@ -1,201 +0,0 @@
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# Note: We subclass str so that serialization is straightforward
-# https://stackoverflow.com/questions/24481852/serialising-an-enum-member-to-json
-
-"""Video encoder configurations."""
-
-import logging
-from dataclasses import dataclass, field
-from typing import TYPE_CHECKING, Any
-
-from lerobot.utils.import_utils import _av_available, require_package
-
-if TYPE_CHECKING or _av_available:
-    from lerobot.datasets.pyav_utils import check_video_encoder_config_pyav, detect_available_encoders_pyav
-
-logger = logging.getLogger(__name__)
-
-# List of hardware encoders to probe for auto-selection. Availability depends on the platform and the chosen video backend.
-# Determines the order of preference for auto-selection when vcodec="auto" is used.
-HW_VIDEO_CODECS = [
-    "h264_videotoolbox",  # macOS
-    "hevc_videotoolbox",  # macOS
-    "h264_nvenc",  # NVIDIA GPU
-    "hevc_nvenc",  # NVIDIA GPU
-    "h264_vaapi",  # Linux Intel/AMD
-    "h264_qsv",  # Intel Quick Sync
-]
-VALID_VIDEO_CODECS: frozenset[str] = frozenset({"h264", "hevc", "libsvtav1", "auto", *HW_VIDEO_CODECS})
-
-LIBSVTAV1_DEFAULT_PRESET: int = 12
-
-# Keys persisted under ``features[*]["info"]`` as ``video.<name>`` (from :class:`VideoEncoderConfig`).
-# ``vcodec``` and ``pix_fmt`` are derived from the video stream directly.
-VIDEO_ENCODER_INFO_FIELD_NAMES: frozenset[str] = frozenset(
-    {"g", "crf", "preset", "fast_decode", "extra_options", "video_backend"}
-)
-VIDEO_ENCODER_INFO_KEYS: frozenset[str] = frozenset(
-    f"video.{name}" for name in VIDEO_ENCODER_INFO_FIELD_NAMES
-)
-
-
-@dataclass
-class VideoEncoderConfig:
-    """Video encoder configuration.
-
-    Attributes:
-        vcodec: Video encoder name. ``"auto"`` is resolved during
-            construction (HW encoder if available, else ``libsvtav1``).
-        pix_fmt: Pixel format (e.g. ``"yuv420p"``).
-        g: GOP size (keyframe interval).
-        crf: Quality level — mapped to the native quality parameter of the
-            codec (``crf`` for software, ``qp`` for NVENC/VAAPI,
-            ``q:v`` for VideoToolbox, ``global_quality`` for QSV).
-        preset: Speed/quality preset. Accepted type is per-codec.
-        fast_decode: Fast-decode tuning. For ``libsvtav1`` this is a level (0-2)
-            embedded in ``svtav1-params``. For ``h264`` and ``hevc`` non-zero values
-            set ``tune=fastdecode``. Ignored for other codecs.
-        video_backend: Python to be used for encoding. Only ``"pyav"``
-            is currently supported.
-        extra_options: Free-form dictionary of additional video encoder options
-            (e.g. ``{"tune": "film", "profile:v": "high", "bf": 2}``).
-    """
-
-    vcodec: str = "libsvtav1"  # TODO(CarolinePascal): rename to codec ?
-    pix_fmt: str = "yuv420p"
-    g: int | None = 2
-    crf: int | None = 30
-    preset: int | str | None = None
-    fast_decode: int = 0
-    # TODO(CarolinePascal): add torchcodec support + find a way to unify the
-    # two backends (encoding and decoding).
-    video_backend: str = "pyav"
-    extra_options: dict[str, Any] = field(default_factory=dict)
-
-    def __post_init__(self) -> None:
-        self.resolve_vcodec()
-        # Empty-constructor ergonomics: ``VideoEncoderConfig()`` must "just work".
-        if self.preset is None and self.vcodec == "libsvtav1":
-            self.preset = LIBSVTAV1_DEFAULT_PRESET
-        self.validate()
-
-    def detect_available_encoders(self, encoders: list[str] | str) -> list[str]:
-        """Return the subset of available encoders based on the specified video backend.
-
-        Args:
-            encoders: List of encoder names to detect. If a string, it is converted to a list.
-        Returns:
-            List of available encoder names. If the video backend is not "pyav", returns an empty list.
-        """
-        if self.video_backend == "pyav":
-            require_package("av", extra="dataset")
-            return detect_available_encoders_pyav(encoders)
-        return []
-
-    def validate(self) -> None:
-        """Validate the video encoder configuration."""
-        if self.video_backend == "pyav":
-            require_package("av", extra="dataset")
-            check_video_encoder_config_pyav(self)
-
-    def resolve_vcodec(self) -> None:
-        """Check ``vcodec`` and, when it is ``"auto"``, pick a concrete encoder.
-
-        For ``"auto"``, the first hardware encoder in the preference list that is available is chosen; if none are available, ``libsvtav1`` is used. If the
-        resolved codec (explicit or after auto-selection) is not available, raises ``ValueError``.
-        """
-        if self.vcodec not in VALID_VIDEO_CODECS:
-            raise ValueError(f"Invalid vcodec '{self.vcodec}'. Must be one of: {sorted(VALID_VIDEO_CODECS)}")
-        if self.vcodec == "auto":
-            available = self.detect_available_encoders(HW_VIDEO_CODECS)
-            for encoder in HW_VIDEO_CODECS:
-                if encoder in available:
-                    logger.info(f"Auto-selected video codec: {encoder}")
-                    self.vcodec = encoder
-                    return
-            logger.warning("No hardware encoder available, falling back to software encoder 'libsvtav1'")
-            self.vcodec = "libsvtav1"
-
-        if self.detect_available_encoders(self.vcodec):
-            logger.info(f"Using video codec: {self.vcodec}")
-            return
-        raise ValueError(f"Unsupported video codec: {self.vcodec} with video backend {self.video_backend}")
-
-    def get_codec_options(
-        self, encoder_threads: int | None = None, as_strings: bool = False
-    ) -> dict[str, Any]:
-        """Translate the tuning fields to codec-specific options.
-
-        ``VideoEncoderConfig.extra_options`` are merged last but never override a structured field.
-
-        Args:
-            encoder_threads: Number of encoder threads set globally for all VideoEncoderConfigs.
-                For libsvtav1, this is mapped to ``lp`` via ``svtav1-params``.
-                For h264/hevc, this is mapped to ``threads``.
-                Hardware encoders ignore this parameter.
-            as_strings: If ``True``, casts values to strings.
-        """
-        opts: dict[str, Any] = {}
-
-        def set_if(key: str, value: Any) -> None:
-            if value is not None:
-                opts[key] = value if not as_strings else str(value)
-
-        # GOP size is not a codec-specific option, so it is always set.
-        set_if("g", self.g)
-
-        if self.vcodec == "libsvtav1":
-            set_if("crf", self.crf)
-            set_if("preset", self.preset)
-            svtav1_parts: list[str] = []
-            if self.fast_decode is not None:
-                svtav1_parts.append(f"fast-decode={max(0, min(2, self.fast_decode))}")
-            if encoder_threads is not None:
-                svtav1_parts.append(f"lp={encoder_threads}")
-            if svtav1_parts:
-                opts["svtav1-params"] = ":".join(svtav1_parts)
-        elif self.vcodec in ("h264", "hevc"):
-            set_if("crf", self.crf)
-            set_if("preset", self.preset)
-            if self.fast_decode:
-                opts["tune"] = "fastdecode"
-            set_if("threads", encoder_threads)
-        elif self.vcodec in ("h264_videotoolbox", "hevc_videotoolbox"):
-            if self.crf is not None:
-                opts["q:v"] = max(1, min(100, 100 - self.crf * 2))
-        elif self.vcodec in ("h264_nvenc", "hevc_nvenc"):
-            opts["rc"] = "constqp"
-            set_if("qp", self.crf)
-            set_if("preset", self.preset)
-        elif self.vcodec == "h264_vaapi":
-            set_if("qp", self.crf)
-        elif self.vcodec == "h264_qsv":
-            set_if("global_quality", self.crf)
-            set_if("preset", self.preset)
-        else:
-            set_if("crf", self.crf)
-            set_if("preset", self.preset)
-
-        # Extra options are merged last but never override structured fields (values are kept as given).
-        for k, v in self.extra_options.items():
-            if k not in opts:
-                set_if(k, v)
-
-        return opts
-
-
-def camera_encoder_defaults() -> VideoEncoderConfig:
-    """Return a :class:`VideoEncoderConfig` with RGB-camera defaults."""
-    return VideoEncoderConfig()
@@ -37,22 +37,21 @@ from .dataset_tools import (
 from .factory import make_dataset, resolve_delta_timestamps
 from .image_writer import safe_stop_image_writer
 from .io_utils import load_episodes, write_stats
+from .language import (
+    EVENT_ONLY_STYLES,
+    LANGUAGE_EVENTS,
+    LANGUAGE_PERSISTENT,
+    PERSISTENT_STYLES,
+    STYLE_REGISTRY,
+    column_for_style,
+)
 from .lerobot_dataset import LeRobotDataset
 from .multi_dataset import MultiLeRobotDataset
 from .pipeline_features import aggregate_pipeline_dataset_features, create_initial_features
-from .pyav_utils import (
-    check_video_encoder_config_pyav,
-    detect_available_encoders_pyav,
-    get_codec,
-)
 from .sampler import EpisodeAwareSampler
 from .streaming_dataset import StreamingLeRobotDataset
 from .utils import DEFAULT_EPISODES_PATH, create_lerobot_dataset_card
-from .video_utils import (
-    VideoEncoderConfig,
-    VideoEncodingManager,
-    camera_encoder_defaults,
-)
+from .video_utils import VideoEncodingManager

 # NOTE: Low-level I/O functions (cast_stats_to_numpy, get_parquet_file_size_in_mb, etc.)
 # and legacy migration constants are intentionally NOT re-exported here.
@@ -62,25 +61,26 @@ __all__ = [
    "CODEBASE_VERSION",
    "DEFAULT_EPISODES_PATH",
    "DEFAULT_QUANTILES",
+    "EVENT_ONLY_STYLES",
    "EpisodeAwareSampler",
+    "LANGUAGE_EVENTS",
+    "LANGUAGE_PERSISTENT",
    "LeRobotDataset",
    "LeRobotDatasetMetadata",
    "MultiLeRobotDataset",
+    "PERSISTENT_STYLES",
+    "STYLE_REGISTRY",
    "StreamingLeRobotDataset",
-    "VideoEncoderConfig",
    "VideoEncodingManager",
-    "camera_encoder_defaults",
    "add_features",
    "aggregate_datasets",
    "aggregate_pipeline_dataset_features",
    "aggregate_stats",
-    "check_video_encoder_config_pyav",
    "convert_image_to_video_dataset",
    "create_initial_features",
    "create_lerobot_dataset_card",
+    "column_for_style",
    "delete_episodes",
-    "detect_available_encoders_pyav",
-    "get_codec",
    "get_feature_stats",
    "load_episodes",
    "make_dataset",
@@ -332,6 +332,7 @@ def aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chu
        videos_idx: Dictionary tracking video chunk and file indices.
        video_files_size_in_mb: Maximum size for video files in MB (defaults to DEFAULT_VIDEO_FILE_SIZE_IN_MB)
        chunk_size: Maximum number of files per chunk (defaults to DEFAULT_CHUNK_SIZE)
+
    Returns:
        dict: Updated videos_idx with current chunk and file indices.
    """
@@ -416,7 +417,6 @@ def aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chu
                concatenate_video_files(
                    [dst_path, src_path],
                    dst_path,
-                    compatibility_check=True,
                )
                # Update duration of this destination file
                dst_file_durations[dst_key] = current_dst_duration + src_duration
@@ -512,7 +512,7 @@ def compute_episode_stats(

    ep_stats = {}
    for key, data in episode_data.items():
-        if features[key]["dtype"] == "string":
+        if features[key]["dtype"] in {"string", "language"}:
            continue

        if features[key]["dtype"] in ["image", "video"]:
@@ -34,7 +34,6 @@ from .io_utils import (
    load_episodes,
    load_info,
    load_stats,
-    load_subtasks,
    load_tasks,
    write_info,
    write_stats,
@@ -48,7 +47,7 @@ from .utils import (
    is_valid_version,
    update_chunk_file_indices,
 )
-from .video_utils import VideoEncoderConfig, get_video_info
+from .video_utils import get_video_info

 CODEBASE_VERSION = "v3.0"

@@ -175,7 +174,6 @@ class LeRobotDatasetMetadata:
        self.info = load_info(self.root)
        check_version_compatibility(self.repo_id, self._version, CODEBASE_VERSION)
        self.tasks = load_tasks(self.root)
-        self.subtasks = load_subtasks(self.root)
        self.episodes = load_episodes(self.root)
        self.stats = load_stats(self.root)

@@ -318,6 +316,39 @@ class LeRobotDatasetMetadata:
        """Keys to access visual modalities (regardless of their storage method)."""
        return [key for key, ft in self.features.items() if ft["dtype"] in ["video", "image"]]

+    @property
+    def has_language_columns(self) -> bool:
+        """Return ``True`` if the dataset declares any language column.
+
+        Used to gate language-aware code paths (collate, render step) so
+        unannotated datasets keep PyTorch's default collate behavior.
+        """
+        from .language import LANGUAGE_COLUMNS  # noqa: PLC0415  (avoid circular import)
+
+        return any(col in self.features for col in LANGUAGE_COLUMNS)
+
+    @property
+    def tools(self) -> list[dict]:
+        """OpenAI-style tool schemas declared by this dataset.
+
+        Read from ``meta/info.json["tools"]``. Returns a copy, so callers
+        can mutate the result safely. Falls back to
+        :data:`lerobot.datasets.language.DEFAULT_TOOLS` (the canonical
+        ``say`` schema) when the dataset doesn't declare any — that way
+        unannotated datasets and chat-template consumers
+        (``apply_chat_template(messages, tools=meta.tools)``) keep
+        working out of the box.
+
+        Implementations live under :mod:`lerobot.tools` (one file per
+        tool); see ``docs/source/tools.mdx`` for the authoring guide.
+        """
+        from .language import DEFAULT_TOOLS  # noqa: PLC0415  (avoid circular import)
+
+        declared = self.info.tools
+        if declared:
+            return [dict(t) for t in declared]
+        return [dict(t) for t in DEFAULT_TOOLS]
+
    @property
    def names(self) -> dict[str, list | dict]:
        """Names of the various dimensions of vector modalities."""
@@ -510,23 +541,10 @@ class LeRobotDatasetMetadata:
        self.stats = aggregate_stats([self.stats, episode_stats]) if self.stats is not None else episode_stats
        write_stats(self.stats, self.root)

-    def update_video_info(
-        self,
-        video_key: str | None = None,
-        camera_encoder_config: VideoEncoderConfig | None = None,
-    ) -> None:
-        """Populate per-feature video info in ``info.json``.
-
+    def update_video_info(self, video_key: str | None = None) -> None:
+        """
        Warning: this function writes info from first episode videos, implicitly assuming that all videos have
        been encoded the same way. Also, this means it assumes the first episode exists.
-
-        Args:
-            video_key: If provided, only update this video key. Otherwise update
-                all video keys in the dataset.
-            camera_encoder_config: Encoder configuration used to produce the
-                videos. When provided, its fields are recorded as
-                ``video.<field>`` entries alongside the stream-derived
-                ``video.*`` entries (see :func:`get_video_info`).
        """
        if video_key is not None and video_key not in self.video_keys:
            raise ValueError(f"Video key {video_key} not found in dataset")
@@ -535,9 +553,7 @@ class LeRobotDatasetMetadata:
        for key in video_keys:
            if not self.features[key].get("info", None):
                video_path = self.root / self.video_path.format(video_key=key, chunk_index=0, file_index=0)
-                self.info.features[key]["info"] = get_video_info(
-                    video_path, camera_encoder_config=camera_encoder_config
-                )
+                self.info.features[key]["info"] = get_video_info(video_path)

    def update_chunk_settings(
        self,
@@ -648,7 +664,6 @@ class LeRobotDatasetMetadata:
        _validate_feature_names(features)

        obj.tasks = None
-        obj.subtasks = None
        obj.episodes = None
        obj.stats = None
        obj.info = create_empty_dataset_info(
@@ -295,9 +295,4 @@ class DatasetReader:
        task_idx = item["task_index"].item()
        item["task"] = self._meta.tasks.iloc[task_idx].name

-        # add subtask information if available
-        if "subtask_index" in self._meta.features and self._meta.subtasks is not None:
-            subtask_idx = item["subtask_index"].item()
-            item["subtask"] = self._meta.subtasks.iloc[subtask_idx].name
-
        return item
@@ -62,12 +62,7 @@ from .utils import (
    DEFAULT_EPISODES_PATH,
    update_chunk_file_indices,
 )
-from .video_utils import (
-    VideoEncoderConfig,
-    camera_encoder_defaults,
-    encode_video_frames,
-    get_video_info,
-)
+from .video_utils import encode_video_frames, get_video_info


 def _load_episode_with_stats(src_dataset: LeRobotDataset, episode_idx: int) -> dict:
@@ -97,7 +92,6 @@ def delete_episodes(
    episode_indices: list[int],
    output_dir: str | Path | None = None,
    repo_id: str | None = None,
-    camera_encoder_config: VideoEncoderConfig | None = None,
 ) -> LeRobotDataset:
    """Delete episodes from a LeRobotDataset and create a new dataset.

@@ -106,8 +100,6 @@ def delete_episodes(
        episode_indices: List of episode indices to delete.
        output_dir: Root directory where the edited dataset will be stored. If not specified, defaults to $HF_LEROBOT_HOME/repo_id. Equivalent to new_root in EditDatasetConfig.
        repo_id: Edited dataset identifier. Equivalent to new_repo_id in EditDatasetConfig.
-        camera_encoder_config: Video encoder settings used when re-encoding video segments
-            (``None`` uses :func:`~lerobot.datasets.video_utils.camera_encoder_defaults`).
    """
    if not episode_indices:
        raise ValueError("No episodes to delete")
@@ -140,7 +132,7 @@ def delete_episodes(

    video_metadata = None
    if dataset.meta.video_keys:
-        video_metadata = _copy_and_reindex_videos(dataset, new_meta, episode_mapping, camera_encoder_config)
+        video_metadata = _copy_and_reindex_videos(dataset, new_meta, episode_mapping)

    data_metadata = _copy_and_reindex_data(dataset, new_meta, episode_mapping)

@@ -162,7 +154,6 @@ def split_dataset(
    dataset: LeRobotDataset,
    splits: dict[str, float | list[int]],
    output_dir: str | Path | None = None,
-    camera_encoder_config: VideoEncoderConfig | None = None,
 ) -> dict[str, LeRobotDataset]:
    """Split a LeRobotDataset into multiple smaller datasets.

@@ -171,8 +162,6 @@ def split_dataset(
        splits: Either a dict mapping split names to episode indices, or a dict mapping
                split names to fractions (must sum to <= 1.0).
        output_dir: Root directory where the split datasets will be stored. If not specified, defaults to $HF_LEROBOT_HOME/repo_id.
-        camera_encoder_config: Video encoder settings used when re-encoding video segments
-            (``None`` uses :func:`~lerobot.datasets.video_utils.camera_encoder_defaults`).

    Examples:
      Split by specific episodes
@@ -233,9 +222,7 @@ def split_dataset(

        video_metadata = None
        if dataset.meta.video_keys:
-            video_metadata = _copy_and_reindex_videos(
-                dataset, new_meta, episode_mapping, camera_encoder_config
-            )
+            video_metadata = _copy_and_reindex_videos(dataset, new_meta, episode_mapping)

        data_metadata = _copy_and_reindex_data(dataset, new_meta, episode_mapping)

@@ -591,7 +578,8 @@ def _keep_episodes_from_video_with_av(
    output_path: Path,
    episodes_to_keep: list[tuple[int, int]],
    fps: float,
-    camera_encoder_config: VideoEncoderConfig | None = None,
+    vcodec: str = "libsvtav1",
+    pix_fmt: str = "yuv420p",
 ) -> None:
    """Keep only specified episodes from a video file using PyAV.

@@ -605,11 +593,9 @@ def _keep_episodes_from_video_with_av(
            Ranges are half-open intervals: [start_frame, end_frame), where start_frame
            is inclusive and end_frame is exclusive.
        fps: Frame rate of the video.
-        camera_encoder_config: Video encoder settings
-            (``None`` uses :func:`~lerobot.datasets.video_utils.camera_encoder_defaults`).
+        vcodec: Video codec to use for encoding.
+        pix_fmt: Pixel format for output video.
    """
-    if camera_encoder_config is None:
-        camera_encoder_config = camera_encoder_defaults()
    from fractions import Fraction

    import av
@@ -633,12 +619,12 @@ def _keep_episodes_from_video_with_av(

    # Convert fps to Fraction for PyAV compatibility.
    fps_fraction = Fraction(fps).limit_denominator(1000)
-    v_out = out.add_stream(camera_encoder_config.vcodec, rate=fps_fraction)
+    v_out = out.add_stream(vcodec, rate=fps_fraction)

    # PyAV type stubs don't distinguish video streams from audio/subtitle streams.
    v_out.width = v_in.codec_context.width
    v_out.height = v_in.codec_context.height
-    v_out.pix_fmt = camera_encoder_config.pix_fmt
+    v_out.pix_fmt = pix_fmt

    # Set time_base to match the frame rate for proper timestamp handling.
    v_out.time_base = Fraction(1, int(fps))
@@ -701,7 +687,8 @@ def _copy_and_reindex_videos(
    src_dataset: LeRobotDataset,
    dst_meta: LeRobotDatasetMetadata,
    episode_mapping: dict[int, int],
-    camera_encoder_config: VideoEncoderConfig | None = None,
+    vcodec: str = "libsvtav1",
+    pix_fmt: str = "yuv420p",
 ) -> dict[int, dict]:
    """Copy and filter video files, only re-encoding files with deleted episodes.

@@ -713,14 +700,10 @@ def _copy_and_reindex_videos(
        src_dataset: Source dataset to copy from
        dst_meta: Destination metadata object
        episode_mapping: Mapping from old episode indices to new indices
-        camera_encoder_config: Video encoder settings used when re-encoding segments
-            (``None`` uses :func:`~lerobot.datasets.video_utils.camera_encoder_defaults`).

    Returns:
        dict mapping episode index to its video metadata (chunk_index, file_index, timestamps)
    """
-    if camera_encoder_config is None:
-        camera_encoder_config = camera_encoder_defaults()
    if src_dataset.meta.episodes is None:
        src_dataset.meta.episodes = load_episodes(src_dataset.meta.root)

@@ -809,7 +792,8 @@ def _copy_and_reindex_videos(
                    dst_video_path,
                    episodes_to_keep_ranges,
                    src_dataset.meta.fps,
-                    camera_encoder_config,
+                    vcodec,
+                    pix_fmt,
                )

                cumulative_ts = 0.0
@@ -1280,7 +1264,11 @@ def _estimate_frame_size_via_calibration(
    episode_indices: list[int],
    temp_dir: Path,
    fps: int,
-    camera_encoder_config: VideoEncoderConfig,
+    vcodec: str,
+    pix_fmt: str,
+    g: int,
+    crf: int,
+    fast_decode: int,
    num_calibration_frames: int = 30,
 ) -> float:
    """Estimate MB per frame by encoding a small calibration sample.
@@ -1294,7 +1282,11 @@ def _estimate_frame_size_via_calibration(
        episode_indices: List of episode indices being processed.
        temp_dir: Temporary directory for calibration files.
        fps: Frames per second for video encoding.
-        camera_encoder_config: Video encoder settings used for calibration encoding.
+        vcodec: Video codec (libsvtav1, h264, hevc).
+        pix_fmt: Pixel format (yuv420p, etc.).
+        g: GOP size (group of pictures).
+        crf: Constant Rate Factor (quality).
+        fast_decode: Fast decode tuning parameter.
        num_calibration_frames: Number of frames to use for calibration (default: 30).

    Returns:
@@ -1330,7 +1322,11 @@ def _estimate_frame_size_via_calibration(
            imgs_dir=calibration_dir,
            video_path=calibration_video_path,
            fps=fps,
-            camera_encoder_config=camera_encoder_config,
+            vcodec=vcodec,
+            pix_fmt=pix_fmt,
+            g=g,
+            crf=crf,
+            fast_decode=fast_decode,
            overwrite=True,
        )

@@ -1648,7 +1644,11 @@ def convert_image_to_video_dataset(
    dataset: LeRobotDataset,
    output_dir: Path | None = None,
    repo_id: str | None = None,
-    camera_encoder_config: VideoEncoderConfig | None = None,
+    vcodec: str = "libsvtav1",
+    pix_fmt: str = "yuv420p",
+    g: int = 2,
+    crf: int = 30,
+    fast_decode: int = 0,
    episode_indices: list[int] | None = None,
    num_workers: int = 4,
    max_episodes_per_batch: int | None = None,
@@ -1663,8 +1663,11 @@ def convert_image_to_video_dataset(
        dataset: The source LeRobot dataset with images
        output_dir: Root directory where the edited dataset will be stored. If not specified, defaults to $HF_LEROBOT_HOME/repo_id. Equivalent to new_root in EditDatasetConfig.
        repo_id: Edited dataset identifier. Equivalent to new_repo_id in EditDatasetConfig.
-        camera_encoder_config: Video encoder settings
-            (``None`` uses :func:`~lerobot.datasets.video_utils.camera_encoder_defaults`).
+        vcodec: Video codec (default: libsvtav1)
+        pix_fmt: Pixel format (default: yuv420p)
+        g: Group of pictures size (default: 2)
+        crf: Constant rate factor (default: 30)
+        fast_decode: Fast decode tuning (default: 0)
        episode_indices: List of episode indices to convert (None = all episodes)
        num_workers: Number of threads for parallel processing (default: 4)
        max_episodes_per_batch: Maximum episodes per video batch to avoid memory issues (None = no limit)
@@ -1673,9 +1676,6 @@ def convert_image_to_video_dataset(
    Returns:
        New LeRobotDataset with images encoded as videos
    """
-    if camera_encoder_config is None:
-        camera_encoder_config = camera_encoder_defaults()
-
    # Check that it's an image dataset
    if len(dataset.meta.video_keys) > 0:
        raise ValueError(
@@ -1699,10 +1699,7 @@ def convert_image_to_video_dataset(
    logging.info(
        f"Converting {len(episode_indices)} episodes with {len(img_keys)} cameras from {dataset.repo_id}"
    )
-    logging.info(
-        f"Video codec: {camera_encoder_config.vcodec}, pixel format: {camera_encoder_config.pix_fmt}, "
-        f"GOP: {camera_encoder_config.g}, CRF: {camera_encoder_config.crf}"
-    )
+    logging.info(f"Video codec: {vcodec}, pixel format: {pix_fmt}, GOP: {g}, CRF: {crf}")

    # Create new features dict, converting image features to video features
    new_features = {}
@@ -1772,7 +1769,11 @@ def convert_image_to_video_dataset(
                episode_indices=episode_indices,
                temp_dir=temp_dir,
                fps=fps,
-                camera_encoder_config=camera_encoder_config,
+                vcodec=vcodec,
+                pix_fmt=pix_fmt,
+                g=g,
+                crf=crf,
+                fast_decode=fast_decode,
            )

            logging.info(f"Processing camera: {img_key}")
@@ -1814,7 +1815,11 @@ def convert_image_to_video_dataset(
                    imgs_dir=imgs_dir,
                    video_path=video_path,
                    fps=fps,
-                    camera_encoder_config=camera_encoder_config,
+                    vcodec=vcodec,
+                    pix_fmt=pix_fmt,
+                    g=g,
+                    crf=crf,
+                    fast_decode=fast_decode,
                    overwrite=True,
                )

@@ -1860,9 +1865,7 @@ def convert_image_to_video_dataset(
                video_path = new_meta.root / new_meta.video_path.format(
                    video_key=img_key, chunk_index=0, file_index=0
                )
-                new_meta.info.features[img_key]["info"] = get_video_info(
-                    video_path, camera_encoder_config=camera_encoder_config
-                )
+                new_meta.info.features[img_key]["info"] = get_video_info(video_path)

        write_info(new_meta.info, new_meta.root)

@@ -52,8 +52,6 @@ from .utils import (
 )
 from .video_utils import (
    StreamingVideoEncoder,
-    VideoEncoderConfig,
-    camera_encoder_defaults,
    concatenate_video_files,
    encode_video_frames,
    get_video_duration_in_s,
@@ -67,19 +65,14 @@ def _encode_video_worker(
    episode_index: int,
    root: Path,
    fps: int,
-    camera_encoder_config: VideoEncoderConfig | None = None,
+    vcodec: str = "libsvtav1",
    encoder_threads: int | None = None,
 ) -> Path:
    temp_path = Path(tempfile.mkdtemp(dir=root)) / f"{video_key}_{episode_index:03d}.mp4"
    fpath = DEFAULT_IMAGE_PATH.format(image_key=video_key, episode_index=episode_index, frame_index=0)
    img_dir = (root / fpath).parent
    encode_video_frames(
-        img_dir,
-        temp_path,
-        fps,
-        camera_encoder_config=camera_encoder_config,
-        encoder_threads=encoder_threads,
-        overwrite=True,
+        img_dir, temp_path, fps, vcodec=vcodec, overwrite=True, encoder_threads=encoder_threads
    )
    shutil.rmtree(img_dir)
    return temp_path
@@ -96,22 +89,20 @@ class DatasetWriter:
        self,
        meta: LeRobotDatasetMetadata,
        root: Path,
-        camera_encoder_config: VideoEncoderConfig | None,
+        vcodec: str,
        encoder_threads: int | None,
        batch_encoding_size: int,
        streaming_encoder: StreamingVideoEncoder | None = None,
        initial_frames: int = 0,
    ):
-        """Initialize the writer with metadata, codec, and encoder config.
+        """Initialize the writer with metadata, codec, and encoding config.

        Args:
            meta: Dataset metadata instance (used for feature schema, chunk
                settings, and episode persistence).
            root: Local dataset root directory.
-            camera_encoder_config: Video encoder settings applied to all cameras.
-                ``None`` uses :func:`~lerobot.datasets.video_utils.camera_encoder_defaults`.
-            encoder_threads: Number of encoder threads (global). ``None``
-                lets the codec decide.
+            vcodec: Video codec for encoding (e.g. ``'libsvtav1'``, ``'h264'``).
+            encoder_threads: Threads per encoder instance. ``None`` for auto.
            batch_encoding_size: Number of episodes to accumulate before
                batch-encoding videos.
            streaming_encoder: Optional pre-built :class:`StreamingVideoEncoder`
@@ -120,7 +111,7 @@ class DatasetWriter:
        """
        self._meta = meta
        self._root = root
-        self._camera_encoder_config = camera_encoder_config or camera_encoder_defaults()
+        self._vcodec = vcodec
        self._encoder_threads = encoder_threads
        self._batch_encoding_size = batch_encoding_size
        self._streaming_encoder = streaming_encoder
@@ -293,7 +284,7 @@ class DatasetWriter:
                            episode_index,
                            self._root,
                            self._meta.fps,
-                            self._camera_encoder_config,
+                            self._vcodec,
                            self._encoder_threads,
                        ): video_key
                        for video_key in self._meta.video_keys
@@ -504,7 +495,7 @@ class DatasetWriter:

        # Update video info (only needed when first episode is encoded)
        if episode_index == 0:
-            self._meta.update_video_info(video_key, camera_encoder_config=self._camera_encoder_config)
+            self._meta.update_video_info(video_key)
            write_info(self._meta.info, self._meta.root)

        metadata = {
@@ -573,12 +564,7 @@ class DatasetWriter:
    def _encode_temporary_episode_video(self, video_key: str, episode_index: int) -> Path:
        """Use ffmpeg to convert frames stored as png into mp4 videos."""
        return _encode_video_worker(
-            video_key,
-            episode_index,
-            self._root,
-            self._meta.fps,
-            self._camera_encoder_config,
-            self._encoder_threads,
+            video_key, episode_index, self._root, self._meta.fps, self._vcodec, self._encoder_threads
        )

    def close_writer(self) -> None:
@@ -22,6 +22,12 @@ from PIL import Image as PILImage
 from lerobot.utils.constants import DEFAULT_FEATURES
 from lerobot.utils.utils import is_valid_numpy_dtype_string

+from .language import (
+    LANGUAGE_PERSISTENT,
+    is_language_column,
+    language_events_column_feature,
+    language_persistent_column_feature,
+)
 from .utils import (
    DEFAULT_CHUNK_SIZE,
    DEFAULT_DATA_FILE_SIZE_IN_MB,
@@ -46,7 +52,13 @@ def get_hf_features_from_features(features: dict) -> datasets.Features:
    """
    hf_features = {}
    for key, ft in features.items():
-        if ft["dtype"] == "video":
+        if is_language_column(key):
+            hf_features[key] = (
+                language_persistent_column_feature()
+                if key == LANGUAGE_PERSISTENT
+                else language_events_column_feature()
+            )
+        elif ft["dtype"] == "video":
            continue
        elif ft["dtype"] == "image":
            hf_features[key] = datasets.Image()
@@ -242,6 +254,8 @@ def validate_feature_dtype_and_shape(
        return validate_feature_image_or_video(name, expected_shape, value)
    elif expected_dtype == "string":
        return validate_feature_string(name, value)
+    elif expected_dtype == "language":
+        return ""
    else:
        raise NotImplementedError(f"The feature dtype '{expected_dtype}' is not implemented yet.")

@@ -31,10 +31,10 @@ from torchvision import transforms
 from lerobot.utils.io_utils import load_json, write_json
 from lerobot.utils.utils import SuppressProgressBars, flatten_dict, unflatten_dict

+from .language import LANGUAGE_COLUMNS
 from .utils import (
    DEFAULT_DATA_FILE_SIZE_IN_MB,
    DEFAULT_EPISODES_PATH,
-    DEFAULT_SUBTASKS_PATH,
    DEFAULT_TASKS_PATH,
    EPISODES_DIR,
    INFO_PATH,
@@ -186,14 +186,6 @@ def load_tasks(local_dir: Path) -> pandas.DataFrame:
    return tasks


-def load_subtasks(local_dir: Path) -> pandas.DataFrame | None:
-    """Load subtasks from subtasks.parquet if it exists."""
-    subtasks_path = local_dir / DEFAULT_SUBTASKS_PATH
-    if subtasks_path.exists():
-        return pd.read_parquet(subtasks_path)
-    return None
-
-
 def write_episodes(episodes: Dataset, local_dir: Path) -> None:
    """Write episode metadata to a parquet file in the LeRobot v3.0 format.
    This function writes episode-level metadata to a single parquet file.
@@ -265,11 +257,13 @@ def hf_transform_to_torch(items_dict: dict[str, list[Any]]) -> dict[str, list[to
        dict: The batch with items converted to torch tensors.
    """
    for key in items_dict:
+        if key in LANGUAGE_COLUMNS:
+            continue
        first_item = items_dict[key][0]
        if isinstance(first_item, PILImage.Image):
            to_tensor = transforms.ToTensor()
            items_dict[key] = [to_tensor(img) for img in items_dict[key]]
-        elif first_item is None:
+        elif first_item is None or isinstance(first_item, dict):
            pass
        else:
            items_dict[key] = [x if isinstance(x, str) else torch.tensor(x) for x in items_dict[key]]
@@ -304,8 +298,9 @@ def item_to_torch(item: dict) -> dict:
    Returns:
        dict: Dictionary with all tensor-like items converted to torch.Tensor.
    """
+    skip_keys = {"task", *LANGUAGE_COLUMNS}
    for key, val in item.items():
-        if isinstance(val, (np.ndarray | list)) and key not in ["task"]:
+        if isinstance(val, (np.ndarray | list)) and key not in skip_keys:
            # Convert numpy arrays and lists to torch tensors
            item[key] = torch.tensor(val)
    return item
@@ -0,0 +1,240 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+from typing import Literal
+
+import datasets
+import pyarrow as pa
+
+LANGUAGE_PERSISTENT = "language_persistent"
+LANGUAGE_EVENTS = "language_events"
+LANGUAGE_COLUMNS = (LANGUAGE_PERSISTENT, LANGUAGE_EVENTS)
+PERSISTENT_ROW_FIELDS = ("role", "content", "style", "timestamp", "camera", "tool_calls")
+EVENT_ROW_FIELDS = ("role", "content", "style", "camera", "tool_calls")
+
+CORE_STYLES = {
+    "subtask",
+    "plan",
+    "memory",
+    "motion",
+    "interjection",
+    "vqa",
+    "trace",
+    "task_aug",
+}
+# Project-local styles can be registered at import time by appending to
+# ``EXTENDED_STYLES`` before ``column_for_style`` is called. Anything added
+# here is treated as a known style alongside ``CORE_STYLES`` for resolver
+# validation. Empty by default — populate from a downstream module that
+# also extends ``PERSISTENT_STYLES`` or ``EVENT_ONLY_STYLES`` to declare
+# the new style's column.
+EXTENDED_STYLES: set[str] = set()
+STYLE_REGISTRY = CORE_STYLES | EXTENDED_STYLES
+
+PERSISTENT_STYLES = {"subtask", "plan", "memory", "motion", "task_aug"}
+EVENT_ONLY_STYLES = {"interjection", "vqa", "trace"}
+
+# Styles whose ``content`` is grounded in a specific camera view. Rows of these
+# styles MUST carry a non-null ``camera`` referencing an ``observation.images.*``
+# feature key. Rows of every other style MUST have ``camera=None``. ``motion``
+# is intentionally NOT in this set: motion primitives are described in
+# robot-frame (joint / Cartesian) terms, not pixel space, so they are
+# camera-agnostic. ``trace`` is the pixel-trajectory event style and IS
+# view-dependent. The ``camera`` field nevertheless lives on
+# ``PERSISTENT_ROW_FIELDS`` too so the schema, validator, and resolver
+# behave symmetrically across the two columns; persistent rows simply
+# always have ``camera=None`` in practice today.
+VIEW_DEPENDENT_STYLES = {"vqa", "trace"}
+
+LanguageColumn = Literal["language_persistent", "language_events"]
+
+
+def _json_arrow_type() -> pa.DataType:
+    """Return the Arrow JSON type, falling back to ``string`` on older pyarrow."""
+    return pa.json_() if hasattr(pa, "json_") else pa.string()
+
+
+def _json_feature() -> object:
+    """Return the HF ``datasets`` JSON feature, falling back to a string value."""
+    return datasets.Json() if hasattr(datasets, "Json") else datasets.Value("string")
+
+
+def language_persistent_row_arrow_type() -> pa.StructType:
+    """Return the Arrow struct type for a single persistent language row.
+
+    Persistent rows carry their own ``timestamp`` because they represent a state
+    that became active at a specific moment and remains active until superseded.
+    """
+    return pa.struct(
+        [
+            pa.field("role", pa.string(), nullable=False),
+            pa.field("content", pa.string(), nullable=True),
+            pa.field("style", pa.string(), nullable=True),
+            pa.field("timestamp", pa.float64(), nullable=False),
+            pa.field("camera", pa.string(), nullable=True),
+            pa.field("tool_calls", pa.list_(_json_arrow_type()), nullable=True),
+        ]
+    )
+
+
+def language_event_row_arrow_type() -> pa.StructType:
+    """Return the Arrow struct type for a single event language row.
+
+    Event rows have no ``timestamp`` field: each event is stored on the dataset
+    row whose frame timestamp is the event's firing time.
+    """
+    return pa.struct(
+        [
+            pa.field("role", pa.string(), nullable=False),
+            pa.field("content", pa.string(), nullable=True),
+            pa.field("style", pa.string(), nullable=True),
+            pa.field("camera", pa.string(), nullable=True),
+            pa.field("tool_calls", pa.list_(_json_arrow_type()), nullable=True),
+        ]
+    )
+
+
+def language_persistent_arrow_type() -> pa.ListType:
+    """Return the Arrow list type for the ``language_persistent`` column."""
+    return pa.list_(language_persistent_row_arrow_type())
+
+
+def language_events_arrow_type() -> pa.ListType:
+    """Return the Arrow list type for the ``language_events`` column."""
+    return pa.list_(language_event_row_arrow_type())
+
+
+def language_persistent_row_feature() -> dict[str, object]:
+    """Return the HF ``datasets`` feature mapping for a persistent language row."""
+    return {
+        "role": datasets.Value("string"),
+        "content": datasets.Value("string"),
+        "style": datasets.Value("string"),
+        "timestamp": datasets.Value("float64"),
+        "camera": datasets.Value("string"),
+        "tool_calls": datasets.List(_json_feature()),
+    }
+
+
+def language_event_row_feature() -> dict[str, object]:
+    """Return the HF ``datasets`` feature mapping for an event language row."""
+    return {
+        "role": datasets.Value("string"),
+        "content": datasets.Value("string"),
+        "style": datasets.Value("string"),
+        "camera": datasets.Value("string"),
+        "tool_calls": datasets.List(_json_feature()),
+    }
+
+
+def language_persistent_column_feature() -> datasets.List:
+    """Return the HF ``datasets`` feature for the ``language_persistent`` column."""
+    return datasets.List(language_persistent_row_feature())
+
+
+def language_events_column_feature() -> datasets.List:
+    """Return the HF ``datasets`` feature for the ``language_events`` column."""
+    return datasets.List(language_event_row_feature())
+
+
+def language_feature_info() -> dict[str, dict]:
+    """Return the ``info["features"]`` entries for both language columns."""
+    return {
+        LANGUAGE_PERSISTENT: {"dtype": "language", "shape": (1,), "names": None},
+        LANGUAGE_EVENTS: {"dtype": "language", "shape": (1,), "names": None},
+    }
+
+
+def is_language_column(key: str) -> bool:
+    """Return ``True`` if ``key`` is one of the dataset's language column names."""
+    return key in LANGUAGE_COLUMNS
+
+
+def is_view_dependent_style(style: str | None) -> bool:
+    """Return ``True`` if rows of ``style`` must be tagged with a ``camera`` key."""
+    return style in VIEW_DEPENDENT_STYLES
+
+
+def validate_camera_field(style: str | None, camera: str | None) -> None:
+    """Enforce the ``camera`` invariant: required iff ``style`` is view-dependent.
+
+    Raises ``ValueError`` if a view-dependent style is missing ``camera`` or if
+    a non-view-dependent style carries one. Pipeline writers and the validator
+    should call this on every emitted row.
+    """
+    if is_view_dependent_style(style):
+        if not camera:
+            raise ValueError(
+                f"Rows of view-dependent style {style!r} require a non-empty 'camera' "
+                f"field referencing an 'observation.images.*' feature key."
+            )
+    elif camera is not None:
+        raise ValueError(f"Rows of style {style!r} must have camera=None; got camera={camera!r}.")
+
+
+# --- Tool registry --------------------------------------------------------
+# Tools declared on a dataset live in ``meta/info.json["tools"]`` as a list
+# of OpenAI-style function schemas. The runtime / training stack reads them
+# through :class:`LeRobotDatasetMetadata.tools` (with these constants as
+# fallback when the dataset doesn't declare any). Implementations live
+# under :mod:`lerobot.tools` (one file per tool); see
+# ``docs/source/tools.mdx`` for the authoring guide.
+
+SAY_TOOL_SCHEMA: dict = {
+    "type": "function",
+    "function": {
+        "name": "say",
+        "description": "Speak a short utterance to the user via the TTS executor.",
+        "parameters": {
+            "type": "object",
+            "properties": {
+                "text": {
+                    "type": "string",
+                    "description": "The verbatim text to speak.",
+                }
+            },
+            "required": ["text"],
+        },
+    },
+}
+"""Canonical schema for the ``say`` tool emitted by the steerable
+annotation pipeline (PR 2 Module 2). Single source of truth — PR 2's
+writer, PR 3's runtime tool registry, and the dataset visualizer all
+import this constant rather than duplicating the dict."""
+
+DEFAULT_TOOLS: list[dict] = [SAY_TOOL_SCHEMA]
+"""Fallback tools list. Returned by ``LeRobotDatasetMetadata.tools``
+when ``meta/info.json["tools"]`` is unset, so unannotated datasets and
+chat-template consumers (``apply_chat_template(messages, tools=...)``)
+keep working out of the box."""
+
+
+def column_for_style(style: str | None) -> LanguageColumn:
+    """Map a language style to the column where rows of that style are stored.
+
+    Styles in :data:`PERSISTENT_STYLES` route to :data:`LANGUAGE_PERSISTENT`.
+    Styles in :data:`EVENT_ONLY_STYLES` and the implicit ``None`` style route
+    to :data:`LANGUAGE_EVENTS`.
+    """
+    if style is None:
+        return LANGUAGE_EVENTS
+    if style in PERSISTENT_STYLES:
+        return LANGUAGE_PERSISTENT
+    if style in EVENT_ONLY_STYLES:
+        return LANGUAGE_EVENTS
+    raise ValueError(f"Unknown language style: {style!r}")
@@ -0,0 +1,543 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import copy
+import hashlib
+import re
+from collections.abc import Sequence
+from typing import Any
+
+from lerobot.configs.recipe import DEFAULT_BINDINGS, PLACEHOLDER_RE, TrainingRecipe
+
+from .language import LANGUAGE_PERSISTENT, column_for_style
+
+LanguageRow = dict[str, Any]
+RenderedMessages = dict[str, list[Any]]
+
+_RESOLVER_RE = re.compile(r"^(?P<name>[A-Za-z_][A-Za-z0-9_]*)\((?P<args>.*)\)$")
+
+
+def active_at(
+    t: float,
+    *,
+    persistent: Sequence[LanguageRow],
+    style: str | None = None,
+    role: str | None = None,
+    tool_name: str | None = None,
+    camera: str | None = None,
+) -> LanguageRow | None:
+    """Return the persistent row of ``style`` that is active at time ``t``.
+
+    A persistent row is "active" at ``t`` when its own ``timestamp`` is the
+    most recent one ``<= t`` for the given ``style``/``role``/``tool_name``/
+    ``camera`` selector. Only valid for persistent styles.
+    """
+    _validate_persistent_resolver("active_at", style)
+    matches = [
+        row
+        for row in _matching_rows(persistent, style=style, role=role, tool_name=tool_name, camera=camera)
+        if _timestamp(row) <= t
+    ]
+    if not matches:
+        return None
+    latest_ts = max(_timestamp(row) for row in matches)
+    return _select_one(
+        [row for row in matches if _timestamp(row) == latest_ts],
+        style=style,
+        role=role,
+        tool_name=tool_name,
+        camera=camera,
+    )
+
+
+EMITTED_AT_TOLERANCE_S = 0.1
+"""Half-window for matching persistent rows to a frame timestamp in
+``emitted_at``. Persistent timestamps come from parquet (float64) and ``t``
+is also a float64 from parquet, so in the ideal hot path an exact match
+would suffice — but any caller that derives ``t`` arithmetically (e.g.
+``frame_idx / fps``) breaks bit-equality. A 0.1 s tolerance covers
+common arithmetic drift without admitting frames that are visibly far
+apart at typical control rates (30–100 Hz)."""
+
+
+def emitted_at(
+    t: float,
+    *,
+    persistent: Sequence[LanguageRow],
+    events: Sequence[LanguageRow],
+    style: str | None = None,
+    role: str | None = None,
+    tool_name: str | None = None,
+    camera: str | None = None,
+) -> LanguageRow | None:
+    """Return the row of ``style`` emitted at exactly time ``t``.
+
+    For persistent styles, this matches persistent rows whose own ``timestamp``
+    is within ``EMITTED_AT_TOLERANCE_S`` of ``t`` (see that constant for why
+    we use a tolerance instead of bit-equality). For event styles, the
+    ``events`` list is assumed to come from the dataset row at frame ``t``
+    (event rows carry no timestamp of their own), so all matching event rows
+    are considered emitted at ``t``. ``camera`` filters by the row's
+    ``camera`` field — required to disambiguate when multiple view-dependent
+    rows share ``(t, role)`` across cameras.
+    """
+    if column_for_style(style) == LANGUAGE_PERSISTENT:
+        matches = [
+            row
+            for row in _matching_rows(persistent, style=style, role=role, tool_name=tool_name, camera=camera)
+            if abs(_timestamp(row) - t) <= EMITTED_AT_TOLERANCE_S
+        ]
+    else:
+        matches = _matching_rows(events, style=style, role=role, tool_name=tool_name, camera=camera)
+    return _select_one(matches, style=style, role=role, tool_name=tool_name, camera=camera)
+
+
+def nth_prev(
+    t: float,
+    *,
+    persistent: Sequence[LanguageRow],
+    style: str | None = None,
+    offset: int = 1,
+    role: str | None = None,
+    tool_name: str | None = None,
+    camera: str | None = None,
+) -> LanguageRow | None:
+    """Return the persistent row that was active ``offset`` steps before ``t``.
+
+    Walks back through chronologically sorted persistent rows of ``style``
+    (filtered by optional ``role``/``tool_name``/``camera``) and returns the
+    one ``offset`` positions before the row active at ``t``. Only valid for
+    persistent styles.
+    """
+    return _nth_relative("nth_prev", t, persistent, style, -offset, role, tool_name, camera)
+
+
+def nth_next(
+    t: float,
+    *,
+    persistent: Sequence[LanguageRow],
+    style: str | None = None,
+    offset: int = 1,
+    role: str | None = None,
+    tool_name: str | None = None,
+    camera: str | None = None,
+) -> LanguageRow | None:
+    """Return the persistent row that becomes active ``offset`` steps after ``t``.
+
+    Walks forward through chronologically sorted persistent rows of ``style``
+    (filtered by optional ``role``/``tool_name``/``camera``) and returns the
+    one ``offset`` positions after the row active at ``t``. Only valid for
+    persistent styles.
+    """
+    return _nth_relative("nth_next", t, persistent, style, offset, role, tool_name, camera)
+
+
+def render_sample(
+    *,
+    recipe: TrainingRecipe,
+    persistent: Sequence[LanguageRow] | None,
+    events: Sequence[LanguageRow] | None,
+    t: float,
+    sample_idx: int,
+    task: str | None = None,
+    dataset_ctx: Any | None = None,
+) -> RenderedMessages | None:
+    """Render the chat-style messages for a single dataset sample.
+
+    Resolves the recipe's bindings against ``persistent`` and ``events`` rows
+    at frame timestamp ``t``, then expands the recipe's message templates.
+    Returns ``None`` if the resolved sample contains no target message.
+    """
+    persistent_rows = _normalize_rows(persistent or [])
+    event_rows = _normalize_rows(events or [])
+    selected_recipe = _select_recipe(recipe, sample_idx)
+    bindings = _resolve_bindings(
+        selected_recipe,
+        persistent=persistent_rows,
+        events=event_rows,
+        t=t,
+        sample_idx=sample_idx,
+        task=task,
+        dataset_ctx=dataset_ctx,
+    )
+    return _render_message_recipe(selected_recipe, bindings)
+
+
+def _select_recipe(recipe: TrainingRecipe, sample_idx: int) -> TrainingRecipe:
+    """Pick a deterministic blend component for ``sample_idx`` (or return ``recipe``)."""
+    if recipe.blend is None:
+        return recipe
+
+    total_weight = sum(component.weight or 0.0 for component in recipe.blend.values())
+    if total_weight <= 0:
+        raise ValueError("Blend weights must sum to a positive value.")
+
+    digest = hashlib.blake2b(str(sample_idx).encode(), digest_size=8).digest()
+    draw = int.from_bytes(digest, "big") / 2**64 * total_weight
+    cumulative = 0.0
+    last_component: TrainingRecipe | None = None
+    for component in recipe.blend.values():
+        last_component = component
+        cumulative += component.weight or 0.0
+        if draw < cumulative:
+            return component
+    assert last_component is not None
+    return last_component
+
+
+def _resolve_bindings(
+    recipe: TrainingRecipe,
+    *,
+    persistent: Sequence[LanguageRow],
+    events: Sequence[LanguageRow],
+    t: float,
+    sample_idx: int,
+    task: str | None,
+    dataset_ctx: Any | None,
+) -> dict[str, LanguageRow | str | None]:
+    """Resolve every binding in ``recipe`` (plus ``task``) at time ``t``."""
+    bindings: dict[str, LanguageRow | str | None] = {
+        "task": _resolve_task(task, dataset_ctx, persistent=persistent, sample_idx=sample_idx),
+    }
+    specs = {**DEFAULT_BINDINGS, **(recipe.bindings or {})}
+    for name, spec in specs.items():
+        bindings[name] = _resolve_spec(spec, persistent=persistent, events=events, t=t)
+    return bindings
+
+
+def _resolve_task(
+    task: str | None,
+    dataset_ctx: Any | None,
+    *,
+    persistent: Sequence[LanguageRow] = (),
+    sample_idx: int = 0,
+) -> str | None:
+    """Return the task string for ``sample_idx``.
+
+    Resolution order:
+
+    1. Explicit ``task`` override (caller-supplied) wins.
+    2. If ``persistent`` contains rows of style ``task_aug`` (role=user),
+       deterministically pick one by ``sample_idx`` so each frame of an
+       episode rotates through the available rephrasings across an epoch.
+       This realizes Xiao 2022 / CAST-style task-prompt diversity without
+       changing ``meta/tasks.parquet`` and without forcing recipes to opt
+       in: ``${task}`` automatically picks a rephrasing when one exists,
+       and falls back to the canonical task otherwise. Recipes that want
+       the literal canonical task can override the binding.
+    3. Otherwise read the canonical task from ``dataset_ctx`` (which is
+       backed by ``meta/tasks.parquet``).
+    """
+    if task is not None:
+        return task
+
+    aug_rows = [r for r in persistent if r.get("style") == "task_aug" and r.get("role") == "user"]
+    if aug_rows:
+        # Deterministic, blake2b-based pick keyed on sample_idx so the
+        # rotation is reproducible across runs (Python's built-in ``hash``
+        # is process-randomized).
+        digest = hashlib.blake2b(f"task_aug:{sample_idx}".encode(), digest_size=8).digest()
+        idx = int.from_bytes(digest, "big") % len(aug_rows)
+        chosen = aug_rows[idx].get("content")
+        if chosen:
+            return str(chosen)
+
+    if dataset_ctx is None:
+        return None
+    if isinstance(dataset_ctx, dict):
+        return dataset_ctx.get("task")
+    return getattr(dataset_ctx, "task", None)
+
+
+def _resolve_spec(
+    spec: str,
+    *,
+    persistent: Sequence[LanguageRow],
+    events: Sequence[LanguageRow],
+    t: float,
+) -> LanguageRow | None:
+    """Parse a single binding's resolver expression and dispatch to its function."""
+    match = _RESOLVER_RE.match(spec.strip())
+    if match is None:
+        raise ValueError(f"Invalid resolver expression: {spec!r}")
+    name = match.group("name")
+    kwargs = _parse_resolver_args(match.group("args"))
+    kwargs.pop("t_arg", None)
+
+    if name == "emitted_at":
+        return emitted_at(t, persistent=persistent, events=events, **kwargs)
+    if name == "active_at":
+        return active_at(t, persistent=persistent, **kwargs)
+    if name == "nth_prev":
+        return nth_prev(t, persistent=persistent, **kwargs)
+    if name == "nth_next":
+        return nth_next(t, persistent=persistent, **kwargs)
+    raise ValueError(f"Unknown language resolver: {name!r}")
+
+
+def _parse_resolver_args(args: str) -> dict[str, Any]:
+    """Parse a comma-separated resolver argument list into a kwargs dict."""
+    kwargs: dict[str, Any] = {}
+    if not args.strip():
+        return kwargs
+
+    parts = [part.strip() for part in args.split(",") if part.strip()]
+    for part in parts:
+        if part == "t":
+            kwargs["t_arg"] = True
+            continue
+        if "=" not in part:
+            raise ValueError(f"Invalid resolver argument: {part!r}")
+        key, value = (item.strip() for item in part.split("=", 1))
+        if key == "offset":
+            kwargs[key] = int(value)
+        else:
+            kwargs[key] = value.strip("\"'")
+    return kwargs
+
+
+def _render_message_recipe(
+    recipe: TrainingRecipe,
+    bindings: dict[str, LanguageRow | str | None],
+) -> RenderedMessages | None:
+    """Expand ``recipe.messages`` into rendered chat messages using ``bindings``."""
+    assert recipe.messages is not None
+    messages: list[dict[str, Any]] = []
+    streams: list[str | None] = []
+    target_indices: list[int] = []
+
+    for turn in recipe.messages:
+        if turn.if_present is not None and bindings.get(turn.if_present) is None:
+            continue
+
+        message = {"role": turn.role}
+        if turn.content is not None:
+            message["content"] = _render_content(turn.content, bindings)
+
+        if turn.tool_calls_from is not None:
+            row = bindings.get(turn.tool_calls_from)
+            tool_calls = row.get("tool_calls") if isinstance(row, dict) else None
+            if tool_calls:
+                message["tool_calls"] = copy.deepcopy(tool_calls)
+
+        message_idx = len(messages)
+        messages.append(message)
+        streams.append(turn.stream)
+        if turn.target:
+            target_indices.append(message_idx)
+
+    if not target_indices:
+        return None
+
+    rendered = {
+        "messages": messages,
+        "message_streams": streams,
+        "target_message_indices": target_indices,
+    }
+    _validate_rendered(rendered)
+    return rendered
+
+
+def _render_content(
+    content: str | list[dict[str, Any]],
+    bindings: dict[str, LanguageRow | str | None],
+) -> str | list[dict[str, Any]]:
+    """Substitute bindings into a string or each string field of multimodal blocks."""
+    if isinstance(content, str):
+        return _substitute(content, bindings)
+
+    rendered_blocks = []
+    for block in content:
+        rendered_block = copy.deepcopy(block)
+        for key, value in rendered_block.items():
+            if isinstance(value, str):
+                rendered_block[key] = _substitute(value, bindings)
+        rendered_blocks.append(rendered_block)
+    return rendered_blocks
+
+
+def _substitute(template: str, bindings: dict[str, LanguageRow | str | None]) -> str:
+    """Replace ``${name}`` placeholders in ``template`` with their bound values."""
+
+    def replace(match: re.Match[str]) -> str:
+        """Resolve a single ``${name}`` match to its bound string value."""
+        name = match.group(1)
+        if name not in bindings:
+            raise ValueError(f"Unknown template binding: {name!r}")
+        value = bindings[name]
+        if value is None:
+            return ""
+        if isinstance(value, dict):
+            content = value.get("content")
+            return "" if content is None else str(content)
+        return str(value)
+
+    return PLACEHOLDER_RE.sub(replace, template)
+
+
+def _validate_rendered(rendered: RenderedMessages) -> None:
+    """Sanity-check the rendered output for stream/target alignment."""
+    messages = rendered["messages"]
+    streams = rendered["message_streams"]
+    target_indices = rendered["target_message_indices"]
+
+    if len(streams) != len(messages):
+        raise ValueError("message_streams must be aligned with messages.")
+    if not target_indices:
+        raise ValueError("Rendered samples must contain at least one target message.")
+    for idx in target_indices:
+        if idx < 0 or idx >= len(messages):
+            raise ValueError(f"Target message index {idx} is out of bounds.")
+    # ``stream`` is enforced non-None at MessageTurn construction time
+    # (see ``MessageTurn.__post_init__``), so a missing stream here would
+    # mean the dataclass invariant was bypassed; no need to re-check.
+
+
+def _nth_relative(
+    name: str,
+    t: float,
+    persistent: Sequence[LanguageRow],
+    style: str | None,
+    offset: int,
+    role: str | None,
+    tool_name: str | None,
+    camera: str | None,
+) -> LanguageRow | None:
+    """Shared body for ``nth_prev`` / ``nth_next`` with signed ``offset``."""
+    _validate_persistent_resolver(name, style)
+    if abs(offset) < 1:
+        raise ValueError(f"{name} offset must be non-zero.")
+
+    rows = sorted(
+        _matching_rows(persistent, style=style, role=role, tool_name=tool_name, camera=camera),
+        key=_row_sort_key,
+    )
+    if not rows:
+        return None
+
+    anchor_idx = None
+    for idx, row in enumerate(rows):
+        if _timestamp(row) <= t:
+            anchor_idx = idx
+        else:
+            break
+
+    target_idx = (offset - 1 if offset > 0 else None) if anchor_idx is None else anchor_idx + offset
+
+    if target_idx is None or target_idx < 0 or target_idx >= len(rows):
+        return None
+    return rows[target_idx]
+
+
+def _validate_persistent_resolver(name: str, style: str | None) -> None:
+    """Reject calls with missing or event-only ``style`` for persistent resolvers."""
+    if style is None:
+        raise ValueError(f"{name} requires a persistent style.")
+    if column_for_style(style) != LANGUAGE_PERSISTENT:
+        raise ValueError(f"{name} cannot be used with event-only style {style!r}.")
+
+
+def _matching_rows(
+    rows: Sequence[LanguageRow],
+    *,
+    style: str | None,
+    role: str | None,
+    tool_name: str | None,
+    camera: str | None,
+) -> list[LanguageRow]:
+    """Return ``rows`` filtered by optional ``style``/``role``/``tool_name``/``camera`` selectors."""
+    return [
+        row
+        for row in rows
+        if (style is None or row.get("style") == style)
+        and (role is None or row.get("role") == role)
+        and (tool_name is None or _row_has_tool_name(row, tool_name))
+        and (camera is None or row.get("camera") == camera)
+    ]
+
+
+def _select_one(
+    rows: Sequence[LanguageRow],
+    *,
+    style: str | None,
+    role: str | None,
+    tool_name: str | None,
+    camera: str | None,
+) -> LanguageRow | None:
+    """Return the single matching row, or raise if the resolver is ambiguous.
+
+    Multiple matches always raise — even when the caller already passed
+    some selectors — because remaining ambiguity means the data has
+    several rows that look identical to the resolver and the caller
+    needs to pin down a specific one (e.g. add ``camera=...`` for VQA
+    rows shared across cameras).
+    """
+    if not rows:
+        return None
+    if len(rows) > 1:
+        raise ValueError(
+            f"Ambiguous resolver for style={style!r} role={role!r} "
+            f"tool_name={tool_name!r} camera={camera!r}: {len(rows)} matching rows. "
+            f"Add a selector that distinguishes them."
+        )
+    return rows[0]
+
+
+def _row_sort_key(row: LanguageRow) -> tuple[float, str, str]:
+    """Stable sort key for both persistent and event rows.
+
+    Event rows lack ``timestamp`` (it is implicit in the frame), so default
+    to ``0.0`` — within a single frame all event rows share the same sort
+    bucket and are tiebroken by ``(style, role)``.
+    """
+    timestamp = row.get("timestamp")
+    ts = (
+        float(timestamp.item() if hasattr(timestamp, "item") else timestamp) if timestamp is not None else 0.0
+    )
+    return (ts, row.get("style") or "", row.get("role") or "")
+
+
+def _timestamp(row: LanguageRow) -> float:
+    """Extract a row's ``timestamp`` as a Python float (unwrapping numpy scalars)."""
+    value = row["timestamp"]
+    return float(value.item() if hasattr(value, "item") else value)
+
+
+def _row_has_tool_name(row: LanguageRow, tool_name: str) -> bool:
+    """Return ``True`` if any of the row's tool calls invokes ``tool_name``."""
+    for tool_call in row.get("tool_calls") or []:
+        if isinstance(tool_call, str):
+            continue
+        function = tool_call.get("function") if isinstance(tool_call, dict) else None
+        if isinstance(function, dict) and function.get("name") == tool_name:
+            return True
+    return False
+
+
+def _normalize_rows(rows: Sequence[Any]) -> list[LanguageRow]:
+    """Convert pyarrow scalars / mappings into a fresh list of plain dict rows."""
+    normalized = []
+    for row in rows:
+        if row is None:
+            continue
+        if hasattr(row, "as_py"):
+            row = row.as_py()
+        if not isinstance(row, dict):
+            raise TypeError(f"Language rows must be dictionaries, got {type(row).__name__}.")
+        normalized.append(dict(row))
+    return normalized
@@ -36,8 +36,8 @@ from .utils import (
 )
 from .video_utils import (
    StreamingVideoEncoder,
-    VideoEncoderConfig,
-    get_safe_default_video_backend,
+    get_safe_default_codec,
+    resolve_vcodec,
 )

 logger = logging.getLogger(__name__)
@@ -58,10 +58,10 @@ class LeRobotDataset(torch.utils.data.Dataset):
        video_backend: str | None = None,
        return_uint8: bool = False,
        batch_encoding_size: int = 1,
-        camera_encoder_config: VideoEncoderConfig | None = None,
-        encoder_threads: int | None = None,
+        vcodec: str = "libsvtav1",
        streaming_encoding: bool = False,
        encoder_queue_maxsize: int = 30,
+        encoder_threads: int | None = None,
    ):
        """
        2 modes are available for instantiating this class, depending on 2 different use cases:
@@ -177,15 +177,16 @@ class LeRobotDataset(torch.utils.data.Dataset):
                You can also use the 'pyav' decoder used by Torchvision, which used to be the default option, or 'video_reader' which is another decoder of Torchvision.
            batch_encoding_size (int, optional): Number of episodes to accumulate before batch encoding videos.
                Set to 1 for immediate encoding (default), or higher for batched encoding. Defaults to 1.
-            camera_encoder_config (VideoEncoderConfig | None, optional): Video encoder settings for cameras
-                (codec, quality, etc.). When ``None``, :func:`~lerobot.datasets.video_utils.camera_encoder_defaults`
-                is used by the writer.
-            encoder_threads (int | None, optional): Number of encoder threads (global). ``None`` lets the
-                codec decide.
+            vcodec (str, optional): Video codec for encoding videos during recording. Options: 'h264', 'hevc',
+                'libsvtav1', 'auto', or hardware-specific codecs like 'h264_videotoolbox', 'h264_nvenc'.
+                Defaults to 'libsvtav1'. Use 'auto' to auto-detect the best available hardware encoder.
            streaming_encoding (bool, optional): If True, encode video frames in real-time during capture
                instead of writing PNG images first. This makes save_episode() near-instant. Defaults to False.
            encoder_queue_maxsize (int, optional): Maximum number of frames to buffer per camera when using
                streaming encoding. Defaults to 30 (~1s at 30fps).
+            encoder_threads (int | None, optional): Number of threads per encoder instance. None lets the
+                codec auto-detect (default). Lower values reduce CPU usage per encoder. Maps to 'lp' (via svtav1-params) for
+                libsvtav1 and 'threads' for h264/hevc.

        Note:
            Write-mode parameters (``streaming_encoding``, ``batch_encoding_size``) passed to
@@ -201,9 +202,10 @@ class LeRobotDataset(torch.utils.data.Dataset):
        self.episodes = episodes
        self.tolerance_s = tolerance_s
        self.revision = revision if revision else CODEBASE_VERSION
-        self._video_backend = video_backend if video_backend else get_safe_default_video_backend()
+        self._video_backend = video_backend if video_backend else get_safe_default_codec()
        self._return_uint8 = return_uint8
        self._batch_encoding_size = batch_encoding_size
+        self._vcodec = resolve_vcodec(vcodec)
        self._encoder_threads = encoder_threads

        if self._requested_root is not None:
@@ -249,15 +251,12 @@ class LeRobotDataset(torch.utils.data.Dataset):
            streaming_enc = None
            if streaming_encoding and len(self.meta.video_keys) > 0:
                streaming_enc = self._build_streaming_encoder(
-                    self.meta.fps,
-                    camera_encoder_config,
-                    encoder_queue_maxsize,
-                    encoder_threads,
+                    self.meta.fps, self._vcodec, encoder_queue_maxsize, encoder_threads
                )
            self.writer = DatasetWriter(
                meta=self.meta,
                root=self.root,
-                camera_encoder_config=camera_encoder_config,
+                vcodec=self._vcodec,
                encoder_threads=encoder_threads,
                batch_encoding_size=batch_encoding_size,
                streaming_encoder=streaming_enc,
@@ -299,13 +298,17 @@ class LeRobotDataset(torch.utils.data.Dataset):
    @staticmethod
    def _build_streaming_encoder(
        fps: int,
-        camera_encoder_config: VideoEncoderConfig | None,
+        vcodec: str,
        encoder_queue_maxsize: int,
        encoder_threads: int | None,
    ) -> StreamingVideoEncoder:
        return StreamingVideoEncoder(
            fps=fps,
-            camera_encoder_config=camera_encoder_config,
+            vcodec=vcodec,
+            pix_fmt="yuv420p",
+            g=2,
+            crf=30,
+            preset=None,
            queue_maxsize=encoder_queue_maxsize,
            encoder_threads=encoder_threads,
        )
@@ -622,7 +625,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
        image_writer_threads: int = 0,
        video_backend: str | None = None,
        batch_encoding_size: int = 1,
-        camera_encoder_config: VideoEncoderConfig | None = None,
+        vcodec: str = "libsvtav1",
        metadata_buffer_size: int = 10,
        streaming_encoding: bool = False,
        encoder_queue_maxsize: int = 30,
@@ -653,20 +656,20 @@ class LeRobotDataset(torch.utils.data.Dataset):
            video_backend: Video decoding backend (used when reading back).
            batch_encoding_size: Number of episodes to accumulate before
                batch-encoding videos. ``1`` means encode immediately.
-            camera_encoder_config: Video encoder settings for cameras (codec, quality, etc.).
-                When ``None``, :func:`~lerobot.datasets.video_utils.camera_encoder_defaults` is used.
-            encoder_threads: Number of encoder threads (global). ``None``
-                lets the codec decide.
+            vcodec: Video codec for encoding. Options include ``'libsvtav1'``,
+                ``'h264'``, ``'hevc'``, ``'auto'``.
            metadata_buffer_size: Number of episode metadata records to buffer
                before flushing to parquet.
            streaming_encoding: If ``True``, encode video frames in real-time
                during capture instead of writing images first.
            encoder_queue_maxsize: Max buffered frames per camera when using
                streaming encoding.
+            encoder_threads: Threads per encoder instance. ``None`` for auto.

        Returns:
            A new :class:`LeRobotDataset` in write mode.
        """
+        vcodec = resolve_vcodec(vcodec)
        obj = cls.__new__(cls)
        obj.meta = LeRobotDatasetMetadata.create(
            repo_id=repo_id,
@@ -687,23 +690,23 @@ class LeRobotDataset(torch.utils.data.Dataset):
        obj.image_transforms = None
        obj.delta_timestamps = None
        obj.episodes = None
-        obj._video_backend = video_backend if video_backend is not None else get_safe_default_video_backend()
+        obj._video_backend = video_backend if video_backend is not None else get_safe_default_codec()
        obj._return_uint8 = False
        obj._batch_encoding_size = batch_encoding_size
+        obj._vcodec = vcodec
        obj._encoder_threads = encoder_threads

        # Reader is lazily created on first access (write-only mode)
        obj.reader = None

+        # Create writer
        streaming_enc = None
        if streaming_encoding and len(obj.meta.video_keys) > 0:
-            streaming_enc = cls._build_streaming_encoder(
-                fps, camera_encoder_config, encoder_queue_maxsize, encoder_threads
-            )
+            streaming_enc = cls._build_streaming_encoder(fps, vcodec, encoder_queue_maxsize, encoder_threads)
        obj.writer = DatasetWriter(
            meta=obj.meta,
            root=obj.root,
-            camera_encoder_config=camera_encoder_config,
+            vcodec=vcodec,
            encoder_threads=encoder_threads,
            batch_encoding_size=batch_encoding_size,
            streaming_encoder=streaming_enc,
@@ -726,12 +729,12 @@ class LeRobotDataset(torch.utils.data.Dataset):
        force_cache_sync: bool = False,
        video_backend: str | None = None,
        batch_encoding_size: int = 1,
-        camera_encoder_config: VideoEncoderConfig | None = None,
-        encoder_threads: int | None = None,
+        vcodec: str = "libsvtav1",
        image_writer_processes: int = 0,
        image_writer_threads: int = 0,
        streaming_encoding: bool = False,
        encoder_queue_maxsize: int = 30,
+        encoder_threads: int | None = None,
    ) -> "LeRobotDataset":
        """Resume recording on an existing dataset.

@@ -754,15 +757,13 @@ class LeRobotDataset(torch.utils.data.Dataset):
            video_backend: Video decoding backend for reading back data.
            batch_encoding_size: Number of episodes to accumulate before
                batch-encoding videos.
-            camera_encoder_config: Video encoder settings for cameras (codec, quality, etc.).
-                When ``None``, :func:`~lerobot.datasets.video_utils.camera_encoder_defaults` is used.
-            encoder_threads: Number of encoder threads (global). ``None``
-                lets the codec decide.
+            vcodec: Video codec for encoding.
            image_writer_processes: Subprocesses for async image writing.
            image_writer_threads: Threads for async image writing.
            streaming_encoding: If ``True``, encode video in real-time during
                capture.
            encoder_queue_maxsize: Max buffered frames per camera for streaming.
+            encoder_threads: Threads per encoder instance. ``None`` for auto.

        Returns:
            A :class:`LeRobotDataset` in write mode, ready to append episodes.
@@ -773,6 +774,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
                "Writing into the revision-safe Hub snapshot cache (used when root=None) would corrupt "
                "the shared cache. Please provide a local directory path."
            )
+        vcodec = resolve_vcodec(vcodec)
        obj = cls.__new__(cls)
        obj.repo_id = repo_id
        obj._requested_root = Path(root)
@@ -781,9 +783,11 @@ class LeRobotDataset(torch.utils.data.Dataset):
        obj.image_transforms = None
        obj.delta_timestamps = None
        obj.episodes = None
-        obj._video_backend = video_backend if video_backend else get_safe_default_video_backend()
+        obj._video_backend = video_backend if video_backend else get_safe_default_codec()
        obj._return_uint8 = False
        obj._batch_encoding_size = batch_encoding_size
+        obj._vcodec = vcodec
+        obj._encoder_threads = encoder_threads

        if obj._requested_root is not None:
            obj._requested_root.mkdir(exist_ok=True, parents=True)
@@ -792,22 +796,21 @@ class LeRobotDataset(torch.utils.data.Dataset):
        obj.meta = LeRobotDatasetMetadata(
            obj.repo_id, obj._requested_root, obj.revision, force_cache_sync=force_cache_sync
        )
-
-        obj._encoder_threads = encoder_threads
        obj.root = obj.meta.root

        # Reader is lazily created on first access (write-only mode)
        obj.reader = None

+        # Create writer for appending
        streaming_enc = None
        if streaming_encoding and len(obj.meta.video_keys) > 0:
            streaming_enc = cls._build_streaming_encoder(
-                obj.meta.fps, camera_encoder_config, encoder_queue_maxsize, encoder_threads
+                obj.meta.fps, vcodec, encoder_queue_maxsize, encoder_threads
            )
        obj.writer = DatasetWriter(
            meta=obj.meta,
            root=obj.root,
-            camera_encoder_config=camera_encoder_config,
+            vcodec=vcodec,
            encoder_threads=encoder_threads,
            batch_encoding_size=batch_encoding_size,
            streaming_encoder=streaming_enc,
@@ -1,181 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""PyAV-based compatibility checks for :class:`VideoEncoderConfig`.
-
-Centralises all :mod:`av` introspection of the bundled FFmpeg build.
-Checks degrade to a no-op when the target codec isn't available locally.
-"""
-
-from __future__ import annotations
-
-import functools
-import logging
-from typing import Any
-
-import av
-
-from lerobot.configs.video import VideoEncoderConfig
-
-logger = logging.getLogger(__name__)
-
-FFMPEG_NUMERIC_OPTION_TYPES = ("INT", "INT64", "UINT64", "FLOAT", "DOUBLE")
-FFMPEG_INTEGER_OPTION_TYPES = ("INT", "INT64", "UINT64")
-
-
-@functools.cache
-def get_codec(vcodec: str) -> av.codec.Codec | None:
-    """PyAV write-mode ``Codec`` for *vcodec*, or ``None`` if unavailable."""
-    try:
-        return av.codec.Codec(vcodec, "w")
-    except Exception:
-        return None
-
-
-@functools.cache
-def _get_codec_options_by_name(vcodec: str) -> dict[str, av.option.Option]:
-    """Private-option name → PyAV ``Option`` for *vcodec* (empty if unavailable)."""
-    codec = get_codec(vcodec)
-    if codec is None:
-        return {}
-    return {opt.name: opt for opt in codec.descriptor.options}
-
-
-@functools.cache
-def _get_codec_video_formats(vcodec: str) -> tuple[str, ...]:
-    """Pixel formats accepted by *vcodec* in PyAV's preferred order (empty if unknown)."""
-    codec = get_codec(vcodec)
-    if codec is None:
-        return ()
-    return tuple(fmt.name for fmt in (codec.video_formats or []))
-
-
-def detect_available_encoders_pyav(encoders: list[str] | str) -> list[str]:
-    """Return the subset of *encoders* available as video encoders in the local FFmpeg build.
-
-    Each name is probed directly via :func:`get_codec`; input order is preserved.
-    """
-    if isinstance(encoders, str):
-        encoders = [encoders]
-
-    available: list[str] = []
-    for name in encoders:
-        codec = get_codec(name)
-        if codec is not None and codec.type == "video":
-            available.append(name)
-        else:
-            logger.debug("encoder '%s' not available as video encoder", name)
-    return available
-
-
-def _check_option_value(vcodec: str, label: str, value: Any, opt: av.option.Option) -> None:
-    """Range-check numeric *value* and choice-check string *value* against *opt*."""
-    type_name = opt.type.name
-    if type_name in FFMPEG_NUMERIC_OPTION_TYPES:
-        if isinstance(value, bool):
-            raise ValueError(
-                f"{label}={value!r} is not numeric; codec {vcodec!r} expects a number for this option."
-            )
-        elif isinstance(value, str):
-            try:
-                num_val = float(value)
-            except ValueError as e:
-                raise ValueError(
-                    f"{label}={value!r} is not numeric; codec {vcodec!r} expects a number for this option."
-                ) from e
-        elif isinstance(value, (float, int)):
-            num_val = value
-        else:
-            raise ValueError(
-                f"{label}={value!r} is not numeric; codec {vcodec!r} expects a number for this option."
-            )
-
-        # Check integer type compatibility
-        if type_name in FFMPEG_INTEGER_OPTION_TYPES and not num_val.is_integer():
-            raise ValueError(
-                f"{label}={num_val!r} must be an integer for codec {vcodec!r} "
-                f"(FFmpeg option {opt.name!r} is {type_name}); float values are not allowed."
-            )
-
-        # Check numeric range compatibility
-        lo, hi = float(opt.min), float(opt.max)
-        if lo < hi and not (lo <= num_val <= hi):
-            raise ValueError(
-                f"{label}={num_val} is out of range for codec {vcodec!r}; must be in [{lo}, {hi}]"
-            )
-
-    elif type_name == "STRING":
-        if isinstance(value, bool):
-            raise ValueError(f"{label}={value!r} is not a valid string value for codec {vcodec!r}.")
-        if isinstance(value, str):
-            str_val = value
-        elif isinstance(value, (int, float)):
-            str_val = str(value)
-        else:
-            raise ValueError(f"{label}={value!r} has unsupported type for STRING option on codec {vcodec!r}")
-
-        # Check string choice compatibility
-        choices = [c.name for c in (opt.choices or [])]
-        if choices and str_val not in choices:
-            raise ValueError(
-                f"{label}={str_val!r} is not a supported choice for codec "
-                f"{vcodec!r}; valid choices: {choices}"
-            )
-    else:
-        return
-
-
-def _check_pixel_format(vcodec: str, pix_fmt: str) -> None:
-    formats = _get_codec_video_formats(vcodec)
-    if formats and pix_fmt not in formats:
-        raise ValueError(
-            f"pix_fmt={pix_fmt!r} is not supported by codec {vcodec!r}; "
-            f"supported pixel formats: {list(formats)}"
-        )
-
-
-def _check_codec_options(vcodec: str, codec_options: dict[str, Any], config: VideoEncoderConfig) -> None:
-    """Validate merged encoder options (typed) against the codec's published AVOptions."""
-    supported_options = _get_codec_options_by_name(vcodec)
-    for key, value in codec_options.items():
-        # GOP size is not a codec-specific option, it has to be validated separately.
-        if key == "g":
-            if isinstance(value, bool) or not isinstance(value, int) or value < 1:
-                raise ValueError(f"g={value!r} must be a positive integer for codec {vcodec!r}")
-            continue
-        if key not in supported_options:
-            continue
-        opt = supported_options[key]
-        label = f"extra_options[{key!r}]" if key in config.extra_options else key
-        _check_option_value(vcodec, label, value, opt)
-
-
-def check_video_encoder_config_pyav(config: VideoEncoderConfig) -> None:
-    """Verify *config* is compatible with the bundled FFmpeg build.
-
-    Checks pixel format, abstract tuning-field compatibility, and each merged
-    encoder option from :meth:`~lerobot.configs.video.VideoEncoderConfig.get_codec_options`
-    against PyAV (including numeric ``extra_options`` present in that dict).
-    No-op when ``config.vcodec`` isn't in the local FFmpeg build.
-
-    Raises:
-        ValueError: on the first incompatibility encountered.
-    """
-    vcodec = config.vcodec
-    options = _get_codec_options_by_name(vcodec)
-    if not options:
-        raise ValueError(f"Codec {vcodec!r} is not available in the bundled FFmpeg build")
-    _check_pixel_format(config.vcodec, config.pix_fmt)
-    _check_codec_options(config.vcodec, config.get_codec_options(), config)
@@ -88,7 +88,6 @@ VIDEO_DIR = "videos"

 CHUNK_FILE_PATTERN = "chunk-{chunk_index:03d}/file-{file_index:03d}"
 DEFAULT_TASKS_PATH = "meta/tasks.parquet"
-DEFAULT_SUBTASKS_PATH = "meta/subtasks.parquet"
 DEFAULT_EPISODES_PATH = EPISODES_DIR + "/" + CHUNK_FILE_PATTERN + ".parquet"
 DEFAULT_DATA_PATH = DATA_DIR + "/" + CHUNK_FILE_PATTERN + ".parquet"
 DEFAULT_VIDEO_PATH = VIDEO_DIR + "/{video_key}/" + CHUNK_FILE_PATTERN + ".mp4"
@@ -130,6 +129,9 @@ class DatasetInfo:
    # Optional metadata
    robot_type: str | None = None
    splits: dict[str, str] = field(default_factory=dict)
+    # OpenAI-style tool schemas declared by the dataset. ``None`` means the
+    # dataset doesn't declare any — readers fall back to ``DEFAULT_TOOLS``.
+    tools: list[dict] | None = None

    def __post_init__(self) -> None:
        # Coerce feature shapes from list to tuple — JSON deserialisation
@@ -151,11 +153,15 @@ class DatasetInfo:
        """Return a JSON-serialisable dict.

        Converts tuple shapes back to lists so ``json.dump`` can handle them.
+        Drops ``tools`` when unset so existing datasets keep a clean
+        ``info.json``.
        """
        d = dataclasses.asdict(self)
        for ft in d["features"].values():
            if isinstance(ft.get("shape"), tuple):
                ft["shape"] = list(ft["shape"])
+        if d.get("tools") is None:
+            d.pop("tools", None)
        return d

    @classmethod
@@ -22,7 +22,7 @@ import shutil
 import tempfile
 import threading
 import warnings
-from dataclasses import asdict, dataclass, field
+from dataclasses import dataclass, field
 from fractions import Fraction
 from pathlib import Path
 from threading import Lock
@@ -37,14 +37,86 @@ import torchvision
 from datasets.features.features import register_feature
 from PIL import Image

-from lerobot.configs.video import (
-    VideoEncoderConfig,
-    camera_encoder_defaults,
-)
-from lerobot.utils.import_utils import get_safe_default_video_backend
+from lerobot.utils.import_utils import get_safe_default_codec

 logger = logging.getLogger(__name__)

+# List of hardware encoders to probe for auto-selection. Availability depends on the platform and FFmpeg build.
+# Determines the order of preference for auto-selection when vcodec="auto" is used.
+HW_ENCODERS = [
+    "h264_videotoolbox",  # macOS
+    "hevc_videotoolbox",  # macOS
+    "h264_nvenc",  # NVIDIA GPU
+    "hevc_nvenc",  # NVIDIA GPU
+    "h264_vaapi",  # Linux Intel/AMD
+    "h264_qsv",  # Intel Quick Sync
+]
+
+VALID_VIDEO_CODECS = {"h264", "hevc", "libsvtav1", "auto"} | set(HW_ENCODERS)
+
+
+def _get_codec_options(
+    vcodec: str,
+    g: int | None = 2,
+    crf: int | None = 30,
+    preset: int | None = None,
+) -> dict:
+    """Build codec-specific options dict for video encoding."""
+    options = {}
+
+    # GOP size (keyframe interval) - supported by VideoToolbox and software encoders
+    if g is not None and (vcodec in ("h264_videotoolbox", "hevc_videotoolbox") or vcodec not in HW_ENCODERS):
+        options["g"] = str(g)
+
+    # Quality control (codec-specific parameter names)
+    if crf is not None:
+        if vcodec in ("h264", "hevc", "libsvtav1"):
+            options["crf"] = str(crf)
+        elif vcodec in ("h264_videotoolbox", "hevc_videotoolbox"):
+            quality = max(1, min(100, int(100 - crf * 2)))
+            options["q:v"] = str(quality)
+        elif vcodec in ("h264_nvenc", "hevc_nvenc"):
+            options["rc"] = "constqp"
+            options["qp"] = str(crf)
+        elif vcodec in ("h264_vaapi",):
+            options["qp"] = str(crf)
+        elif vcodec in ("h264_qsv",):
+            options["global_quality"] = str(crf)
+
+    # Preset (only for libsvtav1)
+    if vcodec == "libsvtav1":
+        options["preset"] = str(preset) if preset is not None else "12"
+
+    return options
+
+
+def detect_available_hw_encoders() -> list[str]:
+    """Probe PyAV/FFmpeg for available hardware video encoders."""
+    available = []
+    for codec_name in HW_ENCODERS:
+        try:
+            av.codec.Codec(codec_name, "w")
+            available.append(codec_name)
+        except Exception:  # nosec B110
+            logger.debug("HW encoder '%s' not available", codec_name)  # nosec B110
+    return available
+
+
+def resolve_vcodec(vcodec: str) -> str:
+    """Validate vcodec and resolve 'auto' to best available HW encoder, fallback to libsvtav1."""
+    if vcodec not in VALID_VIDEO_CODECS:
+        raise ValueError(f"Invalid vcodec '{vcodec}'. Must be one of: {sorted(VALID_VIDEO_CODECS)}")
+    if vcodec != "auto":
+        logger.info(f"Using video codec: {vcodec}")
+        return vcodec
+    available = detect_available_hw_encoders()
+    for encoder in HW_ENCODERS:
+        if encoder in available:
+            logger.info(f"Auto-selected video codec: {encoder}")
+            return encoder
+    logger.info("No hardware encoder available, falling back to software encoder 'libsvtav1'")
+    return "libsvtav1"
+

 def decode_video_frames(
    video_path: Path | str,
@@ -70,7 +142,7 @@ def decode_video_frames(
    Currently supports torchcodec on cpu and pyav.
    """
    if backend is None:
-        backend = get_safe_default_video_backend()
+        backend = get_safe_default_codec()
    if backend == "torchcodec":
        return decode_video_frames_torchcodec(video_path, timestamps, tolerance_s, return_uint8=return_uint8)
    elif backend in ["pyav", "video_reader"]:
@@ -328,17 +400,18 @@ def encode_video_frames(
    imgs_dir: Path | str,
    video_path: Path | str,
    fps: int,
-    camera_encoder_config: VideoEncoderConfig | None = None,
-    encoder_threads: int | None = None,
-    *,
+    vcodec: str = "libsvtav1",
+    pix_fmt: str = "yuv420p",
+    g: int | None = 2,
+    crf: int | None = 30,
+    fast_decode: int = 0,
    log_level: int | None = av.logging.WARNING,
    overwrite: bool = False,
+    preset: int | None = None,
+    encoder_threads: int | None = None,
 ) -> None:
    """More info on ffmpeg arguments tuning on `benchmark/video/README.md`"""
-    if camera_encoder_config is None:
-        camera_encoder_config = camera_encoder_defaults()
-    vcodec = camera_encoder_config.vcodec
-    pix_fmt = camera_encoder_config.pix_fmt
+    vcodec = resolve_vcodec(vcodec)

    video_path = Path(video_path)
    imgs_dir = Path(imgs_dir)
@@ -349,18 +422,42 @@ def encode_video_frames(

    video_path.parent.mkdir(parents=True, exist_ok=True)

+    # Encoders/pixel formats incompatibility check
+    if (vcodec == "libsvtav1" or vcodec == "hevc") and pix_fmt == "yuv444p":
+        logger.warning(
+            f"Incompatible pixel format 'yuv444p' for codec {vcodec}, auto-selecting format 'yuv420p'"
+        )
+        pix_fmt = "yuv420p"
+
    # Get input frames
    template = "frame-" + ("[0-9]" * 6) + ".png"
    input_list = sorted(
        glob.glob(str(imgs_dir / template)), key=lambda x: int(x.split("-")[-1].split(".")[0])
    )

+    # Define video output frame size (assuming all input frames are the same size)
    if len(input_list) == 0:
        raise FileNotFoundError(f"No images found in {imgs_dir}.")
    with Image.open(input_list[0]) as dummy_image:
        width, height = dummy_image.size

-    video_options = camera_encoder_config.get_codec_options(encoder_threads, as_strings=True)
+    # Define video codec options
+    video_options = _get_codec_options(vcodec, g, crf, preset)
+
+    if fast_decode:
+        key = "svtav1-params" if vcodec == "libsvtav1" else "tune"
+        value = f"fast-decode={fast_decode}" if vcodec == "libsvtav1" else "fastdecode"
+        video_options[key] = value
+
+    if encoder_threads is not None:
+        if vcodec == "libsvtav1":
+            lp_param = f"lp={encoder_threads}"
+            if "svtav1-params" in video_options:
+                video_options["svtav1-params"] += f":{lp_param}"
+            else:
+                video_options["svtav1-params"] = lp_param
+        else:
+            video_options["threads"] = str(encoder_threads)

    # Set logging level
    if log_level is not None:
@@ -397,10 +494,7 @@ def encode_video_frames(


 def concatenate_video_files(
-    input_video_paths: list[Path | str],
-    output_video_path: Path,
-    overwrite: bool = True,
-    compatibility_check: bool = False,
+    input_video_paths: list[Path | str], output_video_path: Path, overwrite: bool = True
 ):
    """
    Concatenate multiple video files into a single video file using pyav.
@@ -413,7 +507,6 @@ def concatenate_video_files(
        input_video_paths: Ordered list of input video file paths to concatenate.
        output_video_path: Path to the output video file.
        overwrite: Whether to overwrite the output video file if it already exists. Default is True.
-        compatibility_check: Whether to check if the input videos are compatible. Default is False.

    Note:
        - Creates a temporary directory for intermediate files that is cleaned up after use.
@@ -432,22 +525,6 @@ def concatenate_video_files(
    if len(input_video_paths) == 0:
        raise FileNotFoundError("No input video paths provided.")

-    # This check may be skipped at recording time as videos are encoded with the same encoder config.
-    if compatibility_check:
-        reference_video_info = get_video_info(input_video_paths[0])
-        for input_path in input_video_paths[1:]:
-            video_info = get_video_info(input_path)
-            if (
-                video_info["video.height"] != reference_video_info["video.height"]
-                or video_info["video.width"] != reference_video_info["video.width"]
-                or video_info["video.fps"] != reference_video_info["video.fps"]
-                or video_info["video.codec"] != reference_video_info["video.codec"]
-                or video_info["video.pix_fmt"] != reference_video_info["video.pix_fmt"]
-            ):
-                raise ValueError(
-                    f"Input video {input_path} is not compatible with the reference video {input_video_paths[0]}."
-                )
-
    # Create a temporary .ffconcat file to list the input video paths
    with tempfile.NamedTemporaryFile(mode="w", suffix=".ffconcat", delete=False) as tmp_concatenate_file:
        tmp_concatenate_file.write("ffconcat version 1.0\n")
@@ -514,20 +591,26 @@ class _CameraEncoderThread(threading.Thread):
        fps: int,
        vcodec: str,
        pix_fmt: str,
-        codec_options: dict[str, str],
+        g: int | None,
+        crf: int | None,
+        preset: int | None,
        frame_queue: queue.Queue,
        result_queue: queue.Queue,
        stop_event: threading.Event,
+        encoder_threads: int | None = None,
    ):
        super().__init__(daemon=True)
        self.video_path = video_path
        self.fps = fps
        self.vcodec = vcodec
        self.pix_fmt = pix_fmt
-        self.codec_options = codec_options
+        self.g = g
+        self.crf = crf
+        self.preset = preset
        self.frame_queue = frame_queue
        self.result_queue = result_queue
        self.stop_event = stop_event
+        self.encoder_threads = encoder_threads

    def run(self) -> None:
        from .compute_stats import RunningQuantileStats, auto_downsample_height_width
@@ -563,9 +646,19 @@ class _CameraEncoderThread(threading.Thread):
                # Open container on first frame (to get width/height)
                if container is None:
                    height, width = frame_data.shape[:2]
+                    video_options = _get_codec_options(self.vcodec, self.g, self.crf, self.preset)
+                    if self.encoder_threads is not None:
+                        if self.vcodec == "libsvtav1":
+                            lp_param = f"lp={self.encoder_threads}"
+                            if "svtav1-params" in video_options:
+                                video_options["svtav1-params"] += f":{lp_param}"
+                            else:
+                                video_options["svtav1-params"] = lp_param
+                        else:
+                            video_options["threads"] = str(self.encoder_threads)
                    Path(self.video_path).parent.mkdir(parents=True, exist_ok=True)
                    container = av.open(str(self.video_path), "w")
-                    output_stream = container.add_stream(self.vcodec, self.fps, options=self.codec_options)
+                    output_stream = container.add_stream(self.vcodec, self.fps, options=video_options)
                    output_stream.pix_fmt = self.pix_fmt
                    output_stream.width = width
                    output_stream.height = height
@@ -631,24 +724,22 @@ class StreamingVideoEncoder:
    def __init__(
        self,
        fps: int,
-        camera_encoder_config: VideoEncoderConfig | None = None,
+        vcodec: str = "libsvtav1",
+        pix_fmt: str = "yuv420p",
+        g: int | None = 2,
+        crf: int | None = 30,
+        preset: int | None = None,
        queue_maxsize: int = 30,
        encoder_threads: int | None = None,
    ):
-        """
-        Args:
-            fps: Frames per second for the output videos.
-            camera_encoder_config: Video encoder settings applied to all cameras.
-                When ``None``, :func:`camera_encoder_defaults` is used.
-            encoder_threads: Number of encoder threads (global setting).
-                ``None`` lets the codec decide.
-            queue_maxsize: Max frames to buffer per camera before
-                back-pressure drops frames.
-        """
        self.fps = fps
-        self._camera_encoder_config = camera_encoder_config or camera_encoder_defaults()
-        self._encoder_threads = encoder_threads
+        self.vcodec = resolve_vcodec(vcodec)
+        self.pix_fmt = pix_fmt
+        self.g = g
+        self.crf = crf
+        self.preset = preset
        self.queue_maxsize = queue_maxsize
+        self.encoder_threads = encoder_threads

        self._frame_queues: dict[str, queue.Queue] = {}
        self._result_queues: dict[str, queue.Queue] = {}
@@ -679,19 +770,18 @@ class StreamingVideoEncoder:
            temp_video_dir = Path(tempfile.mkdtemp(dir=temp_dir))
            video_path = temp_video_dir / f"{video_key.replace('/', '_')}_streaming.mp4"

-            vcodec = self._camera_encoder_config.vcodec
-            codec_options = self._camera_encoder_config.get_codec_options(
-                self._encoder_threads, as_strings=True
-            )
            encoder_thread = _CameraEncoderThread(
                video_path=video_path,
                fps=self.fps,
-                vcodec=vcodec,
-                pix_fmt=self._camera_encoder_config.pix_fmt,
-                codec_options=codec_options,
+                vcodec=self.vcodec,
+                pix_fmt=self.pix_fmt,
+                g=self.g,
+                crf=self.crf,
+                preset=self.preset,
                frame_queue=frame_queue,
                result_queue=result_queue,
                stop_event=stop_event,
+                encoder_threads=self.encoder_threads,
            )
            encoder_thread.start()

@@ -896,18 +986,8 @@ def get_audio_info(video_path: Path | str) -> dict:
    return audio_info


-def get_video_info(
-    video_path: Path | str,
-    camera_encoder_config: "VideoEncoderConfig | None" = None,
-) -> dict:
-    """Build the ``video.*`` / ``audio.*`` info dict persisted in ``info.json``.
-
-    Args:
-        video_path: Path to the encoded video file to probe.
-        camera_encoder_config: If provided, record the exact encoder settings used to encode this
-            video. Stream-derived values take precedence — encoder fields are only written for keys
-            not already populated from the video file itself.
-    """
+def get_video_info(video_path: Path | str) -> dict:
+    # Set logging level
    logging.getLogger("libav").setLevel(av.logging.WARNING)

    # Getting video stream information
@@ -938,14 +1018,6 @@ def get_video_info(
    # Adding audio stream information
    video_info.update(**get_audio_info(video_path))

-    # Add additional encoder configuration if provided
-    if camera_encoder_config is not None:
-        for field_name, field_value in asdict(camera_encoder_config).items():
-            # vcodec is already populated from the video stream
-            if field_name == "vcodec":
-                continue
-            video_info.setdefault(f"video.{field_name}", field_value)
-
    return video_info


@@ -16,6 +16,7 @@ from lerobot.utils.action_interpolator import ActionInterpolator as ActionInterp

 from .act.configuration_act import ACTConfig as ACTConfig
 from .diffusion.configuration_diffusion import DiffusionConfig as DiffusionConfig
+from .eo1.configuration_eo1 import EO1Config as EO1Config
 from .factory import get_policy_class, make_policy, make_policy_config, make_pre_post_processors
 from .groot.configuration_groot import GrootConfig as GrootConfig
 from .multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig as MultiTaskDiTConfig
@@ -41,6 +42,7 @@ __all__ = [
    "DiffusionConfig",
    "GrootConfig",
    "MultiTaskDiTConfig",
+    "EO1Config",
    "PI0Config",
    "PI0FastConfig",
    "PI05Config",
@@ -100,8 +100,8 @@ class DiffusionConfig(PreTrainedConfig):

    # Inputs / output structure.
    n_obs_steps: int = 2
-    horizon: int = 16
-    n_action_steps: int = 8
+    horizon: int = 64
+    n_action_steps: int = 32

    normalization_mapping: dict[str, NormalizationMode] = field(
        default_factory=lambda: {
@@ -122,10 +122,10 @@ class DiffusionConfig(PreTrainedConfig):
    crop_ratio: float = 1.0
    crop_shape: tuple[int, int] | None = None
    crop_is_random: bool = True
-    pretrained_backbone_weights: str | None = None
-    use_group_norm: bool = True
+    pretrained_backbone_weights: str | None = "ResNet18_Weights.IMAGENET1K_V1"
+    use_group_norm: bool = False
    spatial_softmax_num_keypoints: int = 32
-    use_separate_rgb_encoder_per_camera: bool = False
+    use_separate_rgb_encoder_per_camera: bool = True
    # Unet.
    down_dims: tuple[int, ...] = (512, 1024, 2048)
    kernel_size: int = 5
@@ -0,0 +1 @@
+../../../../docs/source/eo1.mdx
@@ -0,0 +1,7 @@
+#!/usr/bin/env python
+
+from .configuration_eo1 import EO1Config
+from .modeling_eo1 import EO1Policy
+from .processor_eo1 import make_eo1_pre_post_processors
+
+__all__ = ["EO1Config", "EO1Policy", "make_eo1_pre_post_processors"]
@@ -0,0 +1,193 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+from copy import deepcopy
+from dataclasses import dataclass, field
+from typing import TYPE_CHECKING
+
+from lerobot.configs.policies import PreTrainedConfig
+from lerobot.configs.types import FeatureType, NormalizationMode, PolicyFeature
+from lerobot.optim.optimizers import AdamWConfig
+from lerobot.optim.schedulers import CosineDecayWithWarmupSchedulerConfig
+from lerobot.utils.constants import ACTION, OBS_STATE
+from lerobot.utils.import_utils import _transformers_available, require_package
+
+if TYPE_CHECKING or _transformers_available:
+    from transformers.models.qwen2_5_vl.configuration_qwen2_5_vl import (
+        Qwen2_5_VLConfig,
+        Qwen2_5_VLTextConfig,
+        Qwen2_5_VLVisionConfig,
+    )
+else:
+    Qwen2_5_VLConfig = None
+    Qwen2_5_VLTextConfig = None
+    Qwen2_5_VLVisionConfig = None
+
+
+@PreTrainedConfig.register_subclass("eo1")
+@dataclass
+class EO1Config(PreTrainedConfig):
+    """Configuration for native EO1 policy integration in LeRobot."""
+
+    vlm_base: str = "Qwen/Qwen2.5-VL-3B-Instruct"
+    vlm_config: dict | None = None
+
+    # Vision processor settings.
+    image_min_pixels: int | None = 64 * 28 * 28
+    image_max_pixels: int | None = 128 * 28 * 28
+    use_fast_processor: bool = False
+
+    # Execution and action horizon.
+    n_obs_steps: int = 1
+    chunk_size: int = 8
+    n_action_steps: int = 8
+
+    # State/action padding to match EO1 flow head dimensionality.
+    max_state_dim: int = 32
+    max_action_dim: int = 32
+
+    # Flow matching sampling.
+    num_denoise_steps: int = 10
+    num_action_layers: int = 2
+    action_act: str = "linear"
+    time_sampling_beta_alpha: float = 1.5
+    time_sampling_beta_beta: float = 1.0
+    time_sampling_scale: float = 0.999
+    time_sampling_offset: float = 0.001
+    min_period: float = 4e-3
+    max_period: float = 4.0
+    supervise_padding_action_dims: bool = True
+    supervise_padding_actions: bool = True
+
+    # Policy-level dtype request for the Qwen backbone.
+    # - "auto": follow the backbone config/checkpoint default dtype. For Qwen2.5-VL this resolves to bf16.
+    #           The EO1 flow-matching head still keeps its own parameters in fp32.
+    # - "bfloat16": force the backbone to initialize/load in bf16 regardless of the saved config default.
+    # - "float32": force the backbone to initialize/load in fp32 for maximum numerical conservatism.
+    dtype: str = "auto"  # Options: "auto", "bfloat16", "float32"
+    force_fp32_autocast: bool = True
+
+    # Optional attention backend request passed through to the Qwen backbone.
+    # Common values: None, "eager", "sdpa", "flash_attention_2".
+    attn_implementation: str | None = None
+
+    # Training settings.
+    gradient_checkpointing: bool = False  # Enable gradient checkpointing for memory optimization
+
+    normalization_mapping: dict[str, NormalizationMode] = field(
+        default_factory=lambda: {
+            "VISUAL": NormalizationMode.IDENTITY,
+            "STATE": NormalizationMode.MEAN_STD,
+            "ACTION": NormalizationMode.MEAN_STD,
+        }
+    )
+
+    # Optimizer settings aligned with EO1/experiments/2_libero/train.sh and EO1 TrainPipelineConfig defaults.
+    optimizer_lr: float = 1e-4
+    optimizer_betas: tuple[float, float] = (0.9, 0.999)
+    optimizer_eps: float = 1e-8
+    optimizer_weight_decay: float = 0.1
+    optimizer_grad_clip_norm: float = 1.0
+
+    # Scheduler settings aligned with EO1 train.sh: cosine schedule with warmup_ratio=0.03.
+    # Note: These will auto-scale if --steps < scheduler_decay_steps
+    # For example, --steps=3000 will scale warmup to 100 and decay to 3000
+    scheduler_warmup_steps: int = 900  # 0.03 * 30_000 long-run steps
+    scheduler_decay_steps: int = 30_000
+    scheduler_decay_lr: float = 0.0
+
+    def __post_init__(self):
+        super().__post_init__()
+
+        if self.n_action_steps > self.chunk_size:
+            raise ValueError(
+                f"n_action_steps ({self.n_action_steps}) cannot be greater than chunk_size ({self.chunk_size})"
+            )
+
+        # Populate the serialized backbone config only when the caller did not provide one.
+        if self.vlm_config is None:
+            require_package("transformers", extra="eo1")
+            self.vlm_config = Qwen2_5_VLConfig.from_pretrained(self.vlm_base).to_dict()
+
+    @property
+    def vlm_backbone_config(self) -> Qwen2_5_VLConfig:
+        require_package("transformers", extra="eo1")
+        config_dict = deepcopy(self.vlm_config)
+        if self.attn_implementation is not None:
+            config_dict["attn_implementation"] = self.attn_implementation
+        return Qwen2_5_VLConfig(**config_dict)
+
+    @property
+    def text_config(self) -> Qwen2_5_VLTextConfig:
+        return self.vlm_backbone_config.text_config
+
+    @property
+    def vision_config(self) -> Qwen2_5_VLVisionConfig:
+        return self.vlm_backbone_config.vision_config
+
+    def validate_features(self) -> None:
+        """Validate and set up EO1 input and output features."""
+        image_features = [key for key, feat in self.input_features.items() if feat.type == FeatureType.VISUAL]
+        if not image_features:
+            raise ValueError(
+                "EO1 policy requires at least one visual input feature. "
+                "No features of type FeatureType.VISUAL found in input_features."
+            )
+
+        if OBS_STATE not in self.input_features:
+            state_feature = PolicyFeature(
+                type=FeatureType.STATE,
+                shape=(self.max_state_dim,),
+            )
+            self.input_features[OBS_STATE] = state_feature
+
+        if ACTION not in self.output_features:
+            action_feature = PolicyFeature(
+                type=FeatureType.ACTION,
+                shape=(self.max_action_dim,),
+            )
+            self.output_features[ACTION] = action_feature
+
+    def get_optimizer_preset(self) -> AdamWConfig:
+        return AdamWConfig(
+            lr=self.optimizer_lr,
+            betas=self.optimizer_betas,
+            eps=self.optimizer_eps,
+            weight_decay=self.optimizer_weight_decay,
+            grad_clip_norm=self.optimizer_grad_clip_norm,
+        )
+
+    def get_scheduler_preset(self):
+        return CosineDecayWithWarmupSchedulerConfig(
+            peak_lr=self.optimizer_lr,
+            decay_lr=self.scheduler_decay_lr,
+            num_warmup_steps=self.scheduler_warmup_steps,
+            num_decay_steps=self.scheduler_decay_steps,
+        )
+
+    @property
+    def observation_delta_indices(self) -> None:
+        return None
+
+    @property
+    def action_delta_indices(self) -> list[int]:
+        return list(range(self.chunk_size))
+
+    @property
+    def reward_delta_indices(self) -> None:
+        return None
@@ -0,0 +1,620 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import contextlib
+import logging
+import math
+from collections import deque
+from typing import TYPE_CHECKING, Any
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F  # noqa: N812
+import torch.utils.checkpoint
+from torch import Tensor
+
+from lerobot.policies.eo1.configuration_eo1 import EO1Config
+from lerobot.policies.pretrained import PreTrainedPolicy
+from lerobot.utils.constants import ACTION, OBS_STATE
+from lerobot.utils.import_utils import _transformers_available, require_package
+
+if TYPE_CHECKING or _transformers_available:
+    from transformers.activations import ACT2FN
+    from transformers.models.qwen2_5_vl import Qwen2_5_VLForConditionalGeneration
+    from transformers.utils import torch_compilable_check
+else:
+    ACT2FN = None
+    Qwen2_5_VLForConditionalGeneration = None
+    torch_compilable_check = None
+
+logger = logging.getLogger(__name__)
+
+
+def pad_vector(vector, new_dim):
+    """Pad the last dimension of a vector to new_dim with zeros.
+
+    Can be (batch_size x sequence_length x features_dimension)
+    or (batch_size x features_dimension)
+    """
+    if vector.shape[-1] >= new_dim:
+        return vector
+    return F.pad(vector, (0, new_dim - vector.shape[-1]))
+
+
+class EO1Policy(PreTrainedPolicy):
+    """EO1 policy wrapper for LeRobot robot-only training/evaluation."""
+
+    config_class = EO1Config
+    name = "eo1"
+
+    def __init__(self, config: EO1Config, **kwargs):
+        require_package("transformers", extra="eo1")
+        super().__init__(config)
+        config.validate_features()
+        self.config = config
+
+        if config.pretrained_path is None:
+            # Initialize from pretrained VLM
+            vlm_backbone = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+                config.vlm_base,
+                dtype=config.dtype,
+                attn_implementation=config.attn_implementation,
+            )
+        else:
+            vlm_backbone = Qwen2_5_VLForConditionalGeneration._from_config(
+                config.vlm_backbone_config,
+                dtype=config.vlm_backbone_config.dtype if config.dtype == "auto" else config.dtype,
+            )
+
+        self.model = EO1VisionFlowMatchingModel(config, vlm_backbone)
+        if config.gradient_checkpointing:
+            self.model.gradient_checkpointing_enable()
+
+        self.model.to(config.device)
+        self.reset()
+
+    def reset(self):
+        self._action_queue = deque(maxlen=self.config.n_action_steps)
+
+    @staticmethod
+    def _get_model_inputs(batch: dict[str, Tensor], excluded_keys: set[str]) -> dict[str, Tensor]:
+        return {key: value for key, value in batch.items() if key not in excluded_keys}
+
+    def forward(self, batch: dict[str, Tensor]) -> tuple[Tensor, dict]:
+        state = self.prepare_state(batch[OBS_STATE])
+        actions = self.prepare_action(batch[ACTION])
+        model_inputs = self._get_model_inputs(batch, {OBS_STATE, ACTION})
+        loss = self.model(states=state, action=actions, **model_inputs)
+
+        loss_dict = {"loss": loss.item()}
+        return loss, loss_dict
+
+    @torch.no_grad()
+    def predict_action_chunk(self, batch: dict[str, Tensor], **kwargs) -> Tensor:
+        self.eval()
+
+        states = self.prepare_state(batch[OBS_STATE])
+        model_inputs = self._get_model_inputs(batch, {OBS_STATE})
+        actions = self.model.sample_actions(states=states, **model_inputs).to(torch.float32)
+
+        original_action_dim = self.config.output_features[ACTION].shape[0]
+        return actions[:, :, :original_action_dim]
+
+    def prepare_state(self, state: Tensor) -> Tensor:
+        return pad_vector(state, self.config.max_state_dim)
+
+    def prepare_action(self, action: Tensor) -> Tensor:
+        return pad_vector(action, self.config.max_action_dim)
+
+    @torch.no_grad()
+    def select_action(self, batch: dict[str, Tensor]) -> Tensor:
+        self.eval()
+
+        if len(self._action_queue) == 0:
+            actions = self.predict_action_chunk(batch)[:, : self.config.n_action_steps]
+            self._action_queue.extend(actions.transpose(0, 1))
+
+        return self._action_queue.popleft()
+
+    def get_optim_params(self) -> dict:
+        return self.parameters()
+
+
+def get_safe_dtype(target_dtype, device_type):
+    """Get a safe dtype for the given device type."""
+    if device_type == "mps" and target_dtype == torch.float64:
+        return torch.float32
+    if device_type == "cpu":
+        # CPU doesn't support bfloat16, use float32 instead
+        if target_dtype == torch.bfloat16:
+            return torch.float32
+        if target_dtype == torch.float64:
+            return torch.float64
+    return target_dtype
+
+
+def create_sinusoidal_pos_embedding(  # see openpi `create_sinusoidal_pos_embedding` (exact copy)
+    time: torch.Tensor, dimension: int, min_period: float, max_period: float, device="cpu"
+) -> Tensor:
+    """Computes sine-cosine positional embedding vectors for scalar positions."""
+    if dimension % 2 != 0:
+        raise ValueError(f"dimension ({dimension}) must be divisible by 2")
+
+    if time.ndim != 1:
+        raise ValueError("The time tensor is expected to be of shape `(batch_size, )`.")
+
+    dtype = get_safe_dtype(torch.float64, device.type)
+    fraction = torch.linspace(0.0, 1.0, dimension // 2, dtype=dtype, device=device)
+    period = min_period * (max_period / min_period) ** fraction
+
+    # Compute the outer product
+    scaling_factor = 1.0 / period * 2 * math.pi
+    sin_input = scaling_factor[None, :] * time[:, None]
+    return torch.cat([torch.sin(sin_input), torch.cos(sin_input)], dim=1)
+
+
+def sample_beta(alpha, beta, bsize, device):  # see openpi `sample_beta` (exact copy)
+    # Beta sampling uses _sample_dirichlet which isn't implemented for MPS, so sample on CPU
+    alpha_t = torch.tensor(alpha, dtype=torch.float32)
+    beta_t = torch.tensor(beta, dtype=torch.float32)
+    dist = torch.distributions.Beta(alpha_t, beta_t)
+    return dist.sample((bsize,)).to(device)
+
+
+class EO1VisionActionProjector(torch.nn.Sequential):
+    """This block implements the multi-layer perceptron (MLP) module."""
+
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        num_layers: int = 2,
+        activation_layer: str = "linear",
+        bias: bool = True,
+        device: Any = None,
+        dtype: torch.dtype = torch.float32,
+    ):
+        layers = []
+        in_dim = in_channels
+        hidden_channels = [in_dim] * (num_layers - 1) + [out_channels]
+        for hidden_dim in hidden_channels[:-1]:
+            layers.append(torch.nn.Linear(in_dim, hidden_dim, bias=bias, dtype=dtype, device=device))
+            layers.append(ACT2FN[activation_layer])
+            in_dim = hidden_dim
+        layers.append(torch.nn.Linear(in_dim, hidden_channels[-1], bias=bias, dtype=dtype, device=device))
+        super().__init__(*layers)
+
+    @property
+    def dtype(self):
+        return self[0].weight.dtype
+
+
+class EO1VisionFlowMatchingModel(nn.Module):
+    def __init__(
+        self,
+        config: EO1Config,
+        vlm_backbone: Qwen2_5_VLForConditionalGeneration | None = None,
+    ):
+        require_package("transformers", extra="eo1")
+        super().__init__()
+
+        self.config = config
+        # Preserve the backbone dtype selected at construction time so Qwen's fp32 rotary buffers stay intact.
+        self.vlm_backbone = vlm_backbone
+        self.hidden_size = self.vlm_backbone.config.text_config.hidden_size
+        max_state_dim = config.max_state_dim
+        max_action_dim = config.max_action_dim
+        self.state_proj = nn.Linear(max_state_dim, self.hidden_size, dtype=torch.float32)
+        self.action_in_proj = nn.Linear(max_action_dim, self.hidden_size, dtype=torch.float32)
+        self.action_out_proj = EO1VisionActionProjector(
+            self.hidden_size,
+            max_action_dim,
+            config.num_action_layers,
+            config.action_act,
+            dtype=torch.float32,
+        )
+        self.action_time_mlp_in = nn.Linear(self.hidden_size * 2, self.hidden_size, dtype=torch.float32)
+        self.action_time_mlp_out = nn.Linear(self.hidden_size, self.hidden_size, dtype=torch.float32)
+        self.gradient_checkpointing_enabled = False
+
+    def get_input_embeddings(self):
+        return self.vlm_backbone.get_input_embeddings()
+
+    def flow_head_autocast_context(self):
+        if self.config.force_fp32_autocast:
+            return torch.autocast(
+                device_type=self.state_proj.weight.device.type,
+                enabled=False,
+            )
+        return contextlib.nullcontext()
+
+    def gradient_checkpointing_enable(self):
+        """Enable gradient checkpointing for the Qwen2.5-VL backbone."""
+        self.gradient_checkpointing_enabled = True
+        self.vlm_backbone.gradient_checkpointing_enable(
+            gradient_checkpointing_kwargs={"use_reentrant": False}
+        )
+        logger.info("Enabled gradient checkpointing for EO1VisionFlowMatchingModel")
+
+    def gradient_checkpointing_disable(self):
+        """Disable gradient checkpointing for the Qwen2.5-VL backbone."""
+        self.gradient_checkpointing_enabled = False
+        self.vlm_backbone.gradient_checkpointing_disable()
+        logger.info("Disabled gradient checkpointing for EO1VisionFlowMatchingModel")
+
+    def _apply_checkpoint(self, func, *args, **kwargs):
+        """Apply manual gradient checkpointing to EO1 flow-head computations when training."""
+        if self.gradient_checkpointing_enabled and self.training and torch.is_grad_enabled():
+            return torch.utils.checkpoint.checkpoint(
+                func, *args, use_reentrant=False, preserve_rng_state=False, **kwargs
+            )
+        return func(*args, **kwargs)
+
+    def sample_noise(self, shape, device):
+        noise = torch.normal(
+            mean=0.0,
+            std=1.0,
+            size=shape,
+            dtype=torch.float32,
+            device=device,
+        )
+        return noise
+
+    def sample_time(self, bsize, device):
+        time_beta = sample_beta(
+            self.config.time_sampling_beta_alpha, self.config.time_sampling_beta_beta, bsize, device
+        )
+        time = time_beta * self.config.time_sampling_scale + self.config.time_sampling_offset
+        return time.to(dtype=torch.float32, device=device)
+
+    def get_placeholder_mask(
+        self,
+        input_ids: torch.LongTensor | None,
+        inputs_embeds: torch.FloatTensor | None,
+        state_features: torch.FloatTensor | None = None,
+        action_features: torch.FloatTensor | None = None,
+        *,
+        state_token_id: int,
+        action_token_id: int,
+    ) -> tuple[torch.BoolTensor, torch.BoolTensor]:
+        """Return EO1 state/action placeholder masks, following Qwen's multimodal mask style."""
+        if input_ids is None:
+            special_state_mask = inputs_embeds == self.get_input_embeddings()(
+                torch.tensor(state_token_id, dtype=torch.long, device=inputs_embeds.device)
+            )
+            special_state_mask = special_state_mask.all(-1)
+            special_action_mask = inputs_embeds == self.get_input_embeddings()(
+                torch.tensor(action_token_id, dtype=torch.long, device=inputs_embeds.device)
+            )
+            special_action_mask = special_action_mask.all(-1)
+        else:
+            special_state_mask = input_ids == state_token_id
+            special_action_mask = input_ids == action_token_id
+
+        n_state_tokens = special_state_mask.sum()
+        special_state_mask = (
+            special_state_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device)
+        )
+        if state_features is not None:
+            torch_compilable_check(
+                inputs_embeds[special_state_mask].numel() == state_features.numel(),
+                f"State features and state tokens do not match, tokens: {n_state_tokens}, features: {state_features.shape[0]}",
+            )
+
+        n_action_tokens = special_action_mask.sum()
+        special_action_mask = (
+            special_action_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device)
+        )
+        if action_features is not None:
+            torch_compilable_check(
+                inputs_embeds[special_action_mask].numel() == action_features.numel(),
+                f"Action features and action tokens do not match, tokens: {n_action_tokens}, features: {action_features.shape[0]}",
+            )
+
+        return special_state_mask, special_action_mask
+
+    def embed_prefix(
+        self,
+        input_ids: torch.LongTensor,
+        states: torch.Tensor,
+        *,
+        state_token_id: int,
+        action_token_id: int,
+    ) -> torch.FloatTensor:
+        """Embed the EO1 prefix tokens before native Qwen injects multimodal features."""
+
+        # Get the input embeddings for the input IDs
+        def input_embed_func(input_ids: torch.LongTensor) -> torch.FloatTensor:
+            return self.get_input_embeddings()(input_ids)
+
+        inputs_embeds = self._apply_checkpoint(input_embed_func, input_ids)
+
+        # Project the states to the hidden size
+        def state_proj_func(states: torch.Tensor) -> torch.FloatTensor:
+            with self.flow_head_autocast_context():
+                states = states.to(dtype=self.state_proj.weight.dtype)
+                return self.state_proj(states)
+
+        state_embs = self._apply_checkpoint(state_proj_func, states)
+        state_mask, _ = self.get_placeholder_mask(
+            input_ids,
+            inputs_embeds,
+            state_features=state_embs,
+            state_token_id=state_token_id,
+            action_token_id=action_token_id,
+        )
+        state_embs = state_embs.to(inputs_embeds.device, inputs_embeds.dtype)
+        inputs_embeds = inputs_embeds.masked_scatter(state_mask, state_embs)
+        return inputs_embeds
+
+    def embed_suffix(
+        self,
+        timestep: torch.Tensor,
+        noisy_actions: torch.Tensor,
+    ) -> torch.FloatTensor:
+        """Embed the suffix"""
+
+        def action_proj_func(noisy_actions: torch.Tensor) -> torch.FloatTensor:
+            with self.flow_head_autocast_context():
+                noisy_actions = noisy_actions.to(dtype=self.action_in_proj.weight.dtype)
+                return self.action_in_proj(noisy_actions)
+
+        action_embs = self._apply_checkpoint(action_proj_func, noisy_actions)
+        time_embs = create_sinusoidal_pos_embedding(
+            timestep,
+            self.hidden_size,
+            min_period=self.config.min_period,
+            max_period=self.config.max_period,
+            device=action_embs.device,
+        )
+        time_embs = time_embs.to(dtype=action_embs.dtype)
+        time_embs = time_embs[:, None, :].expand_as(action_embs)
+        action_time_embs = torch.cat([action_embs, time_embs], dim=2)
+
+        def mlp_func(action_time_embs: torch.Tensor) -> torch.FloatTensor:
+            with self.flow_head_autocast_context():
+                action_time_embs = action_time_embs.to(dtype=self.action_time_mlp_in.weight.dtype)
+                action_time_embs = self.action_time_mlp_in(action_time_embs)
+                action_time_embs = F.silu(action_time_embs)
+                return self.action_time_mlp_out(action_time_embs)
+
+        action_time_embs = self._apply_checkpoint(mlp_func, action_time_embs)
+        return action_time_embs
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor | None = None,
+        attention_mask: torch.LongTensor | None = None,
+        pixel_values: torch.FloatTensor | None = None,
+        image_grid_thw: torch.LongTensor | None = None,
+        mm_token_type_ids: torch.IntTensor | None = None,
+        states: torch.FloatTensor | None = None,
+        action: torch.FloatTensor | None = None,
+        action_is_pad: torch.BoolTensor | None = None,
+        *,
+        state_token_id: int,
+        action_token_id: int,
+        **kwargs,
+    ) -> Tensor:
+        """Run the EO1 training forward pass and compute the flow-matching loss."""
+
+        # 1. Build the EO1 prefix with state placeholders resolved.
+        inputs_embeds = self.embed_prefix(
+            input_ids,
+            states=states,
+            state_token_id=state_token_id,
+            action_token_id=action_token_id,
+        )
+
+        # 2. Sample the diffusion target and replace the action placeholders.
+        time = self.sample_time(action.shape[0], inputs_embeds.device)
+        noise = self.sample_noise(action.shape, inputs_embeds.device)
+
+        time_expanded = time[:, None, None]
+        x_t = time_expanded * noise + (1 - time_expanded) * action
+        u_t = noise - action
+        action_time_embs = self.embed_suffix(time, x_t)
+        _, action_mask = self.get_placeholder_mask(
+            input_ids,
+            inputs_embeds,
+            action_features=action_time_embs,
+            state_token_id=state_token_id,
+            action_token_id=action_token_id,
+        )
+        action_time_embs = action_time_embs.to(inputs_embeds.device, inputs_embeds.dtype)
+        inputs_embeds = inputs_embeds.masked_scatter(action_mask, action_time_embs)
+
+        # 3. Optionally drop padded action tokens from backbone attention.
+        if attention_mask is not None:
+            attention_mask = attention_mask.to(inputs_embeds.device)
+
+        if not self.config.supervise_padding_actions:
+            action_is_pad = action_is_pad.to(device=inputs_embeds.device, dtype=torch.bool)
+            action_token_mask = action_mask[..., 0]
+            action_padding_mask = torch.zeros_like(action_token_mask)
+            action_padding_mask = action_padding_mask.masked_scatter(
+                action_token_mask,
+                action_is_pad.reshape(-1),
+            )
+            attention_mask = attention_mask.masked_fill(action_padding_mask, 0)
+
+        # 4. Run the Qwen backbone on the fused EO1 sequence.
+        def vlm_forward_func(
+            input_ids: torch.LongTensor,
+            attention_mask: torch.Tensor | None,
+            inputs_embeds: torch.FloatTensor,
+            pixel_values: torch.Tensor | None,
+            image_grid_thw: torch.LongTensor | None,
+            mm_token_type_ids: torch.IntTensor | None,
+        ) -> torch.FloatTensor:
+            outputs = self.vlm_backbone.model(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                inputs_embeds=inputs_embeds,
+                pixel_values=pixel_values,
+                image_grid_thw=image_grid_thw,
+                mm_token_type_ids=mm_token_type_ids,
+                use_cache=False,
+                output_hidden_states=False,
+                return_dict=True,
+            )
+            return outputs.last_hidden_state
+
+        hidden_states = self._apply_checkpoint(
+            vlm_forward_func,
+            input_ids,
+            attention_mask,
+            inputs_embeds,
+            pixel_values,
+            image_grid_thw,
+            mm_token_type_ids,
+        )
+        action_hidden_states = hidden_states[action_mask[..., 0]]
+
+        # 5. Project the action-token hidden states back to the flow target space.
+        def action_out_proj_func(action_hidden_states: torch.FloatTensor) -> torch.FloatTensor:
+            with self.flow_head_autocast_context():
+                action_hidden_states = action_hidden_states.to(dtype=self.action_out_proj.dtype)
+                return self.action_out_proj(action_hidden_states)
+
+        v_t = self._apply_checkpoint(action_out_proj_func, action_hidden_states)
+        v_t = v_t.reshape(u_t.shape).to(dtype=u_t.dtype)
+        losses = F.mse_loss(u_t, v_t, reduction="none")
+
+        # 6. Apply the configured supervision mask and reduce the loss.
+        if not self.config.supervise_padding_action_dims:
+            original_action_dim = self.config.output_features[ACTION].shape[0]
+            losses = losses[..., :original_action_dim]
+
+        if not self.config.supervise_padding_actions:
+            losses = losses[~action_is_pad]
+
+        return losses.mean()
+
+    @torch.no_grad()
+    def sample_actions(
+        self,
+        input_ids: torch.LongTensor | None = None,
+        attention_mask: torch.Tensor | None = None,
+        pixel_values: torch.Tensor | None = None,
+        image_grid_thw: torch.LongTensor | None = None,
+        mm_token_type_ids: torch.IntTensor | None = None,
+        states: torch.Tensor | None = None,
+        *,
+        state_token_id: int,
+        action_token_id: int,
+        **kwargs,
+    ) -> Tensor:
+        """Sample actions from the model."""
+        if states is None:
+            raise ValueError("states are required for EO1 action sampling.")
+        if mm_token_type_ids is None:
+            raise ValueError("mm_token_type_ids are required for EO1 action sampling.")
+
+        # 1. Resolve the left-padded rollout prompt and locate the action span.
+        chunk_size = self.config.chunk_size
+
+        inputs_embeds = self.embed_prefix(
+            input_ids,
+            states=states,
+            state_token_id=state_token_id,
+            action_token_id=action_token_id,
+        ).clone()
+        _, action_placeholder_mask = self.get_placeholder_mask(
+            input_ids,
+            inputs_embeds,
+            state_token_id=state_token_id,
+            action_token_id=action_token_id,
+        )
+        action_mask = action_placeholder_mask[..., 0]
+        token_counts = action_mask.sum(dim=1)
+        if not torch.all(token_counts == chunk_size):
+            raise ValueError(
+                f"Each sample must contain exactly {chunk_size} action tokens, got {token_counts.tolist()}."
+            )
+        if action_mask.ne(action_mask[:1]).any():
+            raise ValueError(
+                "Batch inference expects all samples to share the same action token mask after left padding."
+            )
+        act_start = int(action_mask[0].to(torch.int64).argmax().item())
+        act_end = act_start + self.config.chunk_size
+        if not torch.all(action_mask[:, act_start:act_end]):
+            raise ValueError("Action tokens must form a contiguous chunk of length chunk_size.")
+        act_slice = slice(act_start, act_end)
+
+        # 2. Encode the fixed prefix once and cache its KV state.
+        batch_size = input_ids.shape[0]
+        device = inputs_embeds.device
+        attention_mask = attention_mask.to(device)
+        mm_token_type_ids = mm_token_type_ids.to(device)
+        position_ids, _ = self.vlm_backbone.model.get_rope_index(
+            input_ids,
+            image_grid_thw=image_grid_thw,
+            attention_mask=attention_mask,
+            mm_token_type_ids=mm_token_type_ids,
+        )
+        position_ids = position_ids.to(device)
+
+        outputs = self.vlm_backbone.model(
+            input_ids=input_ids[:, :act_start],
+            attention_mask=attention_mask[:, :act_start],
+            position_ids=position_ids[..., :act_start],
+            inputs_embeds=inputs_embeds[:, :act_start],
+            pixel_values=pixel_values,
+            image_grid_thw=image_grid_thw,
+            mm_token_type_ids=mm_token_type_ids[:, :act_start],
+            use_cache=True,
+            return_dict=True,
+        )
+
+        x_t = self.sample_noise(
+            (batch_size, chunk_size, self.config.max_action_dim),
+            device,
+        ).to(dtype=self.action_in_proj.weight.dtype)
+        dt = -1.0 / self.config.num_denoise_steps
+        past_key_values = outputs.past_key_values
+
+        # 3. Denoise only the action chunk while keeping the prefix cache invariant.
+        for step in range(self.config.num_denoise_steps):
+            time = torch.full(
+                (batch_size,),
+                1.0 + step * dt,
+                device=device,
+                dtype=torch.float32,
+            )
+            action_time_embs = self.embed_suffix(time, x_t)
+            inputs_embeds[:, act_slice] = action_time_embs.to(inputs_embeds.dtype)
+
+            # Keep the prefix KV cache invariant across denoising steps.
+            past_key_values.crop(act_start)
+            outputs = self.vlm_backbone.model(
+                attention_mask=attention_mask[:, :act_end],
+                past_key_values=past_key_values,
+                inputs_embeds=inputs_embeds[:, act_slice],
+                position_ids=position_ids[..., act_slice],
+                use_cache=True,
+                return_dict=True,
+            )
+            with self.flow_head_autocast_context():
+                hidden_states = outputs.last_hidden_state[:, :chunk_size]
+                hidden_states = hidden_states.to(dtype=self.action_out_proj.dtype)
+                v_t = self.action_out_proj(hidden_states)
+
+            x_t += dt * v_t.reshape(x_t.shape)
+
+        return x_t
@@ -0,0 +1,282 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from typing import TYPE_CHECKING, Any
+
+import torch
+
+from lerobot.configs.types import FeatureType, PipelineFeatureType, PolicyFeature
+from lerobot.policies.eo1.configuration_eo1 import EO1Config
+from lerobot.processor import (
+    AddBatchDimensionProcessorStep,
+    ComplementaryDataProcessorStep,
+    DeviceProcessorStep,
+    NormalizerProcessorStep,
+    PolicyAction,
+    PolicyProcessorPipeline,
+    ProcessorStep,
+    ProcessorStepRegistry,
+    RenameObservationsProcessorStep,
+    UnnormalizerProcessorStep,
+)
+from lerobot.processor.converters import policy_action_to_transition, transition_to_policy_action
+from lerobot.types import TransitionKey
+from lerobot.utils.constants import (
+    OBS_STATE,
+    POLICY_POSTPROCESSOR_DEFAULT_NAME,
+    POLICY_PREPROCESSOR_DEFAULT_NAME,
+)
+from lerobot.utils.import_utils import _transformers_available, require_package
+
+if TYPE_CHECKING or _transformers_available:
+    from transformers.models.qwen2_5_vl import Qwen2_5_VLProcessor
+else:
+    Qwen2_5_VLProcessor = None
+
+SYSTEM_MESSAGE = "You are a helpful physical assistant."
+
+# EO-1 special tokens
+ACTION_START_TOKEN = "<|action_start|>"  # nosec B105
+DEFAULT_ACTION_TOKEN = "<|action_pad|>"  # nosec B105
+ACTION_END_TOKEN = "<|action_end|>"  # nosec B105
+STATE_START_TOKEN = "<|state_start|>"  # nosec B105
+DEFAULT_STATE_TOKEN = "<|state_pad|>"  # nosec B105
+STATE_END_TOKEN = "<|state_end|>"  # nosec B105
+TASK_VLA_TOKEN = "<|vla|>"  # nosec B105
+
+EO1_SPECIAL_TOKENS = [
+    ACTION_START_TOKEN,
+    DEFAULT_ACTION_TOKEN,
+    ACTION_END_TOKEN,
+    STATE_START_TOKEN,
+    DEFAULT_STATE_TOKEN,
+    STATE_END_TOKEN,
+    TASK_VLA_TOKEN,
+]
+
+
+@dataclass
+@ProcessorStepRegistry.register(name="eo1_conversation_template_processor")
+class EO1ConversationTemplateStep(ComplementaryDataProcessorStep):
+    input_features: dict[str, PolicyFeature] | dict[str, dict[str, Any]]
+    chunk_size: int
+
+    _image_keys: list[str] = field(default_factory=list, init=False, repr=False)
+
+    def __post_init__(self):
+        # Robust JSON deserialization handling (guard empty maps).
+        if self.input_features:
+            first_val = next(iter(self.input_features.values()))
+            if isinstance(first_val, dict):
+                reconstructed = {}
+                for key, ft_dict in self.input_features.items():
+                    reconstructed[key] = PolicyFeature(
+                        type=FeatureType(ft_dict["type"]), shape=tuple(ft_dict["shape"])
+                    )
+                self.input_features = reconstructed
+
+        self._image_keys = [
+            key for key, value in self.input_features.items() if value.type == FeatureType.VISUAL
+        ]
+
+    def complementary_data(self, complementary_data):
+        tasks = complementary_data.get("task")
+        if tasks is None:
+            raise ValueError("Task is required for EO1ConversationTemplateStep.")
+
+        observation = self.transition.get(TransitionKey.OBSERVATION)
+        if observation is None:
+            raise ValueError("Observation is required for EO1ConversationTemplateStep.")
+
+        if OBS_STATE in observation and observation[OBS_STATE].shape[0] != len(tasks):
+            raise ValueError("Batch size mismatch between observation.state and task list.")
+
+        # LeRobot visual observations reach in processor as float32 tensors in [0, 1].
+        # Convert to uint8 in [0, 255] to meet the input requirement of Qwen2.5-VL-3B-Instruct.
+        images = {
+            key: observation[key].clamp(0, 1).mul(255.0).round().to(torch.uint8) for key in self._image_keys
+        }
+        messages = []
+        for i in range(len(tasks)):
+            content = [
+                *[{"type": "image", "image": images[key][i]} for key in self._image_keys],
+                {
+                    "type": "text",
+                    "text": (
+                        f"{STATE_START_TOKEN}{DEFAULT_STATE_TOKEN}{STATE_END_TOKEN}{tasks[i]}{TASK_VLA_TOKEN}"
+                    ),
+                },
+            ]
+            messages.append(
+                [
+                    {"role": "system", "content": [{"type": "text", "text": SYSTEM_MESSAGE}]},
+                    {"role": "user", "content": content},
+                    {
+                        "role": "assistant",
+                        "content": [
+                            {
+                                "type": "text",
+                                "text": f"{ACTION_START_TOKEN}{DEFAULT_ACTION_TOKEN * self.chunk_size}{ACTION_END_TOKEN}",
+                            }
+                        ],
+                    },
+                ]
+            )
+
+        complementary_data["messages"] = messages
+
+        return complementary_data
+
+    def transform_features(
+        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
+    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
+        """
+        This step only materializes EO1-specific message objects in complementary_data.
+        PipelineFeatureType tracks only ACTION and OBSERVATION, so there is no static
+        feature contract change to record here.
+        """
+        return features
+
+    def get_config(self) -> dict[str, Any]:
+        return {
+            "input_features": {
+                key: {"type": ft.type.value, "shape": ft.shape} for key, ft in self.input_features.items()
+            },
+            "chunk_size": self.chunk_size,
+        }
+
+
+@dataclass
+@ProcessorStepRegistry.register(name="eo1_qwen_processor")
+class EO1QwenProcessorStep(ComplementaryDataProcessorStep):
+    processor_name: str = "Qwen/Qwen2.5-VL-3B-Instruct"
+    image_min_pixels: int | None = 64 * 28 * 28
+    image_max_pixels: int | None = 128 * 28 * 28
+    use_fast_processor: bool = False
+
+    _processor: Qwen2_5_VLProcessor | None = field(default=None, init=False, repr=False)
+    _state_token_id: int | None = field(default=None, init=False, repr=False)
+    _action_token_id: int | None = field(default=None, init=False, repr=False)
+
+    def __post_init__(self):
+        require_package("transformers", extra="eo1")
+        self._processor = Qwen2_5_VLProcessor.from_pretrained(
+            self.processor_name,
+            use_fast=self.use_fast_processor,
+        )
+        self._processor.tokenizer.add_tokens(EO1_SPECIAL_TOKENS, special_tokens=True)
+        self._state_token_id = self._processor.tokenizer.convert_tokens_to_ids(DEFAULT_STATE_TOKEN)
+        self._action_token_id = self._processor.tokenizer.convert_tokens_to_ids(DEFAULT_ACTION_TOKEN)
+
+    def complementary_data(self, complementary_data):
+        messages = complementary_data.pop("messages", None)
+        if messages is None:
+            raise ValueError("Messages are required for EO1QwenProcessorStep.")
+
+        # Rollout batches use left padding so action spans stay aligned across samples.
+        # Supervised batches use right padding to match standard training collation.
+        padding_side = "right" if self.transition.get(TransitionKey.ACTION) is not None else "left"
+
+        inputs = self._processor.apply_chat_template(
+            messages,
+            tokenize=True,
+            padding=True,
+            padding_side=padding_side,
+            min_pixels=self.image_min_pixels,
+            max_pixels=self.image_max_pixels,
+            add_generation_prompt=False,
+            return_dict=True,
+            return_tensors="pt",
+        )
+
+        complementary_data["input_ids"] = inputs["input_ids"]
+        complementary_data["pixel_values"] = inputs["pixel_values"]
+        complementary_data["image_grid_thw"] = inputs["image_grid_thw"]
+        complementary_data["attention_mask"] = inputs["attention_mask"]
+        complementary_data["mm_token_type_ids"] = inputs["mm_token_type_ids"]
+        complementary_data["state_token_id"] = self._state_token_id
+        complementary_data["action_token_id"] = self._action_token_id
+
+        return complementary_data
+
+    def get_config(self) -> dict[str, Any]:
+        return {
+            "processor_name": self.processor_name,
+            "image_min_pixels": self.image_min_pixels,
+            "image_max_pixels": self.image_max_pixels,
+            "use_fast_processor": self.use_fast_processor,
+        }
+
+    def transform_features(
+        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
+    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
+        """
+        This step only converts the messages to the model input format.
+        """
+        return features
+
+
+def make_eo1_pre_post_processors(
+    config: EO1Config,
+    dataset_stats: dict[str, dict[str, torch.Tensor]] | None = None,
+) -> tuple[
+    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
+    PolicyProcessorPipeline[PolicyAction, PolicyAction],
+]:
+    """Build pre/post processor pipelines for EO1."""
+
+    input_steps: list[ProcessorStep] = [
+        RenameObservationsProcessorStep(rename_map={}),
+        AddBatchDimensionProcessorStep(),
+        NormalizerProcessorStep(
+            features={**config.input_features, **config.output_features},
+            norm_map=config.normalization_mapping,
+            stats=dataset_stats,
+        ),
+        EO1ConversationTemplateStep(input_features=config.input_features, chunk_size=config.chunk_size),
+        EO1QwenProcessorStep(
+            processor_name=config.vlm_base,
+            image_min_pixels=config.image_min_pixels,
+            image_max_pixels=config.image_max_pixels,
+            use_fast_processor=config.use_fast_processor,
+        ),
+        DeviceProcessorStep(device=config.device),
+    ]
+
+    output_steps: list[ProcessorStep] = [
+        UnnormalizerProcessorStep(
+            features=config.output_features,
+            norm_map=config.normalization_mapping,
+            stats=dataset_stats,
+        ),
+        DeviceProcessorStep(device="cpu"),
+    ]
+
+    return (
+        PolicyProcessorPipeline[dict[str, Any], dict[str, Any]](
+            steps=input_steps,
+            name=POLICY_PREPROCESSOR_DEFAULT_NAME,
+        ),
+        PolicyProcessorPipeline[PolicyAction, PolicyAction](
+            steps=output_steps,
+            name=POLICY_POSTPROCESSOR_DEFAULT_NAME,
+            to_transition=policy_action_to_transition,
+            to_output=transition_to_policy_action,
+        ),
+    )
@@ -46,6 +46,7 @@ from lerobot.utils.feature_utils import dataset_to_policy_features

 from .act.configuration_act import ACTConfig
 from .diffusion.configuration_diffusion import DiffusionConfig
+from .eo1.configuration_eo1 import EO1Config
 from .groot.configuration_groot import GrootConfig
 from .multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig
 from .pi0.configuration_pi0 import PI0Config
@@ -146,6 +147,10 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:
        from .wall_x.modeling_wall_x import WallXPolicy

        return WallXPolicy
+    elif name == "eo1":
+        from .eo1.modeling_eo1 import EO1Policy
+
+        return EO1Policy
    else:
        try:
            return _get_policy_cls_from_policy_name(name=name)
@@ -196,6 +201,8 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:
        return XVLAConfig(**kwargs)
    elif policy_type == "wall_x":
        return WallXConfig(**kwargs)
+    elif policy_type == "eo1":
+        return EO1Config(**kwargs)
    else:
        try:
            config_cls = PreTrainedConfig.get_choice_class(policy_type)
@@ -399,6 +406,13 @@ def make_pre_post_processors(
            config=policy_cfg,
            dataset_stats=kwargs.get("dataset_stats"),
        )
+    elif isinstance(policy_cfg, EO1Config):
+        from .eo1.processor_eo1 import make_eo1_pre_post_processors
+
+        processors = make_eo1_pre_post_processors(
+            config=policy_cfg,
+            dataset_stats=kwargs.get("dataset_stats"),
+        )

    else:
        try:
@@ -514,7 +528,7 @@ def make_policy(

        logging.info("Loading policy's PEFT adapter.")

-        peft_pretrained_path = cfg.pretrained_path
+        peft_pretrained_path = str(cfg.pretrained_path)
        peft_config = PeftConfig.from_pretrained(peft_pretrained_path)

        kwargs["pretrained_name_or_path"] = peft_config.base_model_name_or_path
@@ -527,7 +541,9 @@ def make_policy(
            )

        policy = policy_cls.from_pretrained(**kwargs)
-        policy = PeftModel.from_pretrained(policy, peft_pretrained_path, config=peft_config)
+        policy = PeftModel.from_pretrained(
+            policy, peft_pretrained_path, config=peft_config, is_trainable=True
+        )

    else:
        # Make a fresh policy.
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-from dataclasses import dataclass, field
+from dataclasses import field
 from typing import TYPE_CHECKING

 import torch
@@ -109,7 +109,6 @@ class MultiEmbodimentActionEncoder(nn.Module):
        return x


-@dataclass
 class FlowmatchingActionHeadConfig(PretrainedConfig):
    """NOTE: N1.5 uses XEmbFlowmatchingPolicyHeadConfig as action head"""

@@ -444,13 +444,13 @@ class PaliGemmaWithExpertModel(
        if image.dtype != torch.float32:
            image = image.to(torch.float32)
        image_outputs = self.paligemma.model.get_image_features(image)
-        features = image_outputs.pooler_output * self.paligemma.config.text_config.hidden_size**0.5
+        features = image_outputs.pooler_output
        if features.dtype != out_dtype:
            features = features.to(out_dtype)
        return features

    def embed_language_tokens(self, tokens: torch.Tensor):
-        return self.paligemma.model.language_model.embed_tokens(tokens)
+        return self.paligemma.model.language_model.get_input_embeddings()(tokens)

    def forward(
        self,
@@ -666,8 +666,7 @@ class PI0Pytorch(nn.Module):  # see openpi `PI0Pytorch`
        # Process language tokens
        def lang_embed_func(lang_tokens):
            lang_emb = self.paligemma_with_expert.embed_language_tokens(lang_tokens)
-            lang_emb_dim = lang_emb.shape[-1]
-            return lang_emb * math.sqrt(lang_emb_dim)
+            return lang_emb

        lang_emb = self._apply_checkpoint(lang_embed_func, lang_tokens)
        embs.append(lang_emb)
@@ -748,16 +747,8 @@ class PI0Pytorch(nn.Module):  # see openpi `PI0Pytorch`

        return embs, pad_masks, att_masks, adarms_cond

-    def forward(
-        self, images, img_masks, lang_tokens, lang_masks, state, actions, noise=None, time=None
-    ) -> Tensor:
+    def forward(self, images, img_masks, lang_tokens, lang_masks, state, actions, noise, time) -> Tensor:
        """Do a full training forward pass and compute the loss."""
-        if noise is None:
-            noise = self.sample_noise(actions.shape, actions.device)
-
-        if time is None:
-            time = self.sample_time(actions.shape[0], actions.device)
-
        time_expanded = time[:, None, None]
        x_t = time_expanded * noise + (1 - time_expanded) * actions
        u_t = noise - actions
@@ -1292,8 +1283,11 @@ class PI0Policy(PreTrainedPolicy):
        state = self.prepare_state(batch)
        actions = self.prepare_action(batch)

+        noise = self.model.sample_noise(actions.shape, actions.device)
+        time = self.model.sample_time(actions.shape[0], actions.device)
+
        # Compute loss
-        losses = self.model.forward(images, img_masks, lang_tokens, lang_masks, state, actions)
+        losses = self.model.forward(images, img_masks, lang_tokens, lang_masks, state, actions, noise, time)

        # Truncate losses to actual action dimensions
        original_action_dim = self.config.output_features[ACTION].shape[0]
@@ -728,14 +728,8 @@ class PI05Pytorch(nn.Module):  # see openpi `PI0Pytorch`

        return embs, pad_masks, att_masks, adarms_cond

-    def forward(self, images, img_masks, tokens, masks, actions, noise=None, time=None) -> Tensor:
+    def forward(self, images, img_masks, tokens, masks, actions, noise, time) -> Tensor:
        """Do a full training forward pass and compute the loss."""
-        if noise is None:
-            noise = self.sample_noise(actions.shape, actions.device)
-
-        if time is None:
-            time = self.sample_time(actions.shape[0], actions.device)
-
        time_expanded = time[:, None, None]
        x_t = time_expanded * noise + (1 - time_expanded) * actions
        u_t = noise - actions
@@ -1262,8 +1256,11 @@ class PI05Policy(PreTrainedPolicy):

        actions = self.prepare_action(batch)

+        noise = self.model.sample_noise(actions.shape, actions.device)
+        time = self.model.sample_time(actions.shape[0], actions.device)
+
        # Compute loss (no separate state needed for PI05)
-        losses = self.model.forward(images, img_masks, tokens, masks, actions)
+        losses = self.model.forward(images, img_masks, tokens, masks, actions, noise, time)

        # Truncate losses to actual action dimensions
        original_action_dim = self.config.output_features[ACTION].shape[0]
@@ -16,7 +16,6 @@

 import builtins
 import logging
-import math
 from collections import deque
 from pathlib import Path
 from typing import TYPE_CHECKING, Literal, TypedDict, Unpack
@@ -261,13 +260,15 @@ class PI0FastPaliGemma(nn.Module):
        if image.dtype != torch.float32:
            image = image.to(torch.float32)
        image_outputs = self.paligemma.model.get_image_features(image)
-        features = image_outputs.pooler_output * self.paligemma.config.text_config.hidden_size**0.5
+        features = image_outputs.pooler_output
+        norm = 2048**0.5
+        features = features / norm * norm
        if features.dtype != out_dtype:
            features = features.to(out_dtype)
        return features

    def embed_language_tokens(self, tokens: torch.Tensor):
-        return self.paligemma.model.language_model.embed_tokens(tokens)
+        return self.paligemma.model.language_model.get_input_embeddings()(tokens)

    def forward(
        self,
@@ -417,8 +418,7 @@ class PI0FastPytorch(nn.Module):  # see openpi `PI0Pytorch`
        # Process language instruction tokens
        def lang_embed_func(tokens):
            lang_emb = self.paligemma_with_expert.embed_language_tokens(tokens)
-            lang_emb_dim = lang_emb.shape[-1]
-            return lang_emb * math.sqrt(lang_emb_dim)
+            return lang_emb

        lang_emb = self._apply_checkpoint(lang_embed_func, tokens)
        embs.append(lang_emb)
@@ -432,8 +432,7 @@ class PI0FastPytorch(nn.Module):  # see openpi `PI0Pytorch`

            def fast_action_embed_func(fast_action_tokens):
                fast_emb = self.paligemma_with_expert.embed_language_tokens(fast_action_tokens)
-                fast_emb_dim = fast_emb.shape[-1]
-                return fast_emb * math.sqrt(fast_emb_dim)
+                return fast_emb

            fast_action_emb = self._apply_checkpoint(fast_action_embed_func, fast_action_tokens)
            embs.append(fast_action_emb)
@@ -666,7 +665,6 @@ class PI0FastPytorch(nn.Module):  # see openpi `PI0Pytorch`
            if t < max_decoding_steps - 1:
                # embed the newly generated token
                next_token_emb = self.paligemma_with_expert.embed_language_tokens(next_token)
-                next_token_emb = next_token_emb * math.sqrt(next_token_emb.shape[-1])
                if prefix_embs.dtype == torch.bfloat16:
                    next_token_emb = next_token_emb.to(dtype=torch.bfloat16)

@@ -771,7 +769,6 @@ class PI0FastPytorch(nn.Module):  # see openpi `PI0Pytorch`
            # Embed the single previous token
            # We use embed_language_tokens directly to avoid overhead of full prefix embedding
            next_token_emb = self.paligemma_with_expert.embed_language_tokens(next_token)
-            next_token_emb = next_token_emb * math.sqrt(next_token_emb.shape[-1])
            if prefix_embs.dtype == torch.bfloat16:
                next_token_emb = next_token_emb.to(dtype=torch.bfloat16)

@@ -97,8 +97,8 @@ class VQBeTConfig(PreTrainedConfig):
    vision_backbone: str = "resnet18"
    crop_shape: tuple[int, int] | None = (84, 84)
    crop_is_random: bool = True
-    pretrained_backbone_weights: str | None = None
-    use_group_norm: bool = True
+    pretrained_backbone_weights: str | None = "ResNet18_Weights.IMAGENET1K_V1"
+    use_group_norm: bool = False
    spatial_softmax_num_keypoints: int = 32
    # VQ-VAE
    n_vqvae_training_steps: int = 20000
@@ -22,7 +22,7 @@ from transformers.utils import (
    add_start_docstrings,
    add_start_docstrings_to_model_forward,
    is_flash_attn_2_available,
-    is_flash_attn_greater_or_equal_2_10,
+    is_flash_attn_greater_or_equal,
    is_torchdynamo_compiling,
    logging,
    replace_return_docstrings,
@@ -890,7 +890,7 @@ class Qwen2_5_VLFlashAttention2(Qwen2_5_VLAttention):
        # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
        # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignment, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
        # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
-        self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
+        self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal("2.1.0")

    def forward(
        self,
@@ -45,7 +45,7 @@ from transformers.utils import (
    add_start_docstrings,
    add_start_docstrings_to_model_forward,
    is_flash_attn_2_available,
-    is_flash_attn_greater_or_equal_2_10,
+    is_flash_attn_greater_or_equal,
    logging,
    replace_return_docstrings,
 )
@@ -909,7 +909,7 @@ class Florence2FlashAttention2(Florence2Attention):
        # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
        # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignment, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
        # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
-        self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
+        self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal("2.1.0")

    def _reshape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
        return tensor.view(bsz, seq_len, self.num_heads, self.head_dim)
@@ -95,6 +95,13 @@ from .relative_action_processor import (
 from .rename_processor import RenameObservationsProcessorStep, rename_stats
 from .tokenizer_processor import ActionTokenizerProcessorStep, TokenizerProcessorStep

+# RenderMessagesStep is intentionally NOT re-exported here: it pulls in
+# `lerobot.datasets.language`, which requires the `[dataset]` extra
+# (`datasets`, `pyarrow`). Importing it from the processor package would
+# break every base-install consumer of `lerobot.processor`. Users that
+# need it import directly:
+#   from lerobot.processor.render_messages_processor import RenderMessagesStep
+
 __all__ = [
    "ActionProcessorStep",
    "AddTeleopActionAsComplimentaryDataStep",
@@ -174,6 +174,24 @@ class AddBatchDimensionComplementaryDataStep(ComplementaryDataProcessorStep):
            task_index_value = complementary_data["task_index"]
            if isinstance(task_index_value, Tensor) and task_index_value.dim() == 0:
                complementary_data["task_index"] = task_index_value.unsqueeze(0)
+
+        complementary_data.pop("language_persistent", None)
+        complementary_data.pop("language_events", None)
+
+        if "messages" in complementary_data:
+            messages = complementary_data["messages"]
+            if isinstance(messages, list) and (not messages or isinstance(messages[0], dict)):
+                complementary_data["messages"] = [messages]
+
+        if "message_streams" in complementary_data:
+            streams = complementary_data["message_streams"]
+            if isinstance(streams, list) and (not streams or isinstance(streams[0], str)):
+                complementary_data["message_streams"] = [streams]
+
+        if "target_message_indices" in complementary_data:
+            indices = complementary_data["target_message_indices"]
+            if isinstance(indices, list) and (not indices or isinstance(indices[0], int)):
+                complementary_data["target_message_indices"] = [indices]
        return complementary_data

    def transform_features(
@@ -153,26 +153,30 @@ def from_tensor_to_numpy(x: torch.Tensor | Any) -> np.ndarray | float | int | An
    return x


+_COMPLEMENTARY_KEYS = (
+    "task",
+    "index",
+    "task_index",
+    "episode_index",
+    "timestamp",
+    "language_persistent",
+    "language_events",
+    "messages",
+    "message_streams",
+    "target_message_indices",
+)
+
+
 def _extract_complementary_data(batch: dict[str, Any]) -> dict[str, Any]:
-    """
-    Extract complementary data from a batch dictionary.
+    """Extract complementary data from a batch dictionary.

-    This includes padding flags, task description, and indices.
-
-    Args:
-        batch: The batch dictionary.
-
-    Returns:
-        A dictionary with the extracted complementary data.
+    Includes padding flags (any key containing ``_is_pad``) plus the fixed
+    set of metadata / language keys defined in ``_COMPLEMENTARY_KEYS`` —
+    each only when present in ``batch``.
    """
    pad_keys = {k: v for k, v in batch.items() if "_is_pad" in k}
-    task_key = {"task": batch["task"]} if "task" in batch else {}
-    subtask_key = {"subtask": batch["subtask"]} if "subtask" in batch else {}
-    index_key = {"index": batch["index"]} if "index" in batch else {}
-    task_index_key = {"task_index": batch["task_index"]} if "task_index" in batch else {}
-    episode_index_key = {"episode_index": batch["episode_index"]} if "episode_index" in batch else {}
-
-    return {**pad_keys, **task_key, **subtask_key, **index_key, **task_index_key, **episode_index_key}
+    extras = {k: batch[k] for k in _COMPLEMENTARY_KEYS if k in batch}
+    return {**pad_keys, **extras}


 def create_transition(
@@ -0,0 +1,94 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+from typing import Any
+
+from lerobot.configs import PipelineFeatureType, PolicyFeature
+from lerobot.configs.recipe import TrainingRecipe
+from lerobot.datasets.language import LANGUAGE_EVENTS, LANGUAGE_PERSISTENT
+from lerobot.datasets.language_render import render_sample
+from lerobot.types import EnvTransition, TransitionKey
+
+from .pipeline import ProcessorStep, ProcessorStepRegistry
+
+
+@dataclass
+@ProcessorStepRegistry.register(name="render_messages_processor")
+class RenderMessagesStep(ProcessorStep):
+    """Processor step that turns raw language columns into rendered chat messages.
+
+    Reads ``language_persistent`` and ``language_events`` from the transition's
+    complementary data, renders them through ``recipe`` at the sample timestamp,
+    and replaces the raw columns with the resulting ``messages`` /
+    ``message_streams`` / ``target_message_indices`` keys.
+    """
+
+    recipe: TrainingRecipe
+    dataset_ctx: Any | None = None
+
+    def __call__(self, transition: EnvTransition) -> EnvTransition | None:
+        """Render messages for a single transition; return ``None`` to drop it."""
+        complementary_data = transition.get(TransitionKey.COMPLEMENTARY_DATA) or {}
+        persistent = complementary_data.get(LANGUAGE_PERSISTENT) or []
+        events = complementary_data.get(LANGUAGE_EVENTS) or []
+
+        if not persistent and not events:
+            return transition
+
+        timestamp = complementary_data.get("timestamp")
+        if timestamp is None:
+            raise KeyError("RenderMessagesStep requires sample timestamp in complementary data.")
+
+        sample_idx = complementary_data.get("index", 0)
+        rendered = render_sample(
+            recipe=self.recipe,
+            persistent=persistent,
+            events=events,
+            t=_scalar(timestamp),
+            sample_idx=int(_scalar(sample_idx)),
+            task=complementary_data.get("task"),
+            dataset_ctx=self.dataset_ctx,
+        )
+        if rendered is None:
+            return None
+
+        new_transition = transition.copy()
+        new_complementary_data = dict(complementary_data)
+        new_complementary_data.pop(LANGUAGE_PERSISTENT, None)
+        new_complementary_data.pop(LANGUAGE_EVENTS, None)
+        new_complementary_data.update(rendered)
+        new_transition[TransitionKey.COMPLEMENTARY_DATA] = new_complementary_data
+        return new_transition
+
+    def transform_features(
+        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
+    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
+        """Pass features through unchanged; rendering only touches complementary data."""
+        return features
+
+
+def _scalar(value: Any) -> float | int:
+    """Unwrap a tensor/array/single-element list into a Python scalar."""
+    if hasattr(value, "item"):
+        return value.item()
+    if isinstance(value, list):
+        if len(value) != 1:
+            raise ValueError(f"Expected a scalar, got list of length {len(value)}: {value!r}")
+        return _scalar(value[0])
+    return value
@@ -54,6 +54,7 @@ class BiOpenArmFollower(Robot):
            calibration_dir=config.calibration_dir,
            port=config.left_arm_config.port,
            disable_torque_on_disconnect=config.left_arm_config.disable_torque_on_disconnect,
+            use_velocity_and_torque=config.left_arm_config.use_velocity_and_torque,
            max_relative_target=config.left_arm_config.max_relative_target,
            cameras=left_cameras,
            side=config.left_arm_config.side,
@@ -72,6 +73,7 @@ class BiOpenArmFollower(Robot):
            calibration_dir=config.calibration_dir,
            port=config.right_arm_config.port,
            disable_torque_on_disconnect=config.right_arm_config.disable_torque_on_disconnect,
+            use_velocity_and_torque=config.right_arm_config.use_velocity_and_torque,
            max_relative_target=config.right_arm_config.max_relative_target,
            cameras=right_cameras,
            side=config.right_arm_config.side,
@@ -46,7 +46,7 @@ class LeKiwiConfig(RobotConfig):
    cameras: dict[str, CameraConfig] = field(default_factory=lekiwi_cameras_config)

    # Set to `True` for backward compatibility with previous policies/dataset
-    use_degrees: bool = False
+    use_degrees: bool = True


@dataclass
@@ -66,6 +66,10 @@ class OpenArmFollowerConfigBase:
    # Whether to disable torque when disconnecting
    disable_torque_on_disconnect: bool = True

+    # When True, expose `.vel` and `.torque` per motor in observation features.
+    # Default False for compatibility with the position-only openarm_mini teleoperator.
+    use_velocity_and_torque: bool = False
+
    # Safety limit for relative target positions
    # Set to a positive scalar for all motors, or a dict mapping motor names to limits
    max_relative_target: float | dict[str, float] | None = None
@@ -93,8 +93,9 @@ class OpenArmFollower(Robot):
        features: dict[str, type] = {}
        for motor in self.bus.motors:
            features[f"{motor}.pos"] = float
-            features[f"{motor}.vel"] = float  # Add this
-            features[f"{motor}.torque"] = float  # Add this
+            if self.config.use_velocity_and_torque:
+                features[f"{motor}.vel"] = float
+                features[f"{motor}.torque"] = float
        return features

    @property
@@ -235,8 +236,9 @@ class OpenArmFollower(Robot):
        for motor in self.bus.motors:
            state = states.get(motor, {})
            obs_dict[f"{motor}.pos"] = state.get("position", 0.0)
-            obs_dict[f"{motor}.vel"] = state.get("velocity", 0.0)
-            obs_dict[f"{motor}.torque"] = state.get("torque", 0.0)
+            if self.config.use_velocity_and_torque:
+                obs_dict[f"{motor}.vel"] = state.get("velocity", 0.0)
+                obs_dict[f"{motor}.torque"] = state.get("torque", 0.0)

        # Capture images from cameras
        for cam_key, cam in self.cameras.items():
@@ -332,7 +332,7 @@ def build_rollout_context(
                cfg.dataset.repo_id,
                root=cfg.dataset.root,
                batch_encoding_size=cfg.dataset.video_encoding_batch_size,
-                camera_encoder_config=cfg.dataset.camera_encoder_config,
+                vcodec=cfg.dataset.vcodec,
                streaming_encoding=cfg.dataset.streaming_encoding,
                encoder_queue_maxsize=cfg.dataset.encoder_queue_maxsize,
                encoder_threads=cfg.dataset.encoder_threads,
@@ -367,7 +367,7 @@ def build_rollout_context(
                image_writer_threads=cfg.dataset.num_image_writer_threads_per_camera
                * len(robot.cameras if hasattr(robot, "cameras") else []),
                batch_encoding_size=cfg.dataset.video_encoding_batch_size,
-                camera_encoder_config=cfg.dataset.camera_encoder_config,
+                vcodec=cfg.dataset.vcodec,
                streaming_encoding=cfg.dataset.streaming_encoding,
                encoder_queue_maxsize=cfg.dataset.encoder_queue_maxsize,
                encoder_threads=cfg.dataset.encoder_threads,
@@ -33,12 +33,13 @@ Recording modes:
    ``record_autonomous=False``: Only correction windows are recorded.
        Each correction (start to stop) becomes one episode.

-Teleoperator expectations:
-    The user is responsible for keeping the leader arm aligned with the
-    follower arm at the moment a correction begins.  Programmatic motor
-    handover (``enable_torque`` / ``disable_torque`` / ``write_goal_positions``)
-    is intentionally not invoked here — see the TODO in
-    :func:`DAggerStrategy._apply_transition` for the open design decision.
+Teleoperator handover:
+    On AUTONOMOUS → PAUSED, actuated teleops (those with non-empty
+    ``feedback_features``, e.g. SO-101, OpenArmMini) are smoothly driven to
+    the follower's last position via ``send_feedback`` so the operator takes
+    over without a jerk.  Non-actuated teleops cannot be driven,
+    so on PAUSED → CORRECTING the follower is instead slid to the teleop's
+    current pose before the correction begins.
 """

 from __future__ import annotations
@@ -175,17 +176,27 @@ class DAggerEvents:
 # ---------------------------------------------------------------------------


-# TODO(Steven): re-enable programmatic teleop alignment once we decide whether
-# to enforce motor-control methods on every Teleoperator.  Until then the user
-# is responsible for moving the leader arm to the follower's pose at the moment
-# a correction begins.
-def _teleop_smooth_move_to(
-    teleop: Teleoperator, target_pos: dict, duration_s: float = 2.0, fps: int = 50
-) -> None:
-    """Smoothly move teleop to target position via linear interpolation.
+def _teleop_supports_feedback(teleop: Teleoperator) -> bool:
+    """Return True when the teleop can receive position feedback (is actuated).
+    TODO(Maxime): See if it is possible to unify this interface across teleops instead of duck-typing.
+    """
+    return (
+        bool(teleop.feedback_features)
+        and hasattr(teleop, "disable_torque")
+        and hasattr(teleop, "enable_torque")
+    )

-    Requires the teleoperator to support motor control methods
-    (``enable_torque``, ``write_goal_positions``, ``get_action``).
+
+def _teleop_smooth_move_to(
+    teleop: Teleoperator, target_pos: dict, duration_s: float = 2.0, fps: int = 30
+) -> None:
+    """Smoothly move an actuated teleop to ``target_pos`` via linear interpolation.
+
+    Requires the teleoperator to support feedback
+    (i.e. have non-empty ``feedback_features`` and implement ``disable_torque`` / ``enable_torque``).
+
+    TODO(Maxime): This blocks up to ``duration_s`` seconds, during this time
+    the follower robot doesn't receive new actions, this could be an issue on LeKiwi.
    """
    teleop.enable_torque()
    current = teleop.get_action()
@@ -193,13 +204,28 @@ def _teleop_smooth_move_to(

    for step in range(steps + 1):
        t = step / steps
-        interp = {}
-        for k in current:
-            if k in target_pos:
-                interp[k] = current[k] * (1 - t) + target_pos[k] * t
-            else:
-                interp[k] = current[k]
-        teleop.write_goal_positions(interp)
+        interp = {
+            k: current[k] * (1 - t) + target_pos[k] * t if k in target_pos else current[k] for k in current
+        }
+        teleop.send_feedback(interp)
+        time.sleep(1 / fps)
+
+
+def _follower_smooth_move_to(
+    robot: ThreadSafeRobot, current: dict, target: dict, duration_s: float = 1.0, fps: int = 30
+) -> None:
+    """Smoothly move the follower robot from ``current`` to ``target`` action.
+
+    Used when the teleop is non-actuated: instead of driving the leader arm
+    to the follower, we bring the follower to the teleop's current pose.
+    Both ``current`` and ``target`` must be in robot-action key space.
+    """
+    steps = max(int(duration_s * fps), 1)
+
+    for step in range(steps + 1):
+        t = step / steps
+        interp = {k: current[k] * (1 - t) + target[k] * t if k in target else current[k] for k in current}
+        robot.send_action(interp)
        time.sleep(1 / fps)


@@ -415,9 +441,6 @@ class DAggerStrategy(RolloutStrategy):
        engine.reset()
        interpolator.reset()
        events.reset()
-        # TODO(Steven): re-enable once Teleoperator motor-control methods are
-        # standardised; until then the user pre-aligns the leader by hand.
-        # teleop.disable_torque()
        engine.resume()

        last_action: dict[str, Any] | None = None
@@ -441,8 +464,16 @@ class DAggerStrategy(RolloutStrategy):
                    transition = events.consume_transition()
                    if transition is not None:
                        old_phase, new_phase = transition
-                        self._apply_transition(old_phase, new_phase, engine, interpolator, robot, teleop)
-                        last_action = None
+                        self._apply_transition(
+                            old_phase,
+                            new_phase,
+                            engine,
+                            interpolator,
+                            ctx,
+                            last_action,
+                        )
+                        if new_phase == DAggerPhase.AUTONOMOUS:
+                            last_action = None

                    phase = events.phase
                    obs = robot.get_observation()
@@ -532,9 +563,6 @@ class DAggerStrategy(RolloutStrategy):
            finally:
                logger.info("DAgger continuous control loop ended — pausing engine")
                engine.pause()
-                # TODO(Steven): re-enable once Teleoperator motor-control methods
-                # are standardised across all teleop implementations.
-                # teleop.disable_torque()
                with contextlib.suppress(Exception):
                    with self._episode_lock:
                        dataset.save_episode()
@@ -570,9 +598,6 @@ class DAggerStrategy(RolloutStrategy):
        engine.reset()
        interpolator.reset()
        events.reset()
-        # TODO(Steven): re-enable once Teleoperator motor-control methods are
-        # standardised; until then the user pre-aligns the leader by hand.
-        # teleop.disable_torque()
        engine.resume()

        last_action: dict[str, Any] | None = None
@@ -600,8 +625,16 @@ class DAggerStrategy(RolloutStrategy):
                    transition = events.consume_transition()
                    if transition is not None:
                        old_phase, new_phase = transition
-                        self._apply_transition(old_phase, new_phase, engine, interpolator, robot, teleop)
-                        last_action = None
+                        self._apply_transition(
+                            old_phase,
+                            new_phase,
+                            engine,
+                            interpolator,
+                            ctx,
+                            last_action,
+                        )
+                        if new_phase == DAggerPhase.AUTONOMOUS:
+                            last_action = None

                        # Correction ended -> save episode (blocking if not streaming)
                        if old_phase == DAggerPhase.CORRECTING and new_phase == DAggerPhase.PAUSED:
@@ -679,9 +712,6 @@ class DAggerStrategy(RolloutStrategy):
            finally:
                logger.info("DAgger corrections-only loop ended — pausing engine")
                engine.pause()
-                # TODO(Steven): re-enable once Teleoperator motor-control methods
-                # are standardised across all teleop implementations.
-                # teleop.disable_torque()
                with contextlib.suppress(Exception):
                    with self._episode_lock:
                        dataset.save_episode()
@@ -698,36 +728,71 @@ class DAggerStrategy(RolloutStrategy):
        new_phase: DAggerPhase,
        engine,
        interpolator,
-        robot: ThreadSafeRobot,
-        teleop: Teleoperator,
+        ctx: RolloutContext,
+        prev_action: dict | None,
    ) -> None:
-        """Execute side-effects for a validated phase transition."""
+        """Execute side-effects for a validated phase transition, including smooth handovers.
+
+        AUTONOMOUS -> PAUSED (actuated teleop):
+            Pause the engine, then drive the leader arm to the follower's last
+            commanded position so the operator takes over without a jerk.
+
+        PAUSED -> CORRECTING (non-actuated teleop):
+            Slide the follower to the teleop's current pose so the robot meets
+            the operator's hand rather than jumping to it on the first frame.
+
+        CORRECTING -> PAUSED (actuated teleop):
+            Re-enable torque to hold position after correction.
+            This will be potentially useful if cancelling the correction recording
+
+        PAUSED -> AUTONOMOUS:
+            Reset and resume the inference engine.
+        """
+        teleop = ctx.hardware.teleop
+        robot = ctx.hardware.robot_wrapper
+
        logger.info("Phase transition: %s -> %s", old_phase.value, new_phase.value)
        if old_phase == DAggerPhase.AUTONOMOUS and new_phase == DAggerPhase.PAUSED:
-            logger.info("Pausing engine — robot holds position")
+            logger.info("Pausing engine - robot holds position")
            engine.pause()
-            obs = robot.get_observation()
-            _robot_pos = {
-                k: v for k, v in obs.items() if k.endswith(".pos") and k in robot.observation_features
-            }
-            # TODO(Steven): once Teleoperator motor-control methods are
-            # standardised, drive the leader to the follower's pose here so the
-            # operator does not need to pre-align the arm by hand.  Until then
-            # the user is responsible for the alignment.
-            # _teleop_smooth_move_to(teleop, _robot_pos, duration_s=2.0, fps=50)

-        elif new_phase == DAggerPhase.CORRECTING:
-            logger.info("Entering correction mode — human teleop control")
-            # TODO(Steven): re-enable once Teleoperator motor-control methods
-            # are standardised across all teleop implementations.
-            # teleop.disable_torque()
+            if _teleop_supports_feedback(teleop) and prev_action is not None:
+                # TODO(Maxime): prev_action is in robot action key space (output of robot_action_processor).
+                # send_feedback expects teleop feedback key space. For homogeneous setups (e.g. SO-101
+                # leader + SO-101 follower) the keys are identical so this works. If the processor pipeline
+                # does non-trivial key renaming (e.g. a rename_map on action keys), the interpolation in
+                # _teleop_smooth_move_to silently no-ops and the arm doesn't move.
+                logger.info("Smooth handover: moving leader arm to follower position")
+                _teleop_smooth_move_to(teleop, prev_action)
+
+        elif old_phase == DAggerPhase.PAUSED and new_phase == DAggerPhase.CORRECTING:
+            logger.info("Entering correction mode - human teleop control")
+            if not _teleop_supports_feedback(teleop) and prev_action is not None:
+                logger.info("Smooth handover: sliding follower to teleop position")
+                obs = robot.get_observation()
+                teleop_action = teleop.get_action()
+                processed = ctx.processors.teleop_action_processor((teleop_action, obs))
+                target = ctx.processors.robot_action_processor((processed, obs))
+                _follower_smooth_move_to(robot, prev_action, target)
+
+            # unlock the teleop for human control
+            if _teleop_supports_feedback(teleop):
+                teleop.disable_torque()
+
+        elif old_phase == DAggerPhase.CORRECTING and new_phase == DAggerPhase.PAUSED:
+            if _teleop_supports_feedback(teleop):
+                teleop.enable_torque()

        elif new_phase == DAggerPhase.AUTONOMOUS:
-            logger.info("Resuming autonomous mode — resetting engine and interpolator")
+            logger.info("Resuming autonomous mode - resetting engine and interpolator")
            interpolator.reset()
            engine.reset()
            engine.resume()

+            # release teleop before resuming the policy
+            if _teleop_supports_feedback(teleop):
+                teleop.disable_torque()
+
    # ------------------------------------------------------------------
    # Background push (shared by both modes)
    # ------------------------------------------------------------------
@@ -0,0 +1,149 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""``lerobot-annotate`` — populate ``language_persistent`` and
+``language_events`` columns on a LeRobot dataset.
+
+Annotations live directly in ``data/chunk-*/file-*.parquet``.
+
+Example:
+
+  uv run lerobot-annotate \\
+      --root=/path/to/dataset \\
+      --vlm.model_id=Qwen/Qwen2.5-VL-7B-Instruct
+
+For distributed runs, see ``examples/annotation/run_hf_job.py``.
+"""
+
+import logging
+from pathlib import Path
+
+from lerobot.annotations.steerable_pipeline.config import AnnotationPipelineConfig
+from lerobot.annotations.steerable_pipeline.executor import Executor
+from lerobot.annotations.steerable_pipeline.frames import make_frame_provider
+from lerobot.annotations.steerable_pipeline.modules import (
+    GeneralVqaModule,
+    InterjectionsAndSpeechModule,
+    PlanSubtasksMemoryModule,
+)
+from lerobot.annotations.steerable_pipeline.validator import StagingValidator
+from lerobot.annotations.steerable_pipeline.vlm_client import make_vlm_client
+from lerobot.annotations.steerable_pipeline.writer import LanguageColumnsWriter
+from lerobot.configs import parser
+
+logger = logging.getLogger(__name__)
+
+
+def _resolve_root(cfg: AnnotationPipelineConfig) -> Path:
+    if cfg.root is not None:
+        return Path(cfg.root)
+    if cfg.repo_id is not None:
+        from huggingface_hub import snapshot_download
+
+        return Path(snapshot_download(repo_id=cfg.repo_id, repo_type="dataset"))
+    raise ValueError("Either --root or --repo_id must be provided.")
+
+
+@parser.wrap()
+def annotate(cfg: AnnotationPipelineConfig) -> None:
+    """Run the steerable annotation pipeline against a dataset."""
+    logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
+    root = _resolve_root(cfg)
+    logger.info("annotate: root=%s", root)
+
+    vlm = make_vlm_client(cfg.vlm)
+    frame_provider = make_frame_provider(root, camera_key=cfg.vlm.camera_key)
+    # Surface the resolved cameras up front so silent Module-3-no-op
+    # regressions are obvious in job output rather than discovered post-hoc
+    # by counting parquet rows.
+    cam_keys = list(getattr(frame_provider, "camera_keys", []) or [])
+    logger.info(
+        "annotate: frame_provider default camera=%r, all cameras=%s",
+        getattr(frame_provider, "camera_key", None),
+        cam_keys,
+    )
+    if cfg.module_3.enabled and not cam_keys:
+        logger.warning(
+            "annotate: Module 3 (VQA) is enabled but no cameras were "
+            "resolved — Module 3 will produce zero VQA rows. Check "
+            "meta/info.json for observation.images.* features, or pass "
+            "--vlm.camera_key=<key> to seed the cameras list."
+        )
+    module_1 = PlanSubtasksMemoryModule(vlm=vlm, config=cfg.module_1, frame_provider=frame_provider)
+    module_2 = InterjectionsAndSpeechModule(
+        vlm=vlm, config=cfg.module_2, seed=cfg.seed, frame_provider=frame_provider
+    )
+    module_3 = GeneralVqaModule(vlm=vlm, config=cfg.module_3, seed=cfg.seed, frame_provider=frame_provider)
+    writer = LanguageColumnsWriter()
+    validator = StagingValidator(
+        dataset_camera_keys=tuple(getattr(frame_provider, "camera_keys", []) or []) or None,
+    )
+
+    executor = Executor(
+        config=cfg,
+        module_1=module_1,
+        module_2=module_2,
+        module_3=module_3,
+        writer=writer,
+        validator=validator,
+    )
+    summary = executor.run(root)
+    logger.info("annotate: wrote %d shard(s)", len(summary.written_paths))
+    for phase in summary.phases:
+        logger.info(
+            "annotate: phase=%s processed=%d skipped=%d",
+            phase.name,
+            phase.episodes_processed,
+            phase.episodes_skipped,
+        )
+    if summary.validation_report.warnings:
+        for w in summary.validation_report.warnings:
+            logger.warning(w)
+
+    if cfg.push_to_hub:
+        _push_to_hub(root, cfg)
+
+
+def _push_to_hub(root: Path, cfg: AnnotationPipelineConfig) -> None:
+    """Upload the annotated dataset directory to the Hugging Face Hub."""
+    from huggingface_hub import HfApi  # noqa: PLC0415
+
+    repo_id = cfg.push_to_hub
+    commit_message = cfg.push_commit_message or "Add steerable annotations (lerobot-annotate)"
+    api = HfApi()
+    print(f"[lerobot-annotate] creating/locating dataset repo {repo_id}...", flush=True)
+    api.create_repo(
+        repo_id=repo_id,
+        repo_type="dataset",
+        private=cfg.push_private,
+        exist_ok=True,
+    )
+    print(f"[lerobot-annotate] uploading {root} -> {repo_id}...", flush=True)
+    api.upload_folder(
+        folder_path=str(root),
+        repo_id=repo_id,
+        repo_type="dataset",
+        commit_message=commit_message,
+        ignore_patterns=[".annotate_staging/**", "**/.DS_Store"],
+    )
+    print(f"[lerobot-annotate] uploaded to https://huggingface.co/datasets/{repo_id}", flush=True)
+
+
+def main() -> None:
+    annotate()
+
+
+if __name__ == "__main__":
+    main()
@@ -49,14 +49,6 @@ Delete episodes and save to a new dataset at a specific path and with a new repo
        --operation.type delete_episodes \
        --operation.episode_indices "[0, 2, 5]"

-Delete episodes and re-encode video segments with h264:
-    lerobot-edit-dataset \
-        --repo_id lerobot/pusht \
-        --operation.type delete_episodes \
-        --operation.episode_indices "[0, 2, 5]" \
-        --operation.camera_encoder_config.vcodec h264 \
-        --operation.camera_encoder_config.crf 23
-
 Split dataset by fractions (pusht_train, pusht_val):
    lerobot-edit-dataset \
        --repo_id lerobot/pusht \
@@ -82,14 +74,6 @@ Split into more than two splits:
        --operation.type split \
        --operation.splits '{"train": 0.6, "val": 0.2, "test": 0.2}'

-Split dataset and re-encode video segments with h264:
-    lerobot-edit-dataset \
-        --repo_id lerobot/pusht \
-        --operation.type split \
-        --operation.splits '{"train": 0.8, "val": 0.2}' \
-        --operation.camera_encoder_config.vcodec h264 \
-        --operation.camera_encoder_config.crf 23
-
 Merge multiple datasets:
    lerobot-edit-dataset \
        --new_repo_id lerobot/pusht_merged \
@@ -203,7 +187,7 @@ import abc
 import logging
 import shutil
 import sys
-from dataclasses import dataclass, field
+from dataclasses import dataclass
 from pathlib import Path

 import draccus
@@ -211,8 +195,6 @@ import draccus
 from lerobot.configs import parser
 from lerobot.datasets import (
    LeRobotDataset,
-    VideoEncoderConfig,
-    camera_encoder_defaults,
    convert_image_to_video_dataset,
    delete_episodes,
    merge_datasets,
@@ -236,14 +218,12 @@ class OperationConfig(draccus.ChoiceRegistry, abc.ABC):
@dataclass
 class DeleteEpisodesConfig(OperationConfig):
    episode_indices: list[int] | None = None
-    camera_encoder_config: VideoEncoderConfig = field(default_factory=camera_encoder_defaults)


@OperationConfig.register_subclass("split")
@dataclass
 class SplitConfig(OperationConfig):
    splits: dict[str, float | list[int]] | None = None
-    camera_encoder_config: VideoEncoderConfig = field(default_factory=camera_encoder_defaults)


@OperationConfig.register_subclass("merge")
@@ -270,7 +250,11 @@ class ModifyTasksConfig(OperationConfig):
@dataclass
 class ConvertImageToVideoConfig(OperationConfig):
    output_dir: str | None = None
-    camera_encoder_config: VideoEncoderConfig = field(default_factory=camera_encoder_defaults)
+    vcodec: str = "libsvtav1"
+    pix_fmt: str = "yuv420p"
+    g: int = 2
+    crf: int = 30
+    fast_decode: int = 0
    episode_indices: list[int] | None = None
    num_workers: int = 4
    max_episodes_per_batch: int | None = None
@@ -372,7 +356,6 @@ def handle_delete_episodes(cfg: EditDatasetConfig) -> None:
        episode_indices=cfg.operation.episode_indices,
        output_dir=output_dir,
        repo_id=output_repo_id,
-        camera_encoder_config=cfg.operation.camera_encoder_config,
    )

    logging.info(f"Dataset saved to {output_dir}")
@@ -404,7 +387,6 @@ def handle_split(cfg: EditDatasetConfig) -> None:
        dataset,
        splits=cfg.operation.splits,
        output_dir=cfg.new_root,
-        camera_encoder_config=cfg.operation.camera_encoder_config,
    )

    for split_name, split_ds in split_datasets.items():
@@ -575,8 +557,11 @@ def handle_convert_image_to_video(cfg: EditDatasetConfig) -> None:
        dataset=dataset,
        output_dir=output_dir,
        repo_id=output_repo_id,
-        camera_encoder_config=getattr(cfg.operation, "camera_encoder_config", None)
-        or camera_encoder_defaults(),
+        vcodec=getattr(cfg.operation, "vcodec", "libsvtav1"),
+        pix_fmt=getattr(cfg.operation, "pix_fmt", "yuv420p"),
+        g=getattr(cfg.operation, "g", 2),
+        crf=getattr(cfg.operation, "crf", 30),
+        fast_decode=getattr(cfg.operation, "fast_decode", 0),
        episode_indices=getattr(cfg.operation, "episode_indices", None),
        num_workers=getattr(cfg.operation, "num_workers", 4),
        max_episodes_per_batch=getattr(cfg.operation, "max_episodes_per_batch", None),
@@ -63,27 +63,6 @@ lerobot-record \\
  --dataset.streaming_encoding=true \\
  --dataset.encoder_threads=2
 ```
-
-Example recording with custom video encoding parameters:
-```shell
-lerobot-record \\
-    --robot.type=so100_follower \\
-    --robot.port=/dev/tty.usbmodem58760431541 \\
-    --robot.cameras="{laptop: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \\
-    --robot.id=black \\
-    --teleop.type=so100_leader \\
-    --teleop.port=/dev/tty.usbmodem58760431551 \\
-    --teleop.id=blue \\
-    --dataset.repo_id=<my_username>/<my_dataset_name> \\
-    --dataset.num_episodes=2 \\
-    --dataset.single_task="Grab the cube" \\
-    --dataset.streaming_encoding=true \\
-    --dataset.encoder_threads=2 \\
-    --dataset.camera_encoder_config.vcodec=h264 \\
-    --dataset.camera_encoder_config.preset=fast \\
-    --dataset.camera_encoder_config.extra_options={"tune": "film", "profile:v": "high", "bf": 2} \\
-    --display_data=true
-```
 """

 import logging
@@ -398,10 +377,10 @@ def record(
                cfg.dataset.repo_id,
                root=cfg.dataset.root,
                batch_encoding_size=cfg.dataset.video_encoding_batch_size,
-                camera_encoder_config=cfg.dataset.camera_encoder_config,
-                encoder_threads=cfg.dataset.encoder_threads,
+                vcodec=cfg.dataset.vcodec,
                streaming_encoding=cfg.dataset.streaming_encoding,
                encoder_queue_maxsize=cfg.dataset.encoder_queue_maxsize,
+                encoder_threads=cfg.dataset.encoder_threads,
                image_writer_processes=cfg.dataset.num_image_writer_processes if num_cameras > 0 else 0,
                image_writer_threads=cfg.dataset.num_image_writer_threads_per_camera * num_cameras
                if num_cameras > 0
@@ -427,10 +406,10 @@ def record(
                image_writer_processes=cfg.dataset.num_image_writer_processes,
                image_writer_threads=cfg.dataset.num_image_writer_threads_per_camera * len(robot.cameras),
                batch_encoding_size=cfg.dataset.video_encoding_batch_size,
-                camera_encoder_config=cfg.dataset.camera_encoder_config,
-                encoder_threads=cfg.dataset.encoder_threads,
+                vcodec=cfg.dataset.vcodec,
                streaming_encoding=cfg.dataset.streaming_encoding,
                encoder_queue_maxsize=cfg.dataset.encoder_queue_maxsize,
+                encoder_threads=cfg.dataset.encoder_threads,
            )

        robot.connect()
@@ -441,7 +420,7 @@ def record(

        if not cfg.dataset.streaming_encoding:
            logging.info(
-                "Streaming encoding is disabled. If you have capable hardware, consider enabling it for way faster episode saving. --dataset.streaming_encoding=true --dataset.encoder_threads=2 # --dataset.camera_encoder_config.vcodec=auto. More info in the documentation: https://huggingface.co/docs/lerobot/streaming_video_encoding"
+                "Streaming encoding is disabled. If you have capable hardware, consider enabling it for way faster episode saving. --dataset.streaming_encoding=true --dataset.encoder_threads=2 # --dataset.vcodec=auto. More info in the documentation: https://huggingface.co/docs/lerobot/streaming_video_encoding"
            )

        with VideoEncodingManager(dataset):
@@ -48,6 +48,7 @@ from lerobot.envs import close_envs, make_env, make_env_pre_post_processors
 from lerobot.optim.factory import make_optimizer_and_scheduler
 from lerobot.policies import PreTrainedPolicy, make_policy, make_pre_post_processors
 from lerobot.rewards import make_reward_pre_post_processors
+from lerobot.utils.collate import lerobot_collate_fn
 from lerobot.utils.import_utils import register_third_party_plugins
 from lerobot.utils.logging_utils import AverageMeter, MetricsTracker
 from lerobot.utils.random_utils import set_seed
@@ -277,9 +278,14 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
    if cfg.peft is not None:
        if cfg.is_reward_model_training:
            raise ValueError("PEFT is only supported for policy training. ")
-        logging.info("Using PEFT! Wrapping model.")
-        peft_cli_overrides = dataclasses.asdict(cfg.peft)
-        policy = policy.wrap_with_peft(peft_cli_overrides=peft_cli_overrides)
+        from peft import PeftModel
+
+        if isinstance(policy, PeftModel):
+            logging.info("PEFT adapter already loaded from checkpoint, skipping wrap_with_peft.")
+        else:
+            logging.info("Using PEFT! Wrapping model.")
+            peft_cli_overrides = dataclasses.asdict(cfg.peft)
+            policy = policy.wrap_with_peft(peft_cli_overrides=peft_cli_overrides)

    # Wait for all processes to finish model creation before continuing
    accelerator.wait_for_everyone()
@@ -396,6 +402,10 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
        shuffle = True
        sampler = None

+    # Only swap in the language-aware collate when the dataset actually
+    # declares language columns; otherwise stay on PyTorch's default
+    # collate so non-language training runs are unaffected.
+    collate_fn = lerobot_collate_fn if dataset.meta.has_language_columns else None
    dataloader = torch.utils.data.DataLoader(
        dataset,
        num_workers=cfg.num_workers,
@@ -404,6 +414,7 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
        sampler=sampler,
        pin_memory=device.type == "cuda",
        drop_last=False,
+        collate_fn=collate_fn,
        prefetch_factor=cfg.prefetch_factor if cfg.num_workers > 0 else None,
        persistent_workers=cfg.persistent_workers and cfg.num_workers > 0,
    )
@@ -49,6 +49,7 @@ class BiOpenArmLeader(Teleoperator):
            can_data_bitrate=config.left_arm_config.can_data_bitrate,
            motor_config=config.left_arm_config.motor_config,
            manual_control=config.left_arm_config.manual_control,
+            use_velocity_and_torque=config.left_arm_config.use_velocity_and_torque,
            position_kd=config.left_arm_config.position_kd,
            position_kp=config.left_arm_config.position_kp,
        )
@@ -63,6 +64,7 @@ class BiOpenArmLeader(Teleoperator):
            can_data_bitrate=config.right_arm_config.can_data_bitrate,
            motor_config=config.right_arm_config.motor_config,
            manual_control=config.right_arm_config.manual_control,
+            use_velocity_and_torque=config.right_arm_config.use_velocity_and_torque,
            position_kd=config.right_arm_config.position_kd,
            position_kp=config.right_arm_config.position_kp,
        )
@@ -60,6 +60,10 @@ class OpenArmLeaderConfigBase:
    # When enabled, motors have torque disabled for manual movement
    manual_control: bool = True

+    # When True, expose `.vel` and `.torque` per motor in action features.
+    # Default False for compatibility with the position-only openarm_mini teleoperator.
+    use_velocity_and_torque: bool = False
+
    # TODO(Steven, Pepijn): Not used ... ?
    # MIT control parameters (used when manual_control=False for torque control)
    # List of 8 values: [joint_1, joint_2, joint_3, joint_4, joint_5, joint_6, joint_7, gripper]
@@ -70,8 +70,9 @@ class OpenArmLeader(Teleoperator):
        features: dict[str, type] = {}
        for motor in self.bus.motors:
            features[f"{motor}.pos"] = float
-            features[f"{motor}.vel"] = float
-            features[f"{motor}.torque"] = float
+            if self.config.use_velocity_and_torque:
+                features[f"{motor}.vel"] = float
+                features[f"{motor}.torque"] = float
        return features

    @property
@@ -201,8 +202,9 @@ class OpenArmLeader(Teleoperator):
        for motor in self.bus.motors:
            state = states.get(motor, {})
            action_dict[f"{motor}.pos"] = state.get("position")
-            action_dict[f"{motor}.vel"] = state.get("velocity")
-            action_dict[f"{motor}.torque"] = state.get("torque")
+            if self.config.use_velocity_and_torque:
+                action_dict[f"{motor}.vel"] = state.get("velocity")
+                action_dict[f"{motor}.torque"] = state.get("torque")

        dt_ms = (time.perf_counter() - start) * 1e3
        logger.debug(f"{self} read state: {dt_ms:.1f}ms")
--- a/Show More
+++ b/Show More