diff --git a/Makefile b/Makefile
index e02f02403..d3987101f 100644
--- a/Makefile
+++ b/Makefile
@@ -178,3 +178,9 @@ test-smolvla-ete-eval:
 		--env.episode_length=5 \
 		--eval.n_episodes=1 \
 		--eval.batch_size=1
+
+# E2E annotation pipeline smoke test against a tiny in-memory fixture
+# dataset. Opt-in (not part of `make test-end-to-end`) and uses a stub VLM
+# backend, so it does not require a real model checkpoint or GPU.
+annotation-e2e:
+	uv run python -m tests.annotations.run_e2e_smoke
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 0d4e36172..5d847a94d 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -45,6 +45,8 @@
     title: Language Columns and Recipes
   - local: tools
     title: Tools
+  - local: annotation_pipeline
+    title: Annotation Pipeline
   - local: video_encoding_parameters
     title: Video encoding parameters
   - local: streaming_video_encoding
diff --git a/docs/source/annotation_pipeline.mdx b/docs/source/annotation_pipeline.mdx
new file mode 100644
index 000000000..02658ec9a
--- /dev/null
+++ b/docs/source/annotation_pipeline.mdx
@@ -0,0 +1,291 @@
+# Annotation Pipeline
+
+`lerobot-annotate` watches each episode's video with a vision-language
+model (VLM) and writes natural-language annotations back into your
+dataset. It fills the two language columns from the
+[Language Columns and Recipes](./language_and_recipes) page —
+`language_persistent` and `language_events` — straight into
+`data/chunk-*/file-*.parquet`.
+
+In short: point it at a LeRobot dataset, and it adds subtasks, plans,
+memory, interjections, speech, and visual Q&A that a policy can be
+trained on.
+
+## How it fits together
+
+```text
+  your dataset                  lerobot-annotate
+  (LeRobot v3.1)
+        │
+        ▼
+  ┌─────────────────────────────────────────────────────┐
+  │                    read episodes                     │
+  └──────────────────────────┬──────────────────────────┘
+                             │
+        ┌────────────────────┼────────────────────┐
+        ▼                    ▼                     ▼
+  ┌──────────┐      ┌───────────────┐        ┌──────────┐       one shared Qwen-VL
+  │   plan   │      │ interjections │        │   vqa    │  ◀──   server (vLLM, OpenAI
+  └────┬─────┘      └───────┬───────┘        └────┬─────┘        API) drives all three
+       └────────────────────┼─────────────────────┘
+                            │   each module stages raw JSONL
+                            ▼   into .annotate_staging/
+                  ┌─────────────────┐
+                  │    validator    │  ◀──  checks everything
+                  └────────┬────────┘
+                           ▼
+                  ┌─────────────────┐
+                  │     writer      │
+                  └────────┬────────┘
+                           ▼
+              data/chunk-*/file-*.parquet
+              (+ meta/info.json tools)
+```
+
+Three modules (`plan`, `interjections`, `vqa`) all talk to **one** shared
+VLM. Each module stages its output to disk, a validator checks it, and a
+single writer rewrites the dataset shards in place.
+
+## What the pipeline produces
+
+Each module emits a few kinds of annotation ("styles"), routed to one of
+the two language columns:
+
+| Style / atom                                | Column                | Module          |
+| ------------------------------------------- | --------------------- | --------------- |
+| `subtask` (Pi0.7-style "how, not what")     | `language_persistent` | `plan`          |
+| `plan` (initial + refresh on interjection)  | `language_persistent` | `plan`          |
+| `memory` (MEM-style compression)            | `language_persistent` | `plan`          |
+| `task_aug` (rephrasings of the task)        | `language_persistent` | `plan`          |
+| `interjection`                              | `language_events`     | `interjections` |
+| speech tool-call atom (`style=null`, `say`) | `language_events`     | `interjections` |
+| `vqa` (user / assistant pair)               | `language_events`     | `vqa`           |
+
+### How subtasks are generated
+
+The `plan` module doesn't ask the VLM for subtasks in one shot. Instead
+it uses a two-step **describe → segment** flow:
+
+1. **Describe** — the VLM narrates only what it actually sees in the
+   chosen camera (no guessing about the task).
+2. **Segment** — that description is fed back in, and the VLM splits the
+   episode into consecutive atomic subtasks.
+
+Both passes see the episode as **timestamped contact sheets** — frames
+sampled at `frames_per_second` (0.5s by default) and packed into JPEG
+grids with each frame's time burned into its corner, so the VLM cites
+exact boundary times directly. This is far cheaper in vision tokens than
+one image per frame, so the sampling can stay dense; episodes longer than
+`max_frames_per_prompt` are split into windows at the same density and
+merged. Both prompts also carry a causal **event-boundary** definition (a
+new event starts when an object becomes held / is released / reaches a new
+location / a lid changes state / contents move) to sharpen where cuts land.
+
+The resulting spans are then stitched into a gap-free, full-episode
+cover, so **every frame has exactly one active subtask**. See
+[`run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
+for the production settings (single camera, timestamped contact sheets,
+auto-windowed subtask generation).
+
+### Tools
+
+The writer does **not** add a `tools` column to the parquet. The tool
+catalog lives in `meta/info.json["tools"]` instead (see [Tools](./tools)).
+After every run, the pipeline makes sure the canonical `say` schema is in
+that list, keeping any tools you declared beforehand.
+
+Want to add your own tool? Edit `meta/info.json["tools"]` directly — the
+pipeline preserves whatever is already there. That makes the tool visible
+to the chat template, so the model can learn to _generate_ the call. The
+runtime layer that actually _executes_ a generated call (the `Tool`
+protocol / `TOOL_REGISTRY` under `src/lerobot/tools/`) is not part of
+this PR — the [Tools](./tools) doc marks those pieces as
+not-yet-implemented.
+
+## Running on Hugging Face Jobs
+
+Annotation runs on [Hugging Face Jobs](https://huggingface.co/docs/hub/en/jobs).
+The repo ships a launcher script you copy and tweak for your dataset:
+
+```bash
+HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
+```
+
+[`run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
+starts a single-GPU `h200` job (bump it to `h200x4` for big datasets)
+that:
+
+1. installs `lerobot` (from `main`) plus the annotation extras,
+2. boots one vLLM server per GPU (using the `vllm/vllm-openai` image) and
+   drives it over the OpenAI-compatible API,
+3. runs the `plan` / `interjections` / `vqa` modules across the dataset
+   with `lerobot-annotate`,
+4. with `--push_to_hub=true`, uploads the result to `--new_repo_id` (or
+   back to `--repo_id` in place if you leave that unset).
+
+To use a different dataset, model, or hub repo, edit the `CMD` block in
+the script. Every flag there maps directly to a `lerobot-annotate` flag
+(run `lerobot-annotate --help` for the full list).
+
+## Key options
+
+These are the flags you'll reach for most often. Run
+`lerobot-annotate --help` for everything else; the defaults are tuned for
+short manipulation episodes.
+
+### Dataset in / out
+
+| Flag              | Default | What it does                                                            |
+| ----------------- | ------- | ----------------------------------------------------------------------- |
+| `--repo_id`       | —       | Hub dataset to annotate (downloaded if `--root` unset).                 |
+| `--root`          | —       | Annotate a local dataset directory instead.                             |
+| `--new_repo_id`   | —       | Push the result to a new repo (leaves the source repo untouched).       |
+| `--push_to_hub`   | `false` | Upload after annotating (to `--new_repo_id`, else back to `--repo_id`). |
+| `--only_episodes` | all     | Annotate just these episode indices (handy for a test run).             |
+| `--seed`          | `1729`  | Seeds the RNGs that pick interjection timestamps + VQA question types.  |
+
+### Which modules run
+
+Every module is on by default and can be toggled independently (set to
+`false` to skip it, e.g. to iterate on one module at a time):
+
+| Flag                      | Default | Turns off                           |
+| ------------------------- | ------- | ----------------------------------- |
+| `--plan.enabled`          | `true`  | subtasks + plan + memory + task_aug |
+| `--interjections.enabled` | `true`  | interjections + speech atoms        |
+| `--vqa.enabled`           | `true`  | the VQA pairs                       |
+
+### The VLM (`--vlm.*`)
+
+| Flag                       | Default            | What it does                                                                        |
+| -------------------------- | ------------------ | ----------------------------------------------------------------------------------- |
+| `--vlm.model_id`           | `Qwen/Qwen3.6-27B` | The model to serve and prompt.                                                      |
+| `--vlm.camera_key`         | first `images.*`   | Which camera every prompt is grounded on.                                           |
+| `--vlm.serve_command`      | auto               | The exact `vllm serve …` command (set TP size, GPU memory, `--max-model-len` here). |
+| `--vlm.parallel_servers`   | `1`                | Independent servers for round-robin routing (one per GPU).                          |
+| `--vlm.num_gpus`           | `0`                | GPUs per server (`0` = one each).                                                   |
+| `--vlm.client_concurrency` | `16`               | In-flight requests across all servers.                                              |
+| `--vlm.max_new_tokens`     | `512`              | Generation cap per call.                                                            |
+| `--vlm.temperature`        | `0.2`              | Sampling temperature.                                                               |
+
+### Subtasks / plan / memory (`--plan.*`)
+
+| Flag                            | Default    | What it does                                                                                                              |
+| ------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------- |
+| `--plan.frames_per_second`      | `2.0`      | Frame sampling rate for the contact sheets (`2.0` = one frame every 0.5s).                                                |
+| `--plan.max_frames_per_prompt`  | `60`       | Frame budget per VLM call. Episodes whose sampling exceeds this are auto-windowed at the same density, then stitched.     |
+| `--plan.contact_sheet_columns`  | `5`        | Columns per contact-sheet grid (`contact_sheet_frames_per_sheet` tiles, time row-major).                                  |
+| `--plan.plan_max_steps`         | `8`        | Upper bound on subtasks per episode.                                                                                      |
+| `--plan.subtask_describe_first` | `true`     | Run the describe→segment grounding pass (best subtask quality; +1 call/episode).                                          |
+| `--plan.emit_plan`              | `true`     | Emit the numbered `plan` rows (`false` = subtasks + memory only).                                                         |
+| `--plan.emit_memory`            | `true`     | Emit the `memory` rows (`false` = subtasks + plan only); symmetric to `emit_plan`.                                        |
+| `--plan.n_task_rephrasings`     | `10`       | How many `task_aug` rephrasings to emit (`0` disables).                                                                   |
+| `--plan.derive_task_from_video` | `if_short` | Use the dataset task as-is (`off`), only when it's missing/short (`if_short`), or always re-derive from video (`always`). |
+
+### Interjections + VQA
+
+| Flag                                            | Default | What it does                                               |
+| ----------------------------------------------- | ------- | ---------------------------------------------------------- |
+| `--interjections.max_interjections_per_episode` | `3`     | Cap on interjection/speech pairs per episode.              |
+| `--vqa.vqa_emission_hz`                         | `1.0`   | How often VQA pairs are emitted.                           |
+| `--vqa.restrict_to_default_camera`              | `false` | Ground VQA only on `--vlm.camera_key` (else every camera). |
+| `--executor.episode_parallelism`                | `16`    | Episodes processed concurrently within each phase.         |
+
+## Contributing new modules
+
+The pipeline is built to grow, and **contributions are very welcome** —
+a brand-new module (say, trajectory traces or affordances), a new prompt
+template, a smarter grounding flow, or quality fixes to the existing
+`plan` / `interjections` / `vqa` modules.
+
+Every module lives under
+`src/lerobot/annotations/steerable_pipeline/modules/`, shares the VLM
+client and the keyframe cache, writes its raw output to the staging
+tree, and plugs into the executor as its own phase. Got an idea? Open an
+issue or PR on [the repo](https://github.com/huggingface/lerobot).
+
+## How recipes consume the output
+
+The annotations are meant to be read by recipes (see
+[Language Columns and Recipes](./language_and_recipes)). Typically:
+
+- low-level / high-level / memory-update branches read
+  `subtask` / `plan` / `memory` from `language_persistent`.
+- an interjection-response branch reads `interjection` events plus the
+  paired speech atom (merged into one assistant turn via `tool_calls_from`)
+  and the matching `plan` refresh at the same timestamp.
+- a VQA branch reads the `(vqa, user)` and `(vqa, assistant)` pairs from
+  `language_events`.
+
+## Why state and events are split
+
+Two ideas shape the design:
+
+1. **Persistent state vs. exact events.** Persistent rows (`subtask`,
+   `plan`, `memory`) apply to the whole episode and answer "what's true
+   right now?". Event rows (`interjection`, `vqa`, speech) appear only on
+   the one frame whose timestamp matches. Timestamps are copied straight
+   from the source parquet — never recomputed in floating point.
+2. **One VLM pass.** All three modules share a single VLM client (the
+   OpenAI-compatible client talking to the job's vLLM server), so you pay
+   for one model load per dataset, not three.
+
+## Re-running a single module
+
+Each module stages its raw output to
+`<root>/.annotate_staging/episode_{N:06d}/<module>.jsonl`. This makes
+prompt iteration cheap: re-running one module overwrites only its own
+JSONL, then the writer recomposes the final parquet. Disable modules you
+don't want with `--plan.enabled=false` (and likewise
+`--interjections.enabled` / `--vqa.enabled`) to test one at a time.
+
+## What the validator checks
+
+Before the writer runs, `StagingValidator` confirms:
+
+- every event row lands exactly on a real frame timestamp;
+- no speech / interjection pairs are left orphaned;
+- `plan` is refreshed at every interjection timestamp;
+- `memory` rows fall on subtask boundaries (a warning, not an error);
+- each VQA assistant `content` is valid JSON in one of the
+  bbox / keypoint / count / attribute / spatial shapes;
+- every row goes to the column chosen by `column_for_style(style)`.
+
+Any error aborts the writer. Pass `--skip_validation=true` to override
+while debugging.
+
+## Where each module's ideas come from
+
+- **`plan` — subtasks.** Hi Robot ([Shi 2025](https://arxiv.org/abs/2502.19417))
+  for atom granularity ("pick up one piece of lettuce", "place bowl to
+  box"); Pi0.7 ([Physical Intelligence 2025](https://pi.website/pi07))
+  for "how, not what" detail.
+- **`plan` — memory.** MEM ([Torne 2026](https://arxiv.org/abs/2603.03596)):
+  keep only the minimal relevant information — preserve outcomes, drop
+  specific attributes.
+- **`interjections`.** Hi Robot's scenario taxonomy: negative task,
+  situated correction, specific constraint, preference. Speech is a
+  tool-call-only atom
+  (`tool_calls=[{type:function, function:{name:"say", arguments:{text:...}}}]`).
+- **`vqa`.** ECoT ([Zawalski 2024](https://arxiv.org/abs/2407.08693)) for
+  grounded features (pixel bounding boxes `[x_min, y_min, x_max, y_max]`,
+  keypoints) and Steerable VLA Policies
+  ([Zhao 2025](https://arxiv.org/abs/2509.07626)) for multi-abstraction
+  grounding. Pi0.7 also grounds answers across abstraction levels.
+
+When improving a module, tweak its prompt template in
+`src/lerobot/annotations/steerable_pipeline/prompts/` rather than
+rewriting from scratch.
+
+## Roughly how much it costs
+
+Per episode, the pipeline makes about `max_steps` plan calls,
+`max_interjections_per_episode` interjection calls, and
+`vqa_emission_hz × episode_seconds` VQA calls. With the defaults (8
+subtasks, 1 interjection, 1 Hz × 3 pairs) on a 30-second episode, that's
+~50 VLM calls.
+
+Storage stays small: `language_persistent` is at most tens of KB per
+episode (parquet dictionary-encodes the one entry that repeats across
+frames), and `language_events` is empty on most frames — its size scales
+with the number of emissions, not `num_frames × num_emissions`.
diff --git a/examples/annotations/run_hf_job.py b/examples/annotations/run_hf_job.py
new file mode 100644
index 000000000..a77e22f14
--- /dev/null
+++ b/examples/annotations/run_hf_job.py
@@ -0,0 +1,77 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Launch ``lerobot-annotate`` on a Hugging Face job (vllm + Qwen3.6-27B VLM).
+
+Spawns one single-GPU ``h200`` job that:
+
+  1. installs ``lerobot`` from ``main`` plus the annotation extras,
+  2. boots one vllm server with Qwen3.6-27B (dense VLM),
+  3. runs the plan / interjections / vqa modules across the dataset
+     in free-form mode (each episode generates its own subtasks +
+     memory),
+  4. uploads the annotated dataset to ``--new_repo_id`` (when set)
+     or back to ``--repo_id``.
+
+Usage:
+
+    HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
+
+Adjust ``CMD`` (dataset, model, hub repo) and ``flavor`` below for your
+run. For larger datasets, scale to ``h200x4`` and raise
+``--vlm.parallel_servers`` / ``--vlm.num_gpus`` to match.
+"""
+
+import os
+
+from huggingface_hub import get_token, run_job
+
+token = os.environ.get("HF_TOKEN") or get_token()
+if not token:
+    raise RuntimeError("No HF token. Run `huggingface-cli login` or `export HF_TOKEN=hf_...`")
+
+CMD = (
+    "apt-get update -qq && apt-get install -y -qq git ffmpeg && "
+    "pip install --no-deps "
+    "'lerobot @ git+https://github.com/huggingface/lerobot.git@main' && "
+    "pip install --upgrade-strategy only-if-needed "
+    "datasets pyarrow av jsonlines draccus gymnasium torchcodec mergedeep pyyaml-include toml typing-inspect "
+    "openai && "
+    "export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 && "
+    "export VLLM_VIDEO_BACKEND=pyav && "
+    "lerobot-annotate "
+    "--repo_id=pepijn223/robocasa_pretrain_human300_v4 "
+    "--new_repo_id=pepijn223/robocasa_pretrain_human300_v4_annotated "
+    "--push_to_hub=true "
+    "--vlm.backend=openai "
+    "--vlm.model_id=Qwen/Qwen3.6-27B "
+    "--vlm.num_gpus=1 "
+    '--vlm.serve_command="vllm serve Qwen/Qwen3.6-27B '
+    "--tensor-parallel-size 1 --max-model-len 32768 "
+    '--gpu-memory-utilization 0.8 --uvicorn-log-level warning --port {port}" '
+    "--vlm.serve_ready_timeout_s=1800 "
+    # Qwen3.6 ships with thinking on; annotation wants plain JSON answers.
+    "--vlm.chat_template_kwargs='{\"enable_thinking\": false}'"
+)
+
+job = run_job(
+    image="vllm/vllm-openai:latest",
+    command=["bash", "-c", CMD],
+    flavor="h200",
+    secrets={"HF_TOKEN": token},
+    timeout="2h",
+)
+print(f"Job URL: {job.url}")
+print(f"Job ID:  {job.id}")
diff --git a/pyproject.toml b/pyproject.toml
index e43f8ef81..0dc86d7ff 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -229,6 +229,21 @@ vla_jepa = ["lerobot[transformers-dep]", "lerobot[diffusers-dep]", "lerobot[qwen
 async = ["lerobot[grpcio-dep]", "lerobot[matplotlib-dep]"]
 peft = ["lerobot[transformers-dep]", "lerobot[peft-dep]"]
 
+# Annotation pipeline (lerobot-annotate). The only backend is ``openai``,
+# which talks to any OpenAI-compatible server (``vllm serve`` /
+# ``transformers serve`` / hosted). Distributed runs use Hugging Face Jobs
+# (see examples/annotations/run_hf_job.py).
+annotations = [
+    "lerobot[dataset]",
+    "lerobot[transformers-dep]",
+    "openai>=1.40,<2.0",
+    # ``vllm`` is intentionally NOT a hard dep: it pins an older torch, and
+    # uv's single unified lock would then cap ``torch`` for every extra
+    # (e.g. forcing 2.8 while ``torchcodec`` in [dataset] needs 2.11 -> ABI
+    # break in CI). The HF Jobs image (``vllm/vllm-openai``) provides vLLM;
+    # install it locally only if you run your own ``vllm serve``.
+]
+
 # Development
 dev = ["pre-commit>=3.7.0,<5.0.0", "debugpy>=1.8.1,<1.9.0", "lerobot[grpcio-dep]", "grpcio-tools>=1.73.1,<2.0.0", "mypy>=1.19.1", "ruff>=0.14.1", "lerobot[notebook]"]
 notebook = ["jupyter>=1.0.0,<2.0.0", "ipykernel>=6.0.0,<7.0.0"]
@@ -323,6 +338,7 @@ lerobot-find-joint-limits="lerobot.scripts.lerobot_find_joint_limits:main"
 lerobot-imgtransform-viz="lerobot.scripts.lerobot_imgtransform_viz:main"
 lerobot-edit-dataset="lerobot.scripts.lerobot_edit_dataset:main"
 lerobot-setup-can="lerobot.scripts.lerobot_setup_can:main"
+lerobot-annotate="lerobot.scripts.lerobot_annotate:main"
 lerobot-rollout="lerobot.scripts.lerobot_rollout:main"
 
 # ---------------- Tool Configurations ----------------
@@ -341,7 +357,7 @@ torch = [{ index = "pytorch-cu128", marker = "sys_platform == 'linux'" }]
 torchvision = [{ index = "pytorch-cu128", marker = "sys_platform == 'linux'" }]
 
 [tool.setuptools.package-data]
-lerobot = ["envs/*.json"]
+lerobot = ["envs/*.json", "annotations/steerable_pipeline/prompts/*.txt"]
 
 [tool.setuptools.packages.find]
 where = ["src"]
diff --git a/src/lerobot/annotations/__init__.py b/src/lerobot/annotations/__init__.py
new file mode 100644
index 000000000..67782f192
--- /dev/null
+++ b/src/lerobot/annotations/__init__.py
@@ -0,0 +1,15 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/src/lerobot/annotations/steerable_pipeline/__init__.py b/src/lerobot/annotations/steerable_pipeline/__init__.py
new file mode 100644
index 000000000..a8da5e05e
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/__init__.py
@@ -0,0 +1,36 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Steerable annotation pipeline producing ``language_persistent`` and
+``language_events`` columns for LeRobot datasets.
+
+The pipeline is decomposed into three independently runnable modules whose
+outputs are staged per-episode before a final parquet rewrite:
+
+- :mod:`.modules.plan_subtasks_memory` (the ``plan`` module) — persistent styles
+- :mod:`.modules.interjections_and_speech` (the ``interjections`` module) — event styles + speech
+- :mod:`.modules.general_vqa` (the ``vqa`` module) — event-style VQA pairs
+"""
+
+from .config import AnnotationPipelineConfig
+from .validator import StagingValidator, ValidationReport
+from .writer import LanguageColumnsWriter
+
+__all__ = [
+    "AnnotationPipelineConfig",
+    "LanguageColumnsWriter",
+    "StagingValidator",
+    "ValidationReport",
+]
diff --git a/src/lerobot/annotations/steerable_pipeline/config.py b/src/lerobot/annotations/steerable_pipeline/config.py
new file mode 100644
index 000000000..86d6cadd9
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/config.py
@@ -0,0 +1,211 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+
+@dataclass
+class PlanConfig:
+    """``plan`` module: subtasks + plan + memory + task augmentation."""
+
+    enabled: bool = True
+
+    # ``task_aug`` rephrasings at t=0 (renderer rotates ${task} among them); 0 disables.
+    n_task_rephrasings: int = 10
+
+    # Derive the task from video instead of episode_task: off / if_short / always.
+    # Affects prompts only; ``meta/tasks.parquet`` is untouched.
+    derive_task_from_video: str = "if_short"
+    derive_task_min_words: int = 3
+
+    # --- Frame input: timestamped contact sheets (always on) ---------------
+    # The subtask describe/segment passes ALWAYS render the episode as
+    # macrodata/refiner-style contact sheets: sampled frames packed into JPEG
+    # grids with each frame's timestamp burned into its corner, so the VLM
+    # cites the exact source time of a boundary directly. This is far cheaper
+    # in vision tokens than one image per frame (≈2× faster subtask generation
+    # in practice), which is why the sampling is dense by default.
+    #
+    # ``frames_per_second`` is the sampling rate: 2.0 = one frame every 0.5s.
+    frames_per_second: float = 2.0
+    # Frame budget per VLM call (= columns × rows × sheets). When a whole
+    # episode sampled at ``frames_per_second`` exceeds this, the episode is
+    # AUTOMATICALLY split into consecutive windows of
+    # ``max_frames_per_prompt`` frames each (one describe→segment call per
+    # window, still at the full ``frames_per_second`` density), and the
+    # per-window spans are merged + stitched into one contiguous cover. So an
+    # episode of any length is always covered at the full sampling density.
+    max_frames_per_prompt: int = 60
+    contact_sheet_columns: int = 5
+    contact_sheet_frames_per_sheet: int = 20
+    contact_sheet_frame_width: int = 224
+    contact_sheet_quality: int = 84
+
+    min_subtask_seconds: float = 1.5
+    plan_max_steps: int = 8
+
+    # Narrate-only grounding pass before segmenting — best defense against subtasks
+    # invented from the task text (+1 VLM call/episode).
+    subtask_describe_first: bool = True
+
+    # Emit ``style="plan"`` rows at each boundary; False = subtasks + memory only.
+    emit_plan: bool = True
+
+    # Emit ``style="memory"`` rows at each boundary; False = subtasks (+ plan) only.
+    # Symmetric counterpart of ``emit_plan``.
+    emit_memory: bool = True
+
+    # (subtask spans are always stitched to a contiguous full-episode cover; not configurable.)
+
+    # Optional EgoMimic-style 5-axis task augmentation; replaces n_task_rephrasings.
+    task_aug_axes: TaskAugAxesConfig = field(default_factory=lambda: TaskAugAxesConfig())
+
+
+@dataclass
+class TaskAugAxesConfig:
+    """5-axis t=0 task augmentation (EgoMimic-style): synonym / omit_arm /
+    omit_orientation / omit_grasp_method / combined. Replaces n_task_rephrasings
+    when enabled; each variant becomes a ``task_aug`` row. Axes with nothing to
+    omit emit fewer entries. Defaults (3+3+2+2+2) match EgoMimic."""
+
+    enabled: bool = False
+
+    synonym_paraphrase: int = 3
+    omit_arm: int = 3
+    omit_orientation: int = 2
+    omit_grasp_method: int = 2
+    combined_omissions: int = 2
+
+
+@dataclass
+class InterjectionsConfig:
+    """``interjections`` module: interjections + paired speech."""
+
+    enabled: bool = True
+
+    # Each emits a paired (interjection, speech) row + a plan refresh at that ts.
+    max_interjections_per_episode: int = 3
+    interjection_min_t: float = 2.0
+
+    # Frame window centered on the timestamp so the VLM sees motion, not one frame.
+    interjection_window_seconds: float = 2.0
+    interjection_window_frames: int = 4
+
+
+@dataclass
+class VqaConfig:
+    """``vqa`` module: general VQA."""
+
+    enabled: bool = True
+    vqa_emission_hz: float = 1.0
+    K: int = 1
+    """Consecutive frames per emission tick. The VLM grounds on the FIRST frame,
+    so K>1 smears stale labels onto moved frames. Default 1 (no smear)."""
+    question_types: tuple[str, ...] = ("bbox", "keypoint", "count", "attribute", "spatial")
+
+    # True: ground VQA only on --vlm.camera_key (default: every camera).
+    restrict_to_default_camera: bool = False
+
+
+@dataclass
+class VlmConfig:
+    """Shared Qwen-VL client configuration."""
+
+    # Only ``openai`` (OpenAI-compatible vLLM server, auto-spawned when
+    # auto_serve=True); ``stub`` is for tests.
+    backend: str = "openai"
+    model_id: str = "Qwen/Qwen3.6-27B"
+
+    # OpenAI-compatible endpoint; ``EMPTY`` key works for local servers.
+    api_base: str = "http://localhost:8000/v1"
+    api_key: str = "EMPTY"
+
+    # Spawn a server if none answers api_base; False = fail fast on a remote.
+    auto_serve: bool = True
+    serve_port: int = 8000
+    # Override the auto-serve command; ``{port}`` substituted per replica.
+    serve_command: str | None = None
+
+    # Independent servers for round-robin routing (one per GPU). num_gpus=0 = one each.
+    parallel_servers: int = 1
+    num_gpus: int = 0
+    client_concurrency: int = 16
+    serve_ready_timeout_s: float = 600.0
+
+    max_new_tokens: int = 512
+    temperature: float = 0.2
+
+    # Auto-serve context length (None → 32768); other vLLM flags go in serve_command.
+    max_model_len: int | None = None
+
+    # Camera for keyframes; None → first ``observation.images.*`` key.
+    camera_key: str | None = None
+    # Forwarded as extra_body.chat_template_kwargs (e.g. {"enable_thinking": false}).
+    chat_template_kwargs: dict[str, Any] | None = None
+
+
+@dataclass
+class ExecutorConfig:
+    """Executor settings (intra-process episode concurrency; distribution via HF Jobs)."""
+
+    # Episodes processed concurrently per phase; main knob for saturating the servers.
+    episode_parallelism: int = 16
+
+
+@dataclass
+class AnnotationPipelineConfig:
+    """Top-level config for ``lerobot-annotate`` (rewrites data shards in place)."""
+
+    # Hub dataset: download source when ``root`` unset; push target when push_to_hub
+    # is on and ``new_repo_id`` unset.
+    repo_id: str | None = None
+
+    # Separate push target (matches the LeRobot edit tools). Unset → push in place.
+    new_repo_id: str | None = None
+
+    root: Path | None = None
+
+    # Defaults to ``<root>/.annotate_staging/``.
+    staging_dir: Path | None = None
+
+    seed: int = 1729
+
+    plan: PlanConfig = field(default_factory=PlanConfig)
+    interjections: InterjectionsConfig = field(default_factory=InterjectionsConfig)
+    vqa: VqaConfig = field(default_factory=VqaConfig)
+
+    vlm: VlmConfig = field(default_factory=VlmConfig)
+    executor: ExecutorConfig = field(default_factory=ExecutorConfig)
+
+    skip_validation: bool = False
+    only_episodes: tuple[int, ...] | None = None
+
+    # Keyframe decode backend forwarded to ``decode_video_frames``. None →
+    # library default (torchcodec when available, else PyAV). Or pin
+    # ``"torchcodec"`` / ``"pyav"`` explicitly.
+    video_backend: str | None = None
+
+    # Upload to the Hub (new_repo_id if set, else repo_id; one must be set).
+    push_to_hub: bool = False
+    push_private: bool = False
+    push_commit_message: str | None = None
+
+    def resolved_staging_dir(self, root: Path) -> Path:
+        return self.staging_dir if self.staging_dir is not None else root / ".annotate_staging"
diff --git a/src/lerobot/annotations/steerable_pipeline/executor.py b/src/lerobot/annotations/steerable_pipeline/executor.py
new file mode 100644
index 000000000..69d10bc89
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/executor.py
@@ -0,0 +1,253 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""In-process executor that runs the annotation phases.
+
+The executor runs **six phases** in dependency order:
+
+    phase 1: ``plan`` module (plan + subtasks + memory)
+    phase 2: ``interjections`` module (interjections + speech)
+    phase 3: ``plan`` plan-update pass — re-runs plan emission at every
+             interjection timestamp produced by phase 2
+    phase 4: ``vqa`` module (VQA)
+    phase 5: validator
+    phase 6: writer
+
+Phase 3 is why the ``plan`` module must be re-entered after the
+``interjections`` module — to refresh ``plan`` rows at interjection
+timestamps.
+
+Distributed execution is provided by Hugging Face Jobs (see
+``examples/annotations/run_hf_job.py``); the runner inside the job
+invokes ``lerobot-annotate`` which uses this in-process executor.
+Episode-level concurrency is controlled by
+``ExecutorConfig.episode_parallelism``.
+"""
+
+from __future__ import annotations
+
+import logging
+import time
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+
+from .config import AnnotationPipelineConfig
+from .reader import EpisodeRecord, iter_episodes
+from .staging import EpisodeStaging
+from .validator import StagingValidator
+from .writer import LanguageColumnsWriter
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class PhaseResult:
+    """Summary of one pipeline phase across all episodes."""
+
+    name: str
+    episodes_processed: int
+    episodes_skipped: int
+
+
+@dataclass
+class PipelineRunSummary:
+    """Aggregated result returned by :meth:`Executor.run`."""
+
+    phases: list[PhaseResult]
+    written_paths: list[Path]
+    validation_report: Any  # ValidationReport, kept Any to avoid import cycle
+
+
+@dataclass
+class Executor:
+    """Run all six phases over a dataset root in-process.
+
+    Episode-level concurrency comes from ``ExecutorConfig.episode_parallelism``
+    (a thread pool); cluster-level concurrency comes from running this
+    executor inside a Hugging Face Job. Tests construct the executor
+    directly with stub modules.
+    """
+
+    config: AnnotationPipelineConfig
+    plan: Any  # PlanSubtasksMemoryModule
+    interjections: Any  # InterjectionsAndSpeechModule
+    vqa: Any  # GeneralVqaModule
+    writer: LanguageColumnsWriter
+    validator: StagingValidator
+
+    def run(self, root: Path) -> PipelineRunSummary:
+        records = list(iter_episodes(root, only_episodes=self.config.only_episodes))
+        n = len(records)
+        if n == 0:
+            raise ValueError(f"No episodes found under {root}/data/")
+
+        print(f"[annotate] {n} episodes total", flush=True)
+
+        staging_dir = self.config.resolved_staging_dir(root)
+        staging_dir.mkdir(parents=True, exist_ok=True)
+
+        phases: list[PhaseResult] = []
+
+        # Phase 1: ``plan`` module (plan + subtasks + memory)
+        phases.append(self._run_module_phase("plan", records, staging_dir, self.plan))
+        # Phase 2: ``interjections`` module (interjections + speech). It
+        # reads the ``plan`` module's subtask rows from the same staging
+        # tree to ground the interjection prompt in the correct local subtask.
+        phases.append(self._run_module_phase("interjections", records, staging_dir, self.interjections))
+        # Phase 3: ``plan`` plan-update pass at interjection timestamps.
+        phases.append(self._run_plan_update_phase(records, staging_dir))
+        # Phase 4: ``vqa`` module (VQA)
+        phases.append(self._run_module_phase("vqa", records, staging_dir, self.vqa))
+
+        print("[annotate] running validator...", flush=True)
+        report = self.validator.validate(records, staging_dir)
+        if not report.ok and not self.config.skip_validation:
+            raise RuntimeError(f"Staging validation failed: {report.summary()}")
+        print(f"[annotate] validator: {report.summary()}", flush=True)
+
+        print(f"[annotate] writing parquet shards into {root}/data/...", flush=True)
+        written = self.writer.write_all(records, staging_dir, root)
+        print(f"[annotate] wrote {len(written)} shard(s); pipeline complete", flush=True)
+
+        # Keep meta/info.json aligned with the parquet schema we just wrote.
+        # Idempotent and additive: existing user metadata is preserved.
+        self._ensure_annotation_metadata_in_info(root)
+
+        return PipelineRunSummary(phases=phases, written_paths=written, validation_report=report)
+
+    @staticmethod
+    def _ensure_annotation_metadata_in_info(root: Path) -> None:
+        """Write language features and canonical tools to ``meta/info.json``.
+
+        ``LanguageColumnsWriter`` adds ``language_persistent`` and
+        ``language_events`` to parquet shards. The metadata must advertise
+        those columns too, otherwise non-streaming ``LeRobotDataset`` loads
+        cast against the old schema and fail on the extra parquet columns.
+        """
+        from lerobot.datasets.io_utils import load_info, write_info  # noqa: PLC0415
+        from lerobot.datasets.language import SAY_TOOL_SCHEMA, language_feature_info  # noqa: PLC0415
+
+        info_path = root / "meta" / "info.json"
+        if not info_path.exists():
+            return
+        try:
+            info = load_info(root)
+        except Exception as exc:  # noqa: BLE001
+            print(f"[annotate] could not read {info_path}: {exc}", flush=True)
+            return
+
+        changed = False
+
+        merged_features = {**info.features, **language_feature_info()}
+        if merged_features != info.features:
+            info.features = merged_features
+            changed = True
+
+        existing = info.tools or []
+        names = {(t.get("function") or {}).get("name") for t in existing if isinstance(t, dict)}
+        if SAY_TOOL_SCHEMA["function"]["name"] not in names:
+            info.tools = [*existing, SAY_TOOL_SCHEMA]
+            changed = True
+
+        if changed:
+            write_info(info, root)
+            print(
+                "[annotate] meta/info.json: "
+                f"language_features={list(language_feature_info())}, "
+                f"tools={[t['function']['name'] for t in (info.tools or [])]}",
+                flush=True,
+            )
+
+    def _run_module_phase(
+        self,
+        name: str,
+        records: list[EpisodeRecord],
+        staging_dir: Path,
+        module: Any,
+    ) -> PhaseResult:
+        if not module.enabled:
+            print(f"[annotate] phase={name} skipped (module disabled)", flush=True)
+            return PhaseResult(name=name, episodes_processed=0, episodes_skipped=len(records))
+        n = len(records)
+        parallelism = max(1, min(self.config.executor.episode_parallelism, n))
+        print(
+            f"[annotate] phase={name} starting on {n} episode(s) (parallelism={parallelism})",
+            flush=True,
+        )
+        t0 = time.time()
+
+        def _do(idx_record: tuple[int, EpisodeRecord]) -> tuple[int, int, float]:
+            i, record = idx_record
+            ep_start = time.time()
+            staging = EpisodeStaging(staging_dir, record.episode_index)
+            module.run_episode(record, staging)
+            return i, record.episode_index, time.time() - ep_start
+
+        processed = 0
+        if parallelism == 1:
+            for i, record in enumerate(records, 1):
+                _, ep_idx, elapsed = _do((i, record))
+                processed += 1
+                print(
+                    f"[annotate]   {name} episode {i}/{n} (idx={ep_idx}) done in {elapsed:.1f}s",
+                    flush=True,
+                )
+        else:
+            with ThreadPoolExecutor(max_workers=parallelism) as pool:
+                futures = [pool.submit(_do, (i, r)) for i, r in enumerate(records, 1)]
+                for fut in as_completed(futures):
+                    i, ep_idx, elapsed = fut.result()
+                    processed += 1
+                    print(
+                        f"[annotate]   {name} episode {processed}/{n} "
+                        f"(idx={ep_idx}, submit_order={i}) done in {elapsed:.1f}s",
+                        flush=True,
+                    )
+        total = time.time() - t0
+        print(f"[annotate] phase={name} complete: {processed}/{n} in {total:.1f}s", flush=True)
+        return PhaseResult(name=name, episodes_processed=processed, episodes_skipped=0)
+
+    def _run_plan_update_phase(  # noqa: PLR0915
+        self, records: list[EpisodeRecord], staging_dir: Path
+    ) -> PhaseResult:
+        """Re-emit ``plan`` rows at each timestamp the ``interjections`` module produced.
+
+        The ``plan`` module owns the prompt; the ``interjections`` module
+        produced the timestamps. This phase therefore calls back into the
+        ``plan`` module with the interjection timestamps so its existing
+        prompt path is reused.
+        """
+        if not self.plan.enabled or not self.interjections.enabled:
+            return PhaseResult(name="plan_update", episodes_processed=0, episodes_skipped=len(records))
+        processed = 0
+        for record in records:
+            staging = EpisodeStaging(staging_dir, record.episode_index)
+            interjection_rows = [
+                row for row in staging.read("interjections") if row.get("style") == "interjection"
+            ]
+            interjection_times = [float(row["timestamp"]) for row in interjection_rows]
+            interjection_texts = [str(row.get("content") or "") for row in interjection_rows]
+            if interjection_times:
+                self.plan.run_plan_updates(record, staging, interjection_times, interjection_texts)
+                processed += 1
+        # Episodes without any interjections are skipped (no plan refresh
+        # needed); count them so the summary's processed+skipped == total.
+        return PhaseResult(
+            name="plan_update",
+            episodes_processed=processed,
+            episodes_skipped=len(records) - processed,
+        )
diff --git a/src/lerobot/annotations/steerable_pipeline/frames.py b/src/lerobot/annotations/steerable_pipeline/frames.py
new file mode 100644
index 000000000..a6c904673
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/frames.py
@@ -0,0 +1,481 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Keyframe extraction for the annotation pipeline.
+
+Modules attach decoded camera frames to their VLM prompts so the model can
+ground subtask decomposition, interjection scenarios, and VQA in actual
+visual content. The pipeline shares one provider across modules and one
+episode at a time, with a small per-episode cache so multiple modules
+querying the same timestamp pay decode cost once.
+"""
+
+from __future__ import annotations
+
+import io
+import logging
+import math
+import threading
+from collections.abc import Sequence
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any, Protocol
+
+import PIL.Image
+import torch
+
+from lerobot.configs.video import VideoEncoderConfig
+from lerobot.datasets.video_utils import decode_video_frames, reencode_video
+
+from .reader import EpisodeRecord, snap_to_frame
+
+logger = logging.getLogger(__name__)
+
+
+class FrameProvider(Protocol):
+    """Decodes camera frames at episode-relative timestamps."""
+
+    @property
+    def camera_keys(self) -> list[str]:
+        """All ``observation.images.*`` feature keys this provider can decode."""
+
+    def frames_at(
+        self,
+        record: EpisodeRecord,
+        timestamps: list[float],
+        camera_key: str | None = None,
+    ) -> list[Any]:
+        """Return one decoded frame per timestamp from ``camera_key`` (or default).
+
+        Frames are ``torch.Tensor`` (``C, H, W`` uint8) — the shape
+        :func:`lerobot.datasets.video_utils.decode_video_frames` returns.
+        :func:`to_image_blocks` converts them to PIL only at the VLM-message
+        boundary.
+
+        Empty list if the camera is unavailable. ``camera_key=None`` falls back
+        to the provider's default camera so existing single-camera callers
+        (the ``plan`` and ``interjections`` modules) keep working unchanged.
+        """
+
+    def video_for_episode(
+        self,
+        record: EpisodeRecord,
+        max_frames: int,
+        camera_key: str | None = None,
+    ) -> list[Any]:
+        """Return up to ``max_frames`` decoded frames covering the whole episode.
+
+        Sampling is uniform across the episode duration. Frames are
+        ``torch.Tensor`` (``C, H, W`` uint8); :func:`to_video_block` wraps
+        them into one ``{"type":"video", "video":<list>}`` block for a
+        Qwen-VL-compatible model that pools temporally itself. Empty list if
+        no camera available.
+        """
+
+
+@dataclass
+class _NullProvider:
+    """No-op provider used when the dataset has no video keys or in tests."""
+
+    @property
+    def camera_keys(self) -> list[str]:
+        return []
+
+    def frames_at(
+        self,
+        record: EpisodeRecord,
+        timestamps: list[float],
+        camera_key: str | None = None,
+    ) -> list[Any]:
+        return []
+
+    def video_for_episode(
+        self,
+        record: EpisodeRecord,
+        max_frames: int,
+        camera_key: str | None = None,
+    ) -> list[Any]:
+        return []
+
+
+def null_provider() -> FrameProvider:
+    return _NullProvider()
+
+
+@dataclass
+class VideoFrameProvider:
+    """Decodes frames from the dataset's ``observation.images.*`` streams.
+
+    By default the *first* camera key is used for the ``plan`` module
+    (subtask decomposition) and the ``interjections`` module (interjection
+    scenarios) — those prompts care about *what is happening*, not which
+    angle. The ``vqa`` module instead iterates over every camera in
+    :attr:`camera_keys` so each frame's
+    grounded answer (bbox/keypoint/...) is tagged with the camera it was
+    grounded against.
+
+    ``camera_key`` overrides the default-camera choice but does not restrict
+    :attr:`camera_keys`. Pass ``camera_key`` explicitly to ``frames_at`` /
+    ``video_for_episode`` to read a non-default stream.
+
+    Caches up to ``cache_size`` decoded frames per process to keep
+    co-timestamped ``interjections`` + ``plan`` plan-update calls cheap.
+    """
+
+    root: Path
+    camera_key: str | None = None
+    tolerance_s: float = 1e-2
+    cache_size: int = 256
+    # Keyframe decode backend forwarded to
+    # :func:`lerobot.datasets.video_utils.decode_video_frames`. ``None``
+    # uses the library default (torchcodec when available, else PyAV).
+    video_backend: str | None = None
+    _meta: Any = field(default=None, init=False, repr=False)
+    _cache: dict = field(default_factory=dict, init=False, repr=False)
+    _camera_keys: list[str] = field(default_factory=list, init=False, repr=False)
+    # Pipeline runs the three module phases under a ThreadPoolExecutor (see
+    # ``ExecutorConfig.episode_parallelism``); guard the dict cache and the
+    # one-shot warn flag against concurrent updates from worker threads.
+    _lock: threading.Lock = field(default_factory=threading.Lock, init=False, repr=False)
+    # Serializes decode_video_frames calls: torchcodec hands out one
+    # ``VideoDecoder`` per file from a process-wide cache, and the decoder
+    # is not safe to drive from multiple threads at once.
+    _decode_lock: threading.Lock = field(default_factory=threading.Lock, init=False, repr=False)
+    _warned_decode_fail: bool = field(default=False, init=False, repr=False)
+
+    def __post_init__(self) -> None:
+        from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata  # noqa: PLC0415
+
+        self._meta = LeRobotDatasetMetadata(repo_id="local", root=self.root)
+        # Only ``video_keys`` are decodable here: the clip/decode paths read
+        # ``videos/<key>/from_timestamp`` from episode metadata, which exists
+        # only for video-stored cameras. Image-stored cameras (also in
+        # ``camera_keys``) would KeyError, so restrict the list — and the
+        # default — to video keys.
+        keys = list(self._meta.video_keys)
+        # Last-resort fallback: if metadata didn't surface any video keys but
+        # the caller explicitly named a camera (``--vlm.camera_key=...``),
+        # trust them — the key is by definition known to exist on the dataset.
+        if not keys and self.camera_key:
+            keys = [self.camera_key]
+        self._camera_keys = keys
+        if self.camera_key is None:
+            self.camera_key = keys[0] if keys else None
+
+    @property
+    def camera_keys(self) -> list[str]:
+        """All ``observation.images.*`` keys available on this dataset."""
+        return list(self._camera_keys)
+
+    def frames_at(
+        self,
+        record: EpisodeRecord,
+        timestamps: list[float],
+        camera_key: str | None = None,
+    ) -> list[Any]:
+        target = camera_key if camera_key is not None else self.camera_key
+        if not timestamps or target is None:
+            return []
+        # Snap each request to the nearest real frame timestamp: callers
+        # sample uniform grids whose points land mid-frame, and
+        # ``decode_video_frames`` rejects queries farther than
+        # ``tolerance_s`` from a decodable frame. Snapping also dedupes
+        # repeat queries through the cache.
+        if record.frame_timestamps:
+            timestamps = [snap_to_frame(float(ts), record.frame_timestamps) for ts in timestamps]
+
+        out: list[Any] = []
+        misses: list[float] = []
+        miss_indices: list[int] = []
+        with self._lock:
+            for i, ts in enumerate(timestamps):
+                key = (record.episode_index, target, round(float(ts), 6))
+                cached = self._cache.get(key)
+                if cached is not None:
+                    out.append(cached)
+                else:
+                    out.append(None)
+                    misses.append(float(ts))
+                    miss_indices.append(i)
+
+        if misses:
+            decoded = self._decode(record.episode_index, misses, target)
+            # ``_decode`` returns exactly one frame per requested timestamp,
+            # or an empty list if decoding failed wholesale. A partial list
+            # would mean a frame/timestamp misalignment, so only pair them up
+            # when the counts match (``strict=True`` then guards regressions).
+            if len(decoded) == len(miss_indices):
+                with self._lock:
+                    for i, frame in zip(miss_indices, decoded, strict=True):
+                        out[i] = frame
+                        key = (record.episode_index, target, round(float(timestamps[i]), 6))
+                        if len(self._cache) >= self.cache_size:
+                            self._cache.pop(next(iter(self._cache)))
+                        self._cache[key] = frame
+        # filter out any None left over from decode failures
+        return [frame for frame in out if frame is not None]
+
+    def video_for_episode(
+        self,
+        record: EpisodeRecord,
+        max_frames: int,
+        camera_key: str | None = None,
+    ) -> list[Any]:
+        """Return up to ``max_frames`` frames uniformly sampled across the episode.
+
+        The whole episode duration is covered; the model picks subtask
+        boundaries from the temporal pooling it does internally. Frames are
+        ``torch.Tensor`` (see :meth:`frames_at`).
+        """
+        target = camera_key if camera_key is not None else self.camera_key
+        if max_frames <= 0 or target is None or not record.frame_timestamps:
+            return []
+        n_frames = min(max_frames, len(record.frame_timestamps))
+        if n_frames == len(record.frame_timestamps):
+            timestamps = list(record.frame_timestamps)
+        else:
+            t0 = record.frame_timestamps[0]
+            t_last = record.frame_timestamps[-1]
+            if t_last <= t0:
+                timestamps = [float(t0)] * n_frames
+            else:
+                step = (t_last - t0) / (n_frames - 1) if n_frames > 1 else 0.0
+                timestamps = [float(t0 + i * step) for i in range(n_frames)]
+        return self.frames_at(record, timestamps, camera_key=target)
+
+    def episode_clip_path(self, record: EpisodeRecord, cache_dir: Path) -> Path | None:
+        """Extract the episode's subclip to ``cache_dir/ep_{idx:06d}.mp4``.
+
+        Returns ``None`` if the dataset has no video tracks or extraction
+        failed. Skips re-extract when the cached clip already exists.
+        Re-encodes to H.264 via
+        :func:`lerobot.datasets.video_utils.reencode_video` so the resulting
+        mp4 is decodable by every downstream video processor — stream-copy
+        would inherit the source codec (often AV1 in modern LeRobot
+        datasets), which vllm's libav build cannot decode.
+        """
+        if self.camera_key is None:
+            return None
+        cache_dir.mkdir(parents=True, exist_ok=True)
+        out_path = cache_dir / f"ep_{record.episode_index:06d}.mp4"
+        if out_path.exists() and out_path.stat().st_size > 0:
+            return out_path
+        ep = self._meta.episodes[record.episode_index]
+        from_timestamp = float(ep[f"videos/{self.camera_key}/from_timestamp"])
+        to_timestamp = float(ep[f"videos/{self.camera_key}/to_timestamp"])
+        src = self.root / self._meta.get_video_file_path(record.episode_index, self.camera_key)
+        encoder = VideoEncoderConfig(vcodec="h264", pix_fmt="yuv420p", g=None, crf=23, preset="ultrafast")
+        try:
+            reencode_video(
+                src,
+                out_path,
+                camera_encoder=encoder,
+                overwrite=True,
+                start_time_s=from_timestamp,
+                end_time_s=to_timestamp,
+            )
+        except Exception:
+            logger.warning(
+                "clip extraction failed for episode %s (%s)", record.episode_index, src, exc_info=True
+            )
+            return None
+        return out_path if out_path.exists() and out_path.stat().st_size > 0 else None
+
+    def _decode(self, episode_index: int, timestamps: list[float], camera_key: str) -> list[Any]:
+        """Decode ``timestamps`` from the episode's video as ``(C, H, W)`` tensors.
+
+        Delegates to :func:`lerobot.datasets.video_utils.decode_video_frames`
+        (torchcodec when available, PyAV otherwise; ``video_backend`` pins
+        one explicitly). Returns one frame per requested timestamp, or ``[]``
+        if decoding failed — callers treat ``[]`` as "no frames available".
+        """
+        ep = self._meta.episodes[episode_index]
+        from_timestamp = ep[f"videos/{camera_key}/from_timestamp"]
+        shifted = [from_timestamp + ts for ts in timestamps]
+        video_path = self.root / self._meta.get_video_file_path(episode_index, camera_key)
+
+        try:
+            # The module phases decode under a ThreadPoolExecutor (see
+            # ``ExecutorConfig.episode_parallelism``) but torchcodec's cached
+            # per-file decoder is single-threaded, so serialize decodes on a
+            # dedicated lock. Frame extraction is a small fraction of episode
+            # wall time (VLM calls dominate), so the contention is cheap.
+            with self._decode_lock:
+                # Stacked ``(N, C, H, W)`` uint8 tensor; one row per timestamp.
+                decoded = decode_video_frames(
+                    video_path, shifted, self.tolerance_s, backend=self.video_backend, return_uint8=True
+                )
+            return list(decoded)
+        except Exception as exc:
+            # Log loudly the first time so a silent vqa-module no-op (every
+            # prompt skipped because frames_at returned []) is debuggable from
+            # the job log instead of post-hoc parquet inspection. Subsequent
+            # failures stay quiet.
+            with self._lock:
+                already_warned = self._warned_decode_fail
+                if not already_warned:
+                    self._warned_decode_fail = True
+            if not already_warned:
+                logger.warning(
+                    "VideoFrameProvider._decode failed for episode=%s camera=%s video_path=%s backend=%s: %s",
+                    episode_index,
+                    camera_key,
+                    video_path,
+                    self.video_backend,
+                    exc,
+                    exc_info=exc,
+                )
+            return []
+
+
+def make_frame_provider(
+    root: Path, camera_key: str | None = None, video_backend: str | None = None
+) -> FrameProvider:
+    """Build a :class:`VideoFrameProvider` if videos are present, else null."""
+    try:
+        provider = VideoFrameProvider(root=root, camera_key=camera_key, video_backend=video_backend)
+    except Exception:
+        return null_provider()
+    if provider.camera_key is None:
+        return null_provider()
+    return provider
+
+
+def _frame_to_pil(frame: Any) -> Any:
+    """Materialise a decoded frame as a ``PIL.Image`` for the VLM message.
+
+    Frames flow through the provider as ``torch.Tensor`` (``C, H, W`` uint8,
+    straight from :func:`decode_video_frames`); PIL is only created here, at
+    the VLM-message boundary, because the chat backends expect PIL images /
+    data URLs. Non-tensor inputs (e.g. test stubs) pass through untouched.
+    """
+    if not isinstance(frame, torch.Tensor):
+        return frame
+    array = frame.detach().cpu()
+    if array.ndim == 3 and array.shape[0] in (1, 3):
+        array = array.permute(1, 2, 0)  # (C, H, W) -> (H, W, C)
+    if array.shape[-1] == 1:
+        array = array.squeeze(-1)
+    return PIL.Image.fromarray(array.to(torch.uint8).numpy())
+
+
+def to_image_blocks(frames: list[Any]) -> list[dict[str, Any]]:
+    """Convert decoded frames to Qwen-VL-compatible image content blocks."""
+    return [{"type": "image", "image": _frame_to_pil(frame)} for frame in frames]
+
+
+def to_video_block(frames: list[Any]) -> list[dict[str, Any]]:
+    """Wrap a list of decoded frames as one Qwen-VL video block.
+
+    Returns ``[]`` when the list is empty, so the caller can splat the result
+    into a content array without a separate emptiness check.
+    """
+    if not frames:
+        return []
+    return [{"type": "video", "video": [_frame_to_pil(frame) for frame in frames]}]
+
+
+def to_video_url_block(url: str | None, fps: float = 2.0) -> list[dict[str, Any]]:
+    """Wrap a video file URL as one ``video_url`` block.
+
+    Used by the ``openai`` backend (transformers serve / vllm serve /
+    ktransformers serve), where the server handles frame sampling.
+    Returns ``[]`` when ``url`` is ``None`` so the caller can splat.
+    """
+    if not url:
+        return []
+    return [{"type": "video_url", "video_url": {"url": url}, "fps": fps}]
+
+
+def _draw_timestamp_badge(image: PIL.Image.Image, timestamp: float) -> PIL.Image.Image:
+    """Burn ``timestamp`` (seconds) into the top-left corner of ``image``.
+
+    A solid black badge with white text, so a VLM reading a contact sheet can
+    cite the exact source time of each tile (e.g. ``012.50s``) directly,
+    instead of the caller having to map tile position back to time. Mirrors
+    the macrodata/refiner contact-sheet convention.
+    """
+    from PIL import ImageDraw, ImageFont
+
+    result = image.copy()
+    draw = ImageDraw.Draw(result)
+    font = ImageFont.load_default()
+    label = f"{timestamp:06.2f}s"
+    left, top, right, bottom = draw.textbbox((0, 0), label, font=font)
+    text_w, text_h = right - left, bottom - top
+    pad = max(3, round(min(image.width, image.height) * 0.018))
+    draw.rectangle((0, 0, text_w + pad * 2, text_h + pad * 2), fill=(0, 0, 0))
+    draw.text((pad - left, pad - top), label, fill=(255, 255, 255), font=font)
+    return result
+
+
+def to_contact_sheet_blocks(
+    frames: Sequence[Any],
+    timestamps: Sequence[float],
+    *,
+    columns: int = 5,
+    frames_per_sheet: int = 20,
+    frame_width: int = 224,
+    quality: int = 84,
+) -> list[dict[str, Any]]:
+    """Pack decoded frames into timestamped JPEG contact-sheet image blocks.
+
+    Each frame is resized to ``frame_width`` wide, stamped with its
+    episode-relative timestamp, and tiled row-major into grids of
+    ``frames_per_sheet`` (``columns`` wide). One ``{"type":"image", ...}``
+    block is returned per grid; many frames collapse into a few images, so a
+    long episode's temporal coverage stays dense at a fraction of the vision
+    tokens N separate frames would cost. ``frames`` and ``timestamps`` must be
+    aligned and equal length. Returns ``[]`` for empty input.
+    """
+    from PIL import Image
+
+    if not frames:
+        return []
+    columns = max(1, columns)
+    frames_per_sheet = max(1, frames_per_sheet)
+    rows_per_sheet = math.ceil(frames_per_sheet / columns)
+
+    tiles: list[PIL.Image.Image] = []
+    for ts, frame in zip(timestamps, frames, strict=False):
+        img = _frame_to_pil(frame)
+        if not isinstance(img, PIL.Image.Image):
+            continue
+        img = img.convert("RGB")
+        if img.width != frame_width:
+            height = max(1, round(img.height * frame_width / img.width))
+            img = img.resize((frame_width, height), resample=Image.Resampling.BILINEAR)
+        tiles.append(_draw_timestamp_badge(img, float(ts)))
+    if not tiles:
+        return []
+
+    blocks: list[dict[str, Any]] = []
+    for start in range(0, len(tiles), frames_per_sheet):
+        chunk = tiles[start : start + frames_per_sheet]
+        cell_w = max(tile.width for tile in chunk)
+        cell_h = max(tile.height for tile in chunk)
+        sheet = Image.new("RGB", (cell_w * columns, cell_h * rows_per_sheet), color=(0, 0, 0))
+        for i, tile in enumerate(chunk):
+            x = (i % columns) * cell_w
+            y = (i // columns) * cell_h
+            sheet.paste(tile, (x, y))
+        # JPEG round-trip at ``quality`` to match the refiner convention and
+        # shrink the wire payload; vision-token count is set by resolution, so
+        # the real saving is the grid packing, not the codec.
+        buf = io.BytesIO()
+        sheet.save(buf, format="JPEG", quality=quality)
+        buf.seek(0)
+        blocks.append({"type": "image", "image": Image.open(buf).convert("RGB")})
+    return blocks
diff --git a/src/lerobot/annotations/steerable_pipeline/modules/__init__.py b/src/lerobot/annotations/steerable_pipeline/modules/__init__.py
new file mode 100644
index 000000000..e9ff8ed23
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/modules/__init__.py
@@ -0,0 +1,25 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .general_vqa import GeneralVqaModule
+from .interjections_and_speech import InterjectionsAndSpeechModule
+from .plan_subtasks_memory import PlanSubtasksMemoryModule
+
+__all__ = [
+    "GeneralVqaModule",
+    "InterjectionsAndSpeechModule",
+    "PlanSubtasksMemoryModule",
+]
diff --git a/src/lerobot/annotations/steerable_pipeline/modules/general_vqa.py b/src/lerobot/annotations/steerable_pipeline/modules/general_vqa.py
new file mode 100644
index 000000000..cdc87b579
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/modules/general_vqa.py
@@ -0,0 +1,248 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""``vqa`` module: general VQA at a timed cadence.
+
+Every ``1/hz`` seconds an emission tick fires; each tick anchors ``K``
+consecutive frames, and every anchored frame gets its own VQA pair. Each
+pair is grounded on that single anchor frame — there is no per-pair frame
+window. For datasets with multiple cameras, every anchored frame produces
+one ``(vqa, user)`` + ``(vqa, assistant)`` pair *per camera*: each pair is
+generated against that camera's frame and stamped with the matching
+``camera`` field on the emitted rows. The resolver disambiguates via
+``camera=...``; recipes that consume VQA do so through one sub-recipe
+per camera (see ``recipes/pi05_hirobot.yaml``).
+
+Within a single (frame, camera) we still emit at most one ``(vqa, user)``
+and one ``(vqa, assistant)`` row, so the resolver contract stays scalar.
+
+Question types covered (per the plan's ``vqa`` table): bbox, keypoint,
+count, attribute, spatial. The assistant's ``content`` is a JSON string
+whose schema depends on the question type. Malformed JSON triggers one
+retry inside :meth:`VlmClient.generate_json`.
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+import random
+from collections.abc import Sequence
+from dataclasses import dataclass, field
+from typing import Any
+
+from ..config import VqaConfig
+from ..frames import FrameProvider, null_provider, to_image_blocks
+from ..prompts import load as load_prompt
+from ..reader import EpisodeRecord
+from ..staging import EpisodeStaging
+from ..validator import classify_vqa_answer
+from ..vlm_client import VlmClient
+
+
+def _emission_anchor_indices(frame_timestamps: Sequence[float], hz: float, k: int) -> list[int]:
+    """Return the relative frame indices to anchor VQA emissions to.
+
+    For each emission tick (every ``1/hz`` seconds), we anchor ``k``
+    consecutive frames starting at the tick. Ticks fall on the nearest
+    available source frame timestamp.
+    """
+    if hz <= 0 or k <= 0 or not frame_timestamps:
+        return []
+    t0 = frame_timestamps[0]
+    t_last = frame_timestamps[-1]
+    period = 1.0 / hz
+    indices: list[int] = []
+    t = t0
+    while t <= t_last + 1e-9:
+        # find the index of the nearest frame to t
+        nearest_i = min(range(len(frame_timestamps)), key=lambda i: abs(frame_timestamps[i] - t))
+        for offset in range(k):
+            j = nearest_i + offset
+            if j >= len(frame_timestamps):
+                break
+            if not indices or indices[-1] != j:
+                indices.append(j)
+        t += period
+    # dedupe while preserving order
+    seen: set[int] = set()
+    deduped: list[int] = []
+    for i in indices:
+        if i in seen:
+            continue
+        seen.add(i)
+        deduped.append(i)
+    return deduped
+
+
+@dataclass
+class GeneralVqaModule:
+    """Emit grounded VQA pairs at a timed cadence."""
+
+    vlm: VlmClient
+    config: VqaConfig
+    seed: int = 1729
+    frame_provider: FrameProvider = field(default_factory=null_provider)
+    _warned_no_camera: bool = field(default=False, init=False, repr=False)
+
+    @property
+    def enabled(self) -> bool:
+        return self.config.enabled
+
+    def run_episode(self, record: EpisodeRecord, staging: EpisodeStaging) -> None:
+        if not record.frame_timestamps:
+            staging.write("vqa", [])
+            return
+        rng = random.Random(f"{self.seed}:{record.episode_index}:vqa")
+        anchor_idx = _emission_anchor_indices(
+            record.frame_timestamps, self.config.vqa_emission_hz, self.config.K
+        )
+        cameras = self._target_cameras()
+        if not cameras:
+            # No camera available — emit nothing rather than producing
+            # untagged rows that would fail validation. Surface a loud one-
+            # time warning so this is never silently a no-op.
+            if not self._warned_no_camera:
+                logging.getLogger(__name__).warning(
+                    "vqa module found no cameras on the frame provider — "
+                    "every episode will emit zero VQA rows. Check that the "
+                    "dataset declares observation.images.* features in "
+                    "meta/info.json; passing --vlm.camera_key=<key> at the "
+                    "CLI now also seeds the cameras list as a fallback."
+                )
+                self._warned_no_camera = True
+            staging.write("vqa", [])
+            return
+
+        # Build all messages first (one per (frame, camera)), then issue them
+        # as a single batched generate_json call so the client can fan them
+        # out concurrently.
+        per_call: list[tuple[float, str, str, list[dict[str, Any]]]] = []
+        for idx in anchor_idx:
+            ts = float(record.frame_timestamps[idx])
+            qtype = rng.choice(self.config.question_types)
+            for camera in cameras:
+                messages = self._build_messages(record, qtype, ts, camera)
+                # Skip cameras that decoded to zero frames at this ts: no point
+                # asking the VLM to ground a bbox without an image.
+                if not _has_image_block(messages):
+                    continue
+                per_call.append((ts, camera, qtype, messages))
+
+        if not per_call:
+            staging.write("vqa", [])
+            return
+
+        results = self.vlm.generate_json([m for _, _, _, m in per_call])
+
+        rows: list[dict[str, Any]] = []
+        for (ts, camera, _qtype, _messages), result in zip(per_call, results, strict=True):
+            qa = self._postprocess(result)
+            if qa is None:
+                continue
+            question, answer = qa
+            rows.append(
+                {
+                    "role": "user",
+                    "content": question,
+                    "style": "vqa",
+                    "timestamp": ts,
+                    "camera": camera,
+                    "tool_calls": None,
+                }
+            )
+            rows.append(
+                {
+                    "role": "assistant",
+                    "content": json.dumps(answer, sort_keys=True),
+                    "style": "vqa",
+                    "timestamp": ts,
+                    "camera": camera,
+                    "tool_calls": None,
+                }
+            )
+        staging.write("vqa", rows)
+
+    def _target_cameras(self) -> list[str]:
+        """Return the cameras the ``vqa`` module should iterate per anchored frame.
+
+        Defaults to every camera the provider exposes. Datasets with no
+        cameras (or test/null providers) yield an empty list, which makes
+        ``run_episode`` a no-op.
+
+        When ``config.restrict_to_default_camera`` is set, VQA grounds on
+        only the provider's default camera (the single ``--vlm.camera_key``
+        stream), matching the plan / interjection modules so the whole
+        pipeline focuses on one view.
+        """
+        all_cameras = list(getattr(self.frame_provider, "camera_keys", []) or [])
+        if getattr(self.config, "restrict_to_default_camera", False):
+            default = getattr(self.frame_provider, "camera_key", None)
+            if default and default in all_cameras:
+                return [default]
+            # ``restrict_to_default_camera`` is set but the configured default
+            # isn't one the provider exposes. Returning it anyway would make
+            # ``_decode`` raise a KeyError deep in frame extraction, so warn and
+            # fall through to every available camera instead.
+            if default:
+                logging.getLogger(__name__).warning(
+                    "restrict_to_default_camera is set but camera_key=%r is not in the "
+                    "provider's cameras %s; grounding VQA on all available cameras instead.",
+                    default,
+                    all_cameras,
+                )
+        return all_cameras
+
+    def _build_messages(
+        self,
+        record: EpisodeRecord,
+        question_type: str,
+        frame_timestamp: float,
+        camera_key: str,
+    ) -> list[dict[str, Any]]:
+        prompt = load_prompt("vqa").format(
+            episode_task=record.episode_task,
+            question_type=question_type,
+        )
+        images = self.frame_provider.frames_at(record, [frame_timestamp], camera_key=camera_key)
+        content = [*to_image_blocks(images), {"type": "text", "text": prompt}]
+        return [{"role": "user", "content": content}]
+
+    def _postprocess(self, result: Any) -> tuple[str, dict[str, Any]] | None:
+        if not isinstance(result, dict):
+            return None
+        question = result.get("question")
+        answer = result.get("answer")
+        if not isinstance(question, str) or not question.strip():
+            return None
+        if not isinstance(answer, dict):
+            return None
+        # The validator will enforce shape; here we just sanity-check that the
+        # answer matches *some* known shape so we can drop garbage early.
+        if classify_vqa_answer(answer) is None:
+            return None
+        return question.strip(), answer
+
+
+def _has_image_block(messages: list[dict[str, Any]]) -> bool:
+    """Return True if any user content block is a populated image block."""
+    for msg in messages:
+        content = msg.get("content")
+        if not isinstance(content, list):
+            continue
+        for block in content:
+            if isinstance(block, dict) and block.get("type") == "image":
+                return True
+    return False
diff --git a/src/lerobot/annotations/steerable_pipeline/modules/interjections_and_speech.py b/src/lerobot/annotations/steerable_pipeline/modules/interjections_and_speech.py
new file mode 100644
index 000000000..616f9ce1b
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/modules/interjections_and_speech.py
@@ -0,0 +1,211 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""``interjections`` module: interjections + paired speech (EVENT styles + speech atoms).
+
+Two sub-passes:
+
+1. At ``t=0``, emit ONLY a speech tool-call atom (acknowledgement of the
+   canonical task). No interjection row — the canonical task is already the
+   user utterance from ``meta/tasks.parquet``.
+
+2. For mid-episode interruptions, emit a co-timestamped pair:
+       {role:user, style:interjection, content:<text>}
+       speech atom (role:assistant, style:None, tool_calls=[say(...)])
+   Both rows go in ``language_events`` at the same timestamp.
+
+The ``plan`` module's :meth:`run_plan_updates` reuses this module's
+interjection timestamps to refresh the ``plan`` row at the same instant.
+"""
+
+from __future__ import annotations
+
+import random
+from collections.abc import Sequence
+from dataclasses import dataclass, field
+from typing import Any
+
+from ..config import InterjectionsConfig
+from ..frames import FrameProvider, null_provider, to_image_blocks
+from ..prompts import load as load_prompt
+from ..reader import EpisodeRecord, reconstruct_subtask_spans, snap_to_frame
+from ..staging import EpisodeStaging
+from ..vlm_client import VlmClient
+from ..writer import speech_atom
+
+
+@dataclass
+class InterjectionsAndSpeechModule:
+    """Generate task-start speech and mid-episode interjection/speech pairs."""
+
+    vlm: VlmClient
+    config: InterjectionsConfig
+    seed: int = 1729
+    frame_provider: FrameProvider = field(default_factory=null_provider)
+
+    @property
+    def enabled(self) -> bool:
+        return self.config.enabled
+
+    def run_episode(self, record: EpisodeRecord, staging: EpisodeStaging) -> None:
+        rows: list[dict[str, Any]] = []
+        if record.frame_timestamps:
+            t0 = float(record.frame_timestamps[0])
+            initial = self._initial_speech(record)
+            if initial:
+                rows.append(speech_atom(t0, initial))
+        # Pull the ``plan`` module's subtask spans for this episode so the
+        # interjection prompt can ground itself in the actual current
+        # subtask at each chosen timestamp. The ``plan`` module ran first.
+        episode_end_t = float(record.frame_timestamps[-1]) if record.frame_timestamps else None
+        subtask_spans = reconstruct_subtask_spans(staging.read("plan"), episode_end_t=episode_end_t)
+        rows.extend(self._mid_episode_interjections(record, subtask_spans))
+        staging.write("interjections", rows)
+
+    @staticmethod
+    def _subtask_at(spans: Sequence[dict[str, Any]], t: float) -> str | None:
+        current: str | None = None
+        for span in spans:
+            if float(span["start"]) <= t:
+                current = span.get("text")
+            else:
+                break
+        return current
+
+    def _initial_speech(self, record: EpisodeRecord) -> str | None:
+        prompt = load_prompt("interjections_initial_speech").format(
+            episode_task=record.episode_task,
+        )
+        messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]
+        result = self.vlm.generate_json([messages])[0]
+        if isinstance(result, dict) and isinstance(result.get("text"), str):
+            text = result["text"].strip()
+            if text:
+                return text
+        return None
+
+    def _mid_episode_interjections(
+        self,
+        record: EpisodeRecord,
+        subtask_spans: Sequence[dict[str, Any]],
+    ) -> list[dict[str, Any]]:
+        """Generate interjections aligned with the actual demo trajectory.
+
+        Teleop data is frozen — the robot already executed every step in
+        the video. A *counterfactual* interjection like "actually skip
+        the wipe" contradicts what then happens in the video, which is
+        what qwen36moe-10/11 surfaced as low-quality interjections.
+
+        Instead, anchor every interjection at a subtask boundary and
+        write it as a natural user request for the *upcoming* subtask.
+        The robot's visible next behavior IS the interjection's effect,
+        so the training signal stays consistent: interjection text →
+        plan refresh → action stream all line up.
+        """
+        if self.config.max_interjections_per_episode <= 0:
+            return []
+        if len(subtask_spans) < 2:
+            # Need at least one transition (subtask 0 → subtask 1).
+            return []
+        # Deterministic per-episode RNG so reruns are stable across SLURM jobs.
+        rng = random.Random(f"{self.seed}:{record.episode_index}:interjection")
+
+        # Boundaries: the start time of every subtask except the first
+        # (which is just t0 and is covered by the initial-task speech atom).
+        boundaries: list[tuple[float, str, str]] = []
+        for i in range(1, len(subtask_spans)):
+            ts = float(subtask_spans[i]["start"])
+            if ts < self.config.interjection_min_t:
+                continue
+            prev_text = (subtask_spans[i - 1].get("text") or "").strip()
+            next_text = (subtask_spans[i].get("text") or "").strip()
+            if not next_text:
+                continue
+            boundaries.append((ts, prev_text, next_text))
+        if not boundaries:
+            return []
+
+        n = min(self.config.max_interjections_per_episode, len(boundaries))
+        chosen = sorted(rng.sample(boundaries, n), key=lambda b: b[0])
+
+        out: list[dict[str, Any]] = []
+        for t, prev_subtask, next_subtask in chosen:
+            t_snap = snap_to_frame(t, record.frame_timestamps)
+            # Window straddles the boundary so the VLM sees the end of the
+            # previous subtask and the start of the next one — same
+            # conditioning the policy will see at training time.
+            window_ts = self._window_timestamps(t_snap, record.frame_timestamps)
+            prompt = load_prompt("interjections_interjection").format(
+                episode_task=record.episode_task,
+                prev_subtask=prev_subtask or "(starting from initial state)",
+                next_subtask=next_subtask,
+                timestamp=t_snap,
+                window_seconds=self.config.interjection_window_seconds,
+            )
+            images = self.frame_provider.frames_at(record, window_ts)
+            content = [*to_image_blocks(images), {"type": "text", "text": prompt}]
+            messages = [{"role": "user", "content": content}]
+            result = self.vlm.generate_json([messages])[0]
+            if not isinstance(result, dict):
+                continue
+            interjection_text = result.get("interjection")
+            speech_text = result.get("speech")
+            if not isinstance(interjection_text, str) or not interjection_text.strip():
+                continue
+            if not isinstance(speech_text, str) or not speech_text.strip():
+                continue
+            out.append(
+                {
+                    "role": "user",
+                    "content": interjection_text.strip(),
+                    "style": "interjection",
+                    "timestamp": t_snap,
+                    "tool_calls": None,
+                }
+            )
+            out.append(speech_atom(t_snap, speech_text.strip()))
+        return out
+
+    def _window_timestamps(self, t_anchor: float, frame_timestamps: Sequence[float]) -> list[float]:
+        """Return a small set of frame timestamps centered on ``t_anchor``.
+
+        The window straddles the subtask boundary the interjection sits
+        on: roughly half the frames cover the end of the previous
+        subtask, half cover the start of the next one. The VLM therefore
+        sees BOTH what just finished AND what's about to start, which is
+        the conditioning we need to write a natural "now please do X"
+        request that matches the visible upcoming behavior.
+        """
+        if not frame_timestamps:
+            return [t_anchor]
+        n = max(1, int(self.config.interjection_window_frames))
+        if n == 1:
+            return [t_anchor]
+        window = float(self.config.interjection_window_seconds)
+        step = window / max(1, n - 1)
+        # Center the window on the anchor so half lands before, half after.
+        start_offset = -window / 2.0
+        targets = [t_anchor + start_offset + step * i for i in range(n)]
+        first_ts = float(frame_timestamps[0])
+        last_ts = float(frame_timestamps[-1])
+        snapped: list[float] = []
+        seen: set[float] = set()
+        for tgt in targets:
+            clamped = min(last_ts, max(first_ts, tgt))
+            t = snap_to_frame(clamped, frame_timestamps)
+            if t not in seen:
+                seen.add(t)
+                snapped.append(t)
+        return snapped or [t_anchor]
diff --git a/src/lerobot/annotations/steerable_pipeline/modules/plan_subtasks_memory.py b/src/lerobot/annotations/steerable_pipeline/modules/plan_subtasks_memory.py
new file mode 100644
index 000000000..b6df6551c
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/modules/plan_subtasks_memory.py
@@ -0,0 +1,780 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""``plan`` module: subtask decomposition + plan + memory (PERSISTENT styles)."""
+
+from __future__ import annotations
+
+import logging
+from collections.abc import Sequence
+from dataclasses import dataclass, field
+from typing import Any
+
+from ..config import PlanConfig
+from ..frames import (
+    FrameProvider,
+    null_provider,
+    to_contact_sheet_blocks,
+)
+from ..prompts import load as load_prompt
+from ..reader import EpisodeRecord, reconstruct_subtask_spans, snap_to_frame
+from ..staging import EpisodeStaging
+from ..vlm_client import VlmClient
+
+logger = logging.getLogger(__name__)
+
+
+# Prepended to every describe / segment prompt so the VLM knows the images are
+# timestamped contact-sheet grids, not a single video, and reads the burned-in
+# per-tile timestamp when choosing boundaries.
+def _contact_sheet_preamble(columns: int) -> str:
+    return (
+        "CONTACT SHEETS — how to read the images below:\n"
+        f"- Each image is a grid of sampled video frames, {columns} per row, "
+        "with time running left-to-right then top-to-bottom (row-major).\n"
+        "- Each frame has its timestamp burned into the top-left corner, e.g. "
+        '"012.50s". Use that printed timestamp (not the tile position) when you '
+        "choose start/end times; boundaries should land on or near a printed "
+        "timestamp.\n"
+        "- Frames continue across grids: an action may span the end of one sheet "
+        "and the start of the next, so do not place a boundary just because a new "
+        "image begins.\n\n"
+    )
+
+
+# Appended to every describe (and segment) prompt. A visual, causal definition
+# of where one event ends and the next begins — adapted from macrodata/refiner —
+# to sharpen cut points while the existing prompt keeps owning the imperative
+# phrasing.
+_CAUSAL_BOUNDARY_RULES = (
+    "EVENT BOUNDARIES — where one event ends and the next begins:\n"
+    "- Start a new event whenever the world state changes: an object becomes "
+    "held (the gripper closes on it), an object is released (the gripper opens "
+    "and it stays put), an object reaches a new location, a lid/door/drawer "
+    "changes open/closed state, a tool starts or stops affecting a surface, or "
+    "contents visibly move (e.g. poured).\n"
+    "- If a single action changes the same state gradually and continuously, "
+    "keep it as ONE event — do not split it.\n"
+    "- If the same action repeats on different objects or target locations, "
+    "treat each repetition as a separate event.\n"
+    "- Do NOT create boundaries for idle time, camera motion, hesitation, or "
+    "tiny hand adjustments."
+)
+
+
+@dataclass
+class PlanSubtasksMemoryModule:
+    """Generate subtask spans, plan, and memory rows.
+
+    All output is persistent (lives in ``language_persistent``):
+
+    - ``subtask`` rows: one per span, stamped at the span's *start* timestamp
+      (snapped to an exact frame).
+    - ``plan`` rows: emitted at ``t=0``; refreshed at every interjection
+      timestamp via :meth:`run_plan_updates` (called by the executor after
+      the ``interjections`` module completes).
+    - ``memory`` rows: emitted at each subtask boundary (= subtask start
+      timestamp from the second subtask onward).
+    """
+
+    vlm: VlmClient
+    config: PlanConfig
+    frame_provider: FrameProvider = field(default_factory=null_provider)
+
+    @property
+    def enabled(self) -> bool:
+        return self.config.enabled
+
+    def run_episode(self, record: EpisodeRecord, staging: EpisodeStaging) -> None:
+        rows: list[dict[str, Any]] = []
+        # Task driving every plan-module prompt: canonical episode_task, or a
+        # video-derived one when it's empty/placeholder (see derive_task_*).
+        effective_task = self._resolve_effective_task(record)
+        # task_aug rows at t=0: phrasings the renderer rotates ${task} through.
+        # Either the structured 5-axis taxonomy (task_aug_axes.enabled) or
+        # free-form n_task_rephrasings; the effective task is always emitted
+        # first so the rotation covers the source-of-truth phrasing.
+        t0 = float(record.frame_timestamps[0]) if record.frame_timestamps else 0.0
+        variants: list[str] | None = None
+        if self.config.task_aug_axes.enabled and effective_task:
+            variants = self._generate_task_aug_by_axes(effective_task, self.config.task_aug_axes)
+        elif self.config.n_task_rephrasings > 0 and effective_task:
+            variants = self._generate_task_rephrasings(effective_task, n=self.config.n_task_rephrasings)
+        if variants is not None:
+            rows.extend(self._task_aug_rows([effective_task, *variants], t0))
+
+        subtask_spans = self._generate_subtasks(record, task=effective_task)
+
+        # subtask rows
+        for span in subtask_spans:
+            rows.append(
+                {
+                    "role": "assistant",
+                    "content": span["text"],
+                    "style": "subtask",
+                    "timestamp": snap_to_frame(span["start"], record.frame_timestamps),
+                    "tool_calls": None,
+                }
+            )
+        # Plan rows at every subtask boundary (incl. t=0). The plan is a
+        # numbered list of still-todo subtasks, so re-emitting at each
+        # boundary makes it shrink as work progresses — ${plan} at frame t is
+        # exactly what's left to do.
+        if self.config.emit_plan:
+            for span in subtask_spans:
+                boundary_t = snap_to_frame(span["start"], record.frame_timestamps)
+                plan_text = self._generate_plan(
+                    record, subtask_spans, refresh_t=boundary_t, task=effective_task
+                )
+                if plan_text is not None:
+                    rows.append(
+                        {
+                            "role": "assistant",
+                            "content": plan_text,
+                            "style": "plan",
+                            "timestamp": float(boundary_t),
+                            "tool_calls": None,
+                        }
+                    )
+        # memory rows at every subtask boundary except the very first start;
+        # skipped entirely when ``emit_memory`` is False (subtasks-only / plan-only).
+        prior_memory = ""
+        memory_boundaries = enumerate(subtask_spans[1:], start=1) if self.config.emit_memory else []
+        for i, span in memory_boundaries:
+            completed = subtask_spans[i - 1]["text"]
+            remaining = [s["text"] for s in subtask_spans[i:]]
+            mem_text = self._generate_memory(record, prior_memory, completed, remaining, task=effective_task)
+            if mem_text:
+                ts = snap_to_frame(span["start"], record.frame_timestamps)
+                rows.append(
+                    {
+                        "role": "assistant",
+                        "content": mem_text,
+                        "style": "memory",
+                        "timestamp": ts,
+                        "tool_calls": None,
+                    }
+                )
+                prior_memory = mem_text
+        staging.write("plan", rows)
+
+    # ------------------------------------------------------------------
+    # Task derivation + rephrasings
+    # ------------------------------------------------------------------
+
+    _PLACEHOLDER_TASKS: frozenset[str] = frozenset(
+        {
+            "debug",
+            "test",
+            "tbd",
+            "todo",
+            "n/a",
+            "na",
+            "untitled",
+            "unnamed",
+            "default",
+            "placeholder",
+        }
+    )
+
+    def _resolve_effective_task(self, record: EpisodeRecord) -> str:
+        """Decide which task string drives the ``plan`` module for this episode.
+
+        Returns the user-supplied ``record.episode_task`` unless
+        ``derive_task_from_video`` says otherwise (see config docstring).
+        Falls back gracefully to the canonical task if video derivation
+        fails.
+        """
+        canonical = (record.episode_task or "").strip()
+        mode = (self.config.derive_task_from_video or "off").strip().lower()
+        if mode == "always":
+            derived = self._derive_task_from_video(record)
+            return derived or canonical
+        if mode == "if_short" and self._task_seems_bad(canonical):
+            derived = self._derive_task_from_video(record)
+            if derived:
+                return derived
+        return canonical
+
+    def _task_seems_bad(self, task: str) -> bool:
+        if not task:
+            return True
+        if len(task.split()) < int(self.config.derive_task_min_words):
+            return True
+        return task.lower() in self._PLACEHOLDER_TASKS
+
+    @staticmethod
+    def _task_aug_rows(phrasings: Sequence[str], t0: float) -> list[dict[str, Any]]:
+        """Build deduplicated ``task_aug`` rows (role=user) at ``t0``."""
+        seen: set[str] = set()
+        rows: list[dict[str, Any]] = []
+        for phrasing in phrasings:
+            key = phrasing.strip()
+            if not key or key in seen:
+                continue
+            seen.add(key)
+            rows.append(
+                {"role": "user", "content": key, "style": "task_aug", "timestamp": t0, "tool_calls": None}
+            )
+        return rows
+
+    # ------------------------------------------------------------------
+    # VLM call helpers — every plan-module prompt follows the same shape:
+    # build messages → single VLM call → pull a named field.
+    # ------------------------------------------------------------------
+
+    def _vlm_field(self, messages: list[dict[str, Any]], field: str) -> Any:
+        """Run a single VLM call and return ``result[field]`` or ``None``.
+
+        Centralizes the ``vlm.generate_json([m])[0]`` + ``isinstance(dict)``
+        dance every prompt-call site needs.
+        """
+        result = self.vlm.generate_json([messages])[0]
+        if isinstance(result, dict):
+            return result.get(field)
+        return None
+
+    @staticmethod
+    def _text_message(text: str) -> list[dict[str, Any]]:
+        """One-shot text-only user message wrapped for ``generate_json``."""
+        return [{"role": "user", "content": [{"type": "text", "text": text}]}]
+
+    def _video_message(
+        self,
+        record: EpisodeRecord,
+        prompt: str,
+        window: tuple[float, float] | None = None,
+    ) -> list[dict[str, Any]]:
+        """User message combining the (optionally windowed) contact sheets with ``prompt``.
+
+        The prompt is always prefixed with a short explanation of how to read
+        the timestamped grids, so the model treats them as one ordered
+        sequence of frames rather than unrelated images.
+        """
+        prompt = _contact_sheet_preamble(self.config.contact_sheet_columns) + prompt
+        content = [*self._episode_video_block(record, window=window), {"type": "text", "text": prompt}]
+        return [{"role": "user", "content": content}]
+
+    def _derive_task_from_video(self, record: EpisodeRecord) -> str | None:
+        """Ask the VLM "what is this video about" with no task hint at all."""
+        text = self._vlm_field(self._video_message(record, load_prompt("plan_video_task")), "task")
+        return text.strip() if isinstance(text, str) and text.strip() else None
+
+    def _generate_task_rephrasings(self, base_task: str, *, n: int) -> list[str]:
+        """Generate ``n`` text-only paraphrases of ``base_task``."""
+        if n <= 0 or not base_task:
+            return []
+        prompt = load_prompt("plan_task_rephrasings").format(base_task=base_task, n=n)
+        raw = self._vlm_field(self._text_message(prompt), "rephrasings")
+        if not isinstance(raw, list):
+            return []
+        out = [item.strip().strip('"').strip("'") for item in raw if isinstance(item, str)]
+        return [s for s in out if s][:n]
+
+    # ------------------------------------------------------------------
+    # Structured 5-axis task augmentation (EgoMimic-style taxonomy)
+    # ------------------------------------------------------------------
+
+    def _generate_task_aug_by_axes(self, base_task: str, axes_cfg: Any) -> list[str]:
+        """One VLM call → variants along the 5-axis taxonomy.
+
+        Variants from all axes are flattened into a single list (the
+        downstream pipeline doesn't need to know about the per-axis
+        bucketing — every variant becomes a ``task_aug`` row). Order
+        is preserved for reproducibility: synonym_paraphrase first,
+        then omit_arm, then omit_orientation, then omit_grasp_method,
+        then combined_omissions.
+        """
+        if not base_task:
+            return []
+        prompt = load_prompt("plan_task_aug_axes").format(
+            base_task=base_task,
+            n_synonym=axes_cfg.synonym_paraphrase,
+            n_omit_arm=axes_cfg.omit_arm,
+            n_omit_orientation=axes_cfg.omit_orientation,
+            n_omit_grasp_method=axes_cfg.omit_grasp_method,
+            n_combined=axes_cfg.combined_omissions,
+        )
+        result = self.vlm.generate_json([self._text_message(prompt)])[0]
+        if not isinstance(result, dict):
+            return []
+        ordered_axes = (
+            "synonym_paraphrase",
+            "omit_arm",
+            "omit_orientation",
+            "omit_grasp_method",
+            "combined_omissions",
+        )
+        flat: list[str] = []
+        seen: set[str] = set()
+        for axis in ordered_axes:
+            entries = result.get(axis)
+            if not isinstance(entries, list):
+                continue
+            for item in entries:
+                if not isinstance(item, str):
+                    continue
+                key = item.strip().strip('"').strip("'")
+                if not key or key in seen:
+                    continue
+                seen.add(key)
+                flat.append(key)
+        return flat
+
+    def _episode_video_block(
+        self, record: EpisodeRecord, window: tuple[float, float] | None = None
+    ) -> list[dict[str, Any]]:
+        """Timestamped contact sheets for the describe / segmentation prompts.
+
+        Always renders the (optionally windowed) episode as contact sheets:
+        frames sampled at ``frames_per_second`` and packed into timestamped
+        JPEG grids. ``max_frames_per_prompt`` caps the frame count; whole
+        episodes that exceed it are windowed upstream in
+        :meth:`_generate_subtasks` so each call stays within budget while the
+        full episode keeps its sampling density.
+
+        When ``window=(w0, w1)`` is given the badges are WINDOW-RELATIVE
+        (``ts - w0``) to match the window-relative time frame the
+        segmentation prompt works in (spans are offset back to absolute time
+        afterwards).
+        """
+        if not record.frame_timestamps:
+            return []
+        if window is not None:
+            w0, w1 = float(window[0]), float(window[1])
+            dur = max(0.0, w1 - w0)
+            n = max(1, int(round(dur * self.config.frames_per_second)) + 1)
+            n = min(n, self.config.max_frames_per_prompt)
+            if n <= 1 or dur <= 0.0:
+                timestamps = [0.5 * (w0 + w1)]
+            else:
+                step = dur / (n - 1)
+                timestamps = [w0 + i * step for i in range(n)]
+            frames = self.frame_provider.frames_at(record, timestamps)
+            rel = [ts - w0 for ts in timestamps[: len(frames)]]
+            return self._contact_sheet_blocks(frames, rel)
+        episode_duration = record.frame_timestamps[-1] - record.frame_timestamps[0]
+        n = max(1, int(round(episode_duration * self.config.frames_per_second)) + 1)
+        n = min(n, self.config.max_frames_per_prompt)
+        timestamps = self._uniform_episode_timestamps(record, n)
+        frames = self.frame_provider.frames_at(record, timestamps)
+        return self._contact_sheet_blocks(frames, timestamps[: len(frames)])
+
+    @staticmethod
+    def _uniform_episode_timestamps(record: EpisodeRecord, n: int) -> list[float]:
+        """``n`` episode-relative timestamps spanning ``[t0, t_last]`` uniformly."""
+        ts = record.frame_timestamps
+        if n >= len(ts):
+            return [float(t) for t in ts]
+        t0, t_last = float(ts[0]), float(ts[-1])
+        if t_last <= t0 or n <= 1:
+            return [t0] * max(1, n)
+        step = (t_last - t0) / (n - 1)
+        return [t0 + i * step for i in range(n)]
+
+    def _contact_sheet_blocks(self, frames: list[Any], timestamps: list[float]) -> list[dict[str, Any]]:
+        """Build timestamped contact-sheet image blocks from decoded frames."""
+        return to_contact_sheet_blocks(
+            frames,
+            timestamps,
+            columns=self.config.contact_sheet_columns,
+            frames_per_sheet=self.config.contact_sheet_frames_per_sheet,
+            frame_width=self.config.contact_sheet_frame_width,
+            quality=self.config.contact_sheet_quality,
+        )
+
+    def run_plan_updates(
+        self,
+        record: EpisodeRecord,
+        staging: EpisodeStaging,
+        interjection_times: Sequence[float],
+        interjection_texts: Sequence[str] | None = None,
+    ) -> None:
+        """Append additional ``plan`` rows at every interjection timestamp.
+
+        Plans refresh ONLY on user interjections (event-driven). The
+        interjection text is forwarded into the prompt so the refreshed plan
+        reflects the user's correction.
+        """
+        if not self.config.emit_plan:
+            return
+        existing = staging.read("plan")
+        # Pass the last frame timestamp so the final span is closed (else its
+        # end == start, zero duration, and a refresh inside it is missed).
+        episode_end_t = float(record.frame_timestamps[-1]) if record.frame_timestamps else None
+        spans = reconstruct_subtask_spans(existing, episode_end_t=episode_end_t)
+        already_planned: set[float] = {float(r["timestamp"]) for r in existing if r.get("style") == "plan"}
+        new_rows = list(existing)
+
+        texts: list[str | None] = (
+            [None] * len(interjection_times)
+            if interjection_texts is None
+            else [str(t) if t else None for t in interjection_texts]
+        )
+        for raw_t, inter_text in zip(interjection_times, texts, strict=True):
+            t = snap_to_frame(raw_t, record.frame_timestamps)
+            if t in already_planned:
+                continue
+            already_planned.add(t)
+            plan_text = self._generate_plan(record, spans, refresh_t=t, interjection=inter_text)
+            if plan_text is not None:
+                new_rows.append(
+                    {
+                        "role": "assistant",
+                        "content": plan_text,
+                        "style": "plan",
+                        "timestamp": t,
+                        "tool_calls": None,
+                    }
+                )
+        staging.write("plan", new_rows)
+
+    def _generate_subtasks(self, record: EpisodeRecord, *, task: str | None = None) -> list[dict[str, Any]]:
+        """Generate subtask spans, optionally via a multi-call quality chain.
+
+        Single call (default): watch video → emit subtask JSON.
+
+        Multi-call (opt-in, higher quality, more VLM calls):
+          1. ``subtask_describe_first`` — a grounding pass that narrates
+             ONLY what is visible (no JSON commitment to subtasks yet);
+             its description is injected into the segmentation prompt so
+             the model segments its own grounded observations instead of
+             pattern-matching the task text.
+          2. segmentation — emit subtask JSON (as before).
+        """
+        if record.row_count == 0 or not record.frame_timestamps:
+            return []
+        episode_duration = record.frame_timestamps[-1] - record.frame_timestamps[0]
+        effective_task = task if task is not None else record.episode_task
+
+        # ---- Auto-windowing (keeps the full sampling density) --------
+        # Contact sheets are cheap, but a whole long episode sampled at
+        # ``frames_per_second`` can still exceed ``max_frames_per_prompt``.
+        # When it does, split into consecutive windows of exactly that many
+        # frames (one describe→segment call each, still at the full sampling
+        # density), then merge + stitch — so an episode of any length is
+        # covered at full density rather than subsampled into one sparse call.
+        fps = max(1e-6, float(self.config.frames_per_second))
+        n_whole = int(round(episode_duration * fps)) + 1
+        if n_whole > self.config.max_frames_per_prompt:
+            window_s = self.config.max_frames_per_prompt / fps
+            return self._generate_subtasks_windowed(record, effective_task, window_s)
+
+        # ---- Pass 1 (optional): grounding description ----------------
+        observation_block = ""
+        if getattr(self.config, "subtask_describe_first", False):
+            description = self._describe_episode(record, effective_task)
+            if description:
+                observation_block = (
+                    "You watched this video and described, chronologically, "
+                    "ONLY what the robot actually does:\n"
+                    f'"""{description}"""\n\n'
+                    "Segment THAT grounded description (cross-checked against "
+                    "the video) into atomic subtasks. Do not introduce any "
+                    "action that is not in your description above.\n\n"
+                )
+
+        # ---- Pass 2: segmentation ------------------------------------
+        prompt = self._with_causal_rules(
+            load_prompt("plan_subtasks").format(
+                episode_task=effective_task,
+                min_subtask_seconds=self.config.min_subtask_seconds,
+                max_steps=self.config.plan_max_steps,
+                episode_duration=f"{episode_duration:.3f}",
+                observation_block=observation_block,
+            )
+        )
+        spans = self._vlm_field(self._video_message(record, prompt), "subtasks")
+        cleaned = self._clean_spans(spans, record)
+        if not cleaned:
+            return []
+
+        # ---- Full-episode coverage stitch ----------------------------
+        # The VLM can start after t0 or leave gaps, so frames fall through
+        # with no active subtask. Always stitch into a contiguous
+        # [t0, t_last] cover.
+        cleaned = self._stitch_full_coverage(cleaned, record)
+
+        return cleaned
+
+    def _generate_subtasks_windowed(
+        self, record: EpisodeRecord, task: str, window_s: float
+    ) -> list[dict[str, Any]]:
+        """Subtask generation in fixed-length windows at constant fps.
+
+        Splits ``[t0, t_last]`` into consecutive windows of ``window_s``
+        seconds, runs the describe -> segment chain on each window's own
+        frames (sampled at ``frames_per_second``), offsets
+        each window's spans back to absolute episode time, then merges +
+        stitches into a contiguous whole-episode cover.
+        """
+        t0 = float(record.frame_timestamps[0])
+        t_last = float(record.frame_timestamps[-1])
+        all_spans: list[dict[str, Any]] = []
+        w0 = t0
+        n_windows = 0
+        while w0 < t_last - 1e-6:
+            w1 = min(w0 + window_s, t_last)
+            all_spans.extend(self._subtasks_for_window(record, task, w0, w1))
+            n_windows += 1
+            w0 = w1
+        logger.info(
+            "episode %d: windowed subtask gen over %d window(s) of %.1fs -> %d raw spans",
+            record.episode_index,
+            n_windows,
+            window_s,
+            len(all_spans),
+        )
+        # Merge across windows: clamp to the absolute episode, sort, and
+        # frame-snap to distinct starts (handles any boundary collisions).
+        cleaned = self._clean_spans(all_spans, record)
+        if not cleaned:
+            return []
+        return self._stitch_full_coverage(cleaned, record)
+
+    def _subtasks_for_window(
+        self, record: EpisodeRecord, task: str, w0: float, w1: float
+    ) -> list[dict[str, Any]]:
+        """Run describe -> segment on one ``[w0, w1]`` window.
+
+        The model works in window-RELATIVE time ``[0, L]`` (it perceives
+        the window as a clip starting at 0); spans are offset back to
+        absolute ``[w0, w1]`` before returning.
+        """
+        window = (w0, w1)
+        win_len = max(0.0, w1 - w0)
+
+        observation_block = ""
+        if getattr(self.config, "subtask_describe_first", False):
+            description = self._describe_episode(record, task, window=window)
+            if description:
+                observation_block = (
+                    "You watched this video clip and described, chronologically, "
+                    "ONLY what the robot actually does:\n"
+                    f'"""{description}"""\n\n'
+                    "Segment THAT grounded description (cross-checked against "
+                    "the clip) into atomic subtasks. Do not introduce any "
+                    "action that is not in your description above.\n\n"
+                )
+
+        prompt = self._with_causal_rules(
+            load_prompt("plan_subtasks").format(
+                episode_task=task,
+                min_subtask_seconds=self.config.min_subtask_seconds,
+                max_steps=self.config.plan_max_steps,
+                episode_duration=f"{win_len:.3f}",
+                observation_block=observation_block,
+            )
+        )
+        spans = self._vlm_field(self._video_message(record, prompt, window=window), "subtasks")
+        # Window-relative clamp; no frame-snap dedupe yet (done on the
+        # merged absolute set).
+        cleaned = self._clean_spans(spans, record, bounds=(0.0, win_len), dedupe=False)
+        if not cleaned:
+            return []
+
+        # Offset window-relative spans back to absolute episode time.
+        for s in cleaned:
+            s["start"] = w0 + float(s["start"])
+            s["end"] = w0 + float(s["end"])
+        return cleaned
+
+    def _stitch_full_coverage(
+        self, spans: list[dict[str, Any]], record: EpisodeRecord
+    ) -> list[dict[str, Any]]:
+        """Make subtask spans tile the full episode with no gaps.
+
+        * The first subtask starts at the episode's first frame ``t0``
+          (any idle / approach before the first labelled action is folded
+          into it), so every early frame has an active subtask.
+        * Each subtask's ``end`` is snapped to the next subtask's
+          ``start`` (gaps between spans are closed), and the final
+          subtask's ``end`` extends to the last frame ``t_last``.
+
+        Starts are otherwise left as the (already frame-snapped, distinct)
+        values the VLM produced — only the FIRST start is pulled
+        back to ``t0``, which can't collide with a later span because it
+        was already the earliest. Purely deterministic; runs after the
+        VLM passes.
+        """
+        if not spans or not record.frame_timestamps:
+            return spans
+        t0 = float(record.frame_timestamps[0])
+        t_last = float(record.frame_timestamps[-1])
+        spans = sorted(spans, key=lambda s: float(s["start"]))
+        spans[0]["start"] = t0
+        for i in range(len(spans) - 1):
+            spans[i]["end"] = float(spans[i + 1]["start"])
+        spans[-1]["end"] = t_last
+        for s in spans:
+            if float(s["end"]) < float(s["start"]):
+                s["end"] = float(s["start"])
+        return spans
+
+    @staticmethod
+    def _with_causal_rules(prompt: str) -> str:
+        """Append the causal event-boundary rules to a describe/segment prompt."""
+        return f"{prompt}\n\n{_CAUSAL_BOUNDARY_RULES}"
+
+    def _clean_spans(
+        self,
+        spans: Any,
+        record: EpisodeRecord,
+        bounds: tuple[float, float] | None = None,
+        dedupe: bool = True,
+    ) -> list[dict[str, Any]]:
+        """Clamp / sort / (optionally) dedupe raw VLM subtask spans into valid rows.
+
+        ``bounds`` overrides the clamp range — pass the window's
+        ``(w_lo, w_hi)`` when cleaning window-relative spans, or leave
+        ``None`` to clamp to the whole episode ``[t0, t_last]``.
+        ``dedupe`` runs the frame-snap distinct-start step; skip it for
+        window-relative spans (frame snapping is done once on the merged,
+        absolute-time set).
+        """
+        if not spans:
+            return []
+        if bounds is not None:
+            lo, hi = float(bounds[0]), float(bounds[1])
+        else:
+            lo = record.frame_timestamps[0]
+            hi = record.frame_timestamps[-1]
+        cleaned: list[dict[str, Any]] = []
+        for span in spans:
+            try:
+                start = float(span["start"])
+                end = float(span["end"])
+                text = str(span["text"]).strip()
+            except (KeyError, ValueError, TypeError):
+                continue
+            start = max(lo, min(start, hi))
+            end = max(lo, min(end, hi))
+            if end < start:
+                start, end = end, start
+            if not text:
+                continue
+            cleaned.append({"text": text, "start": start, "end": end})
+        cleaned.sort(key=lambda s: s["start"])
+        if dedupe:
+            return self._dedupe_starts_to_distinct_frames(cleaned, record)
+        return cleaned
+
+    def _describe_episode(
+        self, record: EpisodeRecord, task: str, window: tuple[float, float] | None = None
+    ) -> str:
+        """Grounding pass: free-form chronological description of the (windowed) video."""
+        prompt = self._with_causal_rules(load_prompt("plan_subtask_describe").format(episode_task=task))
+        text = self._vlm_field(self._video_message(record, prompt, window=window), "description")
+        return text.strip() if isinstance(text, str) and text.strip() else ""
+
+    @staticmethod
+    def _dedupe_starts_to_distinct_frames(
+        spans: list[dict[str, Any]], record: EpisodeRecord
+    ) -> list[dict[str, Any]]:
+        """Bump same-frame subtask starts onto distinct frames.
+
+        Two consecutive VLM spans whose ``start`` rounds to the same
+        source frame (after :func:`snap_to_frame`) would otherwise emit
+        two ``style=subtask`` rows at the identical persistent
+        timestamp. The training-time renderer's ``active_at(t,
+        style=subtask)`` resolver can't disambiguate that and raises
+        ``Ambiguous resolver for style='subtask'``.
+
+        Walk the (sorted-by-start) spans, snap each to its frame, and
+        if the snapped frame is already taken push the span onto the
+        next unused frame so both subtasks survive on distinct
+        timestamps. If the episode ends before a free frame is found,
+        the trailing span is dropped with a warning — better than
+        poisoning the render.
+        """
+        if not spans:
+            return spans
+        frames = record.frame_timestamps
+        if not frames:
+            return spans
+        used: set[float] = set()
+        out: list[dict[str, Any]] = []
+        for span in spans:
+            ts = snap_to_frame(span["start"], frames)
+            if ts in used:
+                next_ts = next((f for f in frames if f > ts and f not in used), None)
+                if next_ts is None:
+                    logger.warning(
+                        "episode %d: subtask %r snapped to occupied frame "
+                        "%.3f and no free later frame exists — dropping",
+                        record.episode_index,
+                        span.get("text"),
+                        ts,
+                    )
+                    continue
+                ts = next_ts
+            used.add(ts)
+            new_span = {**span, "start": ts}
+            if float(new_span.get("end", ts)) < ts:
+                new_span["end"] = ts
+            out.append(new_span)
+        return out
+
+    def _generate_plan(
+        self,
+        record: EpisodeRecord,  # noqa: ARG002  (kept for signature stability)
+        subtask_spans: Sequence[dict[str, Any]],
+        *,
+        refresh_t: float | None = None,
+        interjection: str | None = None,  # noqa: ARG002
+        task: str | None = None,  # noqa: ARG002
+    ) -> str | None:
+        """Deterministic plan = numbered list of *still-todo* subtasks.
+
+        No VLM call: a plain numbered list keeps the plan aligned with the
+        upcoming subtasks (the old VLM "compact hierarchical plan" prompt
+        cost a round-trip per episode/refresh and could diverge).
+
+            1. <subtask 1>
+            2. <subtask 2>
+
+        On a refresh at ``refresh_t`` (from ``run_plan_updates`` on
+        interjections, and ``run_episode`` at each boundary), only subtasks
+        starting at or after ``refresh_t`` are included — so it always
+        describes what's left.
+        """
+        if not subtask_spans:
+            return None
+        remaining = [
+            s for s in subtask_spans if refresh_t is None or float(s.get("start", 0.0)) >= float(refresh_t)
+        ]
+        if not remaining:
+            # Past the last subtask boundary on a late refresh — nothing
+            # left to plan; emit None so the caller skips the row.
+            return None
+        return "\n".join(f"{i}. {span.get('text', '').strip()}" for i, span in enumerate(remaining, start=1))
+
+    def _generate_memory(
+        self,
+        record: EpisodeRecord,
+        prior_memory: str,
+        completed: str,
+        remaining: Sequence[str],
+        *,
+        task: str | None = None,
+    ) -> str:
+        prompt = load_prompt("plan_memory").format(
+            episode_task=(task if task is not None else record.episode_task),
+            prior_memory=prior_memory or "(none)",
+            completed_subtask=completed,
+            remaining_subtasks=", ".join(remaining) if remaining else "(none)",
+        )
+        memory = self._vlm_field(self._text_message(prompt), "memory")
+        return memory.strip() if isinstance(memory, str) else ""
diff --git a/src/lerobot/annotations/steerable_pipeline/prompts/__init__.py b/src/lerobot/annotations/steerable_pipeline/prompts/__init__.py
new file mode 100644
index 000000000..5ce8e163b
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/__init__.py
@@ -0,0 +1,33 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Prompt templates loaded as plain text.
+
+One file per use site. Templates use ``str.format(**vars)`` substitution; we
+intentionally avoid jinja2 here so the templates remain inspectable in
+plain editors and roundtrip cleanly through ``ruff format``.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+_DIR = Path(__file__).parent
+
+
+def load(name: str) -> str:
+    """Read prompt template ``name.txt`` from the ``prompts/`` directory."""
+    path = _DIR / f"{name}.txt"
+    return path.read_text(encoding="utf-8")
diff --git a/src/lerobot/annotations/steerable_pipeline/prompts/interjections_initial_speech.txt b/src/lerobot/annotations/steerable_pipeline/prompts/interjections_initial_speech.txt
new file mode 100644
index 000000000..625ce920c
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/interjections_initial_speech.txt
@@ -0,0 +1,12 @@
+The user just asked the robot: "{episode_task}".
+
+Generate a short verbal acknowledgement the robot would speak back before
+beginning the task. Style: compact, confident, friendly.
+
+Examples (Hi Robot, Shi 2025): "Sure, I won't put cheese on it.",
+"OK, starting with the sponge.", "Got it.".
+
+Prefer very short replies: "Got it.", "On it.", "OK."
+
+Output strictly valid JSON:
+  {{ "text": "<the spoken acknowledgement>" }}
diff --git a/src/lerobot/annotations/steerable_pipeline/prompts/interjections_interjection.txt b/src/lerobot/annotations/steerable_pipeline/prompts/interjections_interjection.txt
new file mode 100644
index 000000000..4a4719f54
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/interjections_interjection.txt
@@ -0,0 +1,46 @@
+You are generating training data for a Hi Robot-style hierarchical
+robot policy. The robot in this demonstration has ALREADY executed
+every step shown in the video — we cannot retroactively change the
+action stream. To keep training data consistent with the video, the
+"interjection" must align with what the robot is *about to do next* in
+the demonstration, framed as a natural mid-task user request.
+
+The episode's overall task: "{episode_task}".
+
+The images above show roughly {window_seconds:.1f} seconds straddling a
+subtask boundary in the demonstration:
+
+- Subtask the robot just finished: "{prev_subtask}"
+- Subtask the robot is about to start: "{next_subtask}"
+- Time into episode: {timestamp:.2f}s
+
+Write ONE compact interjection the user would naturally say at this
+moment to prompt / confirm / encourage the robot to do "{next_subtask}".
+Keep it like a mid-task coaching cue, not a full instruction paragraph.
+Also write the robot's compact verbal acknowledgement.
+
+Hard rules:
+
+- The interjection MUST be consistent with the next subtask. The user
+  cannot ask for something different from what the robot then does in
+  the video. If you're tempted to say "actually skip X" or "do Y
+  instead", DO NOT — those would contradict the demonstration.
+- The interjection must reference an object, location, or action that
+  is plausible given the visible scene and the next subtask text.
+- One short phrase or sentence each. Conversational, not robotic.
+- Prefer direct cues: "{next_subtask}, please."; "Now {next_subtask}."
+- Keep robot speech very short: "OK.", "On it.", "Doing that."
+
+Style examples (vary the phrasing — don't reuse these verbatim):
+  - "Now go ahead and {next_subtask}."
+  - "Great, can you {next_subtask} next?"
+  - "{next_subtask}, please."
+  - "Before you continue, please {next_subtask}."
+  - "Looking good — {next_subtask} now."
+  - "Okay, {next_subtask}."
+
+Output strictly valid JSON:
+  {{
+    "interjection": "<short cue from the user, asking for the next subtask>",
+    "speech":       "<short robot acknowledgement>"
+  }}
diff --git a/src/lerobot/annotations/steerable_pipeline/prompts/plan_memory.txt b/src/lerobot/annotations/steerable_pipeline/prompts/plan_memory.txt
new file mode 100644
index 000000000..b5278368b
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/plan_memory.txt
@@ -0,0 +1,36 @@
+You are updating the robot's compressed semantic memory at the boundary of
+a completed subtask.
+
+Reference (verbatim from MEM, Torne 2026):
+"Remove or compress information in the language memory whenever
+appropriate. Keep ONLY the minimal set of relevant information for future
+task execution. Specific object attributes (colors, precise quantities of
+each item) get discarded when their details won't affect subsequent
+actions. Functional outcomes (where items went, how many) are preserved."
+
+Episode task: "{episode_task}"
+Previous memory: {prior_memory}
+Just-completed subtask: "{completed_subtask}"
+Remaining subtasks (for relevance judgement only): {remaining_subtasks}
+
+Write the memory as a short FIRST-PERSON, PAST-TENSE narrative of what the
+robot has accomplished so far — the running story it would tell itself.
+
+Authoring rules:
+- First person, past tense. Every sentence starts with "I": "I picked
+  up...", "I opened...", "I moved to...".
+- One or two short sentences. Extend the previous memory with the
+  just-completed subtask; do not rewrite it from scratch.
+- Keep WHAT happened (functional outcomes — where items went, how many),
+  drop HOW (grasp details, motions).
+- Compress completed steps and drop object attributes (colors, exact
+  counts) once they no longer affect the remaining subtasks.
+
+Example (MEM, Torne 2026):
+  Before: "I prepared the pot and got the potatoes, milk, and butter. I
+           moved to the drawer."
+  After:  "I prepared the pot and got the ingredients. I opened the
+           drawer with the masher."
+
+Output strictly valid JSON:
+  {{ "memory": "<one or two short first-person past-tense sentences>" }}
diff --git a/src/lerobot/annotations/steerable_pipeline/prompts/plan_subtask_describe.txt b/src/lerobot/annotations/steerable_pipeline/prompts/plan_subtask_describe.txt
new file mode 100644
index 000000000..6b709e41d
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/plan_subtask_describe.txt
@@ -0,0 +1,27 @@
+You are watching a teleoperated robot demonstration from a single
+camera. The user asked the robot to: "{episode_task}"
+
+This is an OBSERVATION pass. Watch the entire clip and describe, in
+chronological order, ONLY what the robot physically does — the concrete
+motions, approaches, contacts, grasps, releases, and relocations you can
+actually SEE in the frames.
+
+Hard rules:
+- Describe only motion visible in the video. Do NOT use the task
+  instruction to guess steps that aren't shown. The instruction is the
+  goal; the video is ground truth.
+- Do NOT segment into named subtasks yet and do NOT output JSON beyond
+  the single field below. Just narrate what happens.
+- Give an approximate timestamp (in seconds) for each distinct event,
+  e.g. "0.0-1.4s: the base drives forward toward the stove".
+- Do NOT invent objects, grasps, destinations, or steps. If the robot
+  only does one thing (e.g. it just navigates and the clip ends), say
+  exactly that and nothing more.
+- Be concrete and literal. "the gripper closes on the mug" — not "the
+  robot prepares to make coffee".
+
+Output strictly valid JSON:
+
+  {{
+    "description": "<chronological, timestamped description of ONLY what is visible>"
+  }}
diff --git a/src/lerobot/annotations/steerable_pipeline/prompts/plan_subtasks.txt b/src/lerobot/annotations/steerable_pipeline/prompts/plan_subtasks.txt
new file mode 100644
index 000000000..e6a5260a7
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/plan_subtasks.txt
@@ -0,0 +1,112 @@
+You are labeling a teleoperated robot demonstration.
+
+The user originally asked: "{episode_task}"
+
+You are shown the entire demonstration as a single video. Watch the
+whole clip, then segment it into a list of consecutive atomic subtasks
+the robot performs.
+
+{observation_block}GROUNDING — read this first, it overrides everything below:
+- Label ONLY what the robot actually does in the video. Every subtask
+  you emit must correspond to motion you can SEE in specific frames.
+- Do NOT invent, anticipate, or pad. If the robot only does one thing
+  (e.g. it just navigates to a location and the clip ends), emit
+  EXACTLY ONE subtask. Many demonstrations are a single atomic skill.
+- ``max_steps`` below is a hard CEILING, not a target. Emitting fewer
+  subtasks than the ceiling is not just allowed, it is expected for
+  short / atomic demonstrations. One correct subtask is far better
+  than several invented ones.
+- If the video does not clearly show the action implied by the task,
+  describe what you actually see — do NOT fabricate the task's steps
+  from the instruction text. The instruction tells you the goal; the
+  VIDEO is the ground truth for what happened.
+
+Authoring rules — Hi Robot atom granularity, pi0.7-style short prompts:
+
+- Each subtask = one COMPOSITE atomic skill the low-level policy can
+  execute end-to-end. A "skill" bundles its own approach motion with
+  its terminal action — do NOT split the approach off as its own
+  subtask. The whole-arm policy already learns to reach as part of
+  every manipulation primitive.
+- Write each subtask as an IMPERATIVE COMMAND, starting with one of
+  these verbs (extend only when none fits):
+    pick up <obj>           — approach + grasp + lift in one subtask
+    put <obj> on/in <loc>   — transport + release in one subtask
+    place <obj> on/in <loc> — synonym of "put"; pick one and stay consistent
+    push <obj>              — contact + linear shove
+    pull <obj>              — contact + linear retract
+    turn <knob/dial/handle> — rotary actuation
+    press <button>          — single-press contact
+    open <drawer/door/lid>  — full open motion
+    close <drawer/door/lid> — full close motion
+    pour <src> into <dst>   — tilt + flow
+    insert <obj> into <slot>— alignment + push-fit
+    go to <loc>             — ONLY when no grasp / actuation follows
+                             (e.g. a pure relocation between phases).
+                             If the next subtask grasps something at
+                             that location, drop "go to ..." and just
+                             write "pick up ..." instead.
+- Forbidden ultra-fine splits — the VLM is NOT allowed to emit these
+  as standalone subtasks; fold them into the parent composite:
+    "move to X"   → fold into "pick up X" (or whatever follows)
+    "reach for X" → fold into "pick up X"
+    "grasp X"     → fold into "pick up X"
+    "lift X"      → fold into "pick up X" (or "put X on Y" if it's
+                    the transport phase of a place)
+    "release X"   → fold into "put X on Y" (or "place X in Y")
+- Keep it SHORT — a verb phrase, not a sentence. Drop articles
+  ("the", "a") and adverbs ("carefully", "slowly"). Add a "how"
+  detail (which hand, which grasp point) ONLY when it is needed to
+  disambiguate. Every subtask must begin with one of the verbs
+  above (no leading nouns, no "then", no "first").
+- NEVER use third person. Never write "the robot", "the arm", "the
+  gripper moves", "it picks up" — the robot is implied. Command it,
+  do not describe it.
+- Use the exact object nouns from the task above. If the task says
+  "cube", every subtask says "cube" — never switch to "block". If it
+  says "box", never switch to "bin"/"container". Keep vocabulary
+  consistent across the whole episode.
+- Good: "pick up blue cube", "put blue cube in box", "open drawer",
+  "turn red knob", "press start button", "go to sink".
+- Bad: "move to blue cube" (approach as its own subtask — forbidden,
+  must be folded into "pick up blue cube"); "the robot arm moves
+  towards the blue cube" (third person, too long); "carefully pick
+  up the cube" (adverb, article); "release the yellow block"
+  ("block" when the task said "cube", and "release" must be folded
+  into a "put"/"place" subtask).
+- Subtasks are non-overlapping and cover the full episode in order.
+  Choose the cut points yourself based on what you see in the video
+  (gripper open/close events, contact, regrasps, transitions).
+- Each subtask spans at least {min_subtask_seconds} seconds. If a
+  candidate span would be shorter, merge it into its neighbour
+  rather than emitting it.
+- Do not exceed {max_steps} subtasks total. Fewer, larger composites
+  are preferred over many micro-steps.
+- Every subtask's [start_time, end_time] must lie within
+  [0.0, {episode_duration}] seconds.
+
+SPECIAL CASES — verb disambiguation (each rule is narrowly visual and
+fires ONLY on the spatial situation it names; it must not change how you
+label any other situation):
+- STACK vs PUT: if an object is placed ON TOP OF another specific object
+  (not on a flat table / shelf / counter), use "stack ... on ...", not
+  "put". "stack blue book on green book", NOT "put blue book on table".
+- INSERT vs PUT: if an object goes INTO a fitted slot / hole / socket /
+  receptacle (push-fit), use "insert ... into ...", not "put".
+- RETRIEVE/PICK-UP vs PUT (direction): watch the gripper. If it CLOSES
+  on the object and the object moves WITH the hand, it is "pick up" /
+  "retrieve" (object leaves its location). If the gripper OPENS and the
+  object stays where the hand left it, it is "put" / "place" (object
+  arrives at a location). Decide by which way the object moves, not by
+  where the hand ends up.
+- POUR vs PUT: only use "pour" when the source is tilted and contents
+  flow out; moving a full container without tilting is "put"/"place".
+
+Output strictly valid JSON of shape:
+
+  {{
+    "subtasks": [
+      {{"text": "<short imperative verb phrase>", "start": <float>, "end": <float>}},
+      ...
+    ]
+  }}
diff --git a/src/lerobot/annotations/steerable_pipeline/prompts/plan_task_aug_axes.txt b/src/lerobot/annotations/steerable_pipeline/prompts/plan_task_aug_axes.txt
new file mode 100644
index 000000000..8b19a0a8e
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/plan_task_aug_axes.txt
@@ -0,0 +1,67 @@
+You are generating structured augmentations of a robot task instruction
+for training a language-conditioned policy. Unlike free-form rephrasing,
+your variants follow a NAMED 5-axis taxonomy — each axis omits or varies
+a specific element of the task while preserving its meaning.
+
+Original task: "{base_task}"
+
+Produce variants along five named axes. Each axis has a target count.
+The whole batch should expose the policy to maximum linguistic diversity
+WITHOUT changing what the robot is supposed to do.
+
+Axes and target counts:
+
+  synonym_paraphrase ({n_synonym}):
+    Different wording / verbs / sentence structure. ALL information
+    from the original task is preserved — same object, same arm
+    specification if present, same orientation if present, same grasp
+    if present.
+
+  omit_arm ({n_omit_arm}):
+    Drop the left/right/both arm specification from the task. Skip
+    entirely (emit 0 entries) if the original task does NOT mention an
+    arm. Do not invent an arm specification just to omit it.
+
+  omit_orientation ({n_omit_orientation}):
+    Drop orientation cues (upright, sideways, facing the user,
+    long-edge-first, etc.). Skip entirely if no orientation cue is
+    present in the original task.
+
+  omit_grasp_method ({n_omit_grasp_method}):
+    Drop the grip / grasp method specification (pinch, wrap, hold by
+    the rim, etc.). Skip entirely if no grasp method is mentioned.
+
+  combined_omissions ({n_combined}):
+    Combine TWO of the above omissions simultaneously (e.g. drop both
+    arm and orientation). Skip entirely if fewer than two of (arm,
+    orientation, grasp_method) appear in the original task.
+
+Hard rules:
+- Each variant MUST preserve the core action, the target object, AND
+  the goal / destination. Do not change which object is involved, where
+  it goes, or the high-level action. "Navigate to the stove" may become
+  "go to the stove" or "head over to the stove" — it must NEVER become
+  "wander around the kitchen", "explore the room", or anything that
+  drops or generalises the stove destination. If you cannot vary the
+  wording without changing the goal, emit fewer variants.
+- Only the FIVE listed elements (wording, arm, orientation, grasp
+  method, or a combination) may be varied or omitted. The verb's
+  meaning, the object, and the destination are fixed.
+- Each variant is plain prose, no markdown, no quotes, no list numbers.
+- Each variant must be DISTINCT from every other variant in the entire
+  output, both within and across axes. Near-duplicates are not allowed.
+- If an axis cannot reach its target count because the original task
+  lacks the omittable element, emit fewer entries — do NOT pad the
+  axis with paraphrases that belong to a different axis.
+- Variants should not all start with verbs — vary sentence structure
+  (some imperative, some polite request, some question).
+
+Output strictly valid JSON of shape:
+
+  {{
+    "synonym_paraphrase": ["<v1>", "<v2>", ...],
+    "omit_arm": ["<v1>", "<v2>", ...],
+    "omit_orientation": ["<v1>", ...],
+    "omit_grasp_method": ["<v1>", ...],
+    "combined_omissions": ["<v1>", ...]
+  }}
diff --git a/src/lerobot/annotations/steerable_pipeline/prompts/plan_task_rephrasings.txt b/src/lerobot/annotations/steerable_pipeline/prompts/plan_task_rephrasings.txt
new file mode 100644
index 000000000..602892bd3
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/plan_task_rephrasings.txt
@@ -0,0 +1,32 @@
+You are generating training data for a Hi Robot-style policy. We need
+{n} alternative phrasings of the same robot task so the policy sees
+diverse user prompts during training instead of the same canonical
+string repeated every frame.
+
+Original task:
+"{base_task}"
+
+Generate exactly {n} alternative phrasings of the same task. Vary:
+
+- formality (casual / polite / curt)
+- verbosity (mostly short imperative; occasional polite request)
+- word choice (synonyms, different verbs)
+- sentence structure (imperative / question / suggestion)
+
+Hard rules:
+- Each phrasing MUST preserve the exact meaning of the original task.
+  Do not change which object is involved, the destination, or the
+  action. Do not add extra steps. Do not invent new objects.
+- Each phrasing must be a short phrase or sentence, plain prose, no
+  markdown, no quotes, no list numbers.
+- Phrasings must be distinct — no near-duplicates.
+- Output exactly {n} entries.
+
+Output strictly valid JSON:
+  {{
+    "rephrasings": [
+      "<phrasing 1>",
+      "<phrasing 2>",
+      ...
+    ]
+  }}
diff --git a/src/lerobot/annotations/steerable_pipeline/prompts/plan_video_task.txt b/src/lerobot/annotations/steerable_pipeline/prompts/plan_video_task.txt
new file mode 100644
index 000000000..fcaae7046
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/plan_video_task.txt
@@ -0,0 +1,17 @@
+The video above shows a robot manipulation episode in full. Look at
+the entire video and describe in ONE concise sentence what the robot
+is doing.
+
+Rules:
+- One sentence, in natural English, like a user instruction.
+- Capture the goal of the demonstration, not low-level motions.
+  Example: "place the yellow cube into the red bin" — not "move the
+  end-effector down 5cm and close the gripper".
+- 4 to 15 words. Plain prose, no markdown, no bullets, no quotes.
+- Do not invent objects or actions that aren't visible.
+- Do not output anything other than the JSON object below.
+
+Output strictly valid JSON:
+  {{
+    "task": "<single concise sentence describing what the robot does in this video>"
+  }}
diff --git a/src/lerobot/annotations/steerable_pipeline/prompts/vqa.txt b/src/lerobot/annotations/steerable_pipeline/prompts/vqa.txt
new file mode 100644
index 000000000..23590b381
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/prompts/vqa.txt
@@ -0,0 +1,32 @@
+You are generating a frame-grounded visual question/answer pair for
+chain-of-thought training. Reference: ECoT (Zawalski 2024) and Steerable
+Policies — both train policies on grounded features such as bounding box
+pixel coordinates, keypoints, counts, attributes, and spatial relations.
+
+The frame shows a robot working on: "{episode_task}".
+
+Question types and the EXACT answer JSON shape required for each:
+
+  bbox       => {{"detections": [{{"label": "<obj>", "bbox_format": "xyxy",
+                                    "bbox": [x1, y1, x2, y2]}}, ...]}}
+                bbox is in pixel coordinates (x_min, y_min, x_max, y_max).
+                ECoT example: "a white cup [124, 25, 176, 113]".
+
+  keypoint   => {{"label": "<point>", "point_format": "xy",
+                  "point": [x, y]}}
+
+  count      => {{"label": "<obj>", "count": <int>,
+                  "note": "<optional short note>"}}
+
+  attribute  => {{"label": "<obj>", "attribute": "<color|shape|state|...>",
+                  "value": "<observed value>"}}
+
+  spatial    => {{"subject": "<obj>", "relation": "<left_of|right_of|on|in|"
+                  "above|below|near>", "object": "<obj>"}}
+
+Generate a question of type "{question_type}". Output strictly valid JSON:
+
+  {{
+    "question": "<short, frame-grounded question>",
+    "answer":   <object whose shape matches the schema above>
+  }}
diff --git a/src/lerobot/annotations/steerable_pipeline/reader.py b/src/lerobot/annotations/steerable_pipeline/reader.py
new file mode 100644
index 000000000..22fe4ac26
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/reader.py
@@ -0,0 +1,216 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Datatrove-shaped reader.
+
+The reader walks ``data/chunk-*/file-*.parquet`` and yields one record per
+episode containing:
+
+- ``episode_index``: int
+- ``frame_timestamps``: tuple[float, ...]
+- ``frame_indices``: tuple[int, ...]
+- ``episode_task``: str (canonical task from ``meta/tasks.parquet``)
+- ``data_path``: pathlib.Path of the source parquet shard
+- ``frames_df``: pandas.DataFrame slice for the episode (only loaded on demand)
+
+This shape lets each module operate per-episode without loading all parquet
+rows into memory at once.
+"""
+
+from __future__ import annotations
+
+from collections.abc import Iterator, Sequence
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+import pyarrow.parquet as pq
+
+from lerobot.datasets.io_utils import load_tasks
+from lerobot.datasets.utils import DEFAULT_TASKS_PATH
+
+
+@dataclass
+class EpisodeRecord:
+    """Per-episode record yielded by the reader."""
+
+    episode_index: int
+    episode_task: str
+    frame_timestamps: tuple[float, ...]
+    frame_indices: tuple[int, ...]
+    data_path: Path
+    row_offset: int  # row offset within the parquet file where this episode starts
+    row_count: int  # number of rows for this episode
+
+    # Memoized parquet slice — populated on first ``frames_df()`` call so
+    # repeat queries from different modules don't re-read the whole shard.
+    _frames_df_cache: Any = field(default=None, init=False, repr=False, compare=False)
+
+    def frames_df(self):  # type: ignore[no-untyped-def]
+        """Lazy-load the pandas slice for this episode (memoized)."""
+        if self._frames_df_cache is None:
+            import pandas as pd  # noqa: PLC0415  - deferred for optional dataset extra
+
+            table = pq.read_table(self.data_path)
+            df: pd.DataFrame = table.to_pandas()
+            self._frames_df_cache = df.iloc[self.row_offset : self.row_offset + self.row_count].reset_index(
+                drop=True
+            )
+        return self._frames_df_cache
+
+
+def reconstruct_subtask_spans(
+    rows: Sequence[dict[str, Any]],
+    *,
+    episode_end_t: float | None = None,
+) -> list[dict[str, Any]]:
+    """Turn ``style="subtask"`` rows into ``{text, start, end}`` spans.
+
+    Each span's ``end`` is the next span's ``start``. The final span's
+    ``end`` defaults to its own ``start`` (zero-duration) — pass
+    ``episode_end_t`` to extend it to the episode's last frame instead,
+    which is what downstream consumers (memory, interjection boundary
+    selection) expect.
+
+    Used by the ``plan`` module (plan-update pass) and the
+    ``interjections`` module (interjection anchoring), which both need the
+    same span shape.
+    """
+    sorted_rows = sorted(
+        (r for r in rows if r.get("style") == "subtask"),
+        key=lambda r: float(r["timestamp"]),
+    )
+    spans: list[dict[str, Any]] = []
+    for r in sorted_rows:
+        t = float(r["timestamp"])
+        if spans:
+            spans[-1]["end"] = t
+        spans.append({"text": r.get("content") or "", "start": t, "end": t})
+    if spans and episode_end_t is not None and float(episode_end_t) > spans[-1]["start"]:
+        spans[-1]["end"] = float(episode_end_t)
+    return spans
+
+
+def snap_to_frame(t: float, frame_timestamps: Sequence[float]) -> float:
+    """Snap an arbitrary float to the nearest exact source frame timestamp.
+
+    Modules use this when emitting event-style rows so the row's
+    timestamp matches a real parquet frame: event rows must land on an
+    exact frame, otherwise the per-frame event lookup the writer does
+    would never match them.
+    """
+    if not frame_timestamps:
+        return float(t)
+    nearest = min(frame_timestamps, key=lambda f: abs(f - t))
+    return float(nearest)
+
+
+def _load_tasks_lookup(root: Path) -> dict[int, str]:
+    """Map ``task_index -> task`` from ``meta/tasks.parquet``.
+
+    Returns an empty dict when the file is absent — the task description is
+    derived later from the video if needed. Reuses the library-level
+    :func:`lerobot.datasets.io_utils.load_tasks`, which returns the tasks
+    frame indexed by task string with a ``task_index`` column.
+    """
+    if not (root / DEFAULT_TASKS_PATH).exists():
+        return {}
+    tasks = load_tasks(root)
+    return {int(idx): str(task) for task, idx in zip(tasks.index, tasks["task_index"], strict=True)}
+
+
+def iter_episodes(root: Path, *, only_episodes: tuple[int, ...] | None = None) -> Iterator[EpisodeRecord]:
+    """Yield :class:`EpisodeRecord` for every episode under ``root/data/``.
+
+    Episodes are yielded in ascending ``episode_index`` order. The reader does
+    not assume a specific chunk/file layout: it scans every ``*.parquet``
+    under ``data/`` and groups by ``episode_index``.
+    """
+    tasks = _load_tasks_lookup(root)
+    data_dir = root / "data"
+    parquet_files = sorted(data_dir.rglob("*.parquet"))
+
+    only_set = set(only_episodes) if only_episodes is not None else None
+
+    for path in parquet_files:
+        yield from _iter_one_path(path, tasks, only_set)
+
+
+def _iter_one_path(path: Path, tasks: dict[int, str], only_set: set[int] | None) -> Iterator[EpisodeRecord]:
+    table = pq.read_table(path)
+    names = table.column_names
+    if "episode_index" not in names:
+        return
+    episode_col = table.column("episode_index").to_pylist()
+    timestamp_col = (
+        table.column("timestamp").to_pylist() if "timestamp" in names else [0.0] * len(episode_col)
+    )
+    frame_col = (
+        table.column("frame_index").to_pylist() if "frame_index" in names else list(range(len(episode_col)))
+    )
+    task_col = table.column("task_index").to_pylist() if "task_index" in names else None
+
+    def _build(
+        ep: int,
+        start: int,
+        end: int,
+        task_idx: int | None,
+        ts_buf: list[float],
+        fi_buf: list[int],
+    ) -> EpisodeRecord | None:
+        if only_set is not None and ep not in only_set:
+            return None
+        task = tasks.get(task_idx, "") if task_idx is not None else ""
+        return EpisodeRecord(
+            episode_index=ep,
+            episode_task=task,
+            frame_timestamps=tuple(ts_buf),
+            frame_indices=tuple(fi_buf),
+            data_path=path,
+            row_offset=start,
+            row_count=end - start,
+        )
+
+    cur_ep: int | None = None
+    start_offset = 0
+    ts_buf: list[float] = []
+    fi_buf: list[int] = []
+    cur_task_idx: int | None = None
+
+    for i, ep in enumerate(episode_col):
+        if cur_ep is None:
+            cur_ep = ep
+            start_offset = i
+            ts_buf = [timestamp_col[i]]
+            fi_buf = [frame_col[i]]
+            cur_task_idx = task_col[i] if task_col is not None else None
+            continue
+        if ep != cur_ep:
+            rec = _build(cur_ep, start_offset, i, cur_task_idx, ts_buf, fi_buf)
+            if rec is not None:
+                yield rec
+            cur_ep = ep
+            start_offset = i
+            ts_buf = [timestamp_col[i]]
+            fi_buf = [frame_col[i]]
+            cur_task_idx = task_col[i] if task_col is not None else None
+        else:
+            ts_buf.append(timestamp_col[i])
+            fi_buf.append(frame_col[i])
+
+    if cur_ep is not None:
+        rec = _build(cur_ep, start_offset, len(episode_col), cur_task_idx, ts_buf, fi_buf)
+        if rec is not None:
+            yield rec
diff --git a/src/lerobot/annotations/steerable_pipeline/staging.py b/src/lerobot/annotations/steerable_pipeline/staging.py
new file mode 100644
index 000000000..0b47c4dd6
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/staging.py
@@ -0,0 +1,92 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Per-episode staging.
+
+Each module writes its raw output as a JSONL file under
+``<staging_dir>/episode_{ep:06d}/<module>.jsonl``. The writer reads back this
+staging tree and partitions rows into the two language columns.
+
+JSONL is preferred over parquet here because the staging artifact is meant to
+be human-inspectable, easy to diff between prompt iterations, and trivially
+appended to. The final dataset format is parquet; staging is just an
+intermediate.
+"""
+
+from __future__ import annotations
+
+import json
+from collections.abc import Iterable
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+
+ModuleName = str
+
+_MODULES: tuple[ModuleName, ...] = (
+    "plan",
+    "interjections",
+    "vqa",
+)
+
+
+@dataclass
+class EpisodeStaging:
+    """Filesystem layout for a single episode's staged module outputs."""
+
+    root: Path
+    episode_index: int
+
+    @property
+    def episode_dir(self) -> Path:
+        return self.root / f"episode_{self.episode_index:06d}"
+
+    def path_for(self, module: ModuleName) -> Path:
+        if module not in _MODULES:
+            raise ValueError(f"Unknown module {module!r}; expected one of {_MODULES}")
+        return self.episode_dir / f"{module}.jsonl"
+
+    def write(self, module: ModuleName, rows: Iterable[dict[str, Any]]) -> Path:
+        path = self.path_for(module)
+        path.parent.mkdir(parents=True, exist_ok=True)
+        # Atomic replace: a crash mid-write would otherwise leave a
+        # half-written JSONL file that ``read()`` would then fail to
+        # parse. Write to a sibling .tmp and rename so the target path
+        # only ever points at a complete file.
+        tmp_path = path.with_suffix(path.suffix + ".tmp")
+        with tmp_path.open("w", encoding="utf-8") as f:
+            for row in rows:
+                f.write(json.dumps(row, ensure_ascii=False, sort_keys=True))
+                f.write("\n")
+        tmp_path.replace(path)
+        return path
+
+    def read(self, module: ModuleName) -> list[dict[str, Any]]:
+        path = self.path_for(module)
+        if not path.exists():
+            return []
+        out: list[dict[str, Any]] = []
+        with path.open(encoding="utf-8") as f:
+            for line in f:
+                line = line.strip()
+                if line:
+                    out.append(json.loads(line))
+        return out
+
+    def read_all(self) -> dict[ModuleName, list[dict[str, Any]]]:
+        return {m: self.read(m) for m in _MODULES}
+
+    def has(self, module: ModuleName) -> bool:
+        return self.path_for(module).exists()
diff --git a/src/lerobot/annotations/steerable_pipeline/validator.py b/src/lerobot/annotations/steerable_pipeline/validator.py
new file mode 100644
index 000000000..f08074c9a
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/validator.py
@@ -0,0 +1,332 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Pre-write validation against staged outputs.
+
+Runs after all three modules have written their per-episode artifacts but
+*before* the writer rewrites parquet shards. The validator never touches
+parquet; it only inspects the staging tree and the source frame timestamps
+exposed by :class:`EpisodeRecord`.
+
+Checks (per the plan's "Intermediate staging and validation" section):
+
+- exact timestamp alignment against source frame timestamps
+- no orphan speech / interjection pairs
+- plan / memory emission consistency (events have a paired persistent row)
+- VQA assistant ``content`` is valid JSON (one of bbox / keypoint / count /
+  attribute / spatial)
+- every row maps to its correct column under :func:`column_for_style`
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+from collections.abc import Iterable, Sequence
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+from lerobot.datasets.language import (
+    LANGUAGE_EVENTS,
+    LANGUAGE_PERSISTENT,
+    column_for_style,
+    is_view_dependent_style,
+    validate_camera_field,
+)
+
+from .reader import EpisodeRecord
+from .staging import EpisodeStaging
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class ValidationReport:
+    """Outcome of one validation pass across all episodes."""
+
+    errors: list[str] = field(default_factory=list)
+    warnings: list[str] = field(default_factory=list)
+    episodes_checked: int = 0
+
+    @property
+    def ok(self) -> bool:
+        return not self.errors
+
+    def add_error(self, message: str) -> None:
+        self.errors.append(message)
+
+    def add_warning(self, message: str) -> None:
+        self.warnings.append(message)
+
+    def summary(self) -> str:
+        return f"checked={self.episodes_checked} errors={len(self.errors)} warnings={len(self.warnings)}"
+
+
+VQA_ANSWER_SHAPES: dict[str, set[str]] = {
+    "bbox": {"detections"},
+    "keypoint": {"label", "point_format", "point"},
+    "count": {"label", "count"},
+    "attribute": {"label", "attribute", "value"},
+    "spatial": {"subject", "relation", "object"},
+}
+
+
+def classify_vqa_answer(payload: Any) -> str | None:
+    """Best-effort classification of a VQA answer payload to a question type."""
+    if not isinstance(payload, dict):
+        return None
+    keys = set(payload.keys())
+    for kind, required in VQA_ANSWER_SHAPES.items():
+        if required.issubset(keys):
+            return kind
+    return None
+
+
+@dataclass
+class StagingValidator:
+    """Walks the staging tree and produces a :class:`ValidationReport`."""
+
+    timestamp_atol: float = 0.0  # exact-match by default
+    dataset_camera_keys: tuple[str, ...] | None = None
+    """Known ``observation.images.*`` keys on the dataset. When set, the
+    validator additionally enforces that every view-dependent row's
+    ``camera`` field references one of these keys. Pass ``None`` (default)
+    to skip that cross-check (e.g. in unit tests with no real dataset)."""
+
+    def validate(
+        self,
+        records: Sequence[EpisodeRecord],
+        staging_dir: Path,
+    ) -> ValidationReport:
+        report = ValidationReport()
+        for record in records:
+            self._validate_episode(record, staging_dir, report)
+            report.episodes_checked += 1
+        return report
+
+    def _validate_episode(
+        self,
+        record: EpisodeRecord,
+        staging_dir: Path,
+        report: ValidationReport,
+    ) -> None:
+        staging = EpisodeStaging(staging_dir, record.episode_index)
+        staged = staging.read_all()
+        all_rows: list[dict[str, Any]] = []
+        for module_name, rows in staged.items():
+            for row in rows:
+                row = {**row, "_module": module_name}
+                all_rows.append(row)
+
+        frame_ts = set(record.frame_timestamps)
+
+        events: list[dict[str, Any]] = []
+        persistent: list[dict[str, Any]] = []
+        for row in all_rows:
+            self._check_column_routing(row, report, record.episode_index)
+            self._check_camera_field(row, report, record.episode_index, self.dataset_camera_keys)
+            # ``_check_column_routing`` already recorded any unknown-style error;
+            # don't let the same ``column_for_style`` lookup raise here uncaught.
+            try:
+                column = column_for_style(row.get("style"))
+            except ValueError:
+                continue
+            if column == LANGUAGE_PERSISTENT:
+                persistent.append(row)
+            else:
+                events.append(row)
+
+        for row in events:
+            self._check_event_timestamp_alignment(row, frame_ts, report, record.episode_index)
+
+        self._check_speech_interjection_pairs(events, report, record.episode_index)
+        self._check_plan_memory_consistency(persistent, events, report, record.episode_index)
+        self._check_vqa_json(events, report, record.episode_index)
+        self._check_vqa_uniqueness_per_frame_camera(events, report, record.episode_index)
+
+    def _check_camera_field(
+        self,
+        row: dict[str, Any],
+        report: ValidationReport,
+        episode_index: int,
+        dataset_camera_keys: Sequence[str] | None,
+    ) -> None:
+        """Enforce the camera invariant + that the key matches the dataset's cameras."""
+        style = row.get("style")
+        camera = row.get("camera")
+        try:
+            validate_camera_field(style, camera)
+        except ValueError as exc:
+            report.add_error(f"ep={episode_index} module={row.get('_module')}: {exc}")
+            return
+        if is_view_dependent_style(style) and dataset_camera_keys and camera not in dataset_camera_keys:
+            report.add_error(
+                f"ep={episode_index} module={row.get('_module')}: camera {camera!r} on style "
+                f"{style!r} is not one of the dataset's video keys {sorted(dataset_camera_keys)!r}"
+            )
+
+    def _check_vqa_uniqueness_per_frame_camera(
+        self,
+        events: Iterable[dict[str, Any]],
+        report: ValidationReport,
+        episode_index: int,
+    ) -> None:
+        """Ensure at most one (vqa, user) and one (vqa, assistant) per (t, camera)."""
+        counts: dict[tuple[float, str, str], int] = {}
+        for row in events:
+            if row.get("style") != "vqa":
+                continue
+            ts = row.get("timestamp")
+            camera = row.get("camera")
+            role = row.get("role")
+            if ts is None or camera is None or role is None:
+                continue  # other validators flag these
+            key = (float(ts), str(camera), str(role))
+            counts[key] = counts.get(key, 0) + 1
+        for (ts, camera, role), n in counts.items():
+            if n > 1:
+                report.add_error(
+                    f"ep={episode_index}: {n} duplicate vqa rows at t={ts} "
+                    f"camera={camera!r} role={role!r}; expected at most one per (t, camera, role)"
+                )
+
+    def _check_column_routing(
+        self,
+        row: dict[str, Any],
+        report: ValidationReport,
+        episode_index: int,
+    ) -> None:
+        style = row.get("style")
+        module = row.get("_module")
+        try:
+            target_col = column_for_style(style)
+        except ValueError:
+            report.add_error(f"ep={episode_index} module={module}: unknown style {style!r}")
+            return
+        if module == "plan" and target_col != LANGUAGE_PERSISTENT:
+            report.add_error(
+                f"ep={episode_index} module=plan emitted style {style!r} that routes to {target_col} (must be persistent)"
+            )
+        if module in {"interjections", "vqa"} and target_col != LANGUAGE_EVENTS:
+            report.add_error(
+                f"ep={episode_index} module={module} emitted style {style!r} that routes to {target_col} (must be events)"
+            )
+
+    def _check_event_timestamp_alignment(
+        self,
+        row: dict[str, Any],
+        frame_ts: set[float],
+        report: ValidationReport,
+        episode_index: int,
+    ) -> None:
+        ts = row.get("timestamp")
+        if ts is None:
+            report.add_error(f"ep={episode_index}: event row missing timestamp: {row!r}")
+            return
+        if self.timestamp_atol == 0.0:
+            if float(ts) not in frame_ts:
+                report.add_error(
+                    f"ep={episode_index}: event row timestamp {ts!r} does not match any source frame timestamp"
+                )
+        else:
+            if not any(abs(float(ts) - f) <= self.timestamp_atol for f in frame_ts):
+                report.add_error(
+                    f"ep={episode_index}: event row timestamp {ts!r} not within {self.timestamp_atol}s of any frame"
+                )
+
+    def _check_speech_interjection_pairs(
+        self,
+        events: Iterable[dict[str, Any]],
+        report: ValidationReport,
+        episode_index: int,
+    ) -> None:
+        speech_ts: dict[float, int] = {}
+        interjection_ts: dict[float, int] = {}
+        for row in events:
+            ts = row.get("timestamp")
+            if ts is None:
+                continue
+            ts_f = float(ts)
+            if row.get("style") is None and row.get("role") == "assistant":
+                speech_ts[ts_f] = speech_ts.get(ts_f, 0) + 1
+            if row.get("style") == "interjection":
+                interjection_ts[ts_f] = interjection_ts.get(ts_f, 0) + 1
+
+        for ts in interjection_ts:
+            if ts not in speech_ts:
+                report.add_error(f"ep={episode_index}: interjection at t={ts} has no paired speech atom")
+
+    def _check_plan_memory_consistency(
+        self,
+        persistent: Sequence[dict[str, Any]],
+        events: Sequence[dict[str, Any]],
+        report: ValidationReport,
+        episode_index: int,
+    ) -> None:
+        plan_ts = sorted({float(r["timestamp"]) for r in persistent if r.get("style") == "plan"})
+        memory_ts = sorted({float(r["timestamp"]) for r in persistent if r.get("style") == "memory"})
+        subtask_ts = sorted({float(r["timestamp"]) for r in persistent if r.get("style") == "subtask"})
+        interjection_ts = sorted(
+            {
+                float(r["timestamp"])
+                for r in events
+                if r.get("style") == "interjection" and r.get("timestamp") is not None
+            }
+        )
+
+        if persistent and not plan_ts:
+            report.add_warning(f"ep={episode_index}: persistent rows present but no plan emitted")
+        # every interjection should have a same-timestamp plan refresh
+        for ts in interjection_ts:
+            if ts not in set(plan_ts):
+                report.add_error(
+                    f"ep={episode_index}: interjection at t={ts} has no co-timestamped plan update"
+                )
+        # memory should be emitted at subtask boundaries (subset relation)
+        if memory_ts and subtask_ts:
+            mem_set = set(memory_ts)
+            sub_set = set(subtask_ts)
+            stray = sorted(mem_set - sub_set)
+            if stray:
+                report.add_warning(f"ep={episode_index}: memory rows at {stray} not at any subtask boundary")
+
+    def _check_vqa_json(
+        self,
+        events: Iterable[dict[str, Any]],
+        report: ValidationReport,
+        episode_index: int,
+    ) -> None:
+        for row in events:
+            if row.get("style") != "vqa" or row.get("role") != "assistant":
+                continue
+            content = row.get("content")
+            if content is None:
+                report.add_error(
+                    f"ep={episode_index}: VQA assistant row at t={row.get('timestamp')} has null content"
+                )
+                continue
+            try:
+                payload = json.loads(content)
+            except (TypeError, ValueError) as exc:
+                report.add_error(
+                    f"ep={episode_index}: VQA assistant content not valid JSON at t={row.get('timestamp')}: {exc}"
+                )
+                continue
+            shape = classify_vqa_answer(payload)
+            if shape is None:
+                report.add_error(
+                    f"ep={episode_index}: VQA assistant payload at t={row.get('timestamp')} does not match any known shape: keys={list(payload) if isinstance(payload, dict) else type(payload).__name__}"
+                )
diff --git a/src/lerobot/annotations/steerable_pipeline/vlm_client.py b/src/lerobot/annotations/steerable_pipeline/vlm_client.py
new file mode 100644
index 000000000..9a86317f1
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/vlm_client.py
@@ -0,0 +1,617 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Shared Qwen-VL client.
+
+The pipeline uses a single shared VLM across modules. vLLM is preferred when
+available (high throughput, JSON-guided decoding); transformers is the
+fallback. A ``stub`` backend is used for unit tests so fixtures never call
+into a real model.
+
+The client speaks one method, :meth:`VlmClient.generate_json`, which:
+
+- accepts a list of OpenAI/HF-style multimodal messages,
+- requests JSON output from the server,
+- batches requests transparently,
+- and reprompts once on a JSON parse failure with an inline correction
+  message before raising.
+"""
+
+from __future__ import annotations
+
+import atexit
+import base64
+import io
+import json
+import os
+import shlex
+import signal
+import subprocess
+import sys
+import threading
+import time
+import urllib.request
+from collections.abc import Callable, Sequence
+from concurrent.futures import ThreadPoolExecutor
+from dataclasses import dataclass
+from typing import Any, Protocol
+
+from .config import VlmConfig
+
+
+class VlmClient(Protocol):
+    """Protocol every backend must implement."""
+
+    def generate_json(
+        self,
+        messages_batch: Sequence[Sequence[dict[str, Any]]],
+        *,
+        max_new_tokens: int | None = None,
+        temperature: float | None = None,
+    ) -> list[Any]:
+        """Generate one JSON-decoded response per messages list."""
+
+
+@dataclass
+class StubVlmClient:
+    """Deterministic stub used in unit tests.
+
+    A test passes a callable that maps the *last user message text* (or, if
+    that is empty, the full message list) to a JSON-serializable response.
+    """
+
+    responder: Callable[[Sequence[dict[str, Any]]], Any]
+
+    def generate_json(
+        self,
+        messages_batch: Sequence[Sequence[dict[str, Any]]],
+        *,
+        max_new_tokens: int | None = None,
+        temperature: float | None = None,
+    ) -> list[Any]:
+        return [self.responder(list(messages)) for messages in messages_batch]
+
+
+def _strip_to_json(text: str) -> Any:
+    text = text.strip()
+    # Strip <think>...</think> blocks (Qwen3 Thinking style)
+    while "<think>" in text and "</think>" in text:
+        start = text.find("<think>")
+        end = text.find("</think>", start) + len("</think>")
+        text = (text[:start] + text[end:]).strip()
+    # Strip ```json ... ``` fences from chat-tuned backbones
+    if text.startswith("```"):
+        first = text.find("\n")
+        last = text.rfind("```")
+        if first != -1 and last != -1 and last > first:
+            text = text[first + 1 : last].strip()
+    try:
+        return json.loads(text)
+    except (ValueError, json.JSONDecodeError):
+        pass
+    # Fall back to extracting the first balanced {...} block.
+    obj_text = _extract_first_json_object(text)
+    if obj_text is None:
+        raise json.JSONDecodeError("No JSON object found", text, 0)
+    return json.loads(obj_text)
+
+
+def _extract_first_json_object(text: str) -> str | None:
+    """Return the first balanced ``{...}`` substring, ignoring braces in
+    string literals. Returns ``None`` if no balanced block is found."""
+    start = text.find("{")
+    if start < 0:
+        return None
+    depth = 0
+    in_string = False
+    escape = False
+    for i in range(start, len(text)):
+        ch = text[i]
+        if escape:
+            escape = False
+            continue
+        if ch == "\\":
+            escape = True
+            continue
+        # Note: ``escape`` is always False here — the ``if escape`` branch
+        # above already handled and reset it.
+        if ch == '"':
+            in_string = not in_string
+            continue
+        if in_string:
+            continue
+        if ch == "{":
+            depth += 1
+        elif ch == "}":
+            depth -= 1
+            if depth == 0:
+                return text[start : i + 1]
+    return None
+
+
+@dataclass
+class _GenericTextClient:
+    """Wraps any text-generation callable in JSON-mode + one-retry semantics."""
+
+    generate_text: Callable[[Sequence[Sequence[dict[str, Any]]], int, float], list[str]]
+    config: VlmConfig
+
+    def generate_json(
+        self,
+        messages_batch: Sequence[Sequence[dict[str, Any]]],
+        *,
+        max_new_tokens: int | None = None,
+        temperature: float | None = None,
+    ) -> list[Any]:
+        max_tok = max_new_tokens if max_new_tokens is not None else self.config.max_new_tokens
+        temp = temperature if temperature is not None else self.config.temperature
+        raw = self.generate_text(messages_batch, max_tok, temp)
+        out: list[Any] = []
+        for messages, text in zip(messages_batch, raw, strict=True):
+            try:
+                out.append(_strip_to_json(text))
+                continue
+            except (ValueError, json.JSONDecodeError):
+                pass
+            retry = list(messages) + [
+                {"role": "assistant", "content": text},
+                {
+                    "role": "user",
+                    "content": (
+                        "Your previous reply was not valid JSON. "
+                        "Reply with strictly valid JSON, no prose, no fences."
+                    ),
+                },
+            ]
+            retry_text = self.generate_text([retry], max_tok, temp)[0]
+            try:
+                out.append(_strip_to_json(retry_text))
+            except (ValueError, json.JSONDecodeError):
+                # After retry: log preview and return None instead of crashing
+                # the whole pipeline. Modules treat None as "skip".
+                preview = retry_text.strip().replace("\n", " ")[:200]
+                print(
+                    f"[vlm] WARNING: failed to parse JSON after retry; preview: {preview!r}",
+                    flush=True,
+                )
+                out.append(None)
+        return out
+
+
+def make_vlm_client(config: VlmConfig) -> VlmClient:
+    """Build the shared VLM client.
+
+    Only the ``openai`` backend is supported for now. The shipped workflow
+    is Hugging Face Jobs (``examples/annotations/run_hf_job.py``): it boots
+    a vLLM server inside the ``vllm/vllm-openai`` image and the pipeline
+    talks to it over the OpenAI-compatible API (``--vlm.backend=openai``,
+    optionally auto-spawning the server via ``auto_serve`` /
+    ``serve_command``). The former in-process ``vllm`` / ``transformers``
+    backends were removed to keep the support surface to the HF Jobs path.
+
+    For ``stub``, construct :class:`StubVlmClient` directly with a responder
+    callable; it is rejected here to make accidental misuse obvious.
+    """
+    if config.backend == "openai":
+        return _make_openai_client(config)
+    if config.backend == "stub":
+        raise ValueError(
+            "Use StubVlmClient(...) directly for the stub backend; make_vlm_client builds real clients."
+        )
+    if config.backend in {"vllm", "transformers"}:
+        raise ValueError(
+            f"backend={config.backend!r} (in-process local model) is not supported for now — "
+            "only backend='openai' (the Hugging Face Jobs flow) is. Run the pipeline via "
+            "examples/annotations/run_hf_job.py, which serves the model with vLLM in the "
+            "vllm/vllm-openai image and talks to it over the OpenAI-compatible API."
+        )
+    raise ValueError(f"Unknown VLM backend: {config.backend!r}")
+
+
+def _make_openai_client(config: VlmConfig) -> VlmClient:
+    """Backend that talks to any OpenAI-compatible server.
+
+    Compatible with ``vllm serve``, ``transformers serve``,
+    ``ktransformers serve``, and hosted endpoints. By default the server
+    is expected to be already running. Set ``auto_serve=True`` to have
+    this client spawn one (default: ``transformers serve``), wait until
+    it's ready, and tear it down on process exit.
+
+    Image blocks ``{"type":"image", "image":<PIL.Image>}`` are
+    auto-converted to ``image_url`` data-URLs. Video blocks
+    ``{"type":"video", "video":[<PIL>...]}`` are forwarded as
+    multi-frame ``video_url`` items where supported.
+    """
+    try:
+        from openai import OpenAI  # type: ignore[import-not-found]
+    except ImportError as exc:
+        raise ImportError(
+            "openai package is required for backend='openai'. Install with `pip install openai`."
+        ) from exc
+
+    api_base = config.api_base
+    api_key = config.api_key
+    auto_serve = config.auto_serve
+    api_bases: list[str] = [api_base]
+
+    print(
+        f"[lerobot-annotate] backend=openai model={config.model_id} "
+        f"api_base={api_base} auto_serve={auto_serve}",
+        flush=True,
+    )
+    if auto_serve:
+        if config.parallel_servers > 1:
+            print(
+                f"[lerobot-annotate] spawning {config.parallel_servers} parallel servers",
+                flush=True,
+            )
+            api_bases = _spawn_parallel_inference_servers(config)
+        elif _server_is_up(api_base):
+            print(f"[lerobot-annotate] reusing server already up at {api_base}", flush=True)
+        else:
+            print("[lerobot-annotate] no server reachable; spawning one", flush=True)
+            api_base = _spawn_inference_server(config)
+            api_bases = [api_base]
+            print(f"[lerobot-annotate] server ready at {api_base}", flush=True)
+
+    clients = [OpenAI(base_url=base, api_key=api_key) for base in api_bases]
+    # round-robin counter for parallel mode
+    rr_counter = {"i": 0}
+
+    # ``mm_processor_kwargs`` is a vllm-specific extra; transformers serve
+    # rejects it with HTTP 422. Send it only when explicitly opted in via
+    # an env var (e.g. ``LEROBOT_OPENAI_SEND_MM_KWARGS=1`` for vllm).
+    send_mm_kwargs = os.environ.get("LEROBOT_OPENAI_SEND_MM_KWARGS", "").lower() in {"1", "true", "yes"}
+
+    rr_lock = threading.Lock()
+
+    def _one_call(messages: Sequence[dict[str, Any]], max_tok: int, temp: float) -> str:
+        api_messages, mm_kwargs = _to_openai_messages(messages)
+        kwargs: dict[str, Any] = {
+            "model": config.model_id,
+            "messages": api_messages,
+            "max_tokens": max_tok,
+            "temperature": temp,
+        }
+        extra_body: dict[str, Any] = {}
+        if send_mm_kwargs and mm_kwargs:
+            extra_body["mm_processor_kwargs"] = {**mm_kwargs, "do_sample_frames": True}
+        if config.chat_template_kwargs:
+            extra_body["chat_template_kwargs"] = config.chat_template_kwargs
+        if extra_body:
+            kwargs["extra_body"] = extra_body
+        with rr_lock:
+            chosen = clients[rr_counter["i"] % len(clients)]
+            rr_counter["i"] += 1
+        response = chosen.chat.completions.create(**kwargs)
+        return response.choices[0].message.content or ""
+
+    def _gen(batch: Sequence[Sequence[dict[str, Any]]], max_tok: int, temp: float) -> list[str]:
+        if len(batch) <= 1 or config.client_concurrency <= 1:
+            return [_one_call(messages, max_tok, temp) for messages in batch]
+        # Parallel fan-out — vllm batches these on the server side.
+        max_workers = min(config.client_concurrency, len(batch))
+        with ThreadPoolExecutor(max_workers=max_workers) as pool:
+            futures = [pool.submit(_one_call, messages, max_tok, temp) for messages in batch]
+            return [f.result() for f in futures]
+
+    return _GenericTextClient(_gen, config)
+
+
+def _bind_serve_port(cmd: str, port: int) -> str:
+    """Bind a serve command to ``port``: substitute a ``{port}`` placeholder
+    if present, else append ``--port`` when the command omits it (leaving an
+    explicit ``--port`` untouched). Shared by the single- and parallel-server
+    paths so a serve_command never reaches the server with a literal
+    ``{port}``."""
+    if "{port}" in cmd:
+        return cmd.replace("{port}", str(port))
+    if "--port" not in cmd:
+        return f"{cmd} --port {port}"
+    return cmd
+
+
+def _spawn_parallel_inference_servers(config: VlmConfig) -> list[str]:
+    """Spawn ``config.parallel_servers`` independent vllm replicas.
+
+    Each replica:
+    - is pinned to a single GPU via ``CUDA_VISIBLE_DEVICES``
+    - listens on ``serve_port + i``
+    - is shut down via the same atexit hook as the single-server path
+
+    Returns the list of ``api_base`` URLs the client should round-robin
+    across.
+    """
+    n = config.parallel_servers
+    api_bases: list[str] = []
+    procs: list[subprocess.Popen] = []
+    ready_events: list[threading.Event] = []
+    # Multiple readiness signals — uvicorn's own banner is suppressed at
+    # ``--uvicorn-log-level warning``, so we also accept vllm's own
+    # "Starting vLLM API server" line and the route-listing line. The
+    # HTTP probe below is the ultimate fallback.
+    ready_markers = (
+        "Uvicorn running",
+        "Application startup complete",
+        "Starting vLLM API server",
+        "Available routes are",
+    )
+    # Single lock for all server-stream threads so multibyte chars from
+    # different servers don't interleave and tear UTF-8 sequences.
+    print_lock = threading.Lock()
+
+    base_cmd = config.serve_command or (
+        f"vllm serve {shlex.quote(config.model_id)} "
+        f"--tensor-parallel-size 1 "
+        f"--max-model-len {config.max_model_len or 32768} "
+        f"--uvicorn-log-level warning"
+    )
+
+    num_gpus = config.num_gpus if config.num_gpus > 0 else n
+    for i in range(n):
+        port = config.serve_port + i
+        gpu = i % num_gpus
+        env = os.environ.copy()
+        env["CUDA_VISIBLE_DEVICES"] = str(gpu)
+        cmd = _bind_serve_port(base_cmd, port)
+        api_base = f"http://localhost:{port}/v1"
+        api_bases.append(api_base)
+        print(f"[server-{i}] launching on GPU {gpu} port {port}: {cmd}", flush=True)
+        proc = subprocess.Popen(
+            shlex.split(cmd),
+            stdout=subprocess.PIPE,
+            stderr=subprocess.STDOUT,
+            text=True,
+            bufsize=1,
+            env=env,
+        )
+        procs.append(proc)
+        ready = threading.Event()
+        ready_events.append(ready)
+
+        def _stream(idx: int, p: subprocess.Popen, ev: threading.Event) -> None:
+            # Read whole lines and emit each line atomically under the
+            # shared print_lock so output from N servers stays readable.
+            assert p.stdout is not None
+            for line in iter(p.stdout.readline, ""):
+                with print_lock:
+                    sys.stdout.write(f"[server-{idx}] {line}")
+                    if not line.endswith(("\n", "\r")):
+                        sys.stdout.write("\n")
+                    sys.stdout.flush()
+                if any(m in line for m in ready_markers):
+                    ev.set()
+
+        threading.Thread(target=_stream, args=(i, proc, ready), daemon=True).start()
+
+        def _probe(idx: int, base: str, ev: threading.Event, p: subprocess.Popen) -> None:
+            while not ev.is_set() and p.poll() is None:
+                if _server_is_up(base):
+                    print(f"[server-{idx}] ready (http probe)", flush=True)
+                    ev.set()
+                    return
+                time.sleep(2)
+
+        threading.Thread(target=_probe, args=(i, api_base, ready, proc), daemon=True).start()
+
+    def _shutdown() -> None:
+        for i, p in enumerate(procs):
+            if p.poll() is None:
+                print(f"[server-{i}] stopping pid={p.pid}", flush=True)
+                p.send_signal(signal.SIGINT)
+        for p in procs:
+            try:
+                p.wait(timeout=15)
+            except subprocess.TimeoutExpired:
+                p.kill()
+                p.wait(timeout=5)
+
+    atexit.register(_shutdown)
+
+    deadline = time.monotonic() + config.serve_ready_timeout_s
+    while any(not ev.is_set() for ev in ready_events) and time.monotonic() < deadline:
+        for i, p in enumerate(procs):
+            if p.poll() is not None:
+                raise RuntimeError(
+                    f"[server-{i}] inference server exited unexpectedly with rc={p.returncode}"
+                )
+        time.sleep(2)
+    if any(not ev.is_set() for ev in ready_events):
+        raise RuntimeError(f"[server] not all replicas became ready within {config.serve_ready_timeout_s}s")
+    print(f"[lerobot-annotate] all {n} servers ready: {api_bases}", flush=True)
+    return api_bases
+
+
+def _server_is_up(api_base: str) -> bool:
+    """Return True if ``api_base/models`` answers 200 within 2 seconds."""
+    url = api_base.rstrip("/") + "/models"
+    # ``api_base`` is the user-configured local-server URL we just spawned
+    # or the user passed in via ``--vlm.api_base``; the bandit B310 warning
+    # is for arbitrary user-controlled URLs with file:/ schemes which
+    # cannot reach this code path.
+    try:
+        with urllib.request.urlopen(url, timeout=2) as resp:  # noqa: S310  # nosec B310
+            return resp.status == 200
+    except Exception:  # noqa: BLE001
+        return False
+
+
+def _spawn_inference_server(config: VlmConfig) -> str:
+    """Spawn ``transformers serve`` (or ``serve_command``), wait until it
+    accepts ``/v1/models``, and register a shutdown hook.
+
+    Streams the server's stdout/stderr to the parent terminal in
+    real-time on a background thread so users can see model-load
+    progress and errors as they happen.
+
+    Returns the full ``api_base`` URL the OpenAI client should use.
+    """
+    cmd = config.serve_command
+    if not cmd:
+        cmd = (
+            f"transformers serve {shlex.quote(config.model_id)} "
+            f"--port {config.serve_port} --continuous-batching"
+        )
+    # Bind the single server to ``serve_port`` (what ``api_base`` below
+    # targets): substitute a literal ``{port}`` placeholder, else append
+    # ``--port``. Without this a serve_command carrying ``{port}`` would
+    # reach the server unsubstituted and fail to parse.
+    cmd = _bind_serve_port(cmd, config.serve_port)
+    api_base = f"http://localhost:{config.serve_port}/v1"
+    print(f"[server] launching: {cmd}", flush=True)
+    proc = subprocess.Popen(
+        shlex.split(cmd),
+        stdout=subprocess.PIPE,
+        stderr=subprocess.STDOUT,
+        text=True,
+        bufsize=1,
+    )
+
+    # Watch the server output for the uvicorn readiness banner. This is
+    # more reliable than polling /v1/models because transformers serve
+    # rescans its cache on every model-list request, which can exceed
+    # the urllib timeout and trigger an infinite probe loop.
+    ready_event = threading.Event()
+    # See _spawn_parallel_inference_servers for why we accept these.
+    ready_markers = (
+        "Uvicorn running",
+        "Application startup complete",
+        "Starting vLLM API server",
+        "Available routes are",
+    )
+
+    def _probe() -> None:
+        while not ready_event.is_set() and proc.poll() is None:
+            if _server_is_up(api_base):
+                print("[server] ready (http probe)", flush=True)
+                ready_event.set()
+                return
+            time.sleep(2)
+
+    threading.Thread(target=_probe, daemon=True).start()
+
+    def _stream_output() -> None:
+        # Read raw chunks instead of iterating lines so tqdm progress
+        # bars (which overwrite using \r) flush in real time.
+        assert proc.stdout is not None
+        buf = ""
+        prefix_started = False
+        while True:
+            ch = proc.stdout.read(1)
+            if ch == "":
+                # process exited; flush any tail
+                if buf:
+                    sys.stdout.write(buf)
+                    sys.stdout.flush()
+                return
+            if not prefix_started:
+                sys.stdout.write("[server] ")
+                prefix_started = True
+            sys.stdout.write(ch)
+            sys.stdout.flush()
+            buf += ch
+            if ch in ("\n", "\r"):
+                if any(marker in buf for marker in ready_markers):
+                    ready_event.set()
+                buf = ""
+                prefix_started = False
+
+    threading.Thread(target=_stream_output, daemon=True).start()
+
+    def _shutdown() -> None:
+        if proc.poll() is None:
+            print(f"[server] stopping pid={proc.pid}", flush=True)
+            proc.send_signal(signal.SIGINT)
+            try:
+                proc.wait(timeout=15)
+            except subprocess.TimeoutExpired:
+                proc.kill()
+                proc.wait(timeout=5)
+
+    atexit.register(_shutdown)
+
+    deadline = time.monotonic() + config.serve_ready_timeout_s
+    while time.monotonic() < deadline:
+        if proc.poll() is not None:
+            raise RuntimeError(
+                f"[server] inference server exited unexpectedly with rc={proc.returncode}. "
+                f"See [server] log lines above for the cause."
+            )
+        if ready_event.wait(timeout=2):
+            return api_base
+    proc.terminate()
+    raise RuntimeError(f"[server] did not become ready within {config.serve_ready_timeout_s}s")
+
+
+def _to_openai_messages(
+    messages: Sequence[dict[str, Any]],
+) -> tuple[list[dict[str, Any]], dict[str, Any]]:
+    """Convert internal messages to OpenAI chat format.
+
+    Returns ``(api_messages, mm_kwargs)``. Multimodal-processor kwargs
+    (``fps`` from ``video_url`` blocks) are extracted out so the caller
+    can pass them via ``extra_body.mm_processor_kwargs`` rather than
+    inside the content blocks (which transformers serve rejects).
+
+    File-URL video blocks are inlined as base64 data URLs.
+    """
+    out_messages: list[dict[str, Any]] = []
+    mm_kwargs: dict[str, Any] = {}
+    for message in messages:
+        content = message.get("content")
+        if not isinstance(content, list):
+            out_messages.append({"role": message["role"], "content": content})
+            continue
+        out_blocks: list[dict[str, Any]] = []
+        for block in content:
+            block_type = block.get("type") if isinstance(block, dict) else None
+            if block_type == "text":
+                out_blocks.append({"type": "text", "text": block.get("text", "")})
+            elif block_type == "image":
+                out_blocks.append(
+                    {"type": "image_url", "image_url": {"url": _pil_to_data_url(block["image"])}}
+                )
+            elif block_type == "video":
+                frames = block.get("video", [])
+                for img in frames:
+                    out_blocks.append({"type": "image_url", "image_url": {"url": _pil_to_data_url(img)}})
+            elif block_type == "video_url":
+                video_url = dict(block["video_url"])
+                url = video_url.get("url", "")
+                if url.startswith("file://"):
+                    video_url["url"] = _file_to_data_url(url[len("file://") :])
+                out_blocks.append({"type": "video_url", "video_url": video_url})
+                fps = block.get("fps")
+                if fps is not None:
+                    mm_kwargs["fps"] = fps
+            else:
+                out_blocks.append(block)
+        out_messages.append({"role": message["role"], "content": out_blocks})
+    return out_messages, mm_kwargs
+
+
+def _file_to_data_url(path: str) -> str:
+    """Read a local video file and return a base64 ``data:video/mp4`` URL."""
+    with open(path, "rb") as f:
+        b64 = base64.b64encode(f.read()).decode("ascii")
+    return f"data:video/mp4;base64,{b64}"
+
+
+def _pil_to_data_url(image: Any) -> str:
+    """Encode a PIL.Image as a base64 data URL."""
+    buf = io.BytesIO()
+    image.save(buf, format="PNG")
+    b64 = base64.b64encode(buf.getvalue()).decode("ascii")
+    return f"data:image/png;base64,{b64}"
diff --git a/src/lerobot/annotations/steerable_pipeline/writer.py b/src/lerobot/annotations/steerable_pipeline/writer.py
new file mode 100644
index 000000000..e1a544c80
--- /dev/null
+++ b/src/lerobot/annotations/steerable_pipeline/writer.py
@@ -0,0 +1,341 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Final parquet rewrite.
+
+For every episode the writer:
+
+1. reads the staged module outputs,
+2. partitions them into a persistent slice (PERSISTENT_STYLES) and an event
+   slice (EVENT_ONLY_STYLES + style=None tool-call atoms),
+3. sorts each slice deterministically,
+4. broadcasts the persistent slice across every frame in the episode,
+5. for each frame, materializes the sublist of event rows whose timestamp
+   exactly equals that frame's timestamp,
+6. drops the legacy ``subtask_index`` column,
+7. writes the parquet shard back in place.
+
+The writer does NOT add a dataset-level ``tools`` column. Tool *calls* are
+emitted per-row via the existing ``tool_calls`` field on the v3.1 row
+struct for every speech atom. The tool *schema* (the description
+of the ``say`` function and its parameters) is a fixed code constant —
+``SAY_TOOL_SCHEMA`` below — and downstream chat-template consumers import
+it directly rather than reading a redundant per-row column.
+
+Invariants enforced here (and re-checked by the validator):
+
+- per-episode persistent slice is byte-identical across every frame;
+- ``language_events`` rows on a frame all have ``timestamp == frame_ts``
+  (timestamps come straight from the source parquet — never recomputed);
+- every row passes ``column_for_style(style)``.
+"""
+
+from __future__ import annotations
+
+import logging
+from collections import defaultdict
+from collections.abc import Sequence
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+
+import pyarrow as pa
+import pyarrow.parquet as pq
+
+from lerobot.datasets.language import (
+    EVENT_ONLY_STYLES,
+    LANGUAGE_EVENTS,
+    LANGUAGE_PERSISTENT,
+    PERSISTENT_STYLES,
+    column_for_style,
+    validate_camera_field,
+)
+
+from .reader import EpisodeRecord
+from .staging import EpisodeStaging
+
+logger = logging.getLogger(__name__)
+
+
+# Tool schema constants live in lerobot.datasets.language — single
+# source of truth. Re-exported here so existing imports
+# (``from lerobot.annotations.steerable_pipeline.writer import SAY_TOOL_SCHEMA``)
+# keep working.
+from lerobot.datasets.language import DEFAULT_TOOLS, SAY_TOOL_SCHEMA  # noqa: F401, E402
+
+
+def _row_persistent_sort_key(row: dict[str, Any]) -> tuple:
+    return (float(row["timestamp"]), row.get("style") or "", row.get("role") or "")
+
+
+def _row_event_sort_key(row: dict[str, Any]) -> tuple:
+    # events are bucketed per-frame, but within a frame we still want determinism
+    return (
+        row.get("style") or "",
+        row.get("role") or "",
+        row.get("camera") or "",
+    )
+
+
+def _normalize_row(row: dict[str, Any], style: str | None, *, with_timestamp: bool) -> dict[str, Any]:
+    """Coerce a staged row into the language-column struct shape.
+
+    Key order matches ``PERSISTENT_ROW_FIELDS`` / ``EVENT_ROW_FIELDS`` — the
+    writer infers the parquet struct schema from insertion order, so
+    ``timestamp`` (persistent rows only) sits between ``style`` and ``camera``.
+    """
+    camera = row.get("camera")
+    validate_camera_field(style, camera)
+    out: dict[str, Any] = {
+        "role": str(row["role"]),
+        "content": None if row.get("content") is None else str(row["content"]),
+        "style": style,
+    }
+    if with_timestamp:
+        out["timestamp"] = float(row["timestamp"])
+    out["camera"] = None if camera is None else str(camera)
+    out["tool_calls"] = _normalize_tool_calls(row.get("tool_calls"))
+    return out
+
+
+def _normalize_persistent_row(row: dict[str, Any]) -> dict[str, Any]:
+    """Coerce a staged row into the persistent column's struct shape."""
+    style = row.get("style")
+    if style not in PERSISTENT_STYLES:
+        raise ValueError(
+            f"persistent slice contains row with non-persistent style {style!r}; "
+            "row would be misrouted under column_for_style()"
+        )
+    if "timestamp" not in row:
+        raise ValueError(f"persistent row missing timestamp: {row!r}")
+    if "role" not in row:
+        # Friendly error from the writer instead of a raw KeyError below;
+        # the validator doesn't check ``role`` yet.
+        raise ValueError(f"persistent row missing role: {row!r}")
+    return _normalize_row(row, style, with_timestamp=True)
+
+
+def _normalize_event_row(row: dict[str, Any]) -> dict[str, Any]:
+    """Coerce a staged row into the event column's struct shape (no timestamp)."""
+    style = row.get("style")
+    if style is not None and style not in EVENT_ONLY_STYLES:
+        raise ValueError(
+            f"event slice contains row with style {style!r}; expected None or one of {EVENT_ONLY_STYLES}"
+        )
+    if column_for_style(style) != LANGUAGE_EVENTS:
+        raise ValueError(f"event row with style {style!r} would not route to language_events")
+    if "role" not in row:
+        raise ValueError(f"event row missing role: {row!r}")
+    return _normalize_row(row, style, with_timestamp=False)
+
+
+def _normalize_tool_calls(value: Any) -> list[Any] | None:
+    if value is None:
+        return None
+    if not isinstance(value, list):
+        raise ValueError(f"tool_calls must be a list or None, got {type(value).__name__}")
+    return list(value)
+
+
+def _validate_atom_invariants(row: dict[str, Any]) -> None:
+    """At-least-one of content/tool_calls; style=None implies tool_calls."""
+    has_content = row.get("content") is not None
+    has_tools = row.get("tool_calls") is not None
+    if not (has_content or has_tools):
+        raise ValueError(f"row has neither content nor tool_calls: {row!r}")
+    if row.get("style") is None and not has_tools:
+        raise ValueError(f"style=None requires tool_calls: {row!r}")
+
+
+def _validate_speech_atom(row: dict[str, Any]) -> None:
+    """Speech atoms: role=assistant, style=None, content=None, say tool call."""
+    if row.get("style") is not None:
+        return  # not a speech atom
+    if row.get("role") != "assistant":
+        raise ValueError(f"speech atom must have role=assistant: {row!r}")
+    if row.get("content") is not None:
+        raise ValueError(f"speech atom must have content=null: {row!r}")
+    tool_calls = row.get("tool_calls")
+    if not tool_calls or not isinstance(tool_calls, list):
+        raise ValueError(f"speech atom must have non-empty tool_calls list: {row!r}")
+    first = tool_calls[0]
+    if not isinstance(first, dict):
+        raise ValueError(f"speech atom tool_calls[0] must be a dict: {row!r}")
+    if first.get("type") != "function":
+        raise ValueError(f"speech atom tool_calls[0].type must be 'function': {row!r}")
+    fn = first.get("function") or {}
+    if fn.get("name") != "say":
+        raise ValueError(f"speech atom tool_calls[0].function.name must be 'say': {row!r}")
+    args = fn.get("arguments") or {}
+    if not isinstance(args, dict) or "text" not in args or not isinstance(args["text"], str):
+        raise ValueError(f"speech atom must carry 'text' string in arguments: {row!r}")
+
+
+@dataclass
+class LanguageColumnsWriter:
+    """Rewrite ``data/chunk-*/file-*.parquet`` with the two language columns."""
+
+    drop_existing_subtask_index: bool = True
+
+    def write_all(
+        self,
+        records: Sequence[EpisodeRecord],
+        staging_dir: Path,
+        root: Path,
+    ) -> list[Path]:
+        episodes_by_path: dict[Path, list[EpisodeRecord]] = defaultdict(list)
+        for record in records:
+            episodes_by_path[record.data_path].append(record)
+
+        written: list[Path] = []
+        for path, eps in episodes_by_path.items():
+            self._rewrite_one(path, eps, staging_dir, root)
+            written.append(path)
+        return written
+
+    def _rewrite_one(
+        self,
+        path: Path,
+        episodes: Sequence[EpisodeRecord],
+        staging_dir: Path,
+        root: Path,
+    ) -> None:
+        table = pq.read_table(path)
+        n_rows = table.num_rows
+
+        # Ensure we cover every episode in the file. Episodes that don't have
+        # staging artifacts are passed through with empty annotation lists —
+        # this keeps the writer idempotent and safe for partial reruns.
+        staged_per_ep: dict[int, dict[str, list[dict[str, Any]]]] = {}
+        for record in episodes:
+            staging = EpisodeStaging(staging_dir, record.episode_index)
+            staged_per_ep[record.episode_index] = staging.read_all()
+
+        persistent_by_ep: dict[int, list[dict[str, Any]]] = {}
+        events_by_ep_ts: dict[int, dict[float, list[dict[str, Any]]]] = {}
+
+        for ep_index, ep_staged in staged_per_ep.items():
+            persistent_rows: list[dict[str, Any]] = []
+            event_rows: list[dict[str, Any]] = []  # carry timestamp until bucketed
+            for _module_name, rows in ep_staged.items():
+                for row in rows:
+                    style = row.get("style")
+                    if column_for_style(style) == LANGUAGE_PERSISTENT:
+                        persistent_rows.append(row)
+                    else:
+                        event_rows.append(row)
+
+            persistent_rows.sort(key=_row_persistent_sort_key)
+            normalized_persistent = []
+            for r in persistent_rows:
+                _validate_atom_invariants(r)
+                _validate_speech_atom(r)
+                normalized_persistent.append(_normalize_persistent_row(r))
+            persistent_by_ep[ep_index] = normalized_persistent
+
+            buckets: dict[float, list[dict[str, Any]]] = defaultdict(list)
+            for r in event_rows:
+                _validate_atom_invariants(r)
+                _validate_speech_atom(r)
+                ts = float(r["timestamp"])
+                buckets[ts].append(_normalize_event_row(r))
+            for ts in list(buckets.keys()):
+                buckets[ts].sort(key=_row_event_sort_key)
+            events_by_ep_ts[ep_index] = buckets
+
+        episode_col = (
+            table.column("episode_index").to_pylist() if "episode_index" in table.column_names else None
+        )
+        ts_col = table.column("timestamp").to_pylist() if "timestamp" in table.column_names else None
+        if episode_col is None or ts_col is None:
+            raise ValueError(f"{path} is missing 'episode_index' or 'timestamp' — required by the writer.")
+
+        per_row_persistent: list[list[dict[str, Any]]] = []
+        per_row_events: list[list[dict[str, Any]]] = []
+        for i in range(n_rows):
+            ep = episode_col[i]
+            ts = float(ts_col[i])
+            per_row_persistent.append(persistent_by_ep.get(ep, []))
+            buckets = events_by_ep_ts.get(ep, {})
+            per_row_events.append(buckets.get(ts, []))
+
+        new_table = self._materialize_table(
+            table, per_row_persistent, per_row_events, drop_old=self.drop_existing_subtask_index
+        )
+        # Atomic replace: write to a sibling tmp path and rename so a crash
+        # mid-write can't leave a half-written shard that ``pq.read_table``
+        # would then fail to open. ``Path.replace`` is atomic on POSIX +
+        # Windows when source and target sit on the same filesystem.
+        tmp_path = path.with_suffix(path.suffix + ".tmp")
+        pq.write_table(new_table, tmp_path)
+        tmp_path.replace(path)
+
+    def _materialize_table(
+        self,
+        table: pa.Table,
+        persistent: list[list[dict[str, Any]]],
+        events: list[list[dict[str, Any]]],
+        *,
+        drop_old: bool,
+    ) -> pa.Table:
+        cols = []
+        names = []
+        for name in table.column_names:
+            if drop_old and name == "subtask_index":
+                continue
+            if name in (LANGUAGE_PERSISTENT, LANGUAGE_EVENTS):
+                continue  # we'll re-add canonical versions
+            # Strip any legacy ``tools`` column previously emitted by older
+            # writers — the schema no longer uses it (constant lives in
+            # SAY_TOOL_SCHEMA / DEFAULT_TOOLS).
+            if name == "tools":
+                continue
+            cols.append(table.column(name))
+            names.append(name)
+
+        # We let pyarrow infer struct/list schema rather than passing the
+        # canonical type from `lerobot.datasets.language` directly: that type
+        # uses `pa.json_()` for the `tool_calls` element type, which
+        # `pa.array(..., type=...)` cannot materialize from Python lists on
+        # current pyarrow versions. The inferred schema round-trips through
+        # parquet and `LeRobotDataset` correctly — `tests/datasets/test_language.py`
+        # exercises the same flow.
+        persistent_arr = pa.array(persistent)
+        events_arr = pa.array(events)
+
+        cols.extend([persistent_arr, events_arr])
+        names.extend([LANGUAGE_PERSISTENT, LANGUAGE_EVENTS])
+
+        return pa.Table.from_arrays(cols, names=names)
+
+
+def speech_atom(timestamp: float, text: str) -> dict[str, Any]:
+    """Build a canonical speech tool-call atom for the events column."""
+    return {
+        "role": "assistant",
+        "content": None,
+        "style": None,
+        "timestamp": float(timestamp),
+        "camera": None,
+        "tool_calls": [
+            {
+                "type": "function",
+                "function": {
+                    "name": "say",
+                    "arguments": {"text": text},
+                },
+            }
+        ],
+    }
diff --git a/src/lerobot/scripts/lerobot_annotate.py b/src/lerobot/scripts/lerobot_annotate.py
new file mode 100644
index 000000000..e95036a6b
--- /dev/null
+++ b/src/lerobot/scripts/lerobot_annotate.py
@@ -0,0 +1,206 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""``lerobot-annotate`` — populate ``language_persistent`` and
+``language_events`` columns on a LeRobot dataset.
+
+Annotations live directly in ``data/chunk-*/file-*.parquet``.
+
+Example:
+
+  uv run lerobot-annotate \\
+      --root=/path/to/dataset \\
+      --vlm.model_id=Qwen/Qwen2.5-VL-7B-Instruct
+
+For distributed runs, see ``examples/annotations/run_hf_job.py``.
+"""
+
+import logging
+from pathlib import Path
+
+from lerobot.annotations.steerable_pipeline.config import AnnotationPipelineConfig
+from lerobot.annotations.steerable_pipeline.executor import Executor
+from lerobot.annotations.steerable_pipeline.frames import make_frame_provider
+from lerobot.annotations.steerable_pipeline.modules import (
+    GeneralVqaModule,
+    InterjectionsAndSpeechModule,
+    PlanSubtasksMemoryModule,
+)
+from lerobot.annotations.steerable_pipeline.validator import StagingValidator
+from lerobot.annotations.steerable_pipeline.vlm_client import make_vlm_client
+from lerobot.annotations.steerable_pipeline.writer import LanguageColumnsWriter
+from lerobot.configs import parser
+
+logger = logging.getLogger(__name__)
+
+
+def _resolve_root(cfg: AnnotationPipelineConfig) -> Path:
+    if cfg.root is not None:
+        return Path(cfg.root)
+    if cfg.repo_id is not None:
+        from huggingface_hub import snapshot_download
+
+        return Path(snapshot_download(repo_id=cfg.repo_id, repo_type="dataset"))
+    raise ValueError("Either --root or --repo_id must be provided.")
+
+
+@parser.wrap()
+def annotate(cfg: AnnotationPipelineConfig) -> None:
+    """Run the steerable annotation pipeline against a dataset."""
+    logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
+    root = _resolve_root(cfg)
+    logger.info("annotate: root=%s", root)
+
+    vlm = make_vlm_client(cfg.vlm)
+    frame_provider = make_frame_provider(root, camera_key=cfg.vlm.camera_key, video_backend=cfg.video_backend)
+    # Surface the resolved cameras up front so a silent vqa-module no-op
+    # is obvious in job output rather than discovered post-hoc by counting
+    # parquet rows.
+    cam_keys = list(getattr(frame_provider, "camera_keys", []) or [])
+    logger.info(
+        "annotate: frame_provider default camera=%r, all cameras=%s",
+        getattr(frame_provider, "camera_key", None),
+        cam_keys,
+    )
+    if cfg.vqa.enabled and not cam_keys:
+        logger.warning(
+            "annotate: the vqa module is enabled but no cameras were "
+            "resolved — it will produce zero VQA rows. Check "
+            "meta/info.json for observation.images.* features, or pass "
+            "--vlm.camera_key=<key> to seed the cameras list."
+        )
+    plan = PlanSubtasksMemoryModule(vlm=vlm, config=cfg.plan, frame_provider=frame_provider)
+    interjections = InterjectionsAndSpeechModule(
+        vlm=vlm, config=cfg.interjections, seed=cfg.seed, frame_provider=frame_provider
+    )
+    vqa = GeneralVqaModule(vlm=vlm, config=cfg.vqa, seed=cfg.seed, frame_provider=frame_provider)
+    writer = LanguageColumnsWriter()
+    validator = StagingValidator(
+        dataset_camera_keys=tuple(getattr(frame_provider, "camera_keys", []) or []) or None,
+    )
+
+    executor = Executor(
+        config=cfg,
+        plan=plan,
+        interjections=interjections,
+        vqa=vqa,
+        writer=writer,
+        validator=validator,
+    )
+    summary = executor.run(root)
+    logger.info("annotate: wrote %d shard(s)", len(summary.written_paths))
+    for phase in summary.phases:
+        logger.info(
+            "annotate: phase=%s processed=%d skipped=%d",
+            phase.name,
+            phase.episodes_processed,
+            phase.episodes_skipped,
+        )
+    if summary.validation_report.warnings:
+        for w in summary.validation_report.warnings:
+            logger.warning(w)
+
+    if cfg.push_to_hub:
+        if cfg.repo_id is None and cfg.new_repo_id is None:
+            raise ValueError(
+                "--push_to_hub requires --repo_id or --new_repo_id (the dataset repo to push to)."
+            )
+        _push_to_hub(root, cfg)
+
+
+def _push_to_hub(root: Path, cfg: AnnotationPipelineConfig) -> None:
+    """Upload the annotated dataset directory to the Hub.
+
+    Pushes to ``cfg.new_repo_id`` when set, otherwise back to ``cfg.repo_id``.
+    """
+    from huggingface_hub import HfApi  # noqa: PLC0415
+
+    repo_id = cfg.new_repo_id or cfg.repo_id
+    commit_message = cfg.push_commit_message or "Add steerable annotations (lerobot-annotate)"
+    api = HfApi()
+    print(f"[lerobot-annotate] creating/locating dataset repo {repo_id}...", flush=True)
+    api.create_repo(
+        repo_id=repo_id,
+        repo_type="dataset",
+        private=cfg.push_private,
+        exist_ok=True,
+    )
+    print(f"[lerobot-annotate] uploading {root} -> {repo_id}...", flush=True)
+    commit_info = api.upload_folder(
+        folder_path=str(root),
+        repo_id=repo_id,
+        repo_type="dataset",
+        commit_message=commit_message,
+        ignore_patterns=[".annotate_staging/**", "**/.DS_Store"],
+    )
+    print(f"[lerobot-annotate] uploaded to https://huggingface.co/datasets/{repo_id}", flush=True)
+
+    # Tag the upload with the codebase version. ``LeRobotDatasetMetadata``
+    # resolves the dataset revision via ``get_safe_version`` which scans
+    # for tags like ``v3.0``; without a tag it raises
+    # ``RevisionNotFoundError``. Read the version straight from the
+    # dataset's own ``meta/info.json`` so we tag whatever the writer
+    # actually wrote (no accidental drift if the codebase floor moves).
+    from lerobot.datasets.dataset_metadata import CODEBASE_VERSION  # noqa: PLC0415
+
+    info_path = root / "meta" / "info.json"
+    version_tag = CODEBASE_VERSION
+    if info_path.exists():
+        try:
+            from lerobot.utils.io_utils import load_json  # noqa: PLC0415
+
+            info = load_json(info_path)
+            ds_version = info.get("codebase_version")
+            if isinstance(ds_version, str) and ds_version.startswith("v"):
+                version_tag = ds_version
+        except Exception as exc:  # noqa: BLE001
+            print(
+                f"[lerobot-annotate] could not read codebase_version from info.json ({exc}); falling back to {version_tag}",
+                flush=True,
+            )
+    revision = getattr(commit_info, "oid", None)
+    tag_kwargs = {
+        "repo_id": repo_id,
+        "tag": version_tag,
+        "repo_type": "dataset",
+    }
+    if revision is not None:
+        tag_kwargs["revision"] = revision
+
+    try:
+        from contextlib import suppress  # noqa: PLC0415
+
+        from huggingface_hub.errors import RevisionNotFoundError  # noqa: PLC0415
+
+        with suppress(RevisionNotFoundError):
+            api.delete_tag(repo_id, tag=version_tag, repo_type="dataset")
+        api.create_tag(**tag_kwargs)
+        print(f"[lerobot-annotate] tagged {repo_id} as {version_tag}", flush=True)
+    except Exception as exc:  # noqa: BLE001
+        print(
+            f"[lerobot-annotate] WARNING: could not create tag {version_tag!r} on {repo_id}: {exc}. "
+            "Dataset is uploaded but ``LeRobotDataset`` won't be able to load it until it's tagged. "
+            "Run: from huggingface_hub import HfApi; "
+            f"HfApi().create_tag({repo_id!r}, tag={version_tag!r}, repo_type='dataset', exist_ok=True)",
+            flush=True,
+        )
+
+
+def main() -> None:
+    annotate()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/tests/annotations/__init__.py b/tests/annotations/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/tests/annotations/_helpers.py b/tests/annotations/_helpers.py
new file mode 100644
index 000000000..6a6290a1d
--- /dev/null
+++ b/tests/annotations/_helpers.py
@@ -0,0 +1,58 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Helpers shared across annotation-pipeline tests."""
+
+from __future__ import annotations
+
+import json
+from typing import Any
+
+from lerobot.annotations.steerable_pipeline.vlm_client import StubVlmClient
+
+
+def make_canned_responder(
+    responses_by_marker: dict[str, Any],
+    default: Any = None,
+) -> StubVlmClient:
+    """Return a stub that picks a response by inspecting the user prompt.
+
+    For each call the responder examines the last user-message text and
+    returns the response keyed by the first marker substring it contains.
+    Falls back to ``default`` if no marker matches.
+    """
+
+    def responder(messages: list[dict[str, Any]]) -> Any:
+        last_user_text = ""
+        for message in messages:
+            if message.get("role") != "user":
+                continue
+            content = message.get("content")
+            if isinstance(content, str):
+                last_user_text = content
+            elif isinstance(content, list):
+                for block in content:
+                    if isinstance(block, dict) and block.get("type") == "text":
+                        last_user_text = block.get("text", "")
+        for marker, response in responses_by_marker.items():
+            if marker in last_user_text:
+                return response
+        return default
+
+    return StubVlmClient(responder=responder)
+
+
+def encode_vqa_answer(payload: dict[str, Any]) -> str:
+    return json.dumps(payload, sort_keys=True)
diff --git a/tests/annotations/conftest.py b/tests/annotations/conftest.py
new file mode 100644
index 000000000..69e0d595e
--- /dev/null
+++ b/tests/annotations/conftest.py
@@ -0,0 +1,58 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Shared fixtures for annotation-pipeline tests.
+
+The on-disk dataset builder lives with the other dataset factories in
+``tests/fixtures/dataset_factories.py`` (:func:`build_annotation_dataset`);
+these fixtures only wire it into pytest.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import pytest
+
+# ``build_annotation_dataset`` pulls in ``lerobot.datasets`` (HF ``datasets``
+# + ``pandas``, only in the ``dataset`` extra), so it's imported lazily inside
+# each fixture — this conftest stays importable without that extra. The test
+# modules ``pytest.importorskip("datasets")`` so they skip rather than error.
+
+
+@pytest.fixture
+def fixture_dataset_root(tmp_path: Path) -> Path:
+    """A tiny dataset with two episodes, 12 frames each at 10 fps."""
+    from tests.fixtures.dataset_factories import build_annotation_dataset
+
+    return build_annotation_dataset(
+        tmp_path / "ds",
+        episode_specs=[
+            (0, 12, "Could you tidy the kitchen please?"),
+            (1, 12, "Please clean up the kitchen"),
+        ],
+        fps=10,
+    )
+
+
+@pytest.fixture
+def single_episode_root(tmp_path: Path) -> Path:
+    from tests.fixtures.dataset_factories import build_annotation_dataset
+
+    return build_annotation_dataset(
+        tmp_path / "ds_one",
+        episode_specs=[(0, 30, "Pour water from the bottle into the cup.")],
+        fps=10,
+    )
diff --git a/tests/annotations/run_e2e_smoke.py b/tests/annotations/run_e2e_smoke.py
new file mode 100644
index 000000000..723f49a5e
--- /dev/null
+++ b/tests/annotations/run_e2e_smoke.py
@@ -0,0 +1,116 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Opt-in E2E smoke run for ``make annotation-e2e``.
+
+Builds the shared annotation fixture (:func:`build_annotation_dataset`),
+runs the full annotation pipeline against it with a stub VLM, and prints a
+short report. This is intentionally not a pytest test — it exercises the
+CLI plumbing — but it reuses the same on-disk dataset builder as the pytest
+fixtures so there is no duplicated fixture code.
+"""
+
+from __future__ import annotations
+
+import sys
+import tempfile
+from pathlib import Path
+
+from lerobot.annotations.steerable_pipeline.config import AnnotationPipelineConfig
+from lerobot.annotations.steerable_pipeline.executor import Executor
+from lerobot.annotations.steerable_pipeline.modules import (
+    GeneralVqaModule,
+    InterjectionsAndSpeechModule,
+    PlanSubtasksMemoryModule,
+)
+from lerobot.annotations.steerable_pipeline.validator import StagingValidator
+from lerobot.annotations.steerable_pipeline.vlm_client import StubVlmClient
+from lerobot.annotations.steerable_pipeline.writer import LanguageColumnsWriter
+from tests.fixtures.dataset_factories import build_annotation_dataset
+
+
+def _stub_responder(messages):
+    text = ""
+    for m in messages:
+        if m.get("role") == "user":
+            content = m.get("content")
+            if isinstance(content, list):
+                for block in content:
+                    if isinstance(block, dict) and block.get("type") == "text":
+                        text = block.get("text", "")
+            elif isinstance(content, str):
+                text = content
+    if "atomic subtasks" in text:
+        return {
+            "subtasks": [
+                {"text": "grasp the bottle", "start": 0.0, "end": 1.0},
+                {"text": "pour into the cup", "start": 1.0, "end": 2.0},
+                {"text": "place the bottle down", "start": 2.0, "end": 3.0},
+            ]
+        }
+    if "compressed semantic memory" in text:
+        return {"memory": "poured once"}
+    if "acknowledgement the robot" in text:
+        return {"text": "Sure."}
+    if "compact interjection" in text:
+        return {"interjection": "use less water", "speech": "Using less water."}
+    if "frame-grounded visual question" in text:
+        return {"question": "How many cups?", "answer": {"label": "cup", "count": 1}}
+    return None
+
+
+def main() -> int:
+    with tempfile.TemporaryDirectory() as tmp:
+        root = build_annotation_dataset(
+            Path(tmp) / "ds",
+            episode_specs=[(0, 30, "Pour water into the cup.")],
+            fps=10,
+        )
+        vlm = StubVlmClient(responder=_stub_responder)
+        cfg = AnnotationPipelineConfig()
+        executor = Executor(
+            config=cfg,
+            plan=PlanSubtasksMemoryModule(vlm=vlm, config=cfg.plan),
+            interjections=InterjectionsAndSpeechModule(vlm=vlm, config=cfg.interjections, seed=cfg.seed),
+            vqa=GeneralVqaModule(vlm=vlm, config=cfg.vqa, seed=cfg.seed),
+            writer=LanguageColumnsWriter(),
+            validator=StagingValidator(),
+        )
+        summary = executor.run(root)
+        print(f"phases={[(p.name, p.episodes_processed) for p in summary.phases]}")
+        print(f"validation: {summary.validation_report.summary()}")
+        print(f"shards rewritten: {len(summary.written_paths)}")
+
+        # Assert the interjection code path actually fired — otherwise a stale
+        # canned-VLM marker would silently produce zero interjections and this
+        # smoke run would still "pass" by only printing.
+        import pyarrow.parquet as pq  # noqa: PLC0415
+
+        events = [
+            r
+            for shard in summary.written_paths
+            for ev in pq.read_table(shard).column("language_events").to_pylist()
+            for r in ev
+        ]
+        n_interjections = sum(1 for r in events if r.get("style") == "interjection")
+        n_speech = sum(1 for r in events if r.get("style") is None and r.get("role") == "assistant")
+        print(f"interjections={n_interjections} speech_atoms={n_speech}")
+        assert n_interjections > 0, "no interjection rows produced — check the interjection prompt marker"
+        assert n_speech > 0, "no speech tool-call atoms produced — check the speech prompt marker"
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/tests/annotations/test_frames.py b/tests/annotations/test_frames.py
new file mode 100644
index 000000000..5c9c58f7b
--- /dev/null
+++ b/tests/annotations/test_frames.py
@@ -0,0 +1,246 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Unit tests for :class:`VideoFrameProvider` method bindings.
+
+These were prompted by a real regression: ``video_for_episode`` was once
+indented one level too deep so it ended up nested *inside* a module-level
+helper (after that function's ``return`` statement) — silently dead code
+that meant production runs with ``use_video_url=False`` would
+``AttributeError`` on ``self.frame_provider.video_for_episode(...)``. The
+existing module tests didn't catch it because they exercise stub providers.
+
+The tests below assert on the class itself (not on an instance), so a
+future reindent regression flips them to red without needing a real
+LeRobot dataset on disk.
+"""
+
+from __future__ import annotations
+
+import shutil
+import subprocess
+from pathlib import Path
+
+import pytest
+import torch
+
+pytest.importorskip("datasets", reason="datasets is required (install lerobot[dataset])")
+
+from lerobot.annotations.steerable_pipeline.frames import VideoFrameProvider  # noqa: E402
+
+
+class _FakeMeta:
+    """Minimal metadata stub exposing ``video_keys`` / ``camera_keys``."""
+
+    def __init__(self, video_keys: list[str], image_keys: list[str], video_path: Path | None = None) -> None:
+        self.video_keys = video_keys
+        self.camera_keys = [*video_keys, *image_keys]
+        self._video_path = video_path
+        self.episodes = {0: {f"videos/{key}/from_timestamp": 0.0 for key in video_keys}}
+
+    def get_video_file_path(self, episode_index: int, camera_key: str) -> Path:
+        return self._video_path
+
+
+def test_default_camera_key_skips_image_only_cameras(tmp_path: Path, monkeypatch) -> None:
+    """The default camera must be a *video* key — image-stored cameras have no
+    ``videos/<key>/from_timestamp`` and would KeyError in the clip/decode path.
+
+    Regression: a dataset whose first ``camera_keys`` entry was an image-stored
+    camera (e.g. ``observation.images.wrist``) crashed at clip extraction.
+    """
+    fake = _FakeMeta(
+        video_keys=["observation.images.robot0_agentview_right"],
+        image_keys=["observation.images.wrist"],
+    )
+    import lerobot.datasets.dataset_metadata as meta_mod
+
+    monkeypatch.setattr(meta_mod, "LeRobotDatasetMetadata", lambda *a, **k: fake, raising=True)
+    provider = VideoFrameProvider(root=tmp_path)
+    assert provider.camera_key == "observation.images.robot0_agentview_right"
+    assert "observation.images.wrist" not in provider.camera_keys
+
+
+def test_video_for_episode_is_a_method_of_videoframeprovider():
+    """``video_for_episode`` must be a bound method, not nested dead code."""
+    assert callable(getattr(VideoFrameProvider, "video_for_episode", None))
+
+
+def test_episode_clip_path_is_a_method_of_videoframeprovider():
+    """``episode_clip_path`` is now a method (was a free function reaching
+    into ``provider._meta`` from outside the class)."""
+    assert callable(getattr(VideoFrameProvider, "episode_clip_path", None))
+
+
+def test_videoframeprovider_has_a_lock_for_concurrent_use():
+    """A ``ThreadPoolExecutor`` runs the plan / interjections / vqa phases
+    concurrently; the cache + warn-flag accesses must be guarded.
+    """
+    import threading
+
+    # Fresh-instance check via a minimal fake to avoid touching the hub.
+    # The lock is declared with ``init=False`` and has a default factory,
+    # so a constructed instance must own a real ``threading.Lock``.
+    lock_field = next(
+        (f for f in VideoFrameProvider.__dataclass_fields__.values() if f.name == "_lock"),
+        None,
+    )
+    assert lock_field is not None
+    assert lock_field.default_factory is threading.Lock
+
+
+@pytest.fixture
+def sample_video(tmp_path: Path) -> Path:
+    """A 3 s 10 fps test-pattern mp4, written with ffmpeg."""
+    if shutil.which("ffmpeg") is None:
+        pytest.skip("ffmpeg not available")
+    out = tmp_path / "sample.mp4"
+    subprocess.run(
+        [
+            "ffmpeg",
+            "-y",
+            "-f",
+            "lavfi",
+            "-i",
+            "testsrc=duration=3:size=160x120:rate=10",
+            "-pix_fmt",
+            "yuv420p",
+            str(out),
+        ],
+        check=True,
+        capture_output=True,
+    )
+    return out
+
+
+def _provider_for_video(tmp_path: Path, video: Path, monkeypatch) -> VideoFrameProvider:
+    """A provider whose single camera resolves to ``video`` via fake metadata."""
+    fake = _FakeMeta(video_keys=["observation.images.cam"], image_keys=[], video_path=video)
+    import lerobot.datasets.dataset_metadata as meta_mod
+
+    monkeypatch.setattr(meta_mod, "LeRobotDatasetMetadata", lambda *a, **k: fake, raising=True)
+    return VideoFrameProvider(root=tmp_path, tolerance_s=0.2)
+
+
+def test_decode_returns_one_uint8_frame_per_timestamp(
+    sample_video: Path, tmp_path: Path, monkeypatch
+) -> None:
+    """``_decode`` routes through ``decode_video_frames`` (torchcodec when
+    available, PyAV otherwise) — no subprocess fallback.
+    """
+    provider = _provider_for_video(tmp_path, sample_video, monkeypatch)
+    timestamps = [0.0, 1.0, 2.5]
+    frames = provider._decode(0, timestamps, "observation.images.cam")
+
+    assert len(frames) == len(timestamps)
+    for frame in frames:
+        assert isinstance(frame, torch.Tensor)
+        assert frame.dtype == torch.uint8
+        assert frame.shape == (3, 120, 160)
+
+
+def test_frames_at_snaps_mid_frame_grid_to_real_frames(
+    sample_video: Path, tmp_path: Path, monkeypatch
+) -> None:
+    """Uniform sampling grids land mid-frame; ``frames_at`` must snap them to
+    real frame timestamps before decoding.
+
+    Regression: ``decode_video_frames`` rejects queries farther than
+    ``tolerance_s`` (default 10 ms) from a decodable frame, so un-snapped
+    mid-frame queries raised ``FrameTimestampError`` wholesale and the plan
+    module silently lost its contact sheets for most episodes.
+    """
+    from types import SimpleNamespace
+
+    fake = _FakeMeta(video_keys=["observation.images.cam"], image_keys=[], video_path=sample_video)
+    import lerobot.datasets.dataset_metadata as meta_mod
+
+    monkeypatch.setattr(meta_mod, "LeRobotDatasetMetadata", lambda *a, **k: fake, raising=True)
+    provider = VideoFrameProvider(root=tmp_path)  # default 10 ms tolerance
+    # 10 fps fixture -> frames at 0.0, 0.1, ...; queries sit mid-frame.
+    record = SimpleNamespace(episode_index=0, frame_timestamps=[i / 10 for i in range(30)])
+
+    frames = provider.frames_at(record, [0.149, 1.234, 2.04], camera_key="observation.images.cam")
+
+    assert len(frames) == 3
+    for frame in frames:
+        assert isinstance(frame, torch.Tensor)
+        assert frame.shape == (3, 120, 160)
+
+
+def test_decode_returns_empty_list_on_missing_file(tmp_path: Path, monkeypatch) -> None:
+    """A missing video is a recoverable no-frames condition, never a crash."""
+    provider = _provider_for_video(tmp_path, tmp_path / "does_not_exist.mp4", monkeypatch)
+    assert provider._decode(0, [0.0], "observation.images.cam") == []
+
+
+def test_episode_clip_path_trims_via_reencode_video(tmp_path: Path, monkeypatch) -> None:
+    """Clip extraction delegates to ``video_utils.reencode_video`` with the
+    episode's ``[from_timestamp, to_timestamp)`` trim window — no subprocess.
+    """
+    from types import SimpleNamespace
+
+    import lerobot.annotations.steerable_pipeline.frames as frames_mod
+
+    src = tmp_path / "src.mp4"
+    src.write_bytes(b"src")
+    fake = _FakeMeta(video_keys=["observation.images.cam"], image_keys=[], video_path=src)
+    fake.episodes[0]["videos/observation.images.cam/from_timestamp"] = 1.5
+    fake.episodes[0]["videos/observation.images.cam/to_timestamp"] = 4.0
+    import lerobot.datasets.dataset_metadata as meta_mod
+
+    monkeypatch.setattr(meta_mod, "LeRobotDatasetMetadata", lambda *a, **k: fake, raising=True)
+
+    captured = {}
+
+    def fake_reencode(
+        input_video_path,
+        output_video_path,
+        camera_encoder=None,
+        overwrite=False,
+        start_time_s=None,
+        end_time_s=None,
+    ):
+        captured.update(
+            src=Path(input_video_path),
+            encoder=camera_encoder,
+            start_time_s=start_time_s,
+            end_time_s=end_time_s,
+        )
+        Path(output_video_path).write_bytes(b"clip")
+
+    monkeypatch.setattr(frames_mod, "reencode_video", fake_reencode, raising=True)
+    provider = VideoFrameProvider(root=tmp_path)
+    record = SimpleNamespace(episode_index=0, frame_timestamps=[0.0, 1.0])
+
+    out = provider.episode_clip_path(record, tmp_path / "clips")
+
+    assert out == tmp_path / "clips" / "ep_000000.mp4"
+    assert captured["src"] == src
+    assert captured["start_time_s"] == 1.5
+    assert captured["end_time_s"] == 4.0
+    # H.264 so the clip is decodable by vllm's libav build (sources are often AV1).
+    assert captured["encoder"].vcodec == "h264"
+
+
+def test_videoframeprovider_serializes_decodes_with_a_lock() -> None:
+    """torchcodec's cached per-file decoder is single-threaded; the provider
+    must own a dedicated lock that ``_decode`` holds around the decoder call.
+    """
+    import threading
+
+    lock_field = VideoFrameProvider.__dataclass_fields__.get("_decode_lock")
+    assert lock_field is not None
+    assert lock_field.default_factory is threading.Lock
diff --git a/tests/annotations/test_modules.py b/tests/annotations/test_modules.py
new file mode 100644
index 000000000..d4f07f83b
--- /dev/null
+++ b/tests/annotations/test_modules.py
@@ -0,0 +1,390 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Module 1/2/3 unit tests with stubbed VLMs."""
+
+from __future__ import annotations
+
+import json
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+import PIL.Image
+import pytest
+
+# ``lerobot.annotations`` imports pull in ``lerobot.datasets`` (-> the HF
+# ``datasets`` library), which only ships under the ``dataset`` extra. Skip
+# this module in tiers without it instead of erroring at import.
+pytest.importorskip("datasets", reason="datasets is required (install lerobot[dataset])")
+pytest.importorskip("pandas", reason="pandas is required (install lerobot[dataset])")
+
+from lerobot.annotations.steerable_pipeline.config import (  # noqa: E402
+    InterjectionsConfig,
+    PlanConfig,
+    VqaConfig,
+)
+from lerobot.annotations.steerable_pipeline.modules import (  # noqa: E402
+    GeneralVqaModule,
+    InterjectionsAndSpeechModule,
+    PlanSubtasksMemoryModule,
+)
+from lerobot.annotations.steerable_pipeline.reader import iter_episodes  # noqa: E402
+from lerobot.annotations.steerable_pipeline.staging import EpisodeStaging  # noqa: E402
+from lerobot.annotations.steerable_pipeline.vlm_client import StubVlmClient  # noqa: E402
+
+from ._helpers import make_canned_responder  # noqa: E402
+
+
+@dataclass
+class _StubFrameProvider:
+    """Returns one sentinel object per requested timestamp."""
+
+    # A real (tiny) PIL image so the contact-sheet builder, which resizes and
+    # tiles frames, has something to draw. VQA still passes it through by
+    # identity via ``to_image_blocks``.
+    sentinel: Any = field(default_factory=lambda: PIL.Image.new("RGB", (32, 24)))
+    cameras: tuple[str, ...] = ("observation.images.top",)
+    calls: list[tuple[int, tuple[float, ...], str | None]] = field(default_factory=list)
+    video_calls: list[tuple[int, int, str | None]] = field(default_factory=list)
+
+    @property
+    def camera_keys(self) -> list[str]:
+        return list(self.cameras)
+
+    def frames_at(self, record, timestamps, camera_key=None):
+        self.calls.append((record.episode_index, tuple(timestamps), camera_key))
+        return [self.sentinel] * len(timestamps)
+
+    def video_for_episode(self, record, max_frames, camera_key=None):
+        self.video_calls.append((record.episode_index, max_frames, camera_key))
+        n = min(max_frames, len(record.frame_timestamps))
+        return [self.sentinel] * n
+
+
+def _spy_responder(captured: list[list[dict[str, Any]]], reply: Any):
+    def responder(messages):
+        captured.append(list(messages))
+        return reply
+
+    return StubVlmClient(responder=responder)
+
+
+def test_module1_plan_memory_subtask_smoke(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    vlm = make_canned_responder(
+        {
+            "atomic subtasks": {
+                "subtasks": [
+                    {"text": "grasp the handle of the sponge", "start": 0.0, "end": 0.4},
+                    {"text": "wipe the counter from left to right", "start": 0.4, "end": 0.8},
+                    {"text": "place the sponge into the sink", "start": 0.8, "end": 1.1},
+                ]
+            },
+            "compressed semantic memory": {"memory": "wiped the counter once"},
+        },
+    )
+    module = PlanSubtasksMemoryModule(vlm=vlm, config=PlanConfig())
+    record = next(iter_episodes(fixture_dataset_root))
+    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
+    module.run_episode(record, staging)
+    rows = staging.read("plan")
+
+    styles = {r["style"] for r in rows}
+    assert {"subtask", "plan", "memory"}.issubset(styles)
+    # subtask timestamps must be exact frame timestamps
+    frame_set = set(record.frame_timestamps)
+    for row in rows:
+        assert row["timestamp"] in frame_set
+    # one plan row per subtask boundary; the first lands at t0 and each
+    # plan is the deterministic numbered list of still-todo subtasks
+    plan_rows = sorted((r for r in rows if r["style"] == "plan"), key=lambda r: r["timestamp"])
+    subtask_rows = [r for r in rows if r["style"] == "subtask"]
+    assert len(plan_rows) == len(subtask_rows)
+    assert plan_rows[0]["timestamp"] == record.frame_timestamps[0]
+    # the t0 plan enumerates all subtasks; later plans shrink
+    assert plan_rows[0]["content"].startswith("1. ")
+    assert len(plan_rows[0]["content"].splitlines()) == len(subtask_rows)
+    assert len(plan_rows[-1]["content"].splitlines()) == 1
+
+
+def test_module1_emit_memory_false_skips_memory_keeps_subtasks_and_plan(
+    fixture_dataset_root: Path, tmp_path: Path
+) -> None:
+    """``emit_memory=False`` drops ``memory`` rows (and their VLM calls) while
+    leaving subtask + plan generation intact — symmetric to ``emit_plan``."""
+    vlm = make_canned_responder(
+        {
+            "atomic subtasks": {
+                "subtasks": [
+                    {"text": "grasp the handle of the sponge", "start": 0.0, "end": 0.4},
+                    {"text": "wipe the counter from left to right", "start": 0.4, "end": 0.8},
+                    {"text": "place the sponge into the sink", "start": 0.8, "end": 1.1},
+                ]
+            },
+            "compressed semantic memory": {"memory": "wiped the counter once"},
+        },
+    )
+    module = PlanSubtasksMemoryModule(vlm=vlm, config=PlanConfig(emit_memory=False))
+    record = next(iter_episodes(fixture_dataset_root))
+    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
+    module.run_episode(record, staging)
+    rows = staging.read("plan")
+
+    styles = {r["style"] for r in rows}
+    assert "memory" not in styles
+    assert {"subtask", "plan"}.issubset(styles)
+
+
+def test_module2_at_t0_emits_speech_only_no_interjection(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    vlm = make_canned_responder(
+        {"acknowledgement the robot": {"text": "Sure, on it."}},
+    )
+    module = InterjectionsAndSpeechModule(
+        vlm=vlm,
+        config=InterjectionsConfig(max_interjections_per_episode=0),
+    )
+    record = next(iter_episodes(fixture_dataset_root))
+    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
+    module.run_episode(record, staging)
+    rows = staging.read("interjections")
+    assert len(rows) == 1
+    only = rows[0]
+    assert only["role"] == "assistant"
+    assert only["style"] is None
+    assert only["content"] is None
+    assert only["timestamp"] == record.frame_timestamps[0]
+    assert only["tool_calls"][0]["function"]["name"] == "say"
+
+
+def test_module2_mid_episode_emits_paired_interjection_and_speech(
+    fixture_dataset_root: Path, tmp_path: Path
+) -> None:
+    """Module 2 anchors interjections on Module 1's subtask boundaries.
+
+    The executor runs Module 1 first, then Module 2 reads the subtask
+    rows back from the same staging tree (see
+    ``_mid_episode_interjections``). Reproduce that contract here by
+    seeding the staging with two subtask rows so a single ``0 → 1``
+    boundary exists for Module 2 to anchor on.
+    """
+    vlm = make_canned_responder(
+        {
+            "acknowledgement the robot": {"text": "OK."},
+            # Marker matches the distinctive line of
+            # ``interjections_interjection.txt`` ("Write ONE compact
+            # interjection ..."). Keep this in sync with that prompt's
+            # wording — the canned responder matches on substring.
+            "Write ONE compact interjection": {
+                "interjection": "now wipe the counter please",
+                "speech": "On it.",
+            },
+        },
+    )
+    module = InterjectionsAndSpeechModule(
+        vlm=vlm,
+        config=InterjectionsConfig(max_interjections_per_episode=1, interjection_min_t=0.2),
+        seed=7,
+    )
+    record = next(iter_episodes(fixture_dataset_root))
+    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
+    # Seed Module 1's subtask staging so Module 2 has a boundary to
+    # anchor on (it bails with zero rows when no spans exist — the
+    # production executor guarantees Module 1 ran first).
+    boundary_ts = float(record.frame_timestamps[len(record.frame_timestamps) // 2])
+    staging.write(
+        "plan",
+        [
+            {
+                "role": "assistant",
+                "content": "grasp the sponge",
+                "style": "subtask",
+                "timestamp": float(record.frame_timestamps[0]),
+                "tool_calls": None,
+            },
+            {
+                "role": "assistant",
+                "content": "wipe the counter",
+                "style": "subtask",
+                "timestamp": boundary_ts,
+                "tool_calls": None,
+            },
+        ],
+    )
+    module.run_episode(record, staging)
+    rows = staging.read("interjections")
+
+    interjections = [r for r in rows if r["style"] == "interjection"]
+    speeches = [r for r in rows if r["style"] is None and r["role"] == "assistant"]
+    assert len(interjections) == 1
+    assert len(speeches) >= 2  # initial t=0 + one paired with the interjection
+    inter_t = interjections[0]["timestamp"]
+    assert any(abs(s["timestamp"] - inter_t) < 1e-9 for s in speeches)
+
+
+def test_module3_vqa_unique_per_frame_and_camera(single_episode_root: Path, tmp_path: Path) -> None:
+    payload = {
+        "question": "How many cups?",
+        "answer": {"label": "cup", "count": 2, "note": "white & blue"},
+    }
+    vlm = make_canned_responder({"frame-grounded visual question": payload})
+    module = GeneralVqaModule(
+        vlm=vlm,
+        config=VqaConfig(vqa_emission_hz=1.0, K=3),
+        seed=1,
+        frame_provider=_StubFrameProvider(cameras=("observation.images.top", "observation.images.wrist")),
+    )
+    record = next(iter_episodes(single_episode_root))
+    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
+    module.run_episode(record, staging)
+    rows = staging.read("vqa")
+    # every vqa row must carry a camera tag and one of the configured cameras
+    for r in rows:
+        assert r["style"] == "vqa"
+        assert r.get("camera") in {"observation.images.top", "observation.images.wrist"}
+    # at most one (vqa, user) and one (vqa, assistant) per (timestamp, camera)
+    user_keys = [(r["timestamp"], r["camera"]) for r in rows if r["role"] == "user" and r["style"] == "vqa"]
+    assistant_keys = [
+        (r["timestamp"], r["camera"]) for r in rows if r["role"] == "assistant" and r["style"] == "vqa"
+    ]
+    assert len(user_keys) == len(set(user_keys))
+    assert len(assistant_keys) == len(set(assistant_keys))
+    # both cameras must be represented
+    assert {c for _, c in user_keys} == {"observation.images.top", "observation.images.wrist"}
+    # every emitted timestamp must be an exact source frame timestamp
+    frame_set = set(record.frame_timestamps)
+    for ts, _ in user_keys + assistant_keys:
+        assert ts in frame_set
+
+
+def test_module1_attaches_contact_sheets_to_subtask_prompt(
+    fixture_dataset_root: Path, tmp_path: Path
+) -> None:
+    """Module 1 sends timestamped contact-sheet image blocks (not a raw video block)."""
+    captured: list[list[dict[str, Any]]] = []
+    payload = {
+        "subtasks": [
+            {"text": "grasp the handle of the sponge", "start": 0.0, "end": 0.5},
+            {"text": "wipe the counter", "start": 0.5, "end": 1.1},
+        ]
+    }
+    memory_payload = {"memory": "wiped once"}
+
+    def responder(messages):
+        captured.append(list(messages))
+        text = ""
+        for m in messages:
+            for block in m.get("content", []):
+                if isinstance(block, dict) and block.get("type") == "text":
+                    text = block.get("text", "")
+        if "compressed semantic memory" in text:
+            return memory_payload
+        return payload
+
+    provider = _StubFrameProvider()
+    module = PlanSubtasksMemoryModule(
+        vlm=StubVlmClient(responder=responder),
+        # Disable the rephrasings sub-prompt so the test's only video-bearing
+        # call is the subtask one — keeps the assertions below focused on
+        # ``_generate_subtasks`` rather than fighting the order of unrelated
+        # text-only Module-1 sub-prompts.
+        config=PlanConfig(frames_per_second=2.0, max_frames_per_prompt=60, n_task_rephrasings=0),
+        frame_provider=provider,
+    )
+    record = next(iter_episodes(fixture_dataset_root))
+    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
+    module.run_episode(record, staging)
+
+    # Find the call carrying the subtask prompt rather than blindly taking
+    # captured[0] — Module 1 issues several sub-prompts and their order is
+    # not part of the contract.
+    assert captured, "no VLM calls made"
+
+    def _prompt_text(messages):
+        for m in messages:
+            for block in m.get("content", []):
+                if isinstance(block, dict) and block.get("type") == "text":
+                    return block.get("text", "")
+        return ""
+
+    subtask_calls = [m for m in captured if "atomic subtasks" in _prompt_text(m)]
+    assert len(subtask_calls) == 1, "expected exactly one subtask-prompt VLM call"
+    content = subtask_calls[0][0]["content"]
+    video_blocks = [b for b in content if isinstance(b, dict) and b.get("type") == "video"]
+    image_blocks = [b for b in content if isinstance(b, dict) and b.get("type") == "image"]
+    text_blocks = [b for b in content if isinstance(b, dict) and b.get("type") == "text"]
+    assert video_blocks == [], "contact-sheet mode must not emit a raw video block"
+    assert len(image_blocks) >= 1, f"expected >=1 contact-sheet image block, got {content}"
+    assert all(isinstance(b["image"], PIL.Image.Image) for b in image_blocks)
+    assert len(text_blocks) == 1
+    # the prompt is prefixed with the contact-sheet reading instructions
+    assert text_blocks[0]["text"].startswith("CONTACT SHEETS")
+    # frames were decoded for this episode at episode-relative timestamps
+    assert provider.calls and provider.calls[0][0] == record.episode_index
+
+
+def test_module3_attaches_frame_image_block_to_prompt(single_episode_root: Path, tmp_path: Path) -> None:
+    """Each VQA prompt must carry a single image block at the emission frame."""
+    captured: list[list[dict[str, Any]]] = []
+    payload = {
+        "question": "How many cups?",
+        "answer": {"label": "cup", "count": 1},
+    }
+    provider = _StubFrameProvider()
+    module = GeneralVqaModule(
+        vlm=_spy_responder(captured, payload),
+        config=VqaConfig(vqa_emission_hz=1.0, K=1),
+        seed=0,
+        frame_provider=provider,
+    )
+    record = next(iter_episodes(single_episode_root))
+    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
+    module.run_episode(record, staging)
+
+    assert captured, "no VLM calls made"
+    for messages in captured:
+        content = messages[0]["content"]
+        image_blocks = [b for b in content if isinstance(b, dict) and b.get("type") == "image"]
+        text_blocks = [b for b in content if isinstance(b, dict) and b.get("type") == "text"]
+        assert len(image_blocks) == 1, f"expected 1 image block per VQA prompt, got {content}"
+        assert image_blocks[0]["image"] is provider.sentinel
+        assert len(text_blocks) == 1
+    # provider was called once per emission per camera with the exact emission timestamp
+    for ep_idx, ts_tuple, camera in provider.calls:
+        assert ep_idx == record.episode_index
+        assert len(ts_tuple) == 1
+        assert ts_tuple[0] in record.frame_timestamps
+        assert camera in provider.cameras
+
+
+def test_module3_assistant_content_is_valid_json(single_episode_root: Path, tmp_path: Path) -> None:
+    payload = {
+        "question": "Where is the cup?",
+        "answer": {"detections": [{"label": "cup", "bbox_format": "xyxy", "bbox": [10, 20, 50, 80]}]},
+    }
+    vlm = make_canned_responder({"frame-grounded visual question": payload})
+    module = GeneralVqaModule(
+        vlm=vlm,
+        config=VqaConfig(vqa_emission_hz=1.0, K=2),
+        seed=2,
+        frame_provider=_StubFrameProvider(),
+    )
+    record = next(iter_episodes(single_episode_root))
+    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
+    module.run_episode(record, staging)
+    rows = staging.read("vqa")
+    for row in rows:
+        if row["role"] == "assistant" and row["style"] == "vqa":
+            decoded = json.loads(row["content"])
+            assert "detections" in decoded
diff --git a/tests/annotations/test_pipeline_recipe_render.py b/tests/annotations/test_pipeline_recipe_render.py
new file mode 100644
index 000000000..614c2e45e
--- /dev/null
+++ b/tests/annotations/test_pipeline_recipe_render.py
@@ -0,0 +1,183 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""End-to-end smoke: pipeline output → canonical recipe rendering."""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import pytest
+
+# ``pyarrow`` and the ``lerobot.datasets`` chain (-> the HF ``datasets``
+# library) only ship under the ``dataset`` extra. Skip this module in
+# tiers without it instead of erroring at import.
+pytest.importorskip("datasets", reason="datasets is required (install lerobot[dataset])")
+pytest.importorskip("pandas", reason="pandas is required (install lerobot[dataset])")
+
+import pyarrow.parquet as pq  # noqa: E402
+
+from lerobot.annotations.steerable_pipeline.config import (  # noqa: E402
+    AnnotationPipelineConfig,
+    InterjectionsConfig,
+    PlanConfig,
+    VqaConfig,
+)
+from lerobot.annotations.steerable_pipeline.executor import Executor  # noqa: E402
+from lerobot.annotations.steerable_pipeline.modules import (  # noqa: E402
+    GeneralVqaModule,
+    InterjectionsAndSpeechModule,
+    PlanSubtasksMemoryModule,
+)
+from lerobot.annotations.steerable_pipeline.validator import StagingValidator  # noqa: E402
+from lerobot.annotations.steerable_pipeline.writer import LanguageColumnsWriter  # noqa: E402
+from lerobot.configs.recipe import MessageTurn, TrainingRecipe  # noqa: E402
+from lerobot.datasets.language_render import render_sample  # noqa: E402
+
+from ._helpers import make_canned_responder  # noqa: E402
+
+
+def _build_style_blend_recipe() -> TrainingRecipe:
+    """Inline blend recipe that consumes every style this pipeline produces.
+
+    The language schema/DSL work used to ship
+    ``src/lerobot/configs/recipes/pi05_hirobot.yaml`` as a canonical
+    example, but that file was dropped during review. The contract this
+    test guards is "the recipe DSL can render non-empty messages from
+    pipeline output", which doesn't require a specific YAML — so we build
+    the equivalent blend in code.
+    """
+    return TrainingRecipe(
+        blend={
+            "low_level_execution": TrainingRecipe(
+                weight=0.35,
+                messages=[
+                    MessageTurn(
+                        role="user",
+                        content="${task}\nPlan: ${plan}\nMemory: ${memory}",
+                        stream="high_level",
+                    ),
+                    MessageTurn(role="assistant", content="${subtask}", stream="low_level", target=True),
+                ],
+            ),
+            "user_interjection_response": TrainingRecipe(
+                weight=0.16,
+                bindings={
+                    "speech": "emitted_at(t, role=assistant, tool_name=say)",
+                    "interjection": "emitted_at(t, style=interjection)",
+                },
+                messages=[
+                    MessageTurn(role="user", content="${task}", stream="high_level"),
+                    MessageTurn(
+                        role="user",
+                        content="${interjection}",
+                        stream="high_level",
+                        if_present="interjection",
+                    ),
+                    MessageTurn(
+                        role="assistant",
+                        content="${plan}",
+                        stream="high_level",
+                        target=True,
+                        if_present="plan",
+                        tool_calls_from="speech",
+                    ),
+                ],
+            ),
+        }
+    )
+
+
+def _build_executor() -> Executor:
+    vlm = make_canned_responder(
+        {
+            "atomic subtasks": {
+                "subtasks": [
+                    {"text": "grasp the bottle", "start": 0.0, "end": 0.5},
+                    {"text": "pour into the cup", "start": 0.5, "end": 1.0},
+                    {"text": "place the bottle down", "start": 1.0, "end": 1.5},
+                ]
+            },
+            "compressed semantic memory": {"memory": "poured once"},
+            "acknowledgement the robot": {"text": "Sure."},
+            "compact interjection": {
+                "interjection": "use less water",
+                "speech": "Using less water.",
+            },
+            "frame-grounded visual question": {
+                "question": "How many cups?",
+                "answer": {"label": "cup", "count": 1},
+            },
+        },
+    )
+    config = AnnotationPipelineConfig(
+        plan=PlanConfig(),
+        interjections=InterjectionsConfig(max_interjections_per_episode=1, interjection_min_t=0.5),
+        vqa=VqaConfig(vqa_emission_hz=1.0, K=2),
+    )
+    return Executor(
+        config=config,
+        plan=PlanSubtasksMemoryModule(vlm=vlm, config=config.plan),
+        interjections=InterjectionsAndSpeechModule(vlm=vlm, config=config.interjections, seed=config.seed),
+        vqa=GeneralVqaModule(vlm=vlm, config=config.vqa, seed=config.seed),
+        writer=LanguageColumnsWriter(),
+        validator=StagingValidator(),
+    )
+
+
+def test_canonical_recipe_renders_nonempty_from_pipeline_output(
+    single_episode_root: Path,
+) -> None:
+    executor = _build_executor()
+    summary = executor.run(single_episode_root)
+    # validator may emit warnings but no errors for the synthetic fixture
+    assert summary.validation_report.ok, summary.validation_report.summary()
+
+    table = pq.read_table(single_episode_root / "data" / "chunk-000" / "file-000.parquet")
+    persistent_lists = table.column("language_persistent").to_pylist()
+    events_lists = table.column("language_events").to_pylist()
+    timestamps = table.column("timestamp").to_pylist()
+
+    recipe = _build_style_blend_recipe()
+
+    rendered_any = False
+    for ts, persistent, events in zip(timestamps, persistent_lists, events_lists, strict=True):
+        result = render_sample(
+            recipe=recipe,
+            persistent=persistent,
+            events=events,
+            t=float(ts),
+            sample_idx=0,
+            dataset_ctx={"task": "Pour water from the bottle into the cup."},
+        )
+        if result is None:
+            continue
+        if result["messages"]:
+            rendered_any = True
+            assert result["target_message_indices"]
+            break
+    assert rendered_any, "recipe rendered no messages from pipeline output"
+
+    # Sanity: speech atom appears in events column intact
+    flat_events = [r for ev in events_lists for r in ev]
+    speech_rows = [r for r in flat_events if r.get("style") is None and r.get("role") == "assistant"]
+    assert speech_rows
+    say = speech_rows[0]["tool_calls"][0]
+    assert say["function"]["name"] == "say"
+    assert isinstance(say["function"]["arguments"]["text"], str)
+    # The pipeline does not write a ``tools`` column — the say schema lives
+    # as a constant (``SAY_TOOL_SCHEMA``) so the language row struct is the
+    # single source of truth for the v3.1 schema.
+    assert "tools" not in table.column_names
diff --git a/tests/annotations/test_validator.py b/tests/annotations/test_validator.py
new file mode 100644
index 000000000..6b421cc98
--- /dev/null
+++ b/tests/annotations/test_validator.py
@@ -0,0 +1,133 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Validator behavior tests."""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+import pytest
+
+# ``lerobot.annotations`` imports pull in ``lerobot.datasets`` (-> the HF
+# ``datasets`` library), which only ships under the ``dataset`` extra. Skip
+# this module in tiers without it instead of erroring at import.
+pytest.importorskip("datasets", reason="datasets is required (install lerobot[dataset])")
+pytest.importorskip("pandas", reason="pandas is required (install lerobot[dataset])")
+
+from lerobot.annotations.steerable_pipeline.reader import iter_episodes  # noqa: E402
+from lerobot.annotations.steerable_pipeline.staging import EpisodeStaging  # noqa: E402
+from lerobot.annotations.steerable_pipeline.validator import StagingValidator  # noqa: E402
+from lerobot.annotations.steerable_pipeline.writer import speech_atom  # noqa: E402
+
+
+def _validate(root: Path, staging_dir: Path):
+    records = list(iter_episodes(root))
+    return StagingValidator().validate(records, staging_dir)
+
+
+def test_validator_catches_misaligned_timestamps(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    staging_dir = tmp_path / "stage"
+    EpisodeStaging(staging_dir, 0).write(
+        "vqa",
+        [
+            {
+                "role": "assistant",
+                "content": json.dumps({"label": "cup", "count": 2}, sort_keys=True),
+                "style": "vqa",
+                "timestamp": 9.999,  # not on any 10 fps frame
+                "tool_calls": None,
+            }
+        ],
+    )
+    report = _validate(fixture_dataset_root, staging_dir)
+    assert not report.ok
+    assert any("does not match any source frame timestamp" in e for e in report.errors)
+
+
+def test_validator_catches_orphan_speech(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    staging_dir = tmp_path / "stage"
+    EpisodeStaging(staging_dir, 0).write(
+        "interjections",
+        [
+            speech_atom(0.0, "Got it."),
+            # interjection at 0.3s with NO paired speech
+            {
+                "role": "user",
+                "content": "skip it",
+                "style": "interjection",
+                "timestamp": 0.3,
+                "tool_calls": None,
+            },
+        ],
+    )
+    report = _validate(fixture_dataset_root, staging_dir)
+    assert not report.ok
+    assert any("paired speech" in e for e in report.errors)
+
+
+def test_validator_catches_inconsistent_plan_memory(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    staging_dir = tmp_path / "stage"
+    EpisodeStaging(staging_dir, 0).write(
+        "plan",
+        [
+            {
+                "role": "assistant",
+                "content": "1. do x",
+                "style": "plan",
+                "timestamp": 0.0,
+                "tool_calls": None,
+            },
+            {
+                "role": "assistant",
+                "content": "do x",
+                "style": "subtask",
+                "timestamp": 0.0,
+                "tool_calls": None,
+            },
+        ],
+    )
+    EpisodeStaging(staging_dir, 0).write(
+        "interjections",
+        [
+            speech_atom(0.0, "Got it."),
+            speech_atom(0.4, "Replanning."),
+            {
+                "role": "user",
+                "content": "replan",
+                "style": "interjection",
+                "timestamp": 0.4,
+                "tool_calls": None,
+            },
+        ],
+    )
+    report = _validate(fixture_dataset_root, staging_dir)
+    # missing co-timestamped plan refresh at 0.4s → error
+    assert not report.ok
+    assert any("co-timestamped plan update" in e for e in report.errors)
+
+
+def test_validator_catches_wrong_column(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    staging_dir = tmp_path / "stage"
+    EpisodeStaging(staging_dir, 0).write(
+        "plan",
+        [
+            {"role": "user", "content": "where?", "style": "vqa", "timestamp": 0.0, "tool_calls": None},
+        ],
+    )
+    report = _validate(fixture_dataset_root, staging_dir)
+    assert not report.ok
+    assert any("plan emitted style 'vqa'" in e or "must be persistent" in e for e in report.errors)
diff --git a/tests/annotations/test_vlm_client.py b/tests/annotations/test_vlm_client.py
new file mode 100644
index 000000000..5fa2ff904
--- /dev/null
+++ b/tests/annotations/test_vlm_client.py
@@ -0,0 +1,41 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Unit tests for ``vlm_client`` helpers."""
+
+from __future__ import annotations
+
+import pytest
+
+pytest.importorskip("datasets", reason="datasets is required (install lerobot[dataset])")
+
+from lerobot.annotations.steerable_pipeline.vlm_client import _bind_serve_port  # noqa: E402
+
+
+def test_bind_serve_port_substitutes_placeholder() -> None:
+    # The {port} placeholder is replaced everywhere it appears, regardless of
+    # parallel vs single server — the bug was the single-server path passing
+    # it through unsubstituted.
+    cmd = "vllm serve M --max-model-len 32768 --port {port}"
+    assert _bind_serve_port(cmd, 8000) == "vllm serve M --max-model-len 32768 --port 8000"
+
+
+def test_bind_serve_port_appends_when_missing() -> None:
+    assert _bind_serve_port("vllm serve M", 8001) == "vllm serve M --port 8001"
+
+
+def test_bind_serve_port_leaves_explicit_port_untouched() -> None:
+    cmd = "vllm serve M --port 9000"
+    assert _bind_serve_port(cmd, 8000) == cmd
diff --git a/tests/annotations/test_writer.py b/tests/annotations/test_writer.py
new file mode 100644
index 000000000..0ea550327
--- /dev/null
+++ b/tests/annotations/test_writer.py
@@ -0,0 +1,357 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Writer correctness tests."""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+import pytest
+
+# ``pyarrow`` and the ``lerobot.annotations`` -> ``lerobot.datasets`` chain
+# (-> the HF ``datasets`` library) only ship under the ``dataset`` extra.
+# Skip this module in tiers without it instead of erroring at import.
+pytest.importorskip("datasets", reason="datasets is required (install lerobot[dataset])")
+pytest.importorskip("pandas", reason="pandas is required (install lerobot[dataset])")
+
+import pyarrow.parquet as pq  # noqa: E402
+
+from lerobot.annotations.steerable_pipeline.reader import iter_episodes  # noqa: E402
+from lerobot.annotations.steerable_pipeline.staging import EpisodeStaging  # noqa: E402
+from lerobot.annotations.steerable_pipeline.writer import (  # noqa: E402
+    LanguageColumnsWriter,
+    speech_atom,
+)
+
+
+def _stage_episode(
+    staging_dir: Path,
+    episode_index: int,
+    *,
+    plan: list[dict] | None = None,
+    interjections: list[dict] | None = None,
+    vqa: list[dict] | None = None,
+) -> None:
+    staging = EpisodeStaging(staging_dir, episode_index)
+    if plan is not None:
+        staging.write("plan", plan)
+    if interjections is not None:
+        staging.write("interjections", interjections)
+    if vqa is not None:
+        staging.write("vqa", vqa)
+
+
+def test_writer_persistence_identity(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    """Every frame in an episode has a byte-identical persistent list."""
+    staging_dir = tmp_path / "stage"
+    _stage_episode(
+        staging_dir,
+        0,
+        plan=[
+            {
+                "role": "assistant",
+                "content": "grasp the sponge",
+                "style": "subtask",
+                "timestamp": 0.0,
+                "tool_calls": None,
+            },
+            {
+                "role": "assistant",
+                "content": "1. wipe\n2. dry",
+                "style": "plan",
+                "timestamp": 0.0,
+                "tool_calls": None,
+            },
+            {
+                "role": "assistant",
+                "content": "wiped the counter",
+                "style": "memory",
+                "timestamp": 0.5,
+                "tool_calls": None,
+            },
+        ],
+    )
+    records = list(iter_episodes(fixture_dataset_root))
+    LanguageColumnsWriter().write_all(records, staging_dir, fixture_dataset_root)
+
+    table = pq.read_table(fixture_dataset_root / "data" / "chunk-000" / "file-000.parquet")
+    persistent = table.column("language_persistent").to_pylist()
+    first = persistent[0]
+    assert first  # non-empty
+    for row in persistent:
+        assert row == first, "persistent slice must be byte-identical across all frames"
+
+
+def test_writer_events_exact_timestamp(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    staging_dir = tmp_path / "stage"
+    _stage_episode(
+        staging_dir,
+        0,
+        interjections=[
+            speech_atom(0.0, "Got it."),
+            {
+                "role": "user",
+                "content": "skip the dishes",
+                "style": "interjection",
+                "timestamp": 0.5,
+                "tool_calls": None,
+            },
+            speech_atom(0.5, "Skipping the dishes."),
+        ],
+    )
+    records = list(iter_episodes(fixture_dataset_root))
+    LanguageColumnsWriter().write_all(records, staging_dir, fixture_dataset_root)
+
+    table = pq.read_table(fixture_dataset_root / "data" / "chunk-000" / "file-000.parquet")
+    timestamps = table.column("timestamp").to_pylist()
+    events = table.column("language_events").to_pylist()
+    for ts, ev in zip(timestamps, events, strict=True):
+        if abs(ts - 0.0) < 1e-9:
+            assert any(r["role"] == "assistant" and r.get("style") is None for r in ev), ev
+        elif abs(ts - 0.5) < 1e-9:
+            assert any(r.get("style") == "interjection" for r in ev), ev
+            assert any(r.get("style") is None for r in ev), ev
+        else:
+            assert ev == []
+
+
+def test_writer_column_routing(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    staging_dir = tmp_path / "stage"
+    _stage_episode(
+        staging_dir,
+        0,
+        plan=[
+            {
+                "role": "assistant",
+                "content": "do X",
+                "style": "subtask",
+                "timestamp": 0.0,
+                "tool_calls": None,
+            },
+            {
+                "role": "assistant",
+                "content": "1. do X",
+                "style": "plan",
+                "timestamp": 0.0,
+                "tool_calls": None,
+            },
+            {
+                "role": "assistant",
+                "content": "did X",
+                "style": "memory",
+                "timestamp": 0.3,
+                "tool_calls": None,
+            },
+        ],
+        interjections=[
+            speech_atom(0.0, "OK"),
+            {
+                "role": "user",
+                "content": "wait",
+                "style": "interjection",
+                "timestamp": 0.2,
+                "tool_calls": None,
+            },
+            speech_atom(0.2, "Waiting"),
+        ],
+        vqa=[
+            {
+                "role": "user",
+                "content": "where is the cup?",
+                "style": "vqa",
+                "timestamp": 0.4,
+                "camera": "observation.images.front",
+                "tool_calls": None,
+            },
+            {
+                "role": "assistant",
+                "content": json.dumps(
+                    {"detections": [{"label": "cup", "bbox_format": "xyxy", "bbox": [1, 2, 3, 4]}]},
+                    sort_keys=True,
+                ),
+                "style": "vqa",
+                "timestamp": 0.4,
+                "camera": "observation.images.front",
+                "tool_calls": None,
+            },
+        ],
+    )
+    records = list(iter_episodes(fixture_dataset_root))
+    LanguageColumnsWriter().write_all(records, staging_dir, fixture_dataset_root)
+    table = pq.read_table(fixture_dataset_root / "data" / "chunk-000" / "file-000.parquet")
+
+    persistent = table.column("language_persistent").to_pylist()[0]
+    persistent_styles = {r["style"] for r in persistent}
+    assert persistent_styles == {"subtask", "plan", "memory"}
+
+    all_events = [r for ev in table.column("language_events").to_pylist() for r in ev]
+    event_styles = {r.get("style") for r in all_events}
+    assert event_styles == {None, "interjection", "vqa"}
+
+
+def test_writer_drops_subtask_index_idempotent(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    staging_dir = tmp_path / "stage"
+    _stage_episode(
+        staging_dir,
+        0,
+        plan=[
+            {
+                "role": "assistant",
+                "content": "do X",
+                "style": "subtask",
+                "timestamp": 0.0,
+                "tool_calls": None,
+            },
+        ],
+    )
+    records = list(iter_episodes(fixture_dataset_root))
+    writer = LanguageColumnsWriter()
+    writer.write_all(records, staging_dir, fixture_dataset_root)
+
+    path = fixture_dataset_root / "data" / "chunk-000" / "file-000.parquet"
+    table_a = pq.read_table(path)
+    assert "subtask_index" not in table_a.column_names
+    assert "language_persistent" in table_a.column_names
+    assert "language_events" in table_a.column_names
+    # The writer no longer emits a dataset-level ``tools`` column; the
+    # ``say`` tool schema lives as a code constant (``SAY_TOOL_SCHEMA``)
+    # so the parquet stays small and the pipeline doesn't extend the schema.
+    assert "tools" not in table_a.column_names
+
+    # second pass — must produce identical bytes for the language columns
+    records_again = list(iter_episodes(fixture_dataset_root))
+    writer.write_all(records_again, staging_dir, fixture_dataset_root)
+    table_b = pq.read_table(path)
+    assert (
+        table_a.column("language_persistent").to_pylist() == table_b.column("language_persistent").to_pylist()
+    )
+    assert table_a.column("language_events").to_pylist() == table_b.column("language_events").to_pylist()
+
+
+def test_writer_normalize_rejects_misrouted_persistent_style() -> None:
+    """``_normalize_persistent_row`` must reject any non-persistent style."""
+    from lerobot.annotations.steerable_pipeline.writer import _normalize_persistent_row
+
+    with pytest.raises(ValueError, match="non-persistent style"):
+        _normalize_persistent_row(
+            {"role": "assistant", "content": "oops", "style": "vqa", "timestamp": 0.0, "tool_calls": None}
+        )
+
+
+def test_writer_normalize_rejects_misrouted_event_style() -> None:
+    """``_normalize_event_row`` must reject any persistent style."""
+    from lerobot.annotations.steerable_pipeline.writer import _normalize_event_row
+
+    with pytest.raises(ValueError):
+        _normalize_event_row({"role": "assistant", "content": "oops", "style": "subtask", "tool_calls": None})
+
+
+def test_say_tool_schema_constant_is_well_formed() -> None:
+    """``SAY_TOOL_SCHEMA`` (and ``DEFAULT_TOOLS``) replace the parquet
+    ``tools`` column — chat-template consumers import them directly.
+    """
+    from lerobot.annotations.steerable_pipeline.writer import (
+        DEFAULT_TOOLS,
+        SAY_TOOL_SCHEMA,
+    )
+
+    assert DEFAULT_TOOLS == [SAY_TOOL_SCHEMA]
+    assert SAY_TOOL_SCHEMA["function"]["name"] == "say"
+    params = SAY_TOOL_SCHEMA["function"]["parameters"]
+    assert params["properties"]["text"]["type"] == "string"
+    assert params["required"] == ["text"]
+
+
+def test_writer_does_not_add_tools_column(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    """Re-running on a parquet that already has a legacy ``tools`` column
+    must drop it cleanly so reruns converge to the v3.1 schema.
+    """
+    staging_dir = tmp_path / "stage"
+    _stage_episode(
+        staging_dir,
+        0,
+        plan=[
+            {"role": "assistant", "content": "x", "style": "subtask", "timestamp": 0.0, "tool_calls": None}
+        ],
+    )
+    records = list(iter_episodes(fixture_dataset_root))
+    LanguageColumnsWriter().write_all(records, staging_dir, fixture_dataset_root)
+    table = pq.read_table(fixture_dataset_root / "data" / "chunk-000" / "file-000.parquet")
+    assert "tools" not in table.column_names
+
+
+def test_annotation_metadata_sync_allows_non_streaming_load(
+    fixture_dataset_root: Path, tmp_path: Path
+) -> None:
+    """Annotated parquet columns must be declared in ``meta/info.json``.
+
+    ``LeRobotDataset`` loads non-streaming datasets by casting parquet
+    against metadata-derived HF features. If the annotation writer adds
+    language columns but metadata stays stale, that cast fails with a column
+    mismatch.
+    """
+    from lerobot.annotations.steerable_pipeline.executor import Executor
+    from lerobot.datasets.feature_utils import get_hf_features_from_features
+    from lerobot.datasets.io_utils import load_info, load_nested_dataset
+    from lerobot.datasets.language import LANGUAGE_EVENTS, LANGUAGE_PERSISTENT, language_feature_info
+
+    info_path = fixture_dataset_root / "meta" / "info.json"
+    info = json.loads(info_path.read_text())
+    info["features"] = {
+        "episode_index": {"dtype": "int64", "shape": (1,), "names": None},
+        "frame_index": {"dtype": "int64", "shape": (1,), "names": None},
+        "timestamp": {"dtype": "float32", "shape": (1,), "names": None},
+        "task_index": {"dtype": "int64", "shape": (1,), "names": None},
+    }
+    info_path.write_text(json.dumps(info, indent=2))
+
+    staging_dir = tmp_path / "stage"
+    _stage_episode(
+        staging_dir,
+        0,
+        plan=[
+            {"role": "assistant", "content": "do X", "style": "subtask", "timestamp": 0.0, "tool_calls": None}
+        ],
+    )
+    records = list(iter_episodes(fixture_dataset_root))
+    LanguageColumnsWriter().write_all(records, staging_dir, fixture_dataset_root)
+
+    Executor._ensure_annotation_metadata_in_info(fixture_dataset_root)
+
+    synced = load_info(fixture_dataset_root)
+    for key, feature in language_feature_info().items():
+        assert synced["features"][key] == feature
+
+    hf_features = get_hf_features_from_features(synced["features"])
+    dataset = load_nested_dataset(fixture_dataset_root / "data", features=hf_features)
+
+    assert LANGUAGE_PERSISTENT in dataset.column_names
+    assert LANGUAGE_EVENTS in dataset.column_names
+    assert len(dataset) == 24
+
+
+def test_speech_atom_shape_matches_plan_spec() -> None:
+    atom = speech_atom(2.5, "I'm cleaning up!")
+    assert atom["role"] == "assistant"
+    assert atom["style"] is None
+    assert atom["content"] is None
+    assert atom["timestamp"] == 2.5
+    assert isinstance(atom["tool_calls"], list)
+    call = atom["tool_calls"][0]
+    assert call["type"] == "function"
+    assert call["function"]["name"] == "say"
+    assert call["function"]["arguments"]["text"] == "I'm cleaning up!"
diff --git a/tests/fixtures/dataset_factories.py b/tests/fixtures/dataset_factories.py
index a6e349778..2f4d41ff8 100644
--- a/tests/fixtures/dataset_factories.py
+++ b/tests/fixtures/dataset_factories.py
@@ -552,3 +552,64 @@ def lerobot_dataset_factory(
 @pytest.fixture(scope="session")
 def empty_lerobot_dataset_factory() -> LeRobotDatasetFactory:
     return partial(LeRobotDataset.create, repo_id=DUMMY_REPO_ID, fps=DEFAULT_FPS)
+
+
+def build_annotation_dataset(
+    root: Path,
+    episode_specs: list[tuple[int, int, str]],
+    *,
+    fps: int = 10,
+) -> Path:
+    """Build a minimal LeRobot-shaped dataset on disk for annotation tests.
+
+    ``episode_specs`` is a list of ``(episode_index, num_frames, task_text)``.
+    Each episode is written to its own
+    ``data/chunk-000/file-{ep:03d}.parquet`` so the writer's per-shard
+    rewrite path is exercised. The dataset carries the minimum
+    ``meta/tasks.parquet`` + ``meta/info.json`` the reader / executor need;
+    it has no videos, so the modules fall back to text-only prompts.
+
+    Shared by the annotation-pipeline pytest fixtures (``tests/annotations/
+    conftest.py``) and the opt-in E2E smoke run so the fixture shape lives
+    in exactly one place.
+    """
+    from lerobot.datasets.io_utils import write_tasks
+    from lerobot.utils.io_utils import write_json
+
+    data_dir = root / "data" / "chunk-000"
+    data_dir.mkdir(parents=True, exist_ok=True)
+
+    tasks: dict[int, str] = {}
+    for episode_index, num_frames, task_text in episode_specs:
+        if task_text not in tasks.values():
+            tasks[len(tasks)] = task_text
+        task_index = next(k for k, v in tasks.items() if v == task_text)
+        frame = pd.DataFrame(
+            {
+                "episode_index": [episode_index] * num_frames,
+                "frame_index": list(range(num_frames)),
+                "timestamp": [round(i / fps, 6) for i in range(num_frames)],
+                "task_index": [task_index] * num_frames,
+                "subtask_index": [0] * num_frames,  # legacy column the writer must drop
+            }
+        )
+        frame.to_parquet(data_dir / f"file-{episode_index:03d}.parquet", index=False)
+
+    # Canonical tasks frame: indexed by task string with a ``task_index``
+    # column, matching what ``lerobot.datasets.io_utils.load_tasks`` expects.
+    tasks_df = pd.DataFrame(
+        {"task_index": list(tasks.keys())},
+        index=pd.Index(list(tasks.values()), name="task"),
+    )
+    write_tasks(tasks_df, root)
+
+    write_json(
+        {
+            "codebase_version": "v3.1",
+            "fps": fps,
+            "features": {},
+            "total_episodes": len(episode_specs),
+        },
+        root / "meta" / "info.json",
+    )
+    return root
diff --git a/tests/scripts/test_lerobot_annotate.py b/tests/scripts/test_lerobot_annotate.py
new file mode 100644
index 000000000..6405bdc52
--- /dev/null
+++ b/tests/scripts/test_lerobot_annotate.py
@@ -0,0 +1,86 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+from types import SimpleNamespace
+
+import pytest
+
+# ``lerobot.scripts.lerobot_annotate`` (and the ``_push_to_hub`` path it
+# exercises) imports ``lerobot.datasets``, which only ships under the
+# ``dataset`` extra. Skip in tiers without it instead of erroring.
+pytest.importorskip("datasets", reason="datasets is required (install lerobot[dataset])")
+
+
+def test_push_to_hub_tags_uploaded_dataset_revision(tmp_path, monkeypatch):
+    from lerobot.scripts.lerobot_annotate import _push_to_hub
+
+    root = tmp_path / "dataset"
+    (root / "meta").mkdir(parents=True)
+    (root / "meta" / "info.json").write_text(json.dumps({"codebase_version": "v3.0"}))
+
+    calls = {}
+
+    class FakeHfApi:
+        def create_repo(self, **kwargs):
+            calls["create_repo"] = kwargs
+
+        def upload_folder(self, **kwargs):
+            calls["upload_folder"] = kwargs
+            return SimpleNamespace(oid="abc123")
+
+        def delete_tag(self, repo_id, **kwargs):
+            import requests
+            from huggingface_hub.errors import RevisionNotFoundError
+
+            calls["delete_tag"] = {"repo_id": repo_id, **kwargs}
+            # Simulate the common case: no stale tag to delete.
+            raise RevisionNotFoundError("no such tag", response=requests.Response())
+
+        def create_tag(self, **kwargs):
+            calls["create_tag"] = kwargs
+
+    monkeypatch.setattr("huggingface_hub.HfApi", FakeHfApi)
+
+    cfg = SimpleNamespace(
+        repo_id="source/dataset",
+        new_repo_id="annotated/dataset",
+        push_private=True,
+        push_commit_message=None,
+    )
+
+    _push_to_hub(root, cfg)
+
+    assert calls["create_repo"] == {
+        "repo_id": "annotated/dataset",
+        "repo_type": "dataset",
+        "private": True,
+        "exist_ok": True,
+    }
+    assert calls["upload_folder"]["repo_id"] == "annotated/dataset"
+    # A stale tag (e.g. from a previous annotation run) is deleted first so
+    # the new tag always points at the upload we just made.
+    assert calls["delete_tag"] == {
+        "repo_id": "annotated/dataset",
+        "tag": "v3.0",
+        "repo_type": "dataset",
+    }
+    assert calls["create_tag"] == {
+        "repo_id": "annotated/dataset",
+        "tag": "v3.0",
+        "repo_type": "dataset",
+        "revision": "abc123",
+    }
diff --git a/uv.lock b/uv.lock
index 4072828e7..ca5a7b34a 100644
--- a/uv.lock
+++ b/uv.lock
@@ -1207,6 +1207,15 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/33/6b/e0547afaf41bf2c42e52430072fa5658766e3d65bd4b03a563d1b6336f57/distlib-0.4.0-py2.py3-none-any.whl", hash = "sha256:9659f7d87e46584a30b5780e43ac7a2143098441670ff0a49d5f9034c54a6c16", size = 469047, upload-time = "2025-07-17T16:51:58.613Z" },
 ]
 
+[[package]]
+name = "distro"
+version = "1.9.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/fc/f8/98eea607f65de6527f8a2e8885fc8015d3e6f5775df186e443e0964a11c3/distro-1.9.0.tar.gz", hash = "sha256:2fa77c6fd8940f116ee1d6b94a2f90b13b5ea8d019b98bc8bafdcabcdd9bdbed", size = 60722, upload-time = "2023-12-24T09:54:32.31Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/12/b3/231ffd4ab1fc9d679809f356cebee130ac7daa00d6d6f3206dd4fd137e9e/distro-1.9.0-py3-none-any.whl", hash = "sha256:7bffd925d65168f85027d8da9af6bddab658135b840670a223589bc0c8ef02b2", size = 20277, upload-time = "2023-12-24T09:54:30.421Z" },
+]
+
 [[package]]
 name = "dm-control"
 version = "1.0.41"
@@ -2252,6 +2261,78 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/62/a1/3d680cbfd5f4b8f15abc1d571870c5fc3e594bb582bc3b64ea099db13e56/jinja2-3.1.6-py3-none-any.whl", hash = "sha256:85ece4451f492d0c13c5dd7c13a64681a86afae63a5f347908daf103ce6d2f67", size = 134899, upload-time = "2025-03-05T20:05:00.369Z" },
 ]
 
+[[package]]
+name = "jiter"
+version = "0.15.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/66/b5/55f06bb281d92fb3cc86d14e1def2bd908bb77693183e7cb1f5a3c388b0c/jiter-0.15.0.tar.gz", hash = "sha256:4251acc80e2b7c9b7b8823456ea0fceeb0734dac2df7636d3c711b38476b5a76", size = 166640, upload-time = "2026-05-19T10:09:48.361Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/44/53/4f6bddbcde3c71e56d0aa1337ec95950f3d27dd4153e25aadf0feac71751/jiter-0.15.0-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:0e90a1c315a0226ec822d973817967f9223b7701546c8c2a7913e7ab0926294d", size = 308793, upload-time = "2026-05-19T10:07:35.25Z" },
+    { url = "https://files.pythonhosted.org/packages/01/84/c01099b59a285a1ebba64ae93f62bfa036675340fd1b0045ae65890a0442/jiter-0.15.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:8c9004af7c8d67cce7f1aae1026fb55607f4aa600710d08ede3a3ce4aeefe7e0", size = 309570, upload-time = "2026-05-19T10:07:36.919Z" },
+    { url = "https://files.pythonhosted.org/packages/58/64/8fb7f9d45bb98190355454cd04dad8d8f27223d6bd52f83af07f637168a6/jiter-0.15.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c210f8b35dc6f30aafd4b4365ca89b9d1189f21ab49b8e68fa6322a847aef138", size = 336783, upload-time = "2026-05-19T10:07:38.694Z" },
+    { url = "https://files.pythonhosted.org/packages/c3/b6/f5739011d009b3a30f6a53c5240979030ba29ae46a8c67e3a15759f7c37d/jiter-0.15.0-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:5f30bae8bc1c2d613e28e5af3e8cceb09b742f1c8a8a5f839fb67afaffc03b61", size = 363555, upload-time = "2026-05-19T10:07:40.832Z" },
+    { url = "https://files.pythonhosted.org/packages/e5/12/98a9d9f766665e8a3b6252454e17cb0c464606a28cf2fa09399b003345fa/jiter-0.15.0-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:c60e71b6d10cfc284c9bf36bd885e8d44c46f688ce50aa91b5edd90181dea687", size = 452255, upload-time = "2026-05-19T10:07:42.62Z" },
+    { url = "https://files.pythonhosted.org/packages/e8/d5/60f972840f79c5e7544fce567c56f1e4e50468f996baba3e78d823dd62a6/jiter-0.15.0-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:0ab068bce62a45aa3e7367eceaffb5dde60b7eb853be8dece45132e3d0ff4879", size = 373559, upload-time = "2026-05-19T10:07:44.201Z" },
+    { url = "https://files.pythonhosted.org/packages/ee/cf/d46ef1234ba335aabc2f013210db8e0821a22f5e644a2e9449df199ecc23/jiter-0.15.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:fa248c9eb220197d363f688818dac2fd4b2f0cd7d843ca7105d652034823427d", size = 346055, upload-time = "2026-05-19T10:07:46.005Z" },
+    { url = "https://files.pythonhosted.org/packages/f0/63/4d2749d8d54d230bad9b3a6b0d00cc28c6ff6b2fdffc26a8ccf76cc5a974/jiter-0.15.0-cp312-cp312-manylinux_2_31_riscv64.whl", hash = "sha256:2a77aadd57cac1682e4401a72724d2796d89a4ba129b1a5812aa94ee480826eb", size = 351406, upload-time = "2026-05-19T10:07:47.855Z" },
+    { url = "https://files.pythonhosted.org/packages/d9/b9/9965b990035d8773328e0a8c8b457a87bf2b19f6c4126d9d99296be5d16a/jiter-0.15.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:2ae901f3a55bfafdde31d289590fa25e3245735a2b1e8c7cc15871710a002871", size = 389357, upload-time = "2026-05-19T10:07:49.665Z" },
+    { url = "https://files.pythonhosted.org/packages/2d/55/9ddf903deda1413e87fed792f416b7123daee5b8efbad6a202a7421c36a5/jiter-0.15.0-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:f0b271b462769543716f92d3a4f90527df6ef5ed05ee95ec4137f513e21e1b77", size = 517263, upload-time = "2026-05-19T10:07:51.537Z" },
+    { url = "https://files.pythonhosted.org/packages/e8/76/a0c40ad064d3a20a4fde231e35d56e9a01ce82164278180e82d5daf85469/jiter-0.15.0-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:2fb6a5d26af81fc0f00f9360a891e05cf755e149bba391c4d563adc54812973d", size = 548646, upload-time = "2026-05-19T10:07:53.196Z" },
+    { url = "https://files.pythonhosted.org/packages/23/4f/eca9b954942916ba2f453891b8593ab444cd872396fe66a3936616f236f3/jiter-0.15.0-cp312-cp312-win32.whl", hash = "sha256:c2f6bb8b5216ab9e7873bc08b5d7bef2b8abbb578a3069bf1cd14a45d71d771d", size = 206427, upload-time = "2026-05-19T10:07:55.307Z" },
+    { url = "https://files.pythonhosted.org/packages/95/bf/8ead82a87495149542748e828d153fd232a512a22c83b02c4815c1a9c7d8/jiter-0.15.0-cp312-cp312-win_amd64.whl", hash = "sha256:40b2c7e92c44a84d748d21706c68dc6ff8161d80b59c99d774721a0d2317d7c7", size = 197300, upload-time = "2026-05-19T10:07:56.651Z" },
+    { url = "https://files.pythonhosted.org/packages/f4/e4/9b8a78fb2d894471bc344e37f1949bdd784bd914d031dba0ba3a40c71dd7/jiter-0.15.0-cp312-cp312-win_arm64.whl", hash = "sha256:cc0bc345cf2df9d1c00ac443f50d543c1ccfa8b0422cb85b1ab70d681c0b255b", size = 192702, upload-time = "2026-05-19T10:07:58.307Z" },
+    { url = "https://files.pythonhosted.org/packages/e5/f4/f708c900ecee41b2025ef8413d5351e5649eb2125c506f6720cc69b06f5c/jiter-0.15.0-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:1c11465f97e2abf45a014b83b730222f8f1c5335e802c7055a67d50de6f1f4e3", size = 307829, upload-time = "2026-05-19T10:07:59.704Z" },
+    { url = "https://files.pythonhosted.org/packages/86/59/db537c0949e83668c38481d426b9f2fd5ab758c4ee53a811dd0a510626a0/jiter-0.15.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:d1e7b1776f0797956c509e123d0952d10d293a9492dea9f288ab9570ec01d1a5", size = 308445, upload-time = "2026-05-19T10:08:01.184Z" },
+    { url = "https://files.pythonhosted.org/packages/37/38/ea0e13b18c30ef951da0d47d39e7fa9edb82a93a62990ffbd7cea9b622d4/jiter-0.15.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:351a341c2105aa430b7047e30f1bf7975f6313b00165d3fc07be2edaf741f279", size = 336181, upload-time = "2026-05-19T10:08:02.688Z" },
+    { url = "https://files.pythonhosted.org/packages/58/fc/2303901b16c4ba05865588990a420c0b4156270b44379c20931544a1d962/jiter-0.15.0-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:4ab395feec8d249ec4044e228e98a7033f043426a265df439dc3698823f0a4e4", size = 362985, upload-time = "2026-05-19T10:08:04.394Z" },
+    { url = "https://files.pythonhosted.org/packages/5b/6f/11bace093c52e7d4d26c8e606ccd7ae8c972189622469ec0d9e28161e28b/jiter-0.15.0-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:a2a438005b6f22d0273413484d6094d7c2c5d10ec1b3a3bf128e0d1d3ba53258", size = 453292, upload-time = "2026-05-19T10:08:05.967Z" },
+    { url = "https://files.pythonhosted.org/packages/22/db/987f2f086ca4d7a6582eb4ccd513f9b26b42d9e4243a087609a3137a8fc7/jiter-0.15.0-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:f18f85e4218d1b40f000f42a92239a7a61a902cd42c65e6c360dbd17dcb20894", size = 373501, upload-time = "2026-05-19T10:08:07.857Z" },
+    { url = "https://files.pythonhosted.org/packages/8f/7c/89fbcabb2739b7a5b8dc959a1b6c5761f6484f5fed3486854b3c789bb1de/jiter-0.15.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:d1aa62e277fc1cbd80e6deacae6f4d983b41b3d7728e0645c5d741a6149bba45", size = 344683, upload-time = "2026-05-19T10:08:09.431Z" },
+    { url = "https://files.pythonhosted.org/packages/30/6f/6cca7692e7dddfec6d8d76c54dc97f2af2a41df4ac0674b999df1f09a5f3/jiter-0.15.0-cp313-cp313-manylinux_2_31_riscv64.whl", hash = "sha256:6550fa135c7deb8ead6af49ed7ff648532ea8334a1447fe34a36315ef79c5c29", size = 350892, upload-time = "2026-05-19T10:08:11.352Z" },
+    { url = "https://files.pythonhosted.org/packages/39/14/0338d6190cb8e6d22e677ab1d4eabd4117f67cca70c54cd04b82ff64e068/jiter-0.15.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:066f8f33f18b2419cd8213b2436fa7fbc9c499f315971cfa3ce1f9820c001b1b", size = 388723, upload-time = "2026-05-19T10:08:12.912Z" },
+    { url = "https://files.pythonhosted.org/packages/90/31/cc19f4a1bdb6afb09ce6a2f2615aa8d44d994eba0d8e6105ed1af920e736/jiter-0.15.0-cp313-cp313-musllinux_1_1_aarch64.whl", hash = "sha256:75e8a04e91432dde9f1838373cf93d23726c79d3e908d319acf0e796f85592e7", size = 516648, upload-time = "2026-05-19T10:08:14.808Z" },
+    { url = "https://files.pythonhosted.org/packages/49/9f/833c541512cd091b63c10c0381973dfe11bc7a503a818c16384417e0c81e/jiter-0.15.0-cp313-cp313-musllinux_1_1_x86_64.whl", hash = "sha256:a97261f1fccb8e50ecd2890a96e46efdc3f57c80a197324c6777827231eca712", size = 547382, upload-time = "2026-05-19T10:08:16.927Z" },
+    { url = "https://files.pythonhosted.org/packages/d2/11/e7b70e91f90bc4477e8eee9e8a5f7cf3cb41b4525d6394dc98a714eb8f7f/jiter-0.15.0-cp313-cp313-win32.whl", hash = "sha256:c77496cb10bd7549690fbbab3e5ec05857b83e49276f4a9423a766ddd2afcd4c", size = 205845, upload-time = "2026-05-19T10:08:18.401Z" },
+    { url = "https://files.pythonhosted.org/packages/4b/23/5c20d9ad6f02c493e4023e5d2d09e1c1f15fe2753c9102c544aff068a88e/jiter-0.15.0-cp313-cp313-win_amd64.whl", hash = "sha256:b15741f501469009ae0ae90b7147958a664a7dede40aa7ff174a8a4645f546d0", size = 196842, upload-time = "2026-05-19T10:08:20.131Z" },
+    { url = "https://files.pythonhosted.org/packages/6b/11/1eb400ef248e8c925fd883fbe325daf5e42cd1b0d308539dd332bd4f7ffc/jiter-0.15.0-cp313-cp313-win_arm64.whl", hash = "sha256:5d6a60072b44c3c2b797a7ddcbcbbf2b34ea3cfd4721580fbfd2a09d9d9b84ba", size = 192212, upload-time = "2026-05-19T10:08:21.807Z" },
+    { url = "https://files.pythonhosted.org/packages/8a/60/2fd8d7c79da8acf9b7b277c7616847773779356b92acfc9bb158452174da/jiter-0.15.0-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:ef1fd24d9413f6209e00d3d5a453e67acfe004a25cc6c8e8484faed4311ab9e8", size = 315065, upload-time = "2026-05-19T10:08:23.218Z" },
+    { url = "https://files.pythonhosted.org/packages/46/f4/008fb7d65e8ac2abf00811651a661e025c4ba80bbc6f378450384ddd3aed/jiter-0.15.0-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:144f8e72cb53dab146347b91cceac01f5481237f2b93b4a339a1ee8f8878b67c", size = 339444, upload-time = "2026-05-19T10:08:24.701Z" },
+    { url = "https://files.pythonhosted.org/packages/00/55/90b0c7b9c6896c0f2a591dd36d36b71d22e09674bfef178fa03ba3f81499/jiter-0.15.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:553fcac2ef2cb990877f9fc0833b8b629a3e6a5670b6b5fd58219b41a653ddc4", size = 347779, upload-time = "2026-05-19T10:08:26.408Z" },
+    { url = "https://files.pythonhosted.org/packages/51/6b/69666cec5000fd57734c118437394516c749ae8dbeea9fb66d6fef9c4775/jiter-0.15.0-cp313-cp313t-win_amd64.whl", hash = "sha256:774f93f65031856bf14ad9f59bdcab8b8cad501e5ceabd51ba3525f76937a25b", size = 200395, upload-time = "2026-05-19T10:08:28.055Z" },
+    { url = "https://files.pythonhosted.org/packages/39/04/a6aa62cd27e8149b0d28df5561f10f6cceaf7935a9ccf3f1c5a05f9a0cd8/jiter-0.15.0-cp313-cp313t-win_arm64.whl", hash = "sha256:f1e1754960f38ec40613a07e5e372df67acb3b890fb383b6fb3de3e49ddbf3c7", size = 190516, upload-time = "2026-05-19T10:08:29.35Z" },
+    { url = "https://files.pythonhosted.org/packages/eb/d2/079f350ebf7859d081de30aa890f9e3be68516f754f3ba32366ffff4dcee/jiter-0.15.0-cp314-cp314-macosx_10_12_x86_64.whl", hash = "sha256:ac0d9ddea4350974be7a221fc25895f251a8fee748c889bdced2141c0fec1a49", size = 308884, upload-time = "2026-05-19T10:08:31.667Z" },
+    { url = "https://files.pythonhosted.org/packages/04/4e/a2c30a7f69b48c03b20935d647479106fe932f6e63f75faf53937197e05d/jiter-0.15.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:01a8222cf05ab1128e239421156c207949808acaaea2bdfd33130ae666786e86", size = 310028, upload-time = "2026-05-19T10:08:33.304Z" },
+    { url = "https://files.pythonhosted.org/packages/40/90/2e7cdfd3cf8ca967be38c48f5cf474d79f089efaf559a40f15984a77ae69/jiter-0.15.0-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:182226cbc930c9fab81bc2e41a4da672f89539906dadb05e75670ac07b94f71f", size = 337485, upload-time = "2026-05-19T10:08:35.259Z" },
+    { url = "https://files.pythonhosted.org/packages/9b/11/15a1aa28b120b8ee5b4f1fb894c125046225f09847738bd64233d3b84883/jiter-0.15.0-cp314-cp314-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:71683c38c825452999b5717fcae07ea708e8c93003e808be4319c1b02e3d176e", size = 364223, upload-time = "2026-05-19T10:08:36.694Z" },
+    { url = "https://files.pythonhosted.org/packages/b7/25/f442e8af5f3d0dcf47b39e83a0efd9ee45ea946aa6d04625dc3181eae3b6/jiter-0.15.0-cp314-cp314-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:30f2218e6a9e5c18bc10fe6d41ac189c442c88eacf11bad9f28ef95a9bef00e6", size = 456387, upload-time = "2026-05-19T10:08:38.143Z" },
+    { url = "https://files.pythonhosted.org/packages/da/f4/37f2d2c9f64f49af7da652ed7532bb5a2372e588e6927c3fdd76f911db65/jiter-0.15.0-cp314-cp314-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:5157de9f76eb4bc5ea74a1219366a25f945ad305641d74e04f59c54087091aa9", size = 374461, upload-time = "2026-05-19T10:08:39.869Z" },
+    { url = "https://files.pythonhosted.org/packages/60/28/edcfbbbf0cb15436f36664a8908a0df47ab9006298d4cd937dc08ea932d6/jiter-0.15.0-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:90c5db5527c221249a876160663ab891ace358c17f7b9c93ec1478b7f0550e5c", size = 345924, upload-time = "2026-05-19T10:08:41.668Z" },
+    { url = "https://files.pythonhosted.org/packages/47/13/89fba6398dab7f202b7278c4b4aac122399d2c0183971c4a57a3b7088df5/jiter-0.15.0-cp314-cp314-manylinux_2_31_riscv64.whl", hash = "sha256:3e4540b8e74e4268811ac05db226a6a128ff572e7e0ce3f1163b693cadb184cd", size = 352283, upload-time = "2026-05-19T10:08:43.091Z" },
+    { url = "https://files.pythonhosted.org/packages/1b/da/0f6af8cef2c565a1ab44d970f268c43ccaa72707386ea6388e6fe2b6cd26/jiter-0.15.0-cp314-cp314-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:62ebd14e47e9aed9df4472afcb2663668ce4d74891cd54f86bf6e44029d6dc89", size = 389985, upload-time = "2026-05-19T10:08:44.915Z" },
+    { url = "https://files.pythonhosted.org/packages/a1/ec/b9cb7d6d29e24ee14910266157d2a279d7a8f60ee0df7fa840882976ba64/jiter-0.15.0-cp314-cp314-musllinux_1_1_aarch64.whl", hash = "sha256:0be6f5ad41a809f303f416d17cec92a7a725902fb9b4f3de3d19362ac0ef8554", size = 517695, upload-time = "2026-05-19T10:08:46.486Z" },
+    { url = "https://files.pythonhosted.org/packages/64/5e/6d1bda880723aae0ad86b4b763f044362448efe31e3e819635d41cb03451/jiter-0.15.0-cp314-cp314-musllinux_1_1_x86_64.whl", hash = "sha256:813dfbb17d65328bf86e5f0905dd277ba2265d3ca20556e86c0c7035b7182e5a", size = 548868, upload-time = "2026-05-19T10:08:48.026Z" },
+    { url = "https://files.pythonhosted.org/packages/0c/72/7de501cf38dcacaf35098796f3a50e0f2e338baba18a58946c618544b809/jiter-0.15.0-cp314-cp314-win32.whl", hash = "sha256:50e51156192722a9c58db112837d3f8ef96fb3c5ecc14e95f409134b08b158ec", size = 206380, upload-time = "2026-05-19T10:08:49.738Z" },
+    { url = "https://files.pythonhosted.org/packages/1e/a9/e19addf4b0c1bdce52c6da12351e6bc42c340c45e7c09e2158e46d293ccc/jiter-0.15.0-cp314-cp314-win_amd64.whl", hash = "sha256:30ce1a5d16b5641dc935d50ef775af6a0871e3d14ab05d6fc54dff371b78e558", size = 197687, upload-time = "2026-05-19T10:08:51.088Z" },
+    { url = "https://files.pythonhosted.org/packages/f2/c9/776b1db01db25fc6c1d58d1979a37b0a9fe787e5f5b1d062d2eaacb77923/jiter-0.15.0-cp314-cp314-win_arm64.whl", hash = "sha256:510c8b3c17a0ed9ac69850c0438dada3c9b82d9c4d589fcb62002a5a9cf3a866", size = 192571, upload-time = "2026-05-19T10:08:52.451Z" },
+    { url = "https://files.pythonhosted.org/packages/a0/f6/45bb4670bacf300fd2c7abadbfb3af376e5f1b6ae75fd9bc069891d15870/jiter-0.15.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:7553333dd0930c104a5a0db8df72bf7219fe663d731383b576bb6ed6351c984d", size = 317151, upload-time = "2026-05-19T10:08:53.867Z" },
+    { url = "https://files.pythonhosted.org/packages/d7/68/ed635ad5acd7b73e454283083bbb7c8205ad10e88b0d9d7d793b09fe8226/jiter-0.15.0-cp314-cp314t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:f2143ab06181d2b029eedcb6af3cebe95f11bbac62441781860f98ee9330a6a6", size = 341243, upload-time = "2026-05-19T10:08:55.383Z" },
+    { url = "https://files.pythonhosted.org/packages/5d/db/3ff4176b817b8ea33879e71e13d8bc2b0d481a7ed3fe9e080f333d415c16/jiter-0.15.0-cp314-cp314t-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:6eac374c5c975709b69c10f09afd199df74150172156ad10c8d4fd785b7da995", size = 363629, upload-time = "2026-05-19T10:08:56.928Z" },
+    { url = "https://files.pythonhosted.org/packages/ab/24/5f8270e0ba9c883582f96f722f8a0b58015c7ce1f8c6d4571cf394e99b6b/jiter-0.15.0-cp314-cp314t-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:b3b3b775e33d3bfaec9899edc526ae97b0da0bf9d071a46124ba419149a414f8", size = 456198, upload-time = "2026-05-19T10:08:58.618Z" },
+    { url = "https://files.pythonhosted.org/packages/45/5b/76fc02b0b5c54c3d18c60653156e2f76fde1816f9b4722db68d6ee2c897e/jiter-0.15.0-cp314-cp314t-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:eda3071db3346334beae1360b46da4606da57bf3528c167b3c38533afaf9f2c5", size = 373710, upload-time = "2026-05-19T10:09:00.151Z" },
+    { url = "https://files.pythonhosted.org/packages/c4/52/4310821b0ea9277994d3e1f49fc6a4b34e4800caebacb2c0af81da59a454/jiter-0.15.0-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c6694a173ecabc12eb60efbc0b474464ead1951ff65cd8b1e72100715c64512b", size = 349901, upload-time = "2026-05-19T10:09:01.621Z" },
+    { url = "https://files.pythonhosted.org/packages/93/fe/67648c35b3594fba8854ac64cc8a826d8bcd18324bbdb53d77697c60b6ef/jiter-0.15.0-cp314-cp314t-manylinux_2_31_riscv64.whl", hash = "sha256:a254e10b593624d230c365b6d616b22ca0ad65e63a16e6631c2b3466022e6ba8", size = 352438, upload-time = "2026-05-19T10:09:03.216Z" },
+    { url = "https://files.pythonhosted.org/packages/cb/28/0a1879d07ad6b3e025a2750027363452ced93c2d16d1c9d4b153ffd51c91/jiter-0.15.0-cp314-cp314t-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:d8d2955167274e15d79a7a020afdd9b39c990eb80b2d89fca695d92dcfdd38ec", size = 388152, upload-time = "2026-05-19T10:09:04.741Z" },
+    { url = "https://files.pythonhosted.org/packages/c1/78/46c6f6b56ba85c90021f4afd72ed42f691f8f84daacb5fe27277070e3858/jiter-0.15.0-cp314-cp314t-musllinux_1_1_aarch64.whl", hash = "sha256:acf4ee4d1fc55917239fe72972fb292dd773055d05eb040d36f4326e02cc2c0e", size = 517707, upload-time = "2026-05-19T10:09:06.231Z" },
+    { url = "https://files.pythonhosted.org/packages/ca/cb/720662d4c88fcad606e826fef5424365527ba43ce4868a479aed8f8c507e/jiter-0.15.0-cp314-cp314t-musllinux_1_1_x86_64.whl", hash = "sha256:e7196e56f1cd69af1dbb07dff02dcfb260a50b45a82d409d92a06fedb32473b5", size = 548241, upload-time = "2026-05-19T10:09:08.093Z" },
+    { url = "https://files.pythonhosted.org/packages/60/e3/935b8034fd143f21125c87d51404a9e0e1449186a494405721ff5d1d695e/jiter-0.15.0-cp314-cp314t-win32.whl", hash = "sha256:7f6163c0f10b055245f814dcc59f4818da60dfe72f3e72ab89fc24b6bd5e9c52", size = 207950, upload-time = "2026-05-19T10:09:09.616Z" },
+    { url = "https://files.pythonhosted.org/packages/93/59/984fd9ece895953dad3e0880a650e766f5a2da2c5514f0eafdaaabbeb5f9/jiter-0.15.0-cp314-cp314t-win_amd64.whl", hash = "sha256:980c256edb05b78a111b99c4de3b1d32e31634b867fd1fc2cf726e7b7bba9854", size = 200055, upload-time = "2026-05-19T10:09:11.367Z" },
+    { url = "https://files.pythonhosted.org/packages/0e/a4/cf8d779feb133a27a2e3bc833bccb9e13aa332cdf820497ebf72c10ce8c3/jiter-0.15.0-cp314-cp314t-win_arm64.whl", hash = "sha256:66b1880df2d01e206e8339769d1c7c1753bcb653efd6289e203f6f24ebada0c0", size = 191244, upload-time = "2026-05-19T10:09:12.74Z" },
+    { url = "https://files.pythonhosted.org/packages/73/38/505941b2b092fd5bbbd60a52a880db1173f1690ae6751bed3af1c9ddcb4e/jiter-0.15.0-graalpy312-graalpy250_312_native-macosx_10_12_x86_64.whl", hash = "sha256:631f13a3d04e97d4e083993b10f4b99530e3a10d953e2eb5e196b7dc7f812ce0", size = 303769, upload-time = "2026-05-19T10:09:42.203Z" },
+    { url = "https://files.pythonhosted.org/packages/e7/95/a06692b29e77473f286e1ec1f426d3ca44d7b5843be8ad21d7a5f3fcdcc0/jiter-0.15.0-graalpy312-graalpy250_312_native-macosx_11_0_arm64.whl", hash = "sha256:b6c0ffae686c39bf3737be60793783267628783ea42545632c10b291105aee45", size = 305128, upload-time = "2026-05-19T10:09:43.657Z" },
+    { url = "https://files.pythonhosted.org/packages/23/85/7270d7ad41d6061a25b950c6bf91d638bd9aacb113200a8c8d57a055fd67/jiter-0.15.0-graalpy312-graalpy250_312_native-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:1d54fb5b31dea401a41af3f8a7d2512e9b6a6a005491e6166c7e4ffab9639a9c", size = 340459, upload-time = "2026-05-19T10:09:45.452Z" },
+    { url = "https://files.pythonhosted.org/packages/c8/8d/302cb2057b7513327b4d575cff6b1d066ee6431a5357fc3f8867cd684406/jiter-0.15.0-graalpy312-graalpy250_312_native-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:54d5d6090cdc1b7c9e780dfb04949a990adb1e301a2fc0bbcee7de4638d33f9a", size = 344469, upload-time = "2026-05-19T10:09:46.864Z" },
+]
+
 [[package]]
 name = "json5"
 version = "0.14.0"
@@ -2760,6 +2841,16 @@ aloha = [
     { name = "scipy" },
     { name = "torchcodec", marker = "(platform_machine == 'arm64' and sys_platform == 'darwin') or (platform_machine == 'AMD64' and sys_platform == 'linux') or (platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'x86_64' and sys_platform == 'linux') or sys_platform == 'win32'" },
 ]
+annotations = [
+    { name = "av" },
+    { name = "datasets" },
+    { name = "jsonlines" },
+    { name = "openai" },
+    { name = "pandas" },
+    { name = "pyarrow" },
+    { name = "torchcodec", marker = "(platform_machine == 'arm64' and sys_platform == 'darwin') or (platform_machine == 'AMD64' and sys_platform == 'linux') or (platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'x86_64' and sys_platform == 'linux') or sys_platform == 'win32'" },
+    { name = "transformers" },
+]
 async = [
     { name = "contourpy" },
     { name = "grpcio" },
@@ -3121,6 +3212,7 @@ requires-dist = [
     { name = "lerobot", extras = ["damiao"], marker = "extra == 'openarms'" },
     { name = "lerobot", extras = ["dataset"], marker = "extra == 'all'" },
     { name = "lerobot", extras = ["dataset"], marker = "extra == 'aloha'" },
+    { name = "lerobot", extras = ["dataset"], marker = "extra == 'annotations'" },
     { name = "lerobot", extras = ["dataset"], marker = "extra == 'core-scripts'" },
     { name = "lerobot", extras = ["dataset"], marker = "extra == 'dataset-viz'" },
     { name = "lerobot", extras = ["dataset"], marker = "extra == 'hilserl'" },
@@ -3205,6 +3297,7 @@ requires-dist = [
     { name = "lerobot", extras = ["test"], marker = "extra == 'all'" },
     { name = "lerobot", extras = ["topreward"], marker = "extra == 'all'" },
     { name = "lerobot", extras = ["training"], marker = "extra == 'all'" },
+    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'annotations'" },
     { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'eo1'" },
     { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'groot'" },
     { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'hilserl'" },
@@ -3239,6 +3332,7 @@ requires-dist = [
     { name = "numpy", specifier = ">=2.0.0,<2.3.0" },
     { name = "onnx", marker = "extra == 'unitree-g1'", specifier = ">=1.16.0,<2.0.0" },
     { name = "onnxruntime", marker = "extra == 'unitree-g1'", specifier = ">=1.16.0,<2.0.0" },
+    { name = "openai", marker = "extra == 'annotations'", specifier = ">=1.40,<2.0" },
     { name = "opencv-python-headless", specifier = ">=4.9.0,<4.14.0" },
     { name = "packaging", specifier = ">=24.2,<26.0" },
     { name = "pandas", marker = "extra == 'dataset'", specifier = ">=2.0.0,<3.0.0" },
@@ -3287,7 +3381,7 @@ requires-dist = [
     { name = "transformers", marker = "extra == 'transformers-dep'", specifier = ">=5.4.0,<5.6.0" },
     { name = "wandb", marker = "extra == 'training'", specifier = ">=0.24.0,<0.28.0" },
 ]
-provides-extras = ["dataset", "training", "hardware", "viz", "core-scripts", "evaluation", "dataset-viz", "av-dep", "pygame-dep", "placo-dep", "transformers-dep", "accelerate-dep", "grpcio-dep", "can-dep", "peft-dep", "scipy-dep", "diffusers-dep", "qwen-vl-utils-dep", "matplotlib-dep", "pyserial-dep", "deepdiff-dep", "pynput-dep", "pyzmq-dep", "motorbridge-dep", "motorbridge-smart-servo-dep", "feetech", "dynamixel", "damiao", "robstride", "openarms", "gamepad", "hopejr", "lekiwi", "unitree-g1", "reachy2", "rebot", "kinematics", "intelrealsense", "phone", "diffusion", "wallx", "pi", "molmoact2", "smolvla", "multi-task-dit", "groot", "sarm", "robometer", "topreward", "xvla", "eo1", "hilserl", "vla-jepa", "async", "peft", "dev", "notebook", "test", "video-benchmark", "aloha", "pusht", "libero", "metaworld", "all"]
+provides-extras = ["dataset", "training", "hardware", "viz", "core-scripts", "evaluation", "dataset-viz", "av-dep", "pygame-dep", "placo-dep", "transformers-dep", "grpcio-dep", "accelerate-dep", "can-dep", "peft-dep", "scipy-dep", "diffusers-dep", "qwen-vl-utils-dep", "matplotlib-dep", "pyserial-dep", "deepdiff-dep", "pynput-dep", "pyzmq-dep", "motorbridge-dep", "motorbridge-smart-servo-dep", "feetech", "dynamixel", "damiao", "robstride", "openarms", "gamepad", "hopejr", "lekiwi", "unitree-g1", "reachy2", "rebot", "kinematics", "intelrealsense", "phone", "diffusion", "wallx", "pi", "molmoact2", "smolvla", "multi-task-dit", "groot", "sarm", "robometer", "topreward", "xvla", "eo1", "hilserl", "vla-jepa", "async", "peft", "annotations", "dev", "notebook", "test", "video-benchmark", "aloha", "pusht", "libero", "metaworld", "all"]
 
 [[package]]
 name = "librt"
@@ -4380,6 +4474,25 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/93/99/f2be40a31b908d96b861ae0ce98582fa376c18a7f816b9d5eb4cd6aa0a4c/onnxruntime-1.26.0-cp314-cp314t-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:4eefd386a45202aefb7a5132b94f32df9d506c9edcc7faf2fc60d65183f4b183", size = 18197382, upload-time = "2026-05-08T19:07:46.965Z" },
 ]
 
+[[package]]
+name = "openai"
+version = "1.109.1"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "anyio" },
+    { name = "distro" },
+    { name = "httpx" },
+    { name = "jiter" },
+    { name = "pydantic" },
+    { name = "sniffio" },
+    { name = "tqdm" },
+    { name = "typing-extensions" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/c6/a1/a303104dc55fc546a3f6914c842d3da471c64eec92043aef8f652eb6c524/openai-1.109.1.tar.gz", hash = "sha256:d173ed8dbca665892a6db099b4a2dfac624f94d20a93f46eb0b56aae940ed869", size = 564133, upload-time = "2025-09-24T13:00:53.075Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/1d/2a/7dd3d207ec669cacc1f186fd856a0f61dbc255d24f6fdc1a6715d6051b0f/openai-1.109.1-py3-none-any.whl", hash = "sha256:6bcaf57086cf59159b8e27447e4e7dd019db5d29a438072fbd49c290c7e65315", size = 948627, upload-time = "2025-09-24T13:00:50.754Z" },
+]
+
 [[package]]
 name = "opencv-python"
 version = "4.13.0.92"
@@ -6098,6 +6211,15 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/c1/d4/59e74daffcb57a07668852eeeb6035af9f32cbfd7a1d2511f17d2fe6a738/smmap-5.0.3-py3-none-any.whl", hash = "sha256:c106e05d5a61449cf6ba9a1e650227ecfb141590d2a98412103ff35d89fc7b2f", size = 24390, upload-time = "2026-03-09T03:43:24.361Z" },
 ]
 
+[[package]]
+name = "sniffio"
+version = "1.3.1"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/a2/87/a6771e1546d97e7e041b6ae58d80074f81b7d5121207425c964ddf5cfdbd/sniffio-1.3.1.tar.gz", hash = "sha256:f4324edc670a0f49750a81b895f35c3adb843cca46f0530f79fc1babb23789dc", size = 20372, upload-time = "2024-02-25T23:20:04.057Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/e9/44/75a9c9421471a6c4805dbf2356f7c181a29c1879239abab1ea2cc8f38b40/sniffio-1.3.1-py3-none-any.whl", hash = "sha256:2f6da418d1f1e0fddd844478f41680e794e6051915791a034ff65e5f100525a2", size = 10235, upload-time = "2024-02-25T23:20:01.196Z" },
+]
+
 [[package]]
 name = "soupsieve"
 version = "2.8.3"