annotate: address review feedback — bug fixes, docs/code drift, naming, cleanup

Bugs * validator: don't re-raise on unknown style. The second column_for_style lookup (used to route persistent vs event) now sits in try/except so an unknown style is recorded by _check_column_routing and skipped instead of crashing the whole validation pass. * general_vqa._target_cameras: when restrict_to_default_camera is set but the configured camera_key isn't one the provider exposes, warn and fall back to all cameras instead of returning a phantom key that KeyErrors deep in frame decode. * interjections: clamp interjection timestamps to frame_timestamps[0] rather than a hardcoded 0.0 (datasets can start at non-zero t). Docs / code drift * annotation_pipeline.mdx: drop the phantom 'vocabulary discovery / phase 0 / --vocabulary.* / canonical_vocabulary.json' section (none of it exists); describe the real describe->segment + coverage-stitch flow. Soften the src/lerobot/tools/ + TOOL_REGISTRY reference to 'not part of this PR' (matches tools.mdx, which already marks the runtime layer as not-yet-implemented). Fix the --push_to_hub/--new_repo_id wording. Note the default is now a single h200. Add a 'Contributing new modules' section inviting module / prompt / quality contributions. * executor docstring: six phases, no phantom phase 0. run_hf_job.py * add the Apache 2.0 license header (was flagged repeatedly). * default to a single GPU: flavor=h200, parallel_servers=1, num_gpus=1 (scale to h200x4 noted in the docstring). * pin the install to @main instead of the feature branch (won't break after merge). Naming / cleanup * rename dest_repo_id -> new_repo_id across config / script / example / test to match the LeRobot dataset edit tools. * rename prompt templates module_N_*.txt -> descriptive (plan_*, interjections_*, vqa.txt) and update every load_prompt() call. * remove dead _messages_to_prompt (used only by the removed in-process backends). * declare _warned_decode_fail (frames) and _warned_no_camera (vqa) as real init=False dataclass fields instead of getattr monkey-patches. * scope bandit B607 to the two ffmpeg subprocess.run sites via '# nosec B607' and drop it from the global skip list. Tests * fix stale canned-VLM markers ('ONE realistic interruption' -> 'compact interjection', 'Update the memory' -> 'compressed semantic memory') and drop the dead 'concise hierarchical PLAN' plan responders (plan generation is deterministic now) in run_e2e_smoke, test_pipeline_recipe_render, test_modules. * run_e2e_smoke now asserts interjection + speech rows are produced so a stale marker can't silently pass again. * drop remaining 'PR 1' / 'PR 2' references from test comments / names. Verified: tests/annotations + tests/datasets/test_language + tests/scripts/test_lerobot_annotate (31 passed); make-style E2E smoke (interjections=1 speech_atoms=2); pre-commit (ruff, mypy, bandit, prettier) clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-17 16:27:04 +00:00 · 2026-06-03 18:30:46 +02:00
parent 3a24e426df
commit eba3ab3741
27 changed files with 148 additions and 104 deletions
@@ -7,8 +7,7 @@

 ## What the pipeline produces

-A vocabulary-discovery phase derives a small canonical wording, then three
-modules write into a per-episode staging tree, then a single writer
+Three modules write into a per-episode staging tree, then a single writer
 rewrites the data shards in place:

 | Style / atom                                | Column                | Module          |
@@ -21,20 +20,15 @@ rewrites the data shards in place:
 | speech tool-call atom (`style=null`, `say`) | `language_events`     | `interjections` |
 | `vqa` (user / assistant pair)               | `language_events`     | `vqa`           |

-The `plan` module is constrained to a **canonical vocabulary** discovered
-once per dataset by the `vocabulary` module (phase 0). It watches a few
-sample episode videos (`--vocabulary.sample_episodes`, default `3`) and
-asks the VLM to derive a small set of imperative subtask labels and
-first-person memory milestones that recur across the demos. The VLM
-picks the right number of entries itself based on what it sees in the
-clips — short pick-and-place demos get ~6 subtask labels, longer
-multi-step recipes get more. The result lands at
-`meta/canonical_vocabulary.json` (human-readable / hand-editable) and
-is reused on every subsequent run. The `plan` module then constrains
-both subtask + memory generation to those exact strings — the
-downstream low-level policy sees a small, repeatable target
-distribution instead of thousands of LLM paraphrases. Disable with
-`--vocabulary.enabled=False` to fall back to free-form generation.
+The `plan` module generates subtasks per episode with a **describe → segment**
+grounding flow: a first pass narrates only what is visible in the chosen
+camera, and its description is fed into a second pass that segments the
+episode into consecutive atomic subtasks. The resulting spans are then
+deterministically stitched into a contiguous full-episode cover so every
+frame has exactly one active subtask. See
+[`run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
+for the production flag set (single camera, embedded frames, windowed
+subtask generation).

 The writer does **not** add a `tools` column to the parquet — the tool
 catalog lives at `meta/info.json["tools"]` instead (see
@@ -44,9 +38,11 @@ user pre-declared.

 If you want to declare additional tools for a dataset before annotation
 runs, edit `meta/info.json["tools"]` directly — the pipeline preserves
-anything already there. Implementations of those tools live under
-`src/lerobot/tools/`; one file per tool, registered via
-`TOOL_REGISTRY`. See the [Tools](./tools) doc for the authoring guide.
+anything already there. That makes the tool visible to the chat template
+so the model can learn to _generate_ the call. The runtime layer that
+_executes_ a generated call (the `Tool` protocol / `TOOL_REGISTRY` under
+`src/lerobot/tools/`) is not part of this PR — see the
+[Tools](./tools) doc, which marks those pieces as not-yet-implemented.

 ## Running on Hugging Face Jobs

@@ -59,19 +55,33 @@ HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py
 ```

 [`examples/annotations/run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
-spawns a multi-GPU `h200` job that:
+spawns a single-GPU `h200` job (scale up to `h200x4` for larger datasets) that:

 1. installs the branch under test plus the annotation extras,
 2. boots one vLLM server per GPU (in the `vllm/vllm-openai` image) for the
   chosen model, which the pipeline drives over the OpenAI-compatible API,
 3. runs the `plan` / `interjections` / `vqa` modules across the dataset
   via `lerobot-annotate`,
-4. uploads the annotated dataset to `--push_to_hub`.
+4. with `--push_to_hub=true`, uploads the annotated dataset to
+   `--new_repo_id` (or back to `--repo_id` in place when that is unset).

 To target a different dataset, model, or hub repo, edit the `CMD` block
 inside the script — every flag in there maps directly onto a CLI flag of
 `lerobot-annotate` (see `lerobot-annotate --help` for the full list).

+## Contributing new modules
+
+The pipeline is built to be extended, and **contributions are very
+welcome** — whether that's a brand-new annotation module (e.g. a
+trajectory-trace or affordance module), a new prompt template, a better
+grounding flow, or quality improvements to the existing `plan` /
+`interjections` / `vqa` modules. Each module lives under
+`src/lerobot/annotations/steerable_pipeline/modules/`, shares the VLM
+client and keyframe cache, writes its raw output to the per-episode
+staging tree, and is wired into the executor as an independent phase.
+If you have an idea for a module or an improvement, open an issue or PR
+on [the repo](https://github.com/huggingface/lerobot).
+
 ## Style-to-recipe consumer mapping

 The pipeline's outputs are designed to be consumed by recipes (see
@@ -1,21 +1,37 @@
 #!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """Launch ``lerobot-annotate`` on a Hugging Face job (vllm + Qwen3.6-27B VLM).

-Spawns one ``h200x4`` job that:
+Spawns one single-GPU ``h200`` job that:

-  1. installs this branch of ``lerobot`` plus the annotation extras,
-  2. boots four vllm servers (one per GPU) with Qwen3.6-27B (dense VLM),
+  1. installs ``lerobot`` plus the annotation extras,
+  2. boots one vllm server with Qwen3.6-27B (dense VLM),
  3. runs the plan / interjections / vqa modules across the dataset
     in free-form mode (each episode generates its own subtasks +
     memory),
-  4. uploads the annotated dataset to ``--dest_repo_id`` (when set)
+  4. uploads the annotated dataset to ``--new_repo_id`` (when set)
     or back to ``--repo_id``.

 Usage:

    HF_TOKEN=hf_... uv run python examples/annotations/run_hf_job.py

-Adjust ``CMD`` below to point at your own dataset / target hub repo.
+Adjust ``CMD`` (dataset, model, hub repo) and ``flavor`` below for your
+run. For larger datasets, scale to ``h200x4`` and raise
+``--vlm.parallel_servers`` / ``--vlm.num_gpus`` to match.
 """

 import os
@@ -29,7 +45,7 @@ if not token:
 CMD = (
    "apt-get update -qq && apt-get install -y -qq git ffmpeg && "
    "pip install --no-deps "
-    "'lerobot @ git+https://github.com/huggingface/lerobot.git@feat/language-annotation-pipeline' && "
+    "'lerobot @ git+https://github.com/huggingface/lerobot.git@main' && "
    "pip install --upgrade-strategy only-if-needed "
    "datasets pyarrow av jsonlines draccus gymnasium torchcodec mergedeep pyyaml-include toml typing-inspect "
    "openai && "
@@ -37,12 +53,12 @@ CMD = (
    "export VLLM_VIDEO_BACKEND=pyav && "
    "lerobot-annotate "
    "--repo_id=pepijn223/robocasa_pretrain_human300_v4 "
-    "--dest_repo_id=pepijn223/robocasa_pretrain_human300_v4_annotated5 "
+    "--new_repo_id=pepijn223/robocasa_pretrain_human300_v4_annotated5 "
    "--push_to_hub=true "
    "--vlm.backend=openai "
    "--vlm.model_id=Qwen/Qwen3.6-27B "
-    "--vlm.parallel_servers=4 "
-    "--vlm.num_gpus=4 "
+    "--vlm.parallel_servers=1 "
+    "--vlm.num_gpus=1 "
    '--vlm.serve_command="vllm serve Qwen/Qwen3.6-27B '
    "--tensor-parallel-size 1 --max-model-len 32768 "
    '--gpu-memory-utilization 0.8 --uvicorn-log-level warning --port {port}" '
@@ -111,7 +127,7 @@ CMD = (
 job = run_job(
    image="vllm/vllm-openai:latest",
    command=["bash", "-c", CMD],
-    flavor="h200x4",
+    flavor="h200",
    secrets={"HF_TOKEN": token},
    timeout="2h",
 )
@@ -417,7 +417,7 @@ exclude_dirs = [
    "benchmarks",
    "src/lerobot/datasets/push_dataset_to_hub",
 ]
-skips = ["B101", "B311", "B404", "B603", "B607", "B615"]
+skips = ["B101", "B311", "B404", "B603", "B615"]

 [tool.typos]
 default.extend-ignore-re = [
@@ -370,13 +370,14 @@ class AnnotationPipelineConfig:

    # Hub dataset id. Used as the download source when ``root`` is unset,
    # and as the destination repo when ``push_to_hub`` is enabled and
-    # ``dest_repo_id`` is unset.
+    # ``new_repo_id`` is unset.
    repo_id: str | None = None

-    # Optional separate Hub dataset id to push the annotated result to. When
-    # unset, ``push_to_hub`` uploads back to ``repo_id`` (annotate in place);
-    # when set, the source ``repo_id`` is left untouched.
-    dest_repo_id: str | None = None
+    # Optional separate Hub dataset id to push the annotated result to (named
+    # ``new_repo_id`` to match the LeRobot dataset edit tools). When unset,
+    # ``push_to_hub`` uploads back to ``repo_id`` (annotate in place); when
+    # set, the source ``repo_id`` is left untouched.
+    new_repo_id: str | None = None

    root: Path | None = None

@@ -404,7 +405,7 @@ class AnnotationPipelineConfig:
    video_backend: str | None = None

    # When True, upload the annotated dataset to the Hugging Face Hub:
-    # to ``dest_repo_id`` if set, otherwise back to ``repo_id``. One of
+    # to ``new_repo_id`` if set, otherwise back to ``repo_id``. One of
    # the two must be set for this to take effect.
    push_to_hub: bool = False
    push_private: bool = False
@@ -15,14 +15,8 @@
 # limitations under the License.
 """In-process executor that runs the annotation phases.

-The executor plans **seven phases** in the dependency order from the plan:
+The executor runs **six phases** in dependency order:

-    phase 0: vocabulary discovery — derive a small canonical vocabulary
-             from the first few sample-episode videos (subtask labels +
-             memory milestones) and persist it next to the dataset; the
-             ``plan`` module then constrains every per-episode generation
-             to those strings, so the downstream policy sees a small,
-             repeatable conditioning distribution
    phase 1: ``plan`` module (plan + subtasks + memory)
    phase 2: ``interjections`` module (interjections + speech)
    phase 3: ``plan`` plan-update pass — re-runs plan emission at every
@@ -146,6 +146,7 @@ class VideoFrameProvider:
    # ``ExecutorConfig.episode_parallelism``); guard the dict cache and the
    # one-shot warn flag against concurrent updates from worker threads.
    _lock: threading.Lock = field(default_factory=threading.Lock, init=False, repr=False)
+    _warned_decode_fail: bool = field(default=False, init=False, repr=False)

    def __post_init__(self) -> None:
        from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata  # noqa: PLC0415
@@ -285,7 +286,9 @@ class VideoFrameProvider:
            str(out_path),
        ]
        try:
-            subprocess.run(cmd, check=True, timeout=300)
+            # ffmpeg is invoked by name via PATH lookup (the standard way to
+            # call the CLI); the arg list is fully controlled here, not shell.
+            subprocess.run(cmd, check=True, timeout=300)  # nosec B607
        except (subprocess.CalledProcessError, subprocess.TimeoutExpired, FileNotFoundError):
            return None
        return out_path if out_path.exists() and out_path.stat().st_size > 0 else None
@@ -335,7 +338,7 @@ class VideoFrameProvider:
        # []) is debuggable from the job log instead of post-hoc parquet
        # inspection. Subsequent failures stay quiet.
        with self._lock:
-            already_warned = getattr(self, "_warned_decode_fail", False)
+            already_warned = self._warned_decode_fail
            if not already_warned:
                self._warned_decode_fail = True
        if not already_warned:
@@ -382,7 +385,8 @@ def _decode_frames_ffmpeg(video_path: Path, timestamps: list[float]) -> list[Any

    frames: list[Any] = []
    for ts in timestamps:
-        proc = subprocess.run(
+        # ffmpeg invoked by name via PATH lookup; fully-controlled arg list, no shell.
+        proc = subprocess.run(  # nosec B607
            [
                "ffmpeg",
                "-nostdin",
@@ -95,6 +95,7 @@ class GeneralVqaModule:
    config: VqaConfig
    seed: int = 1729
    frame_provider: FrameProvider = field(default_factory=null_provider)
+    _warned_no_camera: bool = field(default=False, init=False, repr=False)

    @property
    def enabled(self) -> bool:
@@ -113,7 +114,7 @@ class GeneralVqaModule:
            # No camera available — emit nothing rather than producing
            # untagged rows that would fail validation. Surface a loud one-
            # time warning so this is never silently a no-op.
-            if not getattr(self, "_warned_no_camera", False):
+            if not self._warned_no_camera:
                logging.getLogger(__name__).warning(
                    "vqa module found no cameras on the frame provider — "
                    "every episode will emit zero VQA rows. Check that the "
@@ -191,8 +192,17 @@ class GeneralVqaModule:
            default = getattr(self.frame_provider, "camera_key", None)
            if default and default in all_cameras:
                return [default]
+            # ``restrict_to_default_camera`` is set but the configured default
+            # isn't one the provider exposes. Returning it anyway would make
+            # ``_decode`` raise a KeyError deep in frame extraction, so warn and
+            # fall through to every available camera instead.
            if default:
-                return [default]
+                logging.getLogger(__name__).warning(
+                    "restrict_to_default_camera is set but camera_key=%r is not in the "
+                    "provider's cameras %s; grounding VQA on all available cameras instead.",
+                    default,
+                    all_cameras,
+                )
        return all_cameras

    def _build_messages(
@@ -202,7 +212,7 @@ class GeneralVqaModule:
        frame_timestamp: float,
        camera_key: str,
    ) -> list[dict[str, Any]]:
-        prompt = load_prompt("module_3_vqa").format(
+        prompt = load_prompt("vqa").format(
            episode_task=record.episode_task,
            question_type=question_type,
        )
@@ -85,7 +85,7 @@ class InterjectionsAndSpeechModule:
        return current

    def _initial_speech(self, record: EpisodeRecord) -> str | None:
-        prompt = load_prompt("module_2_initial_speech").format(
+        prompt = load_prompt("interjections_initial_speech").format(
            episode_task=record.episode_task,
        )
        messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]
@@ -147,7 +147,7 @@ class InterjectionsAndSpeechModule:
            # previous subtask and the start of the next one — same
            # conditioning the policy will see at training time.
            window_ts = self._window_timestamps(t_snap, record.frame_timestamps)
-            prompt = load_prompt("module_2_interjection").format(
+            prompt = load_prompt("interjections_interjection").format(
                episode_task=record.episode_task,
                prev_subtask=prev_subtask or "(starting from initial state)",
                next_subtask=next_subtask,
@@ -198,11 +198,12 @@ class InterjectionsAndSpeechModule:
        # Center the window on the anchor so half lands before, half after.
        start_offset = -window / 2.0
        targets = [t_anchor + start_offset + step * i for i in range(n)]
+        first_ts = float(frame_timestamps[0])
        last_ts = float(frame_timestamps[-1])
        snapped: list[float] = []
        seen: set[float] = set()
        for tgt in targets:
-            clamped = min(last_ts, max(0.0, tgt))
+            clamped = min(last_ts, max(first_ts, tgt))
            t = snap_to_frame(clamped, frame_timestamps)
            if t not in seen:
                seen.add(t)
@@ -285,14 +285,14 @@ class PlanSubtasksMemoryModule:

    def _derive_task_from_video(self, record: EpisodeRecord) -> str | None:
        """Ask the VLM "what is this video about" with no task hint at all."""
-        text = self._vlm_field(self._video_message(record, load_prompt("module_1_video_task")), "task")
+        text = self._vlm_field(self._video_message(record, load_prompt("plan_video_task")), "task")
        return text.strip() if isinstance(text, str) and text.strip() else None

    def _generate_task_rephrasings(self, base_task: str, *, n: int) -> list[str]:
        """Generate ``n`` text-only paraphrases of ``base_task``."""
        if n <= 0 or not base_task:
            return []
-        prompt = load_prompt("module_1_task_rephrasings").format(base_task=base_task, n=n)
+        prompt = load_prompt("plan_task_rephrasings").format(base_task=base_task, n=n)
        raw = self._vlm_field(self._text_message(prompt), "rephrasings")
        if not isinstance(raw, list):
            return []
@@ -343,7 +343,7 @@ class PlanSubtasksMemoryModule:
            )
            return None

-        prompt = load_prompt("module_1_action_record").format(
+        prompt = load_prompt("plan_action_record").format(
            episode_task=episode_task,
            subtask_text=span.get("text", ""),
            start_time=start_t,
@@ -416,7 +416,7 @@ class PlanSubtasksMemoryModule:
        """
        if not base_task:
            return []
-        prompt = load_prompt("module_1_task_aug_axes").format(
+        prompt = load_prompt("plan_task_aug_axes").format(
            base_task=base_task,
            n_synonym=axes_cfg.synonym_paraphrase,
            n_omit_arm=axes_cfg.omit_arm,
@@ -596,7 +596,7 @@ class PlanSubtasksMemoryModule:
                )

        # ---- Pass 2: segmentation ------------------------------------
-        prompt = load_prompt("module_1_subtasks").format(
+        prompt = load_prompt("plan_subtasks").format(
            episode_task=effective_task,
            min_subtask_seconds=self.config.min_subtask_seconds,
            max_steps=self.config.plan_max_steps,
@@ -679,7 +679,7 @@ class PlanSubtasksMemoryModule:
                    "action that is not in your description above.\n\n"
                )

-        prompt = load_prompt("module_1_subtasks").format(
+        prompt = load_prompt("plan_subtasks").format(
            episode_task=task,
            min_subtask_seconds=self.config.min_subtask_seconds,
            max_steps=self.config.plan_max_steps,
@@ -778,7 +778,7 @@ class PlanSubtasksMemoryModule:
        self, record: EpisodeRecord, task: str, window: tuple[float, float] | None = None
    ) -> str:
        """Grounding pass: free-form chronological description of the (windowed) video."""
-        prompt = load_prompt("module_1_subtask_describe").format(episode_task=task)
+        prompt = load_prompt("plan_subtask_describe").format(episode_task=task)
        text = self._vlm_field(self._video_message(record, prompt, window=window), "description")
        return text.strip() if isinstance(text, str) and text.strip() else ""

@@ -882,7 +882,7 @@ class PlanSubtasksMemoryModule:
        *,
        task: str | None = None,
    ) -> str:
-        prompt = load_prompt("module_1_memory").format(
+        prompt = load_prompt("plan_memory").format(
            episode_task=(task if task is not None else record.episode_task),
            prior_memory=prior_memory or "(none)",
            completed_subtask=completed,
@@ -138,7 +138,13 @@ class StagingValidator:
        for row in all_rows:
            self._check_column_routing(row, report, record.episode_index)
            self._check_camera_field(row, report, record.episode_index, self.dataset_camera_keys)
-            if column_for_style(row.get("style")) == LANGUAGE_PERSISTENT:
+            # ``_check_column_routing`` already recorded any unknown-style error;
+            # don't let the same ``column_for_style`` lookup raise here uncaught.
+            try:
+                column = column_for_style(row.get("style"))
+            except ValueError:
+                continue
+            if column == LANGUAGE_PERSISTENT:
                persistent.append(row)
            else:
                events.append(row)
@@ -598,13 +598,3 @@ def _pil_to_data_url(image: Any) -> str:
    image.save(buf, format="PNG")
    b64 = base64.b64encode(buf.getvalue()).decode("ascii")
    return f"data:image/png;base64,{b64}"
-
-
-def _messages_to_prompt(messages: Sequence[dict[str, Any]]) -> Any:
-    """Pass-through hook used by the vllm backend.
-
-    vllm exposes its own multimodal entry points that vary by version; for the
-    base flow we simply forward the raw message list and let the caller's
-    custom backend handle templating. Real deployments override this.
-    """
-    return list(messages)
@@ -113,9 +113,9 @@ def annotate(cfg: AnnotationPipelineConfig) -> None:
            logger.warning(w)

    if cfg.push_to_hub:
-        if cfg.repo_id is None and cfg.dest_repo_id is None:
+        if cfg.repo_id is None and cfg.new_repo_id is None:
            raise ValueError(
-                "--push_to_hub requires --repo_id or --dest_repo_id (the dataset repo to push to)."
+                "--push_to_hub requires --repo_id or --new_repo_id (the dataset repo to push to)."
            )
        _push_to_hub(root, cfg)

@@ -123,11 +123,11 @@ def annotate(cfg: AnnotationPipelineConfig) -> None:
 def _push_to_hub(root: Path, cfg: AnnotationPipelineConfig) -> None:
    """Upload the annotated dataset directory to the Hub.

-    Pushes to ``cfg.dest_repo_id`` when set, otherwise back to ``cfg.repo_id``.
+    Pushes to ``cfg.new_repo_id`` when set, otherwise back to ``cfg.repo_id``.
    """
    from huggingface_hub import HfApi  # noqa: PLC0415

-    repo_id = cfg.dest_repo_id or cfg.repo_id
+    repo_id = cfg.new_repo_id or cfg.repo_id
    commit_message = cfg.push_commit_message or "Add steerable annotations (lerobot-annotate)"
    api = HfApi()
    print(f"[lerobot-annotate] creating/locating dataset repo {repo_id}...", flush=True)
@@ -60,13 +60,11 @@ def _stub_responder(messages):
                {"text": "place the bottle down", "start": 2.0, "end": 3.0},
            ]
        }
-    if "concise hierarchical PLAN" in text:
-        return {"plan": "1. grasp\n2. pour\n3. place"}
-    if "Update the memory" in text:
+    if "compressed semantic memory" in text:
        return {"memory": "poured once"}
    if "acknowledgement the robot" in text:
        return {"text": "Sure."}
-    if "ONE realistic interruption" in text:
+    if "compact interjection" in text:
        return {"interjection": "use less water", "speech": "Using less water."}
    if "frame-grounded visual question" in text:
        return {"question": "How many cups?", "answer": {"label": "cup", "count": 1}}
@@ -94,6 +92,23 @@ def main() -> int:
        print(f"phases={[(p.name, p.episodes_processed) for p in summary.phases]}")
        print(f"validation: {summary.validation_report.summary()}")
        print(f"shards rewritten: {len(summary.written_paths)}")
+
+        # Assert the interjection code path actually fired — otherwise a stale
+        # canned-VLM marker would silently produce zero interjections and this
+        # smoke run would still "pass" by only printing.
+        import pyarrow.parquet as pq  # noqa: PLC0415
+
+        events = [
+            r
+            for shard in summary.written_paths
+            for ev in pq.read_table(shard).column("language_events").to_pylist()
+            for r in ev
+        ]
+        n_interjections = sum(1 for r in events if r.get("style") == "interjection")
+        n_speech = sum(1 for r in events if r.get("style") is None and r.get("role") == "assistant")
+        print(f"interjections={n_interjections} speech_atoms={n_speech}")
+        assert n_interjections > 0, "no interjection rows produced — check the interjection prompt marker"
+        assert n_speech > 0, "no speech tool-call atoms produced — check the speech prompt marker"
    return 0


@@ -151,7 +151,7 @@ def test_module2_mid_episode_emits_paired_interjection_and_speech(
        {
            "acknowledgement the robot": {"text": "OK."},
            # Marker matches the distinctive line of
-            # ``module_2_interjection.txt`` ("Write ONE compact
+            # ``interjections_interjection.txt`` ("Write ONE compact
            # interjection ..."). Keep this in sync with that prompt's
            # wording — the canned responder matches on substring.
            "Write ONE compact interjection": {
@@ -245,7 +245,6 @@ def test_module1_attaches_video_block_to_subtask_prompt(fixture_dataset_root: Pa
            {"text": "wipe the counter", "start": 0.5, "end": 1.1},
        ]
    }
-    plan_payload = {"plan": "1. grasp\n2. wipe"}
    memory_payload = {"memory": "wiped once"}

    def responder(messages):
@@ -255,9 +254,7 @@ def test_module1_attaches_video_block_to_subtask_prompt(fixture_dataset_root: Pa
            for block in m.get("content", []):
                if isinstance(block, dict) and block.get("type") == "text":
                    text = block.get("text", "")
-        if "concise hierarchical PLAN" in text:
-            return plan_payload
-        if "Update the memory" in text:
+        if "compressed semantic memory" in text:
            return memory_payload
        return payload

@@ -13,7 +13,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-"""End-to-end smoke: pipeline output → PR 1 canonical recipe rendering."""
+"""End-to-end smoke: pipeline output → canonical recipe rendering."""

 from __future__ import annotations

@@ -49,14 +49,15 @@ from lerobot.datasets.language_render import render_sample  # noqa: E402
 from ._helpers import make_canned_responder  # noqa: E402


-def _build_pr1_style_blend_recipe() -> TrainingRecipe:
+def _build_style_blend_recipe() -> TrainingRecipe:
    """Inline blend recipe that consumes every style this pipeline produces.

-    PR 1 used to ship ``src/lerobot/configs/recipes/pi05_hirobot.yaml`` as
-    a canonical example, but that file was dropped during PR 1 review. The
-    cross-PR contract this test guards is "the recipe DSL can render
-    non-empty messages from pipeline output", which doesn't require a
-    specific YAML — so we build the equivalent blend in code.
+    The language schema/DSL work used to ship
+    ``src/lerobot/configs/recipes/pi05_hirobot.yaml`` as a canonical
+    example, but that file was dropped during review. The contract this
+    test guards is "the recipe DSL can render non-empty messages from
+    pipeline output", which doesn't require a specific YAML — so we build
+    the equivalent blend in code.
    """
    return TrainingRecipe(
        blend={
@@ -109,10 +110,9 @@ def _build_executor() -> Executor:
                    {"text": "place the bottle down", "start": 1.0, "end": 1.5},
                ]
            },
-            "concise hierarchical PLAN": {"plan": "1. grasp\n2. pour\n3. place"},
-            "Update the memory": {"memory": "poured once"},
+            "compressed semantic memory": {"memory": "poured once"},
            "acknowledgement the robot": {"text": "Sure."},
-            "ONE realistic interruption": {
+            "compact interjection": {
                "interjection": "use less water",
                "speech": "Using less water.",
            },
@@ -137,7 +137,7 @@ def _build_executor() -> Executor:
    )


-def test_pr1_canonical_recipe_renders_nonempty_from_pipeline_output(
+def test_canonical_recipe_renders_nonempty_from_pipeline_output(
    single_episode_root: Path,
 ) -> None:
    executor = _build_executor()
@@ -150,7 +150,7 @@ def test_pr1_canonical_recipe_renders_nonempty_from_pipeline_output(
    events_lists = table.column("language_events").to_pylist()
    timestamps = table.column("timestamp").to_pylist()

-    recipe = _build_pr1_style_blend_recipe()
+    recipe = _build_style_blend_recipe()

    rendered_any = False
    for ts, persistent, events in zip(timestamps, persistent_lists, events_lists, strict=True):
@@ -168,7 +168,7 @@ def test_pr1_canonical_recipe_renders_nonempty_from_pipeline_output(
            rendered_any = True
            assert result["target_message_indices"]
            break
-    assert rendered_any, "PR 1 recipe rendered no messages from pipeline output"
+    assert rendered_any, "recipe rendered no messages from pipeline output"

    # Sanity: speech atom appears in events column intact
    flat_events = [r for ev in events_lists for r in ev]
@@ -177,7 +177,7 @@ def test_pr1_canonical_recipe_renders_nonempty_from_pipeline_output(
    say = speech_rows[0]["tool_calls"][0]
    assert say["function"]["name"] == "say"
    assert isinstance(say["function"]["arguments"]["text"], str)
-    # PR 2 no longer writes a ``tools`` column — the say schema lives as a
-    # constant (``SAY_TOOL_SCHEMA``) so PR 1's row struct is the single
-    # source of truth for the v3.1 schema.
+    # The pipeline does not write a ``tools`` column — the say schema lives
+    # as a constant (``SAY_TOOL_SCHEMA``) so the language row struct is the
+    # single source of truth for the v3.1 schema.
    assert "tools" not in table.column_names
@@ -229,7 +229,7 @@ def test_writer_drops_subtask_index_idempotent(fixture_dataset_root: Path, tmp_p
    assert "language_events" in table_a.column_names
    # The writer no longer emits a dataset-level ``tools`` column; the
    # ``say`` tool schema lives as a code constant (``SAY_TOOL_SCHEMA``)
-    # so the parquet stays small and PR 2 doesn't extend PR 1's schema.
+    # so the parquet stays small and the pipeline doesn't extend the schema.
    assert "tools" not in table_a.column_names

    # second pass — must produce identical bytes for the language columns
@@ -35,7 +35,7 @@ def test_push_to_hub_tags_uploaded_dataset_revision(tmp_path, monkeypatch):

    cfg = SimpleNamespace(
        repo_id="source/dataset",
-        dest_repo_id="annotated/dataset",
+        new_repo_id="annotated/dataset",
        push_private=True,
        push_commit_message=None,
    )