feat(pi052): condition low-level prompt on state + fix eval slowdown

- Inject discretized proprioceptive state (256 bins, pi05 format) into low-level action-conditioning prompts in both training (PI052TextTokenizerStep) and eval (_with_low_level_subtask_prompt), matching the recipe's documented "[images, subtask, state]" intent. Higher-level subtask/memory text streams stay state-free. - Cache the loc-token tokenizer (_get_loc_tokenizer) instead of reloading it from disk on every _build_text_batch/select_message call (it ran twice per env per replan and dominated eval runtime). - Add a KV cache to select_message decode (bit-identical output to the recompute path) to avoid O(n^2) generation. Net: pi052 eval ~2.9 s/it -> ~0.1 s/it (~25x). Co-authored-by: Cursor <cursoragent@cursor.com>
Merge branch 'main' into feat/smolvla-on-steerable
2026-06-16 15:57:03 +00:00 · 2026-06-14 13:57:55 +02:00 · 2026-06-08 11:02:54 +02:00 · 2026-06-05 16:10:00 +02:00 · 2026-06-05 16:09:08 +02:00 · 2026-06-05 14:38:47 +02:00
101 changed files with 11671 additions and 2280 deletions
@@ -167,9 +167,9 @@ jobs:

      # ── LIBERO TRAIN+EVAL SMOKE ──────────────────────────────────────────────
      # Train SmolVLA for 1 step (batch_size=1, dataset episode 0 only) then
-      # immediately runs eval inside the training loop (env_eval_freq=1, 1 episode).
+      # immediately runs eval inside the training loop (eval_freq=1, 1 episode).
      # Tests the full train→eval-within-training pipeline end-to-end.
-      - name: Run Libero train+eval smoke (1 step, env_eval_freq=1)
+      - name: Run Libero train+eval smoke (1 step, eval_freq=1)
        if: env.HF_USER_TOKEN != ''
        run: |
          docker run --name libero-train-smoke --gpus all \
@@ -196,7 +196,7 @@ jobs:
                --output_dir=/tmp/train-smoke \
                --steps=1 \
                --batch_size=1 \
-                --env_eval_freq=1 \
+                --eval_freq=1 \
                --eval.n_episodes=1 \
                --eval.batch_size=1 \
                --eval.use_async_envs=false \
@@ -65,9 +65,6 @@ repos:
        name: Format Markdown with Prettier
        types_or: [markdown, mdx]
        args: [--prose-wrap=preserve]
-        # Jinja2 model-card templates use a .md extension but contain {% ... %} /
-        # {{ ... }} tags that prettier's Markdown formatter mangles (e.g. table loops).
-        exclude: ^src/lerobot/templates/.*\.md$

  ##### Security #####
  - repo: https://github.com/gitleaks/gitleaks
@@ -58,7 +58,7 @@ test-act-ete-train:
 		--dataset.episodes="[0]" \
 		--batch_size=2 \
 		--steps=4 \
-		--env_eval_freq=2 \
+		--eval_freq=2 \
 		--eval.n_episodes=1 \
 		--eval.batch_size=1 \
 		--save_freq=2 \
@@ -96,7 +96,7 @@ test-diffusion-ete-train:
 		--dataset.episodes="[0]" \
 		--batch_size=2 \
 		--steps=2 \
-		--env_eval_freq=2 \
+		--eval_freq=2 \
 		--eval.n_episodes=1 \
 		--eval.batch_size=1 \
 		--save_checkpoint=true \
@@ -126,7 +126,7 @@ test-tdmpc-ete-train:
 		--dataset.episodes="[0]" \
 		--batch_size=2 \
 		--steps=2 \
-		--env_eval_freq=2 \
+		--eval_freq=2 \
 		--eval.n_episodes=1 \
 		--eval.batch_size=1 \
 		--save_checkpoint=true \
@@ -161,7 +161,7 @@ test-smolvla-ete-train:
 		--dataset.episodes="[0]" \
 		--batch_size=2 \
 		--steps=4 \
-		--env_eval_freq=2 \
+		--eval_freq=2 \
 		--eval.n_episodes=1 \
 		--eval.batch_size=1 \
 		--save_freq=2 \
@@ -58,7 +58,7 @@ action = model.select_action(obs)
 robot.send_action(action)
 ```

-**Supported Hardware:** SO100, LeKiwi, Koch, HopeJR, OMX, EarthRover, Reachy2, Gamepads, Keyboards, Phones, OpenARM, Unitree G1, reBot B601.
+**Supported Hardware:** SO100, LeKiwi, Koch, HopeJR, OMX, EarthRover, Reachy2, Gamepads, Keyboards, Phones, OpenARM, Unitree G1.

 While these devices are natively integrated into the LeRobot codebase, the library is designed to be extensible. You can easily implement the Robot interface to utilize LeRobot's data collection, training, and visualization tools for your own custom robot.

@@ -101,13 +101,11 @@ lerobot-train \
  --dataset.repo_id=lerobot/aloha_mobile_cabinet
 ```

-| Category                   | Models                                                                                                                                                                                                                                                                                                                                                     |
-| -------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| **Imitation Learning**     | [ACT](./docs/source/policy_act_README.md), [Diffusion](./docs/source/policy_diffusion_README.md), [VQ-BeT](./docs/source/policy_vqbet_README.md), [Multitask DiT Policy](./docs/source/policy_multi_task_dit_README.md)                                                                                                                                    |
-| **Reinforcement Learning** | [HIL-SERL](./docs/source/hilserl.mdx), [TDMPC](./docs/source/policy_tdmpc_README.md) & QC-FQL (coming soon)                                                                                                                                                                                                                                                |
-| **VLAs Models**            | [Pi0](./docs/source/pi0.mdx), [Pi0Fast](./docs/source/pi0fast.mdx), [Pi0.5](./docs/source/pi05.mdx), [GR00T N1.5](./docs/source/policy_groot_README.md), [SmolVLA](./docs/source/policy_smolvla_README.md), [XVLA](./docs/source/xvla.mdx), [EO-1](./docs/source/eo1.mdx), [MolmoAct2](./docs/source/molmoact2.mdx), [WALL-OSS](./docs/source/walloss.mdx) |
-| **World Models**           | [VLA-JEPA](./docs/source/vla_jepa.mdx) (more coming soon)                                                                                                                                                                                                                                                                                                  |
-| **Reward Models**          | [SARM](./docs/source/sarm.mdx), [TOPReward](./docs/source/topreward.mdx), [Robometer](./docs/source/robometer.mdx)                                                                                                                                                                                                                                         |
+| Category                   | Models                                                                                                                                                                                                                  |
+| -------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **Imitation Learning**     | [ACT](./docs/source/policy_act_README.md), [Diffusion](./docs/source/policy_diffusion_README.md), [VQ-BeT](./docs/source/policy_vqbet_README.md), [Multitask DiT Policy](./docs/source/policy_multi_task_dit_README.md) |
+| **Reinforcement Learning** | [HIL-SERL](./docs/source/hilserl.mdx), [TDMPC](./docs/source/policy_tdmpc_README.md) & QC-FQL (coming soon)                                                                                                             |
+| **VLAs Models**            | [Pi0Fast](./docs/source/pi0fast.mdx), [Pi0.5](./docs/source/pi05.mdx), [GR00T N1.5](./docs/source/policy_groot_README.md), [SmolVLA](./docs/source/policy_smolvla_README.md), [XVLA](./docs/source/xvla.mdx)            |

 Similarly to the hardware, you can easily implement your own policy & leverage LeRobot's data collection, training, and visualization tools, and share your model to the HF Hub

@@ -135,7 +133,6 @@ Learn how to implement your own simulation environment or benchmark and distribu
 - **[Discord](https://discord.gg/q8Dzzpym3f):** Join the `LeRobot` server to discuss with the community.
 - **[X](https://x.com/LeRobotHF):** Follow us on X to stay up-to-date with the latest developments.
 - **[Robot Learning Tutorial](https://huggingface.co/spaces/lerobot/robot-learning-tutorial):** A free, hands-on course to learn robot learning using LeRobot.
- **[T-Shirt Folding Experiment](https://huggingface.co/spaces/lerobot/robot-folding):** An end-to-end demonstration of folding t-shirts with LeRobot.

 ## Citation

@@ -143,7 +140,7 @@ If you use LeRobot in your project, please cite the GitHub repository to acknowl

 ```bibtex
@misc{cadene2024lerobot,
-    author = {Cadene, Remi and Alibert, Simon and Soare, Alexander and Gallouedec, Quentin and Zouitine, Adil and Palma, Steven and Kooijmans, Pepijn and Aractingi, Michel and Shukor, Mustafa and Aubakirova, Dana and Russi, Martino and Capuano, Francesco and Pascal, Caroline and Choghari, Jade and Meftah, Khalil and Ellerbach, Maxime and Moss, Jess and Wolf, Thomas},
+    author = {Cadene, Remi and Alibert, Simon and Soare, Alexander and Gallouedec, Quentin and Zouitine, Adil and Palma, Steven and Kooijmans, Pepijn and Aractingi, Michel and Shukor, Mustafa and Aubakirova, Dana and Russi, Martino and Capuano, Francesco and Pascal, Caroline and Choghari, Jade and Moss, Jess and Wolf, Thomas},
    title = {LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch},
    howpublished = "\url{https://github.com/huggingface/lerobot}",
    year = {2024}
@@ -71,21 +71,11 @@ it uses a two-step **describe → segment** flow:
 2. **Segment** — that description is fed back in, and the VLM splits the
   episode into consecutive atomic subtasks.

-Both passes see the episode as **timestamped contact sheets** — frames
-sampled at `frames_per_second` (0.5s by default) and packed into JPEG
-grids with each frame's time burned into its corner, so the VLM cites
-exact boundary times directly. This is far cheaper in vision tokens than
-one image per frame, so the sampling can stay dense; episodes longer than
-`max_frames_per_prompt` are split into windows at the same density and
-merged. Both prompts also carry a causal **event-boundary** definition (a
-new event starts when an object becomes held / is released / reaches a new
-location / a lid changes state / contents move) to sharpen where cuts land.
-
 The resulting spans are then stitched into a gap-free, full-episode
 cover, so **every frame has exactly one active subtask**. See
 [`run_hf_job.py`](https://github.com/huggingface/lerobot/blob/main/examples/annotations/run_hf_job.py)
-for the production settings (single camera, timestamped contact sheets,
-auto-windowed subtask generation).
+for the production settings (single camera, embedded frames, windowed
+subtask generation).

 ### Tools

@@ -172,15 +162,15 @@ Every module is on by default and can be toggled independently (set to

 | Flag                            | Default    | What it does                                                                                                              |
 | ------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------- |
-| `--plan.frames_per_second`      | `2.0`      | Frame sampling rate for the contact sheets (`2.0` = one frame every 0.5s).                                                |
-| `--plan.max_frames_per_prompt`  | `60`       | Frame budget per VLM call. Episodes whose sampling exceeds this are auto-windowed at the same density, then stitched.     |
-| `--plan.contact_sheet_columns`  | `5`        | Columns per contact-sheet grid (`contact_sheet_frames_per_sheet` tiles, time row-major).                                  |
+| `--plan.frames_per_second`      | `1.0`      | How densely the episode video is sampled.                                                                                 |
+| `--plan.max_video_frames`       | `32`       | Hard cap on frames per call (context-budget guard — don't exceed ~32 for a 32k context).                                  |
+| `--plan.subtask_window_seconds` | `0`        | Split long episodes into fixed windows for constant frame density (`0` = whole episode).                                  |
 | `--plan.plan_max_steps`         | `8`        | Upper bound on subtasks per episode.                                                                                      |
 | `--plan.subtask_describe_first` | `true`     | Run the describe→segment grounding pass (best subtask quality; +1 call/episode).                                          |
 | `--plan.emit_plan`              | `true`     | Emit the numbered `plan` rows (`false` = subtasks + memory only).                                                         |
-| `--plan.emit_memory`            | `true`     | Emit the `memory` rows (`false` = subtasks + plan only); symmetric to `emit_plan`.                                        |
 | `--plan.n_task_rephrasings`     | `10`       | How many `task_aug` rephrasings to emit (`0` disables).                                                                   |
 | `--plan.derive_task_from_video` | `if_short` | Use the dataset task as-is (`off`), only when it's missing/short (`if_short`), or always re-derive from video (`always`). |
+| `--plan.use_video_url`          | `false`    | Send a server-side video clip instead of embedded frames.                                                                 |

 ### Interjections + VQA

@@ -719,7 +719,7 @@ Example configuration for training the [reward classifier](https://huggingface.c
  "num_workers": 4,
  "steps": 5000,
  "log_freq": 10,
-  "env_eval_freq": 1000,
+  "eval_freq": 1000,
  "save_freq": 1000,
  "save_checkpoint": true,
  "seed": 2,
@@ -141,6 +141,11 @@ sample["target_message_indices"]

 The renderer does not apply a tokenizer chat template. Policy processors decide how to serialize the messages for their backbone, which keeps the same dataset usable across SmolVLA, Pi0.5, and any future VLM that expects OpenAI-style chat messages.

+## Blends
+
+Blend recipes select one weighted sub-recipe deterministically from the sample index.
+`recipes/subtasks_vqa.yaml` trains the core blend — high-level subtask prediction, low-level execution, and VQA. `recipes/subtask_mem_vqa_speech.yaml` is the fuller variant that also adds memory updates and spoken interjection responses.
+
 ## Graceful absence

 If both language columns are missing, `None`, or empty, `RenderMessagesStep` is a no-op.
@@ -143,7 +143,7 @@ lerobot-train \
  --batch_size=4 \
  --eval.batch_size=1 \
  --eval.n_episodes=1 \
-  --env_eval_freq=1000
+  --eval_freq=1000
 ```

 ## Reproducing published results
@@ -173,7 +173,7 @@ lerobot-train \
    --batch_size=4 \
    --eval.batch_size=1 \
    --eval.n_episodes=1 \
-    --env_eval_freq=1000
+    --eval_freq=1000
 ```

 ## Relationship to LIBERO
@@ -120,11 +120,11 @@ lerobot-train \
  --batch_size=4 \
  --eval.batch_size=1 \
  --eval.n_episodes=1 \
-  --env_eval_freq=1000
+  --eval_freq=1000
 ```

 ## Practical tips

 - Use the one-hot task conditioning for multi-task training (MT10/MT50 conventions) so policies have explicit task context.
 - Inspect the dataset task descriptions and the `info["is_success"]` keys when writing post-processing or logging so your success metrics line up with the benchmark.
- Adjust `batch_size`, `steps`, and `env_eval_freq` to match your compute budget.
+- Adjust `batch_size`, `steps`, and `eval_freq` to match your compute budget.
@@ -103,7 +103,7 @@ accelerate launch \
  --batch_size=32 \
  --num_workers=4 \
  --log_freq=20 \
-  --env_eval_freq=-1 \
+  --eval_freq=-1 \
  --save_checkpoint=true \
  --save_freq=2000
 ```
@@ -142,7 +142,7 @@ accelerate launch \
  --batch_size=32 \
  --num_workers=4 \
  --log_freq=20 \
-  --env_eval_freq=-1 \
+  --eval_freq=-1 \
  --save_checkpoint=true \
  --save_freq=2000
 ```
@@ -314,7 +314,7 @@ lerobot-train \
  --steps=30000 \
  --save_freq=1000 \
  --log_freq=100 \
-  --env_eval_freq=1000 \
+  --eval_freq=1000 \
  --policy.type=multi_task_dit \
  --policy.device=cuda \
  --policy.horizon=32 \
@@ -166,7 +166,7 @@ lerobot-train \
  --output_dir=./outputs/smolvla_robocasa_CloseFridge \
  --steps=100000 \
  --batch_size=4 \
-  --env_eval_freq=5000 \
+  --eval_freq=5000 \
  --eval.batch_size=1 \
  --eval.n_episodes=5 \
  --save_freq=10000
@@ -165,7 +165,7 @@ lerobot-train \
  --output_dir=./outputs/smolvla_vlabench_primitive \
  --steps=100000 \
  --batch_size=4 \
-  --env_eval_freq=5000 \
+  --eval_freq=5000 \
  --eval.batch_size=1 \
  --eval.n_episodes=1 \
  --save_freq=10000
@@ -53,17 +53,49 @@ CMD = (
    "export VLLM_VIDEO_BACKEND=pyav && "
    "lerobot-annotate "
    "--repo_id=pepijn223/robocasa_pretrain_human300_v4 "
-    "--new_repo_id=pepijn223/robocasa_pretrain_human300_v4_annotated "
+    "--new_repo_id=pepijn223/robocasa_pretrain_human300_v4_annotated5 "
    "--push_to_hub=true "
    "--vlm.backend=openai "
    "--vlm.model_id=Qwen/Qwen3.6-27B "
+    "--vlm.parallel_servers=1 "
    "--vlm.num_gpus=1 "
    '--vlm.serve_command="vllm serve Qwen/Qwen3.6-27B '
    "--tensor-parallel-size 1 --max-model-len 32768 "
    '--gpu-memory-utilization 0.8 --uvicorn-log-level warning --port {port}" '
    "--vlm.serve_ready_timeout_s=1800 "
-    # Qwen3.6 ships with thinking on; annotation wants plain JSON answers.
-    "--vlm.chat_template_kwargs='{\"enable_thinking\": false}'"
+    "--vlm.client_concurrency=128 "
+    "--vlm.max_new_tokens=512 "
+    "--vlm.temperature=0.7 "
+    "--executor.episode_parallelism=16 "
+    "--vlm.chat_template_kwargs='{\"enable_thinking\": false}' "
+    "--vlm.camera_key=observation.images.robot0_agentview_right "
+    # Phase 1 — plan module (subtasks + memory).
+    # Embed decoded frames (not a file:// clip): if clip extraction fails,
+    # the video_url path silently sends no video and the VLM hallucinates.
+    "--plan.use_video_url=false "
+    "--plan.frames_per_second=1.0 "
+    # 32 frames ≈ 8-10k vision tokens, fits the 32768 context. Don't push
+    # toward 128 — that overflows the context (BadRequestError 400).
+    "--plan.max_video_frames=32 "
+    # Window long episodes into 32s chunks (constant 1 fps density) so they
+    # get more subtasks; per-window spans are merged + stitched. 0 disables.
+    "--plan.subtask_window_seconds=32 "
+    # RoboCasa: the dataset task string is authoritative (eval uses it), so
+    # keep it driving subtasks. ``always`` would throw it away and hallucinate.
+    "--plan.derive_task_from_video=off "
+    # No task augmentation: eval conditions on the exact task strings, so
+    # rephrasings are unused at best and harmful when they drift.
+    "--plan.n_task_rephrasings=0 "
+    # Keep subtask decomposition tight for atomic tasks.
+    "--plan.plan_max_steps=10 "
+    # Only subtasks + memory — skip the numbered "plan" rows. true re-enables.
+    "--plan.emit_plan=false "
+    # The describe->segment grounding pass (+1 VLM call/episode) is ON by
+    # default; pass --plan.subtask_describe_first=false to skip it.
+    # Phase 2 — interjections + speech.
+    "--interjections.max_interjections_per_episode=6 "
+    # Phase 4 — general VQA: disabled for this run.
+    "--vqa.enabled=false"
 )

 job = run_job(
@@ -85,6 +85,11 @@ dependencies = [
    "termcolor>=2.4.0,<4.0.0",
    "tqdm>=4.66.0,<5.0.0",

+    # Training utilities
+    # EMA of policy parameters (Diffusion Policy / pi05 style). Tiny
+    # pure-python dependency — preferred over a hand-rolled implementation.
+    "ema-pytorch>=0.7.7,<1.0.0",
+
    # Build tools (required by opencv-python-headless on some platforms)
    "cmake>=3.29.0.1,<4.2.0",
    "setuptools>=71.0.0,<81.0.0",
@@ -115,8 +120,8 @@ dataset = [
 ]
 training = [
    "lerobot[dataset]",
-    "wandb>=0.24.0,<0.28.0",
-    "lerobot[accelerate-dep]",
+    "accelerate>=1.10.0,<2.0.0",
+    "wandb>=0.24.0,<0.25.0",
 ]
 hardware = [
    "lerobot[pynput-dep]",
@@ -142,8 +147,8 @@ pygame-dep = ["pygame>=2.5.1,<2.7.0"]
 # (noble ships urdfdom 3.x). Cap below 0.9.16 until system urdfdom 4.x is broadly available.
 placo-dep = ["placo>=0.9.6,<0.9.16"]
 transformers-dep = ["transformers>=5.4.0,<5.6.0"]
-grpcio-dep = ["grpcio>=1.73.1,<2.0.0", "protobuf>=6.31.1,<8.0.0"]
-accelerate-dep = ["accelerate>=1.14.0,<2.0.0"]
+sentencepiece-dep = ["sentencepiece>=0.2.0,<0.3.0"] # FAST action tokenizer backend (pi052, pi0_fast)
+grpcio-dep = ["grpcio==1.73.1", "protobuf>=6.31.1,<6.32.0"]
 can-dep = ["python-can>=4.2.0,<5.0.0"]
 peft-dep = ["peft>=0.18.0,<1.0.0"]
 scipy-dep = ["scipy>=1.14.0,<2.0.0"]
@@ -178,12 +183,7 @@ unitree_g1 = [
    "lerobot[matplotlib-dep]",
    "lerobot[pygame-dep]",
 ]
-# reachy2-sdk caps grpcio<=1.73.1 and protobuf<=6.32.0; quarantined here so downstream users aren't held back. reachy2-sdk is unlikely to release new versions.
-reachy2 = [
-    "reachy2_sdk>=1.0.15,<1.1.0",
-    "grpcio<=1.73.1",
-    "protobuf<=6.32.0",
-]
+reachy2 = ["reachy2_sdk>=1.0.15,<1.1.0"]
 # Seeed Studio reBot B601-DM follower (motorbridge / CAN) + StarArm102 / reBot Arm 102
 # leader (motorbridge-smart-servo / FashionStar UART servos).
 rebot = ["lerobot[motorbridge-dep]", "lerobot[motorbridge-smart-servo-dep]"]
@@ -203,9 +203,9 @@ wallx = [
    "torchdiffeq>=0.2.4,<0.3.0",
    "lerobot[qwen-vl-utils-dep]",
 ]
-pi = ["lerobot[transformers-dep]", "lerobot[scipy-dep]"]
+pi = ["lerobot[transformers-dep]", "lerobot[scipy-dep]", "lerobot[sentencepiece-dep]"]
 molmoact2 = ["lerobot[transformers-dep]", "lerobot[peft-dep]", "lerobot[scipy-dep]"]
-smolvla = ["lerobot[transformers-dep]", "num2words>=0.5.14,<0.6.0", "lerobot[accelerate-dep]"]
+smolvla = ["lerobot[transformers-dep]", "num2words>=0.5.14,<0.6.0", "accelerate>=1.7.0,<2.0.0"]
 multi_task_dit = ["lerobot[transformers-dep]", "lerobot[diffusers-dep]"]
 groot = [
    "lerobot[transformers-dep]",
@@ -222,7 +222,7 @@ robometer = ["lerobot[transformers-dep]", "lerobot[qwen-vl-utils-dep]", "lerobot
 topreward = ["lerobot[transformers-dep]"]
 xvla = ["lerobot[transformers-dep]"]
 eo1 = ["lerobot[transformers-dep]", "lerobot[qwen-vl-utils-dep]"]
-hilserl = ["lerobot[transformers-dep]", "lerobot[dataset]", "gym-hil>=0.1.14,<0.2.0", "lerobot[grpcio-dep]", "lerobot[placo-dep]"]
+hilserl = ["lerobot[transformers-dep]", "lerobot[dataset]", "gym-hil>=0.1.13,<0.2.0", "lerobot[grpcio-dep]", "lerobot[placo-dep]"]
 vla_jepa = ["lerobot[transformers-dep]", "lerobot[diffusers-dep]", "lerobot[qwen-vl-utils-dep]"]

 # Features
@@ -244,17 +244,25 @@ annotations = [
    # install it locally only if you run your own ``vllm serve``.
 ]

+# Tool implementations under src/lerobot/tools/. Each tool's dependencies
+# are isolated so adding a new tool doesn't bloat the base install.
+# Currently only `say` (Kyutai pocket-tts; CPU-only, ~100M params).
+tools = [
+    "pocket-tts>=1.0.0,<3.0.0",
+    "scipy>=1.11.0,<2.0.0",  # SayTool.output_dir uses scipy.io.wavfile
+]
+
 # Development
-dev = ["pre-commit>=3.7.0,<5.0.0", "debugpy>=1.8.1,<1.9.0", "lerobot[grpcio-dep]", "grpcio-tools>=1.73.1,<2.0.0", "mypy>=1.19.1", "ruff>=0.14.1", "lerobot[notebook]"]
+dev = ["pre-commit>=3.7.0,<5.0.0", "debugpy>=1.8.1,<1.9.0", "lerobot[grpcio-dep]", "grpcio-tools==1.73.1", "mypy>=1.19.1", "ruff>=0.14.1", "lerobot[notebook]"]
 notebook = ["jupyter>=1.0.0,<2.0.0", "ipykernel>=6.0.0,<7.0.0"]
 test = ["pytest>=8.1.0,<9.0.0", "pytest-timeout>=2.4.0,<3.0.0", "pytest-cov>=5.0.0,<8.0.0", "mock-serial>=0.0.1,<0.1.0 ; sys_platform != 'win32'"]
 video_benchmark = ["scikit-image>=0.23.2,<0.26.0", "pandas>=2.2.2,<2.4.0"]

 # Simulation
 # NOTE: Explicitly listing scipy helps flatten the dependecy tree.
-aloha = ["lerobot[dataset]", "gym-aloha>=0.1.4,<0.2.0", "lerobot[scipy-dep]"]
+aloha = ["lerobot[dataset]", "gym-aloha>=0.1.2,<0.2.0", "lerobot[scipy-dep]"]
 pusht = ["lerobot[dataset]", "gym-pusht>=0.1.5,<0.2.0", "pymunk>=6.6.0,<7.0.0"] # TODO: Fix pymunk version in gym-pusht instead
-libero = ["lerobot[dataset]", "lerobot[transformers-dep]", "hf-libero>=0.1.4,<0.2.0; sys_platform == 'linux'", "lerobot[scipy-dep]"]
+libero = ["lerobot[dataset]", "lerobot[transformers-dep]", "hf-libero>=0.1.3,<0.2.0; sys_platform == 'linux'", "lerobot[scipy-dep]"]
 metaworld = ["lerobot[dataset]", "metaworld==3.0.0", "lerobot[scipy-dep]"]
 # NOTE: vlabench is NOT exposed as a `lerobot` extra. Its only distribution
 # is the OpenMOSS/VLABench GitHub repo (package name `VLABench`, no PyPI
@@ -340,6 +348,8 @@ lerobot-edit-dataset="lerobot.scripts.lerobot_edit_dataset:main"
 lerobot-setup-can="lerobot.scripts.lerobot_setup_can:main"
 lerobot-annotate="lerobot.scripts.lerobot_annotate:main"
 lerobot-rollout="lerobot.scripts.lerobot_rollout:main"
+# Interactive hierarchical-VLA runtime for PI052 (PaliGemma backbone).
+lerobot-pi052-runtime="lerobot.scripts.lerobot_pi052_runtime:main"

 # ---------------- Tool Configurations ----------------

@@ -35,28 +35,14 @@ class PlanConfig:
    derive_task_from_video: str = "if_short"
    derive_task_min_words: int = 3

-    # --- Frame input: timestamped contact sheets (always on) ---------------
-    # The subtask describe/segment passes ALWAYS render the episode as
-    # macrodata/refiner-style contact sheets: sampled frames packed into JPEG
-    # grids with each frame's timestamp burned into its corner, so the VLM
-    # cites the exact source time of a boundary directly. This is far cheaper
-    # in vision tokens than one image per frame (≈2× faster subtask generation
-    # in practice), which is why the sampling is dense by default.
-    #
-    # ``frames_per_second`` is the sampling rate: 2.0 = one frame every 0.5s.
-    frames_per_second: float = 2.0
-    # Frame budget per VLM call (= columns × rows × sheets). When a whole
-    # episode sampled at ``frames_per_second`` exceeds this, the episode is
-    # AUTOMATICALLY split into consecutive windows of
-    # ``max_frames_per_prompt`` frames each (one describe→segment call per
-    # window, still at the full ``frames_per_second`` density), and the
-    # per-window spans are merged + stitched into one contiguous cover. So an
-    # episode of any length is always covered at the full sampling density.
-    max_frames_per_prompt: int = 60
-    contact_sheet_columns: int = 5
-    contact_sheet_frames_per_sheet: int = 20
-    contact_sheet_frame_width: int = 224
-    contact_sheet_quality: int = 84
+    # Frames sampled uniformly, capped at max_video_frames — a hard context cap
+    # (~300 tokens/frame, so 32 fit a 32k VLM; 128 overflow).
+    frames_per_second: float = 1.0
+    max_video_frames: int = 32
+
+    # >0: split long episodes into windows of this length (constant fps density)
+    # instead of subsampling the whole episode; spans merged + stitched. 0 disables.
+    subtask_window_seconds: float = 0.0

    min_subtask_seconds: float = 1.5
    plan_max_steps: int = 8
@@ -68,12 +54,12 @@ class PlanConfig:
    # Emit ``style="plan"`` rows at each boundary; False = subtasks + memory only.
    emit_plan: bool = True

-    # Emit ``style="memory"`` rows at each boundary; False = subtasks (+ plan) only.
-    # Symmetric counterpart of ``emit_plan``.
-    emit_memory: bool = True
-
    # (subtask spans are always stitched to a contiguous full-episode cover; not configurable.)

+    # Send a server-side ``video_url`` clip (at use_video_url_fps) instead of embedded frames.
+    use_video_url: bool = False
+    use_video_url_fps: float = 1.0
+
    # Optional EgoMimic-style 5-axis task augmentation; replaces n_task_rephrasings.
    task_aug_axes: TaskAugAxesConfig = field(default_factory=lambda: TaskAugAxesConfig())

@@ -197,9 +183,8 @@ class AnnotationPipelineConfig:
    skip_validation: bool = False
    only_episodes: tuple[int, ...] | None = None

-    # Keyframe decode backend forwarded to ``decode_video_frames``. None →
-    # library default (torchcodec when available, else PyAV). Or pin
-    # ``"torchcodec"`` / ``"pyav"`` explicitly.
+    # Keyframe decode backend. None → ffmpeg CLI (crash-/thread-safe; torchcodec
+    # SIGSEGVs under concurrent decode). Or ``"torchcodec"`` / ``"pyav"``.
    video_backend: str | None = None

    # Upload to the Hub (new_repo_id if set, else repo_id; one must be set).
@@ -24,11 +24,8 @@ querying the same timestamp pay decode cost once.

 from __future__ import annotations

-import io
 import logging
-import math
 import threading
-from collections.abc import Sequence
 from dataclasses import dataclass, field
 from pathlib import Path
 from typing import Any, Protocol
@@ -36,10 +33,9 @@ from typing import Any, Protocol
 import PIL.Image
 import torch

-from lerobot.configs.video import VideoEncoderConfig
-from lerobot.datasets.video_utils import decode_video_frames, reencode_video
+from lerobot.datasets.video_utils import decode_video_frames

-from .reader import EpisodeRecord, snap_to_frame
+from .reader import EpisodeRecord

 logger = logging.getLogger(__name__)

@@ -138,9 +134,10 @@ class VideoFrameProvider:
    camera_key: str | None = None
    tolerance_s: float = 1e-2
    cache_size: int = 256
-    # Keyframe decode backend forwarded to
-    # :func:`lerobot.datasets.video_utils.decode_video_frames`. ``None``
-    # uses the library default (torchcodec when available, else PyAV).
+    # Keyframe decode backend. ``None`` uses the ffmpeg CLI — the
+    # concurrency- and crash-safe default for the pipeline's threaded
+    # decode. Set to ``"torchcodec"`` or ``"pyav"`` to pin an in-process
+    # decoder when the build is known thread-safe.
    video_backend: str | None = None
    _meta: Any = field(default=None, init=False, repr=False)
    _cache: dict = field(default_factory=dict, init=False, repr=False)
@@ -149,10 +146,6 @@ class VideoFrameProvider:
    # ``ExecutorConfig.episode_parallelism``); guard the dict cache and the
    # one-shot warn flag against concurrent updates from worker threads.
    _lock: threading.Lock = field(default_factory=threading.Lock, init=False, repr=False)
-    # Serializes decode_video_frames calls: torchcodec hands out one
-    # ``VideoDecoder`` per file from a process-wide cache, and the decoder
-    # is not safe to drive from multiple threads at once.
-    _decode_lock: threading.Lock = field(default_factory=threading.Lock, init=False, repr=False)
    _warned_decode_fail: bool = field(default=False, init=False, repr=False)

    def __post_init__(self) -> None:
@@ -188,13 +181,6 @@ class VideoFrameProvider:
        target = camera_key if camera_key is not None else self.camera_key
        if not timestamps or target is None:
            return []
-        # Snap each request to the nearest real frame timestamp: callers
-        # sample uniform grids whose points land mid-frame, and
-        # ``decode_video_frames`` rejects queries farther than
-        # ``tolerance_s`` from a decodable frame. Snapping also dedupes
-        # repeat queries through the cache.
-        if record.frame_timestamps:
-            timestamps = [snap_to_frame(float(ts), record.frame_timestamps) for ts in timestamps]

        out: list[Any] = []
        misses: list[float] = []
@@ -258,14 +244,15 @@ class VideoFrameProvider:
    def episode_clip_path(self, record: EpisodeRecord, cache_dir: Path) -> Path | None:
        """Extract the episode's subclip to ``cache_dir/ep_{idx:06d}.mp4``.

-        Returns ``None`` if the dataset has no video tracks or extraction
-        failed. Skips re-extract when the cached clip already exists.
-        Re-encodes to H.264 via
-        :func:`lerobot.datasets.video_utils.reencode_video` so the resulting
-        mp4 is decodable by every downstream video processor — stream-copy
-        would inherit the source codec (often AV1 in modern LeRobot
-        datasets), which vllm's libav build cannot decode.
+        Returns ``None`` if the dataset has no video tracks. Skips
+        re-extract when the cached clip already exists. Re-encodes to
+        H.264 (libx264) so the resulting mp4 is decodable by every
+        downstream video processor — stream-copy would inherit the
+        source codec (often AV1 in modern LeRobot datasets), which
+        vllm's libav build cannot decode.
        """
+        import subprocess  # noqa: PLC0415
+
        if self.camera_key is None:
            return None
        cache_dir.mkdir(parents=True, exist_ok=True)
@@ -276,20 +263,33 @@ class VideoFrameProvider:
        from_timestamp = float(ep[f"videos/{self.camera_key}/from_timestamp"])
        to_timestamp = float(ep[f"videos/{self.camera_key}/to_timestamp"])
        src = self.root / self._meta.get_video_file_path(record.episode_index, self.camera_key)
-        encoder = VideoEncoderConfig(vcodec="h264", pix_fmt="yuv420p", g=None, crf=23, preset="ultrafast")
+        cmd = [
+            "ffmpeg",
+            "-y",
+            "-loglevel",
+            "error",
+            "-ss",
+            f"{from_timestamp:.3f}",
+            "-to",
+            f"{to_timestamp:.3f}",
+            "-i",
+            str(src),
+            "-c:v",
+            "libx264",
+            "-preset",
+            "ultrafast",
+            "-crf",
+            "23",
+            "-pix_fmt",
+            "yuv420p",
+            "-an",
+            str(out_path),
+        ]
        try:
-            reencode_video(
-                src,
-                out_path,
-                camera_encoder=encoder,
-                overwrite=True,
-                start_time_s=from_timestamp,
-                end_time_s=to_timestamp,
-            )
-        except Exception:
-            logger.warning(
-                "clip extraction failed for episode %s (%s)", record.episode_index, src, exc_info=True
-            )
+            # ffmpeg is invoked by name via PATH lookup (the standard way to
+            # call the CLI); the arg list is fully controlled here, not shell.
+            subprocess.run(cmd, check=True, timeout=300)  # nosec B607
+        except (subprocess.CalledProcessError, subprocess.TimeoutExpired, FileNotFoundError):
            return None
        return out_path if out_path.exists() and out_path.stat().st_size > 0 else None

@@ -297,47 +297,61 @@ class VideoFrameProvider:
        """Decode ``timestamps`` from the episode's video as ``(C, H, W)`` tensors.

        Delegates to :func:`lerobot.datasets.video_utils.decode_video_frames`
-        (torchcodec when available, PyAV otherwise; ``video_backend`` pins
-        one explicitly). Returns one frame per requested timestamp, or ``[]``
-        if decoding failed — callers treat ``[]`` as "no frames available".
+        (torchcodec by default, PyAV fallback) rather than a bespoke decoder.
+        Returns one frame per requested timestamp, or ``[]`` if decoding
+        failed wholesale — callers treat ``[]`` as "no frames available".
        """
        ep = self._meta.episodes[episode_index]
        from_timestamp = ep[f"videos/{camera_key}/from_timestamp"]
        shifted = [from_timestamp + ts for ts in timestamps]
        video_path = self.root / self._meta.get_video_file_path(episode_index, camera_key)

-        try:
-            # The module phases decode under a ThreadPoolExecutor (see
-            # ``ExecutorConfig.episode_parallelism``) but torchcodec's cached
-            # per-file decoder is single-threaded, so serialize decodes on a
-            # dedicated lock. Frame extraction is a small fraction of episode
-            # wall time (VLM calls dominate), so the contention is cheap.
-            with self._decode_lock:
+        # Default to the ffmpeg CLI. The pipeline decodes under a 16-wide
+        # ThreadPoolExecutor and the in-process decoders are unsafe there:
+        # torchcodec is not thread-safe and SIGSEGVs under concurrent decode
+        # (a crash no try/except can catch), PyAV can likewise segfault on
+        # AV1, and lerobot's ``pyav`` backend routes through the removed
+        # ``torchvision.io.VideoReader``. ``_decode_frames_ffmpeg`` shells
+        # out per frame: each decode is an isolated child process, so it is
+        # both crash-safe and concurrency-safe. ``video_backend`` can pin
+        # ``torchcodec`` / ``pyav`` explicitly for callers that know their
+        # build is safe.
+        chain = [self.video_backend] if self.video_backend else ["ffmpeg"]
+
+        exc: Exception | None = None
+        for backend in chain:
+            try:
+                if backend == "ffmpeg":
+                    return _decode_frames_ffmpeg(video_path, shifted)
+                if backend in ("pyav", "av"):
+                    return _decode_frames_av(video_path, shifted)
                # Stacked ``(N, C, H, W)`` uint8 tensor; one row per timestamp.
                decoded = decode_video_frames(
-                    video_path, shifted, self.tolerance_s, backend=self.video_backend, return_uint8=True
+                    video_path, shifted, self.tolerance_s, backend=backend, return_uint8=True
                )
-            return list(decoded)
-        except Exception as exc:
-            # Log loudly the first time so a silent vqa-module no-op (every
-            # prompt skipped because frames_at returned []) is debuggable from
-            # the job log instead of post-hoc parquet inspection. Subsequent
-            # failures stay quiet.
-            with self._lock:
-                already_warned = self._warned_decode_fail
-                if not already_warned:
-                    self._warned_decode_fail = True
+                return list(decoded)
+            except Exception as e:  # noqa: PERF203
+                exc = e
+
+        # Every backend raised. Log loudly the first time so a silent
+        # vqa-module no-op (every prompt skipped because frames_at returned
+        # []) is debuggable from the job log instead of post-hoc parquet
+        # inspection. Subsequent failures stay quiet.
+        with self._lock:
+            already_warned = self._warned_decode_fail
            if not already_warned:
-                logger.warning(
-                    "VideoFrameProvider._decode failed for episode=%s camera=%s video_path=%s backend=%s: %s",
-                    episode_index,
-                    camera_key,
-                    video_path,
-                    self.video_backend,
-                    exc,
-                    exc_info=exc,
-                )
-            return []
+                self._warned_decode_fail = True
+        if not already_warned:
+            logger.warning(
+                "VideoFrameProvider._decode failed for episode=%s camera=%s video_path=%s backends=%s: %s",
+                episode_index,
+                camera_key,
+                video_path,
+                chain,
+                exc,
+                exc_info=exc,
+            )
+        return []


 def make_frame_provider(
@@ -353,6 +367,91 @@ def make_frame_provider(
    return provider


+def _decode_frames_ffmpeg(video_path: Path, timestamps: list[float]) -> list[Any]:
+    """Decode the frames nearest to ``timestamps`` via the ffmpeg CLI.
+
+    Runs one ``ffmpeg`` process per timestamp, seeking with ``-ss`` and
+    piping a single PNG to stdout. Unlike the in-process decoders this
+    survives a hostile container: a full ffmpeg build decodes AV1 (the codec
+    modern LeRobot datasets use) where torchcodec raises and PyAV can
+    SIGSEGV, and a crash stays isolated to the child process — a non-zero
+    exit is a catchable error, not a segfault of the whole job. Returns one
+    ``(C, H, W)`` uint8 tensor per timestamp.
+    """
+    import io  # noqa: PLC0415
+    import subprocess  # noqa: PLC0415
+
+    import numpy as np  # noqa: PLC0415
+
+    frames: list[Any] = []
+    for ts in timestamps:
+        # ffmpeg invoked by name via PATH lookup; fully-controlled arg list, no shell.
+        proc = subprocess.run(  # nosec B607
+            [
+                "ffmpeg",
+                "-nostdin",
+                "-loglevel",
+                "error",
+                "-ss",
+                f"{max(ts, 0.0):.3f}",
+                "-i",
+                str(video_path),
+                "-frames:v",
+                "1",
+                "-f",
+                "image2pipe",
+                "-vcodec",
+                "png",
+                "pipe:1",
+            ],
+            capture_output=True,
+            check=True,
+            timeout=120,
+        )
+        if not proc.stdout:
+            raise RuntimeError(f"ffmpeg returned no frame for t={ts:.3f}s of {video_path}")
+        img = PIL.Image.open(io.BytesIO(proc.stdout)).convert("RGB")
+        frames.append(torch.from_numpy(np.asarray(img).copy()).permute(2, 0, 1).contiguous())
+    return frames
+
+
+def _decode_frames_av(video_path: Path, timestamps: list[float]) -> list[Any]:
+    """Decode the frames nearest to ``timestamps`` using PyAV directly.
+
+    lerobot's ``decode_video_frames(backend="pyav")`` routes through
+    ``torchvision.io.VideoReader``, removed in torchvision 0.23+. This helper
+    talks to the ``av`` package directly. Note PyAV can SIGSEGV on AV1
+    streams in some builds — prefer ``_decode_frames_ffmpeg`` as the default
+    fallback; this stays available behind ``video_backend="pyav"``. Returns
+    one ``(C, H, W)`` uint8 tensor per timestamp.
+    """
+    import av  # noqa: PLC0415
+
+    first_ts = min(timestamps)
+    last_ts = max(timestamps)
+    loaded_frames: list[torch.Tensor] = []
+    loaded_ts: list[float] = []
+    with av.open(str(video_path)) as container:
+        stream = container.streams.video[0]
+        # Seek to the keyframe at or before the first requested timestamp.
+        offset = max(int(first_ts / stream.time_base), 0) if stream.time_base else 0
+        container.seek(offset, stream=stream, backward=True, any_frame=False)
+        for idx, frame in enumerate(container.decode(stream)):
+            ts = frame.time
+            if ts is None:
+                ts = float(frame.pts * stream.time_base) if frame.pts is not None else float(idx)
+            loaded_ts.append(ts)
+            loaded_frames.append(
+                torch.from_numpy(frame.to_ndarray(format="rgb24")).permute(2, 0, 1).contiguous()
+            )
+            if ts >= last_ts:
+                break
+    if not loaded_frames:
+        raise RuntimeError(f"PyAV decoded no frames from {video_path}")
+    ts_tensor = torch.tensor(loaded_ts)
+    return [loaded_frames[int(torch.argmin((ts_tensor - q).abs()))] for q in timestamps]
+
+
 def _frame_to_pil(frame: Any) -> Any:
    """Materialise a decoded frame as a ``PIL.Image`` for the VLM message.

@@ -397,85 +496,3 @@ def to_video_url_block(url: str | None, fps: float = 2.0) -> list[dict[str, Any]
    if not url:
        return []
    return [{"type": "video_url", "video_url": {"url": url}, "fps": fps}]
-
-
-def _draw_timestamp_badge(image: PIL.Image.Image, timestamp: float) -> PIL.Image.Image:
-    """Burn ``timestamp`` (seconds) into the top-left corner of ``image``.
-
-    A solid black badge with white text, so a VLM reading a contact sheet can
-    cite the exact source time of each tile (e.g. ``012.50s``) directly,
-    instead of the caller having to map tile position back to time. Mirrors
-    the macrodata/refiner contact-sheet convention.
-    """
-    from PIL import ImageDraw, ImageFont
-
-    result = image.copy()
-    draw = ImageDraw.Draw(result)
-    font = ImageFont.load_default()
-    label = f"{timestamp:06.2f}s"
-    left, top, right, bottom = draw.textbbox((0, 0), label, font=font)
-    text_w, text_h = right - left, bottom - top
-    pad = max(3, round(min(image.width, image.height) * 0.018))
-    draw.rectangle((0, 0, text_w + pad * 2, text_h + pad * 2), fill=(0, 0, 0))
-    draw.text((pad - left, pad - top), label, fill=(255, 255, 255), font=font)
-    return result
-
-
-def to_contact_sheet_blocks(
-    frames: Sequence[Any],
-    timestamps: Sequence[float],
-    *,
-    columns: int = 5,
-    frames_per_sheet: int = 20,
-    frame_width: int = 224,
-    quality: int = 84,
-) -> list[dict[str, Any]]:
-    """Pack decoded frames into timestamped JPEG contact-sheet image blocks.
-
-    Each frame is resized to ``frame_width`` wide, stamped with its
-    episode-relative timestamp, and tiled row-major into grids of
-    ``frames_per_sheet`` (``columns`` wide). One ``{"type":"image", ...}``
-    block is returned per grid; many frames collapse into a few images, so a
-    long episode's temporal coverage stays dense at a fraction of the vision
-    tokens N separate frames would cost. ``frames`` and ``timestamps`` must be
-    aligned and equal length. Returns ``[]`` for empty input.
-    """
-    from PIL import Image
-
-    if not frames:
-        return []
-    columns = max(1, columns)
-    frames_per_sheet = max(1, frames_per_sheet)
-    rows_per_sheet = math.ceil(frames_per_sheet / columns)
-
-    tiles: list[PIL.Image.Image] = []
-    for ts, frame in zip(timestamps, frames, strict=False):
-        img = _frame_to_pil(frame)
-        if not isinstance(img, PIL.Image.Image):
-            continue
-        img = img.convert("RGB")
-        if img.width != frame_width:
-            height = max(1, round(img.height * frame_width / img.width))
-            img = img.resize((frame_width, height), resample=Image.Resampling.BILINEAR)
-        tiles.append(_draw_timestamp_badge(img, float(ts)))
-    if not tiles:
-        return []
-
-    blocks: list[dict[str, Any]] = []
-    for start in range(0, len(tiles), frames_per_sheet):
-        chunk = tiles[start : start + frames_per_sheet]
-        cell_w = max(tile.width for tile in chunk)
-        cell_h = max(tile.height for tile in chunk)
-        sheet = Image.new("RGB", (cell_w * columns, cell_h * rows_per_sheet), color=(0, 0, 0))
-        for i, tile in enumerate(chunk):
-            x = (i % columns) * cell_w
-            y = (i // columns) * cell_h
-            sheet.paste(tile, (x, y))
-        # JPEG round-trip at ``quality`` to match the refiner convention and
-        # shrink the wire payload; vision-token count is set by resolution, so
-        # the real saving is the grid packing, not the codec.
-        buf = io.BytesIO()
-        sheet.save(buf, format="JPEG", quality=quality)
-        buf.seek(0)
-        blocks.append({"type": "image", "image": Image.open(buf).convert("RGB")})
-    return blocks
@@ -20,13 +20,16 @@ from __future__ import annotations
 import logging
 from collections.abc import Sequence
 from dataclasses import dataclass, field
+from pathlib import Path
 from typing import Any

 from ..config import PlanConfig
 from ..frames import (
    FrameProvider,
+    VideoFrameProvider,
    null_provider,
-    to_contact_sheet_blocks,
+    to_video_block,
+    to_video_url_block,
 )
 from ..prompts import load as load_prompt
 from ..reader import EpisodeRecord, reconstruct_subtask_spans, snap_to_frame
@@ -36,44 +39,6 @@ from ..vlm_client import VlmClient
 logger = logging.getLogger(__name__)


-# Prepended to every describe / segment prompt so the VLM knows the images are
-# timestamped contact-sheet grids, not a single video, and reads the burned-in
-# per-tile timestamp when choosing boundaries.
-def _contact_sheet_preamble(columns: int) -> str:
-    return (
-        "CONTACT SHEETS — how to read the images below:\n"
-        f"- Each image is a grid of sampled video frames, {columns} per row, "
-        "with time running left-to-right then top-to-bottom (row-major).\n"
-        "- Each frame has its timestamp burned into the top-left corner, e.g. "
-        '"012.50s". Use that printed timestamp (not the tile position) when you '
-        "choose start/end times; boundaries should land on or near a printed "
-        "timestamp.\n"
-        "- Frames continue across grids: an action may span the end of one sheet "
-        "and the start of the next, so do not place a boundary just because a new "
-        "image begins.\n\n"
-    )
-
-
-# Appended to every describe (and segment) prompt. A visual, causal definition
-# of where one event ends and the next begins — adapted from macrodata/refiner —
-# to sharpen cut points while the existing prompt keeps owning the imperative
-# phrasing.
-_CAUSAL_BOUNDARY_RULES = (
-    "EVENT BOUNDARIES — where one event ends and the next begins:\n"
-    "- Start a new event whenever the world state changes: an object becomes "
-    "held (the gripper closes on it), an object is released (the gripper opens "
-    "and it stays put), an object reaches a new location, a lid/door/drawer "
-    "changes open/closed state, a tool starts or stops affecting a surface, or "
-    "contents visibly move (e.g. poured).\n"
-    "- If a single action changes the same state gradually and continuously, "
-    "keep it as ONE event — do not split it.\n"
-    "- If the same action repeats on different objects or target locations, "
-    "treat each repetition as a separate event.\n"
-    "- Do NOT create boundaries for idle time, camera motion, hesitation, or "
-    "tiny hand adjustments."
-)
-
-
@dataclass
 class PlanSubtasksMemoryModule:
    """Generate subtask spans, plan, and memory rows.
@@ -148,11 +113,9 @@ class PlanSubtasksMemoryModule:
                            "tool_calls": None,
                        }
                    )
-        # memory rows at every subtask boundary except the very first start;
-        # skipped entirely when ``emit_memory`` is False (subtasks-only / plan-only).
+        # memory rows at every subtask boundary except the very first start
        prior_memory = ""
-        memory_boundaries = enumerate(subtask_spans[1:], start=1) if self.config.emit_memory else []
-        for i, span in memory_boundaries:
+        for i, span in enumerate(subtask_spans[1:], start=1):
            completed = subtask_spans[i - 1]["text"]
            remaining = [s["text"] for s in subtask_spans[i:]]
            mem_text = self._generate_memory(record, prior_memory, completed, remaining, task=effective_task)
@@ -257,13 +220,7 @@ class PlanSubtasksMemoryModule:
        prompt: str,
        window: tuple[float, float] | None = None,
    ) -> list[dict[str, Any]]:
-        """User message combining the (optionally windowed) contact sheets with ``prompt``.
-
-        The prompt is always prefixed with a short explanation of how to read
-        the timestamped grids, so the model treats them as one ordered
-        sequence of frames rather than unrelated images.
-        """
-        prompt = _contact_sheet_preamble(self.config.contact_sheet_columns) + prompt
+        """User message combining the (optionally windowed) video block with ``prompt``."""
        content = [*self._episode_video_block(record, window=window), {"type": "text", "text": prompt}]
        return [{"role": "user", "content": content}]

@@ -336,19 +293,24 @@ class PlanSubtasksMemoryModule:
    def _episode_video_block(
        self, record: EpisodeRecord, window: tuple[float, float] | None = None
    ) -> list[dict[str, Any]]:
-        """Timestamped contact sheets for the describe / segmentation prompts.
+        """Video block for the segmentation / describe prompts.

-        Always renders the (optionally windowed) episode as contact sheets:
-        frames sampled at ``frames_per_second`` and packed into timestamped
-        JPEG grids. ``max_frames_per_prompt`` caps the frame count; whole
-        episodes that exceed it are windowed upstream in
-        :meth:`_generate_subtasks` so each call stays within budget while the
-        full episode keeps its sampling density.
+        Always returns a block that actually carries the video. When
+        ``use_video_url`` is set we try the server-side ``video_url``
+        path first, but if clip extraction fails we FALL BACK to
+        decoding + embedding frames rather than returning an empty
+        block — an empty block would leave the VLM with no visual
+        grounding at all and it would hallucinate subtasks purely from
+        the task text.

-        When ``window=(w0, w1)`` is given the badges are WINDOW-RELATIVE
-        (``ts - w0``) to match the window-relative time frame the
-        segmentation prompt works in (spans are offset back to absolute time
-        afterwards).
+        When ``window=(w0, w1)`` is given (windowed subtask generation,
+        ``subtask_window_seconds > 0``), embed frames sampled at the FIXED
+        ``frames_per_second`` rate within ``[w0, w1]`` — constant temporal
+        density regardless of episode length, so long episodes are split
+        into windows rather than subsampled to a sparse 32-frame whole-
+        episode view. The ``video_url`` path is skipped for windows (it is
+        a whole-episode clip). ``max_video_frames`` still caps each window
+        as a context-budget safety net.
        """
        if not record.frame_timestamps:
            return []
@@ -356,44 +318,28 @@ class PlanSubtasksMemoryModule:
            w0, w1 = float(window[0]), float(window[1])
            dur = max(0.0, w1 - w0)
            n = max(1, int(round(dur * self.config.frames_per_second)) + 1)
-            n = min(n, self.config.max_frames_per_prompt)
+            n = min(n, self.config.max_video_frames)
            if n <= 1 or dur <= 0.0:
                timestamps = [0.5 * (w0 + w1)]
            else:
                step = dur / (n - 1)
                timestamps = [w0 + i * step for i in range(n)]
-            frames = self.frame_provider.frames_at(record, timestamps)
-            rel = [ts - w0 for ts in timestamps[: len(frames)]]
-            return self._contact_sheet_blocks(frames, rel)
+            return to_video_block(self.frame_provider.frames_at(record, timestamps))
+        if self.config.use_video_url and isinstance(self.frame_provider, VideoFrameProvider):
+            cache_dir = Path(self.frame_provider.root) / ".annotate_staging" / ".video_clips"
+            clip = self.frame_provider.episode_clip_path(record, cache_dir)
+            if clip is not None:
+                return to_video_url_block(f"file://{clip}", fps=self.config.use_video_url_fps)
+            logger.warning(
+                "episode %d: video_url clip extraction failed — falling back to "
+                "embedded frames so the VLM still sees the demonstration",
+                record.episode_index,
+            )
        episode_duration = record.frame_timestamps[-1] - record.frame_timestamps[0]
-        n = max(1, int(round(episode_duration * self.config.frames_per_second)) + 1)
-        n = min(n, self.config.max_frames_per_prompt)
-        timestamps = self._uniform_episode_timestamps(record, n)
-        frames = self.frame_provider.frames_at(record, timestamps)
-        return self._contact_sheet_blocks(frames, timestamps[: len(frames)])
-
-    @staticmethod
-    def _uniform_episode_timestamps(record: EpisodeRecord, n: int) -> list[float]:
-        """``n`` episode-relative timestamps spanning ``[t0, t_last]`` uniformly."""
-        ts = record.frame_timestamps
-        if n >= len(ts):
-            return [float(t) for t in ts]
-        t0, t_last = float(ts[0]), float(ts[-1])
-        if t_last <= t0 or n <= 1:
-            return [t0] * max(1, n)
-        step = (t_last - t0) / (n - 1)
-        return [t0 + i * step for i in range(n)]
-
-    def _contact_sheet_blocks(self, frames: list[Any], timestamps: list[float]) -> list[dict[str, Any]]:
-        """Build timestamped contact-sheet image blocks from decoded frames."""
-        return to_contact_sheet_blocks(
-            frames,
-            timestamps,
-            columns=self.config.contact_sheet_columns,
-            frames_per_sheet=self.config.contact_sheet_frames_per_sheet,
-            frame_width=self.config.contact_sheet_frame_width,
-            quality=self.config.contact_sheet_quality,
-        )
+        target_count = max(1, int(round(episode_duration * self.config.frames_per_second)))
+        target_count = min(target_count, self.config.max_video_frames)
+        video_frames = self.frame_provider.video_for_episode(record, target_count)
+        return to_video_block(video_frames)

    def run_plan_updates(
        self,
@@ -459,17 +405,12 @@ class PlanSubtasksMemoryModule:
        episode_duration = record.frame_timestamps[-1] - record.frame_timestamps[0]
        effective_task = task if task is not None else record.episode_task

-        # ---- Auto-windowing (keeps the full sampling density) --------
-        # Contact sheets are cheap, but a whole long episode sampled at
-        # ``frames_per_second`` can still exceed ``max_frames_per_prompt``.
-        # When it does, split into consecutive windows of exactly that many
-        # frames (one describe→segment call each, still at the full sampling
-        # density), then merge + stitch — so an episode of any length is
-        # covered at full density rather than subsampled into one sparse call.
-        fps = max(1e-6, float(self.config.frames_per_second))
-        n_whole = int(round(episode_duration * fps)) + 1
-        if n_whole > self.config.max_frames_per_prompt:
-            window_s = self.config.max_frames_per_prompt / fps
+        # ---- Windowed path (constant temporal density) ---------------
+        # If subtask_window_seconds > 0 and the episode exceeds one window,
+        # process fixed-length windows so the VLM always sees
+        # frames_per_second density; results are merged + stitched.
+        window_s = float(getattr(self.config, "subtask_window_seconds", 0.0) or 0.0)
+        if window_s > 0.0 and episode_duration > window_s:
            return self._generate_subtasks_windowed(record, effective_task, window_s)

        # ---- Pass 1 (optional): grounding description ----------------
@@ -487,14 +428,12 @@ class PlanSubtasksMemoryModule:
                )

        # ---- Pass 2: segmentation ------------------------------------
-        prompt = self._with_causal_rules(
-            load_prompt("plan_subtasks").format(
-                episode_task=effective_task,
-                min_subtask_seconds=self.config.min_subtask_seconds,
-                max_steps=self.config.plan_max_steps,
-                episode_duration=f"{episode_duration:.3f}",
-                observation_block=observation_block,
-            )
+        prompt = load_prompt("plan_subtasks").format(
+            episode_task=effective_task,
+            min_subtask_seconds=self.config.min_subtask_seconds,
+            max_steps=self.config.plan_max_steps,
+            episode_duration=f"{episode_duration:.3f}",
+            observation_block=observation_block,
        )
        spans = self._vlm_field(self._video_message(record, prompt), "subtasks")
        cleaned = self._clean_spans(spans, record)
@@ -569,14 +508,12 @@ class PlanSubtasksMemoryModule:
                    "action that is not in your description above.\n\n"
                )

-        prompt = self._with_causal_rules(
-            load_prompt("plan_subtasks").format(
-                episode_task=task,
-                min_subtask_seconds=self.config.min_subtask_seconds,
-                max_steps=self.config.plan_max_steps,
-                episode_duration=f"{win_len:.3f}",
-                observation_block=observation_block,
-            )
+        prompt = load_prompt("plan_subtasks").format(
+            episode_task=task,
+            min_subtask_seconds=self.config.min_subtask_seconds,
+            max_steps=self.config.plan_max_steps,
+            episode_duration=f"{win_len:.3f}",
+            observation_block=observation_block,
        )
        spans = self._vlm_field(self._video_message(record, prompt, window=window), "subtasks")
        # Window-relative clamp; no frame-snap dedupe yet (done on the
@@ -623,11 +560,6 @@ class PlanSubtasksMemoryModule:
                s["end"] = float(s["start"])
        return spans

-    @staticmethod
-    def _with_causal_rules(prompt: str) -> str:
-        """Append the causal event-boundary rules to a describe/segment prompt."""
-        return f"{prompt}\n\n{_CAUSAL_BOUNDARY_RULES}"
-
    def _clean_spans(
        self,
        spans: Any,
@@ -675,7 +607,7 @@ class PlanSubtasksMemoryModule:
        self, record: EpisodeRecord, task: str, window: tuple[float, float] | None = None
    ) -> str:
        """Grounding pass: free-form chronological description of the (windowed) video."""
-        prompt = self._with_causal_rules(load_prompt("plan_subtask_describe").format(episode_task=task))
+        prompt = load_prompt("plan_subtask_describe").format(episode_task=task)
        text = self._vlm_field(self._video_message(record, prompt, window=window), "description")
        return text.strip() if isinstance(text, str) and text.strip() else ""

@@ -310,19 +310,6 @@ def _make_openai_client(config: VlmConfig) -> VlmClient:
    return _GenericTextClient(_gen, config)


-def _bind_serve_port(cmd: str, port: int) -> str:
-    """Bind a serve command to ``port``: substitute a ``{port}`` placeholder
-    if present, else append ``--port`` when the command omits it (leaving an
-    explicit ``--port`` untouched). Shared by the single- and parallel-server
-    paths so a serve_command never reaches the server with a literal
-    ``{port}``."""
-    if "{port}" in cmd:
-        return cmd.replace("{port}", str(port))
-    if "--port" not in cmd:
-        return f"{cmd} --port {port}"
-    return cmd
-
-
 def _spawn_parallel_inference_servers(config: VlmConfig) -> list[str]:
    """Spawn ``config.parallel_servers`` independent vllm replicas.

@@ -365,7 +352,7 @@ def _spawn_parallel_inference_servers(config: VlmConfig) -> list[str]:
        gpu = i % num_gpus
        env = os.environ.copy()
        env["CUDA_VISIBLE_DEVICES"] = str(gpu)
-        cmd = _bind_serve_port(base_cmd, port)
+        cmd = base_cmd.replace("{port}", str(port)) if "{port}" in base_cmd else f"{base_cmd} --port {port}"
        api_base = f"http://localhost:{port}/v1"
        api_bases.append(api_base)
        print(f"[server-{i}] launching on GPU {gpu} port {port}: {cmd}", flush=True)
@@ -464,11 +451,6 @@ def _spawn_inference_server(config: VlmConfig) -> str:
            f"transformers serve {shlex.quote(config.model_id)} "
            f"--port {config.serve_port} --continuous-batching"
        )
-    # Bind the single server to ``serve_port`` (what ``api_base`` below
-    # targets): substitute a literal ``{port}`` placeholder, else append
-    # ``--port``. Without this a serve_command carrying ``{port}`` would
-    # reach the server unsubstituted and fail to parse.
-    cmd = _bind_serve_port(cmd, config.serve_port)
    api_base = f"http://localhost:{config.serve_port}/v1"
    print(f"[server] launching: {cmd}", flush=True)
    proc = subprocess.Popen(
@@ -49,19 +49,8 @@ def get_step_checkpoint_dir(output_dir: Path, total_steps: int, step: int) -> Pa
    return output_dir / CHECKPOINTS_DIR / step_identifier


-def save_training_step(
-    step: int, save_dir: Path, num_processes: int | None = None, batch_size: int | None = None
-) -> None:
-    state: dict = {"step": step}
-    # num_processes and batch_size are recorded so a resumed run can detect a changed world size or
-    # batch size: the sampler's resume offset is computed from the (num_processes, batch_size) that
-    # produced `step`, since both scale how many sampler positions a step consumes (see
-    # compute_sampler_state).
-    if num_processes is not None:
-        state["num_processes"] = num_processes
-    if batch_size is not None:
-        state["batch_size"] = batch_size
-    write_json(state, save_dir / TRAINING_STEP)
+def save_training_step(step: int, save_dir: Path) -> None:
+    write_json({"step": step}, save_dir / TRAINING_STEP)


 def load_training_step(save_dir: Path) -> int:
@@ -69,16 +58,6 @@ def load_training_step(save_dir: Path) -> int:
    return training_step["step"]


-def load_training_num_processes(checkpoint_dir: Path) -> int | None:
-    """World size recorded at checkpoint time, or None for checkpoints written before it was stored."""
-    return load_json(checkpoint_dir / TRAINING_STATE_DIR / TRAINING_STEP).get("num_processes")
-
-
-def load_training_batch_size(checkpoint_dir: Path) -> int | None:
-    """Per-process batch size recorded at checkpoint time, or None for older checkpoints."""
-    return load_json(checkpoint_dir / TRAINING_STATE_DIR / TRAINING_STEP).get("batch_size")
-
-
 def update_last_checkpoint(checkpoint_dir: Path) -> Path:
    last_checkpoint_dir = checkpoint_dir.parent / LAST_CHECKPOINT_LINK
    if last_checkpoint_dir.is_symlink():
@@ -96,8 +75,6 @@ def save_checkpoint(
    scheduler: LRScheduler | None = None,
    preprocessor: PolicyProcessorPipeline | None = None,
    postprocessor: PolicyProcessorPipeline | None = None,
-    num_processes: int | None = None,
-    batch_size: int | None = None,
 ) -> None:
    """This function creates the following directory structure:

@@ -123,10 +100,6 @@ def save_checkpoint(
        scheduler (LRScheduler | None, optional): The scheduler to save the state from. Defaults to None.
        preprocessor: The preprocessor/pipeline to save. Defaults to None.
        postprocessor: The postprocessor/pipeline to save. Defaults to None.
-        num_processes (int | None, optional): Distributed world size to record for sample-exact
-            resume. Defaults to None (not recorded).
-        batch_size (int | None, optional): Per-process batch size to record for sample-exact
-            resume. Defaults to None (not recorded).
    """
    pretrained_dir = checkpoint_dir / PRETRAINED_MODEL_DIR
    policy.save_pretrained(pretrained_dir)
@@ -139,9 +112,7 @@ def save_checkpoint(
        preprocessor.save_pretrained(pretrained_dir)
    if postprocessor is not None:
        postprocessor.save_pretrained(pretrained_dir)
-    save_training_state(
-        checkpoint_dir, step, optimizer, scheduler, num_processes=num_processes, batch_size=batch_size
-    )
+    save_training_state(checkpoint_dir, step, optimizer, scheduler)


 def save_training_state(
@@ -149,8 +120,6 @@ def save_training_state(
    train_step: int,
    optimizer: Optimizer | None = None,
    scheduler: LRScheduler | None = None,
-    num_processes: int | None = None,
-    batch_size: int | None = None,
 ) -> None:
    """
    Saves the training step, optimizer state, scheduler state, and rng state.
@@ -162,12 +131,10 @@ def save_training_state(
            Defaults to None.
        scheduler (LRScheduler | None, optional): The scheduler from which to save the state_dict.
            Defaults to None.
-        num_processes (int | None, optional): Distributed world size to record. Defaults to None.
-        batch_size (int | None, optional): Per-process batch size to record. Defaults to None.
    """
    save_dir = checkpoint_dir / TRAINING_STATE_DIR
    save_dir.mkdir(parents=True, exist_ok=True)
-    save_training_step(train_step, save_dir, num_processes=num_processes, batch_size=batch_size)
+    save_training_step(train_step, save_dir)
    save_rng_state(save_dir)
    if optimizer is not None:
        save_optimizer_state(optimizer, save_dir)
@@ -205,3 +205,149 @@ class WandBLogger:

        wandb_video = self._wandb.Video(video_path, fps=self.env_fps, format="mp4")
        self._wandb.log({f"{mode}/video": wandb_video}, step=step)
+
+    def log_training_examples(
+        self,
+        batch: dict,
+        step: int,
+        *,
+        camera_keys: list[str],
+        n_samples: int = 4,
+        policy=None,
+        predict_actions: bool = False,
+        mode: str = "train",
+    ) -> None:
+        """Push a ``wandb.Table`` of training-example rows for the current batch.
+
+        Each row is one batch element with:
+          * one ``wandb.Image`` column per camera in ``camera_keys`` (CHW or
+            HWC, uint8 or float in [0,1] — auto-detected),
+          * any text fields present in the batch (``task`` / ``subtask`` /
+            ``memory`` / ``instruction``),
+          * ground-truth action first/last frame (the action chunk's
+            endpoints — gives a quick sense of trajectory direction),
+          * if ``predict_actions=True`` and ``policy`` is supplied, the model's
+            ``predict_action_chunk`` first/last frame alongside.
+
+        This is opt-in via ``--wandb.log_examples_freq=N`` on the CLI; the
+        training loop calls it once every N steps. Cheap to keep on: with
+        N=4 samples and 3 cameras you upload 12 small PNGs per dump and (if
+        enabled) run one extra inference forward pass.
+        """
+        import logging  # noqa: PLC0415
+        import numpy as np  # noqa: PLC0415
+        import torch  # noqa: PLC0415
+
+        if mode not in {"train", "eval"}:
+            raise ValueError(mode)
+
+        # Batch size — first tensor-like value wins.
+        bsz = next(
+            (int(v.shape[0]) for v in batch.values() if hasattr(v, "shape") and v.ndim > 0),
+            None,
+        )
+        if not bsz:
+            return
+        n = min(int(n_samples), bsz)
+
+        # Optional predicted-action forward pass on the first n samples.
+        pred_actions: np.ndarray | None = None
+        if predict_actions and policy is not None:
+            was_training = policy.training
+            try:
+                policy.eval()
+                sub_batch = {}
+                for k, v in batch.items():
+                    if isinstance(v, torch.Tensor):
+                        sub_batch[k] = v[:n]
+                    elif isinstance(v, (list, tuple)):
+                        sub_batch[k] = list(v[:n])
+                    else:
+                        sub_batch[k] = v
+                with torch.no_grad():
+                    pred = policy.predict_action_chunk(sub_batch)
+                pred_actions = pred.detach().cpu().float().numpy()
+            except Exception as exc:  # noqa: BLE001
+                logging.warning(
+                    "log_training_examples: predict_action_chunk failed (%s) — "
+                    "skipping predicted-action columns",
+                    exc,
+                )
+                pred_actions = None
+            finally:
+                if was_training:
+                    policy.train()
+
+        present_cameras = [c for c in camera_keys if c in batch]
+        text_keys = [k for k in ("task", "subtask", "memory", "instruction") if k in batch]
+
+        columns = ["sample"]
+        columns.extend(c.removeprefix("observation.images.") or c for c in present_cameras)
+        columns.extend(text_keys)
+        columns.append("gt_action_first")
+        columns.append("gt_action_last")
+        if pred_actions is not None:
+            columns.append("pred_action_first")
+            columns.append("pred_action_last")
+
+        table = self._wandb.Table(columns=columns)
+
+        def _to_uint8_hwc(t: torch.Tensor) -> np.ndarray:
+            # Strip an outer time dim if present: (T, C, H, W) -> first frame.
+            if t.ndim == 4:
+                t = t[0]
+            # CHW -> HWC.
+            if t.ndim == 3 and t.shape[0] in (1, 3, 4) and t.shape[-1] not in (1, 3, 4):
+                t = t.permute(1, 2, 0)
+            arr = t.detach().cpu().float().numpy()
+            if arr.size and float(arr.max()) <= 1.5:
+                arr = arr * 255.0
+            return np.clip(arr, 0, 255).astype(np.uint8)
+
+        def _action_endpoints(a: torch.Tensor) -> tuple[str, str]:
+            arr = a.detach().cpu().float().numpy()
+            if arr.ndim == 2:  # (T, D)
+                return (
+                    str(np.round(arr[0], 3).tolist()),
+                    str(np.round(arr[-1], 3).tolist()),
+                )
+            if arr.ndim == 1:
+                rounded = np.round(arr, 3).tolist()
+                return (str(rounded), str(rounded))
+            return (str(arr.tolist()), str(arr.tolist()))
+
+        for i in range(n):
+            row: list = [i]
+            for cam in present_cameras:
+                try:
+                    row.append(self._wandb.Image(_to_uint8_hwc(batch[cam][i])))
+                except Exception as exc:  # noqa: BLE001
+                    logging.warning(
+                        "log_training_examples: camera %s sample %d failed (%s)",
+                        cam,
+                        i,
+                        exc,
+                    )
+                    row.append(None)
+            for tk in text_keys:
+                v = batch[tk]
+                if isinstance(v, (list, tuple)):
+                    row.append(str(v[i]) if i < len(v) else "")
+                else:
+                    row.append(str(v))
+            action = batch.get("action")
+            if isinstance(action, torch.Tensor) and action.ndim >= 1:
+                first, last = _action_endpoints(action[i])
+                row.append(first)
+                row.append(last)
+            else:
+                row.append("")
+                row.append("")
+            if pred_actions is not None:
+                p = torch.from_numpy(pred_actions[i])
+                pfirst, plast = _action_endpoints(p)
+                row.append(pfirst)
+                row.append(plast)
+            table.add_data(*row)
+
+        self._wandb.log({f"{mode}/examples": table}, step=step)
@@ -39,8 +39,6 @@ class DatasetConfig:
    # This reduces memory and speeds up DataLoader IPC. The training pipeline handles the conversion.
    return_uint8: bool = False
    streaming: bool = False
-    # Fraction of episodes held out per task for offline evaluation (0.0 = disabled).
-    eval_split: float = 0.0

    def __post_init__(self) -> None:
        if self.episodes is not None:
@@ -64,6 +62,72 @@ class WandBConfig:
    run_id: str | None = None
    mode: str | None = None  # Allowed values: 'online', 'offline' 'disabled'. Defaults to 'online'
    add_tags: bool = True  # If True, save configuration as tags in the WandB run.
+    # Periodic training-example dump (independent of ``log_freq``). When > 0,
+    # every ``log_examples_freq`` steps the trainer pushes a ``wandb.Table``
+    # with one row per sampled batch element containing each camera view
+    # (rendered as ``wandb.Image``), any text fields present in the batch
+    # (``task`` / ``subtask`` / ``memory`` / ``instruction``), and the
+    # ground-truth action chunk's first + last frames. Defaults to 5000 — set
+    # to 0 to disable. Only fires when ``enable=True``, so runs without wandb
+    # are unaffected.
+    log_examples_freq: int = 5000
+    # Number of batch elements to include in each example dump.
+    log_examples_n: int = 4
+    # If True (default), also run ``policy.predict_action_chunk`` on the logged
+    # samples (in eval mode, no_grad) and add predicted vs ground-truth action
+    # columns to the table. Costs one extra forward pass per dump — negligible
+    # at the 5k-step default cadence. Set to ``False`` if your policy doesn't
+    # implement ``predict_action_chunk`` or you want to skip the extra forward.
+    log_examples_predict_actions: bool = True
+
+
+@dataclass
+class EMAConfig:
+    """Exponential Moving Average of trainable policy parameters.
+
+    Diffusion / flow-matching policies (Diffusion Policy, π0/π0.5,
+    pi052) benefit substantially from averaging late-training
+    parameter oscillations — see Chi et al. 2023 §V.D. The official
+    JAX openpi trainer ships EMA with ``ema_decay=0.99`` (default) and
+    ``0.999`` for its pi05_libero config; the openpi PyTorch port
+    explicitly lists EMA as unsupported, and LeRobot main inherited
+    that gap. Enabling this flag plugs ema-pytorch
+    (https://github.com/lucidrains/ema-pytorch) into the LeRobot
+    training loop with a shadow ``nn.Module`` clone of the policy.
+
+    Cost: 1× model params in fp32 shadow (~13 GB for pi052's 3.3B
+    params) + one elementwise update per training step (~1% step time).
+
+    Off by default (opt-in): EMA is only beneficial for flow-matching /
+    diffusion policies (pi0/pi05/pi052), and the fp32 shadow copy is pure
+    overhead for other policies (e.g. VLA-JEPA). Set ``--ema.enable=true``
+    to turn it on (the pi05/pi052 training recipes do this). openpi (JAX)
+    ships EMA on for every config; enable it explicitly to match that.
+    """
+
+    enable: bool = False
+    # Target EMA decay β in θ_ema ← β·θ_ema + (1-β)·θ_live (passed to
+    # ema-pytorch as ``beta``).
+    #   0.999  — last ~1000 steps; pi05_libero default in openpi
+    #   0.99   — last ~100 steps; openpi top-level default
+    #   0.75   — very fast EMA (Diffusion Policy original setting)
+    #   0.9999 — very slow EMA (long classification runs)
+    decay: float = 0.99
+    # Skip the first N calls to ``ema.update()``; during this window
+    # the shadow is just a hard copy of the live weights (no averaging).
+    # Lets early-training rapid changes settle before averaging begins.
+    # Maps to ema-pytorch's ``update_after_step`` (NOT a smooth decay
+    # ramp like older lerobot EMA implementations).
+    warmup_steps: int = 0
+    # When True, the periodic eval block uses the EMA shadow model
+    # directly (``ema.ema_model``) instead of the live policy. Standard
+    # practice for diffusion-style policies — eval scores are usually
+    # 1–3% higher than the live policy at the same step.
+    use_for_eval: bool = True
+    # When True, the periodic wandb training-example dump uses the EMA
+    # shadow for the optional predicted-action columns (so what you see
+    # in W&B matches eval behavior).
+    use_for_wandb_examples: bool = True


@dataclass
@@ -147,7 +147,16 @@ class TrainingRecipe:
        return cls.from_dict(data)

    def _validate_message_recipe(self) -> None:
-        """Ensure every templated binding is known and at least one turn is a target."""
+        """Ensure every templated binding is known and the recipe supervises something.
+
+        A recipe is valid if it has at least one of:
+
+        * a ``target: true`` assistant turn (drives text-CE supervision), or
+        * a ``stream: low_level`` turn (drives flow / action supervision via
+          ``predict_actions=True``, even when no assistant turn is targeted —
+          e.g. π0.5-style ``low_level_execution`` where the action expert
+          conditions on a user-only ``${subtask}`` prompt).
+        """
        assert self.messages is not None
        known_bindings = set(DEFAULT_BINDINGS) | set(self.bindings or {}) | {"task"}

@@ -156,8 +165,14 @@ class TrainingRecipe:
            if missing:
                raise ValueError(f"MessageTurn references unknown binding(s): {sorted(missing)}")

-        if not any(turn.target for turn in self.messages):
-            raise ValueError("Message recipes must contain at least one target turn.")
+        has_target = any(turn.target for turn in self.messages)
+        has_low_level = any(turn.stream == "low_level" for turn in self.messages)
+        if not (has_target or has_low_level):
+            raise ValueError(
+                "Message recipes must contain at least one supervised turn — "
+                "either ``target: true`` (text CE) or ``stream: low_level`` "
+                "(flow/action loss)."
+            )

    def _validate_blend_recipe(self) -> None:
        """Ensure each blend component is a non-empty, weighted message recipe."""
@@ -0,0 +1,68 @@
+# subtask_mem_vqa_speech — Hi-Robot blend + memory + spoken responses.
+#
+# Superset of subtasks_vqa.yaml. Keeps the core subtask + action + VQA
+# training, and adds two text-supervised tasks:
+#
+#   high_level_subtask         — predict the subtask from the task.
+#   low_level_execution        — flow loss with [images, subtask, state].
+#   memory_update              — compress progress into a memory note.
+#   user_interjection_response — reply to a user interjection with a
+#                                spoken `say` tool call (no plan, no
+#                                subtask text — just the spoken reply).
+#   ask_vqa_{top,wrist}        — camera-grounded VQA.
+#
+# Plan is intentionally left out — memory is the only persistent
+# high-level state here, keeping the prompt short.
+#
+# Requires the dataset to carry `memory`, `interjection` and `say`-tool
+# annotations (the annotation pipeline's memory + interjection modules)
+# in addition to `subtask` and `vqa`. Sub-recipes whose `if_present`
+# bindings are missing simply don't render for that sample, so a
+# dataset without interjections still trains the rest of the blend.
+#
+# Tool-call note: the `say` tool call on the interjection-response turn
+# is flattened to a `<say>...</say>` text marker by the tokenizer step
+# (`_flatten_say_tool_calls`) so the LM head learns to emit exactly the
+# marker the runtime parses back (`_split_plan_and_say`).
+
+blend:
+
+  high_level_subtask:
+    weight: 0.30
+    messages:
+      - {role: user, content: "${task}", stream: high_level}
+      - {role: assistant, content: "${subtask}", stream: high_level, target: true, if_present: subtask}
+
+  low_level_execution:
+    weight: 0.55
+    messages:
+      # The action expert is conditioned on the SUBTASK — at inference
+      # `HighLevelSubtaskFwd` generates it via the LM head and feeds it
+      # here. `stream: low_level` flips `predict_actions=True` so the
+      # flow loss fires; no text-CE target (subtask prediction is owned
+      # by `high_level_subtask`).
+      - {role: user, content: "${subtask}", stream: low_level, if_present: subtask}
+
+  memory_update:
+    # At inference, `MemoryUpdateFwd` is triggered only on
+    # `subtask_change` events (sparse). Training densely with
+    # `active_at` — i.e. on every frame inside a subtask interval,
+    # not just the boundary frame — supervises the same
+    # (prior_memory, completed_subtask) → current_memory mapping
+    # against varied observations within the interval. The model
+    # learns a stateless transformation; the *when* to emit lives in
+    # the inference trigger, not the model. Annotations only exist
+    # for ~1% of frames as boundary events, so `emitted_at` would
+    # waste 99% of the blend draws (and silently leak them into a
+    # task-conditioned fallback); `active_at` lifts the renderable
+    # rate to ~87% on this dataset.
+    weight: 0.15
+    bindings:
+      prior_memory: "nth_prev(style=memory, offset=1)"
+      current_memory: "active_at(t, style=memory)"
+      completed_subtask: "nth_prev(style=subtask, offset=1)"
+    messages:
+      - {role: user, content: "${task}", stream: high_level}
+      - {role: assistant, content: "Previous memory: ${prior_memory}", stream: high_level, if_present: prior_memory}
+      - {role: user, content: "Completed subtask: ${completed_subtask}", stream: high_level, if_present: completed_subtask}
+      - {role: assistant, content: "${current_memory}", stream: high_level, target: true, if_present: current_memory}
@@ -0,0 +1,99 @@
+# subtask_mem_vqa_robocasa — Hi-Robot blend tuned for RoboCasa cameras.
+#
+# Same supervision as ``subtask_mem.yaml`` (subtask + memory) plus
+# camera-grounded VQA across the three RoboCasa camera keys produced
+# by ``slurm_build_robocasa_composite_seen.py``:
+#
+#   observation.images.robot0_agentview_left   (left scene view)
+#   observation.images.robot0_agentview_right  (right scene view)
+#   observation.images.robot0_eye_in_hand      (wrist)
+#
+# The annotation pipeline (``examples/annotations/run_hf_job.py``) emits
+# VQA per camera, so each anchor frame produces three (user, assistant)
+# rows tagged with their source camera. Each VQA sub-recipe consumes
+# the rows for one camera via ``camera=...`` resolver bindings.
+#
+# Spatial VQA targets (bbox / point) are rewritten from JSON to
+# PaliGemma ``<locDDDD>`` tokens by ``_messages_vqa_to_loc`` —
+# ``register_paligemma_loc_tokens`` already collapses them to single
+# detection-vocab ids so the LM head learns the pretrained pointing /
+# detection prior, not a 7-piece BPE salad.
+#
+# Interjections / spoken responses are intentionally absent — the
+# annotation job runs with ``--interjections.enabled=false``.
+
+blend:
+
+  high_level_subtask:
+    weight: 0.25
+    messages:
+      - {role: user, content: "${task}", stream: high_level}
+      - {role: assistant, content: "${subtask}", stream: high_level, target: true, if_present: subtask}
+
+  low_level_execution:
+    weight: 0.45
+    messages:
+      # Action expert is conditioned on the SUBTASK; at inference the
+      # high-level loop generates it via the LM head and feeds it here.
+      # ``stream: low_level`` flips ``predict_actions=True`` so the flow
+      # loss fires; subtask CE is owned by ``high_level_subtask``.
+      - {role: user, content: "${subtask}", stream: low_level, if_present: subtask}
+
+  memory_update:
+    # Trained densely with ``active_at`` — every frame inside a subtask
+    # interval — so the (prior_memory, completed_subtask) → current_memory
+    # mapping is supervised against varied observations. The *when* to
+    # emit lives in the inference trigger (subtask_change), not the
+    # model. See ``subtask_mem.yaml`` for the long version of this note.
+    weight: 0.15
+    bindings:
+      prior_memory: "nth_prev(style=memory, offset=1)"
+      current_memory: "active_at(t, style=memory)"
+      completed_subtask: "nth_prev(style=subtask, offset=1)"
+    messages:
+      - {role: user, content: "${task}", stream: high_level}
+      - {role: assistant, content: "Previous memory: ${prior_memory}", stream: high_level, if_present: prior_memory}
+      - {role: user, content: "Completed subtask: ${completed_subtask}", stream: high_level, if_present: completed_subtask}
+      - {role: assistant, content: "${current_memory}", stream: high_level, target: true, if_present: current_memory}
+
+  ask_vqa_agentview_left:
+    weight: 0.05
+    bindings:
+      vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.robot0_agentview_left)"
+      vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.robot0_agentview_left)"
+    messages:
+      - role: user
+        stream: high_level
+        if_present: vqa_query
+        content:
+          - {type: image, feature: observation.images.robot0_agentview_left}
+          - {type: text, text: "${vqa_query}"}
+      - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
+
+  ask_vqa_agentview_right:
+    weight: 0.05
+    bindings:
+      vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.robot0_agentview_right)"
+      vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.robot0_agentview_right)"
+    messages:
+      - role: user
+        stream: high_level
+        if_present: vqa_query
+        content:
+          - {type: image, feature: observation.images.robot0_agentview_right}
+          - {type: text, text: "${vqa_query}"}
+      - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
+
+  ask_vqa_wrist:
+    weight: 0.05
+    bindings:
+      vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.robot0_eye_in_hand)"
+      vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.robot0_eye_in_hand)"
+    messages:
+      - role: user
+        stream: high_level
+        if_present: vqa_query
+        content:
+          - {type: image, feature: observation.images.robot0_eye_in_hand}
+          - {type: text, text: "${vqa_query}"}
+      - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
@@ -0,0 +1,114 @@
+# subtask_mem_vqa_speech — Hi-Robot blend + memory + spoken responses.
+#
+# Superset of subtasks_vqa.yaml. Keeps the core subtask + action + VQA
+# training, and adds two text-supervised tasks:
+#
+#   high_level_subtask         — predict the subtask from the task.
+#   low_level_execution        — flow loss with [images, subtask, state].
+#   memory_update              — compress progress into a memory note.
+#   user_interjection_response — reply to a user interjection with a
+#                                spoken `say` tool call (no plan, no
+#                                subtask text — just the spoken reply).
+#   ask_vqa_{top,wrist}        — camera-grounded VQA.
+#
+# Plan is intentionally left out — memory is the only persistent
+# high-level state here, keeping the prompt short.
+#
+# Requires the dataset to carry `memory`, `interjection` and `say`-tool
+# annotations (the annotation pipeline's memory + interjection modules)
+# in addition to `subtask` and `vqa`. Sub-recipes whose `if_present`
+# bindings are missing simply don't render for that sample, so a
+# dataset without interjections still trains the rest of the blend.
+#
+# Tool-call note: the `say` tool call on the interjection-response turn
+# is flattened to a `<say>...</say>` text marker by the tokenizer step
+# (`_flatten_say_tool_calls`) so the LM head learns to emit exactly the
+# marker the runtime parses back (`_split_plan_and_say`).
+
+blend:
+
+  high_level_subtask:
+    weight: 0.25
+    messages:
+      - {role: user, content: "${task}", stream: high_level}
+      - {role: assistant, content: "${subtask}", stream: high_level, target: true, if_present: subtask}
+
+  low_level_execution:
+    weight: 0.40
+    messages:
+      # The action expert is conditioned on the SUBTASK — at inference
+      # `HighLevelSubtaskFwd` generates it via the LM head and feeds it
+      # here. `stream: low_level` flips `predict_actions=True` so the
+      # flow loss fires; no text-CE target (subtask prediction is owned
+      # by `high_level_subtask`).
+      - {role: user, content: "${subtask}", stream: low_level, if_present: subtask}
+
+  memory_update:
+    # At inference, `MemoryUpdateFwd` is triggered only on
+    # `subtask_change` events (sparse). Training densely with
+    # `active_at` — i.e. on every frame inside a subtask interval,
+    # not just the boundary frame — supervises the same
+    # (prior_memory, completed_subtask) → current_memory mapping
+    # against varied observations within the interval. The model
+    # learns a stateless transformation; the *when* to emit lives in
+    # the inference trigger, not the model. Annotations only exist
+    # for ~1% of frames as boundary events, so `emitted_at` would
+    # waste 99% of the blend draws (and silently leak them into the
+    # task-conditioned fallback); `active_at` lifts the renderable
+    # rate to ~87% on Hi-Robot-style datasets.
+    weight: 0.10
+    bindings:
+      prior_memory: "nth_prev(style=memory, offset=1)"
+      current_memory: "active_at(t, style=memory)"
+      completed_subtask: "nth_prev(style=subtask, offset=1)"
+    messages:
+      - {role: user, content: "${task}", stream: high_level}
+      - {role: assistant, content: "Previous memory: ${prior_memory}", stream: high_level, if_present: prior_memory}
+      - {role: user, content: "Completed subtask: ${completed_subtask}", stream: high_level, if_present: completed_subtask}
+      - {role: assistant, content: "${current_memory}", stream: high_level, target: true, if_present: current_memory}
+
+  user_interjection_response:
+    weight: 0.10
+    bindings:
+      interjection: "emitted_at(t, style=interjection)"
+      speech: "emitted_at(t, role=assistant, tool_name=say)"
+    messages:
+      - {role: user, content: "${task}", stream: high_level}
+      - {role: user, content: "${interjection}", stream: high_level, if_present: interjection}
+      # Spoken reply only: the assistant turn carries no text content,
+      # just a `say` tool call (`tool_calls_from: speech`). The chat
+      # tokenizer flattens it to a `<say>...</say>` marker, so the
+      # supervised target trains the model to respond to an
+      # interjection with a spoken acknowledgement.
+      - {role: assistant, stream: high_level, target: true, if_present: speech, tool_calls_from: speech}
+
+  # VQA is view-dependent — each camera gets its own sub-recipe so the
+  # resolver disambiguates via `camera=...`. Camera keys match
+  # subtasks_vqa.yaml (`front` + `wrist`); adjust to your dataset.
+  ask_vqa_top:
+    weight: 0.075
+    bindings:
+      vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.front)"
+      vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.front)"
+    messages:
+      - role: user
+        stream: high_level
+        if_present: vqa_query
+        content:
+          - {type: image, feature: observation.images.front}
+          - {type: text, text: "${vqa_query}"}
+      - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
+
+  ask_vqa_wrist:
+    weight: 0.075
+    bindings:
+      vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.wrist)"
+      vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.wrist)"
+    messages:
+      - role: user
+        stream: high_level
+        if_present: vqa_query
+        content:
+          - {type: image, feature: observation.images.wrist}
+          - {type: text, text: "${vqa_query}"}
+      - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
@@ -0,0 +1,61 @@
+# subtasks_vqa — Hi-Robot blend for PI052 (PaliGemma backbone).
+#
+#   Trains two things only: subtasks and VQA. Plan and memory are
+#   intentionally left out — keeps the prompt short and the training
+#   surface small. The fuller blend with memory + spoken replies is
+#   ``subtask_mem_vqa_speech.yaml``.
+#
+#     high_level_subtask  — predict the subtask from the task.
+#     low_level_execution — flow loss with [images, subtask, state].
+#     ask_vqa_{top,wrist} — camera-grounded VQA.
+#
+# PI052's text tokenizer renders these messages as plain
+# ``Role: content`` text (PaliGemma is not chat-pretrained).
+
+blend:
+
+  high_level_subtask:
+    weight: 0.40
+    messages:
+      - {role: user, content: "${task}", stream: high_level}
+      - {role: assistant, content: "${subtask}", stream: high_level, target: true, if_present: subtask}
+
+  low_level_execution:
+    weight: 0.40
+    messages:
+      # The action expert is conditioned on the SUBTASK — at inference
+      # the high-level loop (``HighLevelSubtaskFwd``) generates the
+      # subtask via the LM head and feeds it here. The action expert's
+      # prefix is [images, subtask, state]. ``stream: low_level`` flips
+      # ``predict_actions=True`` so the flow loss fires; no text-CE
+      # target here (subtask prediction is owned by
+      # ``high_level_subtask``).
+      - {role: user, content: "${subtask}", stream: low_level, if_present: subtask}
+
+  ask_vqa_top:
+    weight: 0.10
+    bindings:
+      vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.front)"
+      vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.front)"
+    messages:
+      - role: user
+        stream: high_level
+        if_present: vqa_query
+        content:
+          - {type: image, feature: observation.images.front}
+          - {type: text, text: "${vqa_query}"}
+      - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
+
+  ask_vqa_wrist:
+    weight: 0.10
+    bindings:
+      vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.wrist)"
+      vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.wrist)"
+    messages:
+      - role: user
+        stream: high_level
+        if_present: vqa_query
+        content:
+          - {type: image, feature: observation.images.wrist}
+          - {type: text, text: "${vqa_query}"}
+      - {role: assistant, content: "${vqa}", stream: high_level, target: true, if_present: vqa}
@@ -30,7 +30,7 @@ from lerobot.utils.hub import HubMixin
 from lerobot.utils.sample_weighting import SampleWeightingConfig

 from . import parser
-from .default import DatasetConfig, EvalConfig, PeftConfig, WandBConfig
+from .default import DatasetConfig, EMAConfig, EvalConfig, PeftConfig, WandBConfig
 from .policies import PreTrainedConfig
 from .rewards import RewardModelConfig

@@ -100,13 +100,8 @@ class TrainPipelineConfig(HubMixin):
    prefetch_factor: int = 4
    persistent_workers: bool = True
    steps: int = 100_000
-    # Run policy in the simulation environment every N steps to measure reward/success (0 = disabled).
-    env_eval_freq: int = 20_000
+    eval_freq: int = 20_000
    log_freq: int = 200
-    # Compute eval loss on held-out episodes every N steps (0 = disabled). Requires eval_split > 0.
-    eval_steps: int = 0
-    # Cap on total eval samples, split uniformly across tasks (0 = use all held-out data).
-    max_eval_samples: int = 0
    tolerance_s: float = 1e-4
    save_checkpoint: bool = True
    # Checkpoint is saved every `save_freq` training iterations and after the last training step.
@@ -116,9 +111,20 @@ class TrainPipelineConfig(HubMixin):
    scheduler: LRSchedulerConfig | None = None
    eval: EvalConfig = field(default_factory=EvalConfig)
    wandb: WandBConfig = field(default_factory=WandBConfig)
+    ema: EMAConfig = field(default_factory=EMAConfig)
    peft: PeftConfig | None = None

-    # Sample weighting configuration (e.g., for RA-BC training)
+    # VQA oversampling. When set (a fraction in (0, 1)), the training
+    # dataloader uses a WeightedEpisodeAwareSampler that draws frames
+    # carrying a `vqa` language annotation often enough that they make
+    # up roughly this fraction of the training stream. VQA annotations
+    # are typically sparse, so without this they are underrepresented.
+    # `None` (default) keeps uniform episode-aware sampling.
+    vqa_target_fraction: float | None = None
+
+    # Sample weighting configuration (e.g., for RA-BC training). Old
+    # inline ``use_rabc`` / ``rabc_*`` params are migrated to this
+    # field by ``_migrate_legacy_rabc_keys`` above.
    sample_weighting: SampleWeightingConfig | None = None

    # Rename map for the observation to override the image and state keys
@@ -35,7 +35,6 @@ from .dataset_tools import (
    remove_feature,
    split_dataset,
 )
-from .factory import make_dataset, make_train_eval_datasets, resolve_delta_timestamps
 from .image_writer import safe_stop_image_writer
 from .io_utils import load_episodes, write_stats
 from .language import (
@@ -50,11 +49,24 @@ from .lerobot_dataset import LeRobotDataset
 from .multi_dataset import MultiLeRobotDataset
 from .pipeline_features import aggregate_pipeline_dataset_features, create_initial_features
 from .pyav_utils import check_video_encoder_parameters_pyav, detect_available_encoders_pyav
-from .sampler import EpisodeAwareSampler, compute_sampler_state
+from .sampler import EpisodeAwareSampler, WeightedEpisodeAwareSampler
 from .streaming_dataset import StreamingLeRobotDataset
 from .utils import DEFAULT_EPISODES_PATH, create_lerobot_dataset_card
 from .video_utils import VideoEncodingManager

+
+def make_dataset(*args, **kwargs):
+    from .factory import make_dataset as _make_dataset
+
+    return _make_dataset(*args, **kwargs)
+
+
+def resolve_delta_timestamps(*args, **kwargs):
+    from .factory import resolve_delta_timestamps as _resolve_delta_timestamps
+
+    return _resolve_delta_timestamps(*args, **kwargs)
+
+
 # NOTE: Low-level I/O functions (cast_stats_to_numpy, get_parquet_file_size_in_mb, etc.)
 # and legacy migration constants are intentionally NOT re-exported here.
 # Import directly: ``from lerobot.datasets.io_utils import ...``
@@ -65,6 +77,7 @@ __all__ = [
    "DEFAULT_QUANTILES",
    "EVENT_ONLY_STYLES",
    "EpisodeAwareSampler",
+    "WeightedEpisodeAwareSampler",
    "LANGUAGE_EVENTS",
    "LANGUAGE_PERSISTENT",
    "LeRobotDataset",
@@ -82,14 +95,12 @@ __all__ = [
    "aggregate_stats",
    "convert_image_to_video_dataset",
    "create_initial_features",
-    "compute_sampler_state",
    "create_lerobot_dataset_card",
    "column_for_style",
    "delete_episodes",
    "get_feature_stats",
    "load_episodes",
    "make_dataset",
-    "make_train_eval_datasets",
    "merge_datasets",
    "modify_features",
    "modify_tasks",
@@ -286,8 +286,6 @@ def aggregate_datasets(
    data_files_size_in_mb: int | None = None,
    video_files_size_in_mb: int | None = None,
    chunk_size: int | None = None,
-    concatenate_videos: bool = True,
-    concatenate_data: bool = True,
 ):
    """Aggregates multiple LeRobot datasets into a single unified dataset.

@@ -305,8 +303,6 @@ def aggregate_datasets(
        data_files_size_in_mb: Maximum size for data files in MB (defaults to DEFAULT_DATA_FILE_SIZE_IN_MB)
        video_files_size_in_mb: Maximum size for video files in MB (defaults to DEFAULT_VIDEO_FILE_SIZE_IN_MB)
        chunk_size: Maximum number of files per chunk (defaults to DEFAULT_CHUNK_SIZE)
-        concatenate_videos: When False, keep one mp4 per source file instead of packing into shards.
-        concatenate_data: When False, keep one parquet per source file instead of packing into shards.
    """
    logging.info("Start aggregate_datasets")

@@ -355,12 +351,8 @@ def aggregate_datasets(
    dst_meta.episodes = {}

    for src_meta in tqdm.tqdm(all_metadata, desc="Copy data and videos"):
-        videos_idx = aggregate_videos(
-            src_meta, dst_meta, videos_idx, video_files_size_in_mb, chunk_size, concatenate_videos
-        )
-        data_idx = aggregate_data(
-            src_meta, dst_meta, data_idx, data_files_size_in_mb, chunk_size, concatenate_data
-        )
+        videos_idx = aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chunk_size)
+        data_idx = aggregate_data(src_meta, dst_meta, data_idx, data_files_size_in_mb, chunk_size)

        meta_idx = aggregate_metadata(src_meta, dst_meta, meta_idx, data_idx, videos_idx)

@@ -375,9 +367,7 @@ def aggregate_datasets(
    logging.info("Aggregation complete.")


-def aggregate_videos(
-    src_meta, dst_meta, videos_idx, video_files_size_in_mb, chunk_size, concatenate_videos=True
-):
+def aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chunk_size):
    """Aggregates video chunks from a source dataset into the destination dataset.

    Handles video file concatenation and rotation based on file size limits.
@@ -389,7 +379,6 @@ def aggregate_videos(
        videos_idx: Dictionary tracking video chunk and file indices.
        video_files_size_in_mb: Maximum size for video files in MB (defaults to DEFAULT_VIDEO_FILE_SIZE_IN_MB)
        chunk_size: Maximum number of files per chunk (defaults to DEFAULT_CHUNK_SIZE)
-        concatenate_videos: When False, keep one mp4 per source file instead of packing into shards.
    Returns:
        dict: Updated videos_idx with current chunk and file indices.
    """
@@ -450,7 +439,7 @@ def aggregate_videos(
            src_size = get_file_size_in_mb(src_path)
            dst_size = get_file_size_in_mb(dst_path)

-            if not concatenate_videos or dst_size + src_size >= video_files_size_in_mb:
+            if dst_size + src_size >= video_files_size_in_mb:
                # Rotate to a new file - offset is 0
                chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, chunk_size)
                dst_key = (chunk_idx, file_idx)
@@ -488,7 +477,7 @@ def aggregate_videos(
    return videos_idx


-def aggregate_data(src_meta, dst_meta, data_idx, data_files_size_in_mb, chunk_size, concatenate_data=True):
+def aggregate_data(src_meta, dst_meta, data_idx, data_files_size_in_mb, chunk_size):
    """Aggregates data chunks from a source dataset into the destination dataset.

    Reads source data files, updates indices to match the aggregated dataset,
@@ -504,7 +493,6 @@ def aggregate_data(src_meta, dst_meta, data_idx, data_files_size_in_mb, chunk_si
        data_idx: Dictionary tracking data chunk and file indices.
        data_files_size_in_mb: Maximum size for data files in MB.
        chunk_size: Maximum number of files per chunk.
-        concatenate_data: When False, keep one parquet per source file instead of packing into shards.

    Returns:
        dict: Updated data_idx with current chunk and file indices.
@@ -550,7 +538,6 @@ def aggregate_data(src_meta, dst_meta, data_idx, data_files_size_in_mb, chunk_si
            contains_images=contains_images,
            aggr_root=dst_meta.root,
            hf_features=hf_features,
-            concatenate=concatenate_data,
        )

        # Record the mapping from source to actual destination
@@ -627,7 +614,6 @@ def append_or_create_parquet_file(
    contains_images: bool = False,
    aggr_root: Path = None,
    hf_features: datasets.Features | None = None,
-    concatenate: bool = True,
 ) -> tuple[dict[str, int], tuple[int, int]]:
    """Appends data to an existing parquet file or creates a new one based on size constraints.

@@ -644,7 +630,6 @@ def append_or_create_parquet_file(
        contains_images: Whether the data contains images requiring special handling.
        aggr_root: Root path for the aggregated dataset.
        hf_features: Optional HuggingFace Features schema for proper image typing.
-        concatenate: When False, always rotate to a new file instead of appending to the current one.

    Returns:
        tuple: (updated_idx, (dst_chunk, dst_file)) where updated_idx is the index dict
@@ -664,7 +649,7 @@ def append_or_create_parquet_file(
    src_size = get_parquet_file_size_in_mb(src_path)
    dst_size = get_parquet_file_size_in_mb(dst_path)

-    if not concatenate or dst_size + src_size >= max_mb:
+    if dst_size + src_size >= max_mb:
        idx["chunk"], idx["file"] = update_chunk_file_indices(idx["chunk"], idx["file"], chunk_size)
        dst_chunk, dst_file = idx["chunk"], idx["file"]
        new_path = aggr_root / default_path.format(chunk_index=dst_chunk, file_index=dst_file)
@@ -59,8 +59,6 @@ class RunningQuantileStats:
            batch: An array where all dimensions except the last are batch dimensions.
        """
        batch = batch.reshape(-1, batch.shape[-1])
-        # Promote integer and low-precision inputs before computing squared statistics.
-        batch = batch.astype(np.result_type(batch.dtype, np.float32), copy=False)
        num_elements, vector_length = batch.shape

        if self._count == 0:
@@ -126,10 +126,53 @@ class DatasetReader:
    def _load_hf_dataset(self) -> datasets.Dataset:
        """hf_dataset contains all the observations, states, actions, rewards, etc."""
        features = get_hf_features_from_features(self._meta.features)
+        # Datasets annotated with the PR1 language columns may have been
+        # written without registering those columns in ``meta/info.json``
+        # (e.g. they predate ``CODEBASE_VERSION="v3.1"`` and were
+        # back-filled by ``lerobot-annotate``). Probe a single parquet
+        # shard and graft the column features on so the strict
+        # ``Dataset.from_parquet`` cast doesn't fail with
+        # ``column names don't match``.
+        features = self._extend_features_with_language_columns(features)
        hf_dataset = load_nested_dataset(self.root / "data", features=features, episodes=self.episodes)
        hf_dataset.set_transform(hf_transform_to_torch)
        return hf_dataset

+    def _extend_features_with_language_columns(
+        self, features: datasets.Features
+    ) -> datasets.Features:
+        """Add ``language_persistent`` / ``language_events`` to ``features``
+        when the underlying parquet shards declare them but the metadata
+        doesn't. No-op when neither column is present or both are
+        already registered.
+        """
+        # Find any one parquet to peek at; bail if there are none yet
+        # (the dataset will fail later for an unrelated reason and we
+        # want that error to surface as-is).
+        try:
+            sample = next((self.root / "data").glob("*/*.parquet"))
+        except StopIteration:
+            return features
+
+        from pyarrow import parquet as _pq  # noqa: PLC0415
+
+        schema_names = set(_pq.read_schema(sample).names)
+        from .language import (  # noqa: PLC0415
+            LANGUAGE_EVENTS,
+            LANGUAGE_PERSISTENT,
+            language_events_column_feature,
+            language_persistent_column_feature,
+        )
+
+        extra: dict[str, object] = {}
+        if LANGUAGE_PERSISTENT in schema_names and LANGUAGE_PERSISTENT not in features:
+            extra[LANGUAGE_PERSISTENT] = language_persistent_column_feature()
+        if LANGUAGE_EVENTS in schema_names and LANGUAGE_EVENTS not in features:
+            extra[LANGUAGE_EVENTS] = language_events_column_feature()
+        if not extra:
+            return features
+        return datasets.Features({**features, **extra})
+
    def _check_cached_episodes_sufficient(self) -> bool:
        """Check if the cached dataset contains all requested episodes and their video files."""
        if self.hf_dataset is None or len(self.hf_dataset) == 0:
@@ -261,8 +261,6 @@ def merge_datasets(
    datasets: list[LeRobotDataset],
    output_repo_id: str,
    output_dir: str | Path | None = None,
-    concatenate_videos: bool = True,
-    concatenate_data: bool = True,
 ) -> LeRobotDataset:
    """Merge multiple LeRobotDatasets into a single dataset.

@@ -272,8 +270,6 @@ def merge_datasets(
        datasets: List of LeRobotDatasets to merge.
        output_repo_id: Merged dataset identifier.
        output_dir: Root directory where the merged dataset will be stored. If not specified, defaults to $HF_LEROBOT_HOME/output_repo_id.
-        concatenate_videos: When False, keep one mp4 per source file instead of packing into shards.
-        concatenate_data: When False, keep one parquet per source file instead of packing into shards.
    """
    if not datasets:
        raise ValueError("No datasets to merge")
@@ -288,8 +284,6 @@ def merge_datasets(
        aggr_repo_id=output_repo_id,
        roots=roots,
        aggr_root=output_dir,
-        concatenate_videos=concatenate_videos,
-        concatenate_data=concatenate_data,
    )

    merged_dataset = LeRobotDataset(
@@ -14,7 +14,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import logging
-import math
 from pprint import pformat

 import torch
@@ -131,81 +130,3 @@ def make_dataset(cfg: TrainPipelineConfig) -> LeRobotDataset | MultiLeRobotDatas
                dataset.meta.stats[key][stats_type] = torch.tensor(stats, dtype=torch.float32)

    return dataset
-
-
-def make_train_eval_datasets(
-    cfg: TrainPipelineConfig,
-) -> tuple[LeRobotDataset | MultiLeRobotDataset, LeRobotDataset | None]:
-    """Create train and optional eval datasets by splitting episodes based on eval_split.
-
-    The last ceil(n_episodes * eval_split) episodes per task are held out for evaluation.
-    If eval_split == 0.0, returns (full_dataset, None).
-    """
-    full_dataset = make_dataset(cfg)
-
-    if cfg.dataset.eval_split == 0.0:
-        return full_dataset, None
-
-    base_episodes = (
-        full_dataset.episodes if full_dataset.episodes is not None else list(range(full_dataset.num_episodes))
-    )
-
-    episode_tasks = full_dataset.meta.episodes["tasks"]
-    task_to_episodes: dict[str, list[int]] = {}
-    for ep_idx in base_episodes:
-        task_key = episode_tasks[ep_idx][0] if episode_tasks[ep_idx] else ""
-        task_to_episodes.setdefault(task_key, []).append(ep_idx)
-
-    train_episodes, eval_episodes = [], []
-    for eps in task_to_episodes.values():
-        n_eval = math.ceil(len(eps) * cfg.dataset.eval_split)
-        train_episodes.extend(eps[: len(eps) - n_eval])
-        eval_episodes.extend(eps[len(eps) - n_eval :])
-
-    if not train_episodes:
-        raise ValueError(
-            f"eval_split={cfg.dataset.eval_split} leaves 0 training episodes from {len(base_episodes)} total."
-        )
-
-    logging.info(
-        f"Train/eval split: {len(train_episodes)} train, {len(eval_episodes)} eval "
-        f"(eval_split={cfg.dataset.eval_split}, {len(task_to_episodes)} tasks)"
-    )
-
-    delta_timestamps = resolve_delta_timestamps(cfg.trainable_config, full_dataset.meta)
-
-    train_image_transforms = (
-        ImageTransforms(cfg.dataset.image_transforms) if cfg.dataset.image_transforms.enable else None
-    )
-
-    train_dataset = LeRobotDataset(
-        cfg.dataset.repo_id,
-        root=cfg.dataset.root,
-        episodes=train_episodes,
-        delta_timestamps=delta_timestamps,
-        image_transforms=train_image_transforms,
-        revision=cfg.dataset.revision,
-        video_backend=cfg.dataset.video_backend,
-        return_uint8=True,
-        tolerance_s=cfg.tolerance_s,
-    )
-
-    eval_dataset = LeRobotDataset(
-        cfg.dataset.repo_id,
-        root=cfg.dataset.root,
-        episodes=eval_episodes,
-        delta_timestamps=delta_timestamps,
-        image_transforms=None,
-        revision=cfg.dataset.revision,
-        video_backend=cfg.dataset.video_backend,
-        return_uint8=True,
-        tolerance_s=cfg.tolerance_s,
-    )
-
-    if cfg.dataset.use_imagenet_stats:
-        for ds in (train_dataset, eval_dataset):
-            for key in ds.meta.camera_keys:
-                for stats_type, stats in IMAGENET_STATS.items():
-                    ds.meta.stats[key][stats_type] = torch.tensor(stats, dtype=torch.float32)
-
-    return train_dataset, eval_dataset
@@ -170,6 +170,29 @@ def render_sample(
    """
    persistent_rows = _normalize_rows(persistent or [])
    event_rows = _normalize_rows(events or [])
+
+    # VQA-priority routing. A ``vqa`` annotation is sparse and
+    # view-dependent; the plain weighted blend would (a) waste a draw
+    # whenever it picks an ``ask_vqa*`` sub-recipe for a frame that has
+    # no VQA, and (b) silently drop a VQA-annotated frame whenever it
+    # picks a non-VQA sub-recipe. So: if the blend has ``ask_vqa*``
+    # sub-recipes and *this* frame carries one of their VQA bindings,
+    # render VQA here regardless of the weighted draw. That makes VQA's
+    # recipe-side training share equal the VQA-annotation density (the
+    # maximum reachable without a dataset-level oversampling sampler).
+    if recipe.blend is not None:
+        vqa_rendered = _render_vqa_if_present(
+            recipe,
+            persistent=persistent_rows,
+            events=event_rows,
+            t=t,
+            sample_idx=sample_idx,
+            task=task,
+            dataset_ctx=dataset_ctx,
+        )
+        if vqa_rendered is not None:
+            return vqa_rendered
+
    selected_recipe = _select_recipe(recipe, sample_idx)
    bindings = _resolve_bindings(
        selected_recipe,
@@ -183,6 +206,59 @@ def render_sample(
    return _render_message_recipe(selected_recipe, bindings)


+def _render_vqa_if_present(
+    recipe: TrainingRecipe,
+    *,
+    persistent: Sequence[LanguageRow],
+    events: Sequence[LanguageRow],
+    t: float,
+    sample_idx: int,
+    task: str | None,
+    dataset_ctx: Any | None,
+) -> RenderedMessages | None:
+    """Render an ``ask_vqa*`` sub-recipe iff this frame carries a VQA
+    annotation; otherwise return ``None`` so the caller falls back to the
+    normal weighted blend.
+
+    When several VQA sub-recipes resolve (e.g. a frame annotated for more
+    than one camera), one is chosen deterministically by relative weight.
+    """
+    assert recipe.blend is not None
+    renderable: list[tuple[float, RenderedMessages]] = []
+    for name, component in recipe.blend.items():
+        if not name.startswith("ask_vqa"):
+            continue
+        bindings = _resolve_bindings(
+            component,
+            persistent=persistent,
+            events=events,
+            t=t,
+            sample_idx=sample_idx,
+            task=task,
+            dataset_ctx=dataset_ctx,
+        )
+        rendered = _render_message_recipe(component, bindings)
+        if rendered is not None:
+            renderable.append((float(component.weight or 0.0), rendered))
+
+    if not renderable:
+        return None
+    if len(renderable) == 1:
+        return renderable[0][1]
+
+    # Multiple cameras have a VQA for this frame — deterministic pick by
+    # relative weight (fall back to a uniform draw if all weights are 0).
+    total = sum(w for w, _ in renderable) or float(len(renderable))
+    digest = hashlib.blake2b(f"vqa:{sample_idx}".encode(), digest_size=8).digest()
+    draw = int.from_bytes(digest, "big") / 2**64 * total
+    cumulative = 0.0
+    for w, rendered in renderable:
+        cumulative += w or (total / len(renderable))
+        if draw < cumulative:
+            return rendered
+    return renderable[-1][1]
+
+
 def _select_recipe(recipe: TrainingRecipe, sample_idx: int) -> TrainingRecipe:
    """Pick a deterministic blend component for ``sample_idx`` (or return ``recipe``)."""
    if recipe.blend is None:
@@ -346,7 +422,15 @@ def _render_message_recipe(
        if turn.target:
            target_indices.append(message_idx)

-    if not target_indices:
+    # A render is meaningful if it supervises *something*: either a
+    # text-CE target turn, or a ``low_level`` stream turn (flow / action
+    # supervision — e.g. the flow-only ``low_level_execution`` recipe,
+    # ``user(${subtask})`` with ``stream: low_level`` and no target).
+    # Without this, a flow-only recipe renders to ``None`` every time
+    # the blend draws it → ``predict_actions`` is never True → the
+    # action expert never receives a flow loss.
+    has_low_level = any(stream == "low_level" for stream in streams)
+    if not target_indices and not has_low_level:
        return None

    rendered = {
@@ -403,8 +487,10 @@ def _validate_rendered(rendered: RenderedMessages) -> None:

    if len(streams) != len(messages):
        raise ValueError("message_streams must be aligned with messages.")
-    if not target_indices:
-        raise ValueError("Rendered samples must contain at least one target message.")
+    # Valid iff it supervises something: a text-CE target turn OR a
+    # ``low_level`` stream turn (flow / action supervision).
+    if not target_indices and not any(s == "low_level" for s in streams):
+        raise ValueError("Rendered samples must contain a target message or a low_level-stream message.")
    for idx in target_indices:
        if idx < 0 or idx >= len(messages):
            raise ValueError(f"Target message index {idx} is out of bounds.")
@@ -474,8 +474,6 @@ class LeRobotDataset(torch.utils.data.Dataset):
        if reader.hf_dataset is None:
            # One-shot load after finalize()
            reader.load_and_activate()
-        if reader._absolute_to_relative_idx is not None and idx in reader._absolute_to_relative_idx:
-            idx = reader._absolute_to_relative_idx[idx]
        return reader.get_item(idx)

    def select_columns(self, column_names: str | list[str]):
@@ -14,36 +14,14 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import logging
-import math
 from collections.abc import Iterator

-import numpy as np
 import torch

 logger = logging.getLogger(__name__)


 class EpisodeAwareSampler:
-    """Sampler over episode frames that stores only per-episode boundaries.
-
-    Logical positions map to frame indices on the fly (O(num_episodes) construction memory)
-    instead of materializing a Python list of every frame index.
-
-    Each epoch is shuffled with a `torch.randperm` seeded from `(seed, epoch)`, so the data order
-    is a pure function of `(seed, epoch)`: it reproduces on every rank without synchronizing the
-    global RNG (no `generator` to sync across distributed ranks), and `state_dict` /
-    `load_state_dict` resume a run sample-exactly by regenerating the epoch's permutation and
-    continuing from the saved offset. Each call to `__iter__` advances the epoch. During a
-    resumed epoch, `__len__` still reports the full length.
-
-    Epoch advancement: `__iter__` eagerly advances the epoch, and `set_epoch` / `load_state_dict`
-    set it explicitly. Within a single run callers should rely on exactly one of these mechanisms,
-    not both: advancing the epoch by hand *and* letting `__iter__` auto-advance over the same
-    iterations would skip or repeat epochs. The training loop drives it purely through `__iter__`
-    (via `cycle`); `set_epoch` / `load_state_dict` are used only to (re)position before iteration
-    starts (e.g. on resume or in tests).
-    """
-
    def __init__(
        self,
        dataset_from_indices: list[int],
@@ -52,125 +30,120 @@ class EpisodeAwareSampler:
        drop_n_first_frames: int = 0,
        drop_n_last_frames: int = 0,
        shuffle: bool = False,
-        seed: int = 0,
    ):
-        """
+        """Sampler that optionally incorporates episode boundary information.
+
        Args:
-            dataset_from_indices: Start index of each episode in the dataset.
-            dataset_to_indices: End index of each episode in the dataset.
-            episode_indices_to_use: Episode indices to use; None means all.
-            drop_n_first_frames: Frames to drop from the start of each episode.
-            drop_n_last_frames: Frames to drop from the end of each episode.
+            dataset_from_indices: List of indices containing the start of each episode in the dataset.
+            dataset_to_indices: List of indices containing the end of each episode in the dataset.
+            episode_indices_to_use: List of episode indices to use. If None, all episodes are used.
+                                    Assumes that episodes are indexed from 0 to N-1.
+            drop_n_first_frames: Number of frames to drop from the start of each episode.
+            drop_n_last_frames: Number of frames to drop from the end of each episode.
            shuffle: Whether to shuffle the indices.
-            seed: Seed the permutation is derived from (together with the epoch).
        """
        if drop_n_first_frames < 0:
            raise ValueError(f"drop_n_first_frames must be >= 0, got {drop_n_first_frames}")
        if drop_n_last_frames < 0:
            raise ValueError(f"drop_n_last_frames must be >= 0, got {drop_n_last_frames}")

-        from_indices = np.asarray(dataset_from_indices, dtype=np.int64)
-        to_indices = np.asarray(dataset_to_indices, dtype=np.int64)
-        if from_indices.shape != to_indices.shape:
-            raise ValueError(
-                f"dataset_from_indices and dataset_to_indices must have the same length, "
-                f"got {len(from_indices)} and {len(to_indices)}"
-            )
+        indices = []
+        for episode_idx, (start_index, end_index) in enumerate(
+            zip(dataset_from_indices, dataset_to_indices, strict=True)
+        ):
+            if episode_indices_to_use is None or episode_idx in episode_indices_to_use:
+                ep_length = end_index - start_index
+                if drop_n_first_frames + drop_n_last_frames >= ep_length:
+                    logger.warning(
+                        "Episode %d has %d frames but drop_n_first_frames=%d and "
+                        "drop_n_last_frames=%d removes all frames. Skipping.",
+                        episode_idx,
+                        ep_length,
+                        drop_n_first_frames,
+                        drop_n_last_frames,
+                    )
+                    continue
+                indices.extend(range(start_index + drop_n_first_frames, end_index - drop_n_last_frames))

-        used = np.ones(len(from_indices), dtype=bool)
-        if episode_indices_to_use is not None:
-            used = np.zeros(len(from_indices), dtype=bool)
-            used[np.asarray(episode_indices_to_use, dtype=np.int64)] = True
-
-        starts = from_indices + drop_n_first_frames
-        lengths = to_indices - drop_n_last_frames - starts
-        for episode_idx in np.flatnonzero(used & (lengths <= 0)):
-            logger.warning(
-                "Episode %d has %d frames but drop_n_first_frames=%d and "
-                "drop_n_last_frames=%d removes all frames. Skipping.",
-                episode_idx,
-                to_indices[episode_idx] - from_indices[episode_idx],
-                drop_n_first_frames,
-                drop_n_last_frames,
-            )
-        used &= lengths > 0
-        if not used.any():
+        if not indices:
            raise ValueError(
                "No valid frames remain after applying drop_n_first_frames and drop_n_last_frames. "
                "All episodes were either filtered out or had too few frames."
            )

-        self._starts = starts[used]
-        self._cum_lengths = np.cumsum(lengths[used])
-        self._num_frames = int(self._cum_lengths[-1])
+        self.indices = indices
        self.shuffle = shuffle
-        self.seed = seed
-        self._epoch = 0
-        self._start_index = 0
-
-    @property
-    def indices(self) -> list[int]:
-        """Materialized frame indices in unshuffled order; O(num_frames), introspection only."""
-        return [self._frame_index(k) for k in range(self._num_frames)]
-
-    def set_epoch(self, epoch: int) -> None:
-        self._epoch = epoch
-
-    def state_dict(self) -> dict:
-        return {"epoch": self._epoch, "start_index": self._start_index}
-
-    def load_state_dict(self, state: dict) -> None:
-        self._epoch = state["epoch"]
-        self._start_index = state["start_index"]
-
-    def _epoch_generator(self, epoch: int) -> torch.Generator:
-        # Derive a per-epoch seed from (seed, epoch) so the permutation is a pure function of both
-        # and reproduces identically on every rank without touching the global RNG.
-        epoch_seed = int(np.random.SeedSequence([self.seed, epoch]).generate_state(1, dtype=np.uint64)[0])
-        return torch.Generator().manual_seed(epoch_seed)
-
-    def _frame_index(self, position: int) -> int:
-        episode = int(np.searchsorted(self._cum_lengths, position, side="right"))
-        position_in_episode = position - (int(self._cum_lengths[episode - 1]) if episode > 0 else 0)
-        return int(self._starts[episode]) + position_in_episode

    def __iter__(self) -> Iterator[int]:
-        # Advance epoch state eagerly, not on first consumption of the generator.
-        epoch, start = self._epoch, self._start_index
-        self._epoch += 1
-        self._start_index = 0
-        return self._iter_epoch(epoch, start)
-
-    def _iter_epoch(self, epoch: int, start: int) -> Iterator[int]:
        if self.shuffle:
-            order = torch.randperm(self._num_frames, generator=self._epoch_generator(epoch))
-            for k in range(start, self._num_frames):
-                yield self._frame_index(int(order[k]))
+            for i in torch.randperm(len(self.indices)):
+                yield self.indices[i]
        else:
-            for k in range(start, self._num_frames):
-                yield self._frame_index(k)
+            for i in self.indices:
+                yield i

    def __len__(self) -> int:
-        return self._num_frames
+        return len(self.indices)


-def compute_sampler_state(step: int, num_frames: int, batch_size: int, num_processes: int) -> dict:
-    """Map an optimization step to an `EpisodeAwareSampler` state for sample-exact resume.
+class WeightedEpisodeAwareSampler(EpisodeAwareSampler):
+    """``EpisodeAwareSampler`` that draws frames *with replacement* in
+    proportion to per-frame weights.

-    Under accelerate's batch sharding, one step consumes `batch_size * num_processes` sampler
-    positions and each rank sees `ceil(ceil(num_frames / batch_size) / num_processes)` batches
-    per epoch (`even_batches` padding included). The start index provably stays below
-    `num_frames`; the `min` is defensive.
-
-    Assumptions (resume is only sample-exact when they hold):
-        - `num_processes` and `batch_size` match the run that wrote the checkpoint. Both scale how
-          many positions a step consumes, so the epoch/offset are wrong if either changed. The
-          caller passes the checkpoint's `num_processes` and `batch_size` and warns on a mismatch.
-        - accelerate uses `even_batches=True` (its default). The `ceil(... / num_processes)` term
-          mirrors that padding; with `even_batches=False` the per-epoch batch count differs and
-          the boundary is off.
+    Used to oversample frames carrying a sparse annotation (e.g. a VQA
+    question) so the policy sees them more often than their natural
+    dataset density. One epoch still yields ``len(self.indices)``
+    samples — the weights only change the *composition* of the stream,
+    not its length. Each epoch re-draws, so the oversampled subset
+    varies run to run.
    """
-    batches_per_epoch = math.ceil(math.ceil(num_frames / batch_size) / num_processes)
-    epoch, batches_into_epoch = divmod(step, batches_per_epoch)
-    start_index = min(batches_into_epoch * batch_size * num_processes, num_frames)
-    return {"epoch": epoch, "start_index": start_index}
+
+    def __init__(
+        self,
+        dataset_from_indices: list[int],
+        dataset_to_indices: list[int],
+        frame_weights,
+        *,
+        episode_indices_to_use: list | None = None,
+        drop_n_first_frames: int = 0,
+        drop_n_last_frames: int = 0,
+    ):
+        """
+        Args:
+            dataset_from_indices: Episode start indices (see ``EpisodeAwareSampler``).
+            dataset_to_indices: Episode end indices.
+            frame_weights: 1-D sequence/tensor of non-negative weights, one per
+                dataset frame (length == total dataset frames). Higher weight ⇒
+                that frame is sampled more often.
+            episode_indices_to_use / drop_n_first_frames / drop_n_last_frames:
+                Same meaning as ``EpisodeAwareSampler`` — the episode-boundary
+                frame filtering is applied first, then weighting is restricted
+                to the surviving frames.
+        """
+        super().__init__(
+            dataset_from_indices,
+            dataset_to_indices,
+            episode_indices_to_use=episode_indices_to_use,
+            drop_n_first_frames=drop_n_first_frames,
+            drop_n_last_frames=drop_n_last_frames,
+            shuffle=False,
+        )
+        weights = torch.as_tensor(frame_weights, dtype=torch.double).flatten()
+        idx = torch.tensor(self.indices, dtype=torch.long)
+        if weights.numel() <= int(idx.max()):
+            raise ValueError(
+                f"frame_weights has {weights.numel()} entries but the sampler "
+                f"references frame index {int(idx.max())}."
+            )
+        selected = weights[idx]
+        if not torch.isfinite(selected).all() or bool((selected < 0).any()):
+            raise ValueError("frame_weights must be finite and non-negative.")
+        if float(selected.sum()) <= 0.0:
+            # All surviving frames have zero weight — fall back to uniform.
+            selected = torch.ones_like(selected)
+        self._weights = selected
+
+    def __iter__(self) -> Iterator[int]:
+        picks = torch.multinomial(self._weights, num_samples=len(self.indices), replacement=True)
+        for i in picks.tolist():
+            yield self.indices[i]
@@ -366,17 +366,24 @@ def get_safe_version(repo_id: str, version: str | packaging.version.Version) ->
    hub_versions = get_repo_versions(repo_id)

    if not hub_versions:
-        raise RevisionNotFoundError(
-            f"""Your dataset must be tagged with a codebase version.
-            Assuming _version_ is the codebase_version value in the info.json, you can run this:
-            ```python
-            from huggingface_hub import HfApi
-
-            hub_api = HfApi()
-            hub_api.create_tag("{repo_id}", tag="_version_", repo_type="dataset")
-            ```
-            """
+        msg = (
+            f"Repo {repo_id!r} has no codebase-version tags. The dataset "
+            f"either doesn't exist on the Hub yet, or it was uploaded "
+            f"without a ``v3.x``-style tag. To tag an existing dataset run:\n"
+            f"  from huggingface_hub import HfApi\n"
+            f"  HfApi().create_tag({repo_id!r}, tag='v3.0', repo_type='dataset', exist_ok=True)"
        )
+        # ``RevisionNotFoundError`` extends ``HfHubHTTPError`` whose
+        # ``__init__`` indexes ``response.headers`` unconditionally on
+        # current ``huggingface_hub`` versions. Constructing it without
+        # a real ``Response`` object crashes with either
+        # ``TypeError: missing 1 required keyword-only argument`` (old
+        # builds) or ``AttributeError: 'NoneType' object has no attribute
+        # 'headers'`` (new builds). Skip that path entirely — this isn't
+        # really an HTTP error, it's a configuration issue — and raise a
+        # plain ``RuntimeError`` so the message actually reaches the
+        # caller.
+        raise RuntimeError(msg)

    if target_version in hub_versions:
        return f"v{target_version}"
@@ -481,10 +481,8 @@ def reencode_video(
    encoder_threads: int | None = None,
    log_level: int | None = av.logging.WARNING,
    overwrite: bool = False,
-    start_time_s: float | None = None,
-    end_time_s: float | None = None,
 ) -> None:
-    """Re-encode a video file, optionally trimming it to ``[start_time_s, end_time_s)``.
+    """Re-encode a video file using the given encoder configuration.

    Args:
        input_video_path: Existing video file to read.
@@ -493,17 +491,10 @@ def reencode_video(
        encoder_threads: Optional thread count forwarded to :meth:`VideoEncoderConfig.get_codec_options`.
        log_level: libav log level while encoding, or ``None`` to leave logging unchanged. Defaults to WARNING.
        overwrite: When ``False`` and ``output_video_path`` already exists, skip and log a warning.
-        start_time_s: When set, trim the output to start at this timestamp (seconds).
-        end_time_s: When set, trim the output to end at this timestamp (seconds, exclusive).
    """

    camera_encoder = camera_encoder or camera_encoder_defaults()

-    if (start_time_s is not None and start_time_s < 0) or (end_time_s is not None and end_time_s < 0):
-        raise ValueError(f"Trim times must be non-negative, got start={start_time_s}, end={end_time_s}.")
-    if start_time_s is not None and end_time_s is not None and end_time_s <= start_time_s:
-        raise ValueError(f"end_time_s ({end_time_s}) must be greater than start_time_s ({start_time_s}).")
-
    output_video_path = Path(output_video_path)

    if output_video_path.exists() and not overwrite:
@@ -535,10 +526,6 @@ def reencode_video(
            width = int(in_stream.width)
            height = int(in_stream.height)

-            # Seek to the keyframe at or before start_time_s to avoid reading from the start.
-            if start_time_s is not None:
-                src.seek(int(start_time_s * av.time_base), backward=True)
-
            with av.open(
                tmp_output_video_path,
                mode="w",
@@ -552,14 +539,7 @@ def reencode_video(
                out_stream.height = height

                for frame in src.decode(in_stream):
-                    frame_time_s = frame.time
-                    if start_time_s is not None and frame_time_s < start_time_s:
-                        continue
-                    if end_time_s is not None and frame_time_s >= end_time_s:
-                        break
                    frame = frame.reformat(width=width, height=height, format=pix_fmt)
-                    if start_time_s is not None:
-                        frame.pts = None  # reset timestamps so the trimmed output starts at t=0
                    packet = out_stream.encode(frame)
                    if packet:
                        dst.mux(packet)
@@ -33,8 +33,8 @@ logger = logging.getLogger(__name__)

 # Dimensions for the flat action/state vectors used by the LeRobot wrapper.
 # These correspond to the PandaOmron robot in RoboCasa365.
-OBS_STATE_DIM = 16  # base_pos(3) + base_quat(4) + ee_pos_rel(3) + ee_quat_rel(4) + gripper_qpos(2)
-ACTION_DIM = 12  # base_motion(4) + control_mode(1) + ee_pos(3) + ee_rot(3) + gripper(1)
+OBS_STATE_DIM = 16  # ee_pos_rel(3) + ee_quat_rel(4) + base_pos(3) + base_quat(4) + gripper_qpos(2)
+ACTION_DIM = 12  # ee_pos(3) + ee_rot(3) + gripper(1) + base_motion(4) + control_mode(1)
 ACTION_LOW = -1.0
 ACTION_HIGH = 1.0

@@ -101,14 +101,15 @@ def _resolve_tasks(task: str) -> tuple[list[str], str | None]:
 def convert_action(flat_action: np.ndarray) -> dict[str, Any]:
    """Split a flat (12,) action vector into a RoboCasa action dict.

-    Layout: base_motion(4) + control_mode(1) + ee_pos(3) + ee_rot(3) + gripper(1)
+    Layout (openpi / robocasa.utils.env_utils.convert_action order):
+        ee_pos(3) + ee_rot(3) + gripper(1) + base_motion(4) + control_mode(1)
    """
    return {
-        "action.base_motion": flat_action[0:4],
-        "action.control_mode": flat_action[4:5],
-        "action.end_effector_position": flat_action[5:8],
-        "action.end_effector_rotation": flat_action[8:11],
-        "action.gripper_close": flat_action[11:12],
+        "action.end_effector_position": flat_action[0:3],
+        "action.end_effector_rotation": flat_action[3:6],
+        "action.gripper_close": flat_action[6:7],
+        "action.base_motion": flat_action[7:11],
+        "action.control_mode": flat_action[11:12],
    }


@@ -230,12 +231,14 @@ class RoboCasaEnv(gym.Env):
            return {"pixels": images}

        # `state.*` keys come from PandaOmronKeyConverter inside the wrapper.
+        # openpi state order: ee first, then base, then gripper (matches the
+        # openpi robocasa pipeline / examples/robocasa/main.py state layout).
        agent_pos = np.concatenate(
            [
-                raw_obs.get("state.base_position", np.zeros(3)),
-                raw_obs.get("state.base_rotation", np.zeros(4)),
                raw_obs.get("state.end_effector_position_relative", np.zeros(3)),
                raw_obs.get("state.end_effector_rotation_relative", np.zeros(4)),
+                raw_obs.get("state.base_position", np.zeros(3)),
+                raw_obs.get("state.base_rotation", np.zeros(4)),
                raw_obs.get("state.gripper_qpos", np.zeros(2)),
            ],
            axis=-1,
@@ -104,6 +104,8 @@ class AdamWConfig(OptimizerConfig):
    eps: float = 1e-8
    weight_decay: float = 1e-2
    grad_clip_norm: float = 10.0
+    foreach: bool | None = None
+    fused: bool | None = None

    def build(self, params: OptimizerParams) -> torch.optim.Optimizer:
        kwargs = asdict(self)
@@ -25,6 +25,7 @@ from .multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig as M
 from .pi0.configuration_pi0 import PI0Config as PI0Config
 from .pi0_fast.configuration_pi0_fast import PI0FastConfig as PI0FastConfig
 from .pi05.configuration_pi05 import PI05Config as PI05Config
+from .pi052.configuration_pi052 import PI052Config as PI052Config
 from .pretrained import PreTrainedPolicy as PreTrainedPolicy
 from .smolvla.configuration_smolvla import SmolVLAConfig as SmolVLAConfig
 from .tdmpc.configuration_tdmpc import TDMPCConfig as TDMPCConfig
@@ -49,6 +50,7 @@ __all__ = [
    "PI0Config",
    "PI0FastConfig",
    "PI05Config",
+    "PI052Config",
    "SmolVLAConfig",
    "TDMPCConfig",
    "VQBeTConfig",
@@ -63,6 +63,79 @@ from .wall_x.configuration_wall_x import WallXConfig
 from .xvla.configuration_xvla import XVLAConfig


+def _restore_pi052_pretrained_state(
+    preprocessor: PolicyProcessorPipeline,
+    postprocessor: PolicyProcessorPipeline,
+    pretrained_path: str,
+) -> None:
+    """Transplant saved stateful blobs from a pi052 checkpoint into fresh pipelines.
+
+    pi052's preprocessor includes steps whose constructor args don't
+    JSON-roundtrip (``RenderMessagesStep.recipe`` is a Python object,
+    ``ActionTokenizerProcessorStep.action_tokenizer_name`` is a
+    fitted-tokenizer path that may not exist at eval time). We rebuild
+    those pipelines fresh from ``config.recipe_path`` and then walk
+    over the saved ``policy_{pre,post}processor.json`` files to find
+    each step's ``state_file`` reference and load the bytes back into
+    the corresponding fresh step. Today that's only the
+    NormalizerProcessorStep / UnnormalizerProcessorStep (the action /
+    state quantile stats), but the loop is generic so any future
+    stateful step picks up its blob automatically.
+
+    Pairing is by ``registry_name`` AND position so a benign reorder
+    on the saved side surfaces a warning rather than silently feeding
+    the wrong tensors into the wrong step.
+    """
+    import json  # noqa: PLC0415
+    import logging  # noqa: PLC0415
+    from pathlib import Path  # noqa: PLC0415
+
+    from safetensors.torch import load_file  # noqa: PLC0415
+
+    base = Path(pretrained_path)
+    if not base.exists():
+        return
+
+    log = logging.getLogger(__name__)
+
+    for pipeline, config_filename in [
+        (preprocessor, f"{POLICY_PREPROCESSOR_DEFAULT_NAME}.json"),
+        (postprocessor, f"{POLICY_POSTPROCESSOR_DEFAULT_NAME}.json"),
+    ]:
+        config_path = base / config_filename
+        if not config_path.exists():
+            continue
+        saved = json.loads(config_path.read_text())
+
+        for idx, (saved_step, fresh_step) in enumerate(
+            zip(saved.get("steps", []), pipeline.steps, strict=False)
+        ):
+            state_file = saved_step.get("state_file")
+            if not state_file:
+                continue
+            saved_name = saved_step.get("registry_name")
+            fresh_name = getattr(type(fresh_step), "_registry_name", None)
+            if saved_name and fresh_name and saved_name != fresh_name:
+                log.warning(
+                    "PI052 state restore: %s step %d registry name mismatch "
+                    "(saved=%s, fresh=%s); skipping %s",
+                    config_filename, idx, saved_name, fresh_name, state_file,
+                )
+                continue
+            state_path = base / state_file
+            if not state_path.exists():
+                log.warning(
+                    "PI052 state restore: %s missing at %s; %s left at fresh init",
+                    state_file, base, fresh_name,
+                )
+                continue
+            fresh_step.load_state_dict(load_file(str(state_path)))
+            log.info(
+                "PI052 state restore: loaded %s into %s (step %d)",
+                state_file, fresh_name, idx,
+            )
+
+
 def _reconnect_relative_absolute_steps(
    preprocessor: PolicyProcessorPipeline, postprocessor: PolicyProcessorPipeline
 ) -> None:
@@ -130,6 +203,10 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:
        from .pi05.modeling_pi05 import PI05Policy

        return PI05Policy
+    elif name == "pi052":
+        from .pi052.modeling_pi052 import PI052Policy
+
+        return PI052Policy
    elif name == "gaussian_actor":
        from .gaussian_actor.modeling_gaussian_actor import GaussianActorPolicy

@@ -178,8 +255,8 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:

    Args:
        policy_type: The type of the policy. Supported types include "tdmpc",
-                     "multi_task_dit", "diffusion", "act", "vqbet", "pi0", "pi05", "gaussian_actor",
-                     "smolvla", "wall_x", "molmoact2".
+                     "multi_task_dit", "diffusion", "act", "vqbet", "pi0", "pi05",
+                     "pi052", "gaussian_actor", "smolvla", "wall_x", "molmoact2".
        **kwargs: Keyword arguments to be passed to the configuration class constructor.

    Returns:
@@ -202,6 +279,10 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:
        return PI0Config(**kwargs)
    elif policy_type == "pi05":
        return PI05Config(**kwargs)
+    elif policy_type == "pi052":
+        from .pi052.configuration_pi052 import PI052Config
+
+        return PI052Config(**kwargs)
    elif policy_type == "gaussian_actor":
        return GaussianActorConfig(**kwargs)
    elif policy_type == "smolvla":
@@ -246,6 +327,12 @@ class ProcessorConfigKwargs(TypedDict, total=False):
    preprocessor_overrides: dict[str, Any] | None
    postprocessor_overrides: dict[str, Any] | None
    dataset_stats: dict[str, dict[str, torch.Tensor]] | None
+    # Optional: HF Hub repo id of the dataset the policy is being
+    # trained on. Used by policies that auto-fit pieces of their
+    # preprocessing (e.g. pi052's FAST action tokenizer per
+    # Pertsch et al. 2025 [64], π0.5 §III.C). When omitted, those
+    # policies fall back to their universal pre-fitted tokenizers.
+    dataset_repo_id: str | None
    dataset_meta: Any | None


@@ -279,6 +366,29 @@ def make_pre_post_processors(
        NotImplementedError: If a processor factory is not implemented for the given
            policy configuration type.
    """
+    if pretrained_path and getattr(policy_cfg, "type", None) == "pi052":
+        # pi052 pipelines don't roundtrip through the saved
+        # ``policy_preprocessor.json``: ``RenderMessagesStep`` holds a
+        # Python ``TrainingRecipe`` (not JSON-serializable; saved as
+        # ``{}``) and ``ActionTokenizerProcessorStep`` saves a host-only
+        # FAST tokenizer path. Generic ``from_pretrained`` then dies
+        # with ``RenderMessagesStep.__init__() missing 1 required
+        # positional argument: 'recipe'`` (job 22164494).
+        #
+        # Mirror ``lerobot_pi052_runtime``'s bootstrap: build pipelines
+        # fresh from ``config.recipe_path`` and transplant the saved
+        # stateful blobs (normalizer stats) from the checkpoint dir.
+        from .pi052.processor_pi052 import make_pi052_pre_post_processors
+
+        preprocessor, postprocessor = make_pi052_pre_post_processors(
+            config=policy_cfg,
+            dataset_stats=kwargs.get("dataset_stats"),
+            dataset_repo_id=kwargs.get("dataset_repo_id"),
+        )
+        _restore_pi052_pretrained_state(preprocessor, postprocessor, pretrained_path)
+        _reconnect_relative_absolute_steps(preprocessor, postprocessor)
+        return preprocessor, postprocessor
+
    if pretrained_path:
        # TODO(Steven): Temporary patch, implement correctly the processors for Gr00t
        if isinstance(policy_cfg, GrootConfig):
@@ -373,6 +483,22 @@ def make_pre_post_processors(
            dataset_stats=kwargs.get("dataset_stats"),
        )

+    elif policy_cfg.type == "pi052":
+        # NOTE: PI052Config subclasses PI05Config, so this branch MUST
+        # come before the PI05Config isinstance check below (otherwise
+        # pi052 would silently pick up π0.5's processor).
+        from .pi052.processor_pi052 import make_pi052_pre_post_processors
+
+        processors = make_pi052_pre_post_processors(
+            config=policy_cfg,
+            dataset_stats=kwargs.get("dataset_stats"),
+            # ``dataset_repo_id`` flows in via kwargs when FAST CE is
+            # enabled — the train loop sets it from ``--dataset.repo_id``.
+            # When ``None``, ``make_pi052_pre_post_processors`` skips
+            # the auto-fit and uses the universal tokenizer.
+            dataset_repo_id=kwargs.get("dataset_repo_id"),
+        )
+
    elif isinstance(policy_cfg, PI05Config):
        from .pi05.processor_pi05 import make_pi05_pre_post_processors

@@ -178,7 +178,6 @@ N_COLOR_CHANNELS = 3


 # config
-@strict
 class GR00TN15Config(PretrainedConfig):
    model_type = "gr00t_n1_5"

@@ -0,0 +1,42 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""π0.5 v2 — full reproduction of the π0.5 paper's hierarchical
+inference recipe on lerobot.
+
+Extends :class:`lerobot.policies.pi05.PI05Policy` with:
+
+* recipe-driven training (PR 1's :class:`RenderMessagesStep`),
+* PaliGemma ``lm_head`` cross-entropy on supervised subtask spans
+  (the "high-level subtask prediction" of the paper, §IV.D),
+* AR text generation at inference (:meth:`PI052Policy.select_message`),
+* per-component prompt dropout (Pi 0.7 §V.E) for regularising the
+  text head against missing context at inference.
+
+See ``src/lerobot/configs/recipes/subtasks_vqa.yaml`` for the
+canonical training recipe and
+``examples/training/pi052_hirobot.slurm`` for the launcher.
+"""
+
+from .configuration_pi052 import PI052Config
+from .modeling_pi052 import PI052Policy
+from .processor_pi052 import make_pi052_pre_post_processors
+from .text_processor_pi052 import PI052TextTokenizerStep
+
+__all__ = [
+    "PI052Config",
+    "PI052Policy",
+    "PI052TextTokenizerStep",
+    "make_pi052_pre_post_processors",
+]
@@ -0,0 +1,235 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""π0.5 v2 (with text head) — reproduction of the π0.5 paper's
+hierarchical inference recipe.
+
+Same architecture as the existing ``PI05Policy`` (PaliGemma 2B VLM +
+~300M Gemma action expert, joint training with FAST tokens during
+pre-train and flow matching during post-train), but with the
+PaliGemma ``lm_head`` re-enabled so the same model can be supervised
+to predict both:
+
+  * **subtask strings** at the high level (cross-entropy on the LM
+    head), and
+  * **action chunks** at the low level (flow matching on the
+    action-expert tokens).
+
+This is the dual-head co-training pattern from the paper:
+
+    L = H(x, f_θ_text) + α * ‖ω - a - f_θ_action(a_τ, o, ℓ)‖²
+
+with α = 10.0 per § IV.D of arxiv:2504.16054. The π0.5 model splits
+inference into a text-prediction step followed by an action-prediction
+step, which the multi-rate ``PI052Runtime`` (in
+``lerobot.policies.pi052.inference``) drives at separate rates.
+"""
+
+from dataclasses import dataclass
+
+from lerobot.configs import PreTrainedConfig
+from lerobot.optim.optimizers import AdamWConfig
+
+from ..pi05.configuration_pi05 import PI05Config
+
+
+@PreTrainedConfig.register_subclass("pi052")
+@dataclass
+class PI052Config(PI05Config):
+    """π0.5 with the PaliGemma LM head re-enabled for subtask prediction.
+
+    Recipe-driven dual-head training: the flow head supervises actions,
+    the LM head supervises subtask / plan / memory / VQA text. The
+    flow:text loss split is the milder 5:1 (see ``flow_loss_weight``).
+    """
+
+    # Recipe / language stack ---------------------------------------------
+    recipe_path: str | None = "recipes/subtasks_vqa.yaml"
+    """Path (absolute or relative to ``src/lerobot/configs/``) to a
+    ``TrainingRecipe`` YAML. Defaults to the canonical Hi-Robot blend
+    shipped alongside this policy. Set to ``None`` to disable recipe
+    rendering and fall back to π0.5's single-task ``Task: ... Action:``
+    prompt path (unannotated datasets keep working that way)."""
+
+    apply_chat_template: bool = False
+    """PaliGemma is *not* chat-pretrained — its tokenizer doesn't ship a
+    chat template, so we don't apply one. The recipe renderer's output
+    is concatenated as a plain prefix + assistant suffix instead,
+    mirroring how the π0.5 paper's high-level inference samples text
+    auto-regressively after the prefix."""
+
+    # Loss weights --------------------------------------------------------
+    # Paper §IV.D uses α=10 between the flow and text terms, assuming
+    # text is a rare auxiliary task. With the recipe stack the flow-only
+    # `low_level` branch fires on a large share of samples, so α=10
+    # swamps the LM head and collapses generation into degenerate
+    # repetition. We use the milder 5:1 split here.
+    text_loss_weight: float = 1.0
+    """Weight on the LM-head cross-entropy term. Set to ``0`` to disable
+    text training entirely (reverts to flow-only / π0.5 behaviour)."""
+
+    flow_loss_weight: float = 5.0
+    """Weight on the action-expert flow-matching term. ``5.0`` — a milder
+    flow:text split than the paper's α=10, since the flow-only
+    ``low_level`` recipe already gives the action expert frequent
+    gradient. Lower it further if the LM head still underfits."""
+
+    # Backbone training ---------------------------------------------------
+    unfreeze_lm_head: bool = True
+    """Whether to keep the PaliGemma ``lm_head`` unfrozen for fine-tuning.
+    The existing ``PI05Policy`` zeroes / freezes the head on load
+    because it never reads from it. Must be ``True`` for π0.5-style
+    hierarchical inference."""
+
+    # Per-component prompt dropout (Pi0.7 §V.E) ---------------------------
+    # Randomly drop non-target context messages so the LM head learns
+    # to handle missing /
+    # stale plan / memory at inference. Defaults to 0.0 so behaviour
+    # is identical until explicitly enabled.
+    plan_dropout_prob: float = 0.0
+    memory_dropout_prob: float = 0.0
+    subtask_dropout_prob: float = 0.0
+
+    # FAST discrete-action supervision — paper §III.B-C ------------------
+    # When enabled, actions are *also* tokenised via the FAST tokenizer
+    # ("physical-intelligence/fast") and supervised with cross-entropy
+    # on the PaliGemma LM head — exactly as in the paper's pre-training
+    # objective (Eq. 1 mixes FAST CE + flow MSE + subtask CE). The
+    # ActionTokenizerProcessorStep is wired into the preprocessor
+    # pipeline when this flag is set; the loss is computed in
+    # PI052Policy.forward.
+    enable_fast_action_loss: bool = True
+    """If True, tokenise actions with the FAST tokenizer and add a
+    cross-entropy loss on the LM head. On by default to match the
+    π0.5 paper's three-loss objective (text CE + FAST CE + flow MSE,
+    §III.B-C Eq. 1). Set to False if you only want the
+    post-training-style flow + text recipe."""
+
+    action_tokenizer_name: str = "physical-intelligence/fast"
+    """HF identifier for the FAST action tokenizer."""
+
+    max_action_tokens: int = 256
+    """Maximum number of FAST tokens per action chunk."""
+
+    fast_skip_tokens: int = 128
+    """Number of low-vocab tokens the FAST tokenizer skips to avoid
+    collisions with PaliGemma's text vocabulary."""
+
+    fast_action_loss_weight: float = 1.0
+    """Weight on the FAST-action-token CE loss. Paper §III.C uses 1.0."""
+
+    auto_fit_fast_tokenizer: bool = False
+    """If True, the processor factory checks ``fast_tokenizer_cache_dir``
+    for a previously-fitted tokenizer keyed on ``(dataset_repo_id,
+    base_tokenizer_name, fit_samples)``. On cache miss, it loads
+    ``action_tokenizer_name`` as a base, samples
+    ``fast_tokenizer_fit_samples`` action chunks from the dataset, runs
+    ``.fit()``, saves the result, and uses *that* fitted path as the
+    actual tokenizer. Pertsch et al. 2025 (FAST paper [64], π0.5 §III.C)
+    explicitly recommend per-dataset fitting for best compression.
+
+    Off by default because the fit requires a separate pre-training
+    pass over the dataset (~1-2 min on a medium dataset) and depends
+    on the FAST tokenizer snapshot having a ``.fit()`` method. Opt in
+    when you want paper-faithful compression; leave off to fall back
+    on the universal ``physical-intelligence/fast`` codebook."""
+
+    fast_tokenizer_cache_dir: str = "~/.cache/lerobot/fast_tokenizers"
+    """Where fitted FAST tokenizers are stored. ``~`` expands."""
+
+    fast_tokenizer_fit_samples: int = 1024
+    """Number of action chunks to sample for the fit. The FAST paper uses
+    a few thousand; 1024 is a reasonable default for medium datasets."""
+
+    # Knowledge insulation — paper §III.B --------------------------------
+    # When enabled, gradients from the action expert's flow loss are
+    # blocked from flowing back into the VLM's K/V projections. This
+    # prevents the action loss from over-fitting the language backbone
+    # to robot-specific features. Implemented in ``modeling_pi052`` as
+    # a per-instance monkey-patch on ``paligemma_with_expert.forward``
+    # that splits queries into VLM and action halves and ``.detach()``-s
+    # the VLM K/V tensors used in the action-half's attention.
+    knowledge_insulation: bool = False
+    """If True, route every transformer layer through the KI
+    attention path that blocks action→VLM gradient flow on K/V."""
+
+    # Learning-rate defaults --------------------------------------------
+    # pi052 inherits π0.5's openpi-validated optimizer config (peak LR
+    # 2.5e-5, cosine→2.5e-6, 1k warmup, AdamW (0.9, 0.95), wd=0.01,
+    # grad_clip=1.0). The only place pi052 needs to diverge from pi05
+    # is the LM-head LR multiplier: pi05 has no text supervision so the
+    # head doesn't get gradients; pi052 always has text supervision
+    # (subtask / memory / VQA) via the recipe, and under KI the LM head
+    # only sees gradients on ~30–45% of the batch (the text-CE mask
+    # share of the recipe). Under aggressive cosine decay this is too
+    # weak to keep the head pinned, so it drifts back toward PaliGemma's
+    # pretrained ``<loc>`` first-token bias. 5x is the documented fix
+    # (see ``PI05Config.lm_head_lr_scale`` docstring); the wiring is
+    # already in ``PI05Policy.get_optim_params`` — it splits the LM head
+    # + tied ``embed_tokens`` into their own param group while sharing
+    # the same cosine lambda, so the 5x ratio is preserved across decay.
+    lm_head_lr_scale: float = 5.0
+
+    # PaLM-style z-loss on text CE. Penalises the log-partition function
+    # ``z = log Σ exp(logits)`` drifting away from zero — without it, large-
+    # vocab models (PaliGemma is 257k) can let ``logsumexp`` grow unbounded
+    # while CE stays low, because a uniform additive logit bias cancels in
+    # softmax. PaLM appendix B / Chinchilla report z-loss is essential for
+    # stable large-vocab CE; it especially helps under ``lm_head_lr_scale=
+    # 5.0`` which amplifies drift risk on the LM head. ``1e-4`` is the
+    # commonly cited weight; set 0 to disable entirely.
+    text_ce_z_loss_weight: float = 1e-4
+
+    # Liger Triton kernels (rope + geglu + layer_norm) are now patched
+    # unconditionally at model build time — see ``_enable_hf_kernels``
+    # in ``modeling_pi052``. The patch is process-global, idempotent
+    # and degrades gracefully if ``liger-kernel`` is missing. Measured
+    # at -4.5% step time on H100 (bench job 22161421); peak memory
+    # unchanged. ``fused_linear_cross_entropy`` ships separately via
+    # ``_shifted_lin_ce`` / ``_fast_lin_ce``.
+    use_hf_kernels: bool = True
+    """Deprecated. Liger HF kernels are patched unconditionally by
+    ``_enable_hf_kernels`` — this field is retained as a no-op for
+    backward compatibility with checkpoints saved before commit
+    d70c8104 (which still serialize ``use_hf_kernels: true`` into
+    ``config.json``). Loading those configs would otherwise raise
+    ``DecodingError: The fields use_hf_kernels are not valid for
+    PI052Config`` (job 22164492). Remove in a future major bump."""
+
+    # Optimizer foreach/fused. pi052 carries these locally because the shared
+    # PI05Config (kept identical to upstream main) does not define them; the
+    # checkpoints we train serialize both keys into config.json, so they must
+    # be valid PI052Config fields and flow into the AdamW preset below.
+    optimizer_foreach: bool | None = False
+    optimizer_fused: bool | None = True
+
+    def get_optimizer_preset(self) -> AdamWConfig:
+        return AdamWConfig(
+            lr=self.optimizer_lr,
+            betas=self.optimizer_betas,
+            eps=self.optimizer_eps,
+            weight_decay=self.optimizer_weight_decay,
+            grad_clip_norm=self.optimizer_grad_clip_norm,
+            foreach=self.optimizer_foreach,
+            fused=self.optimizer_fused,
+        )
+
+    def __post_init__(self) -> None:
+        super().__post_init__()
+        # Backbone needs gradients flowing through the text head when
+        # we're training it. Override the π0.5 default
+        # (``train_expert_only=True``) unless the user explicitly opts
+        # out of text training via ``text_loss_weight=0``.
+        if self.text_loss_weight > 0 and self.unfreeze_lm_head:
+            self.train_expert_only = False
@@ -0,0 +1,304 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Dataset-specific FAST action tokenizer fitting.
+
+The published ``physical-intelligence/fast`` tokenizer is a *universal*
+codebook fitted on a heterogeneous mix of robot datasets. Per Pertsch
+et al. 2025 (the FAST paper, [64] in the π0.5 paper) and §III.C of
+π0.5 itself, the recommended practice is to **finetune the tokenizer on
+your specific dataset's action distribution** before training the
+policy — same way one would adapt a language tokenizer to a domain
+corpus. Without this finetune step, action sequences from your robot
+may require more tokens per chunk than necessary, lowering effective
+compression and slowing convergence of the action-CE loss.
+
+This module provides a single utility, :func:`fit_fast_tokenizer`,
+that does the finetune. The training entry point invokes it
+automatically when the policy's ``enable_fast_action_loss`` and
+``auto_fit_fast_tokenizer`` flags are both ``True`` and no cached
+fitted tokenizer is found at ``fast_tokenizer_cache_dir``.
+
+The fitted tokenizer is saved to
+``{cache_dir}/{dataset_hash}_{base_hash}/`` so successive training
+runs over the same dataset re-use it.
+"""
+
+from __future__ import annotations
+
+import hashlib
+import logging
+import os
+import time
+from pathlib import Path
+
+import numpy as np
+
+logger = logging.getLogger(__name__)
+
+# Marker file the cache-hit check looks for. ``ProcessorMixin.save_pretrained``
+# writes ``processor_config.json`` (NOT ``preprocessor_config.json`` —
+# that's the image / feature-extractor convention). Centralised here so
+# the cache-hit check and the rank-N readiness wait agree on the same
+# sentinel.
+_CACHE_SENTINEL = "processor_config.json"
+
+
+def _dataset_signature(
+    dataset_repo_id: str,
+    base_tokenizer_name: str,
+    n_samples: int,
+    chunk_size: int,
+) -> str:
+    """Deterministic short hash for naming the cache directory.
+
+    Keys on (dataset, base tokenizer, sample count, chunk size) so any
+    of those changing re-runs the fit. ``chunk_size`` matters because
+    the tokenizer is fit on chunks of that length.
+    """
+    h = hashlib.sha256()
+    h.update(dataset_repo_id.encode("utf-8"))
+    h.update(b"\0")
+    h.update(base_tokenizer_name.encode("utf-8"))
+    h.update(b"\0")
+    h.update(str(n_samples).encode("utf-8"))
+    h.update(b"\0")
+    h.update(str(chunk_size).encode("utf-8"))
+    return h.hexdigest()[:16]
+
+
+def fit_fast_tokenizer(
+    *,
+    dataset_repo_id: str,
+    cache_dir: str | Path,
+    base_tokenizer_name: str = "physical-intelligence/fast",
+    n_samples: int = 1024,
+    chunk_size: int = 50,
+    seed: int = 42,
+) -> str:
+    """Fit a FAST tokenizer on a LeRobot dataset's action distribution.
+
+    Args:
+        dataset_repo_id: HF Hub repo id of the LeRobotDataset to fit on.
+        cache_dir: Directory under which to save (and look up) fitted
+            tokenizers. The actual save path is
+            ``{cache_dir}/{signature}``.
+        base_tokenizer_name: HF identifier for the base FAST tokenizer
+            to finetune from. ``physical-intelligence/fast`` is the
+            universal one.
+        n_samples: Number of action chunks to sample for the fit. The
+            FAST paper uses a few thousand; ``1024`` is a good default
+            for medium datasets.
+        chunk_size: Length of each action chunk (matches
+            ``policy.chunk_size``). The FAST tokenizer is fit on
+            sequences of this length.
+        seed: RNG seed for sample selection.
+
+    Returns:
+        The local path to the fitted tokenizer. Passed directly to
+        ``--policy.action_tokenizer_name`` for the training run.
+
+    Raises:
+        ImportError: If the ``transformers`` library doesn't expose
+            ``AutoProcessor`` or the FAST tokenizer doesn't have a
+            ``.fit()`` method (then you're on an older FAST snapshot —
+            update to the current published model).
+        FileNotFoundError: If the dataset can't be loaded.
+    """
+    cache_dir = Path(cache_dir)
+    sig = _dataset_signature(dataset_repo_id, base_tokenizer_name, n_samples, chunk_size)
+    out_dir = cache_dir / sig
+
+    if out_dir.exists() and (out_dir / _CACHE_SENTINEL).exists():
+        logger.info(
+            "FAST tokenizer cache hit: %s — re-using fitted tokenizer for "
+            "dataset=%s base=%s n_samples=%d",
+            out_dir, dataset_repo_id, base_tokenizer_name, n_samples,
+        )
+        return str(out_dir)
+
+    # DDP-safe fit: only the (local) main process actually fits + saves;
+    # other ranks poll the cache sentinel until the leader is done.
+    # Without this guard, all N ranks fit concurrently and race on
+    # ``save_pretrained`` + ``AutoProcessor.from_pretrained`` (the latter
+    # copies ``processing_action_tokenizer.py`` into ``HF_MODULES_CACHE``
+    # and compiles a ``.pyc`` — concurrent writers occasionally produce
+    # a stale / partial ``.pyc`` and the subsequent ``from .. import
+    # UniversalActionProcessor`` raises ``AttributeError``.
+    is_leader = (
+        int(os.environ.get("RANK", "0")) == 0
+        and int(os.environ.get("LOCAL_RANK", "0")) == 0
+    )
+    if not is_leader:
+        timeout_s = 1800.0  # 30 min — covers ~1024-sample fits on cold caches
+        start = time.monotonic()
+        while not (out_dir / _CACHE_SENTINEL).exists():
+            if time.monotonic() - start > timeout_s:
+                raise RuntimeError(
+                    f"FAST tokenizer fit: non-leader rank timed out after "
+                    f"{timeout_s:.0f}s waiting for {out_dir / _CACHE_SENTINEL}. "
+                    "Leader rank likely crashed during the fit."
+                )
+            time.sleep(2.0)
+        logger.info("FAST tokenizer ready (leader populated cache): %s", out_dir)
+        return str(out_dir)
+
+    logger.info(
+        "FAST tokenizer cache miss — fitting on dataset=%s "
+        "base=%s n_samples=%d chunk_size=%d → %s",
+        dataset_repo_id, base_tokenizer_name, n_samples, chunk_size, out_dir,
+    )
+
+    from transformers import AutoProcessor  # noqa: PLC0415
+
+    from lerobot.datasets.lerobot_dataset import LeRobotDataset  # noqa: PLC0415
+
+    # Stream a single episode's worth of action chunks at a time so
+    # we don't blow memory on huge datasets. Random episode +
+    # random start offset gives a reasonable spread.
+    #
+    # Actions are read straight from the underlying HF dataset's
+    # ``action`` *column* — never via ``ds[i]``. ``ds[i]`` builds a full
+    # training item (delta-timestamp expansion + video decode + image
+    # transforms); a single bad video frame would then throw and, since
+    # the failure was swallowed at debug level, silently starve the fit
+    # of every chunk. The action column carries no video, so reading it
+    # directly is both faster and immune to decode errors.
+    rng = np.random.default_rng(seed)
+    actions_buf: list[np.ndarray] = []
+
+    # Resolve the dataset's data parquet shards directly, sidestepping
+    # ``LeRobotDataset(repo_id, episodes=[N])`` which on v3-format
+    # datasets routes through HF datasets'' split lookup and raises
+    # ``ValueError: Instruction "train" corresponds to no data!`` for
+    # every episode (job 22182985 looped through 13,293 skipped episodes
+    # for ~2.5 h before NCCL killed it). Reading the ``action`` column
+    # straight from the parquet shards is also faster: each per-episode
+    # ``LeRobotDataset`` instantiation re-parses every meta file.
+    from huggingface_hub import snapshot_download  # noqa: PLC0415
+    import pyarrow as _pa  # noqa: PLC0415
+    import pyarrow.parquet as _pq  # noqa: PLC0415
+
+    snap = Path(snapshot_download(repo_id=dataset_repo_id, repo_type="dataset"))
+    data_files = sorted((snap / "data").glob("chunk-*/file-*.parquet"))
+    if not data_files:
+        raise RuntimeError(
+            f"FAST fit: no ``data/chunk-*/file-*.parquet`` shards found under {snap!s}."
+        )
+
+    # Read just the (episode_index, action) columns once across all
+    # shards. This is the same pattern used elsewhere in the codebase
+    # for whole-dataset audits and stays under ~2 GB even on 32 k-episode
+    # / 29 M-frame datasets because the action column is a fixed-length
+    # float vector.
+    tables = [_pq.read_table(f, columns=["episode_index", "action"]) for f in data_files]
+    table = _pa.concat_tables(tables)
+    eps = table["episode_index"].to_numpy()
+    acts_col = table["action"]
+    # ``action`` may be a fixed-shape ListArray or a 2-D NumericArray;
+    # ``to_numpy(zero_copy_only=False)`` produces an object array of
+    # 1-D NumPy actions either way, which we stack into (N, D).
+    try:
+        acts = np.stack(acts_col.to_numpy(zero_copy_only=False)).astype(np.float32)
+    except Exception:  # noqa: BLE001
+        # Fallback path for nested-list types: flatten via to_pylist().
+        acts = np.asarray(acts_col.to_pylist(), dtype=np.float32)
+    if acts.ndim != 2:
+        raise RuntimeError(
+            f"FAST fit: expected ``action`` rows to be 1-D vectors; got shape {acts.shape}."
+        )
+
+    # Episode index → slice (start, stop) into ``acts`` along axis 0.
+    # ``eps`` is monotonically increasing within each parquet shard but
+    # we make no assumption across shards — sort once and group.
+    order = np.argsort(eps, kind="stable")
+    eps_sorted = eps[order]
+    boundaries = np.searchsorted(eps_sorted, np.arange(int(eps_sorted.max()) + 2))
+    ep_to_slice: dict[int, tuple[int, int]] = {
+        int(ep): (int(boundaries[ep]), int(boundaries[ep + 1]))
+        for ep in range(len(boundaries) - 1)
+        if boundaries[ep] < boundaries[ep + 1]
+    }
+    num_episodes = len(ep_to_slice)
+    # ``acts`` is in original (un-sorted-by-episode) row order; reorder
+    # so per-episode slices are contiguous.
+    acts = acts[order]
+
+    samples_per_episode = max(1, n_samples // max(num_episodes, 1))
+    collected = 0
+    eps_visited = 0
+    short_episodes = 0
+    ep_indices = list(ep_to_slice.keys())
+    for ep_idx in rng.permutation(ep_indices):
+        if collected >= n_samples:
+            break
+        start, stop = ep_to_slice[int(ep_idx)]
+        ep_actions = acts[start:stop]
+        if ep_actions.shape[0] < chunk_size:
+            short_episodes += 1
+            continue
+        starts = rng.integers(0, ep_actions.shape[0] - chunk_size + 1, size=samples_per_episode)
+        for s in starts:
+            actions_buf.append(ep_actions[int(s) : int(s) + chunk_size])
+            collected += 1
+            if collected >= n_samples:
+                break
+        eps_visited += 1
+
+    if not actions_buf:
+        raise RuntimeError(
+            f"FAST fit collected zero action chunks from {dataset_repo_id!r}: "
+            f"all {num_episodes} episodes were shorter than chunk_size="
+            f"{chunk_size} ({short_episodes} too short) or had an unreadable "
+            "``action`` column. Lower ``chunk_size`` to match your episode "
+            "lengths."
+        )
+
+    actions = np.stack(actions_buf, axis=0).astype(np.float32)  # (N, H, D)
+    logger.info(
+        "FAST fit: collected %d chunks of shape %s from %d episodes",
+        actions.shape[0], actions.shape[1:], eps_visited,
+    )
+
+    # Quantile-normalise per dimension before fitting.
+    #
+    # The FAST tokenizer DCT-transforms actions, scales by ``scale`` and
+    # rounds to integer tokens; the integer *range* must fit the
+    # codebook (vocab_size, default 1024). Raw motor units (e.g. encoder
+    # ticks) blow that range up — hence "Vocab size 1024 is too small".
+    # More importantly, at training time ``ActionTokenizerProcessorStep``
+    # runs *after* the QUANTILES ``NormalizerProcessorStep``, so it
+    # encodes normalised actions. Fitting on raw actions would mismatch
+    # that space. We replicate QUANTILES normalisation here (per-dim
+    # [q01, q99] → [-1, 1], clipped) so the fit and the training-time
+    # encode see the same distribution.
+    flat = actions.reshape(-1, actions.shape[-1])
+    q01 = np.quantile(flat, 0.01, axis=0)
+    q99 = np.quantile(flat, 0.99, axis=0)
+    span = np.where((q99 - q01) > 1e-6, q99 - q01, 1.0)
+    actions = np.clip((actions - q01) / span * 2.0 - 1.0, -1.0, 1.0).astype(np.float32)
+
+    base = AutoProcessor.from_pretrained(base_tokenizer_name, trust_remote_code=True)
+    if not hasattr(base, "fit"):
+        raise ImportError(
+            f"Base FAST tokenizer {base_tokenizer_name!r} has no ``.fit()`` "
+            "method — your transformers / model snapshot is too old. Update "
+            "to the current ``physical-intelligence/fast`` revision."
+        )
+
+    fitted = base.fit(actions)
+    out_dir.mkdir(parents=True, exist_ok=True)
+    fitted.save_pretrained(str(out_dir))
+    logger.info("FAST fit: saved fitted tokenizer to %s", out_dir)
+    return str(out_dir)
@@ -0,0 +1,73 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PI052 inference / runtime orchestration.
+
+Multi-rate runtime that mirrors the recipe-time training shape:
+
+  low_level_execution        → LowLevelForward + DispatchAction (high Hz)
+  high_level_subtask         → HighLevelSubtaskFwd (~1 Hz)
+  memory_update              → MemoryUpdateFwd (event: subtask_change)
+  user_interjection_response → UserInterjectionFwd (event: stdin)
+  ask_vqa_*                  → AskVQAFwd (event: stdin question)
+  speech tool calls          → DispatchToolCalls (event: tool_call_pending)
+
+The CLI ``lerobot-pi052-runtime`` builds a ``PI052Runtime`` and calls
+``run()``.
+"""
+
+from .repl import StdinReader
+from .runtime import PI052Runtime
+from .runtime_state import initial_runtime_state, push_log, set_if_changed, take_event
+from .steps import (
+    AskVQAFwd,
+    DispatchAction,
+    DispatchToolCalls,
+    HighLevelSubtaskFwd,
+    InferenceStep,
+    LowLevelForward,
+    MemoryUpdateFwd,
+    UserInterjectionFwd,
+)
+from .triggers import EventTrigger, HzTrigger, Tick, TickClock, Trigger
+from .ui import make_state_panel, print_robot_lines, print_user_line
+
+__all__ = [
+    # runtime
+    "PI052Runtime",
+    "StdinReader",
+    # state helpers
+    "initial_runtime_state",
+    "push_log",
+    "set_if_changed",
+    "take_event",
+    # triggers
+    "Trigger",
+    "Tick",
+    "TickClock",
+    "HzTrigger",
+    "EventTrigger",
+    # steps
+    "InferenceStep",
+    "LowLevelForward",
+    "DispatchAction",
+    "HighLevelSubtaskFwd",
+    "MemoryUpdateFwd",
+    "UserInterjectionFwd",
+    "AskVQAFwd",
+    "DispatchToolCalls",
+    # UI
+    "make_state_panel",
+    "print_robot_lines",
+    "print_user_line",
+]
@@ -0,0 +1,105 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Stdin REPL event collector for the PI052 runtime.
+
+Reads non-blocking stdin lines, classifies each one heuristically:
+
+  "stop" / "quit" / "exit"               → state["stop"] = True
+  "/action" / "/pause"                    → set state["mode"]
+  ends with "?"                           → user_vqa_query event
+  starts with "task:" or first line       → set runtime task
+  anything else                           → user_interjection event
+
+Plugged into the runtime via ``event_collector=StdinReader().poll``.
+
+Note: the shipped CLI (``lerobot-pi052-runtime``) drives stdin
+directly in its REPL / autonomous loops and does *not* wire this
+collector; it's kept as the documented embedding hook and for tests.
+"""
+
+from __future__ import annotations
+
+import select
+import sys
+from dataclasses import dataclass, field
+from typing import Any
+
+
+@dataclass
+class StdinReader:
+    """Non-blocking stdin line collector for the runtime loop."""
+
+    prompt: str = "> "
+    _seen_first_line: bool = field(default=False, init=False)
+    _prompted: bool = field(default=False, init=False)
+
+    def poll(self, state: dict[str, Any]) -> None:
+        """Drain pending stdin lines into runtime events."""
+        # Print the input prompt once on every fresh tick if we don't
+        # already have a pending line; matches the expected REPL feel.
+        if not self._prompted:
+            print(self.prompt, end="", flush=True)
+            self._prompted = True
+
+        # ``select`` with timeout=0 makes this non-blocking. Only works
+        # for actual TTY / pipe stdins; CI / scripted runs hit EOF.
+        try:
+            ready, _, _ = select.select([sys.stdin], [], [], 0)
+        except (ValueError, OSError):
+            return
+        if not ready:
+            return
+
+        line = sys.stdin.readline()
+        if not line:  # EOF
+            state["stop"] = True
+            return
+        line = line.strip()
+        self._prompted = False  # we'll re-prompt next tick
+        if not line:
+            return
+
+        lower = line.lower()
+        if lower in {"stop", "quit", "exit"}:
+            state["stop"] = True
+            return
+
+        # Slash commands flip the run mode. ``/pause`` stops the action
+        # loop (the action steps gate on ``state["mode"]``); ``/action``
+        # resumes it.
+        if lower.split(" ", 1)[0] in {"/action", "/act", "/run"}:
+            state["mode"] = "action"
+            return
+        if lower in {"/pause", "/p"}:
+            state["mode"] = "paused"
+            queue = state.get("action_queue")
+            if hasattr(queue, "clear"):
+                queue.clear()
+            return
+
+        # First non-control line sets the task if no task is active.
+        if not state.get("task"):
+            task = line[5:].strip() if lower.startswith("task:") else line
+            state["task"] = task
+            print(f"[pi052] Task: {task}", flush=True)
+            self._seen_first_line = True
+            return
+
+        # Question → VQA; statement → interjection.
+        if lower.endswith("?"):
+            state["recent_vqa_query"] = line
+            state.setdefault("events_this_tick", []).append("user_vqa_query")
+        else:
+            state["recent_interjection"] = line
+            state.setdefault("events_this_tick", []).append("user_interjection")
@@ -0,0 +1,205 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PI052 runtime loop.
+
+Threads the multi-rate inference pipeline together with a stdin REPL
+event collector, drives ticks through :class:`TickClock`, and prints
+state-change updates to the user.
+"""
+
+from __future__ import annotations
+
+import logging
+from collections import deque
+from dataclasses import dataclass, field
+from typing import Any, Callable
+
+from .runtime_state import initial_runtime_state, push_log
+from .steps import (
+    AskVQAFwd,
+    DispatchAction,
+    DispatchToolCalls,
+    HighLevelSubtaskFwd,
+    InferenceStep,
+    LowLevelForward,
+    MemoryUpdateFwd,
+)
+from .triggers import EventTrigger, HzTrigger, TickClock
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class PI052Runtime:
+    """Compose the inference pipeline and drive it tick-by-tick."""
+
+    policy: Any
+    tools: dict[str, Any] = field(default_factory=dict)
+    """Name → tool-instance dict, e.g. ``{"say": SayTool(...)}``. Read
+    from :func:`lerobot.tools.get_tools(meta)` when wiring the
+    runtime."""
+    observation_provider: Callable[[], dict | None] | None = None
+    """Closure returning the current preprocessed observation batch.
+    ``None`` for dry-run / language-only sessions."""
+    robot_executor: Callable[[Any], None] | None = None
+    """Closure that takes one action chunk and forwards it to the
+    robot. ``None`` for dry-run."""
+    event_collector: Callable[[dict], None] | None = None
+    """Per-tick hook that polls external sources (stdin, network) and
+    appends event names to ``state["events_this_tick"]``."""
+    chunk_hz: float = 4.0
+    ctrl_hz: float = 50.0
+    high_level_hz: float = 1.0
+    max_rate_hz: float = 50.0
+
+    pipeline: list[InferenceStep] = field(init=False)
+    state: dict[str, Any] = field(init=False)
+    _stop: bool = field(default=False, init=False)
+
+    def __post_init__(self) -> None:
+        # Subtask + memory + VQA configuration. Pipeline:
+        #
+        #   HighLevelSubtaskFwd → generate the next subtask via the LM
+        #                         head at ~``high_level_hz``; writes
+        #                         ``current_subtask`` and emits
+        #                         ``subtask_change`` on a transition.
+        #   MemoryUpdateFwd     → on ``subtask_change``, refresh
+        #                         ``current_memory`` from the
+        #                         ``memory_update`` head.
+        #   AskVQAFwd           → answer camera-grounded stdin questions.
+        #   LowLevelForward     → action chunk conditioned on the
+        #                         generated ``current_subtask``.
+        #   DispatchAction      → drain the chunk to the robot.
+        #   DispatchToolCalls   → fire any pending tool calls.
+        #
+        # Order matters: ``HighLevelSubtaskFwd`` must run before
+        # ``MemoryUpdateFwd`` so the event is visible the same tick, and
+        # both must run before ``LowLevelForward`` (which is gated on
+        # "action queue empty") so the chunk consumes the freshest
+        # subtask. ``UserInterjectionFwd`` is still importable but
+        # disabled until plan generation is wired in.
+        self.pipeline = [
+            HighLevelSubtaskFwd(
+                trigger=HzTrigger(self.high_level_hz),
+                policy=self.policy,
+                observation_provider=self.observation_provider,
+            ),
+            # Listens for the ``subtask_change`` event raised by
+            # ``HighLevelSubtaskFwd`` and refreshes ``current_memory``.
+            MemoryUpdateFwd(
+                trigger=EventTrigger("subtask_change"),
+                policy=self.policy,
+                observation_provider=self.observation_provider,
+            ),
+            AskVQAFwd(
+                policy=self.policy,
+                observation_provider=self.observation_provider,
+            ),
+            LowLevelForward(
+                trigger=HzTrigger(self.chunk_hz),
+                policy=self.policy,
+                observation_provider=self.observation_provider,
+            ),
+            DispatchAction(
+                trigger=HzTrigger(self.ctrl_hz),
+                robot_executor=self.robot_executor,
+            ),
+            DispatchToolCalls(tools=self.tools),
+        ]
+        self.state = initial_runtime_state()
+
+    # ------------------------------------------------------------------
+    # Lifecycle
+    # ------------------------------------------------------------------
+
+    def set_task(self, task: str) -> None:
+        """Set or replace the active task. Logged for the REPL."""
+        self.state["task"] = task
+        push_log(self.state, f"Task: {task}")
+
+    def stop(self) -> None:
+        self._stop = True
+
+    def run(self, *, max_ticks: int | None = None) -> None:
+        """Main loop. Returns when ``stop()`` is called or after
+        ``max_ticks`` ticks (useful for tests / dry-run)."""
+        clock = TickClock(max_rate_hz=self.max_rate_hz)
+        while not self._stop:
+            tick = clock.advance()
+            self.state["_tick"] = tick
+            self.state["events_this_tick"] = []
+            self.state["log_lines"] = []
+
+            if self.event_collector is not None:
+                self.event_collector(self.state)
+            if self.state.get("stop"):
+                self._stop = True
+                break
+
+            for step in self.pipeline:
+                self.state = step(self.state)
+
+            self._flush_logs()
+            if max_ticks is not None and tick.index >= max_ticks:
+                break
+
+        self._on_shutdown()
+
+    # ------------------------------------------------------------------
+    # REPL helper: drive one full pipeline pass and return its logs
+    # ------------------------------------------------------------------
+
+    def step_once(self) -> list[str]:
+        """Run one tick of the pipeline and return the log lines.
+
+        Used by the interactive REPL: instead of a background thread,
+        the CLI drives ticks synchronously after each user input. Logs
+        are returned (not printed) so the caller can route them into
+        the rich-Live chat scrollback.
+        """
+        from .triggers import Tick  # noqa: PLC0415
+
+        # Synthesize a tick. We don't need the real wall-clock pacing
+        # here — the REPL drives the runtime, not vice versa — but
+        # ``HzTrigger`` uses ``tick.monotonic_seconds`` to gate, so we
+        # bump it generously so every Hz-triggered step considers
+        # itself due.
+        import time as _time  # noqa: PLC0415
+
+        prev_index = self.state.get("_tick").index if isinstance(self.state.get("_tick"), Tick) else 0
+        self.state["_tick"] = Tick(index=prev_index + 1, monotonic_seconds=_time.monotonic())
+        self.state["log_lines"] = []
+        # ``events_this_tick`` is set up by the caller before
+        # ``step_once`` (the REPL pushes user-driven events first).
+        self.state.setdefault("events_this_tick", [])
+
+        for step in self.pipeline:
+            self.state = step(self.state)
+
+        return list(self.state.get("log_lines") or [])
+
+    # ------------------------------------------------------------------
+    # I/O
+    # ------------------------------------------------------------------
+
+    def _flush_logs(self) -> None:
+        for line in self.state.get("log_lines") or []:
+            print(f"[pi052] {line}", flush=True)
+
+    def _on_shutdown(self) -> None:
+        # Drain any queued action chunks safely.
+        queue = self.state.get("action_queue")
+        if isinstance(queue, deque):
+            queue.clear()
+        print("[pi052] runtime stopped", flush=True)
@@ -0,0 +1,95 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Runtime state passed between inference steps each tick.
+
+The runtime threads a single dict through the pipeline; this module
+documents the shape and provides factories. We use a plain ``dict``
+rather than a frozen dataclass because steps freely add and remove
+keys (``events_this_tick``, ``messages_pending``, ``tool_calls_pending``,
+…) and dataclass field churn would just get in the way.
+
+Stable keys (read by multiple steps):
+
+  task          str             the current top-level task
+  current_plan  str | None      latest plan emitted by the planner
+  current_subtask str | None    latest subtask the policy is executing
+  current_memory str | None     latest compressed memory
+  recent_interjection str | None  most recent user interjection text (consumed)
+
+  action_queue  collections.deque[Tensor]  pending action chunks
+  tool_calls_pending list[dict]  parsed but not-yet-dispatched tool calls
+
+  events_this_tick list[str]    triggers consumed this tick
+  _tick         Tick            current tick (set by the loop)
+
+  mode          str             "action" (run the robot) | "paused"
+                                 (action loop stopped — robot holds)
+
+  log_lines     list[str]       human-readable status lines printed each tick
+"""
+
+from __future__ import annotations
+
+from collections import deque
+from typing import Any
+
+
+def initial_runtime_state(task: str | None = None) -> dict[str, Any]:
+    """Build a fresh runtime state dict with sensible defaults."""
+    return {
+        "task": task,
+        "current_plan": None,
+        "current_subtask": None,
+        "current_memory": None,
+        "recent_interjection": None,
+        "action_queue": deque(),
+        "tool_calls_pending": [],
+        "events_this_tick": [],
+        "log_lines": [],
+        "mode": "action",
+        "stop": False,
+    }
+
+
+def take_event(state: dict[str, Any], event_name: str) -> bool:
+    """Pop ``event_name`` from ``events_this_tick`` if present.
+
+    Steps that consume an event call this so the same event doesn't
+    re-fire on a sibling step within the same tick.
+    """
+    events: list[str] = state.get("events_this_tick") or []
+    if event_name in events:
+        events.remove(event_name)
+        return True
+    return False
+
+
+def push_log(state: dict[str, Any], line: str) -> None:
+    """Append ``line`` to the per-tick log buffer; the runtime prints
+    it at the end of the tick."""
+    state.setdefault("log_lines", []).append(line)
+
+
+def set_if_changed(state: dict[str, Any], key: str, value: Any, label: str | None = None) -> bool:
+    """Update ``state[key]`` and log a diff line if the value changed.
+
+    Returns ``True`` if the value actually changed.
+    """
+    prev = state.get(key)
+    if prev == value:
+        return False
+    state[key] = value
+    if label is not None:
+        push_log(state, f"  {label}: {value}")
+    return True
@@ -0,0 +1,955 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Inference steps for the PI052 multi-rate runtime.
+
+Each step is a tiny class with a ``trigger`` and an ``__call__(state)``;
+the runtime applies them in order each tick. When a step's trigger
+doesn't fire, the step is a no-op and the runtime moves on.
+
+Stream-to-step mapping mirrors the ``subtasks_vqa.yaml`` recipe:
+
+* ``LowLevelForward``        — calls ``policy.select_action`` for the
+                                action chunk; trained by
+                                ``low_level_execution``
+* ``EnqueueChunk``           — pushes the chunk to ``action_queue``
+* ``DispatchAction``         — pops one action per control tick and
+                                forwards to the robot
+* ``HighLevelSubtaskFwd``    — calls ``policy.select_message`` for the
+                                next subtask; trained by
+                                ``high_level_subtask``
+* ``MemoryUpdateFwd``        — fires on subtask boundary; trained by
+                                ``memory_update``
+* ``UserInterjectionFwd``    — fires on stdin interjection; trained by
+                                ``user_interjection_response``
+* ``AskVQAFwd``              — fires on stdin question; trained by
+                                ``ask_vqa_*``
+* ``DispatchToolCalls``      — pops ``tool_calls_pending`` and calls
+                                the matching ``Tool`` instance
+"""
+
+from __future__ import annotations
+
+import logging
+import re
+from dataclasses import dataclass, field
+from typing import Any
+
+from .runtime_state import push_log, set_if_changed, take_event
+from .triggers import EventTrigger, HzTrigger, Trigger
+
+logger = logging.getLogger(__name__)
+
+
+# ---------------------------------------------------------------------------
+# Step base + runner
+# ---------------------------------------------------------------------------
+
+
+@dataclass
+class InferenceStep:
+    """A trigger-gated callable. Subclasses override :meth:`run`."""
+
+    trigger: Trigger
+
+    def __call__(self, state: dict[str, Any]) -> dict[str, Any]:
+        if not self.trigger.should_fire(state["_tick"], state):
+            return state
+        return self.run(state) or state
+
+    def run(self, state: dict[str, Any]) -> dict[str, Any] | None:  # pragma: no cover
+        raise NotImplementedError
+
+
+# ---------------------------------------------------------------------------
+# Low-level (action) path
+# ---------------------------------------------------------------------------
+
+
+@dataclass
+class LowLevelForward(InferenceStep):
+    """Run the policy's action head and produce one action chunk."""
+
+    policy: Any = None
+    observation_provider: Any = None
+    """Callable ``() -> dict``: returns the current observation batch
+    (already preprocessed). Typically wraps the robot's camera /
+    proprio reads. ``None`` in dry-run mode → step skips."""
+
+    trigger: Trigger = field(default_factory=lambda: HzTrigger(hz=4.0))
+
+    def run(self, state: dict[str, Any]) -> dict[str, Any] | None:
+        if self.policy is None or self.observation_provider is None:
+            return None
+        # ``/vlm`` mode pauses the whole action loop so the robot holds
+        # position while the operator probes the VLM with VQA.
+        if state.get("mode", "action") != "action":
+            return None
+        if not state.get("task"):
+            return None
+
+        # PI052 produces *action chunks* (typically 50 steps via
+        # flow-matching). Every step gets dispatched to the robot;
+        # popping one per dispatch tick is essentially free. Only
+        # generate a new chunk once the previous one has fully
+        # drained — this is the canonical "sense → think → act"
+        # loop. Refreshing while a chunk is still queued causes the
+        # new chunk to "telescope" past the old one (planned from an
+        # observation that's already 25+ steps stale by the time it
+        # starts dispatching).
+        queue = state.setdefault("action_queue", [])
+        if len(queue) > 0:
+            return None
+
+        observation = self.observation_provider()
+        if observation is None:
+            return None
+
+        # The action expert is conditioned on the SUBTASK generated by
+        # the high-level loop (``HighLevelSubtaskFwd`` runs earlier in
+        # the pipeline and writes ``current_subtask``). Matches the
+        # training-time ``low_level_execution`` recipe — ``user(${subtask})``.
+        # Falls back to the task string only on the very first frame,
+        # before the high-level loop has produced a subtask.
+        subtask = state.get("current_subtask") or state.get("task") or ""
+        ctx = [{"role": "user", "content": subtask}]
+        # ``add_generation_prompt=False`` to match the training-time
+        # prefix shape: at training the action expert sees the rendered
+        # user turn ending at ``<|im_end|>`` (no trailing
+        # ``<|im_start|>assistant\n``). Passing True here would append
+        # extra role-marker tokens the action expert never saw during
+        # training.
+        text_batch = _build_text_batch(self.policy, ctx, add_generation_prompt=False)
+        from lerobot.utils.constants import (  # noqa: PLC0415
+            OBS_LANGUAGE_ATTENTION_MASK,
+            OBS_LANGUAGE_TOKENS,
+        )
+
+        observation = dict(observation)
+        observation[OBS_LANGUAGE_TOKENS] = text_batch["lang_tokens"]
+        observation[OBS_LANGUAGE_ATTENTION_MASK] = text_batch["lang_masks"]
+
+        try:
+            # ``predict_action_chunk`` returns the *full* chunk shape
+            # ``(batch, n_action_steps, action_dim)``. Enqueue every
+            # step so DispatchAction at ctrl_hz can drain them
+            # smoothly until the next refresh.
+            chunk = self.policy.predict_action_chunk(observation)
+        except Exception as exc:  # noqa: BLE001
+            logger.warning(
+                "predict_action_chunk failed: %s",
+                exc,
+                exc_info=logger.isEnabledFor(logging.DEBUG),
+            )
+            push_log(
+                state,
+                f"  [warn] predict_action_chunk failed: "
+                f"{type(exc).__name__}: {exc}",
+            )
+            return None
+
+        # ``chunk`` shape: ``(batch, n_action_steps, action_dim)``. Push
+        # each step as a ``(1, action_dim)`` tensor so the existing
+        # action executor's batch-squeeze logic works unchanged.
+        if chunk.ndim == 3:
+            chunk_iter = chunk[0]  # ``(n_action_steps, action_dim)``
+        elif chunk.ndim == 2:
+            chunk_iter = chunk
+        else:
+            chunk_iter = chunk.unsqueeze(0)
+
+        for step in chunk_iter:
+            queue.append(step.unsqueeze(0))
+        state["last_chunk_size"] = int(chunk_iter.shape[0])
+        return None
+
+
+@dataclass
+class DispatchAction(InferenceStep):
+    """Pop one action per tick and hand it to the robot.
+
+    In dry-run mode (``robot_executor=None``) the step still pops the
+    queue so it doesn't grow unbounded — the popped tensor is logged
+    instead of executed.
+
+    Wall-clock catch-up: the action queue represents an open-loop
+    trajectory at a fixed step rate (``trigger.hz`` ≈ ``ctrl_hz``).
+    When the main loop stalls — e.g. an LLM call for the high-level
+    subtask blocks for ~2 s on MPS — the dispatch trigger fires only
+    once over that whole interval. Naively popping a single entry per
+    fire makes the robot lag further and further behind the planned
+    timeline, and a 50-step chunk would take ~125 s to drain instead
+    of ~1.7 s. Track real elapsed time between dispatches and pop
+    ``round(elapsed * hz)`` entries, sending the most recent one. The
+    skipped intermediate joint targets are stale anyway — the dynamixel
+    will smooth toward the latest goal position.
+    """
+
+    robot_executor: Any = None
+    trigger: Trigger = field(default_factory=lambda: HzTrigger(hz=50.0))
+    _last_dispatch_t: float | None = field(default=None, init=False)
+
+    def run(self, state: dict[str, Any]) -> dict[str, Any] | None:
+        import time as _time  # noqa: PLC0415
+
+        # ``/vlm`` mode pauses dispatch — the robot holds its last
+        # commanded position while the operator runs VQA.
+        if state.get("mode", "action") != "action":
+            self._last_dispatch_t = None
+            return None
+
+        queue = state.get("action_queue")
+        if not queue:
+            # Reset wall-clock anchor when the queue is empty so the
+            # next chunk doesn't see a huge fake "elapsed" window.
+            self._last_dispatch_t = None
+            return None
+
+        now = _time.monotonic()
+        hz = getattr(self.trigger, "hz", 30.0)
+        if self._last_dispatch_t is None or hz <= 0:
+            n_to_pop = 1
+        else:
+            elapsed = now - self._last_dispatch_t
+            # ``max(1, ...)`` so we always pop at least one when the
+            # trigger fires; ``min(len(queue), ...)`` so we don't run
+            # off the end of the chunk.
+            n_to_pop = max(1, min(len(queue), int(round(elapsed * hz))))
+        self._last_dispatch_t = now
+
+        # Drain ``n_to_pop`` stale entries, keep only the latest as the
+        # action actually sent. The intermediate joint targets would
+        # all be ~10–30 ms apart in chunk time — the robot can't track
+        # them individually anyway when the host loop is slow.
+        latest = None
+        for _ in range(n_to_pop):
+            if not queue:
+                break
+            latest = queue.popleft() if hasattr(queue, "popleft") else queue.pop(0)
+            state["actions_dispatched"] = state.get("actions_dispatched", 0) + 1
+
+        if latest is not None and self.robot_executor is not None:
+            self.robot_executor(latest)
+        return None
+
+
+# ---------------------------------------------------------------------------
+# High-level (text) paths — all use policy.select_message
+# ---------------------------------------------------------------------------
+
+
+_LOC_TOKENIZER_CACHE: dict[str, Any] = {}
+
+
+def _get_loc_tokenizer(tok_name: str, auto_tokenizer_cls: Any, register_loc_fn: Any) -> Any:
+    """Return a loc-token-registered tokenizer, loading from disk only once.
+
+    ``AutoTokenizer.from_pretrained`` + loc-token registration is expensive and
+    the result is immutable, so cache per ``tok_name``.
+    """
+    tokenizer = _LOC_TOKENIZER_CACHE.get(tok_name)
+    if tokenizer is None:
+        tokenizer = register_loc_fn(auto_tokenizer_cls.from_pretrained(tok_name))
+        _LOC_TOKENIZER_CACHE[tok_name] = tokenizer
+    return tokenizer
+
+
+def _build_text_batch(
+    policy: Any,
+    prompt_messages: list[dict[str, Any]],
+    *,
+    add_generation_prompt: bool = True,
+) -> dict[str, Any]:
+    """Tokenize chat messages into the batch ``select_message`` expects.
+
+    PI052's backbone (PaliGemma) ships no chat template, so we train on
+    a plain role-prefixed concatenation built by
+    ``PI052TextTokenizerStep``. We reuse that exact formatter so the
+    inference prefix matches training; ``add_generation_prompt`` appends
+    the bare ``Assistant: `` header the LM head continues from.
+    """
+    import torch  # noqa: PLC0415
+    from transformers import AutoTokenizer  # noqa: PLC0415
+
+    from lerobot.policies.pi052.text_processor_pi052 import (  # noqa: PLC0415
+        _flatten_say_tool_calls,
+        _format_messages,
+        _strip_blocks,
+        register_paligemma_loc_tokens,
+    )
+
+    tok_name = (
+        getattr(policy.config, "tokenizer_name", None) or "google/paligemma-3b-pt-224"
+    )
+    # Register PaliGemma's <locDDDD> tokens so inference encoding /
+    # decoding sees them as single vocab ids — must match training.
+    # The tokenizer is read-only after registration, so cache it: rebuilding it
+    # from disk on every call dominated eval runtime (this runs twice per env
+    # per replan — subtask gen + action prompt).
+    tokenizer = _get_loc_tokenizer(tok_name, AutoTokenizer, register_paligemma_loc_tokens)
+
+    messages = [_strip_blocks(_flatten_say_tool_calls(m)) for m in prompt_messages]
+    prompt, _spans = _format_messages(messages)
+    if add_generation_prompt:
+        prompt = prompt + "Assistant: "
+
+    encoded = tokenizer(prompt, return_tensors="pt")
+    ids = encoded["input_ids"]
+    attn = encoded.get("attention_mask")
+    if attn is None and tokenizer.pad_token_id is not None:
+        attn = ids != tokenizer.pad_token_id
+    if attn is not None and hasattr(attn, "dtype") and attn.dtype != torch.bool:
+        attn = attn.bool()
+
+    # Move tokens onto the policy's device — otherwise prefix embedding
+    # raises a device-mismatch on every forward (CPU tensor vs MPS / CUDA
+    # model), which the caller's broad except would swallow silently.
+    device = getattr(getattr(policy, "config", None), "device", None)
+    if device is not None:
+        try:
+            ids = ids.to(device)
+            if attn is not None and hasattr(attn, "to"):
+                attn = attn.to(device)
+        except Exception as exc:  # noqa: BLE001
+            logger.debug("could not move pi052 lang tokens to %s: %s", device, exc)
+    return {"lang_tokens": ids, "lang_masks": attn, "tokenizer": tokenizer}
+
+
+def _strip_recipe_keys(m: dict[str, Any]) -> dict[str, Any]:
+    new = dict(m)
+    new.pop("stream", None)
+    new.pop("target", None)
+    return new
+
+
+@dataclass
+class HighLevelSubtaskFwd(InferenceStep):
+    """At ~1 Hz, ask the policy for the next subtask.
+
+    Mirrors the ``high_level_subtask`` recipe layout exactly:
+
+        user:   "${task}\\nPlan: ${plan}\\nMemory: ${memory}"
+        user:   "Current subtask: ${subtask}"        (if subtask present)
+        ↓ generate ↓
+        assistant: <next subtask>
+    """
+
+    policy: Any = None
+    observation_provider: Any = None
+    """Same shape as ``LowLevelForward.observation_provider``. When
+    set, the resulting observation is merged into ``select_message``'s
+    batch so text generation runs against real video + state."""
+
+    trigger: Trigger = field(default_factory=lambda: HzTrigger(hz=1.0))
+
+    def run(self, state: dict[str, Any]) -> dict[str, Any] | None:
+        if self.policy is None or not state.get("task"):
+            return None
+        # ``/vlm`` mode pauses subtask generation along with the rest of
+        # the action loop.
+        if state.get("mode", "action") != "action":
+            return None
+        # Gate to chunk boundaries: only generate a fresh subtask when
+        # the action queue is empty (i.e. right before LowLevelForward
+        # refreshes the chunk). ``select_message`` takes ~2 s on MPS,
+        # and running it every loop iteration starves DispatchAction
+        # at ctrl_hz=30 — the queue drains at ~0.4 actions/sec instead
+        # of 30/sec and the robot barely moves. Tying it to the same
+        # "queue empty" condition as the chunk refresh produces a
+        # clean sense → think → act cycle.
+        #
+        # Rearm the trigger when skipping so a low-hz schedule
+        # (e.g. ``--high_level_hz=0.2`` = once per 5 s) doesn't lose
+        # the slot: the trigger fires once on the timer but the brief
+        # queue-empty window almost never coincides, so without rearm
+        # HL would effectively never run.
+        queue = state.get("action_queue") or []
+        if len(queue) > 0:
+            if hasattr(self.trigger, "rearm"):
+                self.trigger.rearm()
+            return None
+        # Per-chunk-boundary throttle: at each "queue empty" moment we
+        # increment a counter; subtask gen only fires once the counter
+        # reaches ``subtask_chunks_per_gen``. Lets the operator run e.g.
+        # 5 action chunks per subtask-gen so the LM head doesn't churn
+        # every 1.7 s (a fresh subtask while the previous one is still
+        # being executed is wasted compute *and* causes the action
+        # expert's flow trajectory to be re-planned mid-grasp).
+        chunks_per_gen = max(1, int(state.get("subtask_chunks_per_gen", 1) or 1))
+        # Initialise so the first chunk boundary fires immediately
+        # (counter starts at chunks_per_gen, decrements per skip,
+        # generates and resets when it hits 0).
+        if "_hl_chunks_until_gen" not in state:
+            state["_hl_chunks_until_gen"] = 0
+        if state["_hl_chunks_until_gen"] > 0:
+            state["_hl_chunks_until_gen"] -= 1
+            if hasattr(self.trigger, "rearm"):
+                self.trigger.rearm()
+            return None
+        state["_hl_chunks_until_gen"] = chunks_per_gen - 1
+        ctx = _msgs_for_subtask(state)
+        observation = _maybe_observation(self.observation_provider)
+        # Default: greedy argmax, no min_new_tokens, no special-token
+        # suppression — matches training. Operator can override via
+        # ``--text_min_new_tokens=N --text_temperature=T --text_top_p=P``
+        # on the CLI; useful for under-trained checkpoints whose LM
+        # head still favours EOS at position 0 (pre-trained chat
+        # backbone's short-turn prior hasn't been fully overridden
+        # by the fine-tuning supervision yet).
+        msg = _generate_with_policy(
+            self.policy,
+            ctx,
+            observation=observation,
+            state=state,
+            label="subtask gen",
+            min_new_tokens=int(state.get("text_gen_min_new_tokens") or 0),
+            temperature=float(state.get("text_gen_temperature") or 0.0),
+            top_p=float(state.get("text_gen_top_p") or 1.0),
+            # Subtasks never legitimately contain PaliGemma ``<loc>``
+            # tokens — suppress them so a checkpoint whose LM head
+            # has drifted toward the pretrained loc-prior falls back
+            # to its (still-correct) text mass.
+            suppress_loc_tokens=True,
+        )
+        # Diagnostics: surface what the model is *actually* producing
+        # at chunk boundaries, even when the output gets rejected or
+        # repeats. Memorisation collapse looks like "same accepted
+        # subtask N times in a row" or "gibberish_count rising while
+        # current_subtask is stuck". The state panel renders these.
+        state["last_subtask_raw"] = msg or ""
+        # Persistent empty completion is its own failure mode (model
+        # immediately EOS-es from the chat-template generation
+        # prompt) — surface it once every N occurrences so the
+        # operator can distinguish "generation failing silently"
+        # from "generating fine but filter rejecting".
+        if not msg:
+            empties = state.get("subtask_empty_count", 0) + 1
+            state["subtask_empty_count"] = empties
+            if empties == 1 or empties % 5 == 0:
+                debug = getattr(self.policy, "_last_select_message_debug", "") or ""
+                if debug:
+                    push_log(
+                        state,
+                        f"  [info] subtask gen empty (×{empties}); {debug}",
+                    )
+                else:
+                    push_log(
+                        state,
+                        f"  [info] subtask gen returned empty (×{empties}) — "
+                        "no tokens generated (head EOS-ing before any "
+                        "non-special token).",
+                    )
+        if msg and _looks_like_gibberish(msg):
+            # Bump a counter so the operator can see the model is
+            # struggling without spamming the log every tick. A first
+            # rejection still logs once so the failure is visible.
+            count = state.get("subtask_gibberish_count", 0) + 1
+            state["subtask_gibberish_count"] = count
+            if count == 1 or count % 30 == 0:
+                push_log(
+                    state,
+                    f"  [info] subtask gen rejected (gibberish ×{count}): {msg[:60]!r}",
+                )
+            return None
+        if msg:
+            prev_subtask = state.get("current_subtask")
+            changed = set_if_changed(state, "current_subtask", msg, label="subtask")
+            if changed:
+                # Stash the just-completed subtask so ``MemoryUpdateFwd``
+                # can drop it into its prompt as ``Completed subtask:``
+                # — the recipe binds ``completed_subtask`` to
+                # ``nth_prev(style=subtask, offset=1)``, i.e. the subtask
+                # that was active *before* the change.
+                if prev_subtask:
+                    state["prior_subtask"] = prev_subtask
+                # Subtask change is a downstream trigger.
+                state.setdefault("events_this_tick", []).append("subtask_change")
+                state["subtask_repeat_count"] = 0
+            else:
+                # Same accepted string regenerated — memorisation tell.
+                # Once this counter climbs past a few, you're seeing
+                # the model unable to move past the current subtask
+                # despite the chunk having drained (visual scene may
+                # have changed but the LM is replaying training
+                # tokens).
+                state["subtask_repeat_count"] = (
+                    state.get("subtask_repeat_count", 0) + 1
+                )
+        # Silently skip empty completions — common when the model
+        # warms up or generates only EOS; logging it every tick at
+        # ctrl_hz is just noise.
+        return None
+
+
+@dataclass
+class MemoryUpdateFwd(InferenceStep):
+    """On subtask boundary, refresh the compressed memory.
+
+    Mirrors the ``memory_update`` recipe layout exactly:
+
+        user:      "${task}"
+        assistant: "Previous memory: ${prior_memory}"   (if prior memory)
+        user:      "Completed subtask: ${completed_subtask}"  (if subtask)
+        ↓ generate ↓
+        assistant: <new memory>
+    """
+
+    policy: Any = None
+    observation_provider: Any = None
+    trigger: Trigger = field(default_factory=lambda: EventTrigger("subtask_change"))
+
+    def run(self, state: dict[str, Any]) -> dict[str, Any] | None:
+        # Don't consume the event — multiple steps may want to react.
+        if self.policy is None:
+            return None
+        ctx = _msgs_for_memory(state)
+        observation = _maybe_observation(self.observation_provider)
+        new_memory = _generate_with_policy(
+            self.policy,
+            ctx,
+            observation=observation,
+            state=state,
+            label="memory gen",
+            suppress_loc_tokens=True,
+        )
+        state["last_memory_raw"] = new_memory or ""
+        if new_memory and _looks_like_gibberish(new_memory):
+            count = state.get("memory_gibberish_count", 0) + 1
+            state["memory_gibberish_count"] = count
+            push_log(
+                state,
+                f"  [info] memory gen rejected (gibberish ×{count}): {new_memory[:60]!r}",
+            )
+            return None
+        if new_memory:
+            set_if_changed(state, "current_memory", new_memory, label="memory")
+        return None
+
+
+@dataclass
+class UserInterjectionFwd(InferenceStep):
+    """On stdin interjection, refresh the plan + emit a paired ``say``.
+
+    Mirrors the ``user_interjection_response`` recipe layout exactly:
+
+        user:      "${task}"
+        assistant: "Previous plan:\\n${prior_plan}"   (if prior plan)
+        user:      "${interjection}"                  (the new utterance)
+        ↓ generate ↓
+        assistant: <plan + <say>...</say>>
+    """
+
+    policy: Any = None
+    observation_provider: Any = None
+    trigger: Trigger = field(default_factory=lambda: EventTrigger("user_interjection"))
+
+    def run(self, state: dict[str, Any]) -> dict[str, Any] | None:
+        if self.policy is None or not take_event(state, "user_interjection"):
+            return None
+        ctx = _msgs_for_interjection(state)
+        observation = _maybe_observation(self.observation_provider)
+        out = _generate_with_policy(
+            self.policy,
+            ctx,
+            observation=observation,
+            state=state,
+            label="plan/say gen",
+            suppress_loc_tokens=True,
+        )
+        if not out:
+            # Don't log every empty completion — happens repeatedly on
+            # MPS during warm-up and floods the panel. The user can
+            # re-trigger by typing again.
+            return None
+        if _looks_like_gibberish(out):
+            count = state.get("plan_gibberish_count", 0) + 1
+            state["plan_gibberish_count"] = count
+            push_log(
+                state,
+                f"  [info] plan/say gen rejected (gibberish ×{count}): {out[:60]!r}",
+            )
+            return None
+        # Heuristic split: model is trained to emit one assistant turn
+        # carrying both plan text AND a `say` tool call. Look for a
+        # "<say>...</say>" or "say(...)" marker; fall back to whole
+        # text → plan, no speech.
+        plan_text, speech_text = _split_plan_and_say(out)
+        if plan_text and _looks_like_gibberish(plan_text):
+            plan_text = ""
+        if plan_text:
+            set_if_changed(state, "current_plan", plan_text, label="plan")
+        if speech_text:
+            push_log(state, f"  speech: {speech_text}")
+            state.setdefault("tool_calls_pending", []).append(
+                {
+                    "type": "function",
+                    "function": {"name": "say", "arguments": {"text": speech_text}},
+                }
+            )
+            state.setdefault("events_this_tick", []).append("tool_call_pending")
+        # Mark interjection consumed.
+        state["recent_interjection"] = None
+        return None
+
+
+@dataclass
+class AskVQAFwd(InferenceStep):
+    """On stdin question, answer a frame-grounded VQA.
+
+    Mirrors the ``ask_vqa_*`` recipe layout exactly: a single user
+    turn carrying just the VQA question, plus the camera image block
+    in training (we drop the image at inference because the dataset's
+    image preprocessing doesn't match SmolVLM's vision tower input).
+
+        user:   <question>
+        ↓ generate ↓
+        assistant: <vqa answer>
+    """
+
+    policy: Any = None
+    observation_provider: Any = None
+    trigger: Trigger = field(default_factory=lambda: EventTrigger("user_vqa_query"))
+
+    def run(self, state: dict[str, Any]) -> dict[str, Any] | None:
+        if self.policy is None or not take_event(state, "user_vqa_query"):
+            return None
+        question = state.get("recent_vqa_query")
+        if not question:
+            return None
+        ctx = _msgs_for_vqa(question)
+        observation = _maybe_observation(self.observation_provider)
+        answer = _generate_with_policy(
+            self.policy,
+            ctx,
+            observation=observation,
+            state=state,
+            label="vqa gen",
+        )
+        # VQA answers are intentionally JSON-like during training, so
+        # ``_looks_like_gibberish`` would false-positive on them. Keep
+        # the answer as-is — the VQA panel line lets the user judge.
+        if answer:
+            push_log(state, f"  vqa: {answer}")
+        state["recent_vqa_query"] = None
+        return None
+
+
+# ---------------------------------------------------------------------------
+# Tool dispatch
+# ---------------------------------------------------------------------------
+
+
+@dataclass
+class DispatchToolCalls(InferenceStep):
+    """Pop ``tool_calls_pending`` and execute them via :data:`TOOL_REGISTRY`."""
+
+    tools: dict[str, Any] = field(default_factory=dict)
+    trigger: Trigger = field(default_factory=lambda: EventTrigger("tool_call_pending"))
+
+    def run(self, state: dict[str, Any]) -> dict[str, Any] | None:
+        take_event(state, "tool_call_pending")
+        pending = state.get("tool_calls_pending") or []
+        for call in pending:
+            try:
+                fn = (call or {}).get("function") or {}
+                name = fn.get("name")
+                args = fn.get("arguments") or {}
+                tool = self.tools.get(name)
+                if tool is None:
+                    push_log(state, f"  [warn] tool {name!r} not registered — skipping call")
+                    continue
+                tool.call(args)
+            except Exception as exc:  # noqa: BLE001
+                push_log(state, f"  [error] tool dispatch failed: {exc}")
+        state["tool_calls_pending"] = []
+        return None
+
+
+# ---------------------------------------------------------------------------
+# Helpers
+# ---------------------------------------------------------------------------
+
+
+def _looks_like_gibberish(text: str) -> bool:
+    """Heuristically detect generation that's clearly off the rails.
+
+    Memorised models can collapse to dominant-mode outputs when the
+    prompt drifts even slightly from training distribution. Reject:
+
+    * empty / whitespace-only
+    * too few alphabetic characters (mostly punctuation)
+    * a single character repeated past the threshold
+    * starts with ``":"`` and contains no letters
+    * too few unique tokens — e.g. ``"the"``, ``"the the the"``,
+      ``"Ass\\n::\\nthe"`` (the collapse seen on real-robot frames
+      where the model emits one or two memorised tokens repeatedly)
+    * chat-template fragment leakage (``Assistant:``, ``User:``,
+      ``Ass\\n``)
+
+    Real subtasks look like ``"close the gripper to grasp the blue
+    cube"`` — multiple unique alphabetic tokens, no role-marker
+    fragments. Anything materially shorter than that is rejected.
+    """
+    if not text or not text.strip():
+        return True
+    stripped = text.strip()
+    alpha = sum(1 for c in stripped if c.isalpha())
+    if alpha < max(3, len(stripped) // 8):
+        return True
+    if stripped.startswith('":') and stripped.count('"') > stripped.count(" "):
+        return True
+    # Single repeating char: e.g. ``""""""``.
+    if len(set(stripped)) <= 2 and len(stripped) > 4:
+        return True
+    # Chat-template fragment leakage — the model emits ``Ass``,
+    # ``Assistant:``, ``User:``, often with extra newlines/colons.
+    # Reject if the cleaned text is mostly role-marker shards.
+    cleaned = stripped.replace("\n", " ").replace(":", " ")
+    for marker in ("Assistant", "User", "Ass "):
+        if marker in cleaned and len(cleaned.split()) < 4:
+            return True
+    tokens = [t for t in cleaned.split() if any(c.isalpha() for c in t)]
+    unique_alpha = {t.lower() for t in tokens}
+    # Short degenerate output — model stuck on ``the`` or a couple of
+    # memorised single-token continuations.
+    if len(unique_alpha) < 3 and len(stripped) < 80:
+        return True
+    # Long repetition collapse — the LM head loops an n-gram for the
+    # whole generation budget ("the arm the arm … the the the the").
+    # Length-independent: many tokens but a tiny unique ratio. The
+    # earlier ``< 80`` check missed these because the looped string
+    # blows well past 80 chars.
+    if len(tokens) >= 8 and len(unique_alpha) <= max(3, len(tokens) // 10):
+        return True
+    return False
+
+
+def _control_context_messages(
+    state: dict[str, Any],
+    *,
+    include_completed: bool = False,
+    extra_user: str | None = None,
+) -> list[dict[str, Any]]:
+    """Build a chat-template-ready prompt from current runtime state.
+
+    Mirrors what ``subtasks_vqa.yaml`` renders into ``${task}\nPlan:
+    ${plan}\nMemory: ${memory}`` for the high-level branches.
+    """
+    # Always emit ``Plan: `` / ``Memory: `` labels — even with empty
+    # values — to mirror the training-time recipe substitution.
+    task = state.get("task") or ""
+    plan = state.get("current_plan") or ""
+    memory = state.get("current_memory") or ""
+    parts = [task, f"Plan: {plan}", f"Memory: {memory}"]
+    if include_completed and state.get("current_subtask"):
+        parts.append(f"Completed subtask: {state['current_subtask']}")
+    head = "\n".join(parts)
+    msgs: list[dict[str, Any]] = [{"role": "user", "content": head}]
+    if extra_user:
+        msgs.append({"role": "user", "content": extra_user})
+    return msgs
+
+
+# ---------------------------------------------------------------------------
+# Per-recipe prompt builders. Each one mirrors a single sub-recipe's
+# message layout in ``subtasks_vqa.yaml`` so the chat-templated
+# prompt at inference matches what the model saw during training.
+# Generic ``_control_context_messages`` is kept around as a fallback
+# for ad-hoc callers but the four high-level steps now use these.
+# ---------------------------------------------------------------------------
+
+
+def _hirobot_user_head(state: dict[str, Any]) -> str:
+    """Build the ``task\\nPlan: …\\nMemory: …`` user content string.
+
+    Mirrors what the recipe renders at training time, where
+    ``language_render._substitute`` substitutes empty strings for
+    missing ``${plan}`` / ``${memory}`` bindings — i.e. the
+    ``Plan: `` / ``Memory: `` prefix labels are *always* in the
+    user turn, even when their values aren't set yet. Skipping them
+    here (the previous behaviour) produced a different prompt shape
+    on early frames before plan / memory are populated and on
+    samples where the dataset has no plan / memory annotation.
+    """
+    task = state.get("task") or ""
+    plan = state.get("current_plan") or ""
+    memory = state.get("current_memory") or ""
+    return f"{task}\nPlan: {plan}\nMemory: {memory}"
+
+
+def _msgs_for_subtask(state: dict[str, Any]) -> list[dict[str, Any]]:
+    """``high_level_subtask`` recipe layout — predict the subtask from the
+    task. The v-current recipe's user turn is just ``${task}`` (plan and
+    memory are not trained), so the inference prompt is the bare task —
+    no ``Plan: `` / ``Memory: `` lines.
+    """
+    return [{"role": "user", "content": state.get("task") or ""}]
+
+
+def _msgs_for_memory(state: dict[str, Any]) -> list[dict[str, Any]]:
+    """Memory-update prompt — mirrors ``memory_update`` recipe layout.
+
+    Recipe layout (``subtask_mem.yaml``):
+
+        user:      "${task}"
+        assistant: "Previous memory: ${prior_memory}"     (if_present prior)
+        user:      "Completed subtask: ${completed}"       (if_present completed)
+        assistant: → predicts new memory
+
+    Fired by ``MemoryUpdateFwd`` on a ``subtask_change`` event:
+    ``state['current_memory']`` is the memory the policy last emitted
+    (= the ``prior_memory`` binding at training), and
+    ``state['prior_subtask']`` is the subtask that just got replaced
+    (= the ``completed_subtask`` binding at training).
+    """
+    msgs: list[dict[str, Any]] = [
+        {"role": "user", "content": state.get("task") or ""},
+    ]
+    prior_memory = state.get("current_memory")
+    if prior_memory:
+        msgs.append(
+            {"role": "assistant", "content": f"Previous memory: {prior_memory}"}
+        )
+    completed_subtask = state.get("prior_subtask")
+    if completed_subtask:
+        msgs.append(
+            {"role": "user", "content": f"Completed subtask: {completed_subtask}"}
+        )
+    return msgs
+
+
+def _msgs_for_interjection(state: dict[str, Any]) -> list[dict[str, Any]]:
+    """``user_interjection_response`` recipe layout."""
+    msgs: list[dict[str, Any]] = [
+        {"role": "user", "content": state.get("task") or ""}
+    ]
+    if state.get("current_plan"):
+        msgs.append(
+            {"role": "assistant", "content": f"Previous plan:\n{state['current_plan']}"}
+        )
+    interjection = state.get("recent_interjection")
+    if interjection:
+        msgs.append({"role": "user", "content": interjection})
+    return msgs
+
+
+def _msgs_for_plan(state: dict[str, Any]) -> list[dict[str, Any]]:
+    """``plan_generation`` recipe layout — bare task → plan.
+
+    The assistant turn is the generation target, so we only render
+    the user turn at inference; the runtime appends the predicted
+    plan after sampling.
+    """
+    return [{"role": "user", "content": state.get("task") or ""}]
+
+
+def _msgs_for_vqa(question: str) -> list[dict[str, Any]]:
+    """``ask_vqa_*`` recipe layout (text-only at inference)."""
+    return [{"role": "user", "content": question}]
+
+
+def _maybe_observation(provider: Any) -> dict | None:
+    """Pull one observation from ``provider`` if it's set, else ``None``.
+
+    Errors from the provider are logged at debug level and swallowed —
+    text generation still runs (in text-only mode) so a flaky frame
+    source doesn't kill the REPL.
+    """
+    if provider is None:
+        return None
+    try:
+        return provider()
+    except Exception as exc:  # noqa: BLE001
+        logger.debug("observation_provider raised %s — falling back to text-only", exc)
+        return None
+
+
+def _generate_with_policy(
+    policy: Any,
+    messages: list[dict[str, Any]],
+    *,
+    observation: dict | None = None,
+    state: dict[str, Any] | None = None,
+    label: str = "select_message",
+    min_new_tokens: int = 0,
+    temperature: float = 0.0,
+    top_p: float = 1.0,
+    suppress_loc_tokens: bool = False,
+) -> str:
+    """Drive ``policy.select_message`` with a chat batch (and optional obs).
+
+    When ``observation`` carries ``observation.images.*`` and
+    ``observation.state``, those are merged into the batch so
+    ``select_message`` runs the same VLM prefix the policy was trained
+    on. Without an observation the runtime falls back to a text-only
+    prompt — the text head still runs, but generations may drift from
+    the training distribution.
+
+    Failures are surfaced both to the module logger (``warning``) and,
+    when ``state`` is given, to the runtime's user-visible log via
+    :func:`push_log`, so the REPL no longer "looks dead" when
+    something goes wrong inside generation.
+    """
+    if not hasattr(policy, "select_message"):
+        if state is not None:
+            push_log(state, f"  [warn] policy has no select_message — skipping {label}")
+        return ""
+    text_batch = _build_text_batch(policy, messages)
+    try:
+        from lerobot.utils.constants import (  # noqa: PLC0415
+            OBS_LANGUAGE_ATTENTION_MASK,
+            OBS_LANGUAGE_TOKENS,
+        )
+
+        batch: dict[str, Any] = {
+            OBS_LANGUAGE_TOKENS: text_batch["lang_tokens"],
+            OBS_LANGUAGE_ATTENTION_MASK: text_batch["lang_masks"],
+        }
+        if observation:
+            for k, v in observation.items():
+                if isinstance(k, str) and k.startswith("observation.") and k not in batch:
+                    batch[k] = v
+        kwargs: dict[str, Any] = {
+            "tokenizer": text_batch["tokenizer"],
+            "min_new_tokens": min_new_tokens,
+            "temperature": temperature,
+            "top_p": top_p,
+        }
+        kwargs["suppress_loc_tokens"] = suppress_loc_tokens
+        return policy.select_message(batch, **kwargs)
+    except Exception as exc:  # noqa: BLE001
+        logger.warning("%s failed: %s", label, exc, exc_info=logger.isEnabledFor(logging.DEBUG))
+        if state is not None:
+            push_log(state, f"  [warn] {label} failed: {type(exc).__name__}: {exc}")
+        return ""
+
+
+_SAY_RE = re.compile(r"<\s*say\s*>(.*?)<\s*/\s*say\s*>", re.IGNORECASE | re.DOTALL)
+
+
+def _split_plan_and_say(text: str) -> tuple[str, str]:
+    """Pull a ``<say>...</say>`` snippet out of ``text``; remainder is plan.
+
+    The training-time tool-call serializer wraps ``say(text="…")`` in a
+    deterministic textual marker so prefix-LM-style training learns to
+    emit it. The runtime parses it back here. If no marker is present,
+    the entire text is treated as plan with no speech.
+    """
+    if not text:
+        return "", ""
+    match = _SAY_RE.search(text)
+    if not match:
+        return text.strip(), ""
+    speech = match.group(1).strip().strip('"').strip("'")
+    plan = (text[: match.start()] + text[match.end() :]).strip()
+    return plan, speech
@@ -0,0 +1,134 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Trigger primitives for PI052's multi-rate inference runtime.
+
+Mirrors the plan's Section "Runtime orchestration": each
+``InferenceStep`` is gated by a :class:`Trigger` that decides per tick
+whether the step fires. Two trigger flavours cover all the cadences
+the canonical recipe needs:
+
+* :class:`HzTrigger` for periodic beats (action chunks at ~3-5 Hz,
+  high-level subtask generation at ~1 Hz, action dispatch at ~50 Hz)
+* :class:`EventTrigger` for one-shot reactions (subtask boundary →
+  memory update; user interjection → plan refresh; user VQA query →
+  vqa answer; pending tool call → dispatcher)
+
+Triggers are stateless except for ``HzTrigger``'s last-fire timestamp.
+The runtime stores the :class:`Tick` clock as ``state["_tick"]`` so
+every step shares a single time source.
+"""
+
+from __future__ import annotations
+
+import time
+from dataclasses import dataclass, field
+from typing import Any, Protocol
+
+
+@dataclass
+class Tick:
+    """Single tick from :class:`TickClock`. Carries time references the
+    runtime steps consume to gate themselves."""
+
+    index: int
+    """Monotonic counter — increments by one per tick."""
+
+    monotonic_seconds: float
+    """``time.monotonic()`` at the start of this tick."""
+
+
+@dataclass
+class TickClock:
+    """Drives the runtime loop at up to ``max_rate_hz``.
+
+    Sleeps just enough between :meth:`advance` calls to enforce the
+    rate. With ``max_rate_hz=50`` the loop wakes ~every 20ms; the
+    higher-level ``HzTrigger`` slices that timeline into sub-cadences.
+    """
+
+    max_rate_hz: float = 50.0
+    _index: int = field(default=0, init=False)
+    _last_seconds: float | None = field(default=None, init=False)
+
+    def advance(self) -> Tick:
+        period = 1.0 / max(self.max_rate_hz, 0.1)
+        now = time.monotonic()
+        if self._last_seconds is not None:
+            sleep_for = (self._last_seconds + period) - now
+            if sleep_for > 0:
+                time.sleep(sleep_for)
+                now = time.monotonic()
+        self._last_seconds = now
+        self._index += 1
+        return Tick(index=self._index, monotonic_seconds=now)
+
+
+class Trigger(Protocol):
+    """Decide whether the next ``InferenceStep`` should fire."""
+
+    def should_fire(self, tick: Tick, state: dict[str, Any]) -> bool: ...
+
+
+@dataclass
+class HzTrigger:
+    """Fire at most ``hz`` times per second.
+
+    A step that gates further (e.g. ``HighLevelSubtaskFwd`` skipping
+    when the action queue is non-empty) and wants the trigger to
+    retry next tick instead of waiting a full period can call
+    :meth:`rearm` from inside ``run``. Without this, a low-hz trigger
+    (e.g. ``hz=0.2`` = once per 5 s) almost never coincides with the
+    brief queue-empty window and the step never fires at all.
+    """
+
+    hz: float
+    _last_seconds: float | None = field(default=None, init=False)
+
+    def should_fire(self, tick: Tick, state: dict[str, Any]) -> bool:
+        period = 1.0 / max(self.hz, 1e-6)
+        if self._last_seconds is None or (tick.monotonic_seconds - self._last_seconds) >= period:
+            self._last_seconds = tick.monotonic_seconds
+            return True
+        return False
+
+    def rearm(self) -> None:
+        """Mark the trigger as not having fired, so the next tick re-evaluates.
+
+        Used by a step that decided to skip after ``should_fire`` already
+        committed the firing — keeps the cadence honest without losing
+        the slot.
+        """
+        self._last_seconds = None
+
+
+@dataclass
+class EventTrigger:
+    """Fire when ``event_name`` is in ``state["events_this_tick"]``.
+
+    The runtime fills ``events_this_tick`` once per tick from:
+
+    * stdin / network input (``user_interjection``, ``user_vqa_query``,
+      ``stop``)
+    * internal state transitions (``subtask_change``,
+      ``tool_call_pending``)
+
+    The list is consumed (cleared at the end of the tick) so events
+    fire at most once.
+    """
+
+    event_name: str
+
+    def should_fire(self, tick: Tick, state: dict[str, Any]) -> bool:
+        events: list[str] = state.get("events_this_tick") or []
+        return self.event_name in events
@@ -0,0 +1,127 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Rich-based REPL layout for the PI052 runtime.
+
+Two-zone terminal layout:
+
+    [chat scrollback — user messages / robot responses, scrolls naturally]
+
+    ┌── State ──────────────────────────────────────────┐
+    │ task     please clean up the kitchen              │
+    │ subtask  grasp the handle of the sponge           │
+    │ plan     1. grasp sponge  2. wipe  3. tidy        │
+    │ memory   sponge picked up; counter still dirty    │
+    └───────────────────────────────────────────────────┘
+    > _
+
+The state panel re-renders on every state change. Chat lines are
+``console.print``'d above the live region so they accumulate naturally
+in scrollback. Implemented with :class:`rich.live.Live` plus
+:func:`rich.console.Console.input` for the prompt — when an input is
+pending, ``rich.Live`` auto-suspends so the input doesn't fight the
+panel for cursor position.
+"""
+
+from __future__ import annotations
+
+from typing import Any
+
+try:  # rich is optional; only required for the interactive REPL.
+    from rich.console import Console
+    from rich.panel import Panel
+    from rich.table import Table
+    from rich.text import Text
+
+    _HAS_RICH = True
+except ImportError:  # pragma: no cover
+    _HAS_RICH = False
+    Console = Any  # type: ignore[assignment]
+    Panel = Any  # type: ignore[assignment]
+    Table = Any  # type: ignore[assignment]
+    Text = Any  # type: ignore[assignment]
+
+
+_STATE_KEYS = (
+    ("task", "task"),
+    ("current_subtask", "subtask"),
+    ("current_plan", "plan"),
+    ("current_memory", "memory"),
+)
+
+
+def make_state_panel(state: dict[str, Any]) -> Any:
+    """Render the persistent state panel for the live region.
+
+    Returns a :class:`rich.panel.Panel`. Caller passes it to
+    ``Live.update(panel)`` whenever the state changes.
+    """
+    if not _HAS_RICH:
+        raise RuntimeError(
+            "rich is required for the interactive REPL. "
+            "`pip install rich` (it's a transitive dep of lerobot)."
+        )
+    table = Table.grid(padding=(0, 2), expand=True)
+    table.add_column(justify="right", style="dim", no_wrap=True, width=10)
+    table.add_column(justify="left")
+    for key, label in _STATE_KEYS:
+        value = state.get(key)
+        if value is None:
+            rendered = Text("(not set)", style="dim italic")
+        else:
+            rendered = Text(str(value), style="bold")
+        table.add_row(label, rendered)
+    queue = state.get("action_queue")
+    queue_len = len(queue) if hasattr(queue, "__len__") else 0
+    pending = state.get("tool_calls_pending") or []
+    footer = Text.assemble(
+        ("queued actions: ", "dim"),
+        (str(queue_len), "bold cyan"),
+        ("    pending tool calls: ", "dim"),
+        (str(len(pending)), "bold magenta"),
+    )
+    table.add_row("", footer)
+    run_mode = state.get("mode", "action")
+    mode_tag = (
+        "[green]action[/]" if run_mode == "action" else "[yellow]paused[/]"
+    )
+    return Panel(
+        table,
+        title=f"[bold]PI052 state[/] · mode: {mode_tag}",
+        border_style="cyan",
+    )
+
+
+def print_user_line(console: Any, line: str) -> None:
+    """Append a user-typed line to the chat scrollback."""
+    if not _HAS_RICH:
+        print(f"you: {line}", flush=True)
+        return
+    console.print(f"[bold cyan]you:[/] {line}")
+
+
+def print_robot_lines(console: Any, lines: list[str]) -> None:
+    """Append robot/runtime log lines to the chat scrollback."""
+    if not _HAS_RICH:
+        for line in lines:
+            print(f"robot: {line.lstrip()}", flush=True)
+        return
+    for line in lines:
+        # The runtime uses leading whitespace + "label: text"; render
+        # the label in green and the value in default for readability.
+        stripped = line.lstrip()
+        if ":" in stripped:
+            label, _, value = stripped.partition(":")
+            console.print(f"[bold green]robot[/] [dim]({label.strip()})[/] {value.strip()}")
+        else:
+            console.print(f"[bold green]robot:[/] {stripped}")
@@ -0,0 +1,423 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Interactive VQA for the PI052 runtime.
+
+In ``/vlm`` mode a typed line is treated as a VQA question. This module
+runs the full interactive flow:
+
+  1. pull the current observation and list available cameras,
+  2. ask the operator which camera to ground the question on,
+  3. generate the answer with the VLM conditioned on that one camera,
+  4. parse the JSON answer; if it carries a bounding box (``bbox``) or a
+     point (``keypoint``), draw the overlay on the camera frame, save a
+     PNG to ``./vqa_overlays/`` and auto-open it.
+
+VQA answer schemas mirror the annotation pipeline's ``VQA_ANSWER_SHAPES``
+(see ``lerobot.annotations.steerable_pipeline.validator``):
+
+  * ``bbox``     — ``{"detections": [{"label", "bbox_format": "xyxy",
+                    "bbox": [x1, y1, x2, y2]}, ...]}``
+  * ``keypoint`` — ``{"label", "point_format": "xy", "point": [x, y]}``
+  * ``count`` / ``attribute`` / ``spatial`` — text-only, no overlay.
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+import os
+import re
+import subprocess
+import sys
+import time
+import webbrowser
+from pathlib import Path
+from typing import Any
+
+from .runtime_state import push_log
+
+logger = logging.getLogger(__name__)
+
+_IMAGE_PREFIX = "observation.images."
+
+# PaliGemma detection / pointing vocabulary. PI052 trains spatial VQA
+# answers in this native ``<locNNNN>`` format (index in [0, 1023],
+# normalized to the image axis) instead of pixel-coordinate JSON, so the
+# answer string the runtime parses can be e.g.
+# ``<loc0512><loc0301> blue cube`` (point) or
+# ``<loc0100><loc0080><loc0400><loc0360> blue cube`` (box).
+_LOC_RE = re.compile(r"<loc(\d{1,4})>")
+
+# Iteration order for shape matching — most specific keys first so an
+# answer is classified deterministically.
+_SHAPE_ORDER = ("bbox", "keypoint", "count", "attribute", "spatial")
+
+_BBOX_COLOR = (255, 64, 64)
+_POINT_COLOR = (64, 220, 64)
+
+
+# ---------------------------------------------------------------------------
+# Camera selection
+# ---------------------------------------------------------------------------
+
+
+def available_cameras(observation: dict | None) -> list[str]:
+    """Return the sorted ``observation.images.*`` keys present in ``observation``."""
+    if not observation:
+        return []
+    return sorted(k for k in observation if isinstance(k, str) and k.startswith(_IMAGE_PREFIX))
+
+
+def camera_short_name(camera_key: str) -> str:
+    """Strip the ``observation.images.`` prefix for display."""
+    return camera_key[len(_IMAGE_PREFIX) :] if camera_key.startswith(_IMAGE_PREFIX) else camera_key
+
+
+def prompt_camera_choice(
+    cameras: list[str],
+    *,
+    input_fn: Any = input,
+    print_fn: Any = print,
+) -> str | None:
+    """Ask the operator which camera frame to draw a VQA overlay on.
+
+    Accepts either the menu number or the (short or full) camera name.
+    A single-camera setup auto-selects without prompting. Returns the
+    chosen ``observation.images.*`` key, or ``None`` if the operator
+    cancels / gives an invalid answer.
+    """
+    if not cameras:
+        return None
+    if len(cameras) == 1:
+        return cameras[0]
+    print_fn("Draw the result on which camera?")
+    for i, cam in enumerate(cameras, 1):
+        print_fn(f"  [{i}] {camera_short_name(cam)}")
+    try:
+        raw = str(input_fn("camera> ")).strip()
+    except (EOFError, KeyboardInterrupt):
+        return None
+    if not raw:
+        return cameras[0]
+    if raw.isdigit():
+        idx = int(raw) - 1
+        return cameras[idx] if 0 <= idx < len(cameras) else None
+    for cam in cameras:
+        if raw == cam or raw == camera_short_name(cam):
+            return cam
+    return None
+
+
+# ---------------------------------------------------------------------------
+# Answer parsing
+# ---------------------------------------------------------------------------
+
+
+def _loc_to_norm(idx: int) -> float:
+    """PaliGemma ``<locNNNN>`` index → normalized [0, 1] axis coordinate."""
+    return max(0.0, min(1023.0, float(idx))) / 1023.0
+
+
+def parse_loc_answer(answer: str) -> dict | None:
+    """Parse a PaliGemma ``<loc>``-format spatial VQA answer.
+
+    PI052 trains spatial answers in PaliGemma's native detection
+    vocabulary, label-first: a point is ``<label> <locY><locX>``, a box
+    is ``<label> <locY0><locX0><locY1><locX1>``, and multiple boxes are
+    joined by `` ; `` (e.g. ``cube <loc..><loc..><loc..><loc..> ; box
+    <loc..><loc..><loc..><loc..>``). Loc-first formats are also accepted
+    — this parser strips loc tokens and treats the remainder as the
+    label, so order is irrelevant. Coordinates come back *normalized*
+    ([0, 1]); the overlay denormalizes them against the chosen camera
+    frame's pixel size.
+
+    Returns ``{"kind", "payload", "normalized": True}`` on success
+    (``payload`` mirrors the JSON shapes so the overlay code is shared),
+    or ``None`` when the answer carries no ``<loc>`` tokens.
+    """
+    if not answer or "<loc" not in answer:
+        return None
+    segments = [seg for seg in answer.split(";") if "<loc" in seg]
+    points: list[tuple[float, float, str]] = []
+    boxes: list[tuple[float, float, float, float, str]] = []
+    for seg in segments:
+        locs = [int(m) for m in _LOC_RE.findall(seg)]
+        label = _LOC_RE.sub("", seg).strip()
+        if len(locs) == 2:
+            y, x = (_loc_to_norm(v) for v in locs[:2])
+            points.append((x, y, label))
+        elif len(locs) >= 4:
+            y1, x1, y2, x2 = (_loc_to_norm(v) for v in locs[:4])
+            boxes.append((x1, y1, x2, y2, label))
+    if boxes:
+        detections = [
+            {"label": lbl, "bbox_format": "xyxy", "bbox": [x1, y1, x2, y2]}
+            for (x1, y1, x2, y2, lbl) in boxes
+        ]
+        return {"kind": "bbox", "payload": {"detections": detections}, "normalized": True}
+    if len(points) == 1:
+        x, y, lbl = points[0]
+        return {
+            "kind": "keypoint",
+            "payload": {"label": lbl, "point_format": "xy", "point": [x, y]},
+            "normalized": True,
+        }
+    if points:  # several bare points → treat as detections-as-points
+        detections = [
+            {"label": lbl, "bbox_format": "xyxy", "bbox": [x, y, x, y]} for (x, y, lbl) in points
+        ]
+        return {"kind": "bbox", "payload": {"detections": detections}, "normalized": True}
+    return None
+
+
+def parse_vqa_answer(answer: str) -> dict | None:
+    """Parse a VQA answer string into ``{"kind", "payload"}``.
+
+    ``kind`` is one of the ``VQA_ANSWER_SHAPES`` names (``bbox``,
+    ``keypoint``, ``count``, ``attribute``, ``spatial``) or ``"unknown"``
+    when the JSON doesn't match any known shape. PaliGemma ``<loc>``
+    spatial answers are detected first (PI052 trains them in that native
+    format). Returns ``None`` when the answer is neither ``<loc>`` text
+    nor a parseable JSON object.
+    """
+    if not answer or not answer.strip():
+        return None
+    loc_parsed = parse_loc_answer(answer)
+    if loc_parsed is not None:
+        return loc_parsed
+    try:
+        payload = json.loads(answer)
+    except (ValueError, TypeError):
+        return None
+    if not isinstance(payload, dict):
+        return None
+
+    try:
+        from lerobot.annotations.steerable_pipeline.validator import (  # noqa: PLC0415
+            VQA_ANSWER_SHAPES,
+        )
+
+        shapes = VQA_ANSWER_SHAPES
+    except ImportError:  # pragma: no cover - annotation extra not installed
+        shapes = {
+            "bbox": {"detections"},
+            "keypoint": {"label", "point_format", "point"},
+            "count": {"label", "count"},
+            "attribute": {"label", "attribute", "value"},
+            "spatial": {"subject", "relation", "object"},
+        }
+
+    keys = set(payload)
+    for kind in _SHAPE_ORDER:
+        required = shapes.get(kind)
+        if required and required <= keys:
+            return {"kind": kind, "payload": payload}
+    return {"kind": "unknown", "payload": payload}
+
+
+def answer_has_overlay(parsed: dict | None) -> bool:
+    """True iff ``parsed`` carries drawable spatial coordinates."""
+    return bool(parsed) and parsed.get("kind") in ("bbox", "keypoint")
+
+
+# ---------------------------------------------------------------------------
+# Overlay drawing
+# ---------------------------------------------------------------------------
+
+
+def observation_image_to_pil(image_tensor: Any) -> Any:
+    """Convert an ``observation.images.*`` tensor to a PIL RGB image.
+
+    The runtime observation stores images as ``(1, C, H, W)`` (or
+    ``(C, H, W)``) float tensors in ``[0, 1]``. Reuses
+    ``image_array_to_pil_image`` which handles the CHW→HWC transpose and
+    the float→uint8 scaling.
+    """
+    from lerobot.datasets.image_writer import image_array_to_pil_image  # noqa: PLC0415
+
+    arr = image_tensor
+    if hasattr(arr, "detach"):
+        arr = arr.detach().cpu()
+    if hasattr(arr, "numpy"):
+        arr = arr.numpy()
+    while arr.ndim > 3:  # drop leading batch dim(s)
+        arr = arr[0]
+    return image_array_to_pil_image(arr).convert("RGB")
+
+
+def draw_vqa_overlay(image: Any, parsed: dict) -> Any:
+    """Draw ``bbox`` / ``keypoint`` answers onto a copy of ``image``.
+
+    Non-spatial answers (``count`` / ``attribute`` / ``spatial`` /
+    ``unknown``) are returned as an unmodified copy. When ``parsed`` has
+    ``normalized=True`` (PaliGemma ``<loc>`` answers) the [0, 1]
+    coordinates are scaled to the image's pixel size.
+    """
+    from PIL import ImageDraw  # noqa: PLC0415
+
+    img = image.convert("RGB").copy()
+    kind = parsed.get("kind")
+    payload = parsed.get("payload") or {}
+    draw = ImageDraw.Draw(img)
+    w, h = img.size
+    sx, sy = (w, h) if parsed.get("normalized") else (1, 1)
+
+    if kind == "bbox":
+        for det in payload.get("detections") or []:
+            if not isinstance(det, dict):
+                continue
+            box = det.get("bbox")
+            if not (isinstance(box, list | tuple) and len(box) == 4):
+                continue
+            try:
+                x1, y1, x2, y2 = (float(v) for v in box)
+            except (TypeError, ValueError):
+                continue
+            x1, x2 = x1 * sx, x2 * sx
+            y1, y2 = y1 * sy, y2 * sy
+            draw.rectangle([x1, y1, x2, y2], outline=_BBOX_COLOR, width=3)
+            label = str(det.get("label", "")).strip()
+            if label:
+                draw.text((x1 + 3, max(0.0, y1 - 12)), label, fill=_BBOX_COLOR)
+    elif kind == "keypoint":
+        point = payload.get("point")
+        if isinstance(point, list | tuple) and len(point) == 2:
+            try:
+                x, y = float(point[0]) * sx, float(point[1]) * sy
+            except (TypeError, ValueError):
+                return img
+            r = 6
+            draw.ellipse([x - r, y - r, x + r, y + r], outline=_POINT_COLOR, width=3)
+            draw.line([x - 2 * r, y, x + 2 * r, y], fill=_POINT_COLOR, width=2)
+            draw.line([x, y - 2 * r, x, y + 2 * r], fill=_POINT_COLOR, width=2)
+            label = str(payload.get("label", "")).strip()
+            if label:
+                draw.text((x + r + 3, y - r), label, fill=_POINT_COLOR)
+    return img
+
+
+def _open_file(path: Path) -> None:
+    """Best-effort open ``path`` in the OS default viewer."""
+    try:
+        if sys.platform == "darwin":
+            subprocess.run(["open", str(path)], check=False)
+        elif sys.platform.startswith("linux"):
+            subprocess.run(["xdg-open", str(path)], check=False)
+        elif os.name == "nt":
+            os.startfile(str(path))  # type: ignore[attr-defined]  # noqa: S606
+        else:  # pragma: no cover - exotic platform
+            webbrowser.open(path.resolve().as_uri())
+    except Exception as exc:  # noqa: BLE001
+        logger.debug("could not auto-open %s: %s", path, exc)
+
+
+def save_and_open_overlay(image: Any, out_dir: str | Path = "./vqa_overlays") -> Path:
+    """Save ``image`` as a timestamped PNG under ``out_dir`` and auto-open it."""
+    out = Path(out_dir)
+    out.mkdir(parents=True, exist_ok=True)
+    path = out / f"vqa_{int(time.time() * 1000)}.png"
+    image.save(path)
+    _open_file(path)
+    return path
+
+
+# ---------------------------------------------------------------------------
+# Orchestrator
+# ---------------------------------------------------------------------------
+
+
+def handle_vqa_query(
+    *,
+    policy: Any,
+    observation_provider: Any,
+    question: str,
+    state: dict[str, Any],
+    input_fn: Any = input,
+    print_fn: Any = print,
+) -> None:
+    """Run one interactive VQA question end to end.
+
+    Called synchronously from the input layer while the runtime is in
+    ``/question`` mode (the action loop is gated off, so the policy is
+    not in concurrent use). Progress is reported via both
+    :func:`push_log` (REPL panel scrollback) and ``print_fn`` (direct
+    stdout) — in autonomous question mode the panel redraw is suspended,
+    so the direct print is what the operator actually sees.
+    """
+    from .steps import _generate_with_policy, _msgs_for_vqa  # noqa: PLC0415
+
+    def report(line: str) -> None:
+        """Surface a line both to the panel scrollback and to stdout."""
+        push_log(state, line)
+        try:
+            print_fn(line)
+        except Exception:  # noqa: BLE001
+            pass
+
+    if policy is None or not hasattr(policy, "select_message"):
+        report("  [warn] vqa: policy has no select_message — skipping")
+        return
+
+    observation: dict | None = None
+    if observation_provider is not None:
+        try:
+            observation = observation_provider()
+        except Exception as exc:  # noqa: BLE001
+            logger.debug("observation_provider raised %s", exc)
+
+    # Feed the FULL observation (every camera + state) to the VLM. The
+    # ``ask_vqa_*`` recipes look single-camera, but the image *block* is
+    # stripped before tokenization — the actual frames reach the model
+    # via PI052's ``OBS_IMAGES_*`` channels, and ``embed_prefix``
+    # consumes *all* ``config.image_features`` regardless of which
+    # camera the sub-recipe was tagged for. So the model always sees
+    # every camera; the operator never has to name one to ask.
+    answer = _generate_with_policy(
+        policy,
+        _msgs_for_vqa(question),
+        observation=observation,
+        state=state,
+        label="vqa gen",
+    )
+    if not answer:
+        report("  [info] vqa gen returned empty")
+        return
+    report(f"  vqa: {answer}")
+
+    parsed = parse_vqa_answer(answer)
+    if not answer_has_overlay(parsed):
+        if parsed is None:
+            report("  [info] vqa answer is not JSON — no overlay")
+        return
+
+    # The answer carries a bounding box / point. Its pixel coordinates
+    # are camera-specific and the text answer doesn't say which camera,
+    # so ask the operator *now* — only when there is actually something
+    # to draw — which camera frame to render the overlay on.
+    cameras = available_cameras(observation)
+    if observation is None or not cameras:
+        report("  [info] no camera image — cannot draw overlay")
+        return
+    chosen = prompt_camera_choice(cameras, input_fn=input_fn, print_fn=print_fn)
+    if chosen is None:
+        report("  [info] overlay skipped — no camera selected")
+        return
+    try:
+        pil = observation_image_to_pil(observation[chosen])
+        overlay = draw_vqa_overlay(pil, parsed)
+        path = save_and_open_overlay(overlay)
+        report(f"  vqa overlay ({camera_short_name(chosen)}) saved: {path}")
+    except Exception as exc:  # noqa: BLE001
+        logger.warning("vqa overlay failed: %s", exc, exc_info=logger.isEnabledFor(logging.DEBUG))
+        report(f"  [warn] vqa overlay failed: {type(exc).__name__}: {exc}")
@@ -0,0 +1,198 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""π0.5 v2 pre/post-processor factory.
+
+When ``config.recipe_path`` is set, the pre-processor pipeline becomes:
+
+    rename observations
+    add batch dim
+    relative-action prep      (inherited from π0.5)
+    NormalizerProcessorStep
+    RenderMessagesStep        — recipe → messages, target_message_indices,
+                                message_streams (PR 1 of the steerable
+                                stack)
+    PI052TextTokenizerStep    — messages → input_ids + label mask +
+                                predict_actions
+    DeviceProcessorStep
+
+When ``recipe_path`` is ``None`` we delegate to the plain π0.5 pipeline
+so unannotated datasets keep working.
+
+Post-processor is unchanged from π0.5.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Any
+
+import torch
+
+from lerobot.configs.recipe import TrainingRecipe
+from lerobot.processor import (
+    AbsoluteActionsProcessorStep,
+    ActionTokenizerProcessorStep,
+    AddBatchDimensionProcessorStep,
+    DeviceProcessorStep,
+    NormalizerProcessorStep,
+    PolicyAction,
+    PolicyProcessorPipeline,
+    RelativeActionsProcessorStep,
+    RenameObservationsProcessorStep,
+    UnnormalizerProcessorStep,
+    policy_action_to_transition,
+    transition_to_policy_action,
+)
+# RenderMessagesStep is intentionally not re-exported from
+# ``lerobot.processor`` because it pulls in optional language-stack deps;
+# import it directly.
+from lerobot.processor.render_messages_processor import RenderMessagesStep
+from lerobot.utils.constants import POLICY_POSTPROCESSOR_DEFAULT_NAME, POLICY_PREPROCESSOR_DEFAULT_NAME
+
+from ..pi05.processor_pi05 import make_pi05_pre_post_processors
+from .configuration_pi052 import PI052Config
+from .text_processor_pi052 import PI052TextTokenizerStep
+
+
+def make_pi052_pre_post_processors(
+    config: PI052Config,
+    dataset_stats: dict[str, dict[str, torch.Tensor]] | None = None,
+    dataset_repo_id: str | None = None,
+) -> tuple[
+    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
+    PolicyProcessorPipeline[PolicyAction, PolicyAction],
+]:
+    """Build PI0.5-v2's pre/post-processor pipelines.
+
+    Falls through to π0.5's stock pipeline when ``recipe_path`` is unset.
+    """
+    if not config.recipe_path:
+        return make_pi05_pre_post_processors(config, dataset_stats=dataset_stats)
+
+    recipe = _load_recipe(config.recipe_path)
+
+    relative_step = RelativeActionsProcessorStep(
+        enabled=config.use_relative_actions,
+        exclude_joints=getattr(config, "relative_exclude_joints", []),
+        action_names=getattr(config, "action_feature_names", None),
+    )
+
+    input_steps = [
+        RenameObservationsProcessorStep(rename_map={}),
+        AddBatchDimensionProcessorStep(),
+        relative_step,
+        NormalizerProcessorStep(
+            features={**config.input_features, **config.output_features},
+            norm_map=config.normalization_mapping,
+            stats=dataset_stats,
+        ),
+        RenderMessagesStep(recipe=recipe),
+        PI052TextTokenizerStep(
+            tokenizer_name="google/paligemma-3b-pt-224",
+            max_length=config.tokenizer_max_length,
+            plan_dropout_prob=getattr(config, "plan_dropout_prob", 0.0),
+            memory_dropout_prob=getattr(config, "memory_dropout_prob", 0.0),
+            subtask_dropout_prob=getattr(config, "subtask_dropout_prob", 0.0),
+        ),
+    ]
+
+    # FAST tokenizer for discrete-action CE supervision (paper §III.C).
+    # Only inserted when explicitly enabled — keeps the post-training-
+    # style recipe (flow + text) as the default. When on, the step
+    # writes ACTION_TOKENS / ACTION_TOKEN_MASK into
+    # ``COMPLEMENTARY_DATA`` and the modeling forward picks them up.
+    if getattr(config, "enable_fast_action_loss", False):
+        # Per Pertsch et al. 2025 (FAST [64], π0.5 §III.C): fit the
+        # tokenizer on this dataset's action distribution rather than
+        # using the universal codebook off the shelf. We do this once
+        # and cache to disk, keyed on (dataset, base, n_samples).
+        action_tokenizer_path = config.action_tokenizer_name
+        if (
+            getattr(config, "auto_fit_fast_tokenizer", False)
+            and dataset_repo_id is not None
+        ):
+            from .fit_fast_tokenizer import fit_fast_tokenizer  # noqa: PLC0415
+
+            cache_dir = Path(config.fast_tokenizer_cache_dir).expanduser()
+            try:
+                action_tokenizer_path = fit_fast_tokenizer(
+                    dataset_repo_id=dataset_repo_id,
+                    cache_dir=cache_dir,
+                    base_tokenizer_name=config.action_tokenizer_name,
+                    n_samples=config.fast_tokenizer_fit_samples,
+                    chunk_size=config.chunk_size,
+                )
+            except Exception as exc:  # noqa: BLE001
+                import logging  # noqa: PLC0415
+
+                logging.getLogger(__name__).warning(
+                    "FAST tokenizer fit failed (%s) — falling back to "
+                    "the universal base tokenizer %r. Train will still "
+                    "work but compression will be suboptimal.",
+                    exc, config.action_tokenizer_name,
+                )
+
+        input_steps.append(
+            ActionTokenizerProcessorStep(
+                action_tokenizer_name=action_tokenizer_path,
+                max_action_tokens=config.max_action_tokens,
+                fast_skip_tokens=config.fast_skip_tokens,
+                paligemma_tokenizer_name="google/paligemma-3b-pt-224",
+            )
+        )
+
+    input_steps.append(DeviceProcessorStep(device=config.device))
+
+    output_steps = [
+        UnnormalizerProcessorStep(
+            features=config.output_features,
+            norm_map=config.normalization_mapping,
+            stats=dataset_stats,
+        ),
+        AbsoluteActionsProcessorStep(
+            enabled=config.use_relative_actions,
+            relative_step=relative_step,
+        ),
+        DeviceProcessorStep(device="cpu"),
+    ]
+    return (
+        PolicyProcessorPipeline[dict[str, Any], dict[str, Any]](
+            steps=input_steps,
+            name=POLICY_PREPROCESSOR_DEFAULT_NAME,
+        ),
+        PolicyProcessorPipeline[PolicyAction, PolicyAction](
+            steps=output_steps,
+            name=POLICY_POSTPROCESSOR_DEFAULT_NAME,
+            to_transition=policy_action_to_transition,
+            to_output=transition_to_policy_action,
+        ),
+    )
+
+
+def _load_recipe(path_str: str) -> TrainingRecipe:
+    """Resolve ``path_str`` to a ``TrainingRecipe``.
+
+    Accepts an absolute path or a path relative to
+    ``src/lerobot/configs/``.
+    """
+    p = Path(path_str)
+    if not p.is_absolute() and not p.exists():
+        from lerobot.configs import recipe as _recipe_module  # noqa: PLC0415
+
+        configs_dir = Path(_recipe_module.__file__).resolve().parent
+        candidate = configs_dir / path_str
+        if candidate.exists():
+            p = candidate
+    return TrainingRecipe.from_yaml(p)
@@ -0,0 +1,641 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""π0.5 v2 text-tokenisation step.
+
+PaliGemma is *not* chat-pretrained, so we can't lean on
+``tokenizer.apply_chat_template``. Instead we concatenate the rendered
+messages as plain text with simple ``User: ... Assistant: ...`` role
+delimiters — matching the prompt format π0.5 uses in the paper
+(``Task: ... State: ... Action: ...``).
+
+Outputs:
+
+* ``OBS_LANGUAGE_TOKENS`` / ``OBS_LANGUAGE_ATTENTION_MASK`` — the
+  concatenated prompt tokenised by the PaliGemma tokenizer (the same
+  one ``processor_pi05`` already uses).
+* ``text_labels`` — same shape as token ids, ``-100`` everywhere except
+  positions belonging to messages whose index is in
+  ``target_message_indices``. ``modeling_pi052`` runs cross-entropy on
+  those positions via the PaliGemma ``lm_head``.
+* ``predict_actions`` — bool tensor, ``True`` iff any of the rendered
+  target messages has ``message_streams[i] == "low_level"``.
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+from dataclasses import dataclass
+from typing import Any
+
+import numpy as np
+import torch
+from torch import Tensor
+
+from lerobot.configs import PipelineFeatureType, PolicyFeature
+from lerobot.processor.pipeline import ProcessorStep, ProcessorStepRegistry
+from lerobot.types import EnvTransition, TransitionKey
+from lerobot.utils.constants import OBS_LANGUAGE_ATTENTION_MASK, OBS_LANGUAGE_TOKENS, OBS_STATE
+
+logger = logging.getLogger(__name__)
+
+
+def discretize_state_str(state_row: Any) -> str:
+    """Discretize a single normalized state vector into 256 bins, space-joined.
+
+    Mirrors pi05's ``Pi05PrepareStateTokenizerProcessorStep`` (same bins /
+    convention) so pi052's low-level action prompt carries proprioception in
+    the exact format pi05 was trained on. Expects state already normalized by
+    the upstream ``NormalizerProcessorStep``.
+    """
+    arr = state_row.detach().cpu().numpy() if hasattr(state_row, "detach") else np.asarray(state_row)
+    disc = np.digitize(arr, bins=np.linspace(-1, 1, 256 + 1)[:-1]) - 1
+    return " ".join(str(int(x)) for x in disc.reshape(-1).tolist())
+
+
+def _state_row_at(state_all: Any, pos: int) -> Any:
+    """Select the per-sample state row from a (possibly batched) state tensor."""
+    if state_all is None:
+        return None
+    if hasattr(state_all, "ndim") and state_all.ndim >= 2:
+        return state_all[pos]
+    return state_all
+
+
+def _content_to_text(content: Any) -> str:
+    """Collapse a message's ``content`` (string or multimodal blocks) to text."""
+    if isinstance(content, str):
+        return content
+    if isinstance(content, list):
+        parts = [
+            b["text"]
+            for b in content
+            if isinstance(b, dict) and b.get("type") == "text" and isinstance(b.get("text"), str)
+        ]
+        return "\n".join(parts)
+    return ""
+
+
+def _flatten_say_tool_calls(message: dict[str, Any]) -> dict[str, Any]:
+    """Serialize assistant ``say`` tool calls into a ``<say>...</say>`` marker.
+
+    PaliGemma's flat text prompt has no notion of structured tool calls,
+    and ``_format_messages`` only reads ``role`` / ``content`` — so
+    without this a ``say`` tool call is dropped entirely and never
+    supervised. Rewriting it into the content text as a ``<say>...</say>``
+    marker lets the LM head learn to emit it; the runtime parses it back
+    via ``_split_plan_and_say``. Messages without ``say`` tool calls are
+    returned unchanged (the structured calls, if any, are still dropped).
+    """
+    tool_calls = message.get("tool_calls")
+    if not tool_calls:
+        return message
+    say_texts: list[str] = []
+    for call in tool_calls:
+        if not isinstance(call, dict):
+            continue
+        fn = call.get("function") or {}
+        if fn.get("name") != "say":
+            continue
+        args = fn.get("arguments")
+        if isinstance(args, str):
+            try:
+                import json  # noqa: PLC0415
+
+                args = json.loads(args)
+            except (ValueError, TypeError):
+                args = {}
+        text = args.get("text", "") if isinstance(args, dict) else ""
+        if text:
+            say_texts.append(str(text))
+    new = dict(message)
+    new.pop("tool_calls", None)
+    if not say_texts:
+        return new
+    base = _content_to_text(new.get("content")).strip()
+    marker = "".join(f"<say>{t}</say>" for t in say_texts)
+    new["content"] = f"{base}\n{marker}" if base else marker
+    return new
+
+
+def _strip_blocks(message: dict[str, Any]) -> dict[str, Any]:
+    """Normalise a message's content to a plain string.
+
+    The recipe renderer can emit ``content`` as a string OR as a list
+    of HF-style multimodal blocks (``{type: text, text: ...}``,
+    ``{type: image, feature: ...}``). PaliGemma's text tokenizer can
+    only consume strings, so we flatten: drop image blocks (cameras
+    flow through ``observation.images.*`` separately) and join text
+    block texts.
+    """
+    new = dict(message)
+    new.pop("stream", None)
+    new.pop("target", None)
+    content = new.get("content")
+    if content is None:
+        new["content"] = ""
+    elif isinstance(content, str):
+        pass
+    elif isinstance(content, list):
+        parts: list[str] = []
+        for block in content:
+            if not isinstance(block, dict):
+                continue
+            if block.get("type") == "text":
+                t = block.get("text", "")
+                if isinstance(t, str):
+                    parts.append(t)
+        new["content"] = "\n".join(parts)
+    else:
+        new["content"] = str(content)
+    return new
+
+
+def _is_batched_messages(messages: Any) -> bool:
+    return isinstance(messages, list) and bool(messages) and isinstance(messages[0], list)
+
+
+def _sample_indices(value: Any, batch_size: int) -> list[int | None]:
+    if value is None:
+        return [None] * batch_size
+    if isinstance(value, torch.Tensor):
+        if value.numel() == 1:
+            return [int(value.item())] * batch_size
+        values = value.reshape(-1).tolist()
+        return [int(v) for v in values[:batch_size]]
+    if isinstance(value, (list, tuple)):
+        if len(value) == 1:
+            return _sample_indices(value[0], batch_size)
+        return [int(v.item() if hasattr(v, "item") else v) for v in value[:batch_size]]
+    return [int(value)] * batch_size
+
+
+# ---------------------------------------------------------------------------
+# VQA spatial answers → PaliGemma <loc> format (PI052 only)
+#
+# PaliGemma is pre-trained on detection / pointing with a ``<locNNNN>``
+# vocabulary (normalized [0, 1023]). The recipe's bbox / keypoint VQA
+# answers are stored as JSON in Qwen2.5-VL's grounding convention:
+# **0–1000 normalized coordinates**, NOT pixels. (Verified empirically
+# on the published datasets: x and y both span 0..1000 with ~30% of
+# values exceeding the camera's pixel dimensions — they're not pixels.)
+# Converting to ``<loc>`` is therefore camera-resolution-independent:
+# ``loc_idx = round(coord / 1000 * 1023)``. We do the conversion here —
+# not in the dataset — so the dataset keeps the raw JSON and stays
+# backbone-agnostic.
+# ---------------------------------------------------------------------------
+
+# The 0–1000 scale Qwen2.5-VL emits for grounding coordinates.
+_VQA_COORD_SCALE = 1000.0
+
+
+def register_paligemma_loc_tokens(tokenizer: Any) -> Any:
+    """Make PaliGemma's ``<locDDDD>`` ids match on raw text — single tokens.
+
+    PaliGemma reserves vocab ids [256000, 257023] for ``<locDDDD>``
+    (detection / pointing) tokens, but the *stock* tokenizer does NOT
+    match them when encoding raw text — it BPE-splits ``<loc0162>`` into
+    7 pieces (``<``, ``loc``, ``0``, ``1``, ``6``, ``2``, ``>``). Training
+    the LM head on a ``<loc>`` target then supervises those 7 generic
+    BPE pieces instead of one detection-vocab id, the LM head learns to
+    emit the *character sequence*, and those pieces' logits dominate
+    other turns (the ``<loc>``-salad on subtasks). Registering the loc
+    tokens once makes them tokenize as their single ids (256000+idx),
+    leveraging PaliGemma's detection prior properly. Idempotent.
+    """
+    if "<loc0000>" in getattr(tokenizer, "added_tokens_encoder", {}):
+        return tokenizer
+    tokenizer.add_tokens([f"<loc{i:04d}>" for i in range(1024)])
+    return tokenizer
+
+
+def _loc_token(coord: float, scale: float = _VQA_COORD_SCALE) -> str:
+    """PaliGemma ``<locNNNN>`` for a coord on a ``[0, scale]`` axis."""
+    idx = round(float(coord) / scale * 1023) if scale > 0 else 0
+    return f"<loc{max(0, min(1023, idx)):04d}>"
+
+
+def _vqa_answer_to_loc(answer: dict[str, Any]) -> str | None:
+    """Convert a bbox / keypoint VQA answer dict to PaliGemma ``<loc>`` text.
+
+    Input coordinates are in Qwen2.5-VL's 0–1000 normalized space (see
+    module-level note). y is emitted before x for each coordinate pair
+    (PaliGemma convention), with the integer indices in [0, 1023].
+
+    **Format: label first, locs after.** PaliGemma's pretraining puts
+    locs first (``<loc><loc> label``), but for our small-dataset VQA
+    blend that turns the LM head into a loc-emission attractor at every
+    ``Assistant:`` position — VQA targets share their first supervised
+    token with ~25% of all text samples, and the head collapses to
+    emitting ``<loc>`` regardless of the prompt. Putting the label
+    first (``label <locY><locX>``) means every text sample (subtask,
+    memory, VQA, …) starts the supervised target with a real word,
+    breaking the attractor. The model still learns the loc vocabulary
+    for the *spatial* portion of the answer; it just can't fire it as
+    the first generation step from a clean prompt.
+
+    Returns ``None`` for non-spatial answers (count / attribute /
+    spatial-relation) — those keep their JSON form.
+    """
+    point = answer.get("point")
+    if isinstance(point, list | tuple) and len(point) == 2 and "point_format" in answer:
+        try:
+            x, y = float(point[0]), float(point[1])
+        except (TypeError, ValueError):
+            return None
+        label = str(answer.get("label", "")).strip()
+        if not label:
+            return None
+        return f"{label} {_loc_token(y)}{_loc_token(x)}"
+
+    detections = answer.get("detections")
+    if isinstance(detections, list) and detections:
+        parts: list[str] = []
+        for det in detections:
+            if not isinstance(det, dict):
+                continue
+            box = det.get("bbox")
+            if not (isinstance(box, list | tuple) and len(box) == 4):
+                continue
+            try:
+                x1, y1, x2, y2 = (float(v) for v in box)
+            except (TypeError, ValueError):
+                continue
+            label = str(det.get("label", "")).strip()
+            if not label:
+                continue
+            toks = (
+                f"{_loc_token(y1)}{_loc_token(x1)}"
+                f"{_loc_token(y2)}{_loc_token(x2)}"
+            )
+            parts.append(f"{label} {toks}")
+        return " ; ".join(parts) if parts else None
+    return None
+
+
+def _messages_vqa_to_loc(
+    messages: list[dict[str, Any]],
+    target_indices: list[int],
+) -> list[dict[str, Any]]:
+    """Rewrite bbox / keypoint VQA *target* answers from JSON to ``<loc>`` text.
+
+    Each target turn whose content parses as a spatial VQA answer is
+    converted. Non-spatial answers and subtask / memory targets (plain
+    text → not JSON) are left untouched. Camera-independent: VQA coords
+    are 0–1000 normalized, so no observation lookup is needed.
+    """
+    if not target_indices:
+        return messages
+    out = list(messages)
+    for idx in target_indices:
+        if not (0 <= idx < len(out)):
+            continue
+        content = out[idx].get("content")
+        if not isinstance(content, str) or not content.strip():
+            continue
+        try:
+            answer = json.loads(content)
+        except (ValueError, TypeError):
+            continue  # subtask / memory targets are plain text — skip
+        if not isinstance(answer, dict):
+            continue
+        loc_text = _vqa_answer_to_loc(answer)
+        if loc_text is not None:
+            out[idx] = {**out[idx], "content": loc_text}
+    return out
+
+
+def _format_messages(
+    messages: list[dict[str, Any]],
+    target_indices: list[int] | None = None,
+    eos_token: str | None = None,
+) -> tuple[str, list[tuple[int, int]]]:
+    """Concatenate messages into the π0.5-style flat prompt.
+
+    When both ``target_indices`` and ``eos_token`` are given, the EOS
+    string is appended to each supervised target turn's content and the
+    returned span covers it — so the label builder marks the EOS token
+    as a supervised label. That teaches the LM head where the answer
+    *ends*: without an EOS in the target span the model is never given a
+    stop signal and rambles to ``max_length`` at inference. Inference
+    callers omit both args (no EOS baked into the prompt — the model
+    generates it and ``select_message`` stops on it).
+
+    Returns:
+        prompt:       the full text the tokenizer will consume.
+        msg_spans:    list of ``(char_start, char_end)`` covering each
+                      message's supervised payload (content, plus the
+                      appended EOS for target turns) within ``prompt``.
+    """
+    targets = set(target_indices or [])
+    parts: list[str] = []
+    spans: list[tuple[int, int]] = []
+    cursor = 0
+    for i, m in enumerate(messages):
+        role = m.get("role", "user")
+        content = m.get("content", "") or ""
+        # Role tag + newline. The model has to learn to emit the same
+        # role tokens at generation time, which is fine for greedy
+        # decoding because the chat template is implicit in the
+        # supervised target span.
+        header = f"{role.capitalize()}: "
+        # A supervised target turn ends with EOS so the model learns to
+        # terminate; the span below covers content + EOS. Non-target
+        # turns (and inference) carry no EOS.
+        body = content + eos_token if (eos_token and i in targets) else content
+        # span covers the content (+ EOS) portion only — never the role
+        # tag — so labels are computed over the supervised payload.
+        full = header + body + "\n"
+        start = cursor + len(header)
+        end = start + len(body)
+        parts.append(full)
+        spans.append((start, end))
+        cursor += len(full)
+    return "".join(parts), spans
+
+
+@dataclass
+@ProcessorStepRegistry.register(name="pi052_text_tokenizer")
+class PI052TextTokenizerStep(ProcessorStep):
+    """Render messages → token ids + label mask + predict_actions flag.
+
+    No chat template; concatenates messages as
+    ``User: ... \\nAssistant: ...`` text.
+    """
+
+    tokenizer_name: str = "google/paligemma-3b-pt-224"
+    max_length: int = 200
+    padding: str = "max_length"
+    padding_side: str = "right"
+    plan_dropout_prob: float = 0.0
+    memory_dropout_prob: float = 0.0
+    subtask_dropout_prob: float = 0.0
+    interjection_dropout_prob: float = 0.0
+    dropout_seed: int | None = None
+
+    def __post_init__(self) -> None:
+        self._tokenizer: Any = None
+
+    def _ensure_tokenizer(self) -> Any:
+        if self._tokenizer is not None:
+            return self._tokenizer
+        from transformers import AutoTokenizer  # noqa: PLC0415
+
+        self._tokenizer = register_paligemma_loc_tokens(
+            AutoTokenizer.from_pretrained(self.tokenizer_name)
+        )
+        return self._tokenizer
+
+    # ------------------------------------------------------------------
+    # Pipeline step
+    # ------------------------------------------------------------------
+
+    def __call__(self, transition: EnvTransition) -> EnvTransition | None:
+        transition = transition.copy()
+        complementary = transition.get(TransitionKey.COMPLEMENTARY_DATA, {}) or {}
+        messages = complementary.get("messages") or []
+
+        if not messages:
+            # No recipe was rendered — caller will fall back to the
+            # plain Pi0.5 prompt path. We pass the transition through
+            # unmodified.
+            return transition
+
+        tokenizer = self._ensure_tokenizer()
+        # Normalized proprioceptive state (set by NormalizerProcessorStep, which
+        # runs before this step). Injected into low-level action prompts so the
+        # action expert sees proprioception, matching pi05's discretized State:.
+        state_all = (transition.get(TransitionKey.OBSERVATION) or {}).get(OBS_STATE)
+        # VQA coords are 0–1000 normalized (Qwen2.5-VL convention) — the
+        # <loc> conversion is camera-resolution-independent and needs no
+        # observation lookup here.
+        if _is_batched_messages(messages):
+            indices_iter = _sample_indices(complementary.get("index"), len(messages))
+            encoded = [
+                self._encode_messages(
+                    tokenizer,
+                    msg,
+                    list(streams),
+                    list(tgt_indices),
+                    complementary,
+                    sample_idx=int(s_idx) if s_idx is not None else None,
+                    state_row=_state_row_at(state_all, pos),
+                )
+                for pos, (msg, streams, tgt_indices, s_idx) in enumerate(
+                    zip(
+                        messages,
+                        complementary.get("message_streams") or [[] for _ in messages],
+                        complementary.get("target_message_indices") or [[] for _ in messages],
+                        indices_iter,
+                        strict=False,
+                    )
+                )
+            ]
+        else:
+            sample_idx = _sample_indices(complementary.get("index"), 1)[0]
+            encoded = [
+                self._encode_messages(
+                    tokenizer,
+                    messages,
+                    list(complementary.get("message_streams") or []),
+                    list(complementary.get("target_message_indices") or []),
+                    complementary,
+                    sample_idx=sample_idx,
+                    state_row=_state_row_at(state_all, 0),
+                )
+            ]
+
+        obs = dict(transition.get(TransitionKey.OBSERVATION) or {})
+        obs[OBS_LANGUAGE_TOKENS] = torch.stack([ids for ids, _, _, _, _ in encoded])
+        obs[OBS_LANGUAGE_ATTENTION_MASK] = torch.stack([attn for _, attn, _, _, _ in encoded])
+        transition[TransitionKey.OBSERVATION] = obs
+
+        transition[TransitionKey.COMPLEMENTARY_DATA] = {
+            **complementary,
+            "text_labels": torch.stack([labels for _, _, labels, _, _ in encoded]),
+            "predict_actions": torch.stack([pred for _, _, _, pred, _ in encoded]),
+        }
+        return transition
+
+    def _encode_messages(
+        self,
+        tokenizer: Any,
+        messages: list[dict[str, Any]],
+        message_streams: list[str | None],
+        target_indices: list[int],
+        complementary: dict[str, Any],
+        sample_idx: int | None = None,
+        state_row: Any = None,
+    ) -> tuple[Tensor, Tensor, Tensor, Tensor, str]:
+        # Optional: drop non-target messages per the dropout config.
+        # Keeps the supervised-target indices stable by re-mapping
+        # after removal.
+        if (
+            self.plan_dropout_prob
+            or self.memory_dropout_prob
+            or self.subtask_dropout_prob
+            or self.interjection_dropout_prob
+        ):
+            messages, target_indices = self._apply_prompt_dropout(
+                messages,
+                target_indices,
+                complementary,
+                sample_idx=sample_idx,
+            )
+
+        # Rewrite bbox / keypoint VQA target answers from JSON to
+        # PaliGemma <loc> text. Coords are 0–1000 normalized so this is
+        # camera-independent.
+        messages = _messages_vqa_to_loc(messages, target_indices)
+
+        # Flatten ``say`` tool calls into ``<say>...</say>`` text before
+        # stripping, so the spoken reply is actually tokenized and
+        # supervised (PaliGemma's flat prompt has no structured calls).
+        messages = [_strip_blocks(_flatten_say_tool_calls(m)) for m in messages]
+        # Low-level (action-conditioning) samples get the discretized state
+        # appended to their user message, mirroring pi05's
+        # "..., State: {256-bin};" so the action expert sees proprioception.
+        # Higher-level text streams (subtask/memory generation) stay state-free.
+        if state_row is not None and any(s == "low_level" for s in message_streams):
+            state_str = discretize_state_str(state_row)
+            for m in reversed(messages):
+                if m.get("role") == "user":
+                    base = _content_to_text(m.get("content", ""))
+                    m["content"] = f"{base}, State: {state_str};"
+                    break
+        # Append EOS to supervised target turns so the LM head learns to
+        # stop (the span covers it → it becomes a supervised label).
+        prompt, spans = _format_messages(
+            messages, target_indices, getattr(tokenizer, "eos_token", None)
+        )
+
+        encoded = tokenizer(
+            prompt,
+            max_length=self.max_length,
+            padding=self.padding,
+            truncation=True,
+            return_tensors="pt",
+            return_offsets_mapping=True,
+            padding_side=self.padding_side,
+        )
+
+        input_ids = encoded["input_ids"][0]
+        attention_mask = encoded["attention_mask"][0].bool()
+        offsets = encoded["offset_mapping"][0]  # (seq, 2), char (start,end)
+
+        # Build label mask: -100 everywhere except over supervised
+        # target message char ranges.
+        labels = torch.full_like(input_ids, fill_value=-100)
+        for idx in target_indices:
+            if idx >= len(spans):
+                continue
+            char_start, char_end = spans[idx]
+            for token_pos in range(input_ids.shape[0]):
+                if not attention_mask[token_pos]:
+                    continue
+                tok_start, tok_end = int(offsets[token_pos, 0]), int(offsets[token_pos, 1])
+                if tok_end <= char_start or tok_start >= char_end:
+                    continue
+                labels[token_pos] = input_ids[token_pos]
+
+        # Scan ALL message streams (not just targets): the
+        # ``low_level_execution`` recipe drops ``target: true`` on
+        # the assistant to avoid trivial copy-from-user text-CE; the
+        # flow loss still needs to fire, gated by ``stream: low_level``.
+        predict_actions = torch.tensor(
+            bool(any(s == "low_level" for s in message_streams)),
+            dtype=torch.bool,
+        )
+        return input_ids, attention_mask, labels, predict_actions, prompt
+
+    # ------------------------------------------------------------------
+    # Per-component prompt dropout (Pi0.7 §V.E)
+    # ------------------------------------------------------------------
+
+    def _apply_prompt_dropout(
+        self,
+        messages: list[dict[str, Any]],
+        target_indices: list[int],
+        complementary: dict[str, Any],
+        sample_idx: int | None = None,
+    ) -> tuple[list[dict[str, Any]], list[int]]:
+        """Drop messages classified as plan/memory/subtask context.
+
+        Targets are *never* dropped (they're the supervised payload).
+        Re-maps target_indices to the new positions after drops.
+        """
+        import random  # noqa: PLC0415
+
+        seed = self.dropout_seed
+        if seed is None:
+            # Canonical row-index key set by ``BatchProcessor`` /
+            # ``render_messages_processor``. Falling back to other
+            # keys silently gave every sample seed=0 → identical
+            # dropout pattern across the whole epoch.
+            seed_src = sample_idx if sample_idx is not None else complementary.get("index", 0)
+            try:
+                if hasattr(seed_src, "item"):
+                    seed_src = seed_src.item()
+                seed = int(seed_src)
+            except (TypeError, ValueError):
+                seed = 0
+        rng = random.Random(seed)
+
+        keep_indices: list[int] = []
+        for idx, msg in enumerate(messages):
+            if idx in target_indices:
+                keep_indices.append(idx)
+                continue
+            kind = _classify_for_dropout(msg)
+            prob = {
+                "plan": self.plan_dropout_prob,
+                "memory": self.memory_dropout_prob,
+                "subtask": self.subtask_dropout_prob,
+                "interjection": self.interjection_dropout_prob,
+            }.get(kind, 0.0)
+            if prob > 0.0 and rng.random() < prob:
+                continue
+            keep_indices.append(idx)
+
+        # Build remap and apply
+        new_messages = [messages[i] for i in keep_indices]
+        old_to_new = {old: new for new, old in enumerate(keep_indices)}
+        new_targets = [old_to_new[t] for t in target_indices if t in old_to_new]
+        return new_messages, new_targets
+
+    def transform_features(
+        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
+    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
+        return features
+
+
+def _classify_for_dropout(message: dict[str, Any]) -> str | None:
+    """Heuristic content-prefix classifier (plan / memory / subtask)."""
+    content = message.get("content")
+    if isinstance(content, list):
+        text_parts = [b.get("text", "") for b in content if isinstance(b, dict) and b.get("type") == "text"]
+        content = " ".join(text_parts)
+    elif content is None:
+        return None
+    elif not isinstance(content, str):
+        return None
+    s = content.strip()
+    if s.startswith("Plan:") or s.startswith("Previous plan"):
+        return "plan"
+    if s.startswith("Memory:") or s.startswith("Previous memory"):
+        return "memory"
+    if s.startswith("Current subtask") or s.startswith("Completed subtask"):
+        return "subtask"
+    return None
@@ -14,18 +14,28 @@

 from __future__ import annotations

-from typing import TYPE_CHECKING
+import copy
+from typing import TYPE_CHECKING, Literal

 import torch
-from torch import nn
+from torch import Tensor, nn
+from torch.nn import functional as F  # noqa: N812

 from lerobot.utils.import_utils import _transformers_available

+# Default PaliGemma SigLIP input resolution. Mirrors
+# ``pi05.configuration_pi05.DEFAULT_IMAGE_SIZE``; duplicated as a plain constant
+# to avoid importing the pi05 package here (which would create an import cycle:
+# pi_gemma -> pi05.__init__ -> modeling_pi05 -> pi_gemma).
+DEFAULT_IMAGE_SIZE = 224
+
 if TYPE_CHECKING or _transformers_available:
    from transformers.cache_utils import DynamicCache
    from transformers.masking_utils import create_causal_mask
    from transformers.modeling_layers import GradientCheckpointingLayer
    from transformers.modeling_outputs import BaseModelOutputWithPast
+    from transformers.models.auto import CONFIG_MAPPING
+    from transformers.models.gemma import modeling_gemma
    from transformers.models.gemma.modeling_gemma import (
        GemmaAttention,
        GemmaConfig,
@@ -49,6 +59,8 @@ else:
    GradientCheckpointingLayer = None
    BaseModelOutputWithPast = None
    create_causal_mask = None
+    CONFIG_MAPPING = None
+    modeling_gemma = None


 def _gated_residual(
@@ -275,6 +287,8 @@ class PiGemmaModel(GemmaModel):  # type: ignore[misc]
        # Convert to bfloat16 if the first layer uses bfloat16
        if len(self.layers) > 0 and self.layers[0].self_attn.q_proj.weight.dtype == torch.bfloat16:
            hidden_states = hidden_states.to(torch.bfloat16)
+        if causal_mask is not None and torch.is_floating_point(causal_mask):
+            causal_mask = causal_mask.to(dtype=hidden_states.dtype)

        # create position embeddings to be shared across the decoder layers
        position_embeddings = self.rotary_emb(hidden_states, position_ids)
@@ -367,3 +381,374 @@ __all__ = [
    "PaliGemmaModelWithPiGemma",
    "PaliGemmaForConditionalGenerationWithPiGemma",
 ]
+
+
+# PI0.5 / PI052 dual-expert backbone: generic PaliGemma + Gemma action-expert
+# transformer machinery used by the pi052 policy. GemmaVariantConfig is openpi's
+# width/depth variant config (renamed from GemmaConfig to avoid clashing with
+# transformers' GemmaConfig).
+
+def sdpa_attention_forward(
+    module,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attention_mask: torch.Tensor | None,
+    scaling: float,
+    dropout: float = 0.0,
+):
+    """Drop-in for ``modeling_gemma.eager_attention_forward`` using
+    ``torch.nn.functional.scaled_dot_product_attention``.
+
+    PyTorch SDPA picks the memory-efficient kernel for arbitrary additive
+    bias masks (the FA backend only accepts causal/sliding-window). On
+    H100 that is ~1.3-1.7x faster and uses ~30-40% less attention memory
+    than the eager softmax(QK^T)+matmul path. Mirrors eager's signature
+    and output shape (``(B, Lq, H, D)``) so call sites are unchanged.
+    """
+    n_rep = module.num_key_value_groups
+    if n_rep > 1:
+        key = key.repeat_interleave(n_rep, dim=1)
+        value = value.repeat_interleave(n_rep, dim=1)
+    if attention_mask is not None and attention_mask.dtype != query.dtype:
+        attention_mask = attention_mask.to(dtype=query.dtype)
+    attn_output = F.scaled_dot_product_attention(
+        query,
+        key,
+        value,
+        attn_mask=attention_mask,
+        dropout_p=dropout if module.training else 0.0,
+        is_causal=False,
+        scale=scaling,
+    )
+    return attn_output.transpose(1, 2).contiguous(), None
+
+
+# Define the complete layer computation function for gradient checkpointing
+def compute_layer_complete(
+    layer_idx, inputs_embeds, attention_mask, position_ids, adarms_cond, paligemma, gemma_expert
+):
+    models = [paligemma.model.language_model, gemma_expert.model]
+    query_states = []
+    key_states = []
+    value_states = []
+    gates = []
+    for i, hidden_states in enumerate(inputs_embeds):
+        layer = models[i].layers[layer_idx]
+        hidden_states, gate = layernorm_forward(layer.input_layernorm, hidden_states, adarms_cond[i])
+        gates.append(gate)
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, layer.self_attn.head_dim)
+        query_state = layer.self_attn.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        key_state = layer.self_attn.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        value_state = layer.self_attn.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+        query_states.append(query_state)
+        key_states.append(key_state)
+        value_states.append(value_state)
+    # Concatenate and process attention
+    query_states = torch.cat(query_states, dim=2)
+    key_states = torch.cat(key_states, dim=2)
+    value_states = torch.cat(value_states, dim=2)
+    dummy_tensor = torch.zeros(
+        query_states.shape[0],
+        query_states.shape[2],
+        query_states.shape[-1],
+        device=query_states.device,
+        dtype=query_states.dtype,
+    )
+    cos, sin = paligemma.model.language_model.rotary_emb(dummy_tensor, position_ids)
+    query_states, key_states = modeling_gemma.apply_rotary_pos_emb(
+        query_states, key_states, cos, sin, unsqueeze_dim=1
+    )
+    batch_size = query_states.shape[0]
+    scaling = paligemma.model.language_model.layers[layer_idx].self_attn.scaling
+    att_output, _ = sdpa_attention_forward(
+        paligemma.model.language_model.layers[layer_idx].self_attn,
+        query_states,
+        key_states,
+        value_states,
+        attention_mask,
+        scaling,
+    )
+    # Get head_dim from the current layer, not from the model
+    head_dim = paligemma.model.language_model.layers[layer_idx].self_attn.head_dim
+    att_output = att_output.reshape(batch_size, -1, 1 * 8 * head_dim)
+    # Process layer outputs
+    outputs_embeds = []
+    start_pos = 0
+    for i, hidden_states in enumerate(inputs_embeds):
+        layer = models[i].layers[layer_idx]
+        end_pos = start_pos + hidden_states.shape[1]
+        if att_output.dtype != layer.self_attn.o_proj.weight.dtype:
+            att_output = att_output.to(layer.self_attn.o_proj.weight.dtype)
+        out_emb = layer.self_attn.o_proj(att_output[:, start_pos:end_pos])
+        # first residual
+        out_emb = _gated_residual(hidden_states, out_emb, gates[i])
+        after_first_residual = out_emb.clone()
+        out_emb, gate = layernorm_forward(layer.post_attention_layernorm, out_emb, adarms_cond[i])
+        # Convert to bfloat16 if the next layer (mlp) uses bfloat16
+        if layer.mlp.up_proj.weight.dtype == torch.bfloat16:
+            out_emb = out_emb.to(dtype=torch.bfloat16)
+        out_emb = layer.mlp(out_emb)
+        # second residual
+        out_emb = _gated_residual(after_first_residual, out_emb, gate)
+        outputs_embeds.append(out_emb)
+        start_pos = end_pos
+    return outputs_embeds
+
+
+class GemmaVariantConfig:  # see openpi `gemma.py: Config`
+    """Configuration for Gemma model variants."""
+
+    def __init__(self, width, depth, mlp_dim, num_heads, num_kv_heads, head_dim):
+        self.width = width
+        self.depth = depth
+        self.mlp_dim = mlp_dim
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads
+        self.head_dim = head_dim
+
+
+def get_gemma_config(variant: str) -> GemmaVariantConfig:  # see openpi `gemma.py: get_config`
+    """Returns config for specified gemma variant."""
+    if variant == "gemma_300m":
+        return GemmaVariantConfig(
+            width=1024,
+            depth=18,
+            mlp_dim=4096,
+            num_heads=8,
+            num_kv_heads=1,
+            head_dim=256,
+        )
+    elif variant == "gemma_2b":
+        return GemmaVariantConfig(
+            width=2048,
+            depth=18,
+            mlp_dim=16_384,
+            num_heads=8,
+            num_kv_heads=1,
+            head_dim=256,
+        )
+    else:
+        raise ValueError(f"Unknown variant: {variant}")
+
+
+class PaliGemmaWithExpertModel(
+    nn.Module
+):  # see openpi `gemma_pytorch.py: PaliGemmaWithExpertModel` this class is almost a exact copy of PaliGemmaWithExpertModel in openpi
+    """PaliGemma model with action expert for PI05."""
+
+    def __init__(
+        self,
+        vlm_config,
+        action_expert_config,
+        use_adarms=None,
+        precision: Literal["bfloat16", "float32"] = "bfloat16",
+        image_size: int = DEFAULT_IMAGE_SIZE,
+        freeze_vision_encoder: bool = False,
+        train_expert_only: bool = False,
+    ):
+        if use_adarms is None:
+            use_adarms = [False, False]
+        super().__init__()
+        self.freeze_vision_encoder = freeze_vision_encoder
+        self.train_expert_only = train_expert_only
+
+        vlm_config_hf = CONFIG_MAPPING["paligemma"]()
+        vlm_config_hf._vocab_size = 257152  # noqa: SLF001
+        vlm_config_hf.image_token_index = 257152
+        vlm_config_hf.text_config.hidden_size = vlm_config.width
+        vlm_config_hf.text_config.intermediate_size = vlm_config.mlp_dim
+        vlm_config_hf.text_config.num_attention_heads = vlm_config.num_heads
+        vlm_config_hf.text_config.head_dim = vlm_config.head_dim
+        vlm_config_hf.text_config.num_hidden_layers = vlm_config.depth
+        vlm_config_hf.text_config.num_key_value_heads = vlm_config.num_kv_heads
+        vlm_config_hf.text_config.hidden_activation = "gelu_pytorch_tanh"
+        vlm_config_hf.text_config.dtype = "float32"
+        vlm_config_hf.text_config.vocab_size = 257152
+        vlm_config_hf.text_config.use_adarms = use_adarms[0]
+        vlm_config_hf.text_config.adarms_cond_dim = vlm_config.width if use_adarms[0] else None
+        vlm_config_hf.vision_config.image_size = image_size
+        vlm_config_hf.vision_config.intermediate_size = 4304
+        vlm_config_hf.vision_config.projection_dim = 2048
+        vlm_config_hf.vision_config.projector_hidden_act = "gelu_fast"
+        vlm_config_hf.vision_config.dtype = "float32"
+
+        action_expert_config_hf = CONFIG_MAPPING["gemma"](
+            head_dim=action_expert_config.head_dim,
+            hidden_size=action_expert_config.width,
+            intermediate_size=action_expert_config.mlp_dim,
+            num_attention_heads=action_expert_config.num_heads,
+            num_hidden_layers=action_expert_config.depth,
+            num_key_value_heads=action_expert_config.num_kv_heads,
+            vocab_size=257152,
+            hidden_activation="gelu_pytorch_tanh",
+            dtype="float32",
+            use_adarms=use_adarms[1],
+            adarms_cond_dim=action_expert_config.width if use_adarms[1] else None,
+        )
+
+        self.paligemma = PaliGemmaForConditionalGenerationWithPiGemma(config=vlm_config_hf)
+        self.gemma_expert = PiGemmaForCausalLM(config=action_expert_config_hf)
+        self.gemma_expert.model.embed_tokens = None
+
+        self.to_bfloat16_for_selected_params(precision)
+        self._set_requires_grad()
+
+    def to_bfloat16_for_selected_params(self, precision: Literal["bfloat16", "float32"] = "bfloat16"):
+        if precision == "bfloat16":
+            self.to(dtype=torch.bfloat16)
+        elif precision == "float32":
+            self.to(dtype=torch.float32)
+            return
+        else:
+            raise ValueError(f"Invalid precision: {precision}")
+
+        # Keep full vision path in float32 so we never toggle (toggle causes optimizer
+        # "same dtype" error). Saves memory vs full float32; more memory than only 3 params.
+        params_to_keep_float32 = [
+            "vision_tower",
+            "multi_modal_projector",
+            "lm_head",
+            "input_layernorm",
+            "post_attention_layernorm",
+            "model.norm",
+        ]
+
+        for name, param in self.named_parameters():
+            if any(selector in name for selector in params_to_keep_float32):
+                param.data = param.data.to(dtype=torch.float32)
+
+    def _set_requires_grad(self):
+        if self.freeze_vision_encoder:
+            self.paligemma.model.vision_tower.eval()
+            for param in self.paligemma.model.vision_tower.parameters():
+                param.requires_grad = False
+        if self.train_expert_only:
+            self.paligemma.eval()
+            for param in self.paligemma.parameters():
+                param.requires_grad = False
+
+    def train(self, mode: bool = True):
+        super().train(mode)
+        if self.freeze_vision_encoder:
+            self.paligemma.model.vision_tower.eval()
+        if self.train_expert_only:
+            self.paligemma.eval()
+
+    def embed_image(self, image: torch.Tensor):
+        # Vision tower and multi_modal_projector are kept in float32 (params_to_keep_float32).
+        out_dtype = image.dtype
+        if image.dtype != torch.float32:
+            image = image.to(torch.float32)
+        image_outputs = self.paligemma.model.get_image_features(image)
+        # OpenPI / big_vision convention: image (soft) tokens are NOT scaled by the
+        # Gemma embedder normalizer (sqrt(hidden_size)) — only text tokens are. lerobot/pi05_base
+        # was trained in this regime, so scaling image features here over-scales them ~45x and
+        # breaks the pretrained vision-language alignment. Keep image features un-normalized.
+        features = image_outputs.pooler_output
+        if features.dtype != out_dtype:
+            features = features.to(out_dtype)
+        return features
+
+    def embed_language_tokens(self, tokens: torch.Tensor):
+        return self.paligemma.model.language_model.embed_tokens(tokens)
+
+    def forward(
+        self,
+        attention_mask: torch.Tensor | None = None,
+        position_ids: torch.LongTensor | None = None,
+        past_key_values: list[torch.FloatTensor] | None = None,
+        inputs_embeds: list[torch.FloatTensor] | None = None,
+        use_cache: bool | None = None,
+        adarms_cond: list[torch.Tensor] | None = None,
+    ):
+        if adarms_cond is None:
+            adarms_cond = [None, None]
+        if inputs_embeds[1] is None:
+            prefix_output = self.paligemma.model.language_model.forward(
+                inputs_embeds=inputs_embeds[0],
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                past_key_values=past_key_values,
+                use_cache=use_cache,
+                adarms_cond=adarms_cond[0] if adarms_cond is not None else None,
+            )
+            prefix_past_key_values = prefix_output.past_key_values
+            prefix_output = prefix_output.last_hidden_state
+            suffix_output = None
+        elif inputs_embeds[0] is None:
+            suffix_output = self.gemma_expert.model.forward(
+                inputs_embeds=inputs_embeds[1],
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                past_key_values=past_key_values,
+                use_cache=use_cache,
+                adarms_cond=adarms_cond[1] if adarms_cond is not None else None,
+            )
+            suffix_output = suffix_output.last_hidden_state
+            prefix_output = None
+            prefix_past_key_values = None
+        else:
+            models = [self.paligemma.model.language_model, self.gemma_expert.model]
+            num_layers = self.paligemma.config.text_config.num_hidden_layers
+
+            # Check if gradient checkpointing is enabled for any of the models
+            use_gradient_checkpointing = (
+                hasattr(self.gemma_expert.model, "gradient_checkpointing")
+                and self.gemma_expert.model.gradient_checkpointing
+                and self.training
+            ) or (hasattr(self, "gradient_checkpointing") and self.gradient_checkpointing and self.training)
+
+            # Process all layers with gradient checkpointing if enabled
+            for layer_idx in range(num_layers):
+                if use_gradient_checkpointing:
+                    inputs_embeds = torch.utils.checkpoint.checkpoint(
+                        compute_layer_complete,
+                        layer_idx,
+                        inputs_embeds,
+                        attention_mask,
+                        position_ids,
+                        adarms_cond,
+                        use_reentrant=False,
+                        preserve_rng_state=False,
+                        paligemma=self.paligemma,
+                        gemma_expert=self.gemma_expert,
+                    )
+                else:
+                    inputs_embeds = compute_layer_complete(
+                        layer_idx,
+                        inputs_embeds,
+                        attention_mask,
+                        position_ids,
+                        adarms_cond,
+                        paligemma=self.paligemma,
+                        gemma_expert=self.gemma_expert,
+                    )
+
+            # final norm
+            def compute_final_norms(inputs_embeds, adarms_cond):
+                outputs_embeds = []
+                for i, hidden_states in enumerate(inputs_embeds):
+                    out_emb, _ = layernorm_forward(models[i].norm, hidden_states, adarms_cond[i])
+                    outputs_embeds.append(out_emb)
+                return outputs_embeds
+
+            # Apply gradient checkpointing to final norm if enabled
+            if use_gradient_checkpointing:
+                outputs_embeds = torch.utils.checkpoint.checkpoint(
+                    compute_final_norms,
+                    inputs_embeds,
+                    adarms_cond,
+                    use_reentrant=False,
+                    preserve_rng_state=False,
+                )
+            else:
+                outputs_embeds = compute_final_norms(inputs_embeds, adarms_cond)
+
+            prefix_output = outputs_embeds[0]
+            suffix_output = outputs_embeds[1]
+            prefix_past_key_values = None
+
+        return [prefix_output, suffix_output], prefix_past_key_values
+
@@ -29,7 +29,6 @@ from huggingface_hub.errors import HfHubHTTPError
 from safetensors.torch import load_model as load_model_as_safetensor, save_model as save_model_as_safetensor
 from torch import Tensor, nn

-from lerobot.__version__ import __version__
 from lerobot.configs import PreTrainedConfig
 from lerobot.configs.train import TrainPipelineConfig
 from lerobot.utils.hub import HubMixin
@@ -39,67 +38,6 @@ from .utils import log_model_loading_keys
 T = TypeVar("T", bound="PreTrainedPolicy")


-def _build_card_context(
-    cfg: TrainPipelineConfig | None,
-    dataset_repo_id: str | None,
-    input_features: dict | None,
-    output_features: dict | None,
-) -> dict:
-    """Collect optional data for the model-card template.
-
-    Returns plain values only (no Markdown) — the template in
-    ``lerobot/templates/lerobot_modelcard_template.md`` decides how and whether to show
-    each one. Everything is best-effort: anything unavailable is left empty/None and the
-    template simply skips that section, so this never breaks a Hub push.
-    """
-    context = {
-        "training": None,
-        "input_features": input_features or {},
-        "output_features": output_features or {},
-        "dataset": None,
-        "robot_type": None,
-        "cameras": [],
-    }
-
-    if cfg is not None:
-        optimizer = getattr(cfg, "optimizer", None)
-        context["training"] = {
-            "steps": cfg.steps,
-            "batch_size": cfg.batch_size,
-            "seed": cfg.seed,
-            "optimizer": getattr(optimizer, "type", None) if optimizer else None,
-            "lr": getattr(optimizer, "lr", None) if optimizer else None,
-            "lerobot_version": __version__,
-        }
-
-    if dataset_repo_id:
-        dataset_cfg = getattr(cfg, "dataset", None)
-        try:
-            from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
-
-            meta = LeRobotDatasetMetadata(
-                dataset_repo_id,
-                root=getattr(dataset_cfg, "root", None),
-                revision=getattr(dataset_cfg, "revision", None),
-            )
-            context["dataset"] = {
-                "repo_id": dataset_repo_id,
-                "episodes": meta.total_episodes,
-                "frames": meta.total_frames,
-                "fps": meta.fps,
-                "tasks": [str(task) for task in meta.tasks.index],
-            }
-            context["robot_type"] = meta.robot_type
-            context["cameras"] = [key.split(".")[-1] for key in meta.camera_keys]
-        except Exception as e:  # noqa: BLE001 — dataset details are optional, never fail the push
-            logging.warning(
-                f"Could not load dataset metadata for '{dataset_repo_id}'; those sections will be "
-                f"omitted from the model card. ({e})"
-            )
-
-    return context
-
-
 class ActionSelectKwargs(TypedDict, total=False):
    noise: Tensor | None

@@ -290,7 +228,7 @@ class PreTrainedPolicy(nn.Module, HubMixin, abc.ABC):
                self.save_pretrained(saved_path)  # Calls _save_pretrained and stores model tensors

            card = self.generate_model_card(
-                cfg.dataset.repo_id, self.config.type, self.config.license, self.config.tags, cfg=cfg
+                cfg.dataset.repo_id, self.config.type, self.config.license, self.config.tags
            )
            card.save(str(saved_path / "README.md"))

@@ -308,20 +246,9 @@ class PreTrainedPolicy(nn.Module, HubMixin, abc.ABC):
            logging.info(f"Model pushed to {commit_info.repo_url.url}")

    def generate_model_card(
-        self,
-        dataset_repo_id: str,
-        model_type: str,
-        license: str | None,
-        tags: list[str] | None,
-        cfg: TrainPipelineConfig | None = None,
+        self, dataset_repo_id: str, model_type: str, license: str | None, tags: list[str] | None
    ) -> ModelCard:
-        base_model_mapping = {
-            "smolvla": "lerobot/smolvla_base",
-            "pi0": "lerobot/pi0_base",
-            "pi05": "lerobot/pi05_base",
-            "pi0_fast": "lerobot/pi0fast-base",
-            "xvla": "lerobot/xvla-base",
-        }
+        base_model = "lerobot/smolvla_base" if model_type == "smolvla" else None  # Set a base model

        card_data = ModelCardData(
            license=license or "apache-2.0",
@@ -330,20 +257,13 @@ class PreTrainedPolicy(nn.Module, HubMixin, abc.ABC):
            tags=list(set(tags or []).union({"robotics", "lerobot", model_type})),
            model_name=model_type,
            datasets=dataset_repo_id,
-            base_model=base_model_mapping.get(model_type),
+            base_model=base_model,
        )

-        context = _build_card_context(
-            cfg, dataset_repo_id, self.config.input_features, self.config.output_features
-        )
-        # Used by the template to pre-fill commands and the "Fine-tuned from" line.
-        context["policy_repo_id"] = getattr(self.config, "repo_id", None)
-        context["base_model"] = base_model_mapping.get(model_type)
-
        template_card = (
            files("lerobot.templates").joinpath("lerobot_modelcard_template.md").read_text(encoding="utf-8")
        )
-        card = ModelCard.from_template(card_data, template_str=template_card, **context)
+        card = ModelCard.from_template(card_data, template_str=template_card)
        card.validate()
        return card

@@ -175,9 +175,6 @@ class AddBatchDimensionComplementaryDataStep(ComplementaryDataProcessorStep):
            if isinstance(task_index_value, Tensor) and task_index_value.dim() == 0:
                complementary_data["task_index"] = task_index_value.unsqueeze(0)

-        complementary_data.pop("language_persistent", None)
-        complementary_data.pop("language_events", None)
-
        if "messages" in complementary_data:
            messages = complementary_data["messages"]
            if isinstance(messages, list) and (not messages or isinstance(messages[0], dict)):
@@ -32,6 +32,7 @@ from __future__ import annotations

 import importlib
 import json
+import os
 import re
 from abc import ABC, abstractmethod
 from collections.abc import Callable, Iterable, Sequence
@@ -280,11 +281,6 @@ class DataProcessorPipeline[TInput, TOutput](HubMixin):

    before_step_hooks: list[Callable[[int, EnvTransition], None]] = field(default_factory=list, repr=False)
    after_step_hooks: list[Callable[[int, EnvTransition], None]] = field(default_factory=list, repr=False)
-    _serialized_state_filenames: tuple[str | None, ...] | None = field(
-        default=None,
-        init=False,
-        repr=False,
-    )

    def __call__(self, data: TInput) -> TOutput:
        """Processes input data through the full pipeline.
@@ -342,108 +338,30 @@ class DataProcessorPipeline[TInput, TOutput](HubMixin):
            transition = processor_step(transition)
            yield transition

-    def _get_sanitized_name(self) -> str:
-        """Return a filename-safe version of the pipeline name.
+    def _save_pretrained(self, save_directory: Path, **kwargs):
+        """Internal method to comply with `HubMixin`'s saving mechanism.

-        Returns:
-            The lower-cased pipeline name with non-alphanumeric characters replaced by underscores.
+        This method does the actual saving work and is called by HubMixin.save_pretrained.
        """
-        return re.sub(r"[^a-zA-Z0-9_]", "_", self.name.lower())
+        config_filename = kwargs.pop("config_filename", None)

-    @staticmethod
-    def _get_state_filename(
-        *,
-        step_index: int,
-        registry_name: str | None,
-        sanitized_name: str,
-    ) -> str:
-        """Return the safetensors filename for one stateful processor step.
+        # Sanitize the pipeline name to create a valid filename prefix.
+        sanitized_name = re.sub(r"[^a-zA-Z0-9_]", "_", self.name.lower())

-        Args:
-            step_index: The index of the processor step in this pipeline.
-            registry_name: The registered processor step name, if available.
-            sanitized_name: The filename-safe pipeline name.
+        if config_filename is None:
+            config_filename = f"{sanitized_name}.json"

-        Returns:
-            The state filename used by the existing disk serialization format.
-        """
-        if registry_name:
-            return f"{sanitized_name}_step_{step_index}_{registry_name}.safetensors"
-
-        return f"{sanitized_name}_step_{step_index}.safetensors"
-
-    @staticmethod
-    def _get_state_key(state_filename: str) -> str:
-        """Return the in-memory state key for a serialized state filename.
-
-        Args:
-            state_filename: The `.safetensors` filename from the serialized config.
-
-        Returns:
-            The state key used by the in-memory pipeline state dictionary.
-        """
-        return state_filename.removesuffix(".safetensors")
-
-    @staticmethod
-    def _get_state_filenames_from_config(loaded_config: dict[str, Any]) -> tuple[str | None, ...]:
-        """Return serialized state filenames in step order.
-
-        Args:
-            loaded_config: A validated processor pipeline config.
-
-        Returns:
-            A tuple containing each step's serialized state filename, or None for stateless steps.
-        """
-        return tuple(step_entry.get("state_file") for step_entry in loaded_config["steps"])
-
-    def _get_state_filenames_for_loading(self) -> tuple[str | None, ...]:
-        """Return expected state filenames in step order for `load_state_dict()`.
-
-        Returns:
-            The preserved serialized state filenames when available, otherwise filenames derived from
-            current non-empty step state.
-        """
-        if self._serialized_state_filenames is not None and len(self._serialized_state_filenames) == len(
-            self.steps
-        ):
-            return self._serialized_state_filenames
-
-        sanitized_name = self._get_sanitized_name()
-        state_filenames: list[str | None] = []
-
-        for step_index, processor_step in enumerate(self.steps):
-            step_state_dict = processor_step.state_dict()
-            if not step_state_dict:
-                state_filenames.append(None)
-                continue
-
-            registry_name = getattr(processor_step.__class__, "_registry_name", None)
-            state_filenames.append(
-                self._get_state_filename(
-                    step_index=step_index,
-                    registry_name=registry_name,
-                    sanitized_name=sanitized_name,
-                )
-            )
-
-        return tuple(state_filenames)
-
-    def get_config(self) -> dict[str, Any]:
-        """Return the JSON-serializable pipeline configuration.
-
-        Returns:
-            A dictionary with the same content that `save_pretrained()` writes as JSON.
-        """
-        sanitized_name = self._get_sanitized_name()
-        pipeline_config: dict[str, Any] = {
+        config: dict[str, Any] = {
            "name": self.name,
            "steps": [],
        }

+        # Iterate through each step to build its configuration entry.
        for step_index, processor_step in enumerate(self.steps):
            registry_name = getattr(processor_step.__class__, "_registry_name", None)
-            step_entry: dict[str, Any] = {}

+            step_entry: dict[str, Any] = {}
+            # Prefer registry name for portability, otherwise fall back to full class path.
            if registry_name:
                step_entry["registry_name"] = registry_name
            else:
@@ -451,110 +369,31 @@ class DataProcessorPipeline[TInput, TOutput](HubMixin):
                    f"{processor_step.__class__.__module__}.{processor_step.__class__.__name__}"
                )

-            step_entry["config"] = processor_step.get_config()
+            # Save step configuration if `get_config` is implemented.
+            if hasattr(processor_step, "get_config"):
+                step_entry["config"] = processor_step.get_config()

-            step_state_dict = processor_step.state_dict()
-            if step_state_dict:
-                step_entry["state_file"] = self._get_state_filename(
-                    step_index=step_index,
-                    registry_name=registry_name,
-                    sanitized_name=sanitized_name,
-                )
+            # Save step state if `state_dict` is implemented and returns a non-empty dict.
+            if hasattr(processor_step, "state_dict"):
+                state = processor_step.state_dict()
+                if state:
+                    # Clone tensors to avoid modifying the original state.
+                    cloned_state = {key: tensor.clone() for key, tensor in state.items()}

-            pipeline_config["steps"].append(step_entry)
+                    # Create a unique filename for the state file.
+                    if registry_name:
+                        state_filename = f"{sanitized_name}_step_{step_index}_{registry_name}.safetensors"
+                    else:
+                        state_filename = f"{sanitized_name}_step_{step_index}.safetensors"

-        return pipeline_config
+                    save_file(cloned_state, os.path.join(str(save_directory), state_filename))
+                    step_entry["state_file"] = state_filename

-    def state_dict(self) -> dict[str, dict[str, torch.Tensor]]:
-        """Return pipeline state tensors grouped by state key.
+            config["steps"].append(step_entry)

-        Returns:
-            A dictionary mapping suffixless state keys to cloned step state dictionaries.
-        """
-        sanitized_name = self._get_sanitized_name()
-        pipeline_state_dict: dict[str, dict[str, torch.Tensor]] = {}
-
-        for step_index, processor_step in enumerate(self.steps):
-            step_state_dict = processor_step.state_dict()
-            if not step_state_dict:
-                continue
-
-            registry_name = getattr(processor_step.__class__, "_registry_name", None)
-            state_filename = self._get_state_filename(
-                step_index=step_index,
-                registry_name=registry_name,
-                sanitized_name=sanitized_name,
-            )
-            state_key = self._get_state_key(state_filename)
-            pipeline_state_dict[state_key] = {
-                tensor_name: tensor.clone() for tensor_name, tensor in step_state_dict.items()
-            }
-
-        return pipeline_state_dict
-
-    def load_state_dict(
-        self,
-        state_dict: dict[str, dict[str, torch.Tensor]],
-    ) -> None:
-        """Load pipeline state tensors into the existing steps.
-
-        Args:
-            state_dict: A dictionary mapping suffixless state keys to step state dictionaries.
-
-        Raises:
-            KeyError: If loading finds missing expected state or unexpected extra state.
-        """
-        expected_state_filenames = self._get_state_filenames_for_loading()
-        used_state_keys: set[str] = set()
-
-        for step_index, (processor_step, state_filename) in enumerate(
-            zip(self.steps, expected_state_filenames, strict=True)
-        ):
-            if state_filename is None:
-                continue
-
-            state_key = self._get_state_key(state_filename)
-            if state_key not in state_dict:
-                raise KeyError(
-                    f"Missing state key '{state_key}' for processor step {step_index}. "
-                    f"Available state keys: {sorted(state_dict.keys())}"
-                )
-
-            processor_step.load_state_dict(state_dict[state_key])
-            used_state_keys.add(state_key)
-
-        unexpected_state_keys = set(state_dict) - used_state_keys
-        if unexpected_state_keys:
-            expected_state_key_set = {
-                self._get_state_key(state_filename)
-                for state_filename in expected_state_filenames
-                if state_filename is not None
-            }
-            raise KeyError(
-                f"Unexpected processor state keys: {sorted(unexpected_state_keys)}. "
-                f"Expected state keys: {sorted(expected_state_key_set)}"
-            )
-
-    def _save_pretrained(self, save_directory: Path, **kwargs) -> None:
-        """Internal method to comply with `HubMixin`'s saving mechanism.
-
-        This method does the actual saving work and is called by HubMixin.save_pretrained.
-        """
-        config_filename = kwargs.pop("config_filename", None)
-        sanitized_name = self._get_sanitized_name()
-
-        if config_filename is None:
-            config_filename = f"{sanitized_name}.json"
-
-        pipeline_config = self.get_config()
-        pipeline_state_dict = self.state_dict()
-
-        for state_key, step_state_dict in pipeline_state_dict.items():
-            state_filename = f"{state_key}.safetensors"
-            save_file(step_state_dict, save_directory / state_filename)
-
-        with open(save_directory / config_filename, "w") as file_pointer:
-            json.dump(pipeline_config, file_pointer, indent=2)
+        # Write the main configuration JSON file.
+        with open(os.path.join(str(save_directory), config_filename), "w") as file_pointer:
+            json.dump(config, file_pointer, indent=2)

    def save_pretrained(
        self,
@@ -738,54 +577,12 @@ class DataProcessorPipeline[TInput, TOutput](HubMixin):
        cls._validate_overrides_used(validated_overrides, loaded_config)

        # 5. Construct and return the final pipeline instance
-        pipeline = cls(
+        return cls(
            steps=steps,
            name=loaded_config.get("name", "DataProcessorPipeline"),
            to_transition=to_transition or cast(Callable[[TInput], EnvTransition], batch_to_transition),
            to_output=to_output or cast(Callable[[EnvTransition], TOutput], transition_to_batch),
        )
-        pipeline._serialized_state_filenames = cls._get_state_filenames_from_config(loaded_config)
-        return pipeline
-
-    @classmethod
-    def from_config(
-        cls,
-        config: dict[str, Any],
-        *,
-        state_dict: dict[str, dict[str, torch.Tensor]] | None = None,
-        overrides: dict[str, Any] | None = None,
-        to_transition: Callable[[TInput], EnvTransition] | None = None,
-        to_output: Callable[[EnvTransition], TOutput] | None = None,
-    ) -> DataProcessorPipeline[TInput, TOutput]:
-        """Build a pipeline from an in-memory config and optional state tensors.
-
-        Args:
-            config: A config dictionary with the same structure as the saved processor JSON.
-            state_dict: Optional in-memory pipeline state grouped by suffixless state key.
-            overrides: Optional constructor overrides keyed by registry name or class name.
-            to_transition: Optional converter from input data to `EnvTransition`.
-            to_output: Optional converter from `EnvTransition` to output data.
-
-        Returns:
-            A processor pipeline built from the config and optional state.
-        """
-        cls._validate_loaded_config("<in-memory config>", config, "<in-memory config>")
-
-        steps, remaining_override_keys = cls._build_steps_from_config(config, overrides or {})
-        cls._validate_overrides_used(remaining_override_keys, config)
-
-        pipeline = cls(
-            steps=steps,
-            name=config.get("name", "DataProcessorPipeline"),
-            to_transition=to_transition or cast(Callable[[TInput], EnvTransition], batch_to_transition),
-            to_output=to_output or cast(Callable[[EnvTransition], TOutput], transition_to_batch),
-        )
-        pipeline._serialized_state_filenames = cls._get_state_filenames_from_config(config)
-
-        if state_dict is not None:
-            pipeline.load_state_dict(state_dict)
-
-        return pipeline

    @classmethod
    def _load_config(
@@ -869,7 +666,9 @@ class DataProcessorPipeline[TInput, TOutput](HubMixin):
                ) from e

    @classmethod
-    def _validate_loaded_config(cls, model_id: str, loaded_config: Any, config_filename: str) -> None:
+    def _validate_loaded_config(
+        cls, model_id: str, loaded_config: dict[str, Any], config_filename: str
+    ) -> None:
        """Validate that a config was loaded and is a valid processor config.

        This method validates processor config format with intelligent migration detection:
@@ -889,7 +688,7 @@ class DataProcessorPipeline[TInput, TOutput](HubMixin):

        Args:
            model_id: The model identifier (used for migration detection)
-            loaded_config: The loaded config value to validate (may be non-dict)
+            loaded_config: The loaded config dictionary (guaranteed non-None)
            config_filename: The config filename that was loaded (for error messages)

        Raises:
@@ -903,14 +702,9 @@ class DataProcessorPipeline[TInput, TOutput](HubMixin):
                    model_id,
                    f"Config file '{config_filename}' is not a valid processor configuration",
                )
-            loaded_config_description = (
-                list(loaded_config.keys())
-                if isinstance(loaded_config, dict)
-                else type(loaded_config).__name__
-            )
            raise ValueError(
                f"Config file '{config_filename}' is not a valid processor configuration. "
-                f"Expected a config with 'steps' field, but got: {loaded_config_description}"
+                f"Expected a config with 'steps' field, but got: {list(loaded_config.keys())}"
            )

    @classmethod
@@ -972,41 +766,26 @@ class DataProcessorPipeline[TInput, TOutput](HubMixin):
            ImportError: If a step class cannot be imported or found in registry
            ValueError: If a step cannot be instantiated with its configuration
        """
-        steps, remaining_override_keys = cls._build_steps_from_config(loaded_config, overrides)
-
-        for step_instance, step_entry in zip(steps, loaded_config["steps"], strict=True):
-            cls._load_step_state(step_instance, step_entry, model_id, base_path, hub_download_kwargs)
-
-        return steps, remaining_override_keys
-
-    @classmethod
-    def _build_steps_from_config(
-        cls,
-        loaded_config: dict[str, Any],
-        overrides: dict[str, Any],
-    ) -> tuple[list[ProcessorStep], set[str]]:
-        """Build processor steps from config without loading tensor state.
-
-        Args:
-            loaded_config: The loaded processor configuration.
-            overrides: User-provided constructor overrides keyed by step key.
-
-        Returns:
-            A tuple containing instantiated steps and override keys that did not match a step.
-        """
-        processor_steps: list[ProcessorStep] = []
-        remaining_override_keys = set(overrides.keys())
+        steps: list[ProcessorStep] = []
+        override_keys = set(overrides.keys())

        for step_entry in loaded_config["steps"]:
+            # 1. Get step class and key
            step_class, step_key = cls._resolve_step_class(step_entry)
-            processor_step = cls._instantiate_step(step_entry, step_class, step_key, overrides)

-            if step_key in remaining_override_keys:
-                remaining_override_keys.discard(step_key)
+            # 2. Instantiate step with overrides
+            step_instance = cls._instantiate_step(step_entry, step_class, step_key, overrides)

-            processor_steps.append(processor_step)
+            # 3. Load step state if available
+            cls._load_step_state(step_instance, step_entry, model_id, base_path, hub_download_kwargs)

-        return processor_steps, remaining_override_keys
+            # 4. Track used overrides
+            if step_key in override_keys:
+                override_keys.discard(step_key)
+
+            steps.append(step_instance)
+
+        return steps, override_keys

    @classmethod
    def _resolve_step_class(cls, step_entry: dict[str, Any]) -> tuple[type[ProcessorStep], str]:
@@ -1317,7 +1096,7 @@ class DataProcessorPipeline[TInput, TOutput](HubMixin):
        return True

    @classmethod
-    def _is_processor_config(cls, config: Any) -> bool:
+    def _is_processor_config(cls, config: dict) -> bool:
        """Check if config follows DataProcessorPipeline format.

        This method validates the processor configuration structure:
@@ -1368,9 +1147,6 @@ class DataProcessorPipeline[TInput, TOutput](HubMixin):
        Returns:
            True if config follows valid DataProcessorPipeline format, False otherwise
        """
-        if not isinstance(config, dict):
-            return False
-
        # Must have a "steps" field with a list of step configurations
        if not isinstance(config.get("steps"), list):
            return False
@@ -50,7 +50,17 @@ class RenderMessagesStep(ProcessorStep):
        events = complementary_data.get(LANGUAGE_EVENTS) or []

        if not persistent and not events:
-            return transition
+            rendered = _fallback_low_level_render(complementary_data.get("task"))
+            if rendered is None:
+                return transition
+            new_transition = transition.copy()
+            new_complementary_data = dict(new_transition.get(TransitionKey.COMPLEMENTARY_DATA) or {})
+            new_complementary_data.update(rendered)
+            new_transition[TransitionKey.COMPLEMENTARY_DATA] = new_complementary_data
+            return new_transition
+
+        if _is_batched_language(persistent) or _is_batched_language(events):
+            return self._call_batch(transition, complementary_data, persistent, events)

        timestamp = complementary_data.get("timestamp")
        if timestamp is None:
@@ -67,18 +77,147 @@ class RenderMessagesStep(ProcessorStep):
            dataset_ctx=self.dataset_ctx,
        )
        if rendered is None:
-            return None
+            rendered = _fallback_low_level_render(complementary_data.get("task"))
+            if rendered is None:
+                return None

        new_transition = transition.copy()
-        new_complementary_data = dict(complementary_data)
+        new_complementary_data = dict(new_transition.get(TransitionKey.COMPLEMENTARY_DATA) or {})
        new_complementary_data.pop(LANGUAGE_PERSISTENT, None)
        new_complementary_data.pop(LANGUAGE_EVENTS, None)
        new_complementary_data.update(rendered)
        new_transition[TransitionKey.COMPLEMENTARY_DATA] = new_complementary_data
        return new_transition

+    def _call_batch(
+        self,
+        transition: EnvTransition,
+        complementary_data: dict[str, Any],
+        persistent_batch: list,
+        events_batch: list,
+    ) -> EnvTransition | None:
+        timestamp = complementary_data.get("timestamp")
+        if timestamp is None:
+            raise KeyError("RenderMessagesStep requires sample timestamp in complementary data.")
+
+        batch_size = max(len(persistent_batch), len(events_batch))
+        messages: list[list[dict[str, Any]]] = []
+        message_streams: list[list[str | None]] = []
+        target_message_indices: list[list[int]] = []
+        keep_indices: list[int] = []
+
+        for i in range(batch_size):
+            rendered = render_sample(
+                recipe=self.recipe,
+                persistent=persistent_batch[i] if i < len(persistent_batch) else [],
+                events=events_batch[i] if i < len(events_batch) else [],
+                t=_batch_value(timestamp, i),
+                sample_idx=int(_batch_value(complementary_data.get("index", 0), i)),
+                task=_batch_value(complementary_data.get("task"), i),
+                dataset_ctx=self.dataset_ctx,
+            )
+            if rendered is None:
+                rendered = _fallback_low_level_render(_batch_value(complementary_data.get("task"), i))
+                if rendered is None:
+                    continue
+            keep_indices.append(i)
+            messages.append(rendered["messages"])
+            message_streams.append(rendered["message_streams"])
+            target_message_indices.append(rendered["target_message_indices"])
+
+        if not messages:
+            return None
+
+        new_transition = (
+            _select_batch_indices(transition, keep_indices)
+            if len(keep_indices) != batch_size
+            else transition.copy()
+        )
+        new_complementary_data = dict(new_transition.get(TransitionKey.COMPLEMENTARY_DATA) or {})
+        new_complementary_data.pop(LANGUAGE_PERSISTENT, None)
+        new_complementary_data.pop(LANGUAGE_EVENTS, None)
+        new_complementary_data["messages"] = messages
+        new_complementary_data["message_streams"] = message_streams
+        new_complementary_data["target_message_indices"] = target_message_indices
+        new_transition[TransitionKey.COMPLEMENTARY_DATA] = new_complementary_data
+        return new_transition
+
    def transform_features(
        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
        """Pass features through unchanged; rendering only touches complementary data."""
        return features
+
+
+def _scalar(value: Any) -> float | int:
+    """Unwrap a tensor/array/single-element list into a Python scalar."""
+    if hasattr(value, "item"):
+        return value.item()
+    if isinstance(value, list):
+        if len(value) != 1:
+            raise ValueError(f"Expected a scalar, got list of length {len(value)}: {value!r}")
+        return _scalar(value[0])
+    return value
+
+
+def _is_batched_language(value: Any) -> bool:
+    return isinstance(value, list) and bool(value) and isinstance(value[0], list)
+
+
+def _batch_value(value: Any, index: int) -> Any:
+    if value is None:
+        return None
+    if isinstance(value, list):
+        return value[index]
+    if hasattr(value, "ndim") and getattr(value, "ndim") > 0:
+        return _scalar(value[index])
+    return _scalar(value)
+
+
+def _select_batch_indices(transition: EnvTransition, indices: list[int]) -> EnvTransition:
+    selected = transition.copy()
+    for key in (TransitionKey.OBSERVATION, TransitionKey.COMPLEMENTARY_DATA):
+        data = selected.get(key)
+        if isinstance(data, dict):
+            selected[key] = {k: _select_value(v, indices) for k, v in data.items()}
+    action = selected.get(TransitionKey.ACTION)
+    if action is not None:
+        selected[TransitionKey.ACTION] = _select_value(action, indices)
+    return selected
+
+
+def _select_value(value: Any, indices: list[int]) -> Any:
+    if isinstance(value, list) and len(value) >= len(indices):
+        return [value[i] for i in indices]
+    if hasattr(value, "index_select") and hasattr(value, "new_tensor") and getattr(value, "ndim", 0) > 0:
+        return value.index_select(0, value.new_tensor(indices).long())
+    return value
+
+
+def _fallback_low_level_render(task: Any) -> dict[str, Any] | None:
+    """Keep action-only samples trainable when no recipe branch matches."""
+    if hasattr(task, "item"):
+        task = task.item()
+    if isinstance(task, list):
+        messages = []
+        message_streams = []
+        target_message_indices = []
+        for t in task:
+            rendered = _fallback_low_level_render(t)
+            if rendered is None:
+                return None
+            messages.append(rendered["messages"])
+            message_streams.append(rendered["message_streams"])
+            target_message_indices.append(rendered["target_message_indices"])
+        return {
+            "messages": messages,
+            "message_streams": message_streams,
+            "target_message_indices": target_message_indices,
+        }
+    if not isinstance(task, str) or not task:
+        return None
+    return {
+        "messages": [{"role": "user", "content": task}],
+        "message_streams": ["low_level"],
+        "target_message_indices": [],
+    }
@@ -32,6 +32,7 @@ import torch
 from lerobot.configs import FeatureType, PipelineFeatureType, PolicyFeature
 from lerobot.types import EnvTransition, RobotObservation, TransitionKey
 from lerobot.utils.constants import (
+    ACTION_CODE_TOKEN_MASK,
    ACTION_TOKEN_MASK,
    ACTION_TOKENS,
    OBS_LANGUAGE_ATTENTION_MASK,
@@ -412,14 +413,15 @@ class ActionTokenizerProcessorStep(ActionProcessorStep):
            # During inference, no action is available, skip tokenization
            return new_transition

-        # Tokenize and get both tokens and mask
-        tokens, mask = self._tokenize_action(action)
+        # Tokenize and get masks for the full formatted sequence and the discrete action codes.
+        tokens, mask, code_mask = self._tokenize_action(action)

        # Store mask in complementary data
        complementary_data = new_transition.get(TransitionKey.COMPLEMENTARY_DATA, {})
        if complementary_data is None:
            complementary_data = {}
        complementary_data[ACTION_TOKEN_MASK] = mask
+        complementary_data[ACTION_CODE_TOKEN_MASK] = code_mask
        complementary_data[ACTION_TOKENS] = tokens
        new_transition[TransitionKey.COMPLEMENTARY_DATA] = complementary_data
        return new_transition
@@ -430,7 +432,7 @@ class ActionTokenizerProcessorStep(ActionProcessorStep):
        """
        return self._paligemma_tokenizer.vocab_size - 1 - self.fast_skip_tokens - tokens

-    def _tokenize_action(self, action: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
+    def _tokenize_action(self, action: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """
        Tokenizes the action tensor and creates a mask.

@@ -459,6 +461,7 @@ class ActionTokenizerProcessorStep(ActionProcessorStep):
        # The fast tokenizer expects action data and returns token IDs
        tokens_list = []
        masks_list = []
+        code_masks_list = []

        for i in range(batch_size):
            # Tokenize single action (move to CPU first as tokenizer uses scipy which requires numpy)
@@ -476,19 +479,26 @@ class ActionTokenizerProcessorStep(ActionProcessorStep):
            if tokens.dim() > 1:
                tokens = tokens.flatten()

+            action_code_tokens = self._act_tokens_to_paligemma_tokens(tokens)
            bos_id = self._paligemma_tokenizer.bos_token_id
-            # add bos
+            prompt_tokens = torch.tensor(
+                self._paligemma_tokenizer.encode("Action: ", add_special_tokens=False),
+                device=action.device,
+            )
+            end_tokens = torch.tensor(self._paligemma_tokenizer.encode("|"), device=action.device)
+
+            code_start = 1 + len(prompt_tokens)
+            code_end = code_start + len(action_code_tokens)
            tokens = torch.cat(
                [
                    torch.tensor([bos_id], device=action.device),
-                    torch.tensor(
-                        self._paligemma_tokenizer.encode("Action: ", add_special_tokens=False),
-                        device=action.device,
-                    ),
-                    self._act_tokens_to_paligemma_tokens(tokens),
-                    torch.tensor(self._paligemma_tokenizer.encode("|"), device=action.device),
+                    prompt_tokens,
+                    action_code_tokens,
+                    end_tokens,
                ]
            )
+            code_mask = torch.zeros(len(tokens), dtype=torch.bool, device=action.device)
+            code_mask[code_start:code_end] = True

            # Truncate or pad to max_action_tokens
            if len(tokens) > self.max_action_tokens:
@@ -497,44 +507,49 @@ class ActionTokenizerProcessorStep(ActionProcessorStep):
                    "Consider increasing the `max_action_tokens` in your model config if this happens frequently."
                )
                tokens = tokens[: self.max_action_tokens]
+                code_mask = code_mask[: self.max_action_tokens]
                mask = torch.ones(self.max_action_tokens, dtype=torch.bool, device=action.device)
            else:
+                pad_len = self.max_action_tokens - len(tokens)
                mask = torch.cat(
                    [
                        torch.ones(len(tokens), dtype=torch.bool, device=action.device),
-                        torch.zeros(
-                            self.max_action_tokens - len(tokens), dtype=torch.bool, device=action.device
-                        ),
+                        torch.zeros(pad_len, dtype=torch.bool, device=action.device),
                    ]
                )
+                code_mask = torch.nn.functional.pad(code_mask, (0, pad_len), value=False)
                # Pad tokens with zeros
-                tokens = torch.nn.functional.pad(tokens, (0, self.max_action_tokens - len(tokens)), value=0)
+                tokens = torch.nn.functional.pad(tokens, (0, pad_len), value=0)

            tokens_list.append(tokens)
            masks_list.append(mask)
+            code_masks_list.append(code_mask)

        # Stack into batched tensors
        tokens_batch = torch.stack(tokens_list, dim=0)  # (B, max_action_tokens)
        masks_batch = torch.stack(masks_list, dim=0)  # (B, max_action_tokens)
+        code_masks_batch = torch.stack(code_masks_list, dim=0)  # (B, max_action_tokens)

        # Remove batch dimension if input was single sample
        if single_sample:
            tokens_batch = tokens_batch.squeeze(0)
            masks_batch = masks_batch.squeeze(0)
+            code_masks_batch = code_masks_batch.squeeze(0)

        # Move to the same device as the input
        if device is not None:
            tokens_batch = tokens_batch.to(device)
            masks_batch = masks_batch.to(device)
+            code_masks_batch = code_masks_batch.to(device)

-        return tokens_batch, masks_batch
+        return tokens_batch, masks_batch, code_masks_batch

    def action(self, action: torch.Tensor) -> torch.Tensor:
        """
        This method is not used since we override __call__.
        Required by ActionProcessorStep ABC.
        """
-        tokens, _ = self._tokenize_action(action)
+        tokens, _, _ = self._tokenize_action(action)
        return tokens

    def get_config(self) -> dict[str, Any]:
@@ -21,6 +21,8 @@ from lerobot.utils.import_utils import make_device_from_device_class
 from .config import RobotConfig
 from .robot import Robot

+logger = logging.getLogger(__name__)
+

 def make_robot_from_config(config: RobotConfig) -> Robot:
    # TODO(Steven): Consider just using the make_device_from_device_class for all types
@@ -118,7 +120,7 @@ def ensure_safe_goal_position(
            }

    if warnings_dict:
-        logging.warning(
+        logger.warning(
            "Relative goal position magnitude had to be clamped to be safe.\n"
            f"{pformat(warnings_dict, indent=4)}"
        )
@@ -175,17 +175,12 @@ def _push_to_hub(root: Path, cfg: AnnotationPipelineConfig) -> None:
        "repo_id": repo_id,
        "tag": version_tag,
        "repo_type": "dataset",
+        "exist_ok": True,
    }
    if revision is not None:
        tag_kwargs["revision"] = revision

    try:
-        from contextlib import suppress  # noqa: PLC0415
-
-        from huggingface_hub.errors import RevisionNotFoundError  # noqa: PLC0415
-
-        with suppress(RevisionNotFoundError):
-            api.delete_tag(repo_id, tag=version_tag, repo_type="dataset")
        api.create_tag(**tag_kwargs)
        print(f"[lerobot-annotate] tagged {repo_id} as {version_tag}", flush=True)
    except Exception as exc:  # noqa: BLE001
@@ -94,14 +94,6 @@ Merge multiple datasets from a list of local dataset paths:
        --operation.repo_ids "['pusht_train', 'pusht_val']" \
        --operation.roots "['/path/to/pusht_train', '/path/to/pusht_val']"

-Merge multiple datasets while keeping one file per source file (no video/data stitching):
-    lerobot-edit-dataset \
-        --new_repo_id lerobot/pusht_merged \
-        --operation.type merge \
-        --operation.repo_ids "['lerobot/pusht_train', 'lerobot/pusht_val']" \
-        --operation.concatenate_videos false \
-        --operation.concatenate_data false
-
 Remove camera feature:
    lerobot-edit-dataset \
        --repo_id lerobot/pusht \
@@ -265,9 +257,6 @@ class SplitConfig(OperationConfig):
 class MergeConfig(OperationConfig):
    repo_ids: list[str] | None = None
    roots: list[str] | None = None
-    # When False, keep one file per source file instead of packing into shards.
-    concatenate_videos: bool = True
-    concatenate_data: bool = True


@OperationConfig.register_subclass("remove_feature")
@@ -472,8 +461,6 @@ def handle_merge(cfg: EditDatasetConfig) -> None:
        datasets,
        output_repo_id=cfg.new_repo_id,
        output_dir=output_dir,
-        concatenate_videos=cfg.operation.concatenate_videos,
-        concatenate_data=cfg.operation.concatenate_data,
    )

    logging.info(f"Merged dataset saved to {output_dir}")
@@ -95,6 +95,67 @@ from lerobot.utils.utils import (
 )


+def _wrap_text_to_width(text: str, cv2, font, scale: int, thickness: int, max_width: int) -> list[str]:
+    """Greedy word-wrap using measured pixel width so text fits the frame."""
+    words = text.split()
+    lines: list[str] = []
+    current = ""
+    for word in words:
+        candidate = f"{current} {word}".strip()
+        (w, _), _ = cv2.getTextSize(candidate, font, scale, thickness)
+        if w > max_width and current:
+            lines.append(current)
+            current = word
+        else:
+            current = candidate
+    if current:
+        lines.append(current)
+    return lines or [""]
+
+
+def _annotate_eval_frames(
+    frames: np.ndarray, task: str | None, subtask: str | None
+) -> np.ndarray:
+    """Overlay the high-level task and predicted subtask onto rendered frames.
+
+    ``frames`` is ``(n_envs, H, W, C)`` uint8. Best-effort: if OpenCV isn't
+    available the frames are returned unchanged so eval never fails over a
+    visualization concern.
+    """
+    if frames.ndim != 4 or frames.shape[-1] != 3:
+        return frames
+    try:
+        import cv2  # noqa: PLC0415
+    except ImportError:
+        return frames
+
+    width = frames.shape[2]
+    font = cv2.FONT_HERSHEY_SIMPLEX
+    scale = 0.5
+    margin = 6
+    max_width = width - 2 * margin
+
+    lines: list[str] = []
+    if task:
+        lines += _wrap_text_to_width(f"Task: {task}", cv2, font, scale, 1, max_width)
+    if subtask:
+        lines += _wrap_text_to_width(f"Subtask: {subtask}", cv2, font, scale, 1, max_width)
+    if not lines:
+        return frames
+
+    out = frames.copy()
+    for i in range(out.shape[0]):
+        img = np.ascontiguousarray(out[i])
+        y = 18
+        for line in lines:
+            # Black outline then white fill so text stays legible on any scene.
+            cv2.putText(img, line, (margin, y), font, scale, (0, 0, 0), 3, cv2.LINE_AA)
+            cv2.putText(img, line, (margin, y), font, scale, (255, 255, 255), 1, cv2.LINE_AA)
+            y += 20
+        out[i] = img
+    return out
+
+
 def rollout(
    env: gym.vector.VectorEnv,
    policy: PreTrainedPolicy,
@@ -325,11 +386,42 @@ def eval_policy(
            return
        n_to_render_now = min(max_episodes_rendered - n_episodes_rendered, env.num_envs)
        if isinstance(env, gym.vector.SyncVectorEnv):
-            ep_frames.append(np.stack([env.envs[i].render() for i in range(n_to_render_now)]))  # noqa: B023
+            frames = np.stack([env.envs[i].render() for i in range(n_to_render_now)])  # noqa: B023
        elif hasattr(env, "call"):
            # Here we must render all frames and discard any we don't need.
            # Covers AsyncVectorEnv and _LazyAsyncVectorEnv (which wraps one).
-            ep_frames.append(np.stack(env.call("render")[:n_to_render_now]))
+            frames = np.stack(env.call("render")[:n_to_render_now])
+        else:
+            return
+
+        # Overlay the high-level task and (for hierarchical policies like
+        # pi052) the predicted low-level subtask onto each frame. Both are
+        # best-effort: missing values just skip that line.
+        try:
+            tasks = list(env.call("task_description"))
+        except (AttributeError, NotImplementedError):
+            try:
+                tasks = list(env.call("task"))
+            except (AttributeError, NotImplementedError):
+                tasks = None
+        # Per-env subtasks when available (batched hierarchical policies);
+        # fall back to the scalar last_subtask for single-env / other policies.
+        subtasks = getattr(policy, "last_subtasks", None)
+        subtask_scalar = getattr(policy, "last_subtask", None)
+        annotated = []
+        for i in range(frames.shape[0]):
+            if subtasks is not None and i < len(subtasks):
+                subtask_i = subtasks[i]
+            else:
+                subtask_i = subtask_scalar
+            annotated.append(
+                _annotate_eval_frames(
+                    frames[i : i + 1],
+                    tasks[i] if tasks is not None and i < len(tasks) else None,
+                    subtask_i,
+                )[0]
+            )
+        ep_frames.append(np.stack(annotated))

    if max_episodes_rendered > 0:
        video_paths: list[str] = []
@@ -20,6 +20,7 @@ Requires: pip install 'lerobot[training]'  (includes dataset + accelerate + wand

 import dataclasses
 import logging
+import os
 import time
 from contextlib import nullcontext
 from pprint import pformat
@@ -36,8 +37,6 @@ from tqdm import tqdm
 from lerobot.common.train_utils import (
    get_step_checkpoint_dir,
    get_step_identifier,
-    load_training_batch_size,
-    load_training_num_processes,
    load_training_state,
    save_checkpoint,
    update_last_checkpoint,
@@ -45,8 +44,7 @@ from lerobot.common.train_utils import (
 from lerobot.common.wandb_utils import WandBLogger
 from lerobot.configs import parser
 from lerobot.configs.train import TrainPipelineConfig
-from lerobot.datasets import EpisodeAwareSampler, compute_sampler_state
-from lerobot.datasets.factory import make_train_eval_datasets
+from lerobot.datasets import EpisodeAwareSampler, WeightedEpisodeAwareSampler, make_dataset
 from lerobot.envs import close_envs, make_env, make_env_pre_post_processors
 from lerobot.optim.factory import make_optimizer_and_scheduler
 from lerobot.policies import PreTrainedPolicy, make_policy, make_pre_post_processors
@@ -102,9 +100,6 @@ def update_policy(
    start_time = time.perf_counter()
    policy.train()

-    if torch.cuda.is_available():
-        torch.cuda.reset_peak_memory_stats()
-
    # Compute sample weights if a weighter is provided
    sample_weights = None
    weight_stats = None
@@ -164,11 +159,173 @@ def update_policy(
    train_metrics.grad_norm = grad_norm.item()
    train_metrics.lr = optimizer.param_groups[0]["lr"]
    train_metrics.update_s = time.perf_counter() - start_time
-    if torch.cuda.is_available():
-        train_metrics.gpu_mem_gb = torch.cuda.max_memory_allocated() / (1024**3)
    return train_metrics, output_dict


+def _print_debug_text_predictions(
+    policy: Any, batch: dict[str, Any], step: int, n_samples: int = 5
+) -> None:
+    """Forward the current batch and print head-argmax vs label per supervised position.
+
+    Opt-in via ``LEROBOT_DEBUG_PREDS_EVERY=<step_interval>``. Only the
+    policy types that expose ``debug_text_predictions`` participate
+    (currently PI052); others are silently skipped. Pretty-prints up to
+    ``n_samples`` samples from the current batch, showing the prompt,
+    every supervised position's (label, prediction, ✓/✗), and a
+    per-sample token-accuracy summary — the cheapest "is text training
+    actually learning anything" signal.
+    """
+    # Accelerator/DDP wraps the policy in a ``module`` attribute and
+    # doesn't proxy custom methods through, so a naive
+    # ``hasattr(policy, "debug_text_predictions")`` returns False on the
+    # wrapper — and the helper would silently no-op. Walk through any
+    # ``.module`` indirection (DDP, FSDP, ``accelerator.prepare`` wrappers)
+    # to reach the raw policy that actually defines the method.
+    inner = policy
+    while hasattr(inner, "module") and not hasattr(inner, "debug_text_predictions"):
+        inner = inner.module
+    if not hasattr(inner, "debug_text_predictions"):
+        logging.warning(
+            "LEROBOT_DEBUG_PREDS_EVERY set but policy %s has no "
+            "debug_text_predictions method — skipping dump.",
+            type(inner).__name__,
+        )
+        return
+    try:
+        debug = inner.debug_text_predictions(batch, max_samples=n_samples)
+    except Exception as exc:  # noqa: BLE001
+        logging.warning("debug_text_predictions failed: %s", exc, exc_info=True)
+        return
+    if not debug:
+        logging.warning(
+            "debug_text_predictions returned no supervised samples — "
+            "current batch has no text labels."
+        )
+        return
+    policy = inner  # used below for select_message-style decoding parity
+
+    # Build a tokenizer for decoding — match training side exactly.
+    try:
+        from transformers import AutoTokenizer  # noqa: PLC0415
+
+        from lerobot.policies.pi052.text_processor_pi052 import (  # noqa: PLC0415
+            register_paligemma_loc_tokens,
+        )
+
+        tok_name = (
+            getattr(policy.config, "tokenizer_name", None) or "google/paligemma-3b-pt-224"
+        )
+        tokenizer = register_paligemma_loc_tokens(AutoTokenizer.from_pretrained(tok_name))
+    except Exception as exc:  # noqa: BLE001
+        logging.warning("debug preds: tokenizer load failed: %s", exc)
+        return
+
+    ids = debug["input_ids"]
+    labels = debug["labels"]
+    preds = debug["predictions"]
+    attn = debug["attention_mask"]
+
+    n = ids.shape[0]
+    print(
+        f"\n========== STEP {step} DEBUG PREDICTIONS ({n} samples) ==========",
+        flush=True,
+    )
+    for s in range(n):
+        a = attn[s].tolist()
+        real = sum(a)
+        sid = ids[s].tolist()
+        sl = labels[s].tolist()
+        sp = preds[s].tolist()
+        prompt = tokenizer.decode(sid[:real], skip_special_tokens=False)
+        print(f"\n  --- sample {s + 1}/{n} ---", flush=True)
+        print(f"  prompt: {prompt!r}", flush=True)
+
+        # Ground-truth target (the contiguous supervised label span).
+        sup_ids = [int(sid[i]) for i in range(real) if sl[i] != -100]
+        if sup_ids:
+            print(
+                f"  target  (ground truth)        : {tokenizer.decode(sup_ids, skip_special_tokens=False)!r}",
+                flush=True,
+            )
+
+        # Training-side teacher-forced argmax on the same prompt+target.
+        n_sup = n_ok = 0
+        teacher_chars: list[int] = []
+        for i in range(1, real):
+            label = sl[i]
+            if label == -100:
+                continue
+            n_sup += 1
+            pred = int(sp[i - 1])
+            teacher_chars.append(pred)
+            if label == pred:
+                n_ok += 1
+        teacher_text = (
+            tokenizer.decode(teacher_chars, skip_special_tokens=False) if teacher_chars else ""
+        )
+        acc = n_ok / max(n_sup, 1)
+        print(
+            f"  training argmax (teacher-fed) : {teacher_text!r}   acc={n_ok}/{n_sup}={acc:.1%}",
+            flush=True,
+        )
+    print("=" * 60 + "\n", flush=True)
+
+
+def _build_vqa_oversample_weights(dataset: Any, target_fraction: float) -> "torch.Tensor | None":
+    """Build per-frame sampling weights that oversample VQA-annotated frames.
+
+    Scans the dataset's ``language_events`` column for frames carrying a
+    ``vqa``-style annotation and returns a weight tensor (length == total
+    dataset frames) such that, under multinomial sampling, VQA frames make up
+    roughly ``target_fraction`` of the training stream.
+
+    Returns ``None`` (⇒ fall back to uniform episode-aware sampling) when VQA
+    frames cannot be detected or there are none.
+    """
+    if not 0.0 < target_fraction < 1.0:
+        logging.warning(
+            "vqa_target_fraction must be in (0, 1); got %s — VQA oversampling disabled.",
+            target_fraction,
+        )
+        return None
+    hf = getattr(dataset, "hf_dataset", None)
+    if hf is None or "language_events" not in getattr(hf, "column_names", []):
+        logging.warning(
+            "Dataset has no `language_events` column — VQA oversampling disabled."
+        )
+        return None
+
+    events_col = hf["language_events"]
+    n_frames = len(events_col)
+    is_vqa = torch.zeros(n_frames, dtype=torch.bool)
+    for i, rows in enumerate(events_col):
+        if rows and any((row or {}).get("style") == "vqa" for row in rows):
+            is_vqa[i] = True
+
+    n_vqa = int(is_vqa.sum())
+    if n_vqa == 0:
+        logging.warning("No `vqa` annotations found in the dataset — VQA oversampling disabled.")
+        return None
+    n_other = n_frames - n_vqa
+
+    # Solve target = (n_vqa·w) / (n_vqa·w + n_other) for the VQA weight w.
+    # Clamp to ≥ 1 so VQA frames are never *down*-weighted below uniform.
+    weight = (target_fraction * n_other) / ((1.0 - target_fraction) * max(n_vqa, 1))
+    weight = max(weight, 1.0)
+    weights = torch.ones(n_frames, dtype=torch.double)
+    weights[is_vqa] = weight
+    logging.info(
+        "VQA oversampling: %d/%d frames carry a `vqa` annotation (%.2f%%); "
+        "weighting them x%.2f to target ~%.0f%% of the training stream.",
+        n_vqa,
+        n_frames,
+        100.0 * n_vqa / n_frames,
+        weight,
+        100.0 * target_fraction,
+    )
+    return weights
+
+
@parser.wrap()
 def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
    """
@@ -198,15 +355,26 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
    # We set step_scheduler_with_optimizer=False to prevent accelerate from adjusting the lr_scheduler steps based on the num_processes
    # We set find_unused_parameters=True to handle models with conditional computation
    if accelerator is None:
-        from accelerate.utils import DistributedDataParallelKwargs
+        from datetime import timedelta
+
+        from accelerate.utils import DistributedDataParallelKwargs, InitProcessGroupKwargs

        ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=True)
+        # Bump the c10d store-get / barrier timeout so the rank-0-only
+        # ``make_dataset`` block below doesn't trigger a barrier crash on
+        # large datasets. Default is 10 min (``store->get`` 600 s); a
+        # 32 k-episode v3 dataset (e.g. ``robocasa_pretrain_human300_v4``)
+        # spends >13 min on rank 0 building the episode/frame index
+        # while ranks 1-N idle at ``wait_for_everyone()`` and crash with
+        # ``DistBackendError: ... wait timeout after 600000ms``. 2 h is
+        # plenty of headroom; fast paths are unaffected.
+        ipg_kwargs = InitProcessGroupKwargs(timeout=timedelta(hours=2))
        # Accelerate auto-detects the device based on the available hardware and ignores the policy.device setting.
        # Force the device to be CPU when the active config's device is set to CPU (works for both policy and reward model training).
        force_cpu = cfg.trainable_config.device == "cpu"
        accelerator = Accelerator(
            step_scheduler_with_optimizer=False,
-            kwargs_handlers=[ddp_kwargs],
+            kwargs_handlers=[ddp_kwargs, ipg_kwargs],
            cpu=force_cpu,
        )

@@ -240,24 +408,22 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
        torch.backends.cudnn.benchmark = True
    torch.backends.cuda.matmul.allow_tf32 = True

-    # Dataset loading synchronization: the global main process downloads once to the shared
-    # dataset root, then a barrier lets every other rank read the already-populated copy.
-    # LeRobotDataset skips its snapshot_download when try_load() succeeds, so no rank re-downloads.
+    # Dataset loading synchronization: main process downloads first to avoid race conditions
    if is_main_process:
        logging.info("Creating dataset")
-        dataset, eval_dataset = make_train_eval_datasets(cfg)
+        dataset = make_dataset(cfg)

    accelerator.wait_for_everyone()

-    # Other ranks read from the shared copy populated by the main process.
+    # Now all other processes can safely load the dataset
    if not is_main_process:
-        dataset, eval_dataset = make_train_eval_datasets(cfg)
+        dataset = make_dataset(cfg)

    # Create environment used for evaluating checkpoints during training on simulation data.
    # On real-world data, no need to create an environment as evaluations are done outside train.py,
    # using the eval.py instead, with gym_dora environment and dora-rs.
    eval_env = None
-    if cfg.env_eval_freq > 0 and cfg.env is not None and is_main_process:
+    if cfg.eval_freq > 0 and cfg.env is not None and is_main_process:
        logging.info("Creating env")
        eval_env = make_env(cfg.env, n_envs=cfg.eval.batch_size, use_async_envs=cfg.eval.use_async_envs)

@@ -302,6 +468,27 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):

    active_cfg = cfg.trainable_config
    processor_pretrained_path = active_cfg.pretrained_path
+    # pi052: even when loading pretrained weights, build the processors
+    # from the current pi052 config so the recipe text-label and FAST
+    # action-label steps are generated and not silently swapped for the
+    # checkpoint's older processor stack.
+    if cfg.policy.type == "pi052" and processor_pretrained_path is not None and not cfg.resume:
+        logging.warning(
+            "pi052 is loading pretrained weights from %s, but building processors from the current "
+            "pi052 config so recipe text labels and FAST action labels are generated.",
+            processor_pretrained_path,
+        )
+        processor_pretrained_path = None
+    if (
+        getattr(active_cfg, "use_relative_actions", False)
+        and processor_pretrained_path is not None
+        and not cfg.resume
+    ):
+        logging.warning(
+            "use_relative_actions=true with pretrained processors can skip relative transforms if "
+            "the checkpoint processors do not define them. Building processors from current policy config."
+        )
+        processor_pretrained_path = None

    processor_kwargs = {}
    if (processor_pretrained_path and not cfg.resume) or not processor_pretrained_path:
@@ -310,6 +497,14 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
    if cfg.is_reward_model_training:
        processor_kwargs["dataset_meta"] = dataset.meta

+    # For pi052 (and any future policy that auto-fits part of its
+    # preprocessing per-dataset), pass the dataset repo id so the
+    # processor factory can locate/refresh dataset-specific artifacts
+    # (e.g. fitted FAST tokenizers per Pertsch et al. 2025 [64],
+    # π0.5 §III.C).
+    if cfg.policy.type == "pi052":
+        processor_kwargs["dataset_repo_id"] = cfg.dataset.repo_id
+
    if not cfg.is_reward_model_training and processor_pretrained_path is not None:
        preprocessor_overrides = {
            "device_processor": {"device": device.type},
@@ -394,47 +589,31 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
        logging.info(f"{num_total_params=} ({format_big_number(num_total_params)})")

    # create dataloader for offline training
-    if not cfg.dataset.streaming:
-        # All non-streaming (map-style) datasets use EpisodeAwareSampler.
-        # The order is a pure function of (seed, epoch), so every rank independently produces the
-        # same permutation. accelerate then shards it disjointly across ranks via BatchSamplerShard
-        # without needing a `generator` attribute to synchronize an RNG, and resume is sample-exact.
+    if hasattr(active_cfg, "drop_n_last_frames"):
        shuffle = False
-        sampler = EpisodeAwareSampler(
-            dataset.meta.episodes["dataset_from_index"],
-            dataset.meta.episodes["dataset_to_index"],
-            episode_indices_to_use=dataset.episodes,
-            drop_n_last_frames=getattr(active_cfg, "drop_n_last_frames", 0),
-            shuffle=True,
-            seed=cfg.seed if cfg.seed is not None else 0,
-        )
-        if cfg.resume and step > 0:
-            # The resume offset depends on the (num_processes, batch_size) that produced `step`, so
-            # use the values recorded in the checkpoint (falling back to the current ones for older
-            # ckpts that did not store them).
-            saved_num_processes = load_training_num_processes(cfg.checkpoint_path)
-            saved_batch_size = load_training_batch_size(cfg.checkpoint_path)
-            ckpt_num_processes = saved_num_processes or accelerator.num_processes
-            ckpt_batch_size = saved_batch_size or cfg.batch_size
-            if is_main_process and saved_num_processes not in (None, accelerator.num_processes):
-                logging.warning(
-                    f"Resuming with num_processes={accelerator.num_processes} but the checkpoint was "
-                    f"written with num_processes={saved_num_processes}. The data order resumes at the "
-                    "right epoch/offset, but per-rank sample-exactness requires the same world size."
-                )
-            if is_main_process and saved_batch_size not in (None, cfg.batch_size):
-                logging.warning(
-                    f"Resuming with batch_size={cfg.batch_size} but the checkpoint was written with "
-                    f"batch_size={saved_batch_size}. The data order resumes at the right epoch/offset, "
-                    "but per-rank sample-exactness requires the same batch size."
-                )
-            sampler_state = compute_sampler_state(step, len(sampler), ckpt_batch_size, ckpt_num_processes)
-            sampler.load_state_dict(sampler_state)
-            if is_main_process:
-                logging.info(
-                    f"Resuming data order at epoch {sampler_state['epoch']}, "
-                    f"sample {sampler_state['start_index']}"
-                )
+        from_indices = dataset.meta.episodes["dataset_from_index"]
+        to_indices = dataset.meta.episodes["dataset_to_index"]
+        # When `vqa_target_fraction` is set, oversample VQA-annotated
+        # frames via a weighted sampler; otherwise plain episode-aware.
+        vqa_weights = None
+        if cfg.vqa_target_fraction is not None and not cfg.dataset.streaming:
+            vqa_weights = _build_vqa_oversample_weights(dataset, cfg.vqa_target_fraction)
+        if vqa_weights is not None:
+            sampler = WeightedEpisodeAwareSampler(
+                from_indices,
+                to_indices,
+                vqa_weights,
+                episode_indices_to_use=dataset.episodes,
+                drop_n_last_frames=active_cfg.drop_n_last_frames,
+            )
+        else:
+            sampler = EpisodeAwareSampler(
+                from_indices,
+                to_indices,
+                episode_indices_to_use=dataset.episodes,
+                drop_n_last_frames=active_cfg.drop_n_last_frames,
+                shuffle=True,
+            )
    else:
        shuffle = True
        sampler = None
@@ -456,33 +635,6 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
        persistent_workers=cfg.persistent_workers and cfg.num_workers > 0,
    )

-    # Build eval dataloader if a held-out split exists
-    eval_dataloader = None
-    if eval_dataset is not None:
-        eval_ds = eval_dataset
-        if cfg.max_eval_samples > 0 and hasattr(eval_dataset, "hf_dataset"):
-            task_arr = eval_dataset.hf_dataset.data.column("task_index").to_numpy()
-            unique_tasks = sorted(set(task_arr.tolist()))
-            per_task = max(1, cfg.max_eval_samples // len(unique_tasks))
-            selected: list[int] = []
-            for t in unique_tasks:
-                frames = (task_arr == t).nonzero()[0][:per_task]
-                selected.extend(frames.tolist())
-            eval_ds = torch.utils.data.Subset(eval_dataset, selected)
-
-        eval_collate_fn = lerobot_collate_fn if dataset.meta.has_language_columns else None
-        eval_dataloader = torch.utils.data.DataLoader(
-            eval_ds,
-            batch_size=cfg.batch_size,
-            shuffle=False,
-            num_workers=cfg.num_workers,
-            pin_memory=device.type == "cuda",
-            drop_last=False,
-            collate_fn=eval_collate_fn,
-            prefetch_factor=cfg.prefetch_factor if cfg.num_workers > 0 else None,
-            persistent_workers=cfg.persistent_workers and cfg.num_workers > 0,
-        )
-
    # Prepare everything with accelerator
    accelerator.wait_for_everyone()
    policy, optimizer, dataloader, lr_scheduler = accelerator.prepare(
@@ -492,23 +644,61 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):

    policy.train()

+    # ------------------------------------------------------------------
+    # EMA setup
+    # ------------------------------------------------------------------
+    # Shadow copy of the trainable params for late-training averaging
+    # (Chi et al. 2023 Diffusion Policy §V.D; openpi JAX trainer ships
+    # this with decay=0.999 for pi05_libero; openpi PyTorch port and
+    # LeRobot main both skip it). Off by default; opt in with
+    # ``--ema.enable=true``. Implemented via ema-pytorch
+    # (https://github.com/lucidrains/ema-pytorch) — the standard PyTorch
+    # EMA library, also used by lucidrains' diffusion repos.
+    ema = None
+    if cfg.ema.enable and is_main_process:
+        from ema_pytorch import EMA  # noqa: PLC0415
+
+        ema = EMA(
+            accelerator.unwrap_model(policy),
+            beta=cfg.ema.decay,
+            update_after_step=cfg.ema.warmup_steps,
+            update_every=1,  # update on every ema.update() call
+            # Don't register the live model as an ema submodule — accelerator
+            # already owns its lifecycle, and double-registration would
+            # double-count its params in ``ema.state_dict()``.
+            include_online_model=False,
+        )
+        ema.to(accelerator.device)
+        logging.info(
+            "EMA enabled (ema-pytorch): beta=%g, update_after_step=%d, "
+            "use_for_eval=%s, use_for_wandb_examples=%s",
+            cfg.ema.decay,
+            cfg.ema.warmup_steps,
+            cfg.ema.use_for_eval,
+            cfg.ema.use_for_wandb_examples,
+        )
+
+        # Resume the EMA shadow if a previous run wrote one.
+        if cfg.checkpoint_path is not None:
+            ema_path = cfg.checkpoint_path / "training_state" / "ema_state.pt"
+            if ema_path.exists():
+                logging.info("Resuming EMA shadow from %s", ema_path)
+                try:
+                    ema.load_state_dict(torch.load(ema_path, map_location=accelerator.device))
+                except Exception as exc:  # noqa: BLE001
+                    logging.warning(
+                        "Failed to load EMA shadow (%s) — restarting EMA from "
+                        "current live weights",
+                        exc,
+                    )
+
    train_metrics = {
-        # Per-rank loss reflects only one shard of the global batch; mean recovers the loss DDP
-        # is actually optimizing. grad_norm and lr are already identical on every rank (post
-        # gradient sync / deterministic scheduler) so reducing them would be a no-op collective.
-        "loss": AverageMeter("loss", ":.3f", reduction="mean"),
+        "loss": AverageMeter("loss", ":.3f"),
        "grad_norm": AverageMeter("grdn", ":.3f"),
        "lr": AverageMeter("lr", ":0.1e"),
-        # Report the slowest rank for bottleneck-style timings so multi-GPU runs surface the
-        # true straggler instead of rank 0's view.
-        "update_s": AverageMeter("updt_s", ":.3f", reduction="max"),
-        "dataloading_s": AverageMeter("data_s", ":.3f", reduction="max"),
-        # Derived from the post-reduce max step time; set once per log window on the main rank.
-        "samples_per_s": AverageMeter("smp/s", ":.0f"),
+        "update_s": AverageMeter("updt_s", ":.3f"),
+        "dataloading_s": AverageMeter("data_s", ":.3f"),
    }
-    if torch.cuda.is_available():
-        # max() because headroom is gated by the worst-case rank.
-        train_metrics["gpu_mem_gb"] = AverageMeter("mem_gb", ":.2f", reduction="max")

    # Keep global batch size for logging; MetricsTracker handles world size internally.
    effective_batch_size = cfg.batch_size * accelerator.num_processes
@@ -554,58 +744,97 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
            sample_weighter=sample_weighter,
        )

+        # EMA update: pull one step of the live weights into the shadow.
+        # Runs only on the main process (the shadow lives there); other
+        # ranks rely on the live model staying in sync via accelerator.
+        # ``ema-pytorch`` holds an internal reference to the online model
+        # (set at construction), so ``ema.update()`` takes no args.
+        if ema is not None:
+            ema.update()
+
        # Note: eval and checkpoint happens *after* the `step`th training update has completed, so we
        # increment `step` here.
        step += 1
        if is_main_process:
            progbar.update(1)
        train_tracker.step()
-        is_log_step = cfg.log_freq > 0 and step % cfg.log_freq == 0
+        is_log_step = cfg.log_freq > 0 and step % cfg.log_freq == 0 and is_main_process
        is_saving_step = step % cfg.save_freq == 0 or step == cfg.steps
-        is_env_eval_step = cfg.env_eval_freq > 0 and step % cfg.env_eval_freq == 0
-        is_eval_step = cfg.eval_steps > 0 and eval_dataloader is not None and step % cfg.eval_steps == 0
+        is_eval_step = cfg.eval_freq > 0 and step % cfg.eval_freq == 0
+
+        # Optional periodic head-prediction dump for the LM head:
+        # ``LEROBOT_DEBUG_PREDS_EVERY=1000`` prints 5 samples + per-token
+        # (label, argmax, ✓/✗) every 1000 steps. Cheap diagnostic to see
+        # whether the text head is actually learning what we expect, vs
+        # collapsing to a fixed token. Refilling the recipe-sample dump
+        # budget at the same cadence also redumps the raw input shapes.
+        _debug_preds_every = int(os.environ.get("LEROBOT_DEBUG_PREDS_EVERY", "0"))
+        if (
+            _debug_preds_every > 0
+            and step % _debug_preds_every == 0
+            and is_main_process
+        ):
+            try:
+                from lerobot.policies.pi052 import text_processor_pi052 as _tp  # noqa: PLC0415
+
+                _tp._DUMPED_SO_FAR = 0
+                _tp._DUMP_BUDGET = max(_tp._DUMP_BUDGET, 5)
+            except Exception:  # noqa: BLE001
+                pass
+            _print_debug_text_predictions(policy, batch, step, n_samples=5)

        if is_log_step:
-            # Collective reduce must run on every rank, before the main-process gate below.
-            train_tracker.reduce_across_ranks()
-            if is_main_process:
-                # Cluster-wide throughput, derived from the already-reduced (max) step time so it
-                # reflects the slowest rank — which is what actually gates the next iteration.
-                step_time = train_tracker.update_s.avg + train_tracker.dataloading_s.avg
-                if step_time > 0:
-                    train_tracker.samples_per_s = effective_batch_size / step_time
-                logging.info(train_tracker)
-                if wandb_logger:
-                    wandb_log_dict = train_tracker.to_dict()
-                    if output_dict:
-                        wandb_log_dict.update(output_dict)
-                    # Log sample weighting statistics if enabled
-                    if sample_weighter is not None:
-                        weighter_stats = sample_weighter.get_stats()
-                        wandb_log_dict.update({f"sample_weighting/{k}": v for k, v in weighter_stats.items()})
-                    wandb_logger.log_dict(wandb_log_dict, step)
+            logging.info(train_tracker)
+            if wandb_logger:
+                wandb_log_dict = train_tracker.to_dict()
+                if output_dict:
+                    wandb_log_dict.update(output_dict)
+                # Log sample weighting statistics if enabled
+                if sample_weighter is not None:
+                    weighter_stats = sample_weighter.get_stats()
+                    wandb_log_dict.update({f"sample_weighting/{k}": v for k, v in weighter_stats.items()})
+                # EMA observability: ``ema.step`` is the count of
+                # ``ema.update()`` calls (= optimizer steps once EMA is
+                # enabled); ``ema.initted`` flips to True once we've
+                # crossed ``update_after_step``.
+                if ema is not None:
+                    wandb_log_dict["ema/step"] = int(ema.step.item())
+                    wandb_log_dict["ema/initted"] = float(ema.initted.item())
+                    wandb_log_dict["ema/beta"] = float(cfg.ema.decay)
+                wandb_logger.log_dict(wandb_log_dict, step)
            train_tracker.reset_averages()

-        if is_eval_step:
-            policy.eval()
-            eval_loss_sum = 0.0
-            n_eval_batches = 0
-            with torch.no_grad(), accelerator.autocast():
-                for eval_batch in eval_dataloader:
-                    for cam_key in dataset.meta.camera_keys:
-                        if cam_key in eval_batch and eval_batch[cam_key].dtype == torch.uint8:
-                            eval_batch[cam_key] = eval_batch[cam_key].to(dtype=torch.float32) / 255.0
-                    eval_batch = preprocessor(eval_batch)
-                    loss, _ = policy.forward(eval_batch)
-                    eval_loss_sum += loss.item()
-                    n_eval_batches += 1
-            eval_loss = eval_loss_sum / max(n_eval_batches, 1)
-            policy.train()
-
-            if is_main_process:
-                logging.info(f"step {step}: eval_loss={eval_loss:.4f}")
-                if wandb_logger:
-                    wandb_logger.log_dict({"eval_loss": eval_loss}, step=step, mode="eval")
+        # Periodic training-example dump to wandb (camera images + text
+        # fields + action endpoints). Opt-in via ``--wandb.log_examples_freq``;
+        # independent of ``--log_freq`` so you can keep scalar logs frequent
+        # and the heavier visual dump rare (e.g. every 5000 steps).
+        if (
+            wandb_logger is not None
+            and cfg.wandb.log_examples_freq > 0
+            and step % cfg.wandb.log_examples_freq == 0
+            and is_main_process
+        ):
+            try:
+                # Optionally use the EMA shadow model directly for the
+                # predicted-action columns (matches what eval / deployment
+                # would see). ``ema-pytorch`` exposes the shadow as a
+                # full ``nn.Module`` at ``ema.ema_model``, so we just
+                # pass that instead of swap-and-restore.
+                target_policy = (
+                    ema.ema_model
+                    if (ema is not None and cfg.ema.use_for_wandb_examples)
+                    else accelerator.unwrap_model(policy)
+                )
+                wandb_logger.log_training_examples(
+                    batch=batch,
+                    step=step,
+                    camera_keys=list(dataset.meta.camera_keys),
+                    n_samples=cfg.wandb.log_examples_n,
+                    policy=target_policy,
+                    predict_actions=cfg.wandb.log_examples_predict_actions,
+                )
+            except Exception as exc:  # noqa: BLE001
+                logging.warning("wandb log_training_examples failed: %s", exc)

        if cfg.save_checkpoint and is_saving_step:
            if is_main_process:
@@ -620,23 +849,43 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
                    scheduler=lr_scheduler,
                    preprocessor=preprocessor,
                    postprocessor=postprocessor,
-                    num_processes=accelerator.num_processes,
-                    batch_size=cfg.batch_size,
                )
                update_last_checkpoint(checkpoint_dir)
+                # Save the EMA shadow alongside the training state so a
+                # resumed run picks up exactly where the live EMA left off.
+                # ``ema-pytorch.state_dict()`` returns the full shadow
+                # nn.Module's state dict + step/initted buffers; saved as
+                # .pt (the rest of training_state mixes formats already).
+                if ema is not None:
+                    try:
+                        ema_path = checkpoint_dir / "training_state" / "ema_state.pt"
+                        ema_path.parent.mkdir(parents=True, exist_ok=True)
+                        torch.save(ema.state_dict(), ema_path)
+                    except Exception as exc:  # noqa: BLE001
+                        logging.warning("Failed to save EMA shadow: %s", exc)
                if wandb_logger:
                    wandb_logger.log_policy(checkpoint_dir)

            accelerator.wait_for_everyone()

-        if cfg.env and is_env_eval_step:
+        if cfg.env and is_eval_step:
            if is_main_process:
                step_id = get_step_identifier(step, cfg.steps)
                logging.info(f"Eval policy at step {step}")
+                # Use the EMA shadow model for eval when enabled —
+                # standard practice for diffusion-style policies (~1–3%
+                # lift on closed-loop success). ``ema.ema_model`` is a
+                # full nn.Module clone, so we just pass it through; no
+                # swap/restore on the live policy needed.
+                eval_target_policy = (
+                    ema.ema_model
+                    if (ema is not None and cfg.ema.use_for_eval)
+                    else accelerator.unwrap_model(policy)
+                )
                with torch.no_grad(), accelerator.autocast():
                    eval_info = eval_policy_all(
                        envs=eval_env,  # dict[suite][task_id] -> vec_env
-                        policy=accelerator.unwrap_model(policy),
+                        policy=eval_target_policy,
                        env_preprocessor=env_preprocessor,
                        env_postprocessor=env_postprocessor,
                        preprocessor=preprocessor,
@@ -13,213 +13,77 @@
 [SmolVLA](https://huggingface.co/papers/2506.01844) is a compact, efficient vision-language-action model that achieves competitive performance at reduced computational costs and can be deployed on consumer-grade hardware.
 {% elif model_name == "act" %}
 [Action Chunking with Transformers (ACT)](https://huggingface.co/papers/2304.13705) is an imitation-learning method that predicts short action chunks instead of single steps. It learns from teleoperated data and often achieves high success rates.
+{% elif model_name == "tdmpc" %}
+[TD-MPC](https://huggingface.co/papers/2203.04955) combines model-free and model-based approaches to improve sample efficiency and performance in continuous control tasks by using a learned latent dynamics model and terminal value function.
 {% elif model_name == "diffusion" %}
 [Diffusion Policy](https://huggingface.co/papers/2303.04137) treats visuomotor control as a generative diffusion process, producing smooth, multi-step action trajectories that excel at contact-rich manipulation.
+{% elif model_name == "vqbet" %}
+[VQ-BET](https://huggingface.co/papers/2403.03181) combines vector-quantised action tokens with Behaviour Transformers to discretise control and achieve data-efficient imitation across diverse skills.
 {% elif model_name == "pi0" %}
-[π₀ (Pi0)](https://www.physicalintelligence.company/blog/pi0) is a general-purpose robot foundation model from Physical Intelligence: a generalist Vision-Language-Action policy that understands visual inputs, interprets natural language instructions, and controls a variety of different robots across diverse tasks. The LeRobot implementation is adapted from their open-source OpenPI repository.
+**π₀ (Pi0)**
+
+π₀ is a Vision-Language-Action model for general robot control, from Physical Intelligence. The LeRobot implementation is adapted from their open source OpenPI repository.
+
+**Model Overview**
+
+π₀ represents a breakthrough in robotics as the first general-purpose robot foundation model developed by Physical Intelligence. Unlike traditional robots that are narrow specialists programmed for repetitive motions, π₀ is designed to be a generalist policy that can understand visual inputs, interpret natural language instructions, and control a variety of different robots across diverse tasks.
+
+For more details, see the [Physical Intelligence π₀ blog post](https://www.physicalintelligence.company/blog/pi0).
 {% elif model_name == "pi05" %}
-[π₀.₅ (Pi05)](https://www.physicalintelligence.company/blog/pi05) is a Vision-Language-Action model from Physical Intelligence designed for open-world generalization: it evolves π₀ to generalize to entirely new environments and situations that were never seen during training. The LeRobot implementation is adapted from their open-source OpenPI repository.
-{% elif model_name == "molmoact2" %}
-[MolmoAct2](https://allenai.org/blog/molmoact2) is an open robotics foundation model from the Allen Institute for AI (Ai2) that maps camera images and language instructions to robot action chunks. The LeRobot implementation supports training and evaluation of the regular MolmoAct2 model.
-{% elif model_name == "vla_jepa" %}
-[VLA-JEPA](https://arxiv.org/abs/2602.10098) is a Vision-Language-Action model that combines a Qwen3-VL language backbone with a self-supervised video world model (V-JEPA2) and a flow-matching DiT action head.
+**π₀.₅ (Pi05) Policy**
+
+π₀.₅ is a Vision-Language-Action model with open-world generalization, from Physical Intelligence. The LeRobot implementation is adapted from their open source OpenPI repository.
+
+**Model Overview**
+
+π₀.₅ represents a significant evolution from π₀, developed by Physical Intelligence to address a big challenge in robotics: open-world generalization. While robots can perform impressive tasks in controlled environments, π₀.₅ is designed to generalize to entirely new environments and situations that were never seen during training.
+
+For more details, see the [Physical Intelligence π₀.₅ blog post](https://www.physicalintelligence.company/blog/pi05).
 {% elif model_name == "gaussian_actor" %}
 This is a Gaussian Actor policy (Gaussian policy with a tanh squash) — the policy-side component used by [Soft Actor-Critic (SAC)](https://huggingface.co/papers/1801.01290) and related maximum-entropy continuous-control algorithms.
-{% elif model_name == "pi0_fast" %}
-[π₀-FAST (Pi0-FAST)](https://www.physicalintelligence.company/research/fast) is a Vision-Language-Action model for general robot control, from Physical Intelligence. It models continuous robot actions with autoregressive next-token prediction using FAST (Frequency-space Action Sequence Tokenization), training up to 5x faster than diffusion-based π₀.
-{% elif model_name == "eo1" %}
-[EO-1](https://huggingface.co/papers/2508.21112) is a Vision-Language-Action model for general robot control. It pairs a Qwen2.5-VL backbone for vision-language understanding with a continuous flow-matching action head that denoises action chunks.
-{% elif model_name == "groot" %}
-[GR00T N1.5](https://github.com/NVIDIA/Isaac-GR00T) is an open, cross-embodiment foundation model from NVIDIA for generalized humanoid robot reasoning and skills. It takes language and images as input and uses a flow-matching action transformer to predict actions conditioned on vision, language, and proprioception.
-{% elif model_name == "multi_task_dit" %}
-[Multi-Task Diffusion Transformer (DiT)](https://huggingface.co/papers/2507.05331) extends Diffusion Policy with a large Diffusion Transformer and text + vision conditioning for multi-task robot learning. It supports both diffusion and flow-matching objectives and reaches high dexterity with only ~450M parameters.
-{% elif model_name == "wall_x" %}
-[WALL-OSS](https://huggingface.co/papers/2509.11766) is an open-source foundation model for embodied intelligence from XSquare Robot. Built on Qwen2.5-VL, it uses a tightly-coupled multimodal architecture with flow matching to unify semantic reasoning and high-frequency action generation for cross-embodiment control.
-{% elif model_name == "xvla" %}
-[X-VLA](https://huggingface.co/papers/2510.10274) is a soft-prompted, flow-matching Vision-Language-Action framework that treats each robot or hardware setup as a "task" encoded with a small set of learnable Soft Prompt embeddings, letting a single model reconcile diverse robot morphologies, sensors, and action spaces.
 {% else %}
-This is a **{{ model_name }}** policy trained with [LeRobot](https://github.com/huggingface/lerobot).
+_Model type not recognized — please update this template._
 {% endif %}
-{% set diagrams = {
-  "smolvla": "https://cdn-uploads.huggingface.co/production/uploads/640e21ef3c82bd463ee5a76d/aooU0a3DMtYmy_1IWMaIM.png",
-  "pi0": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/lerobot-pi0%20(1).png",
-  "pi0_fast": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/lerobot-pifast.png",
-  "eo1": "https://huggingface.co/datasets/HaomingSong/lerobot-documentation-images/resolve/main/lerobot/eo_pipeline.png",
-  "groot": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/lerobot-groot-paper1%20(1).png",
-  "wall_x": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/walloss-lerobot-paper.png",
-  "xvla": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/xvla-architecture.png"
-} %}
-{% if diagrams.get(model_name) %}
-<p align="center">
-  <img src="{{ diagrams[model_name] }}" alt="{{ model_name }} architecture" width="85%"/>
-</p>
-{% endif %}
-
-<!-- A short demo is worth more than any description! Record a GIF/video of the policy
-running on your robot, upload it to this repo, and embed it here:
-<p align="center">
-  <img src="https://huggingface.co/<hf_user>/<policy_repo_id>/resolve/main/demo.gif" width="60%"/>
-</p>
-->

 This policy has been trained and pushed to the Hub using [LeRobot](https://github.com/huggingface/lerobot).
-{% set policy_docs = {
-  "act": "act",
-  "smolvla": "smolvla",
-  "pi0": "pi0",
-  "pi0_fast": "pi0fast",
-  "pi05": "pi05",
-  "molmoact2": "molmoact2",
-  "vla_jepa": "vla_jepa",
-  "eo1": "eo1",
-  "groot": "groot",
-  "xvla": "xvla",
-  "multi_task_dit": "multi_task_dit",
-  "wall_x": "walloss"
-} %}
-{% if policy_docs.get(model_name) %}Learn how to train and run it in the [LeRobot {{ model_name }} guide](https://huggingface.co/docs/lerobot/main/en/{{ policy_docs[model_name] }}), or browse the [full documentation](https://huggingface.co/docs/lerobot/index).
-{% else %}See the [full LeRobot documentation](https://huggingface.co/docs/lerobot/index).
-{% endif %}
+See the full documentation at [LeRobot Docs](https://huggingface.co/docs/lerobot/index).
+
+---
+
+## How to Get Started with the Model
+
+For a complete walkthrough, see the [training guide](https://huggingface.co/docs/lerobot/il_robots#train-a-policy).
+Below is the short version on how to train and run inference/eval:
+
+### Train from scratch
+
+```bash
+lerobot-train \
+  --dataset.repo_id=${HF_USER}/<dataset> \
+  --policy.type=act \
+  --output_dir=outputs/train/<desired_policy_repo_id> \
+  --job_name=lerobot_training \
+  --policy.device=cuda \
+  --policy.repo_id=${HF_USER}/<desired_policy_repo_id>
+  --wandb.enable=true
+```
+
+_Writes checkpoints to `outputs/train/<desired_policy_repo_id>/checkpoints/`._
+
+### Evaluate the policy/run inference
+
+```bash
+lerobot-record \
+  --robot.type=so100_follower \
+  --dataset.repo_id=<hf_user>/eval_<dataset> \
+  --policy.path=<hf_user>/<desired_policy_repo_id> \
+  --episodes=10
+```
+
+Prefix the dataset repo with **eval\_** and supply `--policy.path` pointing to a local or hub checkpoint.

 ---

 ## Model Details

 - **License:** {{ license | default("\[More Information Needed]", true) }}
-{% if base_model %}- **Fine-tuned from:** [{{ base_model }}](https://huggingface.co/{{ base_model }})
-{% endif %}{% if robot_type %}- **Robot type:** `{{ robot_type }}`
-{% endif %}{% if cameras %}- **Cameras:** {% for camera in cameras %}`{{ camera }}`{% if not loop.last %}, {% endif %}{% endfor %}
-{% endif %}
-{% if input_features or output_features %}
-## Inputs & Outputs
-
-The policy consumes these observation features and produces these action features.
-{% if input_features %}
-**Inputs**
-
-| Feature | Type | Shape |
-| --- | --- | --- |
-{% for name, feature in input_features.items() %}| `{{ name }}` | {{ feature.type.value }} | `{{ feature.shape }}` |
-{% endfor %}{% endif %}{% if output_features %}
-**Outputs**
-
-| Feature | Type | Shape |
-| --- | --- | --- |
-{% for name, feature in output_features.items() %}| `{{ name }}` | {{ feature.type.value }} | `{{ feature.shape }}` |
-{% endfor %}{% endif %}{% endif %}
-{% if dataset %}
-## Training Dataset
-
- **Repository:** [{{ dataset.repo_id }}](https://huggingface.co/datasets/{{ dataset.repo_id }})
- **Episodes:** {{ dataset.episodes }}
- **Frames:** {{ dataset.frames }}
- **Frame rate:** {{ dataset.fps }} FPS
-{% if dataset.tasks %}- **Task(s):** {% for task in dataset.tasks %}"{{ task }}"{% if not loop.last %}, {% endif %}{% endfor %}
-{% endif %}
-<a class="flex" href="https://huggingface.co/spaces/lerobot/visualize_dataset?path={{ dataset.repo_id }}">
-<img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/badges/resolve/main/visualize-this-dataset-xl.svg"/>
-<img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/badges/resolve/main/visualize-this-dataset-xl-dark.svg"/>
-</a>
-{% endif %}
-{% if training %}
-## Training Configuration
-
-| Setting | Value |
-| --- | --- |
-| Training steps | {{ training.steps }} |
-| Batch size | {{ training.batch_size }} |
-{% if training.optimizer %}| Optimizer | {{ training.optimizer }} |
-{% endif %}{% if training.lr %}| Learning rate | {{ training.lr }} |
-{% endif %}{% if training.seed is not none %}| Seed | {{ training.seed }} |
-{% endif %}| LeRobot version | {{ training.lerobot_version }} |
-{% endif %}
---
-
-## How to Get Started with the Model
-
-New to LeRobot? These guides cover the full workflow:
-
- **[Install LeRobot](https://huggingface.co/docs/lerobot/main/en/installation)** — set up the `lerobot` package.
- **[Hardware setup](https://huggingface.co/docs/lerobot/main/en/hardware_guide)** — assemble, wire, and calibrate your robot and cameras.
- **[Record data & train a policy](https://huggingface.co/docs/lerobot/en/il_robots)** — the end-to-end imitation-learning walkthrough.
- **[CLI cheat-sheet](https://huggingface.co/docs/lerobot/main/en/cheat-sheet)** — quick reference for the `lerobot-*` commands.
-
-The short version to run and train this policy:
-
-### Run the policy on your robot
-
-```bash
-lerobot-rollout \
-  --strategy.type=base \
-  --robot.type={{ robot_type | default("<your_robot_type>", true) }} \
-  --robot.port=<your_robot_port> \
-  --robot.cameras="{ <camera_1>: {type: opencv, index_or_path: <index_or_path>, width: 640, height: 480, fps: 30}, <camera_2>: {type: opencv, index_or_path: <index_or_path>, width: 640, height: 480, fps: 30}}" \
-  --policy.path={{ policy_repo_id | default("<hf_user>/<policy_repo_id>", true) }} \
-  --task="{% if dataset and dataset.tasks %}{{ dataset.tasks[0] }}{% else %}<your_task_description>{% endif %}" \
-  --duration=60
-```
-
-Replace the remaining `<...>` placeholders with your own values: `--robot.port` and the camera names/indices are specific to your machine, and the camera names must match the observation keys this policy was trained on.
-
-When `--strategy.type=base` is used the script doesn't record the episodes. Skipping duration will make the policy run indefinitely. For more information look at [rollout documentation](https://huggingface.co/docs/lerobot/main/en/inference).
-
-{% if base_model %}### Train your own policy
-
-This policy type is usually fine-tuned from the pretrained base model [{{ base_model }}](https://huggingface.co/{{ base_model }}):
-
-```bash
-lerobot-train \
-  --dataset.repo_id=${HF_USER}/<dataset> \
-  --policy.path={{ base_model }} \
-  --output_dir=outputs/train/<policy_repo_id> \
-  --job_name=lerobot_training \
-  --policy.device=cuda \
-  --policy.repo_id=${HF_USER}/<policy_repo_id> \
-  --wandb.enable=true
-```
-{% else %}### Train your own policy
-
-```bash
-lerobot-train \
-  --dataset.repo_id=${HF_USER}/<dataset> \
-  --policy.type={{ model_name }} \
-  --output_dir=outputs/train/<policy_repo_id> \
-  --job_name=lerobot_training \
-  --policy.device=cuda \
-  --policy.repo_id=${HF_USER}/<policy_repo_id> \
-  --wandb.enable=true
-```
-{% endif %}
-_Writes checkpoints to `outputs/train/<policy_repo_id>/checkpoints/`._
-
---
-
-## Evaluation
-
-<!-- Report real-robot results here: run the policy several times per task and count the
-successes. Delete the "No evaluation results" line and fill in this table instead:
-
-| Task | Trials | Successes | Success rate |
-| ---- | ------ | --------- | ------------ |
-| pick the lego brick | 10 | 8 | 80% |
-
-Also worth noting: anything that affects difficulty (new object positions, lighting,
-distractors, a different robot of the same type, ...).
-->
-
-_No evaluation results have been provided for this policy yet._
-
---
-
-## Citation
-
-If you use this policy, please cite the method linked in the description above, along with LeRobot:
-
-```bibtex
-@misc{cadene2024lerobot,
-    author = {Cadene, Remi and Alibert, Simon and Soare, Alexander and Gallouedec, Quentin and Zouitine, Adil and Palma, Steven and Kooijmans, Pepijn and Aractingi, Michel and Shukor, Mustafa and Aubakirova, Dana and Russi, Martino and Capuano, Francesco and Pascal, Caroline and Choghari, Jade and Moss, Jess and Wolf, Thomas},
-    title = {LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch},
-    howpublished = "\url{https://github.com/huggingface/lerobot}",
-    year = {2024}
-}
-```
@@ -0,0 +1,29 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""LeRobot tool implementations.
+
+Storage of the tool catalog (``meta/info.json["tools"]``) and the
+``SAY_TOOL_SCHEMA`` constant live in PR 1
+(``lerobot.datasets.language``). This package holds the *runnable*
+implementations one file per tool, plus the registry that maps tool
+names to classes.
+
+See ``docs/source/tools.mdx`` for the authoring guide.
+"""
+
+from .base import Tool
+from .registry import TOOL_REGISTRY, get_tools
+from .say import SayTool
+
+__all__ = ["Tool", "TOOL_REGISTRY", "get_tools", "SayTool"]
@@ -0,0 +1,58 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tool protocol — the contract every runnable tool implementation honors.
+
+Tools are the executable side of the OpenAI-style function-calling
+abstraction the v3.1 language schema (PR 1) carries on assistant
+messages: the schema describes *what can be called*, the tool
+implementation describes *how to call it*.
+
+Implementations live one-per-file under :mod:`lerobot.tools` (e.g.
+``say.py`` for ``SayTool``) and are registered in
+:mod:`lerobot.tools.registry`. The runtime instantiates them lazily so
+heavy dependencies (torch models, audio backends, network clients,
+hardware drivers) only load when the dataset actually declares the tool.
+"""
+
+from __future__ import annotations
+
+from typing import Any, Protocol, runtime_checkable
+
+
+@runtime_checkable
+class Tool(Protocol):
+    """Minimum surface every tool must expose."""
+
+    #: Name matching ``schema["function"]["name"]``. The runtime dispatcher
+    #: routes incoming ``tool_calls`` to the implementation by this key.
+    name: str
+
+    #: OpenAI-style function-call schema. Same dict the dataset stores in
+    #: ``meta/info.json["tools"]`` and the chat template renders into the
+    #: prompt.
+    schema: dict[str, Any]
+
+    def call(self, arguments: dict[str, Any]) -> Any:
+        """Execute the tool with the model-provided arguments.
+
+        ``arguments`` is the parsed dict from
+        ``tool_calls[i]["function"]["arguments"]`` (already JSON-decoded
+        when the model emits a JSON-string by the chat-template
+        convention). Implementations validate the dict against their own
+        schema; the runtime only routes by name.
+
+        Return value is implementation-defined — typically a tensor
+        (TTS audio), a Path (saved file), a dict (structured result), or
+        ``None`` (side-effect-only call).
+        """
@@ -0,0 +1,70 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tool registry — name → implementation class.
+
+Adding a new tool:
+
+1. Drop a file under ``src/lerobot/tools/`` that defines a class
+   conforming to :class:`lerobot.tools.base.Tool` (must expose ``name``,
+   ``schema``, ``call(arguments)``).
+2. Register the class here under :data:`TOOL_REGISTRY`.
+3. (Optional) Pre-populate ``meta/info.json["tools"]`` on your dataset
+   to advertise the schema to the chat-template + policy. The PR 2
+   annotation pipeline preserves anything you put there.
+
+See ``docs/source/tools.mdx`` for the full authoring guide.
+"""
+
+from __future__ import annotations
+
+from typing import Any
+
+from .base import Tool
+from .say import SayTool
+
+#: Map from ``function.name`` to a class implementing :class:`Tool`.
+#: The runtime instantiates entries lazily — registering a tool here is
+#: essentially free (no model load happens until ``call`` runs).
+TOOL_REGISTRY: dict[str, type] = {
+    "say": SayTool,
+}
+
+
+def get_tools(meta: Any, **kwargs: Any) -> dict[str, Tool]:
+    """Build name → tool-instance dict from a dataset's declared catalog.
+
+    ``meta`` is anything with a ``.tools`` attribute returning the
+    OpenAI-style schema list — typically a
+    :class:`lerobot.datasets.dataset_metadata.LeRobotDatasetMetadata`.
+    Each entry whose ``function.name`` is registered here is
+    instantiated with the schema dict; tools whose name is unknown to
+    the registry are skipped (the schema still rides through the chat
+    template, the model just can't actually invoke that tool at
+    inference).
+
+    Extra keyword arguments are forwarded to every constructor — useful
+    for runtime defaults like ``output_dir=Path("./tts_log")``.
+    """
+    declared = list(meta.tools)
+    instances: dict[str, Tool] = {}
+    for schema in declared:
+        try:
+            name = schema["function"]["name"]
+        except (KeyError, TypeError):
+            continue
+        cls = TOOL_REGISTRY.get(name)
+        if cls is None:
+            continue
+        instances[name] = cls(schema=schema, **kwargs)
+    return instances
@@ -0,0 +1,169 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""``SayTool`` — text-to-speech tool wrapping Kyutai's pocket-tts.
+
+The first concrete tool implementation. PI052 and downstream runtime
+dispatchers consume this when the model emits an assistant message
+with ``tool_calls=[{function: {name: "say", arguments: {text: ...}}}]``.
+
+Why pocket-tts:
+
+- runs on CPU (no GPU dependency); ~6× real-time on a MacBook Air M4
+- ~100M parameters, ~200ms first-chunk latency
+- streamable, voice-cloneable
+- pip-installable, MIT-style permissive license
+
+The pocket-tts model is loaded **lazily** the first time ``call(...)``
+runs (or eagerly via ``preload()``). Loading takes a few seconds and
+several hundred MB of RAM, so we don't pay the cost when the tool is
+merely *registered* — only when it's *invoked*.
+
+Optional dependency. Install with::
+
+    pip install lerobot[tools]
+    # or directly:
+    pip install pocket-tts
+"""
+
+from __future__ import annotations
+
+import logging
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+from lerobot.datasets.language import SAY_TOOL_SCHEMA
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class SayTool:
+    """Speak a short utterance via Kyutai's pocket-tts.
+
+    Parameters
+    ----------
+    schema:
+        Optional schema override; defaults to the canonical
+        ``SAY_TOOL_SCHEMA`` from PR 1. Custom voices or extended
+        argument shapes can pass in a modified schema, but the
+        implementation only reads ``arguments["text"]``.
+    voice:
+        One of the pocket-tts catalog voices (``alba``, ``marius``,
+        ``javert``, ``jean``, ``fantine``, ``cosette``, ``eponine``,
+        ``azelma``) or a path to a ``.wav`` / ``.safetensors`` voice
+        file for cloning. See the pocket-tts model card for licensing.
+    output_dir:
+        If set, every ``call(...)`` writes a ``<timestamp>.wav`` audio
+        file there in addition to returning the PCM tensor.
+        ``None`` (default) skips disk writes — useful for live
+        playback paths that hand the tensor directly to a sounddevice
+        / WebAudio sink.
+    """
+
+    schema: dict[str, Any] = field(default_factory=lambda: dict(SAY_TOOL_SCHEMA))
+    voice: str = "alba"
+    output_dir: Path | None = None
+
+    name: str = field(init=False, default="say")
+    _model: Any = field(init=False, default=None, repr=False)
+    _voice_state: Any = field(init=False, default=None, repr=False)
+    _sample_rate: int = field(init=False, default=24000, repr=False)
+
+    # ------------------------------------------------------------------
+    # Lazy model load
+    # ------------------------------------------------------------------
+
+    def preload(self) -> None:
+        """Load the pocket-tts model + voice state into memory.
+
+        Optional — ``call(...)`` triggers this automatically on first
+        invocation. Useful when you want the multi-second load to
+        happen at startup rather than on the first ``say`` the policy
+        emits.
+        """
+        if self._model is not None and self._voice_state is not None:
+            return
+        try:
+            from pocket_tts import TTSModel  # noqa: PLC0415  (optional dep)
+        except ImportError as exc:  # pragma: no cover (env-dependent)
+            raise ImportError(
+                "SayTool requires pocket-tts. Install with `pip install "
+                "lerobot[tools]` or `pip install pocket-tts`."
+            ) from exc
+        logger.info("SayTool: loading pocket-tts model + voice=%r", self.voice)
+        self._model = TTSModel.load_model()
+        self._voice_state = self._model.get_state_for_audio_prompt(self.voice)
+        self._sample_rate = int(getattr(self._model, "sample_rate", 24000))
+
+    # ------------------------------------------------------------------
+    # Tool protocol
+    # ------------------------------------------------------------------
+
+    def call(self, arguments: dict[str, Any]) -> Any:
+        """Speak ``arguments["text"]`` and return the PCM tensor.
+
+        Optionally also writes ``<output_dir>/<timestamp>.wav`` when
+        ``self.output_dir`` is set. The returned tensor is a 1-D
+        ``torch.Tensor`` of float32 PCM samples at
+        ``self.sample_rate`` Hz — directly playable by
+        ``sounddevice.play(audio.numpy(), self.sample_rate)`` or
+        encodable by ``scipy.io.wavfile.write``.
+        """
+        text = arguments.get("text")
+        if not isinstance(text, str) or not text.strip():
+            raise ValueError(
+                f"SayTool.call expects arguments={{'text': str}}, got {arguments!r}"
+            )
+        self.preload()
+
+        audio = self._model.generate_audio(self._voice_state, text)
+
+        if self.output_dir is not None:
+            self._write_wav(audio, text)
+
+        return audio
+
+    @property
+    def sample_rate(self) -> int:
+        """PCM sample rate of the returned tensor (Hz)."""
+        return self._sample_rate
+
+    # ------------------------------------------------------------------
+    # Helpers
+    # ------------------------------------------------------------------
+
+    def _write_wav(self, audio: Any, text: str) -> Path:
+        """Write a ``.wav`` next to ``output_dir`` for offline inspection."""
+        import time as _time  # noqa: PLC0415
+
+        try:
+            import scipy.io.wavfile  # noqa: PLC0415
+        except ImportError as exc:  # pragma: no cover
+            raise ImportError(
+                "SayTool.output_dir requires scipy. `pip install scipy`."
+            ) from exc
+
+        out_dir = Path(self.output_dir)
+        out_dir.mkdir(parents=True, exist_ok=True)
+        # One file per call; suffix with a millisecond timestamp + a
+        # short text snippet so a directory listing is informative.
+        snippet = "".join(c if c.isalnum() else "_" for c in text[:32]).strip("_")
+        ts_ms = int(_time.time() * 1000)
+        path = out_dir / f"say_{ts_ms}_{snippet}.wav"
+
+        # ``audio`` is a torch tensor; pocket-tts uses CPU, so a plain
+        # ``.numpy()`` is safe.
+        scipy.io.wavfile.write(path, self.sample_rate, audio.numpy())
+        return path
@@ -22,7 +22,7 @@ from torch.utils.data._utils.collate import default_collate

 from lerobot.datasets.language import LANGUAGE_COLUMNS

-_PYTHON_LIST_KEYS = {"messages", "message_streams", "target_message_indices"}
+_PYTHON_LIST_KEYS = {"messages", "message_streams", "target_message_indices", *LANGUAGE_COLUMNS}


 def lerobot_collate_fn(batch: list[dict[str, Any] | None]) -> dict[str, Any] | None:
@@ -34,6 +34,7 @@ ACTION = "action"
 ACTION_PREFIX = ACTION + "."
 ACTION_TOKENS = ACTION + ".tokens"
 ACTION_TOKEN_MASK = ACTION + ".token_mask"
+ACTION_CODE_TOKEN_MASK = ACTION + ".code_token_mask"
 REWARD = "next.reward"
 TRUNCATED = "next.truncated"
 DONE = "next.done"
@@ -13,39 +13,21 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-from collections import defaultdict
 from collections.abc import Callable
 from typing import Any

-import torch
-
 from .utils import format_big_number

-_VALID_REDUCTIONS = ("none", "max", "mean", "sum")
-

 class AverageMeter:
    """
    Computes and stores the average and current value
    Adapted from https://github.com/pytorch/examples/blob/main/imagenet/main.py
-
-    Args:
-        name: Display name of the metric.
-        fmt: Format string used when rendering the metric.
-        reduction: Cross-process reduction applied by :meth:`MetricsTracker.reduce_across_ranks`
-            before logging. One of ``"none"`` (per-rank value, default), ``"max"``, ``"mean"``,
-            or ``"sum"``. Use ``"max"`` for bottleneck-style metrics (e.g. dataloading or
-            update wall time) so multi-GPU runs report the slowest rank rather than rank 0.
    """

-    def __init__(self, name: str, fmt: str = ":f", reduction: str = "none"):
-        if reduction not in _VALID_REDUCTIONS:
-            raise ValueError(
-                f"Invalid reduction {reduction!r} for AverageMeter; expected one of {_VALID_REDUCTIONS}."
-            )
+    def __init__(self, name: str, fmt: str = ":f"):
        self.name = name
        self.fmt = fmt
-        self.reduction = reduction
        self.reset()

    def reset(self) -> None:
@@ -156,37 +138,6 @@ class MetricsTracker:
        self.episodes = self.samples / self._avg_samples_per_ep
        self.epochs = self.samples / self._num_frames

-    def reduce_across_ranks(self) -> None:
-        """
-        Synchronises the running averages of every metric whose ``reduction`` is not ``"none"``
-        across all distributed processes (in-place).
-
-        This is a collective operation and MUST be invoked on every rank — typically just before
-        logging. With no accelerator or in single-process runs it is a no-op. Without it, metrics
-        reported by the main process only reflect rank 0; for bottleneck-style timings
-        (``dataloading_s``, ``update_s``, ...) that means the slowest worker's stall is invisible.
-        """
-        if self.accelerator is None or self.accelerator.num_processes <= 1:
-            return
-
-        buckets: dict[str, list[str]] = defaultdict(list)
-        for name, meter in self.metrics.items():
-            if meter.reduction != "none":
-                buckets[meter.reduction].append(name)
-        if not buckets:
-            return
-
-        device = self.accelerator.device
-        for reduction, names in buckets.items():
-            tensor = torch.tensor([self.metrics[n].avg for n in names], dtype=torch.float32, device=device)
-            reduced = self.accelerator.reduce(tensor, reduction=reduction)
-            for name, value in zip(names, reduced.tolist(), strict=True):
-                meter = self.metrics[name]
-                # Preserve avg == sum / count so a later .update() on this meter accumulates
-                # against the cluster view, not the stale per-rank history.
-                meter.avg = value
-                meter.sum = value * meter.count
-
    def __str__(self) -> str:
        display_list = [
            f"step:{format_big_number(self.steps)}",
@@ -38,20 +38,19 @@ import torch

 pytest.importorskip("datasets", reason="datasets is required (install lerobot[dataset])")

-from lerobot.annotations.steerable_pipeline.frames import VideoFrameProvider  # noqa: E402
+from lerobot.annotations.steerable_pipeline.frames import (  # noqa: E402
+    VideoFrameProvider,
+    _decode_frames_av,
+    _decode_frames_ffmpeg,
+)


 class _FakeMeta:
    """Minimal metadata stub exposing ``video_keys`` / ``camera_keys``."""

-    def __init__(self, video_keys: list[str], image_keys: list[str], video_path: Path | None = None) -> None:
+    def __init__(self, video_keys: list[str], image_keys: list[str]) -> None:
        self.video_keys = video_keys
        self.camera_keys = [*video_keys, *image_keys]
-        self._video_path = video_path
-        self.episodes = {0: {f"videos/{key}/from_timestamp": 0.0 for key in video_keys}}
-
-    def get_video_file_path(self, episode_index: int, camera_key: str) -> Path:
-        return self._video_path


 def test_default_camera_key_skips_image_only_cameras(tmp_path: Path, monkeypatch) -> None:
@@ -125,24 +124,15 @@ def sample_video(tmp_path: Path) -> Path:
    return out


-def _provider_for_video(tmp_path: Path, video: Path, monkeypatch) -> VideoFrameProvider:
-    """A provider whose single camera resolves to ``video`` via fake metadata."""
-    fake = _FakeMeta(video_keys=["observation.images.cam"], image_keys=[], video_path=video)
-    import lerobot.datasets.dataset_metadata as meta_mod
+def test_decode_frames_av_returns_one_uint8_frame_per_timestamp(sample_video: Path) -> None:
+    """``_decode_frames_av`` decodes via PyAV directly — no torchcodec/torchvision.

-    monkeypatch.setattr(meta_mod, "LeRobotDatasetMetadata", lambda *a, **k: fake, raising=True)
-    return VideoFrameProvider(root=tmp_path, tolerance_s=0.2)
-
-
-def test_decode_returns_one_uint8_frame_per_timestamp(
-    sample_video: Path, tmp_path: Path, monkeypatch
-) -> None:
-    """``_decode`` routes through ``decode_video_frames`` (torchcodec when
-    available, PyAV otherwise) — no subprocess fallback.
+    This is the always-available fallback: torchcodec is unusable in some
+    containers and lerobot's ``pyav`` backend routes through the removed
+    ``torchvision.io.VideoReader``.
    """
-    provider = _provider_for_video(tmp_path, sample_video, monkeypatch)
    timestamps = [0.0, 1.0, 2.5]
-    frames = provider._decode(0, timestamps, "observation.images.cam")
+    frames = _decode_frames_av(sample_video, timestamps)

    assert len(frames) == len(timestamps)
    for frame in frames:
@@ -151,96 +141,39 @@ def test_decode_returns_one_uint8_frame_per_timestamp(
        assert frame.shape == (3, 120, 160)


-def test_frames_at_snaps_mid_frame_grid_to_real_frames(
-    sample_video: Path, tmp_path: Path, monkeypatch
-) -> None:
-    """Uniform sampling grids land mid-frame; ``frames_at`` must snap them to
-    real frame timestamps before decoding.
-
-    Regression: ``decode_video_frames`` rejects queries farther than
-    ``tolerance_s`` (default 10 ms) from a decodable frame, so un-snapped
-    mid-frame queries raised ``FrameTimestampError`` wholesale and the plan
-    module silently lost its contact sheets for most episodes.
-    """
-    from types import SimpleNamespace
-
-    fake = _FakeMeta(video_keys=["observation.images.cam"], image_keys=[], video_path=sample_video)
-    import lerobot.datasets.dataset_metadata as meta_mod
-
-    monkeypatch.setattr(meta_mod, "LeRobotDatasetMetadata", lambda *a, **k: fake, raising=True)
-    provider = VideoFrameProvider(root=tmp_path)  # default 10 ms tolerance
-    # 10 fps fixture -> frames at 0.0, 0.1, ...; queries sit mid-frame.
-    record = SimpleNamespace(episode_index=0, frame_timestamps=[i / 10 for i in range(30)])
-
-    frames = provider.frames_at(record, [0.149, 1.234, 2.04], camera_key="observation.images.cam")
+def test_decode_frames_av_picks_nearest_frame(sample_video: Path) -> None:
+    """Repeated and out-of-order timestamps each resolve to the nearest frame."""
+    frames = _decode_frames_av(sample_video, [2.0, 0.0, 2.0])

    assert len(frames) == 3
+    assert torch.equal(frames[0], frames[2])
+    assert not torch.equal(frames[0], frames[1])
+
+
+def test_decode_frames_av_raises_on_missing_file(tmp_path: Path) -> None:
+    """A missing video surfaces as an exception the caller can fall back on."""
+    with pytest.raises(Exception):  # noqa: B017, PT011
+        _decode_frames_av(tmp_path / "does_not_exist.mp4", [0.0])
+
+
+def test_decode_frames_ffmpeg_returns_one_uint8_frame_per_timestamp(sample_video: Path) -> None:
+    """``_decode_frames_ffmpeg`` shells out to the ffmpeg CLI — the always-
+    available fallback that decodes AV1 and isolates crashes to a child
+    process.
+    """
+    timestamps = [0.0, 1.0, 2.5]
+    frames = _decode_frames_ffmpeg(sample_video, timestamps)
+
+    assert len(frames) == len(timestamps)
    for frame in frames:
        assert isinstance(frame, torch.Tensor)
+        assert frame.dtype == torch.uint8
        assert frame.shape == (3, 120, 160)


-def test_decode_returns_empty_list_on_missing_file(tmp_path: Path, monkeypatch) -> None:
-    """A missing video is a recoverable no-frames condition, never a crash."""
-    provider = _provider_for_video(tmp_path, tmp_path / "does_not_exist.mp4", monkeypatch)
-    assert provider._decode(0, [0.0], "observation.images.cam") == []
-
-
-def test_episode_clip_path_trims_via_reencode_video(tmp_path: Path, monkeypatch) -> None:
-    """Clip extraction delegates to ``video_utils.reencode_video`` with the
-    episode's ``[from_timestamp, to_timestamp)`` trim window — no subprocess.
-    """
-    from types import SimpleNamespace
-
-    import lerobot.annotations.steerable_pipeline.frames as frames_mod
-
-    src = tmp_path / "src.mp4"
-    src.write_bytes(b"src")
-    fake = _FakeMeta(video_keys=["observation.images.cam"], image_keys=[], video_path=src)
-    fake.episodes[0]["videos/observation.images.cam/from_timestamp"] = 1.5
-    fake.episodes[0]["videos/observation.images.cam/to_timestamp"] = 4.0
-    import lerobot.datasets.dataset_metadata as meta_mod
-
-    monkeypatch.setattr(meta_mod, "LeRobotDatasetMetadata", lambda *a, **k: fake, raising=True)
-
-    captured = {}
-
-    def fake_reencode(
-        input_video_path,
-        output_video_path,
-        camera_encoder=None,
-        overwrite=False,
-        start_time_s=None,
-        end_time_s=None,
-    ):
-        captured.update(
-            src=Path(input_video_path),
-            encoder=camera_encoder,
-            start_time_s=start_time_s,
-            end_time_s=end_time_s,
-        )
-        Path(output_video_path).write_bytes(b"clip")
-
-    monkeypatch.setattr(frames_mod, "reencode_video", fake_reencode, raising=True)
-    provider = VideoFrameProvider(root=tmp_path)
-    record = SimpleNamespace(episode_index=0, frame_timestamps=[0.0, 1.0])
-
-    out = provider.episode_clip_path(record, tmp_path / "clips")
-
-    assert out == tmp_path / "clips" / "ep_000000.mp4"
-    assert captured["src"] == src
-    assert captured["start_time_s"] == 1.5
-    assert captured["end_time_s"] == 4.0
-    # H.264 so the clip is decodable by vllm's libav build (sources are often AV1).
-    assert captured["encoder"].vcodec == "h264"
-
-
-def test_videoframeprovider_serializes_decodes_with_a_lock() -> None:
-    """torchcodec's cached per-file decoder is single-threaded; the provider
-    must own a dedicated lock that ``_decode`` holds around the decoder call.
-    """
-    import threading
-
-    lock_field = VideoFrameProvider.__dataclass_fields__.get("_decode_lock")
-    assert lock_field is not None
-    assert lock_field.default_factory is threading.Lock
+def test_decode_frames_ffmpeg_raises_on_missing_file(tmp_path: Path) -> None:
+    """A missing video raises (non-zero ffmpeg exit), never crashes the job."""
+    if shutil.which("ffmpeg") is None:
+        pytest.skip("ffmpeg not available")
+    with pytest.raises(Exception):  # noqa: B017, PT011
+        _decode_frames_ffmpeg(tmp_path / "does_not_exist.mp4", [0.0])
@@ -22,7 +22,6 @@ from dataclasses import dataclass, field
 from pathlib import Path
 from typing import Any

-import PIL.Image
 import pytest

 # ``lerobot.annotations`` imports pull in ``lerobot.datasets`` (-> the HF
@@ -52,10 +51,7 @@ from ._helpers import make_canned_responder  # noqa: E402
 class _StubFrameProvider:
    """Returns one sentinel object per requested timestamp."""

-    # A real (tiny) PIL image so the contact-sheet builder, which resizes and
-    # tiles frames, has something to draw. VQA still passes it through by
-    # identity via ``to_image_blocks``.
-    sentinel: Any = field(default_factory=lambda: PIL.Image.new("RGB", (32, 24)))
+    sentinel: Any = field(default_factory=lambda: object())
    cameras: tuple[str, ...] = ("observation.images.top",)
    calls: list[tuple[int, tuple[float, ...], str | None]] = field(default_factory=list)
    video_calls: list[tuple[int, int, str | None]] = field(default_factory=list)
@@ -119,34 +115,6 @@ def test_module1_plan_memory_subtask_smoke(fixture_dataset_root: Path, tmp_path:
    assert len(plan_rows[-1]["content"].splitlines()) == 1


-def test_module1_emit_memory_false_skips_memory_keeps_subtasks_and_plan(
-    fixture_dataset_root: Path, tmp_path: Path
-) -> None:
-    """``emit_memory=False`` drops ``memory`` rows (and their VLM calls) while
-    leaving subtask + plan generation intact — symmetric to ``emit_plan``."""
-    vlm = make_canned_responder(
-        {
-            "atomic subtasks": {
-                "subtasks": [
-                    {"text": "grasp the handle of the sponge", "start": 0.0, "end": 0.4},
-                    {"text": "wipe the counter from left to right", "start": 0.4, "end": 0.8},
-                    {"text": "place the sponge into the sink", "start": 0.8, "end": 1.1},
-                ]
-            },
-            "compressed semantic memory": {"memory": "wiped the counter once"},
-        },
-    )
-    module = PlanSubtasksMemoryModule(vlm=vlm, config=PlanConfig(emit_memory=False))
-    record = next(iter_episodes(fixture_dataset_root))
-    staging = EpisodeStaging(tmp_path / "stage", record.episode_index)
-    module.run_episode(record, staging)
-    rows = staging.read("plan")
-
-    styles = {r["style"] for r in rows}
-    assert "memory" not in styles
-    assert {"subtask", "plan"}.issubset(styles)
-
-
 def test_module2_at_t0_emits_speech_only_no_interjection(fixture_dataset_root: Path, tmp_path: Path) -> None:
    vlm = make_canned_responder(
        {"acknowledgement the robot": {"text": "Sure, on it."}},
@@ -268,10 +236,8 @@ def test_module3_vqa_unique_per_frame_and_camera(single_episode_root: Path, tmp_
        assert ts in frame_set


-def test_module1_attaches_contact_sheets_to_subtask_prompt(
-    fixture_dataset_root: Path, tmp_path: Path
-) -> None:
-    """Module 1 sends timestamped contact-sheet image blocks (not a raw video block)."""
+def test_module1_attaches_video_block_to_subtask_prompt(fixture_dataset_root: Path, tmp_path: Path) -> None:
+    """Module 1 sends one ``type=video`` block covering the whole episode."""
    captured: list[list[dict[str, Any]]] = []
    payload = {
        "subtasks": [
@@ -299,7 +265,7 @@ def test_module1_attaches_contact_sheets_to_subtask_prompt(
        # call is the subtask one — keeps the assertions below focused on
        # ``_generate_subtasks`` rather than fighting the order of unrelated
        # text-only Module-1 sub-prompts.
-        config=PlanConfig(frames_per_second=2.0, max_frames_per_prompt=60, n_task_rephrasings=0),
+        config=PlanConfig(max_video_frames=5, frames_per_second=10.0, n_task_rephrasings=0),
        frame_provider=provider,
    )
    record = next(iter_episodes(fixture_dataset_root))
@@ -324,14 +290,16 @@ def test_module1_attaches_contact_sheets_to_subtask_prompt(
    video_blocks = [b for b in content if isinstance(b, dict) and b.get("type") == "video"]
    image_blocks = [b for b in content if isinstance(b, dict) and b.get("type") == "image"]
    text_blocks = [b for b in content if isinstance(b, dict) and b.get("type") == "text"]
-    assert video_blocks == [], "contact-sheet mode must not emit a raw video block"
-    assert len(image_blocks) >= 1, f"expected >=1 contact-sheet image block, got {content}"
-    assert all(isinstance(b["image"], PIL.Image.Image) for b in image_blocks)
+    assert len(video_blocks) == 1, f"expected exactly 1 video block, got {content}"
+    assert image_blocks == [], "subtask prompt must not mix image blocks with the video block"
    assert len(text_blocks) == 1
-    # the prompt is prefixed with the contact-sheet reading instructions
-    assert text_blocks[0]["text"].startswith("CONTACT SHEETS")
-    # frames were decoded for this episode at episode-relative timestamps
-    assert provider.calls and provider.calls[0][0] == record.episode_index
+    # video block must wrap a list of frames covering the episode
+    assert isinstance(video_blocks[0]["video"], list)
+    assert len(video_blocks[0]["video"]) <= 5
+    # provider is called with target_count = min(duration * fps, max). With
+    # fps=10 on a ~1s episode that requests >max, so max=5 wins.
+    assert provider.video_calls and provider.video_calls[0][0] == record.episode_index
+    assert provider.video_calls[0][1] <= 5


 def test_module3_attaches_frame_image_block_to_prompt(single_episode_root: Path, tmp_path: Path) -> None:
@@ -1,41 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Unit tests for ``vlm_client`` helpers."""
-
-from __future__ import annotations
-
-import pytest
-
-pytest.importorskip("datasets", reason="datasets is required (install lerobot[dataset])")
-
-from lerobot.annotations.steerable_pipeline.vlm_client import _bind_serve_port  # noqa: E402
-
-
-def test_bind_serve_port_substitutes_placeholder() -> None:
-    # The {port} placeholder is replaced everywhere it appears, regardless of
-    # parallel vs single server — the bug was the single-server path passing
-    # it through unsubstituted.
-    cmd = "vllm serve M --max-model-len 32768 --port {port}"
-    assert _bind_serve_port(cmd, 8000) == "vllm serve M --max-model-len 32768 --port 8000"
-
-
-def test_bind_serve_port_appends_when_missing() -> None:
-    assert _bind_serve_port("vllm serve M", 8001) == "vllm serve M --port 8001"
-
-
-def test_bind_serve_port_leaves_explicit_port_untouched() -> None:
-    cmd = "vllm serve M --port 9000"
-    assert _bind_serve_port(cmd, 8000) == cmd
@@ -29,6 +29,15 @@ def test_message_recipe_validates_unknown_binding():
        )


+def test_canonical_recipe_loads():
+    """The canonical PI052 blend YAML loads + validates."""
+    recipe = TrainingRecipe.from_yaml(
+        Path("src/lerobot/configs/recipes/subtask_mem_vqa_speech.yaml")
+    )
+    assert recipe.blend is not None
+    assert sum(c.weight for c in recipe.blend.values()) == pytest.approx(1.0)
+
+
 def test_message_turn_requires_a_stream():
    """Every turn must declare a stream — None is rejected at construction.

@@ -289,52 +289,6 @@ def test_aggregate_datasets(tmp_path, lerobot_dataset_factory):
    assert_dataset_iteration_works(aggr_ds)


-def test_aggregate_datasets_without_concatenation(tmp_path, lerobot_dataset_factory):
-    """With concatenation disabled, each source file is kept as its own destination file."""
-    ds_0 = lerobot_dataset_factory(
-        root=tmp_path / "no_stitch_0",
-        repo_id=f"{DUMMY_REPO_ID}_no_stitch_0",
-        total_episodes=3,
-        total_frames=60,
-    )
-    ds_1 = lerobot_dataset_factory(
-        root=tmp_path / "no_stitch_1",
-        repo_id=f"{DUMMY_REPO_ID}_no_stitch_1",
-        total_episodes=4,
-        total_frames=80,
-    )
-
-    aggr_root = tmp_path / "no_stitch_aggr"
-    aggregate_datasets(
-        repo_ids=[ds_0.repo_id, ds_1.repo_id],
-        roots=[ds_0.root, ds_1.root],
-        aggr_repo_id=f"{DUMMY_REPO_ID}_no_stitch_aggr",
-        aggr_root=aggr_root,
-        concatenate_videos=False,
-        concatenate_data=False,
-    )
-
-    with (
-        patch("lerobot.datasets.dataset_metadata.get_safe_version") as mock_get_safe_version,
-        patch("lerobot.datasets.dataset_metadata.snapshot_download") as mock_snapshot_download,
-    ):
-        mock_get_safe_version.return_value = "v3.0"
-        mock_snapshot_download.return_value = str(aggr_root)
-        aggr_ds = LeRobotDataset(f"{DUMMY_REPO_ID}_no_stitch_aggr", root=aggr_root)
-
-    assert_episode_and_frame_counts(
-        aggr_ds, ds_0.num_episodes + ds_1.num_episodes, ds_0.num_frames + ds_1.num_frames
-    )
-    assert_dataset_iteration_works(aggr_ds)
-    assert_video_timestamps_within_bounds(aggr_ds)
-
-    # Two single-file sources stay as two files each, instead of being packed together.
-    assert len(list((aggr_root / "data").rglob("*.parquet"))) == 2
-    assert aggr_ds.meta.video_keys, "Test fixture should produce at least one video feature"
-    for key in aggr_ds.meta.video_keys:
-        assert len(list((aggr_root / "videos" / key).rglob("*.mp4"))) == 2
-
-
@pytest.mark.parametrize("mutation", ["mismatched_value", "missing_key"])
 def test_aggregate_incomplete_video_encoder_info_warns_and_nuls_encoders(
    tmp_path, lerobot_dataset_factory, caplog, mutation
@@ -83,29 +83,6 @@ def test_get_feature_stats_images():
    assert stats["min"].shape == stats["max"].shape == stats["mean"].shape == stats["std"].shape


-def test_get_feature_stats_uint8_images_preserves_std():
-    data = np.array(
-        [
-            [
-                [[0, 64], [128, 255]],
-                [[255, 128], [64, 0]],
-                [[32, 96], [160, 224]],
-            ],
-            [
-                [[16, 80], [144, 240]],
-                [[240, 144], [80, 16]],
-                [[48, 112], [176, 208]],
-            ],
-        ],
-        dtype=np.uint8,
-    )
-
-    stats = get_feature_stats(data, axis=(0, 2, 3), keepdims=True)
-
-    expected_std = data.transpose(0, 2, 3, 1).reshape(-1, 3).std(axis=0).reshape(1, 3, 1, 1)
-    np.testing.assert_allclose(stats["std"], expected_std)
-
-
 def test_get_feature_stats_axis_0_keepdims(sample_array):
    expected = {
        "min": np.array([[1, 2, 3]]),
@@ -343,6 +343,84 @@ def test_resolve_task_explicit_override_beats_rephrasings():
    assert rendered["messages"][0]["content"] == "explicit override wins"


+def test_flow_only_low_level_recipe_renders_without_target():
+    """Regression: a flow-only ``low_level`` recipe has no ``target`` turn —
+    its supervision is the action-expert flow loss, not text-CE. It must
+    still render (not ``None``), otherwise every blend draw of it is dropped
+    and the action expert never receives a flow loss."""
+    recipe = TrainingRecipe(
+        messages=[
+            MessageTurn(
+                role="user",
+                content="${subtask}",
+                stream="low_level",
+                if_present="subtask",
+            ),
+        ],
+        bindings={"subtask": "active_at(t, style=subtask)"},
+    )
+
+    rendered = render_sample(
+        recipe=recipe,
+        persistent=PERSISTENT,
+        events=[],
+        t=0.5,
+        sample_idx=0,
+        task="clean kitchen",
+    )
+
+    assert rendered is not None
+    assert rendered["messages"] == [{"role": "user", "content": "subtask 0"}]
+    assert rendered["message_streams"] == ["low_level"]
+    assert rendered["target_message_indices"] == []
+
+
+def test_vqa_frame_is_consumed_over_the_weighted_blend():
+    """A frame carrying a VQA annotation renders the ``ask_vqa*`` sub-recipe
+    even when its blend weight is tiny — VQA annotations are sparse and must
+    never be wasted on a subtask/action draw."""
+    recipe = TrainingRecipe(
+        blend={
+            "high_level_subtask": TrainingRecipe(
+                weight=0.99,
+                messages=[
+                    MessageTurn(role="user", content="${task}", stream="high_level"),
+                    MessageTurn(role="assistant", content="a subtask", stream="high_level", target=True),
+                ],
+            ),
+            "ask_vqa_top": TrainingRecipe(
+                weight=0.01,
+                bindings={
+                    "vqa_query": "emitted_at(t, style=vqa, role=user, camera=observation.images.top)",
+                    "vqa": "emitted_at(t, style=vqa, role=assistant, camera=observation.images.top)",
+                },
+                messages=[
+                    MessageTurn(
+                        role="user", content="${vqa_query}", stream="high_level", if_present="vqa_query"
+                    ),
+                    MessageTurn(
+                        role="assistant",
+                        content="${vqa}",
+                        stream="high_level",
+                        target=True,
+                        if_present="vqa",
+                    ),
+                ],
+            ),
+        }
+    )
+    # A frame WITH a vqa event renders VQA on every sample_idx, despite the
+    # ask_vqa weight being only 0.01.
+    for sample_idx in range(20):
+        rendered = render_sample(
+            recipe=recipe, persistent=PERSISTENT, events=EVENTS_AT_1, t=1.0, sample_idx=sample_idx, task="x"
+        )
+        assert rendered["messages"][-1]["content"] == '{"count": 2}', sample_idx
+    # A frame WITHOUT a vqa event falls back to the normal weighted blend.
+    rendered = render_sample(recipe=recipe, persistent=PERSISTENT, events=[], t=1.0, sample_idx=0, task="x")
+    assert rendered["messages"][-1]["content"] == "a subtask"
+
+
 def test_emitted_at_persistent_tolerates_small_timestamp_drift():
    """Persistent ``emitted_at`` should match within EMITTED_AT_TOLERANCE_S
    so callers that derive ``t`` arithmetically (``frame_idx / fps``) still
@@ -25,7 +25,7 @@ from datasets import Dataset  # noqa: E402
 from lerobot.datasets.io_utils import (
    hf_transform_to_torch,
 )
-from lerobot.datasets.sampler import EpisodeAwareSampler
+from lerobot.datasets.sampler import EpisodeAwareSampler, WeightedEpisodeAwareSampler


 def calculate_episode_data_index(hf_dataset: Dataset) -> dict[str, torch.Tensor]:
@@ -114,19 +114,6 @@ def test_shuffle():
    assert set(sampler) == {0, 1, 2, 3, 4, 5}


-def test_shuffle_is_reproducible_across_instances():
-    # The order is a pure function of (seed, epoch), so two fresh samplers (e.g. two ranks)
-    # produce the same permutation without any generator synchronization.
-    sampler_a = EpisodeAwareSampler([0], [6], shuffle=True, seed=42)
-    sampler_b = EpisodeAwareSampler([0], [6], shuffle=True, seed=42)
-    epoch_0 = list(sampler_a)
-    assert list(sampler_b) == epoch_0
-    # Desyncing the global RNG must not affect the permutation.
-    sampler_c = EpisodeAwareSampler([0], [6], shuffle=True, seed=42)
-    torch.randperm(1000)  # consume global RNG, as rank-asymmetric code (e.g. eval) would
-    assert list(sampler_c) == epoch_0
-
-
 def test_negative_drop_first_frames_raises():
    with pytest.raises(ValueError, match="drop_n_first_frames must be >= 0"):
        EpisodeAwareSampler([0], [10], drop_n_first_frames=-1)
@@ -152,85 +139,47 @@ def test_partial_episode_drop_warns(caplog):
    assert "Episode 0" in caplog.text


-# --- seeded (seed, epoch) shuffling, resume, and state ---
-
-from lerobot.datasets.sampler import compute_sampler_state  # noqa: E402
-
-EPISODE_BOUNDS = ([0, 2, 3], [2, 3, 6])  # episodes of 2, 1 and 3 frames
+# --- WeightedEpisodeAwareSampler --------------------------------------------


-@pytest.mark.parametrize("num_frames", [1, 2, 3, 37, 64, 100])
-def test_deterministic_sampler_shuffle_is_permutation(num_frames):
-    for seed in (0, 1, 1234):
-        sampler = EpisodeAwareSampler([0], [num_frames], shuffle=True, seed=seed)
-        assert sorted(sampler) == list(range(num_frames))
+def test_weighted_sampler_respects_episode_drop_and_length():
+    """The episode-boundary frame filtering is applied before weighting,
+    and one epoch still yields ``len(indices)`` samples."""
+    # One episode, 10 frames; drop the last 2.
+    sampler = WeightedEpisodeAwareSampler([0], [10], frame_weights=torch.ones(10), drop_n_last_frames=2)
+    assert sampler.indices == list(range(8))
+    assert len(sampler) == 8
+    draws = list(sampler)
+    assert len(draws) == 8
+    # Dropped frames 8 and 9 must never be sampled.
+    assert all(d in set(range(8)) for d in draws)


-def test_deterministic_sampler_epochs_reproduce_and_differ():
-    sampler_a = EpisodeAwareSampler([0], [100], shuffle=True, seed=42)
-    sampler_b = EpisodeAwareSampler([0], [100], shuffle=True, seed=42)
-    epoch_0 = list(sampler_a)
-    assert list(sampler_b) == epoch_0  # same (seed, epoch) -> same order on any process
-    epoch_1 = list(sampler_a)  # __iter__ auto-advances the epoch
-    assert epoch_1 != epoch_0
-    assert sorted(epoch_1) == sorted(epoch_0)
-    sampler_a.set_epoch(0)
-    assert list(sampler_a) == epoch_0
-    assert list(EpisodeAwareSampler([0], [100], shuffle=True, seed=7)) != epoch_0
+def test_weighted_sampler_oversamples_high_weight_frames():
+    """A heavily-weighted frame dominates the draws."""
+    torch.manual_seed(0)
+    # 100 frames, frame 7 is weighted 1000x.
+    weights = torch.ones(100)
+    weights[7] = 1000.0
+    sampler = WeightedEpisodeAwareSampler([0], [100], frame_weights=weights)
+    counts = {}
+    for _ in range(20):  # 20 epochs
+        for d in sampler:
+            counts[d] = counts.get(d, 0) + 1
+    total = sum(counts.values())
+    # Frame 7 should be the overwhelming majority of the 2000 draws.
+    assert counts.get(7, 0) / total > 0.9


-def test_deterministic_sampler_resume_mid_epoch():
-    reference = EpisodeAwareSampler(*EPISODE_BOUNDS, shuffle=True, seed=42)
-    epoch_0 = list(reference)
-    epoch_1 = list(reference)
-    for start in (0, 1, 4, len(epoch_0)):
-        resumed = EpisodeAwareSampler(*EPISODE_BOUNDS, shuffle=True, seed=42)
-        resumed.load_state_dict({"epoch": 0, "start_index": start})
-        assert list(resumed) == epoch_0[start:]
-        # the resumed sampler continues into the same epoch 1 as the uninterrupted one
-        assert list(resumed) == epoch_1
+def test_weighted_sampler_zero_weights_fall_back_to_uniform():
+    """If every surviving frame has zero weight, sampling is uniform
+    rather than crashing."""
+    sampler = WeightedEpisodeAwareSampler([0], [6], frame_weights=torch.zeros(6))
+    draws = set(sampler)
+    assert draws.issubset(set(range(6)))
+    assert len(list(sampler)) == 6


-def test_deterministic_sampler_construction_stores_only_boundaries():
-    # Construction is O(num_episodes), not O(num_frames): a million-frame single episode
-    # instantiates from just its boundaries without materializing a per-frame index list.
-    num_frames = 1_000_000
-    sampler = EpisodeAwareSampler([0], [num_frames], shuffle=True, seed=0)
-    assert len(sampler) == num_frames
-    assert sampler._starts.shape == (1,) and sampler._cum_lengths.shape == (1,)
-
-
-def test_deterministic_sampler_resume_is_exact_at_scale():
-    # Seeded randperm makes resume sample-exact at non-trivial sizes: regenerating the epoch's
-    # permutation and slicing from the saved offset reproduces the remaining order exactly.
-    num_frames = 100_000
-    reference = EpisodeAwareSampler([0], [num_frames], shuffle=True, seed=0)
-    epoch_0 = list(reference)
-    assert sorted(epoch_0) == list(range(num_frames))
-    start = num_frames - 5
-    resumed = EpisodeAwareSampler([0], [num_frames], shuffle=True, seed=0)
-    resumed.load_state_dict({"epoch": 0, "start_index": start})
-    assert list(resumed) == epoch_0[start:]
-
-
-def test_compute_sampler_state():
-    # 100 frames, batch 10, 2 ranks -> 10 underlying batches, 5 per rank per epoch.
-    assert compute_sampler_state(step=0, num_frames=100, batch_size=10, num_processes=2) == {
-        "epoch": 0,
-        "start_index": 0,
-    }
-    # step 7 -> epoch 1, 2 per-rank batches in = 2 * 10 * 2 = 40 samples in
-    assert compute_sampler_state(step=7, num_frames=100, batch_size=10, num_processes=2) == {
-        "epoch": 1,
-        "start_index": 40,
-    }
-    # uneven epoch: 95 frames -> 10 underlying batches (last short), still 5 per rank
-    assert compute_sampler_state(step=12, num_frames=95, batch_size=10, num_processes=2) == {
-        "epoch": 2,
-        "start_index": 40,
-    }
-    # uneven sharding: 105 frames -> 11 underlying batches, 6 per rank (even_batches pads)
-    assert compute_sampler_state(step=11, num_frames=105, batch_size=10, num_processes=2) == {
-        "epoch": 1,
-        "start_index": 100,
-    }
+def test_weighted_sampler_rejects_short_weight_vector():
+    with pytest.raises(ValueError, match="frame_weights"):
+        WeightedEpisodeAwareSampler([0], [10], frame_weights=torch.ones(5))
@@ -504,19 +504,6 @@ class TestReencodeVideo:
        assert info["video.g"] == 6
        assert info["video.crf"] == 23

-    @require_h264
-    def test_reencode_video_trim_window(self, tmp_path):
-        src = TEST_ARTIFACTS_DIR / "clip_6frames.mp4"
-        out = tmp_path / "trim_window.mp4"
-        cfg = VideoEncoderConfig(vcodec="h264")
-        reencode_video(src, out, camera_encoder=cfg, start_time_s=0.05, end_time_s=0.12, overwrite=True)
-
-        with av.open(str(out)) as container:
-            frames = list(container.decode(video=0))
-        # Only the frames at 0.067 and 0.1 s fall inside [0.05, 0.12).
-        assert len(frames) == 2
-        assert frames[0].time == pytest.approx(0.0, abs=1e-3)
-

 class TestConcatenateVideoFiles:
    def test_two_clips_frame_count(self, tmp_path):
@@ -0,0 +1,167 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Attention-masking tests for the PI052 (π0.5 v2) text head.
+
+Regression coverage for the text-CE collapse bug: PaliGemma's
+``embed_prefix`` flags every language token ``att=0``, which
+``make_att_2d_masks`` turns into one fully *bidirectional* block. Under
+that mask the text cross-entropy degenerates into a copy task — a
+supervised target token attends to the tokens it is trained to predict —
+and the LM head never learns causal generation, so ``select_message``
+collapses at inference.
+
+``_mark_target_span_causal`` sets ``att=1`` on the supervised target
+language positions so each target token attends causally among the
+targets while staying bidirectional to images + the user prompt. These
+tests pin that behaviour for the PaliGemma prefix layout.
+"""
+
+import pytest
+import torch
+
+# modeling_pi052 / modeling_pi05 import transformers transitively.
+pytest.importorskip("transformers")
+
+from lerobot.policies.pi05.modeling_pi05 import make_att_2d_masks  # noqa: E402
+from lerobot.policies.pi052.modeling_pi052 import (  # noqa: E402
+    _mark_target_span_causal,
+    _shifted_lin_ce,
+)
+
+
+def _shifted_ce(logits, labels):
+    """Adapter: ``_shifted_lin_ce`` is Liger-fused (hidden @ lm_head_weightᵀ).
+
+    An identity ``lm_head_weight`` makes the computed logits equal ``logits``.
+    Liger's Triton kernel is GPU-only, so inputs run on CUDA; the loss is
+    returned on CPU so grad still flows back to the CPU ``logits`` leaf.
+    """
+    if not torch.cuda.is_available():
+        pytest.skip("Liger fused CE requires CUDA")
+    vocab_size = logits.shape[-1]
+    eye = torch.eye(vocab_size, dtype=logits.dtype, device="cuda")
+    return _shifted_lin_ce(logits.cuda(), eye, labels.cuda()).cpu()
+
+# ---------------------------------------------------------------------------
+# A synthetic PI052 prefix layout: [images, prompt-lang, target-lang]
+#
+#   indices 0-1  : 2 image tokens          (att = 0)
+#   indices 2-4  : 3 user-prompt lang      (att = 0)
+#   indices 5-8  : 4 supervised target lang(att = 0 from embed_prefix)
+#
+# ``text_labels`` covers the 7 language tokens; -100 on the prompt span,
+# real ids on the 4-token target span. PaliGemma's prefix has no state
+# token (unlike SmolVLA), so the lang span ends at the prefix end.
+# ---------------------------------------------------------------------------
+N_IMAGE = 2
+N_PROMPT = 3
+N_TARGET = 4
+LANG_START = N_IMAGE
+LANG_END = N_IMAGE + N_PROMPT + N_TARGET  # = prefix length
+PREFIX_LEN = LANG_END
+
+
+def _embed_prefix_att_masks() -> torch.Tensor:
+    """Mimic PaliGemma ``embed_prefix``: images + lang all att=0."""
+    return torch.zeros(1, PREFIX_LEN, dtype=torch.bool)
+
+
+def _text_labels() -> torch.Tensor:
+    """-100 over the prompt span, real ids over the target span."""
+    labels = torch.full((1, N_PROMPT + N_TARGET), -100, dtype=torch.long)
+    labels[0, N_PROMPT:] = torch.arange(10, 10 + N_TARGET)
+    return labels
+
+
+def _attends(prefix_att_masks: torch.Tensor) -> torch.Tensor:
+    """2D boolean attendance matrix; ``[i, j]`` True ⇒ i attends to j."""
+    pad = torch.ones(1, PREFIX_LEN, dtype=torch.bool)
+    return make_att_2d_masks(pad, prefix_att_masks)[0]
+
+
+def test_mark_sets_att_on_targets_only():
+    """Only the supervised target language positions flip to att=1."""
+    marked = _mark_target_span_causal(
+        _embed_prefix_att_masks(), _text_labels(), LANG_START, LANG_END
+    )
+    expected = [False] * PREFIX_LEN
+    for i in range(LANG_START + N_PROMPT, LANG_END):  # target span
+        expected[i] = True
+    assert marked[0].tolist() == expected
+
+
+def test_target_tokens_attend_causally_among_themselves():
+    """A target token must NOT attend to later targets, but must attend
+    to earlier ones — genuine causal next-token prediction."""
+    marked = _mark_target_span_causal(
+        _embed_prefix_att_masks(), _text_labels(), LANG_START, LANG_END
+    )
+    attends = _attends(marked)
+    tgt = range(LANG_START + N_PROMPT, LANG_END)
+    for i in tgt:
+        for j in tgt:
+            if j > i:
+                assert not attends[i, j], f"target {i} must not see future target {j}"
+            else:
+                assert attends[i, j], f"target {i} must see earlier/self target {j}"
+
+
+def test_target_tokens_attend_prompt_and_images_bidirectionally():
+    """Targets keep full visibility of images + the user prompt."""
+    marked = _mark_target_span_causal(
+        _embed_prefix_att_masks(), _text_labels(), LANG_START, LANG_END
+    )
+    attends = _attends(marked)
+    context = list(range(0, LANG_START + N_PROMPT))  # images + prompt
+    for i in range(LANG_START + N_PROMPT, LANG_END):
+        for j in context:
+            assert attends[i, j], f"target {i} must attend context {j}"
+
+
+def test_non_target_subtask_stays_bidirectional():
+    """A flow-only / non-target language span (all -100 labels) leaves the
+    mask untouched — the action expert reads it bidirectionally."""
+    all_ignored = torch.full((1, N_PROMPT + N_TARGET), -100, dtype=torch.long)
+    marked = _mark_target_span_causal(
+        _embed_prefix_att_masks(), all_ignored, LANG_START, LANG_END
+    )
+    assert torch.equal(marked, _embed_prefix_att_masks())
+
+
+def test_unmarked_mask_is_bidirectional_the_bug():
+    """Documents the bug the fix prevents: without ``_mark_target_span_causal``
+    a target token attends *bidirectionally* to later targets — the
+    text-CE can copy the answer it is trained to predict."""
+    attends = _attends(_embed_prefix_att_masks())
+    first_tgt = LANG_START + N_PROMPT
+    last_tgt = LANG_END - 1
+    assert attends[first_tgt, last_tgt], (
+        "raw embed_prefix mask is bidirectional over language — the first "
+        "target token can see the last, which is the collapse bug"
+    )
+
+
+def test_shifted_ce_returns_zero_when_no_text_positions_are_supervised():
+    pytest.importorskip("liger_kernel")
+    logits = torch.randn(2, 4, 8, requires_grad=True)
+    labels = torch.full((2, 4), -100, dtype=torch.long)
+
+    loss = _shifted_ce(logits, labels)
+
+    assert loss.item() == 0
+    loss.backward()
+    assert logits.grad is not None
@@ -0,0 +1,114 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Regression tests for PI052 FAST action-code supervision."""
+
+import pytest
+import torch
+from torch.nn import functional as F
+
+pytest.importorskip("transformers")
+pytest.importorskip("liger_kernel")
+
+from lerobot.policies.pi052.modeling_pi052 import _fast_lin_ce  # noqa: E402
+
+
+def _fast_ce(logits, action_tokens, action_code_mask, predict_actions_t):
+    """Adapter: ``_fast_lin_ce`` is Liger-fused (hidden @ lm_head_weightᵀ).
+
+    Feeding an identity ``lm_head_weight`` makes the computed logits equal the
+    provided ``logits``, so these regression tests exercise the masking/gating
+    logic exactly as before the fused-CE refactor. Liger's Triton kernel is
+    GPU-only, so inputs are moved to CUDA and the loss is returned on CPU
+    (keeping grad flowing back to the CPU ``logits`` leaf).
+    """
+    if not torch.cuda.is_available():
+        pytest.skip("Liger fused CE requires CUDA")
+    vocab_size = logits.shape[-1]
+    eye = torch.eye(vocab_size, dtype=logits.dtype, device="cuda")
+    predict = predict_actions_t.cuda() if predict_actions_t is not None else None
+    loss = _fast_lin_ce(
+        logits.cuda(), eye, action_tokens.cuda(), action_code_mask.cuda(), predict
+    )
+    return loss.cpu()
+
+
+def test_fast_ce_supervises_only_discrete_action_codes():
+    """Wrapper tokens can be wrong without affecting the FAST action-code loss."""
+    vocab_size = 8
+    action_tokens = torch.tensor([[1, 2, 3, 4, 5, 0]])
+    action_code_mask = torch.tensor([[False, False, True, True, False, False]])
+
+    logits = torch.zeros(1, action_tokens.shape[1], vocab_size)
+    # Deliberately bad wrapper-token predictions. These should be ignored.
+    logits[0, 0, 7] = 10.0  # target would be token 2
+    logits[0, 3, 7] = 10.0  # target would be delimiter token 5
+    # Correct action-code predictions: hidden t predicts target t + 1.
+    logits[0, 1, 3] = 10.0
+    logits[0, 2, 4] = 10.0
+
+    loss = _fast_ce(logits, action_tokens, action_code_mask, predict_actions_t=None)
+    expected = F.cross_entropy(
+        torch.stack([logits[0, 1], logits[0, 2]]),
+        torch.tensor([3, 4]),
+        reduction="mean",
+    )
+
+    # Looser tolerance: the fused Triton kernel (GPU) differs from CPU eager
+    # F.cross_entropy at the ~1e-7 level, which exceeds the default rtol on
+    # these very small (~1e-4) losses.
+    assert torch.allclose(loss, expected, atol=1e-5, rtol=1e-3)
+
+
+def test_fast_ce_masks_non_action_samples():
+    """Recipe samples with predict_actions=False do not contribute FAST loss."""
+    vocab_size = 8
+    action_tokens = torch.tensor([[1, 2, 3, 4], [1, 2, 5, 6]])
+    action_code_mask = torch.tensor(
+        [[False, False, True, True], [False, False, True, True]]
+    )
+    predict_actions = torch.tensor([True, False])
+
+    logits = torch.zeros(2, action_tokens.shape[1], vocab_size)
+    logits[0, 1, 3] = 10.0
+    logits[0, 2, 4] = 10.0
+    # Bad predictions in the masked sample should not matter.
+    logits[1, 1, 7] = 10.0
+    logits[1, 2, 7] = 10.0
+
+    loss = _fast_ce(logits, action_tokens, action_code_mask, predict_actions)
+    expected = F.cross_entropy(
+        torch.stack([logits[0, 1], logits[0, 2]]),
+        torch.tensor([3, 4]),
+        reduction="mean",
+    )
+
+    # Looser tolerance: the fused Triton kernel (GPU) differs from CPU eager
+    # F.cross_entropy at the ~1e-7 level, which exceeds the default rtol on
+    # these very small (~1e-4) losses.
+    assert torch.allclose(loss, expected, atol=1e-5, rtol=1e-3)
+
+
+def test_fast_ce_returns_zero_when_no_action_code_positions_are_valid():
+    logits = torch.randn(2, 4, 8, requires_grad=True)
+    action_tokens = torch.tensor([[1, 2, 3, 4], [1, 2, 5, 6]])
+    action_code_mask = torch.zeros_like(action_tokens, dtype=torch.bool)
+
+    loss = _fast_ce(logits, action_tokens, action_code_mask, predict_actions_t=None)
+
+    assert loss.item() == 0
+    loss.backward()
+    assert logits.grad is not None
@@ -0,0 +1,153 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Numerical-parity tests for the SDPA attention port.
+
+``pi05`` / ``pi052`` replaced the per-layer call from
+``modeling_gemma.eager_attention_forward`` with
+``sdpa_attention_forward`` (PyTorch SDPA + GQA repeat). The forward
+output must be bit-equivalent (within bf16 tolerance) on the masks
+this model actually uses — block-bidirectional with an arbitrary
+additive bias — otherwise we silently change training behaviour.
+"""
+
+from types import SimpleNamespace
+
+import pytest
+import torch
+
+pytest.importorskip("transformers")
+
+from transformers.models.gemma import modeling_gemma  # noqa: E402
+
+from lerobot.policies.pi052.modeling_pi052 import make_att_2d_masks  # noqa: E402
+from lerobot.policies.pi_gemma import sdpa_attention_forward  # noqa: E402
+from lerobot.utils.constants import OPENPI_ATTENTION_MASK_VALUE  # noqa: E402
+
+
+def _mock_self_attn(num_kv_groups: int, training: bool = False):
+    """Bare module surface that both forwards read."""
+    return SimpleNamespace(
+        num_key_value_groups=num_kv_groups,
+        training=training,
+    )
+
+
+def _build_inputs(
+    bsize: int,
+    num_heads: int,
+    num_kv_heads: int,
+    seq_len: int,
+    head_dim: int,
+    dtype: torch.dtype,
+    seed: int = 0,
+):
+    g = torch.Generator(device="cpu").manual_seed(seed)
+    q = torch.randn(bsize, num_heads, seq_len, head_dim, dtype=dtype, generator=g)
+    k = torch.randn(bsize, num_kv_heads, seq_len, head_dim, dtype=dtype, generator=g)
+    v = torch.randn(bsize, num_kv_heads, seq_len, head_dim, dtype=dtype, generator=g)
+    return q, k, v
+
+
+def _block_bidirectional_mask(
+    bsize: int, seq_len: int, block_sizes: list[int], dtype: torch.dtype
+) -> torch.Tensor:
+    """Mimic ``_prepare_attention_masks_4d`` on a block layout that
+    matches ``[images, language, suffix]`` from ``embed_prefix`` +
+    ``embed_suffix``: every block bidirectional internally, later
+    blocks visible to earlier ones via the cumulative-block rule.
+    """
+    assert sum(block_sizes) == seq_len
+    att_marks = []
+    for i, n in enumerate(block_sizes):
+        att_marks += [1 if i > 0 else 0] + [0] * (n - 1)
+    pad = torch.ones(bsize, seq_len, dtype=torch.bool)
+    att = torch.tensor(att_marks, dtype=torch.bool)[None].expand(bsize, seq_len)
+    att_2d = make_att_2d_masks(pad, att)
+    bias = torch.where(
+        att_2d[:, None, :, :],
+        torch.zeros((), dtype=dtype),
+        torch.tensor(OPENPI_ATTENTION_MASK_VALUE, dtype=dtype),
+    )
+    return bias
+
+
+@pytest.mark.parametrize(
+    "num_heads,num_kv_heads,head_dim",
+    [
+        (8, 1, 256),  # gemma_2b / paligemma config
+        (8, 8, 64),   # MHA control (no GQA repeat)
+    ],
+)
+def test_sdpa_parity_with_eager_block_bidirectional(num_heads, num_kv_heads, head_dim):
+    """SDPA forward output matches the eager softmax(QK^T)@V on the
+    block-bidirectional mask layout pi05 actually uses."""
+    bsize, seq_len = 2, 13
+    block_sizes = [4, 5, 4]  # images, language, suffix-style blocks
+    dtype = torch.float32   # cpu math kernel — keep fp32 for tight tol
+    scaling = head_dim ** -0.5
+
+    q, k, v = _build_inputs(bsize, num_heads, num_kv_heads, seq_len, head_dim, dtype)
+    mask = _block_bidirectional_mask(bsize, seq_len, block_sizes, dtype)
+
+    module = _mock_self_attn(num_heads // num_kv_heads)
+
+    out_eager, _ = modeling_gemma.eager_attention_forward(
+        module, q, k, v, mask, scaling
+    )
+    out_sdpa, _ = sdpa_attention_forward(
+        module, q, k, v, mask, scaling
+    )
+    assert out_eager.shape == out_sdpa.shape
+    torch.testing.assert_close(out_sdpa, out_eager, atol=1e-5, rtol=1e-4)
+
+
+def test_sdpa_parity_bf16():
+    """bf16 path — looser tolerance, must still match eager."""
+    bsize, num_heads, num_kv_heads, seq_len, head_dim = 2, 8, 1, 17, 256
+    scaling = head_dim ** -0.5
+    q, k, v = _build_inputs(bsize, num_heads, num_kv_heads, seq_len, head_dim, torch.bfloat16)
+    mask = _block_bidirectional_mask(bsize, seq_len, [5, 6, 6], torch.bfloat16)
+    module = _mock_self_attn(num_heads // num_kv_heads)
+
+    out_eager, _ = modeling_gemma.eager_attention_forward(
+        module, q, k, v, mask, scaling
+    )
+    out_sdpa, _ = sdpa_attention_forward(
+        module, q, k, v, mask, scaling
+    )
+    torch.testing.assert_close(out_sdpa, out_eager, atol=2e-2, rtol=2e-2)
+
+
+def test_sdpa_parity_backward():
+    """Gradients flow through SDPA and match the eager path within
+    bf16 tolerance — critical for any training-side parity claim."""
+    bsize, num_heads, num_kv_heads, seq_len, head_dim = 1, 4, 2, 9, 32
+    scaling = head_dim ** -0.5
+    q, k, v = _build_inputs(bsize, num_heads, num_kv_heads, seq_len, head_dim, torch.float32)
+    q.requires_grad_(True); k.requires_grad_(True); v.requires_grad_(True)
+    mask = _block_bidirectional_mask(bsize, seq_len, [3, 3, 3], torch.float32)
+    module = _mock_self_attn(num_heads // num_kv_heads)
+
+    out_e, _ = modeling_gemma.eager_attention_forward(module, q, k, v, mask, scaling)
+    g_q_e, g_k_e, g_v_e = torch.autograd.grad(out_e.sum(), [q, k, v])
+
+    out_s, _ = sdpa_attention_forward(module, q, k, v, mask, scaling)
+    g_q_s, g_k_s, g_v_s = torch.autograd.grad(out_s.sum(), [q, k, v])
+
+    torch.testing.assert_close(g_q_s, g_q_e, atol=1e-5, rtol=1e-4)
+    torch.testing.assert_close(g_k_s, g_k_e, atol=1e-5, rtol=1e-4)
+    torch.testing.assert_close(g_v_s, g_v_e, atol=1e-5, rtol=1e-4)
@@ -0,0 +1,196 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for PI052's text tokenizer.
+
+Covers ``say`` tool-call flattening (PaliGemma's flat prompt has no
+structured tool calls, so a ``say`` call must be serialized into a
+``<say>...</say>`` text marker) and EOS-termination supervision (the
+supervised target span must end with an EOS token so the LM head learns
+to stop instead of rambling to ``max_length`` at inference).
+"""
+
+import torch
+
+from lerobot.policies.pi052.text_processor_pi052 import (
+    PI052TextTokenizerStep,
+    _flatten_say_tool_calls,
+    _format_messages,
+)
+from lerobot.types import TransitionKey
+from lerobot.utils.constants import OBS_LANGUAGE_ATTENTION_MASK, OBS_LANGUAGE_TOKENS
+
+
+def _say_call(text):
+    return {"type": "function", "function": {"name": "say", "arguments": {"text": text}}}
+
+
+def test_flatten_appends_say_marker_and_drops_tool_calls():
+    msg = {"role": "assistant", "content": "Heading to the cube.", "tool_calls": [_say_call("On it!")]}
+    out = _flatten_say_tool_calls(msg)
+    assert "tool_calls" not in out
+    assert out["content"] == "Heading to the cube.\n<say>On it!</say>"
+
+
+def test_flatten_marker_only_when_content_empty_or_none():
+    out = _flatten_say_tool_calls({"role": "assistant", "tool_calls": [_say_call("hi")]})
+    assert out["content"] == "<say>hi</say>"
+
+
+def test_flatten_accepts_json_string_arguments():
+    call = {"type": "function", "function": {"name": "say", "arguments": '{"text": "hello there"}'}}
+    out = _flatten_say_tool_calls({"role": "assistant", "content": "p", "tool_calls": [call]})
+    assert out["content"] == "p\n<say>hello there</say>"
+
+
+def test_flatten_leaves_messages_without_tool_calls_untouched():
+    msg = {"role": "assistant", "content": "just a plan"}
+    assert _flatten_say_tool_calls(msg) == msg
+
+
+def test_flatten_drops_non_say_tool_calls_but_keeps_content():
+    weather = {"type": "function", "function": {"name": "check_weather", "arguments": {}}}
+    out = _flatten_say_tool_calls(
+        {"role": "assistant", "content": "plan only", "tool_calls": [weather]}
+    )
+    assert out["content"] == "plan only"
+    assert "tool_calls" not in out
+
+
+# ---------------------------------------------------------------------------
+# EOS-termination supervision
+# ---------------------------------------------------------------------------
+
+
+def test_format_messages_appends_eos_to_target_turns_only():
+    msgs = [
+        {"role": "user", "content": "pick cube"},
+        {"role": "assistant", "content": "move to cube"},
+    ]
+    prompt, spans = _format_messages(msgs, target_indices=[1], eos_token="<eos>")
+    # EOS is appended to the supervised target (assistant) turn only.
+    assert prompt == "User: pick cube\nAssistant: move to cube<eos>\n"
+    # The user span is unchanged; the target span covers content + EOS.
+    assert prompt[spans[0][0] : spans[0][1]] == "pick cube"
+    assert prompt[spans[1][0] : spans[1][1]] == "move to cube<eos>"
+
+
+def test_format_messages_without_eos_args_is_unchanged():
+    """Inference callers omit target_indices / eos_token — no EOS baked in."""
+    prompt, spans = _format_messages([{"role": "user", "content": "hi"}])
+    assert prompt == "User: hi\n"
+    assert prompt[spans[0][0] : spans[0][1]] == "hi"
+
+
+def _eos_char_id() -> int:
+    """Token id _CharTokenizer assigns to its 1-char EOS."""
+    return ord("\x1f") % 251 + 1
+
+
+def test_pi052_text_tokenizer_supervises_eos_at_target_end():
+    """The appended EOS is the last supervised label on a target turn —
+    that's the signal that teaches the LM head to stop. The trailing
+    newline right after it stays unsupervised (-100)."""
+    step = PI052TextTokenizerStep(max_length=64)
+    step._tokenizer = _CharTokenizer()
+    transition = {
+        TransitionKey.OBSERVATION: {},
+        TransitionKey.COMPLEMENTARY_DATA: {
+            "messages": [
+                {"role": "user", "content": "pick cube"},
+                {"role": "assistant", "content": "move to cube"},
+            ],
+            "target_message_indices": [1],
+            "message_streams": ["high_level", "high_level"],
+            "index": torch.tensor(10),
+        },
+    }
+    out = step(transition)
+    ids = out[TransitionKey.OBSERVATION][OBS_LANGUAGE_TOKENS][0]
+    labels = out[TransitionKey.COMPLEMENTARY_DATA]["text_labels"][0]
+
+    supervised = (labels != -100).nonzero().flatten().tolist()
+    assert supervised, "target turn produced no supervised labels"
+    last = supervised[-1]
+    # The last supervised token is the appended EOS.
+    assert int(ids[last]) == _eos_char_id()
+    assert int(labels[last]) == _eos_char_id()
+    # The token right after the EOS (the trailing newline) is NOT supervised.
+    assert int(labels[last + 1]) == -100
+
+
+class _CharTokenizer:
+    pad_token_id = 0
+    eos_token = "\x1f"  # unit separator — a 1-char "EOS" for testing
+
+    def __call__(
+        self,
+        text,
+        max_length,
+        padding,
+        truncation,
+        return_tensors,
+        return_offsets_mapping,
+        padding_side,
+    ):
+        ids = [ord(c) % 251 + 1 for c in text[:max_length]]
+        offsets = [(i, i + 1) for i in range(len(ids))]
+        attention = [1] * len(ids)
+        if padding == "max_length" and len(ids) < max_length:
+            pad = max_length - len(ids)
+            ids += [self.pad_token_id] * pad
+            offsets += [(0, 0)] * pad
+            attention += [0] * pad
+        return {
+            "input_ids": torch.tensor([ids], dtype=torch.long),
+            "attention_mask": torch.tensor([attention], dtype=torch.long),
+            "offset_mapping": torch.tensor([offsets], dtype=torch.long),
+        }
+
+    def decode(self, token_ids, skip_special_tokens=False):
+        return "".join(chr(max(int(i) - 1, 0)) for i in token_ids if int(i) != self.pad_token_id)
+
+
+def test_pi052_text_tokenizer_handles_batched_rendered_messages():
+    step = PI052TextTokenizerStep(max_length=64)
+    step._tokenizer = _CharTokenizer()
+
+    transition = {
+        TransitionKey.OBSERVATION: {},
+        TransitionKey.COMPLEMENTARY_DATA: {
+            "messages": [
+                [
+                    {"role": "user", "content": "pick cube"},
+                    {"role": "assistant", "content": "move to cube"},
+                ],
+                [{"role": "user", "content": "open drawer"}],
+            ],
+            "target_message_indices": [[1], []],
+            "message_streams": [["high_level", "high_level"], ["low_level"]],
+            "index": torch.tensor([10, 11]),
+        },
+    }
+
+    out = step(transition)
+    obs = out[TransitionKey.OBSERVATION]
+    comp = out[TransitionKey.COMPLEMENTARY_DATA]
+
+    assert obs[OBS_LANGUAGE_TOKENS].shape == (2, 64)
+    assert obs[OBS_LANGUAGE_ATTENTION_MASK].shape == (2, 64)
+    assert comp["text_labels"].shape == (2, 64)
+    assert comp["predict_actions"].tolist() == [False, True]
+    assert (comp["text_labels"][0] != -100).any()
+    assert not (comp["text_labels"][1] != -100).any()
@@ -0,0 +1,187 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Training-side conversion of VQA answers to PaliGemma ``<loc>`` text.
+
+PI052 trains spatial VQA answers (``bbox`` / ``keypoint``) in
+PaliGemma's native ``<locNNNN>`` detection vocabulary so the LM head
+reuses the detection prior instead of fighting it (the ``<loc>``-salad
+bug). The dataset stores Qwen2.5-VL's grounding output — **0–1000
+normalized** coordinates, *not* pixels. (Verified empirically on the
+published datasets: x and y both span 0..1000 with ~30% of values
+exceeding the camera's pixel dimensions.) The conversion is therefore
+camera-resolution-independent. The dataset stays backbone-agnostic
+JSON; the conversion lives in PI052's tokenizer. These tests pin the
+JSON → ``<loc>`` rewrite.
+"""
+
+import pytest
+
+pytest.importorskip("transformers")
+
+from lerobot.policies.pi052.text_processor_pi052 import (  # noqa: E402
+    _loc_token,
+    _messages_vqa_to_loc,
+    _vqa_answer_to_loc,
+    register_paligemma_loc_tokens,
+)
+
+
+class _FakeTokenizer:
+    """Tracks ``add_tokens`` calls; mimics the bits ``register_paligemma_loc_tokens`` reads."""
+
+    def __init__(self, prepopulate: bool = False):
+        self.added_tokens_encoder: dict[str, int] = {}
+        self.calls: list[list[str]] = []
+        if prepopulate:
+            self.added_tokens_encoder["<loc0000>"] = 256000
+
+    def add_tokens(self, tokens: list[str]) -> int:
+        self.calls.append(list(tokens))
+        for t in tokens:
+            self.added_tokens_encoder.setdefault(t, len(self.added_tokens_encoder) + 256000)
+        return len(tokens)
+
+
+def test_register_loc_tokens_adds_full_1024_range():
+    tok = _FakeTokenizer()
+    out = register_paligemma_loc_tokens(tok)
+    assert out is tok  # returns same instance
+    assert len(tok.calls) == 1
+    added = tok.calls[0]
+    assert len(added) == 1024
+    assert added[0] == "<loc0000>"
+    assert added[-1] == "<loc1023>"
+    # Spot check a few in the middle.
+    assert added[162] == "<loc0162>"
+    assert added[759] == "<loc0759>"
+
+
+def test_register_loc_tokens_is_idempotent():
+    """If the loc tokens are already present we skip re-adding them."""
+    tok = _FakeTokenizer(prepopulate=True)
+    register_paligemma_loc_tokens(tok)
+    register_paligemma_loc_tokens(tok)
+    assert tok.calls == []  # never called add_tokens
+
+
+def test_loc_token_normalizes_and_clamps():
+    # Default scale is the 0–1000 Qwen convention.
+    assert _loc_token(0) == "<loc0000>"
+    assert _loc_token(1000) == "<loc1023>"
+    assert _loc_token(500) == f"<loc{round(500 / 1000 * 1023):04d}>"
+    # out-of-range coordinates clamp into [0, 1023]
+    assert _loc_token(9999) == "<loc1023>"
+    assert _loc_token(-5) == "<loc0000>"
+
+
+def test_vqa_answer_to_loc_keypoint_normalized():
+    # Label-first: avoids the "Assistant: → <loc>" attractor at training.
+    answer = {"label": "blue cube", "point_format": "xy", "point": [500, 500]}
+    assert _vqa_answer_to_loc(answer) == "blue cube <loc0512><loc0512>"
+
+
+def test_vqa_answer_to_loc_bbox_normalized():
+    answer = {
+        "detections": [{"label": "cube", "bbox_format": "xyxy", "bbox": [0, 0, 1000, 1000]}]
+    }
+    assert _vqa_answer_to_loc(answer) == "cube <loc0000><loc0000><loc1023><loc1023>"
+
+
+def test_vqa_answer_to_loc_multiple_detections_separator():
+    answer = {
+        "detections": [
+            {"label": "blue", "bbox_format": "xyxy", "bbox": [0, 0, 500, 500]},
+            {"label": "yellow", "bbox_format": "xyxy", "bbox": [500, 500, 1000, 1000]},
+        ]
+    }
+    out = _vqa_answer_to_loc(answer)
+    # Each segment is "label <locs>", joined by " ; "
+    assert out == (
+        "blue <loc0000><loc0000><loc0512><loc0512> ; "
+        "yellow <loc0512><loc0512><loc1023><loc1023>"
+    )
+
+
+def test_vqa_answer_to_loc_returns_none_for_non_spatial():
+    assert _vqa_answer_to_loc({"label": "cubes", "count": 2}) is None
+    assert _vqa_answer_to_loc({"weird": "payload"}) is None
+
+
+def test_messages_vqa_to_loc_rewrites_target_turn():
+    messages = [
+        {"role": "user", "content": [{"type": "text", "text": "where is the cube?"}]},
+        {
+            "role": "assistant",
+            "content": '{"label": "cube", "point_format": "xy", "point": [500, 500]}',
+        },
+    ]
+    out = _messages_vqa_to_loc(messages, target_indices=[1])
+    assert out[1]["content"] == "cube <loc0512><loc0512>"
+    # input messages are not mutated
+    assert messages[1]["content"].startswith("{")
+
+
+def test_messages_vqa_to_loc_leaves_plain_text_targets_untouched():
+    messages = [
+        {"role": "user", "content": "pick the cube"},
+        {"role": "assistant", "content": "pick up the cube"},
+    ]
+    out = _messages_vqa_to_loc(messages, target_indices=[1])
+    assert out[1]["content"] == "pick up the cube"
+
+
+def test_messages_vqa_to_loc_noop_without_target_indices():
+    messages = [
+        {"role": "assistant", "content": '{"label": "c", "point_format": "xy", "point": [1, 2]}'}
+    ]
+    assert _messages_vqa_to_loc(messages, []) is messages
+
+
+# ---------------------------------------------------------------------------
+# Round-trip: training-side JSON -> <loc> -> runtime-side parse back
+#
+# Pins that the conversion preserves coordinate *order* (JSON is x-first,
+# PaliGemma <loc> is y-first) and the 0–1000 → [0, 1023] scaling. The
+# only loss is quantization to the 1024-bucket <loc> grid, so a coord
+# survives within half a bucket (~1000/2046 ≈ 0.49 on the 0–1000 scale).
+# ---------------------------------------------------------------------------
+
+
+def test_loc_round_trip_keypoint_preserves_normalized_coords():
+    from lerobot.policies.pi052.inference.vqa import parse_vqa_answer
+
+    answer = {"label": "blue cube", "point_format": "xy", "point": [640, 480]}
+    loc = _vqa_answer_to_loc(answer)
+    parsed = parse_vqa_answer(loc)
+    nx, ny = parsed["payload"]["point"]
+    # parse_vqa_answer returns [0, 1] normalized; rescale back to 0–1000.
+    assert abs(nx * 1000.0 - 640) <= 1000.0 / 2046 + 1e-6
+    assert abs(ny * 1000.0 - 480) <= 1000.0 / 2046 + 1e-6
+    assert parsed["payload"]["label"] == "blue cube"
+
+
+def test_loc_round_trip_bbox_preserves_order_and_scale():
+    from lerobot.policies.pi052.inference.vqa import parse_vqa_answer
+
+    answer = {
+        "detections": [{"label": "cube", "bbox_format": "xyxy", "bbox": [100, 200, 800, 900]}]
+    }
+    loc = _vqa_answer_to_loc(answer)
+    parsed = parse_vqa_answer(loc)
+    x1, y1, x2, y2 = parsed["payload"]["detections"][0]["bbox"]
+    for got, want in ((x1, 100), (y1, 200), (x2, 800), (y2, 900)):
+        assert abs(got * 1000.0 - want) <= 1000.0 / 2046 + 1e-6
@@ -24,7 +24,6 @@ from typing import Any
 import pytest
 import torch
 import torch.nn as nn
-from safetensors.torch import load_file

 pytest.importorskip("datasets", reason="datasets is required (install lerobot[dataset])")

@@ -175,53 +174,6 @@ class MockStepWithTensorState(ProcessorStep):
        return features


-class MockLazyTensorStateStep(ProcessorStep):
-    """Mock step whose tensor state is not present in constructor config."""
-
-    def __init__(
-        self, name: str = "lazy_tensor_step", scale: float = 1.0, initial_value: float | None = None
-    ):
-        self.name = name
-        self.scale = scale
-        self.tensor_state: torch.Tensor | None = None
-
-        if initial_value is not None:
-            self.tensor_state = torch.tensor([initial_value], dtype=torch.float32)
-
-    def __call__(self, transition: EnvTransition) -> EnvTransition:
-        """Return the transition unchanged."""
-        return transition
-
-    def get_config(self) -> dict[str, Any]:
-        """Return constructor config while intentionally omitting tensor state."""
-        return {
-            "name": self.name,
-            "scale": self.scale,
-        }
-
-    def state_dict(self) -> dict[str, torch.Tensor]:
-        """Return tensor state only after it has been initialized or loaded."""
-        if self.tensor_state is None:
-            return {}
-
-        return {"tensor_state": self.tensor_state}
-
-    def load_state_dict(self, state: dict[str, torch.Tensor]) -> None:
-        """Load tensor state."""
-        self.tensor_state = state["tensor_state"].clone()
-
-    def transform_features(
-        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
-    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
-        """Return features unchanged."""
-        return features
-
-
-@ProcessorStepRegistry.register("registered_lazy_tensor_state_step")
-class RegisteredLazyTensorStateStep(MockLazyTensorStateStep):
-    """Registered lazy tensor state step for registry-based serialization tests."""
-
-
 def test_empty_pipeline():
    """Test pipeline with no steps."""
    pipeline = DataProcessorPipeline([], to_transition=identity_transition, to_output=identity_transition)
@@ -668,178 +620,6 @@ def test_mixed_json_and_tensor_state():
        assert torch.allclose(loaded_step.running_mean, step.running_mean)


-def test_get_config_matches_saved_json():
-    """Test that in-memory config matches the config written by save_pretrained."""
-    stateless_step = MockStep(name="stateless")
-    stateful_step = MockLazyTensorStateStep(name="stateful", initial_value=4.0)
-    pipeline = DataProcessorPipeline([stateless_step, stateful_step], name="Memory Pipeline")
-
-    in_memory_config = pipeline.get_config()
-
-    assert pipeline.get_config() == in_memory_config
-
-    with tempfile.TemporaryDirectory() as tmp_dir:
-        pipeline.save_pretrained(tmp_dir)
-
-        config_path = Path(tmp_dir) / "memory_pipeline.json"
-        with open(config_path) as file_pointer:
-            saved_config = json.load(file_pointer)
-
-    assert in_memory_config == saved_config
-    assert "state_file" not in in_memory_config["steps"][0]
-    assert in_memory_config["steps"][1]["state_file"] == "memory_pipeline_step_1.safetensors"
-
-
-def test_state_dict_matches_saved_safetensors():
-    """Test that in-memory state matches the safetensors written by save_pretrained."""
-    stateful_step = MockLazyTensorStateStep(initial_value=7.0)
-    pipeline = DataProcessorPipeline([stateful_step], name="Stateful Pipeline")
-
-    in_memory_state_dict = pipeline.state_dict()
-    state_filename = "stateful_pipeline_step_0.safetensors"
-    state_key = "stateful_pipeline_step_0"
-
-    assert set(in_memory_state_dict) == {state_key}
-    assert set(in_memory_state_dict[state_key]) == {"tensor_state"}
-
-    in_memory_state_dict[state_key]["tensor_state"].add_(1)
-    assert stateful_step.tensor_state is not None
-    assert torch.equal(stateful_step.tensor_state, torch.tensor([7.0]))
-
-    with tempfile.TemporaryDirectory() as tmp_dir:
-        pipeline.save_pretrained(tmp_dir)
-        saved_state_dict = load_file(Path(tmp_dir) / state_filename)
-
-    torch.testing.assert_close(saved_state_dict["tensor_state"], torch.tensor([7.0]))
-
-
-def test_save_pretrained_still_writes_expected_serialization_files():
-    """Test that save_pretrained keeps the existing config and state filenames."""
-    stateful_step = MockLazyTensorStateStep(initial_value=3.0)
-    pipeline = DataProcessorPipeline([stateful_step], name="Policy Preprocessor")
-
-    with tempfile.TemporaryDirectory() as tmp_dir:
-        pipeline.save_pretrained(tmp_dir)
-
-        save_path = Path(tmp_dir)
-        assert (save_path / "policy_preprocessor.json").exists()
-        assert (save_path / "policy_preprocessor_step_0.safetensors").exists()
-
-
-def test_from_config_round_trips_stateful_pipeline():
-    """Test that from_config rebuilds a stateful pipeline from in-memory artifacts."""
-    stateful_step = MockLazyTensorStateStep(name="roundtrip", initial_value=11.0)
-    pipeline = DataProcessorPipeline([stateful_step], name="Roundtrip Pipeline")
-    config = pipeline.get_config()
-    pipeline_state_dict = pipeline.state_dict()
-
-    loaded_pipeline = DataProcessorPipeline.from_config(config, state_dict=pipeline_state_dict)
-    loaded_step = loaded_pipeline.steps[0]
-
-    assert len(loaded_pipeline) == 1
-    assert isinstance(loaded_step, MockLazyTensorStateStep)
-    torch.testing.assert_close(loaded_step.tensor_state, torch.tensor([11.0]))
-
-
-def test_from_config_round_trips_registered_stateful_pipeline():
-    """Test that from_config resolves registry steps and loads their named tensor state."""
-    stateful_step = RegisteredLazyTensorStateStep(name="registered", initial_value=29.0)
-    pipeline = DataProcessorPipeline([stateful_step], name="Registry Pipeline")
-    config = pipeline.get_config()
-    pipeline_state_dict = pipeline.state_dict()
-    state_filename = "registry_pipeline_step_0_registered_lazy_tensor_state_step.safetensors"
-    state_key = "registry_pipeline_step_0_registered_lazy_tensor_state_step"
-
-    assert config["steps"][0]["registry_name"] == "registered_lazy_tensor_state_step"
-    assert config["steps"][0]["state_file"] == state_filename
-    assert set(pipeline_state_dict) == {state_key}
-
-    loaded_pipeline = DataProcessorPipeline.from_config(config, state_dict=pipeline_state_dict)
-    loaded_step = loaded_pipeline.steps[0]
-
-    assert isinstance(loaded_step, RegisteredLazyTensorStateStep)
-    assert loaded_step.tensor_state is not None
-    torch.testing.assert_close(loaded_step.tensor_state, torch.tensor([29.0]))
-
-
-def test_from_config_preserves_state_metadata_for_empty_initial_state():
-    """Test in-memory loading when rebuilt steps start without tensor state."""
-    stateful_step = MockLazyTensorStateStep(name="lazy", initial_value=13.0)
-    pipeline = DataProcessorPipeline([stateful_step], name="Lazy Pipeline")
-    config = pipeline.get_config()
-    pipeline_state_dict = pipeline.state_dict()
-
-    loaded_pipeline = DataProcessorPipeline.from_config(config)
-    loaded_step = loaded_pipeline.steps[0]
-
-    assert isinstance(loaded_step, MockLazyTensorStateStep)
-    assert loaded_step.state_dict() == {}
-    assert "state_file" not in loaded_pipeline.get_config()["steps"][0]
-
-    loaded_pipeline.load_state_dict(pipeline_state_dict)
-
-    torch.testing.assert_close(loaded_step.tensor_state, torch.tensor([13.0]))
-
-
-def test_from_config_applies_overrides_before_state_loading():
-    """Test that constructor overrides and tensor state loading are separate operations."""
-    stateful_step = MockLazyTensorStateStep(name="override", scale=1.0, initial_value=17.0)
-    pipeline = DataProcessorPipeline([stateful_step], name="Override Pipeline")
-    config = pipeline.get_config()
-    pipeline_state_dict = pipeline.state_dict()
-
-    loaded_pipeline = DataProcessorPipeline.from_config(
-        config,
-        state_dict=pipeline_state_dict,
-        overrides={"MockLazyTensorStateStep": {"scale": 5.0}},
-    )
-    loaded_step = loaded_pipeline.steps[0]
-
-    assert isinstance(loaded_step, MockLazyTensorStateStep)
-    assert loaded_step.scale == 5.0
-    torch.testing.assert_close(loaded_step.tensor_state, torch.tensor([17.0]))
-
-
-def test_load_state_dict_raises_on_missing_expected_state():
-    """Test loading raises when serialized config expects missing state."""
-    stateful_step = MockLazyTensorStateStep(initial_value=19.0)
-    pipeline = DataProcessorPipeline([stateful_step], name="Missing Pipeline")
-    loaded_pipeline = DataProcessorPipeline.from_config(pipeline.get_config())
-
-    with pytest.raises(KeyError, match="missing_pipeline_step_0"):
-        loaded_pipeline.load_state_dict({})
-
-
-def test_load_state_dict_raises_on_unexpected_extra_state():
-    """Test loading raises on unexpected top-level state keys."""
-    pipeline = DataProcessorPipeline([MockStep(name="stateless")], name="Unexpected Pipeline")
-
-    with pytest.raises(KeyError, match="extra"):
-        pipeline.load_state_dict({"extra": {"tensor_state": torch.tensor([1.0])}})
-
-
-def test_stateless_pipeline_in_memory_serialization_returns_empty_state():
-    """Test stateless in-memory serialization and loading."""
-    pipeline = DataProcessorPipeline([MockStep(name="stateless")], name="Stateless Pipeline")
-    config = pipeline.get_config()
-    config_without_name = {"steps": config["steps"]}
-
-    assert pipeline.state_dict() == {}
-    assert all("state_file" not in step_entry for step_entry in config["steps"])
-
-    loaded_pipeline = DataProcessorPipeline.from_config(config_without_name, state_dict={})
-
-    assert loaded_pipeline.name == "DataProcessorPipeline"
-    assert loaded_pipeline.state_dict() == {}
-
-
-@pytest.mark.parametrize("invalid_config", [None, [], "not config"])
-def test_from_config_rejects_non_dict_config(invalid_config):
-    """Test from_config reports invalid top-level config values cleanly."""
-    with pytest.raises(ValueError, match="not a valid processor configuration"):
-        DataProcessorPipeline.from_config(invalid_config)  # type: ignore[arg-type]
-
-
 class MockModuleStep(ProcessorStep, nn.Module):
    """Mock step that inherits from nn.Module to test state_dict handling of module parameters."""

@@ -58,3 +58,70 @@ def test_render_messages_step_renders_and_drops_raw_language():
    assert data["messages"][-1]["content"] == "reach carefully"
    assert data["message_streams"] == ["high_level", "low_level"]
    assert data["target_message_indices"] == [1]
+
+
+def test_render_messages_step_falls_back_to_low_level_task_when_recipe_misses():
+    recipe = TrainingRecipe(
+        messages=[
+            MessageTurn(
+                role="assistant",
+                content="${subtask}",
+                stream="high_level",
+                target=True,
+                if_present="subtask",
+            ),
+        ]
+    )
+    transition = create_transition(
+        complementary_data={
+            "task": "pick the cube",
+            "timestamp": torch.tensor(0.0),
+            "index": torch.tensor(7),
+            "language_persistent": [],
+            "language_events": [{"style": "unmatched", "timestamp": 0.0}],
+        }
+    )
+
+    out = RenderMessagesStep(recipe)(transition)
+    data = out[TransitionKey.COMPLEMENTARY_DATA]
+
+    assert data["messages"] == [{"role": "user", "content": "pick the cube"}]
+    assert data["message_streams"] == ["low_level"]
+    assert data["target_message_indices"] == []
+
+
+def test_render_messages_step_falls_back_per_sample_in_batched_language():
+    recipe = TrainingRecipe(
+        messages=[
+            MessageTurn(
+                role="assistant",
+                content="${subtask}",
+                stream="high_level",
+                target=True,
+                if_present="subtask",
+            ),
+        ]
+    )
+    transition = create_transition(
+        action=torch.arange(4).reshape(2, 2),
+        complementary_data={
+            "task": ["pick the cube", "open the drawer"],
+            "timestamp": torch.tensor([0.0, 1.0]),
+            "index": torch.tensor([7, 8]),
+            "language_persistent": [[], []],
+            "language_events": [
+                [{"style": "unmatched", "timestamp": 0.0}],
+                [{"style": "unmatched", "timestamp": 1.0}],
+            ],
+        },
+    )
+
+    out = RenderMessagesStep(recipe)(transition)
+    data = out[TransitionKey.COMPLEMENTARY_DATA]
+
+    assert data["messages"] == [
+        [{"role": "user", "content": "pick the cube"}],
+        [{"role": "user", "content": "open the drawer"}],
+    ]
+    assert data["message_streams"] == [["low_level"], ["low_level"]]
+    assert data["target_message_indices"] == [[], []]
@@ -66,20 +66,6 @@ class TestOperationTypeParsing:
        with pytest.raises(ValueError, match="--new_repo_id is required for merge"):
            _validate_config(cfg)

-    @pytest.mark.parametrize("flag", ["concatenate_videos", "concatenate_data"])
-    def test_merge_concatenate_flag_defaults_true(self, flag):
-        cfg = parse_cfg(["--new_repo_id", "test/merged", "--operation.type", "merge"])
-        assert isinstance(cfg.operation, MergeConfig)
-        assert getattr(cfg.operation, flag) is True
-
-    @pytest.mark.parametrize("flag", ["concatenate_videos", "concatenate_data"])
-    def test_merge_concatenate_flag_can_be_disabled(self, flag):
-        cfg = parse_cfg(
-            ["--new_repo_id", "test/merged", "--operation.type", "merge", f"--operation.{flag}", "false"]
-        )
-        assert isinstance(cfg.operation, MergeConfig)
-        assert getattr(cfg.operation, flag) is False
-
    def test_non_merge_requires_repo_id(self):
        cfg = parse_cfg(["--operation.type", "delete_episodes"])
        with pytest.raises(ValueError, match="--repo_id is required for delete_episodes"):
@@ -1,19 +1,5 @@
 #!/usr/bin/env python

-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
 import json
 from types import SimpleNamespace

@@ -42,14 +28,6 @@ def test_push_to_hub_tags_uploaded_dataset_revision(tmp_path, monkeypatch):
            calls["upload_folder"] = kwargs
            return SimpleNamespace(oid="abc123")

-        def delete_tag(self, repo_id, **kwargs):
-            import requests
-            from huggingface_hub.errors import RevisionNotFoundError
-
-            calls["delete_tag"] = {"repo_id": repo_id, **kwargs}
-            # Simulate the common case: no stale tag to delete.
-            raise RevisionNotFoundError("no such tag", response=requests.Response())
-
        def create_tag(self, **kwargs):
            calls["create_tag"] = kwargs

@@ -71,16 +49,10 @@ def test_push_to_hub_tags_uploaded_dataset_revision(tmp_path, monkeypatch):
        "exist_ok": True,
    }
    assert calls["upload_folder"]["repo_id"] == "annotated/dataset"
-    # A stale tag (e.g. from a previous annotation run) is deleted first so
-    # the new tag always points at the upload we just made.
-    assert calls["delete_tag"] == {
-        "repo_id": "annotated/dataset",
-        "tag": "v3.0",
-        "repo_type": "dataset",
-    }
    assert calls["create_tag"] == {
        "repo_id": "annotated/dataset",
        "tag": "v3.0",
        "repo_type": "dataset",
+        "exist_ok": True,
        "revision": "abc123",
    }
@@ -134,7 +134,7 @@ class TestMultiGPUTraining:
                f"--output_dir={output_dir}",
                "--batch_size=4",
                "--steps=10",
-                "--env_eval_freq=-1",
+                "--eval_freq=-1",
                "--log_freq=5",
                "--save_freq=10",
                "--seed=42",
@@ -177,7 +177,7 @@ class TestMultiGPUTraining:
                f"--output_dir={output_dir}",
                "--batch_size=4",
                "--steps=20",
-                "--env_eval_freq=-1",
+                "--eval_freq=-1",
                "--log_freq=5",
                "--save_freq=10",
                "--seed=42",
@@ -15,7 +15,6 @@
 # limitations under the License.

 import pytest
-import torch

 from lerobot.utils.logging_utils import AverageMeter, MetricsTracker

@@ -26,16 +25,8 @@ def mock_metrics():


 class MockAccelerator:
-    def __init__(self, num_processes: int, reduce_fn=None):
+    def __init__(self, num_processes: int):
        self.num_processes = num_processes
-        self.device = torch.device("cpu")
-        self._reduce_fn = reduce_fn
-
-    def reduce(self, tensor, reduction="mean"):
-        # In single-process tests we just want a deterministic stand-in for accelerate's reduce.
-        if self._reduce_fn is not None:
-            return self._reduce_fn(tensor, reduction)
-        return tensor


 def test_average_meter_initialization():
@@ -166,70 +157,3 @@ def test_metrics_tracker_reset_averages(mock_metrics):
    tracker.reset_averages()
    assert tracker.loss.avg == 0.0
    assert tracker.accuracy.avg == 0.0
-
-
-def test_average_meter_invalid_reduction():
-    with pytest.raises(ValueError):
-        AverageMeter("loss", reduction="median")
-
-
-def test_average_meter_reduction_stored():
-    meter = AverageMeter("updt_s", reduction="max")
-    assert meter.reduction == "max"
-
-
-def test_metrics_tracker_reduce_across_ranks_no_accelerator():
-    metrics = {"update_s": AverageMeter("update_s", reduction="max")}
-    tracker = MetricsTracker(batch_size=32, num_frames=1000, num_episodes=50, metrics=metrics)
-    tracker.update_s = 0.5
-    tracker.reduce_across_ranks()  # no-op without accelerator
-    assert tracker.update_s.avg == 0.5
-
-
-def test_metrics_tracker_reduce_across_ranks_single_process():
-    metrics = {"update_s": AverageMeter("update_s", reduction="max")}
-    tracker = MetricsTracker(
-        batch_size=32,
-        num_frames=1000,
-        num_episodes=50,
-        metrics=metrics,
-        accelerator=MockAccelerator(num_processes=1),
-    )
-    tracker.update_s = 0.5
-    tracker.reduce_across_ranks()  # no-op when world size is 1
-    assert tracker.update_s.avg == 0.5
-
-
-def test_metrics_tracker_reduce_across_ranks_invokes_reduce():
-    captured = {}
-
-    def fake_reduce(tensor, reduction):
-        captured["reduction"] = reduction
-        captured["values"] = tensor.clone()
-        # Pretend the slowest rank reported 0.9 instead of this rank's 0.4.
-        return torch.tensor([0.9], dtype=tensor.dtype, device=tensor.device)
-
-    metrics = {
-        "loss": AverageMeter("loss"),  # reduction="none" -> not touched
-        "update_s": AverageMeter("update_s", reduction="max"),
-    }
-    tracker = MetricsTracker(
-        batch_size=32,
-        num_frames=1000,
-        num_episodes=50,
-        metrics=metrics,
-        accelerator=MockAccelerator(num_processes=4, reduce_fn=fake_reduce),
-    )
-    tracker.loss = 1.0
-    tracker.update_s = 0.4
-    tracker.reduce_across_ranks()
-
-    assert captured["reduction"] == "max"
-    assert torch.allclose(captured["values"], torch.tensor([0.4]))
-    assert tracker.update_s.avg == pytest.approx(0.9)
-    # Metrics without a reduction stay untouched.
-    assert tracker.loss.avg == 1.0
-    # Invariant: avg == sum / count must hold after reduce, so subsequent .update() calls
-    # accumulate against the cluster view rather than the stale per-rank sum.
-    meter = tracker.update_s
-    assert meter.sum / meter.count == pytest.approx(meter.avg)
@@ -20,8 +20,6 @@ from unittest.mock import Mock, patch
 from lerobot.common.train_utils import (
    get_step_checkpoint_dir,
    get_step_identifier,
-    load_training_batch_size,
-    load_training_num_processes,
    load_training_state,
    load_training_step,
    save_checkpoint,
@@ -65,28 +63,6 @@ def test_load_training_step(tmp_path):
    assert loaded_step == step


-def test_save_training_state_records_num_processes(tmp_path, optimizer, scheduler):
-    save_training_state(tmp_path, 10, optimizer, scheduler, num_processes=4)
-    assert load_training_num_processes(tmp_path) == 4
-
-
-def test_load_training_num_processes_absent_returns_none(tmp_path, optimizer, scheduler):
-    # Checkpoints written before the world size was recorded must still load (back-compat).
-    save_training_state(tmp_path, 10, optimizer, scheduler)
-    assert load_training_num_processes(tmp_path) is None
-
-
-def test_save_training_state_records_batch_size(tmp_path, optimizer, scheduler):
-    save_training_state(tmp_path, 10, optimizer, scheduler, batch_size=32)
-    assert load_training_batch_size(tmp_path) == 32
-
-
-def test_load_training_batch_size_absent_returns_none(tmp_path, optimizer, scheduler):
-    # Checkpoints written before the batch size was recorded must still load (back-compat).
-    save_training_state(tmp_path, 10, optimizer, scheduler)
-    assert load_training_batch_size(tmp_path) is None
-
-
 def test_update_last_checkpoint(tmp_path):
    checkpoint = tmp_path / "0005"
    checkpoint.mkdir()
--- a/Show More
+++ b/Show More