From 5ebbdf3d0573e66bce7cbb28fb43757fc04a802f Mon Sep 17 00:00:00 2001
From: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Date: Mon, 18 May 2026 14:51:26 +0200
Subject: [PATCH 01/17] Mention the new Lance LeRobotDataset implementation in
 the docs (#3609)

* Enhance documentation with Lance format details

Added information about Lance format and `lerobot-lancedb` package for multimodal AI datasets.

Signed-off-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
---
 docs/source/lerobot-dataset-v3.mdx | 37 ++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/docs/source/lerobot-dataset-v3.mdx b/docs/source/lerobot-dataset-v3.mdx
index 6f3e6d948..c23677d8c 100644
--- a/docs/source/lerobot-dataset-v3.mdx
+++ b/docs/source/lerobot-dataset-v3.mdx
@@ -10,6 +10,7 @@ This docs will guide you to:
 - Stream datasets without downloading using `StreamingLeRobotDataset`
 - Apply image transforms for data augmentation during training
 - Migrate existing `v2.1` datasets to `v3.0`
+- Experiment with other `LeRobotDataset` formats and implementations like Lance
 
 ## What’s new in `v3`
 
@@ -315,3 +316,39 @@ Dataset v3.0 uses incremental parquet writing with buffered metadata for efficie
 - Ensures the dataset is valid for loading
 
 Without calling `finalize()`, your parquet files will be incomplete and the dataset won't load properly.
+
+## Other formats and implementations
+
+### Lance
+
+Lance is a useful format for multimodal AI datasets, especially for large-scale training requiring high performance IO and random access.
+
+The `lerobot-lancedb` package implements `LeRobotLanceDataset` (for JPEG images) and `LeRobotLanceVideoDataset` (for mp4 videos).
+Those two storage layouts both subclass LeRobotDataset and can provide data loading speed ups.
+
+`LeRobotLanceDataset` is a drop-in replacement for `LeRobotDataset`:
+
+```python
+from lerobot.datasets import LeRobotDatasetMetadata
+from lerobot.policies.diffusion.configuration_diffusion import DiffusionConfig
+from lerobot_lancedb import LeRobotLanceDataset, LeRobotLanceVideoDataset
+
+cfg = DiffusionConfig(...)
+meta = LeRobotDatasetMetadata(root=local_dataset_path)  # or use repo_id=... to load metadata from the Hub
+delta_timestamps = {...}
+
+# Use LeRobotLanceDataset for image datasets
+dataset = LeRobotLanceDataset(
+    root=local_dataset_path,                            # or use repo_id=... to stream from the Hub
+    delta_timestamps=delta_timestamps,
+    return_uint8=True,
+)
+# Or use LeRobotLanceVideoDataset for video datasets:
+dataset = LeRobotLanceVideoDataset(
+    root=local_dataset_path,                            # or use repo_id=... to stream from the Hub
+    delta_timestamps=delta_timestamps,
+    return_uint8=True,
+)
+```
+
+Join the discussion on [Github](https://github.com/huggingface/lerobot/issues/3608) and explore the `lerobot-lancedb` documentation [here](https://lancedb.github.io/lerobot-lancedb/).

From 3c15fd8537c7e84c0465d07d4aa8c6de2912f640 Mon Sep 17 00:00:00 2001
From: Pepijn <138571049+pkooij@users.noreply.github.com>
Date: Mon, 18 May 2026 19:49:21 +0200
Subject: [PATCH 02/17] feat(robots): natively integrate Seeed Studio reBot
 B601-DM arm (#3624)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* feat(robots): natively integrate Seeed Studio reBot B601-DM arm

Add first-class LeRobot support for the Seeed Studio reBot arm, replacing
the out-of-tree `lerobot-robot-seeed-b601` / `lerobot-teleoperator-rebot-arm-102`
plugin packages.

New devices:
- robot `rebot_b601_follower` — single-arm B601-DM follower (6-DOF + gripper,
  Damiao CAN motors via `motorbridge`)
- robot `bi_rebot_b601_follower` — bimanual follower composing two single arms
- teleoperator `rebot_102_leader` — single-arm StarArm102 / reBot Arm 102 leader
  (FashionStar UART servos via `motorbridge-smart-servo`)
- teleoperator `bi_rebot_102_leader` — bimanual leader composing two single arms

The bimanual variants reuse the single-arm classes and namespace each arm's
observation/action keys with `left_` / `right_` prefixes, so a bimanual
StarArm102 leader can teleoperate a bimanual reBot B601 follower.

Optional SDK imports are guarded; a `rebot` extra installs `motorbridge` and
`motorbridge-smart-servo`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: add reBot B601-DM calibration & dual-arm teleoperation guide

Add docs/source/rebot_b601.mdx covering single-arm and bimanual
calibration and teleoperation for the reBot B601-DM follower and
reBot Arm 102 leader, with zero-position reference images from the
Seeed Studio wiki. Register the page in the docs toctree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: fix reBot B601 MDX build (move JSON example out of <Tip>)

The doc-builder parses `{...}` inside MDX component children as a
Svelte expression, so the joint_directions JSON example broke the
build. Move it into a top-level fenced code block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: apply prettier formatting to reBot B601 page

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: remove duplicate colocated reBot B601 page

docs/source/rebot_b601.mdx is the canonical, toctree-registered page;
the colocated rebot_b601.md was a redundant thinner copy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: clarify 6-DOF leader fallback comment in reBot B601 follower

Explain that holding wrist_yaw at zero is what lets a 6-DOF leader
(e.g. so100_leader / so101_leader) teleoperate the 7-DOF follower.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: address Caroline's PR review on reBot B601 integration

- leader: remove _validate_config (no other lerobot device validates its
  config; a key mismatch now surfaces as a plain KeyError)
- leader: simplify _round_to_valid_range to direct modular arithmetic
  instead of a bidirectional search loop
- leader: inline the single-use _clamp helper
- follower & leader: write MotorCalibration range_min/range_max from the
  configured joint_limits / joint_ranges instead of a fixed [-90, 90]
- docs: add a "Find the USB ports" section (lerobot-find-port) and move
  the brltty/permissions tip there; link the OpenArm page for SocketCAN
  adapter configuration

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/source/_toctree.yml                      |   2 +
 docs/source/rebot_b601.mdx                    | 186 +++++++++++
 pyproject.toml                                |   6 +
 .../robots/bi_rebot_b601_follower/__init__.py |  20 ++
 .../bi_rebot_b601_follower.py                 | 150 +++++++++
 .../config_bi_rebot_b601_follower.py          |  29 ++
 .../robots/rebot_b601_follower/__init__.py    |  20 ++
 .../config_rebot_b601_follower.py             |  94 ++++++
 .../rebot_b601_follower.py                    | 289 ++++++++++++++++++
 src/lerobot/robots/utils.py                   |   8 +
 src/lerobot/scripts/lerobot_calibrate.py      |   4 +
 .../scripts/lerobot_find_joint_limits.py      |   4 +
 src/lerobot/scripts/lerobot_record.py         |   4 +
 src/lerobot/scripts/lerobot_replay.py         |   2 +
 src/lerobot/scripts/lerobot_rollout.py        |   4 +
 src/lerobot/scripts/lerobot_setup_motors.py   |   4 +
 src/lerobot/scripts/lerobot_teleoperate.py    |   4 +
 .../bi_rebot_102_leader/__init__.py           |  20 ++
 .../bi_rebot_102_leader.py                    | 113 +++++++
 .../config_bi_rebot_102_leader.py             |  29 ++
 .../rebot_102_leader/__init__.py              |  20 ++
 .../config_rebot_102_leader.py                |  83 +++++
 .../rebot_102_leader/rebot_102_leader.py      | 207 +++++++++++++
 src/lerobot/teleoperators/utils.py            |   8 +
 src/lerobot/utils/import_utils.py             |   4 +
 tests/robots/test_rebot_b601_follower.py      | 116 +++++++
 tests/teleoperators/test_rebot_102_leader.py  | 102 +++++++
 uv.lock                                       |  52 +++-
 28 files changed, 1581 insertions(+), 3 deletions(-)
 create mode 100644 docs/source/rebot_b601.mdx
 create mode 100644 src/lerobot/robots/bi_rebot_b601_follower/__init__.py
 create mode 100644 src/lerobot/robots/bi_rebot_b601_follower/bi_rebot_b601_follower.py
 create mode 100644 src/lerobot/robots/bi_rebot_b601_follower/config_bi_rebot_b601_follower.py
 create mode 100644 src/lerobot/robots/rebot_b601_follower/__init__.py
 create mode 100644 src/lerobot/robots/rebot_b601_follower/config_rebot_b601_follower.py
 create mode 100644 src/lerobot/robots/rebot_b601_follower/rebot_b601_follower.py
 create mode 100644 src/lerobot/teleoperators/bi_rebot_102_leader/__init__.py
 create mode 100644 src/lerobot/teleoperators/bi_rebot_102_leader/bi_rebot_102_leader.py
 create mode 100644 src/lerobot/teleoperators/bi_rebot_102_leader/config_bi_rebot_102_leader.py
 create mode 100644 src/lerobot/teleoperators/rebot_102_leader/__init__.py
 create mode 100644 src/lerobot/teleoperators/rebot_102_leader/config_rebot_102_leader.py
 create mode 100644 src/lerobot/teleoperators/rebot_102_leader/rebot_102_leader.py
 create mode 100644 tests/robots/test_rebot_b601_follower.py
 create mode 100644 tests/teleoperators/test_rebot_102_leader.py

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index f1dfe9aae..470319c48 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -143,6 +143,8 @@
     title: OMX
   - local: openarm
     title: OpenArm
+  - local: rebot_b601
+    title: reBot B601-DM
   title: "Robots"
 - sections:
   - local: phone_teleop
diff --git a/docs/source/rebot_b601.mdx b/docs/source/rebot_b601.mdx
new file mode 100644
index 000000000..adb751560
--- /dev/null
+++ b/docs/source/rebot_b601.mdx
@@ -0,0 +1,186 @@
+# reBot B601-DM
+
+[reBot B601-DM](https://wiki.seeedstudio.com/rebot_arm_b601_dm_lerobot/) is an open-source, low-cost robot arm from Seeed Studio for embodied-AI and imitation learning. It comes as a **follower** arm (the `B601-DM`, a 6-DOF arm plus gripper driven by Damiao CAN motors) and a **leader** arm (the `StarArm102` / `reBot Arm 102`, driven by FashionStar UART smart servos) used to teleoperate it.
+
+This page covers **calibration** and **teleoperation** for both single-arm and bimanual (dual-arm) setups.
+
+<div style="display: flex; align-items: center; gap: 10px;">
+  <img
+    src="https://files.seeedstudio.com/wiki/robotics/projects/lerobot/b601dm_zeroposition.jpg"
+    alt="reBot B601-DM follower arm at its zero position"
+    width="48%"
+  />
+  <img
+    src="https://files.seeedstudio.com/wiki/robotics/projects/lerobot/102_zeroposition.jpg"
+    alt="reBot Arm 102 leader arm at its zero position"
+    width="48%"
+  />
+</div>
+
+_Left: the B601-DM follower at its zero position. Right: the reBot Arm 102 leader at its zero position. Images courtesy of [Seeed Studio](https://wiki.seeedstudio.com/rebot_arm_b601_dm_lerobot/)._
+
+## Install LeRobot 🤗
+
+Follow our [Installation Guide](./installation), then install the reBot support:
+
+```bash
+pip install -e ".[rebot]"
+```
+
+This pulls in `motorbridge` (CAN motor control for the B601-DM follower) and `motorbridge-smart-servo` (FashionStar UART servos for the reBot Arm 102 leader).
+
+## Registered device types
+
+| Type                     | Kind                                         |
+| ------------------------ | -------------------------------------------- |
+| `rebot_b601_follower`    | single-arm B601-DM follower robot            |
+| `bi_rebot_b601_follower` | bimanual (dual-arm) follower robot           |
+| `rebot_102_leader`       | single-arm reBot Arm 102 leader teleoperator |
+| `bi_rebot_102_leader`    | bimanual (dual-arm) leader teleoperator      |
+
+The bimanual types compose two single-arm instances and namespace each arm's
+observation/action keys with a `left_` / `right_` prefix. Per-arm settings are
+passed through nested `left_arm_config.*` / `right_arm_config.*` arguments.
+
+## Find the USB ports
+
+For each device, find the USB port associated with its motor bus using:
+
+```bash
+lerobot-find-port
+```
+
+<Tip warning={true}>
+  On Linux, remove `brltty` (`sudo apt remove brltty`) so it does not hold the
+  leader's USB serial port. You may also need to grant access to the serial
+  devices: `sudo chmod 666 /dev/ttyACM* /dev/ttyUSB*`.
+</Tip>
+
+## Calibration
+
+Neither arm stores a persistent hardware calibration: every time it connects, the motors are re-zeroed against the pose the arm is physically holding. Calibration simply records that zero pose. When prompted, **manually move the arm to its zero position** (the default sit-down pose shown above, gripper fully closed) and press <kbd>ENTER</kbd>.
+
+### Follower (B601-DM)
+
+<hfoptions id="calibrate-follower">
+<hfoption id="Single arm">
+
+```bash
+lerobot-calibrate \
+    --robot.type=rebot_b601_follower \
+    --robot.port=/dev/ttyACM0 \
+    --robot.id=follower \
+    --robot.can_adapter=damiao
+```
+
+</hfoption>
+<hfoption id="Dual arm">
+
+Connect the bimanual follower; calibration runs for the left arm, then the right arm.
+
+```bash
+lerobot-calibrate \
+    --robot.type=bi_rebot_b601_follower \
+    --robot.id=bi_follower \
+    --robot.left_arm_config.port=/dev/ttyACM0 \
+    --robot.left_arm_config.can_adapter=damiao \
+    --robot.right_arm_config.port=/dev/ttyACM1 \
+    --robot.right_arm_config.can_adapter=damiao
+```
+
+Per-arm calibration files are saved with `_left` / `_right` suffixes on the id.
+
+</hfoption>
+</hfoptions>
+
+### Leader (reBot Arm 102)
+
+<hfoptions id="calibrate-leader">
+<hfoption id="Single arm">
+
+```bash
+lerobot-calibrate \
+    --teleop.type=rebot_102_leader \
+    --teleop.port=/dev/ttyUSB0 \
+    --teleop.id=leader
+```
+
+</hfoption>
+<hfoption id="Dual arm">
+
+```bash
+lerobot-calibrate \
+    --teleop.type=bi_rebot_102_leader \
+    --teleop.id=bi_leader \
+    --teleop.left_arm_config.port=/dev/ttyUSB0 \
+    --teleop.right_arm_config.port=/dev/ttyUSB1
+```
+
+</hfoption>
+</hfoptions>
+
+## Teleoperation
+
+Once both arms are calibrated, drive the follower with the leader. The follower talks to its CAN bus through a Damiao serial bridge (`can_adapter=damiao`, the default) or a SocketCAN adapter (`can_adapter=socketcan`). See the [OpenArm page](./openarm) for more details on the SocketCAN adapter configuration.
+
+<hfoptions id="teleoperate">
+<hfoption id="Single arm">
+
+```bash
+lerobot-teleoperate \
+    --robot.type=rebot_b601_follower \
+    --robot.port=/dev/ttyACM0 \
+    --robot.id=follower \
+    --robot.can_adapter=damiao \
+    --teleop.type=rebot_102_leader \
+    --teleop.port=/dev/ttyUSB0 \
+    --teleop.id=leader
+```
+
+</hfoption>
+<hfoption id="Dual arm">
+
+The bimanual leader and follower reuse the single-arm classes; each arm is
+configured through nested `left_arm_config.*` / `right_arm_config.*` arguments,
+so a bimanual reBot Arm 102 leader drives a bimanual B601-DM follower.
+
+```bash
+lerobot-teleoperate \
+    --robot.type=bi_rebot_b601_follower \
+    --robot.id=bi_follower \
+    --robot.left_arm_config.port=/dev/ttyACM0 \
+    --robot.left_arm_config.can_adapter=damiao \
+    --robot.right_arm_config.port=/dev/ttyACM1 \
+    --robot.right_arm_config.can_adapter=damiao \
+    --teleop.type=bi_rebot_102_leader \
+    --teleop.id=bi_leader \
+    --teleop.left_arm_config.port=/dev/ttyUSB0 \
+    --teleop.right_arm_config.port=/dev/ttyUSB1
+```
+
+</hfoption>
+</hfoptions>
+
+<Tip>
+  The leader and follower share the same joint names (`shoulder_pan,
+  shoulder_lift, elbow_flex, wrist_flex, wrist_yaw, wrist_roll, gripper`), so
+  leader actions map directly onto the follower.
+</Tip>
+
+If the motion of a joint is reversed, flip its sign in the leader's `joint_directions` (the gripper also carries a scale to widen its range to the follower):
+
+```bash
+lerobot-teleoperate \
+    --robot.type=rebot_b601_follower \
+    --robot.port=/dev/ttyACM0 \
+    --robot.can_adapter=damiao \
+    --teleop.type=rebot_102_leader \
+    --teleop.port=/dev/ttyUSB0 \
+    --teleop.joint_directions='{"shoulder_pan":-1,"shoulder_lift":-1,"elbow_flex":1,"wrist_flex":1,"wrist_yaw":1,"wrist_roll":-1,"gripper":-6}'
+```
+
+## Recording datasets
+
+Swap `lerobot-teleoperate` for `lerobot-record` (with the same `--robot.*` / `--teleop.*` arguments, plus `--dataset.*`) to record demonstrations for training. See [Imitation Learning for Robots](./il_robots) for the full workflow.
+
+For hardware assembly and wiring, see the [Seeed Studio reBot wiki](https://wiki.seeedstudio.com/rebot_arm_b601_dm_lerobot/).
diff --git a/pyproject.toml b/pyproject.toml
index f983134ab..93953cd57 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -151,6 +151,8 @@ pyserial-dep = ["pyserial>=3.5,<4.0"]
 deepdiff-dep = ["deepdiff>=7.0.1,<9.0.0"]
 pynput-dep = ["pynput>=1.7.8,<1.9.0"]
 pyzmq-dep = ["pyzmq>=26.2.1,<28.0.0"]
+motorbridge-dep = ["motorbridge>=0.3.2,<0.4.0"]
+motorbridge-smart-servo-dep = ["motorbridge-smart-servo>=0.0.4,<0.1.0"]
 
 # Motors
 feetech = ["feetech-servo-sdk>=1.0.0,<2.0.0", "lerobot[pyserial-dep]", "lerobot[deepdiff-dep]"]
@@ -174,6 +176,9 @@ unitree_g1 = [
     "lerobot[pygame-dep]",
 ]
 reachy2 = ["reachy2_sdk>=1.0.15,<1.1.0"]
+# Seeed Studio reBot B601-DM follower (motorbridge / CAN) + StarArm102 / reBot Arm 102
+# leader (motorbridge-smart-servo / FashionStar UART servos).
+rebot = ["lerobot[motorbridge-dep]", "lerobot[motorbridge-smart-servo-dep]"]
 kinematics = ["lerobot[placo-dep]"]
 intelrealsense = [
     "pyrealsense2>=2.55.1.6486,<2.57.0 ; sys_platform != 'darwin'",
@@ -260,6 +265,7 @@ all = [
     "lerobot[lekiwi]",
     "lerobot[openarms]",
     "lerobot[reachy2]",
+    "lerobot[rebot]",
     "lerobot[kinematics]",
     "lerobot[intelrealsense]",
     "lerobot[diffusion]",
diff --git a/src/lerobot/robots/bi_rebot_b601_follower/__init__.py b/src/lerobot/robots/bi_rebot_b601_follower/__init__.py
new file mode 100644
index 000000000..8ef454f45
--- /dev/null
+++ b/src/lerobot/robots/bi_rebot_b601_follower/__init__.py
@@ -0,0 +1,20 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .bi_rebot_b601_follower import BiRebotB601Follower
+from .config_bi_rebot_b601_follower import BiRebotB601FollowerConfig
+
+__all__ = ["BiRebotB601Follower", "BiRebotB601FollowerConfig"]
diff --git a/src/lerobot/robots/bi_rebot_b601_follower/bi_rebot_b601_follower.py b/src/lerobot/robots/bi_rebot_b601_follower/bi_rebot_b601_follower.py
new file mode 100644
index 000000000..bd19f1b62
--- /dev/null
+++ b/src/lerobot/robots/bi_rebot_b601_follower/bi_rebot_b601_follower.py
@@ -0,0 +1,150 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from functools import cached_property
+
+from lerobot.types import RobotAction, RobotObservation
+from lerobot.utils.decorators import check_if_already_connected, check_if_not_connected
+
+from ..rebot_b601_follower import RebotB601Follower, RebotB601FollowerRobotConfig
+from ..robot import Robot
+from .config_bi_rebot_b601_follower import BiRebotB601FollowerConfig
+
+logger = logging.getLogger(__name__)
+
+
+class BiRebotB601Follower(Robot):
+    """Bimanual Seeed Studio reBot B601-DM follower.
+
+    Composes two single-arm :class:`RebotB601Follower` instances. Observation and
+    action keys of each arm are namespaced with a ``left_`` / ``right_`` prefix.
+    """
+
+    config_class = BiRebotB601FollowerConfig
+    name = "bi_rebot_b601_follower"
+
+    def __init__(self, config: BiRebotB601FollowerConfig):
+        super().__init__(config)
+        self.config = config
+
+        left_arm_config = RebotB601FollowerRobotConfig(
+            id=f"{config.id}_left" if config.id else None,
+            calibration_dir=config.calibration_dir,
+            port=config.left_arm_config.port,
+            can_adapter=config.left_arm_config.can_adapter,
+            dm_serial_baud=config.left_arm_config.dm_serial_baud,
+            disable_torque_on_disconnect=config.left_arm_config.disable_torque_on_disconnect,
+            max_relative_target=config.left_arm_config.max_relative_target,
+            cameras=config.left_arm_config.cameras,
+            motor_can_ids=config.left_arm_config.motor_can_ids,
+            pos_vel_velocity=config.left_arm_config.pos_vel_velocity,
+            gripper_torque_ratio=config.left_arm_config.gripper_torque_ratio,
+            joint_limits=config.left_arm_config.joint_limits,
+        )
+
+        right_arm_config = RebotB601FollowerRobotConfig(
+            id=f"{config.id}_right" if config.id else None,
+            calibration_dir=config.calibration_dir,
+            port=config.right_arm_config.port,
+            can_adapter=config.right_arm_config.can_adapter,
+            dm_serial_baud=config.right_arm_config.dm_serial_baud,
+            disable_torque_on_disconnect=config.right_arm_config.disable_torque_on_disconnect,
+            max_relative_target=config.right_arm_config.max_relative_target,
+            cameras=config.right_arm_config.cameras,
+            motor_can_ids=config.right_arm_config.motor_can_ids,
+            pos_vel_velocity=config.right_arm_config.pos_vel_velocity,
+            gripper_torque_ratio=config.right_arm_config.gripper_torque_ratio,
+            joint_limits=config.right_arm_config.joint_limits,
+        )
+
+        self.left_arm = RebotB601Follower(left_arm_config)
+        self.right_arm = RebotB601Follower(right_arm_config)
+
+        # Only for compatibility with parts of the codebase that expect `robot.cameras`.
+        self.cameras = {**self.left_arm.cameras, **self.right_arm.cameras}
+
+    @property
+    def _motors_ft(self) -> dict[str, type]:
+        return {
+            **{f"left_{k}": v for k, v in self.left_arm._motors_ft.items()},
+            **{f"right_{k}": v for k, v in self.right_arm._motors_ft.items()},
+        }
+
+    @property
+    def _cameras_ft(self) -> dict[str, tuple]:
+        return {
+            **{f"left_{k}": v for k, v in self.left_arm._cameras_ft.items()},
+            **{f"right_{k}": v for k, v in self.right_arm._cameras_ft.items()},
+        }
+
+    @cached_property
+    def observation_features(self) -> dict[str, type | tuple]:
+        return {**self._motors_ft, **self._cameras_ft}
+
+    @cached_property
+    def action_features(self) -> dict[str, type]:
+        return self._motors_ft
+
+    @property
+    def is_connected(self) -> bool:
+        return self.left_arm.is_connected and self.right_arm.is_connected
+
+    @check_if_already_connected
+    def connect(self, calibrate: bool = True) -> None:
+        self.left_arm.connect(calibrate)
+        self.right_arm.connect(calibrate)
+
+    @property
+    def is_calibrated(self) -> bool:
+        return self.left_arm.is_calibrated and self.right_arm.is_calibrated
+
+    def calibrate(self) -> None:
+        self.left_arm.calibrate()
+        self.right_arm.calibrate()
+
+    def configure(self) -> None:
+        self.left_arm.configure()
+        self.right_arm.configure()
+
+    @check_if_not_connected
+    def get_observation(self) -> RobotObservation:
+        obs_dict = {}
+        obs_dict.update({f"left_{k}": v for k, v in self.left_arm.get_observation().items()})
+        obs_dict.update({f"right_{k}": v for k, v in self.right_arm.get_observation().items()})
+        return obs_dict
+
+    @check_if_not_connected
+    def send_action(self, action: RobotAction) -> RobotAction:
+        left_action = {
+            key.removeprefix("left_"): value for key, value in action.items() if key.startswith("left_")
+        }
+        right_action = {
+            key.removeprefix("right_"): value for key, value in action.items() if key.startswith("right_")
+        }
+
+        sent_action_left = self.left_arm.send_action(left_action)
+        sent_action_right = self.right_arm.send_action(right_action)
+
+        return {
+            **{f"left_{k}": v for k, v in sent_action_left.items()},
+            **{f"right_{k}": v for k, v in sent_action_right.items()},
+        }
+
+    @check_if_not_connected
+    def disconnect(self) -> None:
+        self.left_arm.disconnect()
+        self.right_arm.disconnect()
diff --git a/src/lerobot/robots/bi_rebot_b601_follower/config_bi_rebot_b601_follower.py b/src/lerobot/robots/bi_rebot_b601_follower/config_bi_rebot_b601_follower.py
new file mode 100644
index 000000000..079b7a355
--- /dev/null
+++ b/src/lerobot/robots/bi_rebot_b601_follower/config_bi_rebot_b601_follower.py
@@ -0,0 +1,29 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+
+from ..config import RobotConfig
+from ..rebot_b601_follower import RebotB601FollowerConfig
+
+
+@RobotConfig.register_subclass("bi_rebot_b601_follower")
+@dataclass
+class BiRebotB601FollowerConfig(RobotConfig):
+    """Configuration class for the bimanual reBot B601-DM follower robot."""
+
+    left_arm_config: RebotB601FollowerConfig
+    right_arm_config: RebotB601FollowerConfig
diff --git a/src/lerobot/robots/rebot_b601_follower/__init__.py b/src/lerobot/robots/rebot_b601_follower/__init__.py
new file mode 100644
index 000000000..43fcbb769
--- /dev/null
+++ b/src/lerobot/robots/rebot_b601_follower/__init__.py
@@ -0,0 +1,20 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .config_rebot_b601_follower import RebotB601FollowerConfig, RebotB601FollowerRobotConfig
+from .rebot_b601_follower import RebotB601Follower
+
+__all__ = ["RebotB601Follower", "RebotB601FollowerConfig", "RebotB601FollowerRobotConfig"]
diff --git a/src/lerobot/robots/rebot_b601_follower/config_rebot_b601_follower.py b/src/lerobot/robots/rebot_b601_follower/config_rebot_b601_follower.py
new file mode 100644
index 000000000..096548afb
--- /dev/null
+++ b/src/lerobot/robots/rebot_b601_follower/config_rebot_b601_follower.py
@@ -0,0 +1,94 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass, field
+
+from lerobot.cameras import CameraConfig
+
+from ..config import RobotConfig
+
+
+@dataclass
+class RebotB601FollowerConfig:
+    """Base configuration class for the Seeed Studio reBot B601-DM follower arm.
+
+    The B601-DM is a 6-DOF arm plus gripper driven by Damiao CAN motors. Motor
+    communication goes through the ``motorbridge`` package.
+    """
+
+    # Communication port. For ``can_adapter="damiao"`` this is the Damiao serial
+    # bridge device (e.g. "/dev/ttyACM0"); for ``can_adapter="socketcan"`` it is
+    # the CAN channel name (e.g. "can0").
+    port: str
+
+    # CAN adapter type:
+    #   "damiao"    - Damiao dedicated serial bridge (default)
+    #   "socketcan" - SocketCAN based adapters (PCAN, slcan, embedded controllers, ...)
+    can_adapter: str = "damiao"
+
+    # Baud rate for the Damiao serial bridge (only used when can_adapter="damiao").
+    dm_serial_baud: int = 921600
+
+    disable_torque_on_disconnect: bool = True
+
+    # `max_relative_target` limits the magnitude of the relative positional target
+    # vector for safety purposes (in degrees). Set to a positive scalar to apply the
+    # same value to all motors, or to a dict mapping motor names to per-motor values.
+    max_relative_target: float | dict[str, float] | None = None
+
+    # cameras
+    cameras: dict[str, CameraConfig] = field(default_factory=dict)
+
+    # Maps motor names to their (send_can_id, recv_can_id) pair.
+    motor_can_ids: dict[str, tuple[int, int]] = field(
+        default_factory=lambda: {
+            "shoulder_pan": (0x01, 0x11),
+            "shoulder_lift": (0x02, 0x12),
+            "elbow_flex": (0x03, 0x13),
+            "wrist_flex": (0x04, 0x14),
+            "wrist_yaw": (0x05, 0x15),
+            "wrist_roll": (0x06, 0x16),
+            "gripper": (0x07, 0x17),
+        }
+    )
+
+    # Target velocity for joints running in POS_VEL mode, in degrees/s. A scalar is
+    # applied to every joint; a list provides one value per joint (in motor order).
+    pos_vel_velocity: float | list[float] = field(default_factory=lambda: [150.0] * 7)
+
+    # Torque/current ratio for the gripper's FORCE_POS mode, in range [0, 1].
+    gripper_torque_ratio: float = 0.1
+
+    # Soft joint limits (degrees). These are clipped against on every action.
+    joint_limits: dict[str, tuple[float, float]] = field(
+        default_factory=lambda: {
+            "shoulder_pan": (-145.0, 145.0),
+            "shoulder_lift": (-170.0, 1.0),
+            "elbow_flex": (-200.0, 1.0),
+            "wrist_flex": (-80.0, 90.0),
+            "wrist_yaw": (-90.0, 90.0),
+            "wrist_roll": (-90.0, 90.0),
+            "gripper": (-270.0, 0.0),
+        }
+    )
+
+
+@RobotConfig.register_subclass("rebot_b601_follower")
+@dataclass
+class RebotB601FollowerRobotConfig(RobotConfig, RebotB601FollowerConfig):
+    """Registered configuration for the reBot B601-DM follower robot."""
+
+    pass
diff --git a/src/lerobot/robots/rebot_b601_follower/rebot_b601_follower.py b/src/lerobot/robots/rebot_b601_follower/rebot_b601_follower.py
new file mode 100644
index 000000000..ec00f4aa9
--- /dev/null
+++ b/src/lerobot/robots/rebot_b601_follower/rebot_b601_follower.py
@@ -0,0 +1,289 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import math
+import time
+from functools import cached_property
+from typing import TYPE_CHECKING
+
+from lerobot.cameras import make_cameras_from_configs
+from lerobot.motors import MotorCalibration
+from lerobot.types import RobotAction, RobotObservation
+from lerobot.utils.decorators import check_if_already_connected, check_if_not_connected
+from lerobot.utils.import_utils import _motorbridge_available, require_package
+
+from ..robot import Robot
+from ..utils import ensure_safe_goal_position
+from .config_rebot_b601_follower import RebotB601FollowerRobotConfig
+
+if TYPE_CHECKING or _motorbridge_available:
+    from motorbridge import Controller as MotorBridgeController, Mode as MotorBridgeMode
+else:
+    MotorBridgeController = None
+    MotorBridgeMode = None
+
+logger = logging.getLogger(__name__)
+
+# Joint controlled in FORCE_POS mode; every other joint runs in POS_VEL mode.
+GRIPPER_MOTOR = "gripper"
+# Per-joint Damiao motor models for the B601-DM (passed to motorbridge).
+MOTOR_MODELS = {
+    "shoulder_pan": "4340P",
+    "shoulder_lift": "4340P",
+    "elbow_flex": "4340P",
+    "wrist_flex": "4310",
+    "wrist_yaw": "4310",
+    "wrist_roll": "4310",
+    "gripper": "4310",
+}
+_ENSURE_MODE_RETRIES = 9
+_SETTLE_SEC = 0.01
+_ZERO_SETTLE_SEC = 0.1
+
+
+class RebotB601Follower(Robot):
+    """Seeed Studio reBot B601-DM follower arm (6-DOF + gripper, Damiao CAN motors).
+
+    Motor communication is handled by the ``motorbridge`` package over a CAN bus,
+    reached either through a Damiao serial bridge or a SocketCAN adapter.
+    """
+
+    config_class = RebotB601FollowerRobotConfig
+    name = "rebot_b601_follower"
+
+    def __init__(self, config: RebotB601FollowerRobotConfig):
+        require_package("motorbridge", extra="rebot")
+        super().__init__(config)
+        self.config = config
+        self.bus: MotorBridgeController | None = None
+        self.motors: dict = {}
+        self.motor_names = list(config.motor_can_ids.keys())
+        self.cameras = make_cameras_from_configs(config.cameras)
+
+    @property
+    def _motors_ft(self) -> dict[str, type]:
+        return {f"{motor}.pos": float for motor in self.motor_names}
+
+    @property
+    def _cameras_ft(self) -> dict[str, tuple]:
+        return {
+            cam: (self.config.cameras[cam].height, self.config.cameras[cam].width, 3) for cam in self.cameras
+        }
+
+    @cached_property
+    def observation_features(self) -> dict[str, type | tuple]:
+        return {**self._motors_ft, **self._cameras_ft}
+
+    @cached_property
+    def action_features(self) -> dict[str, type]:
+        return self._motors_ft
+
+    @property
+    def is_connected(self) -> bool:
+        return self.bus is not None and all(cam.is_connected for cam in self.cameras.values())
+
+    @check_if_already_connected
+    def connect(self, calibrate: bool = True) -> None:
+        logger.info(f"Connecting {self} on {self.config.port} (adapter={self.config.can_adapter})...")
+        if self.config.can_adapter == "damiao":
+            self.bus = MotorBridgeController.from_dm_serial(
+                serial_port=self.config.port,
+                baud=self.config.dm_serial_baud,
+            )
+        elif self.config.can_adapter == "socketcan":
+            self.bus = MotorBridgeController(channel=self.config.port)
+        else:
+            raise ValueError(
+                f"Unsupported can_adapter '{self.config.can_adapter}'. Use 'damiao' or 'socketcan'."
+            )
+
+        for motor_name, (send_id, recv_id) in self.config.motor_can_ids.items():
+            self.motors[motor_name] = self.bus.add_damiao_motor(send_id, recv_id, MOTOR_MODELS[motor_name])
+
+        if not self.is_calibrated and calibrate:
+            logger.info(
+                "Mismatch between calibration values in the motor and the calibration file or no calibration file found"
+            )
+            self.calibrate()
+
+        for cam in self.cameras.values():
+            cam.connect()
+
+        self.configure()
+        logger.info(f"{self} connected.")
+
+    @property
+    def is_calibrated(self) -> bool:
+        return bool(self.calibration)
+
+    def calibrate(self) -> None:
+        if self.calibration:
+            user_input = input(
+                f"Press ENTER to use provided calibration file associated with the id {self.id}, "
+                "or type 'c' and press ENTER to run calibration: "
+            )
+            if user_input.strip().lower() != "c":
+                logger.info(f"Using calibration file associated with the id {self.id}")
+                return
+
+        logger.info(f"\nRunning calibration of {self}")
+        self.bus.disable_all()
+        print(
+            "\nCalibration: set zero position.\n"
+            "Manually move the reBot B601 to its ZERO POSITION and close the gripper.\n"
+            "See the B601 manual for the zero pose (the default sit-down position).\n"
+        )
+        input("Press ENTER when ready...")
+
+        for motor in self.motors.values():
+            motor.set_zero_position()
+            time.sleep(_ZERO_SETTLE_SEC)
+        logger.info("Arm zero position set.")
+
+        self.calibration = {}
+        for motor_name, (send_id, _recv_id) in self.config.motor_can_ids.items():
+            range_min, range_max = self.config.joint_limits[motor_name]
+            self.calibration[motor_name] = MotorCalibration(
+                id=send_id,
+                drive_mode=0,
+                homing_offset=0,
+                range_min=int(range_min),
+                range_max=int(range_max),
+            )
+
+        self._save_calibration()
+        print(f"Calibration saved to {self.calibration_fpath}")
+
+    def configure(self) -> None:
+        self.bus.enable_all()
+        for motor_name, motor in self.motors.items():
+            target_mode = (
+                MotorBridgeMode.FORCE_POS if motor_name == GRIPPER_MOTOR else MotorBridgeMode.POS_VEL
+            )
+            for attempt in range(_ENSURE_MODE_RETRIES + 1):
+                try:
+                    motor.ensure_mode(target_mode)
+                    break
+                except Exception:
+                    if attempt == _ENSURE_MODE_RETRIES:
+                        raise
+                    time.sleep(_SETTLE_SEC)
+            logger.debug(f"{motor_name} mode set to {target_mode}")
+
+    @check_if_not_connected
+    def disable_torque(self) -> None:
+        """Disable motor torque so the arm can be moved by hand (read-only debugging)."""
+        self.bus.disable_all()
+        logger.info(f"{self} torque disabled.")
+
+    def _present_pos(self) -> dict[str, float]:
+        """Read present joint positions in degrees."""
+        for motor in self.motors.values():
+            motor.request_feedback()
+        try:
+            self.bus.poll_feedback_once()
+        except Exception:
+            logger.warning("CAN bus poll feedback failed.")
+
+        present_pos = {}
+        for motor_name, motor in self.motors.items():
+            state = motor.get_state()
+            present_pos[motor_name] = math.degrees(state.pos) if state is not None else 0.0
+        return present_pos
+
+    @check_if_not_connected
+    def get_observation(self) -> RobotObservation:
+        start = time.perf_counter()
+        obs_dict = {f"{motor}.pos": pos for motor, pos in self._present_pos().items()}
+        dt_ms = (time.perf_counter() - start) * 1e3
+        logger.debug(f"{self} read state: {dt_ms:.1f}ms")
+
+        for cam_key, cam in self.cameras.items():
+            start = time.perf_counter()
+            obs_dict[cam_key] = cam.read_latest()
+            dt_ms = (time.perf_counter() - start) * 1e3
+            logger.debug(f"{self} read {cam_key}: {dt_ms:.1f}ms")
+
+        return obs_dict
+
+    @check_if_not_connected
+    def send_action(self, action: RobotAction) -> RobotAction:
+        """Command the arm to a target joint configuration.
+
+        Positions are expressed in degrees. The relative action magnitude may be
+        clipped depending on `max_relative_target`, so the action actually sent is
+        always returned.
+        """
+        goal_pos = {key.removesuffix(".pos"): val for key, val in action.items() if key.endswith(".pos")}
+
+        # Clip against soft joint limits.
+        for motor_name in list(goal_pos):
+            if motor_name in self.config.joint_limits:
+                min_limit, max_limit = self.config.joint_limits[motor_name]
+                clipped = max(min_limit, min(max_limit, goal_pos[motor_name]))
+                if clipped != goal_pos[motor_name]:
+                    logger.debug(f"Clipped {motor_name} from {goal_pos[motor_name]:.2f} to {clipped:.2f}")
+                goal_pos[motor_name] = clipped
+
+        # Tolerate 6-DOF leaders that have no wrist_yaw joint by holding it at zero.
+        # This is intentional: it lets a 6-DOF leader such as the SO-100 / SO-101
+        # (so100_leader / so101_leader) teleoperate this 7-DOF follower — the missing
+        # wrist_yaw command is simply treated as 0.0 instead of raising.
+        if "wrist_yaw" not in goal_pos:
+            goal_pos["wrist_yaw"] = 0.0
+
+        # Cap relative target when too far from the present position.
+        if self.config.max_relative_target is not None:
+            present_pos = self._present_pos()
+            goal_present_pos = {key: (g, present_pos.get(key, g)) for key, g in goal_pos.items()}
+            goal_pos = ensure_safe_goal_position(goal_present_pos, self.config.max_relative_target)
+
+        for motor_name, position_deg in goal_pos.items():
+            motor = self.motors.get(motor_name)
+            if motor is None:
+                continue
+            idx = self.motor_names.index(motor_name)
+            vel_deg_s = (
+                self.config.pos_vel_velocity[idx]
+                if isinstance(self.config.pos_vel_velocity, list)
+                else self.config.pos_vel_velocity
+            )
+            pos_rad = math.radians(position_deg)
+            vel_rad = math.radians(vel_deg_s)
+            if motor_name == GRIPPER_MOTOR:
+                motor.send_force_pos(pos_rad, vel_rad, self.config.gripper_torque_ratio)
+            else:
+                motor.send_pos_vel(pos_rad, vel_rad)
+
+        return {f"{motor}.pos": val for motor, val in goal_pos.items()}
+
+    @check_if_not_connected
+    def disconnect(self) -> None:
+        for motor in self.motors.values():
+            if self.config.disable_torque_on_disconnect:
+                motor.disable()
+            motor.clear_error()
+            motor.close()
+
+        self.bus.close()
+        self.bus = None
+        self.motors = {}
+
+        for cam in self.cameras.values():
+            cam.disconnect()
+
+        logger.info(f"{self} disconnected.")
diff --git a/src/lerobot/robots/utils.py b/src/lerobot/robots/utils.py
index 92da597f1..f897a560e 100644
--- a/src/lerobot/robots/utils.py
+++ b/src/lerobot/robots/utils.py
@@ -68,6 +68,14 @@ def make_robot_from_config(config: RobotConfig) -> Robot:
         from .bi_openarm_follower import BiOpenArmFollower
 
         return BiOpenArmFollower(config)
+    elif config.type == "rebot_b601_follower":
+        from .rebot_b601_follower import RebotB601Follower
+
+        return RebotB601Follower(config)
+    elif config.type == "bi_rebot_b601_follower":
+        from .bi_rebot_b601_follower import BiRebotB601Follower
+
+        return BiRebotB601Follower(config)
     elif config.type == "mock_robot":
         from tests.mocks.mock_robot import MockRobot
 
diff --git a/src/lerobot/scripts/lerobot_calibrate.py b/src/lerobot/scripts/lerobot_calibrate.py
index e68d7438b..e43736954 100644
--- a/src/lerobot/scripts/lerobot_calibrate.py
+++ b/src/lerobot/scripts/lerobot_calibrate.py
@@ -39,6 +39,7 @@ from lerobot.robots import (  # noqa: F401
     Robot,
     RobotConfig,
     bi_openarm_follower,
+    bi_rebot_b601_follower,
     bi_so_follower,
     hope_jr,
     koch_follower,
@@ -46,12 +47,14 @@ from lerobot.robots import (  # noqa: F401
     make_robot_from_config,
     omx_follower,
     openarm_follower,
+    rebot_b601_follower,
     so_follower,
 )
 from lerobot.teleoperators import (  # noqa: F401
     Teleoperator,
     TeleoperatorConfig,
     bi_openarm_leader,
+    bi_rebot_102_leader,
     bi_so_leader,
     homunculus,
     koch_leader,
@@ -59,6 +62,7 @@ from lerobot.teleoperators import (  # noqa: F401
     omx_leader,
     openarm_leader,
     openarm_mini,
+    rebot_102_leader,
     so_leader,
     unitree_g1,
 )
diff --git a/src/lerobot/scripts/lerobot_find_joint_limits.py b/src/lerobot/scripts/lerobot_find_joint_limits.py
index c4f867631..5b9166a2e 100644
--- a/src/lerobot/scripts/lerobot_find_joint_limits.py
+++ b/src/lerobot/scripts/lerobot_find_joint_limits.py
@@ -45,16 +45,19 @@ from lerobot.model import RobotKinematics
 from lerobot.robots import (  # noqa: F401
     RobotConfig,
     bi_openarm_follower,
+    bi_rebot_b601_follower,
     bi_so_follower,
     koch_follower,
     make_robot_from_config,
     omx_follower,
     openarm_follower,
+    rebot_b601_follower,
     so_follower,
 )
 from lerobot.teleoperators import (  # noqa: F401
     TeleoperatorConfig,
     bi_openarm_leader,
+    bi_rebot_102_leader,
     bi_so_leader,
     gamepad,
     koch_leader,
@@ -62,6 +65,7 @@ from lerobot.teleoperators import (  # noqa: F401
     omx_leader,
     openarm_leader,
     openarm_mini,
+    rebot_102_leader,
     so_leader,
 )
 from lerobot.utils.robot_utils import precise_sleep
diff --git a/src/lerobot/scripts/lerobot_record.py b/src/lerobot/scripts/lerobot_record.py
index c8419cb14..c411ebf9e 100644
--- a/src/lerobot/scripts/lerobot_record.py
+++ b/src/lerobot/scripts/lerobot_record.py
@@ -120,6 +120,7 @@ from lerobot.robots import (  # noqa: F401
     Robot,
     RobotConfig,
     bi_openarm_follower,
+    bi_rebot_b601_follower,
     bi_so_follower,
     earthrover_mini_plus,
     hope_jr,
@@ -128,6 +129,7 @@ from lerobot.robots import (  # noqa: F401
     omx_follower,
     openarm_follower,
     reachy2,
+    rebot_b601_follower,
     so_follower,
     unitree_g1 as unitree_g1_robot,
 )
@@ -135,6 +137,7 @@ from lerobot.teleoperators import (  # noqa: F401
     Teleoperator,
     TeleoperatorConfig,
     bi_openarm_leader,
+    bi_rebot_102_leader,
     bi_so_leader,
     homunculus,
     koch_leader,
@@ -143,6 +146,7 @@ from lerobot.teleoperators import (  # noqa: F401
     openarm_leader,
     openarm_mini,
     reachy2_teleoperator,
+    rebot_102_leader,
     so_leader,
     unitree_g1,
 )
diff --git a/src/lerobot/scripts/lerobot_replay.py b/src/lerobot/scripts/lerobot_replay.py
index 41d2926cc..1851f7c2b 100644
--- a/src/lerobot/scripts/lerobot_replay.py
+++ b/src/lerobot/scripts/lerobot_replay.py
@@ -56,6 +56,7 @@ from lerobot.robots import (  # noqa: F401
     Robot,
     RobotConfig,
     bi_openarm_follower,
+    bi_rebot_b601_follower,
     bi_so_follower,
     earthrover_mini_plus,
     hope_jr,
@@ -64,6 +65,7 @@ from lerobot.robots import (  # noqa: F401
     omx_follower,
     openarm_follower,
     reachy2,
+    rebot_b601_follower,
     so_follower,
     unitree_g1,
 )
diff --git a/src/lerobot/scripts/lerobot_rollout.py b/src/lerobot/scripts/lerobot_rollout.py
index 7015e707c..3378b6de4 100644
--- a/src/lerobot/scripts/lerobot_rollout.py
+++ b/src/lerobot/scripts/lerobot_rollout.py
@@ -144,6 +144,7 @@ from lerobot.robots import (  # noqa: F401
     Robot,
     RobotConfig,
     bi_openarm_follower,
+    bi_rebot_b601_follower,
     bi_so_follower,
     earthrover_mini_plus,
     hope_jr,
@@ -151,6 +152,7 @@ from lerobot.robots import (  # noqa: F401
     omx_follower,
     openarm_follower,
     reachy2,
+    rebot_b601_follower,
     so_follower,
     unitree_g1 as unitree_g1_robot,
 )
@@ -159,6 +161,7 @@ from lerobot.teleoperators import (  # noqa: F401
     Teleoperator,
     TeleoperatorConfig,
     bi_openarm_leader,
+    bi_rebot_102_leader,
     bi_so_leader,
     homunculus,
     koch_leader,
@@ -166,6 +169,7 @@ from lerobot.teleoperators import (  # noqa: F401
     openarm_leader,
     openarm_mini,
     reachy2_teleoperator,
+    rebot_102_leader,
     so_leader,
     unitree_g1,
 )
diff --git a/src/lerobot/scripts/lerobot_setup_motors.py b/src/lerobot/scripts/lerobot_setup_motors.py
index 2c962a6e2..69ebcf5fa 100644
--- a/src/lerobot/scripts/lerobot_setup_motors.py
+++ b/src/lerobot/scripts/lerobot_setup_motors.py
@@ -30,20 +30,24 @@ import draccus
 
 from lerobot.robots import (  # noqa: F401
     RobotConfig,
+    bi_rebot_b601_follower,
     bi_so_follower,
     koch_follower,
     lekiwi,
     make_robot_from_config,
     omx_follower,
+    rebot_b601_follower,
     so_follower,
 )
 from lerobot.teleoperators import (  # noqa: F401
     TeleoperatorConfig,
+    bi_rebot_102_leader,
     bi_so_leader,
     koch_leader,
     make_teleoperator_from_config,
     omx_leader,
     openarm_mini,
+    rebot_102_leader,
     so_leader,
 )
 
diff --git a/src/lerobot/scripts/lerobot_teleoperate.py b/src/lerobot/scripts/lerobot_teleoperate.py
index 76157595e..2ff02bda0 100644
--- a/src/lerobot/scripts/lerobot_teleoperate.py
+++ b/src/lerobot/scripts/lerobot_teleoperate.py
@@ -72,6 +72,7 @@ from lerobot.robots import (  # noqa: F401
     Robot,
     RobotConfig,
     bi_openarm_follower,
+    bi_rebot_b601_follower,
     bi_so_follower,
     earthrover_mini_plus,
     hope_jr,
@@ -80,6 +81,7 @@ from lerobot.robots import (  # noqa: F401
     omx_follower,
     openarm_follower,
     reachy2,
+    rebot_b601_follower,
     so_follower,
     unitree_g1 as unitree_g1_robot,
 )
@@ -87,6 +89,7 @@ from lerobot.teleoperators import (  # noqa: F401
     Teleoperator,
     TeleoperatorConfig,
     bi_openarm_leader,
+    bi_rebot_102_leader,
     bi_so_leader,
     gamepad,
     homunculus,
@@ -97,6 +100,7 @@ from lerobot.teleoperators import (  # noqa: F401
     openarm_leader,
     openarm_mini,
     reachy2_teleoperator,
+    rebot_102_leader,
     so_leader,
     unitree_g1,
 )
diff --git a/src/lerobot/teleoperators/bi_rebot_102_leader/__init__.py b/src/lerobot/teleoperators/bi_rebot_102_leader/__init__.py
new file mode 100644
index 000000000..c15cf76d8
--- /dev/null
+++ b/src/lerobot/teleoperators/bi_rebot_102_leader/__init__.py
@@ -0,0 +1,20 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .bi_rebot_102_leader import BiRebotArm102Leader
+from .config_bi_rebot_102_leader import BiRebotArm102LeaderConfig
+
+__all__ = ["BiRebotArm102Leader", "BiRebotArm102LeaderConfig"]
diff --git a/src/lerobot/teleoperators/bi_rebot_102_leader/bi_rebot_102_leader.py b/src/lerobot/teleoperators/bi_rebot_102_leader/bi_rebot_102_leader.py
new file mode 100644
index 000000000..a4e5fd8c6
--- /dev/null
+++ b/src/lerobot/teleoperators/bi_rebot_102_leader/bi_rebot_102_leader.py
@@ -0,0 +1,113 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from functools import cached_property
+
+from lerobot.types import RobotAction
+from lerobot.utils.decorators import check_if_already_connected, check_if_not_connected
+
+from ..rebot_102_leader import RebotArm102Leader, RebotArm102LeaderTeleopConfig
+from ..teleoperator import Teleoperator
+from .config_bi_rebot_102_leader import BiRebotArm102LeaderConfig
+
+logger = logging.getLogger(__name__)
+
+
+class BiRebotArm102Leader(Teleoperator):
+    """Bimanual Seeed Studio StarArm102 / reBot Arm 102 leader.
+
+    Composes two single-arm :class:`RebotArm102Leader` instances. Action keys of
+    each arm are namespaced with a ``left_`` / ``right_`` prefix, so a bimanual
+    leader can teleoperate a bimanual reBot B601 follower.
+    """
+
+    config_class = BiRebotArm102LeaderConfig
+    name = "bi_rebot_102_leader"
+
+    def __init__(self, config: BiRebotArm102LeaderConfig):
+        super().__init__(config)
+        self.config = config
+
+        left_arm_config = RebotArm102LeaderTeleopConfig(
+            id=f"{config.id}_left" if config.id else None,
+            calibration_dir=config.calibration_dir,
+            port=config.left_arm_config.port,
+            baudrate=config.left_arm_config.baudrate,
+            joint_ids=config.left_arm_config.joint_ids,
+            joint_directions=config.left_arm_config.joint_directions,
+            joint_ranges=config.left_arm_config.joint_ranges,
+        )
+
+        right_arm_config = RebotArm102LeaderTeleopConfig(
+            id=f"{config.id}_right" if config.id else None,
+            calibration_dir=config.calibration_dir,
+            port=config.right_arm_config.port,
+            baudrate=config.right_arm_config.baudrate,
+            joint_ids=config.right_arm_config.joint_ids,
+            joint_directions=config.right_arm_config.joint_directions,
+            joint_ranges=config.right_arm_config.joint_ranges,
+        )
+
+        self.left_arm = RebotArm102Leader(left_arm_config)
+        self.right_arm = RebotArm102Leader(right_arm_config)
+
+    @cached_property
+    def action_features(self) -> dict[str, type]:
+        return {
+            **{f"left_{k}": v for k, v in self.left_arm.action_features.items()},
+            **{f"right_{k}": v for k, v in self.right_arm.action_features.items()},
+        }
+
+    @cached_property
+    def feedback_features(self) -> dict[str, type]:
+        return {}
+
+    @property
+    def is_connected(self) -> bool:
+        return self.left_arm.is_connected and self.right_arm.is_connected
+
+    @check_if_already_connected
+    def connect(self, calibrate: bool = True) -> None:
+        self.left_arm.connect(calibrate)
+        self.right_arm.connect(calibrate)
+
+    @property
+    def is_calibrated(self) -> bool:
+        return self.left_arm.is_calibrated and self.right_arm.is_calibrated
+
+    def calibrate(self) -> None:
+        self.left_arm.calibrate()
+        self.right_arm.calibrate()
+
+    def configure(self) -> None:
+        self.left_arm.configure()
+        self.right_arm.configure()
+
+    @check_if_not_connected
+    def get_action(self) -> RobotAction:
+        action_dict = {}
+        action_dict.update({f"left_{k}": v for k, v in self.left_arm.get_action().items()})
+        action_dict.update({f"right_{k}": v for k, v in self.right_arm.get_action().items()})
+        return action_dict
+
+    def send_feedback(self, feedback: dict[str, float]) -> None:
+        raise NotImplementedError("Feedback is not implemented for the reBot Arm 102 leader.")
+
+    @check_if_not_connected
+    def disconnect(self) -> None:
+        self.left_arm.disconnect()
+        self.right_arm.disconnect()
diff --git a/src/lerobot/teleoperators/bi_rebot_102_leader/config_bi_rebot_102_leader.py b/src/lerobot/teleoperators/bi_rebot_102_leader/config_bi_rebot_102_leader.py
new file mode 100644
index 000000000..265ae26c1
--- /dev/null
+++ b/src/lerobot/teleoperators/bi_rebot_102_leader/config_bi_rebot_102_leader.py
@@ -0,0 +1,29 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+
+from ..config import TeleoperatorConfig
+from ..rebot_102_leader import RebotArm102LeaderConfig
+
+
+@TeleoperatorConfig.register_subclass("bi_rebot_102_leader")
+@dataclass
+class BiRebotArm102LeaderConfig(TeleoperatorConfig):
+    """Configuration class for the bimanual reBot Arm 102 leader teleoperator."""
+
+    left_arm_config: RebotArm102LeaderConfig
+    right_arm_config: RebotArm102LeaderConfig
diff --git a/src/lerobot/teleoperators/rebot_102_leader/__init__.py b/src/lerobot/teleoperators/rebot_102_leader/__init__.py
new file mode 100644
index 000000000..a13524707
--- /dev/null
+++ b/src/lerobot/teleoperators/rebot_102_leader/__init__.py
@@ -0,0 +1,20 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .config_rebot_102_leader import RebotArm102LeaderConfig, RebotArm102LeaderTeleopConfig
+from .rebot_102_leader import RebotArm102Leader
+
+__all__ = ["RebotArm102Leader", "RebotArm102LeaderConfig", "RebotArm102LeaderTeleopConfig"]
diff --git a/src/lerobot/teleoperators/rebot_102_leader/config_rebot_102_leader.py b/src/lerobot/teleoperators/rebot_102_leader/config_rebot_102_leader.py
new file mode 100644
index 000000000..d1beea2ed
--- /dev/null
+++ b/src/lerobot/teleoperators/rebot_102_leader/config_rebot_102_leader.py
@@ -0,0 +1,83 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass, field
+
+from ..config import TeleoperatorConfig
+
+
+@dataclass
+class RebotArm102LeaderConfig:
+    """Base configuration class for the Seeed Studio StarArm102 / reBot Arm 102 leader.
+
+    The reBot Arm 102 is a 7-joint (incl. gripper) leader arm driven by FashionStar
+    UART smart servos. Servo communication goes through ``motorbridge-smart-servo``.
+    """
+
+    # USB-to-UART device the leader arm is connected to (e.g. "/dev/ttyUSB0").
+    port: str
+
+    baudrate: int = 1_000_000
+
+    # Servo id of each joint on the UART bus.
+    joint_ids: dict[str, int] = field(
+        default_factory=lambda: {
+            "shoulder_pan": 0,
+            "shoulder_lift": 1,
+            "elbow_flex": 2,
+            "wrist_flex": 3,
+            "wrist_yaw": 4,
+            "wrist_roll": 5,
+            "gripper": 6,
+        }
+    )
+
+    # Per-joint sign applied to raw servo angles so the leader matches the follower
+    # convention. The gripper additionally carries a scale (e.g. -6) to widen its
+    # range to the reBot B601 follower's gripper travel.
+    joint_directions: dict[str, int] = field(
+        default_factory=lambda: {
+            "shoulder_pan": -1,
+            "shoulder_lift": -1,
+            "elbow_flex": 1,
+            "wrist_flex": 1,
+            "wrist_yaw": 1,
+            "wrist_roll": -1,
+            "gripper": -6,
+        }
+    )
+
+    # Per-joint [min, max] output range in degrees. Matches the reBot B601 follower
+    # joint limits so leader actions can drive the follower key-for-key.
+    joint_ranges: dict[str, list[int]] = field(
+        default_factory=lambda: {
+            "shoulder_pan": [-150, 150],
+            "shoulder_lift": [-170, 1],
+            "elbow_flex": [-200, 1],
+            "wrist_flex": [-80, 90],
+            "wrist_yaw": [-90, 90],
+            "wrist_roll": [-90, 90],
+            "gripper": [-270, 0],
+        }
+    )
+
+
+@TeleoperatorConfig.register_subclass("rebot_102_leader")
+@dataclass
+class RebotArm102LeaderTeleopConfig(TeleoperatorConfig, RebotArm102LeaderConfig):
+    """Registered configuration for the reBot Arm 102 leader teleoperator."""
+
+    pass
diff --git a/src/lerobot/teleoperators/rebot_102_leader/rebot_102_leader.py b/src/lerobot/teleoperators/rebot_102_leader/rebot_102_leader.py
new file mode 100644
index 000000000..f9f10ed69
--- /dev/null
+++ b/src/lerobot/teleoperators/rebot_102_leader/rebot_102_leader.py
@@ -0,0 +1,207 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import time
+from typing import TYPE_CHECKING
+
+from lerobot.motors import MotorCalibration
+from lerobot.types import RobotAction
+from lerobot.utils.decorators import check_if_already_connected, check_if_not_connected
+from lerobot.utils.import_utils import _motorbridge_smart_servo_available, require_package
+
+from ..teleoperator import Teleoperator
+from .config_rebot_102_leader import RebotArm102LeaderTeleopConfig
+
+if TYPE_CHECKING or _motorbridge_smart_servo_available:
+    from motorbridge_smart_servo import FashionStarServo, ServoMonitor
+else:
+    FashionStarServo = None
+    ServoMonitor = None
+
+logger = logging.getLogger(__name__)
+
+_SETTLE_SEC = 0.01
+
+
+class RebotArm102Leader(Teleoperator):
+    """Seeed Studio StarArm102 / reBot Arm 102 leader arm.
+
+    A 7-joint (incl. gripper) leader built on FashionStar UART smart servos. Servo
+    communication is handled by the ``motorbridge-smart-servo`` package; this class
+    only reads joint angles, so it produces actions but accepts no feedback.
+    """
+
+    config_class = RebotArm102LeaderTeleopConfig
+    name = "rebot_102_leader"
+
+    def __init__(self, config: RebotArm102LeaderTeleopConfig):
+        require_package("motorbridge-smart-servo", extra="rebot", import_name="motorbridge_smart_servo")
+        super().__init__(config)
+        self.config = config
+        self.bus: FashionStarServo | None = None
+        self.motor_names = list(config.joint_ids.keys())
+        self._last_raw_positions: dict[str, float] = {}
+
+    @property
+    def action_features(self) -> dict[str, type]:
+        return {f"{motor}.pos": float for motor in self.motor_names}
+
+    @property
+    def feedback_features(self) -> dict[str, type]:
+        return {}
+
+    @property
+    def is_connected(self) -> bool:
+        return self.bus is not None
+
+    @check_if_already_connected
+    def connect(self, calibrate: bool = True) -> None:
+        logger.info(f"Connecting {self} on {self.config.port}...")
+        bus = FashionStarServo(self.config.port, baudrate=self.config.baudrate)
+        try:
+            for motor_name, motor_id in self.config.joint_ids.items():
+                if not bus.ping(motor_id):
+                    raise RuntimeError(f"Servo not found for {motor_name} (id={motor_id}).")
+                self._last_raw_positions[motor_name] = 0.0
+            self.bus = bus
+
+            if not self.is_calibrated and calibrate:
+                logger.info(
+                    "Mismatch between calibration values in the motor and the calibration file or no calibration file found"
+                )
+                self.calibrate()
+
+            self.configure()
+        except Exception:
+            bus.close()
+            self.bus = None
+            raise
+
+        logger.info(f"{self} connected.")
+
+    @property
+    def is_calibrated(self) -> bool:
+        return bool(self.calibration) and set(self.calibration) == set(self.motor_names)
+
+    def calibrate(self) -> None:
+        if self.calibration:
+            user_input = input(
+                f"Press ENTER to use provided calibration file associated with the id {self.id}, "
+                "or type 'c' and press ENTER to run calibration: "
+            )
+            if user_input.strip().lower() != "c":
+                logger.info(f"Using calibration file associated with the id {self.id}")
+                return
+
+        logger.info(f"\nRunning calibration of {self}")
+        input(
+            "\nCalibration: set zero position.\n"
+            "Manually move the reBot Arm 102 to its zero pose and close the gripper.\n"
+            "Press ENTER when ready..."
+        )
+
+        self.calibration = {}
+        for motor_name, motor_id in self.config.joint_ids.items():
+            self.bus.unlock(motor_id)
+            time.sleep(_SETTLE_SEC)
+            self.bus.set_origin_point(motor_id)
+            range_min, range_max = self.config.joint_ranges[motor_name]
+            self.calibration[motor_name] = MotorCalibration(
+                id=motor_id,
+                drive_mode=0,
+                homing_offset=0,
+                range_min=int(range_min),
+                range_max=int(range_max),
+            )
+
+        self._save_calibration()
+        logger.info(f"Calibration saved to {self.calibration_fpath}")
+
+    def configure(self) -> None:
+        for motor_id in self.config.joint_ids.values():
+            self.bus.unlock(motor_id)
+            time.sleep(_SETTLE_SEC)
+        # Reset the multi-turn counter of each servo individually.
+        for motor_id in self.config.joint_ids.values():
+            self.bus.reset_multi_turn(motor_id)
+
+    def _read_raw_positions(self) -> dict[str, float]:
+        result: dict[int, ServoMonitor | None] = self.bus.sync_monitor(list(self.config.joint_ids.values()))
+        id_to_name = {v: k for k, v in self.config.joint_ids.items()}
+        raw_positions: dict[str, float] = {}
+        for motor_id, monitor in result.items():
+            motor_name = id_to_name[motor_id]
+            if monitor is None:
+                raise RuntimeError(f"Servo {motor_name} (id={motor_id}) has never responded.")
+            raw_positions[motor_name] = monitor.angle_deg
+        return raw_positions
+
+    @staticmethod
+    def _round_to_valid_range(value: float, min_value: float, max_value: float) -> tuple[float, int]:
+        """Unwrap a multi-turn angle into the ±180° window centred on (min+max)/2.
+
+        The servo may report an angle that has accumulated extra full rotations
+        (value = true_angle + N*360). Subtract the nearest whole number of turns
+        to bring it back into [center-180, center+180]. Returns the unwrapped
+        angle and the number of turns removed.
+        """
+        center = (min_value + max_value) / 2.0
+        turns = round((value - center) / 360.0)
+        return value - turns * 360.0, abs(turns)
+
+    @check_if_not_connected
+    def get_action(self) -> RobotAction:
+        start = time.perf_counter()
+        try:
+            raw_positions = self._read_raw_positions()
+            self._last_raw_positions = raw_positions
+        except Exception as e:
+            logger.error(f"Failed to read raw positions: {e}")
+            logger.warning("[EMERGENCY STOP] Hold the follower arm and cut off the main power to the arms.")
+            logger.warning(
+                "[EMERGENCY STOP] Break the teleoperation session and check the leader USB connection or power."
+            )
+            raw_positions = self._last_raw_positions
+
+        action_dict: dict[str, float] = {}
+        for motor_name in self.motor_names:
+            range_min, range_max = self.config.joint_ranges[motor_name]
+            direction = self.config.joint_directions[motor_name]
+            sign = 1.0 if direction >= 0 else -1.0
+            unwrapped, k = self._round_to_valid_range(
+                raw_positions[motor_name], range_min * sign, range_max * sign
+            )
+            position = unwrapped * direction
+            if k > 0:
+                logger.debug(
+                    f"Servo {motor_name} (id={self.config.joint_ids[motor_name]}) wrapped {k} * 360°. "
+                    f"Unwrapped pos: {unwrapped:.1f}° (raw: {raw_positions[motor_name]:.1f}°)"
+                )
+            action_dict[f"{motor_name}.pos"] = max(float(range_min), min(float(range_max), position))
+
+        dt_ms = (time.perf_counter() - start) * 1e3
+        logger.debug(f"{self} read action: {dt_ms:.1f}ms")
+        return action_dict
+
+    def send_feedback(self, feedback: dict[str, float]) -> None:
+        raise NotImplementedError("Feedback is not implemented for the reBot Arm 102 leader.")
+
+    @check_if_not_connected
+    def disconnect(self) -> None:
+        self.bus.close()
+        self.bus = None
+        logger.info(f"{self} disconnected.")
diff --git a/src/lerobot/teleoperators/utils.py b/src/lerobot/teleoperators/utils.py
index db685f396..5a6d4ecde 100644
--- a/src/lerobot/teleoperators/utils.py
+++ b/src/lerobot/teleoperators/utils.py
@@ -99,6 +99,14 @@ def make_teleoperator_from_config(config: TeleoperatorConfig) -> "Teleoperator":
         from .openarm_mini import OpenArmMini
 
         return OpenArmMini(config)
+    elif config.type == "rebot_102_leader":
+        from .rebot_102_leader import RebotArm102Leader
+
+        return RebotArm102Leader(config)
+    elif config.type == "bi_rebot_102_leader":
+        from .bi_rebot_102_leader import BiRebotArm102Leader
+
+        return BiRebotArm102Leader(config)
     else:
         try:
             return cast("Teleoperator", make_device_from_device_class(config))
diff --git a/src/lerobot/utils/import_utils.py b/src/lerobot/utils/import_utils.py
index ef03367eb..5dbce2c5b 100644
--- a/src/lerobot/utils/import_utils.py
+++ b/src/lerobot/utils/import_utils.py
@@ -114,6 +114,10 @@ _dynamixel_sdk_available = is_package_available("dynamixel-sdk", import_name="dy
 _feetech_sdk_available = is_package_available("feetech-servo-sdk", import_name="scservo_sdk")
 _reachy2_sdk_available = is_package_available("reachy2_sdk")
 _can_available = is_package_available("python-can", "can")
+_motorbridge_available = is_package_available("motorbridge")
+_motorbridge_smart_servo_available = is_package_available(
+    "motorbridge-smart-servo", import_name="motorbridge_smart_servo"
+)
 _unitree_sdk_available = is_package_available("unitree-sdk2py", "unitree_sdk2py")
 _pyrealsense2_available = is_package_available("pyrealsense2") or is_package_available(
     "pyrealsense2-macosx", import_name="pyrealsense2"
diff --git a/tests/robots/test_rebot_b601_follower.py b/tests/robots/test_rebot_b601_follower.py
new file mode 100644
index 000000000..553675be0
--- /dev/null
+++ b/tests/robots/test_rebot_b601_follower.py
@@ -0,0 +1,116 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+from unittest.mock import MagicMock, patch
+
+import pytest
+
+from lerobot.robots.bi_rebot_b601_follower import BiRebotB601Follower, BiRebotB601FollowerConfig
+from lerobot.robots.rebot_b601_follower import (
+    RebotB601Follower,
+    RebotB601FollowerConfig,
+    RebotB601FollowerRobotConfig,
+)
+
+_MODULE = "lerobot.robots.rebot_b601_follower.rebot_b601_follower"
+
+
+def _make_motor_mock(position_rad: float = 0.0) -> MagicMock:
+    motor = MagicMock(name="MotorMock")
+    state = MagicMock()
+    state.pos = position_rad
+    motor.get_state.return_value = state
+    return motor
+
+
+def _make_bus_mock() -> MagicMock:
+    bus = MagicMock(name="MotorBridgeControllerMock")
+    # add_damiao_motor returns a fresh motor mock; position encodes the call order.
+    bus._motor_count = 0
+
+    def _add_motor(_send_id, _recv_id, _model):
+        bus._motor_count += 1
+        return _make_motor_mock(position_rad=math.radians(bus._motor_count))
+
+    bus.add_damiao_motor.side_effect = _add_motor
+    return bus
+
+
+@pytest.fixture
+def follower():
+    bus_mock = _make_bus_mock()
+    with (
+        patch(f"{_MODULE}.require_package", lambda *a, **kw: None),
+        patch(f"{_MODULE}.MotorBridgeController") as controller_cls,
+        patch(f"{_MODULE}.MotorBridgeMode", MagicMock()),
+    ):
+        controller_cls.from_dm_serial.return_value = bus_mock
+        cfg = RebotB601FollowerRobotConfig(port="/dev/null")
+        robot = RebotB601Follower(cfg)
+        robot.connect(calibrate=False)
+        yield robot
+        if robot.is_connected:
+            robot.disconnect()
+
+
+def test_features_match_joints():
+    with patch(f"{_MODULE}.require_package", lambda *a, **kw: None):
+        robot = RebotB601Follower(RebotB601FollowerRobotConfig(port="/dev/null"))
+    expected = {f"{m}.pos" for m in robot.motor_names}
+    assert set(robot.action_features) == expected
+    assert set(robot.observation_features) == expected
+    assert "gripper.pos" in expected
+
+
+def test_connect_disconnect(follower):
+    assert follower.is_connected
+    follower.disconnect()
+    assert not follower.is_connected
+
+
+def test_get_observation_converts_to_degrees(follower):
+    obs = follower.get_observation()
+    assert set(obs) == {f"{m}.pos" for m in follower.motor_names}
+    # The bus mock seeds each motor's position with its 1-indexed creation order (radians).
+    for idx, motor in enumerate(follower.motor_names, 1):
+        assert obs[f"{motor}.pos"] == pytest.approx(math.degrees(math.radians(idx)))
+
+
+def test_send_action_clips_to_joint_limits(follower):
+    # shoulder_pan limit is (-145, 145); request beyond the upper bound.
+    returned = follower.send_action({"shoulder_pan.pos": 999.0})
+    assert returned["shoulder_pan.pos"] == 145.0
+    follower.motors["shoulder_pan"].send_pos_vel.assert_called_once()
+
+
+def test_send_action_routes_gripper_to_force_pos(follower):
+    follower.send_action({"gripper.pos": -10.0})
+    follower.motors["gripper"].send_force_pos.assert_called_once()
+    follower.motors["gripper"].send_pos_vel.assert_not_called()
+
+
+def test_bimanual_prefixes_features():
+    with patch(f"{_MODULE}.require_package", lambda *a, **kw: None):
+        cfg = BiRebotB601FollowerConfig(
+            left_arm_config=RebotB601FollowerConfig(port="/dev/null0"),
+            right_arm_config=RebotB601FollowerConfig(port="/dev/null1"),
+        )
+        robot = BiRebotB601Follower(cfg)
+    assert any(k.startswith("left_") for k in robot.action_features)
+    assert any(k.startswith("right_") for k in robot.action_features)
+    assert "left_gripper.pos" in robot.action_features
+    assert "right_gripper.pos" in robot.action_features
diff --git a/tests/teleoperators/test_rebot_102_leader.py b/tests/teleoperators/test_rebot_102_leader.py
new file mode 100644
index 000000000..bea10e131
--- /dev/null
+++ b/tests/teleoperators/test_rebot_102_leader.py
@@ -0,0 +1,102 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from unittest.mock import MagicMock, patch
+
+import pytest
+
+from lerobot.teleoperators.bi_rebot_102_leader import BiRebotArm102Leader, BiRebotArm102LeaderConfig
+from lerobot.teleoperators.rebot_102_leader import (
+    RebotArm102Leader,
+    RebotArm102LeaderConfig,
+    RebotArm102LeaderTeleopConfig,
+)
+
+_MODULE = "lerobot.teleoperators.rebot_102_leader.rebot_102_leader"
+
+
+def _make_bus_mock(joint_ids: dict[str, int]) -> MagicMock:
+    bus = MagicMock(name="FashionStarServoMock")
+    bus.ping.return_value = True
+
+    def _sync_monitor(ids):
+        # Report each servo at 5 degrees raw.
+        monitors = {}
+        for servo_id in ids:
+            monitor = MagicMock()
+            monitor.angle_deg = 5.0
+            monitors[servo_id] = monitor
+        return monitors
+
+    bus.sync_monitor.side_effect = _sync_monitor
+    return bus
+
+
+@pytest.fixture
+def leader():
+    cfg = RebotArm102LeaderTeleopConfig(port="/dev/null")
+    bus_mock = _make_bus_mock(cfg.joint_ids)
+    with (
+        patch(f"{_MODULE}.require_package", lambda *a, **kw: None),
+        patch(f"{_MODULE}.FashionStarServo", return_value=bus_mock),
+    ):
+        teleop = RebotArm102Leader(cfg)
+        teleop.connect(calibrate=False)
+        yield teleop
+        if teleop.is_connected:
+            teleop.disconnect()
+
+
+def test_action_features_match_joints():
+    with patch(f"{_MODULE}.require_package", lambda *a, **kw: None):
+        teleop = RebotArm102Leader(RebotArm102LeaderTeleopConfig(port="/dev/null"))
+    assert set(teleop.action_features) == {f"{m}.pos" for m in teleop.motor_names}
+    assert teleop.feedback_features == {}
+
+
+def test_connect_disconnect(leader):
+    assert leader.is_connected
+    leader.disconnect()
+    assert not leader.is_connected
+
+
+def test_get_action_applies_direction_and_clamp(leader):
+    action = leader.get_action()
+    assert set(action) == {f"{m}.pos" for m in leader.motor_names}
+    # shoulder_pan has direction -1, so a +5deg raw reading flips to -5deg.
+    assert action["shoulder_pan.pos"] == pytest.approx(-5.0)
+    # Every joint stays within its configured range.
+    for motor, value in action.items():
+        lo, hi = leader.config.joint_ranges[motor.removesuffix(".pos")]
+        assert lo <= value <= hi
+
+
+def test_send_feedback_not_implemented(leader):
+    with pytest.raises(NotImplementedError):
+        leader.send_feedback({})
+
+
+def test_bimanual_prefixes_features():
+    with patch(f"{_MODULE}.require_package", lambda *a, **kw: None):
+        cfg = BiRebotArm102LeaderConfig(
+            left_arm_config=RebotArm102LeaderConfig(port="/dev/null0"),
+            right_arm_config=RebotArm102LeaderConfig(port="/dev/null1"),
+        )
+        teleop = BiRebotArm102Leader(cfg)
+    assert any(k.startswith("left_") for k in teleop.action_features)
+    assert any(k.startswith("right_") for k in teleop.action_features)
+    assert "left_gripper.pos" in teleop.action_features
+    assert "right_gripper.pos" in teleop.action_features
diff --git a/uv.lock b/uv.lock
index 408a9a351..692029986 100644
--- a/uv.lock
+++ b/uv.lock
@@ -1,5 +1,5 @@
 version = 1
-revision = 2
+revision = 3
 requires-python = ">=3.12"
 resolution-markers = [
     "(python_full_version >= '3.15' and platform_machine == 'AMD64' and sys_platform == 'linux') or (python_full_version >= '3.15' and platform_machine == 'x86_64' and sys_platform == 'linux')",
@@ -1142,7 +1142,7 @@ name = "decord"
 version = "0.6.0"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
-    { name = "numpy", marker = "(platform_machine != 'arm64' and sys_platform == 'darwin') or (platform_machine == 'AMD64' and sys_platform == 'linux') or (platform_machine == 'x86_64' and sys_platform == 'linux') or (sys_platform != 'darwin' and sys_platform != 'linux')" },
+    { name = "numpy", marker = "(platform_machine != 'arm64' and platform_machine != 's390x' and sys_platform == 'darwin') or (platform_machine == 'AMD64' and sys_platform == 'linux') or (platform_machine == 'x86_64' and sys_platform == 'linux') or (platform_machine != 's390x' and sys_platform != 'darwin' and sys_platform != 'linux')" },
 ]
 wheels = [
     { url = "https://files.pythonhosted.org/packages/11/79/936af42edf90a7bd4e41a6cac89c913d4b47fa48a26b042d5129a9242ee3/decord-0.6.0-py3-none-manylinux2010_x86_64.whl", hash = "sha256:51997f20be8958e23b7c4061ba45d0efcd86bffd5fe81c695d0befee0d442976", size = 13602299, upload-time = "2021-06-14T21:30:55.486Z" },
@@ -2710,6 +2710,8 @@ all = [
     { name = "matplotlib" },
     { name = "metaworld" },
     { name = "mock-serial", marker = "sys_platform != 'win32'" },
+    { name = "motorbridge" },
+    { name = "motorbridge-smart-servo" },
     { name = "mypy" },
     { name = "num2words" },
     { name = "pandas" },
@@ -2913,6 +2915,12 @@ metaworld = [
     { name = "scipy" },
     { name = "torchcodec", marker = "(platform_machine == 'arm64' and sys_platform == 'darwin') or (platform_machine == 'AMD64' and sys_platform == 'linux') or (platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'x86_64' and sys_platform == 'linux') or sys_platform == 'win32'" },
 ]
+motorbridge-dep = [
+    { name = "motorbridge" },
+]
+motorbridge-smart-servo-dep = [
+    { name = "motorbridge-smart-servo" },
+]
 multi-task-dit = [
     { name = "diffusers" },
     { name = "transformers" },
@@ -2972,6 +2980,10 @@ qwen-vl-utils-dep = [
 reachy2 = [
     { name = "reachy2-sdk" },
 ]
+rebot = [
+    { name = "motorbridge" },
+    { name = "motorbridge-smart-servo" },
+]
 robstride = [
     { name = "python-can" },
 ]
@@ -3116,6 +3128,8 @@ requires-dist = [
     { name = "lerobot", extras = ["matplotlib-dep"], marker = "extra == 'sarm'" },
     { name = "lerobot", extras = ["matplotlib-dep"], marker = "extra == 'unitree-g1'" },
     { name = "lerobot", extras = ["metaworld"], marker = "extra == 'all'" },
+    { name = "lerobot", extras = ["motorbridge-dep"], marker = "extra == 'rebot'" },
+    { name = "lerobot", extras = ["motorbridge-smart-servo-dep"], marker = "extra == 'rebot'" },
     { name = "lerobot", extras = ["multi-task-dit"], marker = "extra == 'all'" },
     { name = "lerobot", extras = ["notebook"], marker = "extra == 'dev'" },
     { name = "lerobot", extras = ["openarms"], marker = "extra == 'all'" },
@@ -3142,6 +3156,7 @@ requires-dist = [
     { name = "lerobot", extras = ["qwen-vl-utils-dep"], marker = "extra == 'sarm'" },
     { name = "lerobot", extras = ["qwen-vl-utils-dep"], marker = "extra == 'wallx'" },
     { name = "lerobot", extras = ["reachy2"], marker = "extra == 'all'" },
+    { name = "lerobot", extras = ["rebot"], marker = "extra == 'all'" },
     { name = "lerobot", extras = ["robstride"], marker = "extra == 'all'" },
     { name = "lerobot", extras = ["sarm"], marker = "extra == 'all'" },
     { name = "lerobot", extras = ["scipy-dep"], marker = "extra == 'aloha'" },
@@ -3174,6 +3189,8 @@ requires-dist = [
     { name = "meshcat", marker = "extra == 'unitree-g1'", specifier = ">=0.3.0,<0.4.0" },
     { name = "metaworld", marker = "extra == 'metaworld'", specifier = "==3.0.0" },
     { name = "mock-serial", marker = "sys_platform != 'win32' and extra == 'test'", specifier = ">=0.0.1,<0.1.0" },
+    { name = "motorbridge", marker = "extra == 'motorbridge-dep'", specifier = ">=0.3.2,<0.4.0" },
+    { name = "motorbridge-smart-servo", marker = "extra == 'motorbridge-smart-servo-dep'", specifier = ">=0.0.4,<0.1.0" },
     { name = "mypy", marker = "extra == 'dev'", specifier = ">=1.19.1" },
     { name = "ninja", marker = "extra == 'groot'", specifier = ">=1.11.1,<2.0.0" },
     { name = "num2words", marker = "extra == 'smolvla'", specifier = ">=0.5.14,<0.6.0" },
@@ -3227,7 +3244,7 @@ requires-dist = [
     { name = "transformers", marker = "extra == 'transformers-dep'", specifier = ">=5.4.0,<5.6.0" },
     { name = "wandb", marker = "extra == 'training'", specifier = ">=0.24.0,<0.25.0" },
 ]
-provides-extras = ["dataset", "training", "hardware", "viz", "core-scripts", "evaluation", "dataset-viz", "av-dep", "pygame-dep", "placo-dep", "transformers-dep", "grpcio-dep", "can-dep", "peft-dep", "scipy-dep", "diffusers-dep", "qwen-vl-utils-dep", "matplotlib-dep", "pyserial-dep", "deepdiff-dep", "pynput-dep", "pyzmq-dep", "feetech", "dynamixel", "damiao", "robstride", "openarms", "gamepad", "hopejr", "lekiwi", "unitree-g1", "reachy2", "kinematics", "intelrealsense", "phone", "diffusion", "wallx", "pi", "smolvla", "multi-task-dit", "groot", "sarm", "xvla", "eo1", "hilserl", "async", "peft", "dev", "notebook", "test", "video-benchmark", "aloha", "pusht", "libero", "metaworld", "all"]
+provides-extras = ["dataset", "training", "hardware", "viz", "core-scripts", "evaluation", "dataset-viz", "av-dep", "pygame-dep", "placo-dep", "transformers-dep", "grpcio-dep", "can-dep", "peft-dep", "scipy-dep", "diffusers-dep", "qwen-vl-utils-dep", "matplotlib-dep", "pyserial-dep", "deepdiff-dep", "pynput-dep", "pyzmq-dep", "motorbridge-dep", "motorbridge-smart-servo-dep", "feetech", "dynamixel", "damiao", "robstride", "openarms", "gamepad", "hopejr", "lekiwi", "unitree-g1", "reachy2", "rebot", "kinematics", "intelrealsense", "phone", "diffusion", "wallx", "pi", "smolvla", "multi-task-dit", "groot", "sarm", "xvla", "eo1", "hilserl", "async", "peft", "dev", "notebook", "test", "video-benchmark", "aloha", "pusht", "libero", "metaworld", "all"]
 
 [[package]]
 name = "librt"
@@ -3653,6 +3670,35 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/98/c2/8c1e6bf77cf62a10203a107179e34e0965fc5369386e0b7034a247ed054d/mock_serial-0.0.1-py3-none-any.whl", hash = "sha256:b6b8cc10c302354bf3ca270a3d4d6bf199c4bbe41478c65046db8f30ea967675", size = 6080, upload-time = "2021-11-23T09:34:51.108Z" },
 ]
 
+[[package]]
+name = "motorbridge"
+version = "0.3.2"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/58/f2/b824ac4d611c71020dccdb72fc50606e543c77c68455ea824b26d9a6de03/motorbridge-0.3.2.tar.gz", hash = "sha256:5cf85dd22c46c7f3c5e6981e90b1034af2deb1bc4e7d74c13074d1d4a7b75ceb", size = 30158, upload-time = "2026-05-18T07:13:17.239Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/2c/1a/7d367039a8325c0e2796c14a1503dfc563e7b244c815b26e079114244b4b/motorbridge-0.3.2-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:8ad158928e93fafd2a7814eaffe8e6ecbec4686f64c2df85f80d7979dfc82532", size = 1108065, upload-time = "2026-05-18T07:13:04.669Z" },
+    { url = "https://files.pythonhosted.org/packages/fe/d6/fafa2b8a3635a6fe7f6e8129e140a68d30f4d6438350a86e51b8198b7834/motorbridge-0.3.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2adde5f26ea4e37d05da6b41b03b637efa6c80db4676bc6dbdb91ac6e811e54a", size = 1184657, upload-time = "2026-05-18T07:13:06.081Z" },
+    { url = "https://files.pythonhosted.org/packages/d8/30/aca01e81ec523d37b98a1ce6e41688d31827625eb15ecf0cf0485d91d62c/motorbridge-0.3.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a03b6dc0be80db7b47d3f190f8c6f4fc43b0b4089235283f53763153a6d4e58c", size = 1201394, upload-time = "2026-05-18T07:13:07.476Z" },
+    { url = "https://files.pythonhosted.org/packages/70/eb/97b2f93682a1ce67bad50e9b598af889be4a3156ebcec129ebb41fa44e5b/motorbridge-0.3.2-cp312-cp312-win_amd64.whl", hash = "sha256:b0657d47aa94f8535d0663538be4a86c46e314303fba513122d17612b584c6e6", size = 839087, upload-time = "2026-05-18T07:13:08.664Z" },
+    { url = "https://files.pythonhosted.org/packages/6e/b0/03246c25ae67c2b33bd19b5d11bae668bb8baa7d9cbd75b035a8bef61d62/motorbridge-0.3.2-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:f305a69c7c3c91dca19c43084beb4cd30a93fd85ff35c712cc3fb0ae33a5c7d3", size = 1108065, upload-time = "2026-05-18T07:13:10.032Z" },
+    { url = "https://files.pythonhosted.org/packages/a9/40/b82d86fbfcc6b18946567f15a7d76d1c673d43bc0c8d268b668506811981/motorbridge-0.3.2-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:778fdde2b12df20184fb8c8f4c7665919d969bd582589a267c7956d4c57336ad", size = 1184657, upload-time = "2026-05-18T07:13:11.812Z" },
+    { url = "https://files.pythonhosted.org/packages/f2/3e/90e41d798814db89605d9a021e0c182608aec3d40eef2be211427e2bb863/motorbridge-0.3.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:eac3a2d27ca387e8d537ec148bea0c28b9517ff4fb9ea0b12f6e78c1e9a7faa4", size = 1201393, upload-time = "2026-05-18T07:13:13.396Z" },
+    { url = "https://files.pythonhosted.org/packages/34/75/3c9ba7514fd0ec330c1fe0b4d76dedfd221abc1b750fe063b6e3f9a88075/motorbridge-0.3.2-cp313-cp313-win_amd64.whl", hash = "sha256:d7d1eb76ae29e8673a320fd1a86b944fb0869129fd4114f0983e43cd48f67372", size = 839087, upload-time = "2026-05-18T07:13:14.555Z" },
+    { url = "https://files.pythonhosted.org/packages/87/33/6787dd22914291a640c2821f175abc7cbb9a1e0fe6c1143f92d7ac362903/motorbridge-0.3.2-cp314-cp314-win_amd64.whl", hash = "sha256:c5f05e36c6607d2145f38fb6f1f11090bb01dbd1012e8251b0d2ae4d60fa4f50", size = 870167, upload-time = "2026-05-18T07:13:15.898Z" },
+]
+
+[[package]]
+name = "motorbridge-smart-servo"
+version = "0.0.4"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/e6/56/45af87189dc49abbe46157b792b7c71f502a5f819f04e7485de0cfa52d9b/motorbridge_smart_servo-0.0.4.tar.gz", hash = "sha256:fb65f3f6e765e6b1915071c255caaf112fad3796fa1761aeee0132d15b8a0989", size = 20415, upload-time = "2026-05-08T09:24:57.563Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/e9/ee/bec4b3acf55cd18e7db83a6d951caccf699533dbd038c1f0b5f2d16d5208/motorbridge_smart_servo-0.0.4-cp39-abi3-macosx_11_0_arm64.whl", hash = "sha256:8bc1f034fa9f96e23229a834db6e7cfe1368dba7b9a2a6f6dbd316448c4390dc", size = 304384, upload-time = "2026-05-08T09:24:52.619Z" },
+    { url = "https://files.pythonhosted.org/packages/3f/d2/71c87063b826433553ce8869b99df3e4f191b107710dd5c905e637512b10/motorbridge_smart_servo-0.0.4-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:348cef6a647e5c7f9cc8e8ce1f3c806af4522e1087172bac2f8a1a0daa3592b6", size = 345668, upload-time = "2026-05-08T09:24:53.735Z" },
+    { url = "https://files.pythonhosted.org/packages/9b/6b/e65e7227a510236c6334cf054c501d3de2cbd463f4c594e42c6e965d5143/motorbridge_smart_servo-0.0.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:8c1982643c496c9f425fa9238f9a92ba601d77f4f2279df68c6868e7b997cbe1", size = 348123, upload-time = "2026-05-08T09:24:55.191Z" },
+    { url = "https://files.pythonhosted.org/packages/2d/fa/539ea123a5660c22c5e5cdad62d7bc5e931c816a0ffd402ae6e4623ab45b/motorbridge_smart_servo-0.0.4-cp39-abi3-win_amd64.whl", hash = "sha256:ea3baa9ba25bcec5541f3d86d73a3406ba2fcffe5dbf900c22e058638fc31ab0", size = 194130, upload-time = "2026-05-08T09:24:56.369Z" },
+]
+
 [[package]]
 name = "mpmath"
 version = "1.3.0"

From ca8c60a0ed0778fc6af7bc511f0b63e28a3a4612 Mon Sep 17 00:00:00 2001
From: von Neumann 101 <2961978672@qq.com>
Date: Tue, 19 May 2026 20:06:41 +0800
Subject: [PATCH 03/17] Set OpenCV fourcc after size and fps (#3620)

* Set OpenCV fourcc after size and fps

* Set OpenCV fourcc last on Windows

* Add comment explaining DSHOW fourcc ordering
---
 src/lerobot/cameras/opencv/camera_opencv.py | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/src/lerobot/cameras/opencv/camera_opencv.py b/src/lerobot/cameras/opencv/camera_opencv.py
index f3289ddc7..3e92eaf06 100644
--- a/src/lerobot/cameras/opencv/camera_opencv.py
+++ b/src/lerobot/cameras/opencv/camera_opencv.py
@@ -199,12 +199,13 @@ class OpenCVCamera(Camera):
             DeviceNotConnectedError: If the camera is not connected.
         """
 
-        # Set FOURCC first (if specified) as it can affect available FPS/resolution options
-        if self.config.fourcc is not None:
-            self._validate_fourcc()
         if self.videocapture is None:
             raise DeviceNotConnectedError(f"{self} videocapture is not initialized")
 
+        set_fourcc_after_size_and_fps = platform.system() == "Windows"
+        if self.config.fourcc is not None and not set_fourcc_after_size_and_fps:
+            self._validate_fourcc()
+
         default_width = int(round(self.videocapture.get(cv2.CAP_PROP_FRAME_WIDTH)))
         default_height = int(round(self.videocapture.get(cv2.CAP_PROP_FRAME_HEIGHT)))
 
@@ -222,6 +223,11 @@ class OpenCVCamera(Camera):
         else:
             self._validate_fps()
 
+        if self.config.fourcc is not None and set_fourcc_after_size_and_fps:
+            # On Windows with DSHOW, changing the resolution can silently override the FOURCC setting.
+            # Set FOURCC last to make sure the requested pixel format is actually enforced.
+            self._validate_fourcc()
+
     def _validate_fps(self) -> None:
         """Validates and sets the camera's frames per second (FPS)."""
 

From 7ab4936b1bca9e349250f6a7bb4e24fcd9667cca Mon Sep 17 00:00:00 2001
From: Pepijn <138571049+pkooij@users.noreply.github.com>
Date: Tue, 19 May 2026 14:46:11 +0200
Subject: [PATCH 04/17] Add extensive language support (#3467)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* Add extensive language support

* Address review: split persistent/event schemas, drop event timestamps

- recipe.py: derive _VALID_ROLES/_VALID_STREAMS from MessageRole/MessageStream Literals
- dataset_metadata.py: keep CODEBASE_VERSION at v3.0
- language.py: remove RESERVED_STYLES; split arrow/feature schemas into
  persistent (with timestamp) and event (without timestamp); add docstrings
- language_render.py: events use frame-row timestamp implicitly; no
  per-event timestamp filtering or sorting
- converters.py: drop unused subtask_key passthrough
- add docstrings to new public APIs (recipe, render_messages_processor, collate)
- update tests for split schemas; revert uv.lock

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add docstrings to all new helpers; revert uv.lock

Covers private helpers in recipe.py, language.py, language_render.py,
and render_messages_processor.py. Also reverts uv.lock to main (it was
re-generated by `uv run` during local checks).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(language): add motion (persistent) and trace (event-only) styles

Promote the previously-reserved motion/trace styles to first-class core
styles. motion routes to language_persistent (it tracks robot state over
time); trace routes to language_events (single-moment annotations).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(language): per-camera tagging on view-dependent styles

Adds a nullable `camera` field to the language row struct (both persistent
and event variants) so view-dependent styles like `vqa` can carry which
`observation.images.*` view they were grounded against. Without this,
multi-camera datasets ended up with multiple `(vqa, role)` rows at the
same timestamp that the resolver could not disambiguate.

- `language.py`: add `camera` to PERSISTENT_ROW_FIELDS / EVENT_ROW_FIELDS,
  to both Arrow struct types and the HF datasets feature mappings;
  introduce VIEW_DEPENDENT_STYLES = {vqa, motion, trace} plus
  `is_view_dependent_style` and `validate_camera_field` helpers (camera
  required iff style is view-dependent).
- `language_render.py`: thread an optional `camera=` kwarg through every
  resolver (`active_at`, `emitted_at`, `nth_prev`, `nth_next`) and through
  `_matching_rows` / `_select_*`, so recipes can disambiguate per-camera
  VQA with `emitted_at(t, style=vqa, role=assistant, camera=...)`.
  Without a `camera` filter, multi-row matches keep raising the existing
  ambiguity error — which is the desired behaviour on multi-camera data.
- `recipes/pi05_hirobot.yaml`: replace the single `ask_vqa` branch with
  `ask_vqa_top` and `ask_vqa_wrist` per-camera sub-recipes (each carrying
  the matching image block), keeping the original 0.20 budget and
  documenting the customization point for datasets with different cameras.
- Tests: schema test asserts the new field order; new tests cover
  `is_view_dependent_style`, `validate_camera_field` (both required and
  forbidden directions), per-camera `emitted_at` filtering, and the
  ambiguity error when two cameras emit `(vqa, assistant)` at the same
  timestamp without a `camera=` filter. RenderMessagesStep + dataset
  passthrough fixtures updated to include the new field.
- `docs/source/language_and_recipes.mdx`: document the `camera` field,
  the per-camera resolver pattern, and the canonical recipe convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(language): drop motion from VIEW_DEPENDENT_STYLES

Motion primitives are described in robot-frame (joint / Cartesian) terms,
not pixel space, so they are camera-agnostic. Only `vqa` (event) and
`trace` (event, pixel-trajectory) are view-dependent.

The `camera` field stays on PERSISTENT_ROW_FIELDS for schema symmetry —
the validator, resolver, and HF feature mapping behave identically across
the two columns regardless of which styles populate `camera` today —
but persistent rows now always have `camera=None` in practice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(language): task_aug style + automatic ${task} rephrasing rotation

Adds task-prompt diversity (Xiao 2022 / CAST) without touching
``meta/tasks.parquet`` or forcing recipes to opt in. The plan reserved
``task_aug`` as a future style; this lands it now.

- ``language.py``: add ``task_aug`` to ``CORE_STYLES`` and
  ``PERSISTENT_STYLES``. ``column_for_style("task_aug")`` returns
  ``language_persistent`` so PR 2 writers route it correctly.

- ``language_render.py``: ``_resolve_task`` now consults the persistent
  slice for rows of ``style="task_aug", role="user"``. When any exist
  it picks one deterministically by ``sample_idx`` (blake2b-keyed, not
  Python's randomized hash) so an epoch sees every rephrasing of every
  episode while the same sample still resolves identically across
  reruns. Falls back to the canonical ``meta/tasks.parquet`` task when
  no rephrasings are present, so existing datasets and unannotated runs
  keep their behaviour. Explicit ``task=`` overrides still win.

- Tests: rephrasing coverage across samples, determinism on repeat
  ``sample_idx``, fallback when persistent has no ``task_aug`` rows,
  and explicit override priority.

Recipes get this for free: any ``${task}`` placeholder rotates through
the available rephrasings. Recipes that want the literal canonical task
can override the binding.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(language): tool catalog in meta/info.json + LeRobotDatasetMetadata.tools

Stores OpenAI-style function schemas at ``meta/info.json["tools"]`` so
datasets can declare which tools are available (today: just ``say``;
tomorrow: per-dataset extensions). The ``DEFAULT_TOOLS`` constant
fills in for unannotated datasets so chat-template consumers don't
have to special-case anything.

Three pieces:

- ``language.py``: ``SAY_TOOL_SCHEMA`` and ``DEFAULT_TOOLS``
  constants. Single source of truth — PR 2's writer and PR 3's
  runtime tool registry will both import from here instead of
  duplicating the dict.
- ``dataset_metadata.py``: ``LeRobotDatasetMetadata.tools`` property
  reads ``info.json["tools"]`` and falls back to ``DEFAULT_TOOLS``.
  Returns deep-copied dicts so callers can mutate the result safely.
- ``docs/source/tools.mdx``: spec page covering the catalog, per-row
  invocations, and the three-step "how to add a new tool" workflow
  (declare schema, implement, register). Linked from the docs
  toctree under the Datasets section.

This lays the groundwork for PR 2's pipeline writing the catalog out
during annotation, and PR 3's ``src/lerobot/tools/`` package shipping
runnable implementations (one file per tool — first up:
``say.py`` wrapping Kyutai's pocket-tts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Apply ruff and prettier formatting after merge

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(language): unify resolver dispatch and prune redundant test scaffolding

* Drop the unused `events` kwarg from `active_at`/`nth_prev`/`nth_next`;
  only `emitted_at` actually consults events. The dispatcher in
  `_resolve_spec` now passes events conditionally.
* Replace the dual `_persistent_sort_key`/`_event_sort_key` pair with a
  single `_row_sort_key` and drop the `sort_key` parameter from
  `_select_one`. Event rows lack `timestamp` (it is implicit in the
  frame) and now default to `0.0` for sort purposes — the
  `(style, role)` tiebreaker is unchanged.
* Inline `_select_latest` into `active_at` (its only caller).
* Collapse `emitted_at`'s dual-branch into one `_select_one` call.
* Tighten `_validate_persistent_resolver` to a single
  `column_for_style(style) != LANGUAGE_PERSISTENT` check.
* Parameterize `test_per_camera_blend_renders_both_views` over the two
  cameras and factor the sub-recipe builder into `_vqa_subrecipe` so
  the test no longer hand-rolls two near-identical recipe blocks.

Net -98 LOC; behavior, public resolver names, and test expectations
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(language): always raise on ambiguous resolver matches

`_select_one` previously skipped its ambiguity check whenever any of
`role`/`tool_name`/`camera` was set, on the assumption that the caller
had already pinned down a unique row. That left a real ambiguity hole
for VQA: with two cameras emitting `(vqa, assistant)` at the same
frame, `emitted_at(..., role="assistant")` silently picked the first
sorted row instead of telling the recipe to add `camera=...`. The
existing `test_emitted_at_raises_on_ambiguous_per_camera_vqa` test
already encoded the desired behavior.

Tighten the check: any time `len(rows) > 1` we now raise with the
selectors echoed back, so users see exactly which fields they passed
and that more is needed to disambiguate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: fix CI — collapse short ValueError to one line, refresh uv.lock

* `ruff format` on CI (newer version) wants the short `camera=None`
  ValueError on a single line.
* `uv.lock` was stale relative to `pyproject.toml`'s `datasets>=4.7.0`
  pin (and picked up upstream `s390x` marker fixes for cuda packages).
  CI runs `uv sync --locked` which rejected the divergence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(language): keep base install green — drop processor re-export, gate dataset-extra tests

`lerobot.processor` re-exported `RenderMessagesStep` at the package
level, so importing anything from `lerobot.processor` pulled in
`lerobot.datasets.language` → `lerobot.datasets/__init__.py` →
`require_package("datasets")`, which fails in the Tier 1 base install
that intentionally omits the `[dataset]` extra. The chain bricked
collection for unrelated suites (`tests/policies/pi0_pi05/...`,
`tests/envs/...`, etc.).

* Stop re-exporting `RenderMessagesStep` from `lerobot.processor`. The
  only consumer (the test) already imports from the submodule.
  Document the deliberate omission in the module docstring.
* Add `pytest.importorskip("datasets", ...)` (and `pandas` where
  needed) at the top of the four PR-added tests that exercise the
  language stack:
  - tests/datasets/test_language.py
  - tests/datasets/test_language_render.py
  - tests/processor/test_render_messages_processor.py
  - tests/utils/test_collate.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(language): address review — tools accessor, motion docs, conditional collate

* **`meta.tools` actually reads `info.json["tools"]`.** `DatasetInfo`
  had no `tools` field, so `from_dict` silently dropped the key (it
  warned about unknown fields then discarded them) and the property
  always returned `DEFAULT_TOOLS`. Added `tools: list[dict] | None`
  to the dataclass; `to_dict()` drops it when unset so existing
  datasets keep a clean `info.json`. Fixed the accessor to read
  `self.info.tools` (the previous `.get(...)` would have raised
  AttributeError on the dataclass anyway). Added regression tests:
  fallback when absent, round-trip from disk, and round-trip
  through `DatasetInfo.from_dict` / `to_dict`.

* **`motion` is not view-dependent — fix the docs.** The mdx claimed
  rows of style `motion` must carry `camera`, but `VIEW_DEPENDENT_STYLES
  = {"vqa", "trace"}` and the validator agrees: motion primitives are
  joint/Cartesian-frame, not pixel-space. Updated both call-out
  paragraphs in `language_and_recipes.mdx`.

* **Conditional `collate_fn` swap.** Added `meta.has_language_columns`
  and gate the `lerobot_collate_fn` swap in `lerobot_train.py` on it,
  so non-language datasets keep PyTorch's `default_collate`. Also
  added a pass-through test in `test_collate.py` that asserts on a
  plain tensor batch the custom collate matches `default_collate`
  key-for-key, plus a test for the `None`-sample drop path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: dedupe regex, centralize column names, harden collate, more tests

* **#2 — dedupe `_PLACEHOLDER_RE`.** The same regex was compiled in
  `recipe.py` and `language_render.py`. Promote to module-level
  `PLACEHOLDER_RE` in `recipe.py` (its primary owner — declares
  template syntax) and import from `language_render.py`.
* **#3 — centralize language column names.** `io_utils.py` had
  hardcoded `{"language_persistent", "language_events"}` literals at
  two sites. Replace with `LANGUAGE_COLUMNS` import so a future column
  rename can't silently desync.
* **#4 — defensive collate preserved-keys.** `lerobot_collate_fn`
  silently filtered language fields from samples that didn't have
  them, which would hand downstream consumers a preserved list
  shorter than the tensor batch. Now: if any sample carries a key,
  every sample in the batch must carry it; otherwise raise a
  `ValueError` so the upstream rendering bug surfaces at the boundary.
* **#5 — `_scalar` rejects non-singleton lists.** Previously a zero-
  or multi-element list fell through and triggered confusing
  `float([])` errors downstream. Now raises `ValueError` with the
  actual length.
* **#6 — refactor `_extract_complementary_data`.** Replace 11 lines
  of `key = {... if ... else {}}` plus an 11-line splat dict with a
  single `_COMPLEMENTARY_KEYS` tuple iterated once.
* **#7 — document `EXTENDED_STYLES`.** Was an empty `set()` with no
  comment. Add a docstring explaining it's an intentional extension
  point: downstream modules append project-local styles before
  `column_for_style` is called.
* **#9 — `tools.mdx` notes the runtime layer is future work.** The
  page referenced `src/lerobot/tools/`, `registry.py`, and
  `get_tools(meta)` — none exist in this PR. Added a callout at the
  start of "How to add your own tool" plus a note on the
  implementations paragraph.
* **#10 — tests for YAML round-trip, malformed rows, blend
  validation.** `test_recipe.py` grew from 1 case to 12 covering:
  blend-or-messages exclusivity, target-turn requirement, blend
  emptiness, weight presence/positivity, nested-blend rejection,
  `from_dict` with nested blends, `from_yaml` / `load_recipe`
  agreement, top-level non-mapping rejection. Added a malformed-row
  test for `_normalize_rows` that asserts non-dict entries raise
  `TypeError`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: emitted_at uses 0.1s tolerance; MessageTurn requires stream at construction

* **Float tolerance in `emitted_at` for persistent styles.** The
  ``_timestamp(row) == t`` exact-equality check silently missed any
  caller that derived ``t`` arithmetically (e.g. ``frame_idx / fps``)
  even though the parquet timestamp would only differ by ULPs. Added
  ``EMITTED_AT_TOLERANCE_S = 0.1`` and check ``abs(...) <= tolerance``
  instead, with a docstring explaining why exact equality wasn't
  enough and why 0.1 s is safe at typical 30–100 Hz control rates.
  Test asserts the new behavior at half-window (matches) and
  double-window (no match) using the constant so it stays in sync.

* **`MessageTurn.stream` is required at construction.** It was typed
  ``MessageStream | None = None`` so YAML could omit ``stream:`` and
  pass the dataclass invariant — but ``_validate_rendered`` rejected
  ``None`` streams later, surfacing the error at the first sample
  instead of at recipe load. Now ``__post_init__`` raises
  ``ValueError`` if ``stream`` is ``None``, with the list of valid
  streams in the message. The redundant late-stage check in
  ``_validate_rendered`` is replaced with a one-line comment that
  cites the upstream invariant. Test pins the new construction-time
  rejection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(tools): drop follow-up-PR references

Reword the two callouts in `tools.mdx` to describe the runtime layer
in present tense ("not part of the catalog layer shipped today",
"those modules don't yet exist in the tree") instead of pointing at a
specific follow-up PR. Keeps the doc honest about what works now
without coupling it to a particular release order.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review: address CarolinePascal feedback

- language timestamps: float64 -> float32 to match LeRobotDataset frame
  timestamps (Arrow struct + HF feature)
- dataset_metadata: hoist `.language` imports to module top — language.py
  has no lerobot imports, so there is no circular-import risk
- dataset_metadata: add a `meta.tools` setter that persists the catalog to
  info.json and reloads `meta.info`
- feature_utils: validate the `language` dtype instead of returning "" —
  warn (non-fatal) when a non-empty value is written at record time
- centralize the scalar-unwrap helper as `lerobot.utils.utils.unwrap_scalar`,
  shared by render_messages_processor and language_render
- docs: move `## Layer 2 — recipe anatomy` ahead of the resolver sections,
  which describe recipe bindings rather than dataset layout
- language_render: note in EMITTED_AT_TOLERANCE_S that persistent rows change
  on a human-action timescale, not the camera frame rate

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/source/_toctree.yml                      |   6 +-
 docs/source/dataset_subtask.mdx               | 277 ---------
 docs/source/language_and_recipes.mdx          | 147 +++++
 docs/source/tools.mdx                         | 210 +++++++
 pyproject.toml                                |   2 +-
 src/lerobot/configs/__init__.py               |   4 +
 src/lerobot/configs/recipe.py                 | 206 +++++++
 src/lerobot/datasets/__init__.py              |  14 +
 src/lerobot/datasets/compute_stats.py         |   2 +-
 src/lerobot/datasets/dataset_metadata.py      |  47 +-
 src/lerobot/datasets/dataset_reader.py        |   5 -
 src/lerobot/datasets/feature_utils.py         |  41 +-
 src/lerobot/datasets/io_utils.py              |  17 +-
 src/lerobot/datasets/language.py              | 242 ++++++++
 src/lerobot/datasets/language_render.py       | 545 ++++++++++++++++++
 src/lerobot/datasets/utils.py                 |   8 +-
 src/lerobot/processor/__init__.py             |   7 +
 src/lerobot/processor/batch_processor.py      |  18 +
 src/lerobot/processor/converters.py           |  36 +-
 .../processor/render_messages_processor.py    |  84 +++
 src/lerobot/scripts/lerobot_train.py          |   6 +
 src/lerobot/utils/collate.py                  |  65 +++
 src/lerobot/utils/utils.py                    |  19 +
 tests/configs/test_recipe.py                  | 168 ++++++
 tests/datasets/test_dataset_metadata.py       | 137 +++++
 tests/datasets/test_language.py               | 173 ++++++
 tests/datasets/test_language_render.py        | 417 ++++++++++++++
 tests/datasets/test_subtask_dataset.py        | 193 -------
 .../test_render_messages_processor.py         |  60 ++
 tests/utils/test_collate.py                   |  84 +++
 uv.lock                                       |   2 +-
 31 files changed, 2730 insertions(+), 512 deletions(-)
 delete mode 100644 docs/source/dataset_subtask.mdx
 create mode 100644 docs/source/language_and_recipes.mdx
 create mode 100644 docs/source/tools.mdx
 create mode 100644 src/lerobot/configs/recipe.py
 create mode 100644 src/lerobot/datasets/language.py
 create mode 100644 src/lerobot/datasets/language_render.py
 create mode 100644 src/lerobot/processor/render_messages_processor.py
 create mode 100644 src/lerobot/utils/collate.py
 create mode 100644 tests/configs/test_recipe.py
 create mode 100644 tests/datasets/test_language.py
 create mode 100644 tests/datasets/test_language_render.py
 delete mode 100644 tests/datasets/test_subtask_dataset.py
 create mode 100644 tests/processor/test_render_messages_processor.py
 create mode 100644 tests/utils/test_collate.py

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 470319c48..412386e2d 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -39,8 +39,10 @@
     title: Porting Large Datasets
   - local: using_dataset_tools
     title: Using the Dataset Tools
-  - local: dataset_subtask
-    title: Using Subtasks in the Dataset
+  - local: language_and_recipes
+    title: Language Columns and Recipes
+  - local: tools
+    title: Tools
   - local: video_encoding_parameters
     title: Video encoding parameters
   - local: streaming_video_encoding
diff --git a/docs/source/dataset_subtask.mdx b/docs/source/dataset_subtask.mdx
deleted file mode 100644
index 6264aca22..000000000
--- a/docs/source/dataset_subtask.mdx
+++ /dev/null
@@ -1,277 +0,0 @@
-# Using Subtasks in LeRobot Datasets
-
-Subtask support in robotics datasets has proven effective in improving robot reasoning and understanding. Subtasks are particularly useful for:
-
-- **Hierarchical policies**: Building policies that include subtask predictions to visualize robot reasoning in real time
-- **Reward modeling**: Helping reward models understand task progression (e.g., SARM-style stage-aware reward models)
-- **Task decomposition**: Breaking down complex manipulation tasks into atomic, interpretable steps
-
-LeRobotDataset now supports subtasks as part of its dataset structure, alongside tasks.
-
-## What are Subtasks?
-
-While a **task** describes the overall goal (e.g., "Pick up the apple and place it in the basket"), **subtasks** break down the execution into finer-grained steps:
-
-1. "Approach the apple"
-2. "Grasp the apple"
-3. "Lift the apple"
-4. "Move to basket"
-5. "Release the apple"
-
-Each frame in the dataset can be annotated with its corresponding subtask, enabling models to learn and predict these intermediate stages.
-
-<img
-  src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/subtask-asset.png"
-  alt="An overview of subtask annotation showing how frames are labeled with intermediate subtask stages"
-  width="80%"
-/>
-
-<p>
-  <em>Figure: Overview of subtask annotation.</em>
-</p>
-
-**Reference:** _Subtask-learning based for robot self-assembly in flexible collaborative assembly in manufacturing_, Original Article, Published: 19 April 2022.
-
-## Dataset Structure
-
-Subtask information is stored in the dataset metadata:
-
-```
-my-dataset/
-├── data/
-│   └── ...
-├── meta/
-│   ├── info.json
-│   ├── stats.json
-│   ├── tasks.parquet
-│   ├── subtasks.parquet      # Subtask index → subtask string mapping
-│   └── episodes/
-│       └── ...
-└── videos/
-    └── ...
-```
-
-### Subtasks Parquet File
-
-The `meta/subtasks.parquet` file maps subtask indices to their natural language descriptions:
-
-| subtask_index | subtask (index column) |
-| ------------- | ---------------------- |
-| 0             | "Approach the apple"   |
-| 1             | "Grasp the apple"      |
-| 2             | "Lift the apple"       |
-| ...           | ...                    |
-
-### Frame-Level Annotations
-
-Each frame in the dataset can include a `subtask_index` field that references the subtasks parquet file:
-
-```python
-# Example frame data in the parquet file
-{
-    "index": 42,
-    "timestamp": 1.4,
-    "episode_index": 0,
-    "task_index": 0,
-    "subtask_index": 2,  # References "Lift the apple"
-    "observation.state": [...],
-    "action": [...],
-}
-```
-
-## Annotating Datasets with Subtasks
-
-We provide a HuggingFace Space for easily annotating any LeRobotDataset with subtasks:
-
-**[https://huggingface.co/spaces/lerobot/annotate](https://huggingface.co/spaces/lerobot/annotate)**
-
-After completing your annotation:
-
-1. Click "Push to Hub" to upload your annotated dataset
-2. You can also run the annotation space locally by following the instructions at [github.com/huggingface/lerobot-annotate](https://github.com/huggingface/lerobot-annotate)
-
-## Loading Datasets with Subtasks
-
-When you load a dataset with subtask annotations, the subtask information is automatically available:
-
-```python
-from lerobot.datasets import LeRobotDataset
-
-# Load a dataset with subtask annotations
-dataset = LeRobotDataset("jadechoghari/collect-fruit-annotated")
-
-# Access a sample
-sample = dataset[100]
-
-# The sample includes both task and subtask information
-print(sample["task"])        # "Collect the fruit"
-print(sample["subtask"])     # "Grasp the apple"
-print(sample["task_index"])  # tensor(0)
-print(sample["subtask_index"])  # tensor(2)
-```
-
-### Checking for Subtask Support
-
-You can check if a dataset has subtask annotations:
-
-```python
-# Check if subtasks are available
-has_subtasks = (
-    "subtask_index" in dataset.features
-    and dataset.meta.subtasks is not None
-)
-
-if has_subtasks:
-    print(f"Dataset has {len(dataset.meta.subtasks)} unique subtasks")
-    print("Subtasks:", list(dataset.meta.subtasks.index))
-```
-
-## Using Subtasks for Training
-
-### With the Tokenizer Processor
-
-The `TokenizerProcessor` automatically handles subtask tokenization for Vision-Language Action (VLA) models:
-
-```python
-from lerobot.processor import TokenizerProcessorStep
-
-# Create a tokenizer processor step
-tokenizer_processor = TokenizerProcessorStep(
-    tokenizer_name_or_path="google/paligemma-3b-pt-224",
-    padding="max_length",
-    max_length=64,
-)
-
-# The processor will automatically tokenize subtasks if present in the batch
-# and add them to the observation under:
-# - "observation.subtask.tokens"
-# - "observation.subtask.attention_mask"
-```
-
-When subtasks are available in the batch, the tokenizer processor adds:
-
-- `observation.subtask.tokens`: Tokenized subtask text
-- `observation.subtask.attention_mask`: Attention mask for the subtask tokens
-
-### DataLoader with Subtasks
-
-```python
-import torch
-from lerobot.datasets import LeRobotDataset
-
-dataset = LeRobotDataset("jadechoghari/collect-fruit-annotated")
-
-dataloader = torch.utils.data.DataLoader(
-    dataset,
-    batch_size=16,
-    shuffle=True,
-)
-
-for batch in dataloader:
-    # Access subtask information in the batch
-    subtasks = batch["subtask"]  # List of subtask strings
-    subtask_indices = batch["subtask_index"]  # Tensor of subtask indices
-
-    # Use for training hierarchical policies or reward models
-    print(f"Batch subtasks: {set(subtasks)}")
-```
-
-## Example Datasets with Subtask Annotations
-
-Try loading a dataset with subtask annotations:
-
-```python
-from lerobot.datasets import LeRobotDataset
-
-# Example dataset with subtask annotations
-dataset = LeRobotDataset("jadechoghari/collect-fruit-annotated")
-
-# Explore the subtasks
-print("Available subtasks:")
-for subtask_name in dataset.meta.subtasks.index:
-    print(f"  - {subtask_name}")
-
-# Get subtask distribution
-subtask_counts = {}
-for i in range(len(dataset)):
-    sample = dataset[i]
-    subtask = sample["subtask"]
-    subtask_counts[subtask] = subtask_counts.get(subtask, 0) + 1
-
-print("\nSubtask distribution:")
-for subtask, count in sorted(subtask_counts.items(), key=lambda x: -x[1]):
-    print(f"  {subtask}: {count} frames")
-```
-
-## Use Cases
-
-### 1. Hierarchical Policy Training
-
-Train policies that predict both actions and current subtask:
-
-```python
-class HierarchicalPolicy(nn.Module):
-    def __init__(self, num_subtasks):
-        super().__init__()
-        self.action_head = nn.Linear(hidden_dim, action_dim)
-        self.subtask_head = nn.Linear(hidden_dim, num_subtasks)
-
-    def forward(self, observations):
-        features = self.encoder(observations)
-        actions = self.action_head(features)
-        subtask_logits = self.subtask_head(features)
-        return actions, subtask_logits
-```
-
-### 2. Stage-Aware Reward Modeling (SARM)
-
-Build reward models that understand task progression:
-
-```python
-# SARM predicts:
-# - Stage: Which subtask is being executed (discrete)
-# - Progress: How far along the subtask (continuous 0-1)
-
-class SARMRewardModel(nn.Module):
-    def forward(self, observations):
-        features = self.encoder(observations)
-        stage_logits = self.stage_classifier(features)
-        progress = self.progress_regressor(features)
-        return stage_logits, progress
-```
-
-### 3. Progress Visualization
-
-Monitor robot execution by tracking subtask progression:
-
-```python
-def visualize_execution(model, observations):
-    for t, obs in enumerate(observations):
-        action, subtask_logits = model(obs)
-        predicted_subtask = subtask_names[subtask_logits.argmax()]
-        print(f"t={t}: Executing '{predicted_subtask}'")
-```
-
-## API Reference
-
-### LeRobotDataset Properties
-
-| Property                    | Type                   | Description                                |
-| --------------------------- | ---------------------- | ------------------------------------------ |
-| `meta.subtasks`             | `pd.DataFrame \| None` | DataFrame mapping subtask names to indices |
-| `features["subtask_index"]` | `dict`                 | Feature spec for subtask_index if present  |
-
-### Sample Keys
-
-When subtasks are available, each sample includes:
-
-| Key             | Type           | Description                          |
-| --------------- | -------------- | ------------------------------------ |
-| `subtask_index` | `torch.Tensor` | Integer index of the current subtask |
-| `subtask`       | `str`          | Natural language subtask description |
-
-## Related Resources
-
-- [SARM Paper](https://arxiv.org/pdf/2509.25358) - Stage-Aware Reward Modeling for Long Horizon Robot Manipulation
-- [LeRobot Annotate Space](https://huggingface.co/spaces/lerobot/annotate) - Interactive annotation tool
-- [LeRobotDataset v3.0](./lerobot-dataset-v3) - Dataset format documentation
diff --git a/docs/source/language_and_recipes.mdx b/docs/source/language_and_recipes.mdx
new file mode 100644
index 000000000..4181dbe34
--- /dev/null
+++ b/docs/source/language_and_recipes.mdx
@@ -0,0 +1,147 @@
+# Language columns and recipes
+
+Most LeRobot datasets ship with a single `task` string per episode — fine for
+short, single-instruction skills, but not enough for the longer-horizon,
+multi-modal robot policies the field is moving toward (high-level planning,
+memory, interjections, VQA, tool use). To support those policies without
+forking the dataset format, LeRobot extends `LeRobotDataset` with two optional
+language columns and a small recipe layer that turns those rows into
+chat-style training samples on the fly.
+
+The design splits cleanly into three layers:
+
+1. **Data in the dataset** — language annotations stored next to frames in
+   `data/chunk-*/file-*.parquet` as two optional columns (`language_persistent`
+   and `language_events`). Datasets without these columns keep their existing
+   behavior.
+2. **Recipe** — a YAML file that declares which annotation rows to bind and
+   how to lay them out as chat turns (`role`, `content`, optional images,
+   optional tool calls). Recipes are pure config; no Python required to add a
+   new one.
+3. **Training format** — at sample time, `RenderMessagesStep` resolves the
+   recipe against the per-frame annotations and emits HF-style `messages` plus
+   LeRobot-specific sidecars (`message_streams`, `target_message_indices`)
+   that policy processors consume.
+
+This page describes each layer in turn.
+
+## Layer 1 — language columns in the dataset
+
+The two optional columns live next to frame data in
+`data/chunk-*/file-*.parquet`:
+
+- `language_persistent`: a list of rows broadcast across every frame in an episode for state that remains active, such as `subtask`, `plan`, and `memory`.
+- `language_events`: a list of rows only on the exact frame where an event was emitted, such as `interjection`, `vqa`, and speech tool calls.
+
+Both columns share the same row shape (event rows omit `timestamp` because the
+frame the row sits on already provides it):
+
+```text
+role: string
+content: string | null
+style: string | null
+timestamp: float32        # persistent rows only
+camera: string | null     # observation.images.* feature key, view-dependent rows only
+tool_calls: list[Json] | null
+```
+
+The `camera` field tags rows whose `content` is grounded in a specific camera
+view. Rows of view-dependent styles (`vqa` and `trace`) MUST set `camera` to
+the matching `observation.images.*` feature key. Rows of every other style —
+including `motion`, which describes robot-frame primitives in joint / Cartesian
+terms — MUST leave `camera` as `null`. Pipeline writers and the validator
+enforce this via `validate_camera_field(style, camera)`.
+
+`meta/tasks.parquet` remains the canonical source for the task. The special `${task}` recipe binding always reads that task string and does not depend on language annotations.
+
+### Architecture
+
+The language stack itself has three internal modules backing layer 1:
+
+1. `lerobot.datasets.language` defines the schema, style registry, and `column_for_style`.
+2. `lerobot.datasets.language_render` resolves rows and renders messages.
+3. `RenderMessagesStep` turns dataset samples into `messages`, `message_streams`, and `target_message_indices`.
+
+`LeRobotDataset` stays recipe-agnostic. It passes `language_persistent` and `language_events` through when present, and unannotated datasets keep their existing behavior.
+
+## Layer 2 — recipe anatomy
+
+Recipes are YAML files backed by `TrainingRecipe` and `MessageTurn`. They
+declare which annotation rows to pull (via `bindings`) and how to compose them
+into chat turns (`messages`).
+
+```yaml
+messages:
+  - { role: user, content: "${task}", stream: high_level }
+  - { role: assistant, content: "${subtask}", stream: low_level, target: true }
+```
+
+A recipe can also branch into a weighted **blend** of sub-recipes. At sample
+time, exactly one branch is selected deterministically from the sample index,
+so different frames train different objectives (e.g. memory updates vs.
+low-level execution vs. VQA) without any Python wiring.
+
+### Temporal semantics
+
+Persistent styles are active after emission until replaced:
+
+- `active_at(t, style=subtask)`
+- `nth_prev(style=memory, offset=1)`
+- `nth_next(style=subtask, offset=1)`
+
+Event styles only exist on their exact timestamp:
+
+- `emitted_at(t, style=interjection)`
+- `emitted_at(t, style=vqa, role=user, camera=observation.images.top)`
+- `emitted_at(t, role=assistant, tool_name=say)`
+
+Exact event matching has no tolerance window, so writers must stamp event rows with frame timestamps from the parquet data.
+
+### View-dependent resolution
+
+For view-dependent styles (`vqa` and `trace`), the resolver gains a
+`camera=` filter parallel to `role=` and `tool_name=`. Datasets with multiple
+cameras typically emit one (`vqa`, `user`) + (`vqa`, `assistant`) pair per
+camera at the same timestamp; without `camera=`, those resolvers see two
+matches and raise an ambiguity error. Recipes consume each camera through its
+own binding plus a matching image block, e.g.
+
+```yaml
+ask_vqa_top:
+  bindings:
+    vqa_query: "emitted_at(t, style=vqa, role=user, camera=observation.images.top)"
+    vqa: "emitted_at(t, style=vqa, role=assistant, camera=observation.images.top)"
+  messages:
+    - role: user
+      stream: high_level
+      if_present: vqa_query
+      content:
+        - { type: image, feature: observation.images.top }
+        - { type: text, text: "${vqa_query}" }
+    - {
+        role: assistant,
+        content: "${vqa}",
+        stream: high_level,
+        target: true,
+        if_present: vqa,
+      }
+```
+
+Add one such sub-recipe per camera the dataset records.
+
+## Layer 3 — training format
+
+Rendered samples use HF-style chat messages plus LeRobot sidecars:
+
+```python
+sample["messages"]
+sample["message_streams"]
+sample["target_message_indices"]
+```
+
+The renderer does not apply a tokenizer chat template. Policy processors decide how to serialize the messages for their backbone, which keeps the same dataset usable across SmolVLA, Pi0.5, and any future VLM that expects OpenAI-style chat messages.
+
+## Graceful absence
+
+If both language columns are missing, `None`, or empty, `RenderMessagesStep` is a no-op.
+If an event-scoped branch is selected on a frame without the required event row, rendering returns `None`, allowing a loader to retry another sample.
diff --git a/docs/source/tools.mdx b/docs/source/tools.mdx
new file mode 100644
index 000000000..d88881184
--- /dev/null
+++ b/docs/source/tools.mdx
@@ -0,0 +1,210 @@
+# Tools
+
+LeRobot v3.1 supports **tool calls** in policies — assistant messages can
+emit structured invocations like `say(text="OK, starting now")` that the
+runtime dispatches to a real implementation (TTS, controller, logger, …).
+
+This page covers:
+
+1. Where the tool catalog lives.
+2. How the annotation pipeline produces tool-call atoms.
+3. How to add your own tool.
+
+## Where tools are declared
+
+Two layers.
+
+**The catalog** — a list of OpenAI-style function schemas — lives at
+`meta/info.json["tools"]` on each dataset. Example:
+
+```json
+{
+  "features": { "...": "..." },
+  "tools": [
+    {
+      "type": "function",
+      "function": {
+        "name": "say",
+        "description": "Speak a short utterance to the user via the TTS executor.",
+        "parameters": {
+          "type": "object",
+          "properties": {
+            "text": {
+              "type": "string",
+              "description": "The verbatim text to speak."
+            }
+          },
+          "required": ["text"]
+        }
+      }
+    }
+  ]
+}
+```
+
+Read it via the dataset metadata accessor:
+
+```python
+from lerobot.datasets.dataset_metadata import LeRobotDatasetMetadata
+
+meta = LeRobotDatasetMetadata(repo_id="pepijn/super_poulain_final_annotations")
+tools = meta.tools     # list[dict] — OpenAI tool schemas
+```
+
+If the dataset's `info.json` doesn't declare any tools, `meta.tools`
+returns `DEFAULT_TOOLS` from `lerobot.datasets.language` — currently a
+single-entry list with the canonical `say` schema. So unannotated
+datasets and chat-template consumers keep working without any
+configuration:
+
+```python
+prompt_str = tokenizer.apply_chat_template(
+    sample["messages"],
+    tools=meta.tools,                 # works either way
+    add_generation_prompt=False,
+    tokenize=False,
+)
+```
+
+**The implementations** — runnable Python — will live under
+`src/lerobot/tools/`, one file per tool. The runtime dispatcher and
+the canonical `say` implementation (wrapping Kyutai's pocket-tts) are
+not part of the catalog layer described here; today this layer ships
+only the schema storage and the `DEFAULT_TOOLS` fallback constant.
+
+## Per-row tool _invocations_
+
+The catalog above describes _what can be called_. The actual _call_ — the
+function name plus the argument values — is stored per-row, on the
+assistant atoms in `language_events`:
+
+```python
+{
+  "role": "assistant",
+  "content": null,
+  "style": null,
+  "timestamp": 12.4,
+  "camera": null,
+  "tool_calls": [
+    { "type": "function",
+      "function": { "name": "say", "arguments": { "text": "On it." } } }
+  ]
+}
+```
+
+Recipes splice these into rendered messages via `tool_calls_from`:
+
+```yaml
+user_interjection_response:
+  bindings:
+    speech: "emitted_at(t, role=assistant, tool_name=say)"
+  messages:
+    - { role: user, content: "${task}", stream: high_level }
+    - {
+        role: assistant,
+        content: "${current_plan}",
+        stream: high_level,
+        target: true,
+        tool_calls_from: speech,
+      }
+```
+
+The model's training target is one assistant turn that carries both the
+plan text _and_ the `say` tool call. At inference, the runtime parses
+the generated text back into structured `tool_calls` and dispatches to
+the matching implementation.
+
+## How to add your own tool
+
+> **Note:** Steps 2 and 3 below describe the runtime layer
+> (`src/lerobot/tools/`, the `Tool` protocol, `TOOL_REGISTRY`,
+> `get_tools(meta)`) which is not part of the catalog layer shipped
+> today — those modules don't yet exist in the tree. Step 1 alone is
+> enough to make the tool visible to the chat template via
+> `meta.tools` so the model can learn to _generate_ the call;
+> executing the call at inference requires the runtime layer.
+
+Three steps. Concrete example: a `record_observation` tool the policy
+can call to capture an extra observation outside the regular control
+loop.
+
+### Step 1 — declare the schema
+
+Add an entry under `meta/info.json["tools"]`. Either edit the file
+directly on disk _before_ running the annotation pipeline (it'll be
+preserved) or hand it to `lerobot-annotate` via a config flag.
+
+```json
+{
+  "tools": [
+    { "type": "function", "function": { "name": "say", "...": "..." } },
+    {
+      "type": "function",
+      "function": {
+        "name": "record_observation",
+        "description": "Capture a high-resolution still image for the user.",
+        "parameters": {
+          "type": "object",
+          "properties": {
+            "label": {
+              "type": "string",
+              "description": "Short label for the saved image."
+            }
+          },
+          "required": ["label"]
+        }
+      }
+    }
+  ]
+}
+```
+
+The schema follows OpenAI's function-calling convention exactly, so the
+chat template can render it natively.
+
+### Step 2 — implement the call
+
+Create `src/lerobot/tools/record_observation.py`:
+
+```python
+from .base import Tool
+from typing import Any
+
+RECORD_OBSERVATION_SCHEMA: dict[str, Any] = { "...": "..." }   # mirrors the JSON above
+
+
+class RecordObservationTool:
+    name = "record_observation"
+    schema = RECORD_OBSERVATION_SCHEMA
+
+    def __init__(self, schema: dict | None = None, output_dir: str = "."):
+        self.output_dir = output_dir
+
+    def call(self, arguments: dict) -> str:
+        label = arguments["label"]
+        # ... save the latest camera frame to <output_dir>/<label>.png ...
+        return f"saved {label}.png"
+```
+
+One file per tool keeps dependencies isolated — `record_observation`
+might pull `pillow`, while `say` pulls `pocket-tts`. Users installing
+only the tools they need avoid heavy transitive deps.
+
+### Step 3 — register it
+
+Add to `src/lerobot/tools/registry.py`:
+
+```python
+from .record_observation import RecordObservationTool
+
+TOOL_REGISTRY["record_observation"] = RecordObservationTool
+```
+
+That's it. At runtime `get_tools(meta)` looks up each schema in
+`meta.tools`, instantiates the matching registered class, and returns
+a name → instance dict the dispatcher can route into.
+
+If you want to use a tool _without_ writing an implementation (e.g. for
+training-time chat-template formatting only), step 1 alone is enough —
+the model still learns to _generate_ the call. Steps 2 and 3 are only
+needed to actually _execute_ it at inference.
diff --git a/pyproject.toml b/pyproject.toml
index 93953cd57..ca6248c95 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -95,7 +95,7 @@ dependencies = [
 
 # ── Feature-scoped extras ──────────────────────────────────
 dataset = [
-    "datasets>=4.0.0,<5.0.0",
+    "datasets>=4.7.0,<5.0.0",
     "pandas>=2.0.0,<3.0.0", # NOTE: Transitive dependency of datasets
     "pyarrow>=21.0.0,<30.0.0", # NOTE: Transitive dependency of datasets
     "lerobot[av-dep]",
diff --git a/src/lerobot/configs/__init__.py b/src/lerobot/configs/__init__.py
index c3fe246cd..be4491811 100644
--- a/src/lerobot/configs/__init__.py
+++ b/src/lerobot/configs/__init__.py
@@ -24,6 +24,7 @@ Import them directly: ``from lerobot.configs.train import TrainPipelineConfig``
 from .dataset import DatasetRecordConfig
 from .default import DatasetConfig, EvalConfig, PeftConfig, WandBConfig
 from .policies import PreTrainedConfig
+from .recipe import MessageTurn, TrainingRecipe, load_recipe
 from .types import (
     FeatureType,
     NormalizationMode,
@@ -49,9 +50,12 @@ __all__ = [
     "DatasetRecordConfig",
     "DatasetConfig",
     "EvalConfig",
+    "MessageTurn",
     "PeftConfig",
     "PreTrainedConfig",
+    "TrainingRecipe",
     "WandBConfig",
+    "load_recipe",
     "VideoEncoderConfig",
     # Defaults
     "camera_encoder_defaults",
diff --git a/src/lerobot/configs/recipe.py b/src/lerobot/configs/recipe.py
new file mode 100644
index 000000000..28e5a0db3
--- /dev/null
+++ b/src/lerobot/configs/recipe.py
@@ -0,0 +1,206 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import re
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any, Literal, get_args
+
+MessageRole = Literal["user", "assistant", "system", "tool"]
+MessageStream = Literal["high_level", "low_level"]
+
+DEFAULT_BINDINGS = {
+    "subtask": "active_at(t, style=subtask)",
+    "memory": "active_at(t, style=memory)",
+    "plan": "active_at(t, style=plan)",
+    "speech": "emitted_at(t, role=assistant, tool_name=say)",
+    "interjection": "emitted_at(t, style=interjection)",
+    "vqa": "emitted_at(t, style=vqa, role=assistant)",
+    "vqa_query": "emitted_at(t, style=vqa, role=user)",
+}
+
+PLACEHOLDER_RE = re.compile(r"\$\{([A-Za-z_][A-Za-z0-9_]*)\}")
+"""``${name}`` placeholder pattern used by both recipe binding-reference
+discovery (here) and rendered-message substitution (in ``language_render``)."""
+
+_VALID_ROLES = frozenset(get_args(MessageRole))
+_VALID_STREAMS = frozenset(get_args(MessageStream))
+
+
+@dataclass
+class MessageTurn:
+    """A single chat-style turn in a recipe template.
+
+    ``content`` may be a plain string, a list of HF-style multimodal blocks, or
+    ``None`` when ``tool_calls_from`` supplies tool-call payloads instead.
+    ``stream`` tags the turn for downstream filtering, ``target`` flags it as a
+    training target, and ``if_present`` skips the turn when the named binding
+    resolves to ``None``.
+    """
+
+    role: MessageRole
+    content: str | list[dict[str, Any]] | None = None
+    stream: MessageStream | None = None
+    target: bool = False
+    if_present: str | None = None
+    tool_calls_from: str | None = None
+
+    def __post_init__(self) -> None:
+        """Validate role, stream, and content after dataclass construction."""
+        if self.role not in _VALID_ROLES:
+            raise ValueError(f"Unsupported message role: {self.role!r}")
+        # ``stream`` is typed Optional only so the dataclass can keep its
+        # field ordering, but recipes must always tag every turn with a
+        # stream — the renderer's ``_validate_rendered`` would reject
+        # ``None`` later on. Fail at construction so the bad recipe is
+        # caught at YAML load time rather than at the first sample.
+        if self.stream is None:
+            raise ValueError(
+                f"MessageTurn(role={self.role!r}) is missing a stream — "
+                f"every turn must declare one of {sorted(_VALID_STREAMS)}."
+            )
+        if self.stream not in _VALID_STREAMS:
+            raise ValueError(f"Unsupported message stream: {self.stream!r}")
+        if self.content is None and self.tool_calls_from is None:
+            raise ValueError("MessageTurn.content is required unless tool_calls_from is set.")
+        if self.content is not None and not isinstance(self.content, (str, list)):
+            raise TypeError("MessageTurn.content must be a string, a list of HF-style blocks, or None.")
+        if isinstance(self.content, list):
+            for block in self.content:
+                if not isinstance(block, dict) or "type" not in block:
+                    raise ValueError(
+                        "Multimodal content blocks must be HF-style dictionaries with a type key."
+                    )
+
+    @classmethod
+    def from_dict(cls, data: dict[str, Any]) -> MessageTurn:
+        """Construct a :class:`MessageTurn` from a plain dictionary."""
+        return cls(**data)
+
+
+@dataclass
+class TrainingRecipe:
+    """A recipe describing how to render training samples from language rows.
+
+    A recipe is either a *message recipe* (``messages`` plus optional
+    ``bindings``) or a *blend recipe* (``blend`` mapping names to weighted
+    sub-recipes). ``weight`` is only meaningful inside a blend.
+    """
+
+    messages: list[MessageTurn] | None = None
+    bindings: dict[str, str] | None = None
+    blend: dict[str, TrainingRecipe] | None = None
+    weight: float | None = None
+
+    def __post_init__(self) -> None:
+        """Validate that exactly one of ``messages`` or ``blend`` is set."""
+        if self.messages is not None and self.blend is not None:
+            raise ValueError("TrainingRecipe must set only one of messages or blend.")
+        if self.messages is None and self.blend is None:
+            raise ValueError("TrainingRecipe must set one of messages or blend.")
+
+        if self.messages is not None:
+            self._validate_message_recipe()
+        if self.blend is not None:
+            self._validate_blend_recipe()
+
+    @classmethod
+    def from_dict(cls, data: dict[str, Any]) -> TrainingRecipe:
+        """Construct a :class:`TrainingRecipe` from a nested dictionary."""
+        data = dict(data)
+        if data.get("messages") is not None:
+            data["messages"] = [
+                turn if isinstance(turn, MessageTurn) else MessageTurn.from_dict(turn)
+                for turn in data["messages"]
+            ]
+        if data.get("blend") is not None:
+            data["blend"] = {
+                name: recipe if isinstance(recipe, TrainingRecipe) else cls.from_dict(recipe)
+                for name, recipe in data["blend"].items()
+            }
+        return cls(**data)
+
+    @classmethod
+    def from_yaml(cls, path: str | Path) -> TrainingRecipe:
+        """Load a :class:`TrainingRecipe` from a YAML file at ``path``."""
+        import yaml  # type: ignore[import-untyped]
+
+        with open(path) as f:
+            data = yaml.safe_load(f)
+        if not isinstance(data, dict):
+            raise ValueError(f"Recipe YAML must contain a mapping at the top level: {path}")
+        return cls.from_dict(data)
+
+    def _validate_message_recipe(self) -> None:
+        """Ensure every templated binding is known and at least one turn is a target."""
+        assert self.messages is not None
+        known_bindings = set(DEFAULT_BINDINGS) | set(self.bindings or {}) | {"task"}
+
+        for turn in self.messages:
+            missing = self._referenced_bindings(turn) - known_bindings
+            if missing:
+                raise ValueError(f"MessageTurn references unknown binding(s): {sorted(missing)}")
+
+        if not any(turn.target for turn in self.messages):
+            raise ValueError("Message recipes must contain at least one target turn.")
+
+    def _validate_blend_recipe(self) -> None:
+        """Ensure each blend component is a non-empty, weighted message recipe."""
+        assert self.blend is not None
+        if not self.blend:
+            raise ValueError("Blend recipes must contain at least one component.")
+
+        for name, recipe in self.blend.items():
+            if recipe.blend is not None:
+                raise ValueError(f"Blend component {name!r} cannot itself define a blend.")
+            if recipe.messages is None:
+                raise ValueError(f"Blend component {name!r} must define messages.")
+            if recipe.weight is None:
+                raise ValueError(f"Blend component {name!r} must define weight.")
+            if recipe.weight <= 0:
+                raise ValueError(f"Blend component {name!r} must have a positive weight.")
+
+    def _referenced_bindings(self, turn: MessageTurn) -> set[str]:
+        """Return the binding names that ``turn`` references via placeholders or attributes."""
+        names: set[str] = set()
+        if turn.if_present is not None:
+            names.add(turn.if_present)
+        if turn.tool_calls_from is not None:
+            names.add(turn.tool_calls_from)
+        names.update(_placeholders_in_content(turn.content))
+        return names
+
+
+def _placeholders_in_content(content: str | list[dict[str, Any]] | None) -> set[str]:
+    """Return the set of ``${name}`` placeholders found anywhere in ``content``."""
+    if content is None:
+        return set()
+    if isinstance(content, str):
+        return set(PLACEHOLDER_RE.findall(content))
+
+    names: set[str] = set()
+    for block in content:
+        for value in block.values():
+            if isinstance(value, str):
+                names.update(PLACEHOLDER_RE.findall(value))
+    return names
+
+
+def load_recipe(path: str | Path) -> TrainingRecipe:
+    """Load a :class:`TrainingRecipe` from a YAML file at ``path``."""
+    return TrainingRecipe.from_yaml(path)
diff --git a/src/lerobot/datasets/__init__.py b/src/lerobot/datasets/__init__.py
index b51ef0222..e4e3ccdf6 100644
--- a/src/lerobot/datasets/__init__.py
+++ b/src/lerobot/datasets/__init__.py
@@ -37,6 +37,14 @@ from .dataset_tools import (
 from .factory import make_dataset, resolve_delta_timestamps
 from .image_writer import safe_stop_image_writer
 from .io_utils import load_episodes, write_stats
+from .language import (
+    EVENT_ONLY_STYLES,
+    LANGUAGE_EVENTS,
+    LANGUAGE_PERSISTENT,
+    PERSISTENT_STYLES,
+    STYLE_REGISTRY,
+    column_for_style,
+)
 from .lerobot_dataset import LeRobotDataset
 from .multi_dataset import MultiLeRobotDataset
 from .pipeline_features import aggregate_pipeline_dataset_features, create_initial_features
@@ -54,10 +62,15 @@ __all__ = [
     "CODEBASE_VERSION",
     "DEFAULT_EPISODES_PATH",
     "DEFAULT_QUANTILES",
+    "EVENT_ONLY_STYLES",
     "EpisodeAwareSampler",
+    "LANGUAGE_EVENTS",
+    "LANGUAGE_PERSISTENT",
     "LeRobotDataset",
     "LeRobotDatasetMetadata",
     "MultiLeRobotDataset",
+    "PERSISTENT_STYLES",
+    "STYLE_REGISTRY",
     "StreamingLeRobotDataset",
     "VideoEncodingManager",
     "check_video_encoder_parameters_pyav",
@@ -69,6 +82,7 @@ __all__ = [
     "convert_image_to_video_dataset",
     "create_initial_features",
     "create_lerobot_dataset_card",
+    "column_for_style",
     "delete_episodes",
     "get_feature_stats",
     "load_episodes",
diff --git a/src/lerobot/datasets/compute_stats.py b/src/lerobot/datasets/compute_stats.py
index f489c84a7..438ac7fba 100644
--- a/src/lerobot/datasets/compute_stats.py
+++ b/src/lerobot/datasets/compute_stats.py
@@ -512,7 +512,7 @@ def compute_episode_stats(
 
     ep_stats = {}
     for key, data in episode_data.items():
-        if features[key]["dtype"] == "string":
+        if features[key]["dtype"] in {"string", "language"}:
             continue
 
         if features[key]["dtype"] in ["image", "video"]:
diff --git a/src/lerobot/datasets/dataset_metadata.py b/src/lerobot/datasets/dataset_metadata.py
index 3c58774c3..39a1b6d2b 100644
--- a/src/lerobot/datasets/dataset_metadata.py
+++ b/src/lerobot/datasets/dataset_metadata.py
@@ -36,12 +36,12 @@ from .io_utils import (
     load_episodes,
     load_info,
     load_stats,
-    load_subtasks,
     load_tasks,
     write_info,
     write_stats,
     write_tasks,
 )
+from .language import DEFAULT_TOOLS, LANGUAGE_COLUMNS
 from .utils import (
     DEFAULT_EPISODES_PATH,
     check_version_compatibility,
@@ -177,7 +177,6 @@ class LeRobotDatasetMetadata:
         self.info = load_info(self.root)
         check_version_compatibility(self.repo_id, self._version, CODEBASE_VERSION)
         self.tasks = load_tasks(self.root)
-        self.subtasks = load_subtasks(self.root)
         self.episodes = load_episodes(self.root)
         self.stats = load_stats(self.root)
 
@@ -343,6 +342,49 @@ class LeRobotDatasetMetadata:
         """Keys to access visual modalities (regardless of their storage method)."""
         return [key for key, ft in self.features.items() if ft["dtype"] in ["video", "image"]]
 
+    @property
+    def has_language_columns(self) -> bool:
+        """Return ``True`` if the dataset declares any language column.
+
+        Used to gate language-aware code paths (collate, render step) so
+        unannotated datasets keep PyTorch's default collate behavior.
+        """
+        return any(col in self.features for col in LANGUAGE_COLUMNS)
+
+    @property
+    def tools(self) -> list[dict]:
+        """OpenAI-style tool schemas declared by this dataset.
+
+        Read from ``meta/info.json["tools"]``. Returns a copy, so callers
+        can mutate the result safely. Falls back to
+        :data:`lerobot.datasets.language.DEFAULT_TOOLS` (the canonical
+        ``say`` schema) when the dataset doesn't declare any — that way
+        unannotated datasets and chat-template consumers
+        (``apply_chat_template(messages, tools=meta.tools)``) keep
+        working out of the box.
+
+        Implementations live under :mod:`lerobot.tools` (one file per
+        tool); see ``docs/source/tools.mdx`` for the authoring guide.
+        """
+        declared = self.info.tools
+        if declared:
+            return [dict(t) for t in declared]
+        return [dict(t) for t in DEFAULT_TOOLS]
+
+    @tools.setter
+    def tools(self, value: list[dict] | None) -> None:
+        """Persist a tool catalog to ``meta/info.json`` and reload metadata.
+
+        Writes ``value`` into the on-disk ``info.json`` (or clears the
+        ``tools`` key when ``value`` is ``None`` or empty), then reloads
+        ``self.info`` so the in-memory metadata matches what's on disk.
+        Saves callers from hand-editing ``info.json`` and re-instantiating
+        the metadata object.
+        """
+        self.info.tools = [dict(t) for t in value] if value else None
+        write_info(self.info, self.root)
+        self.info = load_info(self.root)
+
     @property
     def names(self) -> dict[str, list | dict]:
         """Names of the various dimensions of vector modalities."""
@@ -671,7 +713,6 @@ class LeRobotDatasetMetadata:
         _validate_feature_names(features)
 
         obj.tasks = None
-        obj.subtasks = None
         obj.episodes = None
         obj.stats = None
         obj.info = create_empty_dataset_info(
diff --git a/src/lerobot/datasets/dataset_reader.py b/src/lerobot/datasets/dataset_reader.py
index bd1298590..59aaa40e5 100644
--- a/src/lerobot/datasets/dataset_reader.py
+++ b/src/lerobot/datasets/dataset_reader.py
@@ -295,9 +295,4 @@ class DatasetReader:
         task_idx = item["task_index"].item()
         item["task"] = self._meta.tasks.iloc[task_idx].name
 
-        # add subtask information if available
-        if "subtask_index" in self._meta.features and self._meta.subtasks is not None:
-            subtask_idx = item["subtask_index"].item()
-            item["subtask"] = self._meta.subtasks.iloc[subtask_idx].name
-
         return item
diff --git a/src/lerobot/datasets/feature_utils.py b/src/lerobot/datasets/feature_utils.py
index d5a550a4c..56264408f 100644
--- a/src/lerobot/datasets/feature_utils.py
+++ b/src/lerobot/datasets/feature_utils.py
@@ -13,6 +13,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+import logging
 from pprint import pformat
 
 import datasets
@@ -23,6 +24,12 @@ from lerobot.configs import VIDEO_ENCODER_INFO_KEYS
 from lerobot.utils.constants import DEFAULT_FEATURES
 from lerobot.utils.utils import is_valid_numpy_dtype_string
 
+from .language import (
+    LANGUAGE_PERSISTENT,
+    is_language_column,
+    language_events_column_feature,
+    language_persistent_column_feature,
+)
 from .utils import (
     DEFAULT_CHUNK_SIZE,
     DEFAULT_DATA_FILE_SIZE_IN_MB,
@@ -47,7 +54,13 @@ def get_hf_features_from_features(features: dict) -> datasets.Features:
     """
     hf_features = {}
     for key, ft in features.items():
-        if ft["dtype"] == "video":
+        if is_language_column(key):
+            hf_features[key] = (
+                language_persistent_column_feature()
+                if key == LANGUAGE_PERSISTENT
+                else language_events_column_feature()
+            )
+        elif ft["dtype"] == "video":
             continue
         elif ft["dtype"] == "image":
             hf_features[key] = datasets.Image()
@@ -278,6 +291,8 @@ def validate_feature_dtype_and_shape(
         return validate_feature_image_or_video(name, expected_shape, value)
     elif expected_dtype == "string":
         return validate_feature_string(name, value)
+    elif expected_dtype == "language":
+        return validate_feature_language(name, value)
     else:
         raise NotImplementedError(f"The feature dtype '{expected_dtype}' is not implemented yet.")
 
@@ -357,6 +372,30 @@ def validate_feature_string(name: str, value: str) -> str:
     return ""
 
 
+def validate_feature_language(name: str, value) -> str:
+    """Validate a feature that is expected to hold language annotations.
+
+    Language columns (``language_persistent`` / ``language_events``) are
+    populated after recording by the annotation pipeline, not at record time.
+    Any value supplied here is dropped before the frame is written, so a
+    non-empty value almost certainly signals a mistake. We warn rather than
+    fail to keep recording resilient.
+
+    Args:
+        name (str): The name of the feature.
+        value: The value to validate.
+
+    Returns:
+        str: Always an empty string — language values are non-fatal.
+    """
+    if value is not None:
+        logging.warning(
+            f"The feature '{name}' is a 'language' column populated by the annotation pipeline, "
+            f"not at record time. The provided value will be dropped."
+        )
+    return ""
+
+
 def validate_episode_buffer(episode_buffer: dict, total_episodes: int, features: dict) -> None:
     """Validate the episode buffer before it's written to disk.
 
diff --git a/src/lerobot/datasets/io_utils.py b/src/lerobot/datasets/io_utils.py
index f5681c7c0..a41f34704 100644
--- a/src/lerobot/datasets/io_utils.py
+++ b/src/lerobot/datasets/io_utils.py
@@ -31,10 +31,10 @@ from torchvision import transforms
 from lerobot.utils.io_utils import load_json, write_json
 from lerobot.utils.utils import SuppressProgressBars, flatten_dict, unflatten_dict
 
+from .language import LANGUAGE_COLUMNS
 from .utils import (
     DEFAULT_DATA_FILE_SIZE_IN_MB,
     DEFAULT_EPISODES_PATH,
-    DEFAULT_SUBTASKS_PATH,
     DEFAULT_TASKS_PATH,
     EPISODES_DIR,
     INFO_PATH,
@@ -186,14 +186,6 @@ def load_tasks(local_dir: Path) -> pandas.DataFrame:
     return tasks
 
 
-def load_subtasks(local_dir: Path) -> pandas.DataFrame | None:
-    """Load subtasks from subtasks.parquet if it exists."""
-    subtasks_path = local_dir / DEFAULT_SUBTASKS_PATH
-    if subtasks_path.exists():
-        return pd.read_parquet(subtasks_path)
-    return None
-
-
 def write_episodes(episodes: Dataset, local_dir: Path) -> None:
     """Write episode metadata to a parquet file in the LeRobot v3.0 format.
     This function writes episode-level metadata to a single parquet file.
@@ -265,11 +257,13 @@ def hf_transform_to_torch(items_dict: dict[str, list[Any]]) -> dict[str, list[to
         dict: The batch with items converted to torch tensors.
     """
     for key in items_dict:
+        if key in LANGUAGE_COLUMNS:
+            continue
         first_item = items_dict[key][0]
         if isinstance(first_item, PILImage.Image):
             to_tensor = transforms.ToTensor()
             items_dict[key] = [to_tensor(img) for img in items_dict[key]]
-        elif first_item is None:
+        elif first_item is None or isinstance(first_item, dict):
             pass
         else:
             items_dict[key] = [x if isinstance(x, str) else torch.tensor(x) for x in items_dict[key]]
@@ -304,8 +298,9 @@ def item_to_torch(item: dict) -> dict:
     Returns:
         dict: Dictionary with all tensor-like items converted to torch.Tensor.
     """
+    skip_keys = {"task", *LANGUAGE_COLUMNS}
     for key, val in item.items():
-        if isinstance(val, (np.ndarray | list)) and key not in ["task"]:
+        if isinstance(val, (np.ndarray | list)) and key not in skip_keys:
             # Convert numpy arrays and lists to torch tensors
             item[key] = torch.tensor(val)
     return item
diff --git a/src/lerobot/datasets/language.py b/src/lerobot/datasets/language.py
new file mode 100644
index 000000000..124c25221
--- /dev/null
+++ b/src/lerobot/datasets/language.py
@@ -0,0 +1,242 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+from typing import Literal
+
+import datasets
+import pyarrow as pa
+
+LANGUAGE_PERSISTENT = "language_persistent"
+LANGUAGE_EVENTS = "language_events"
+LANGUAGE_COLUMNS = (LANGUAGE_PERSISTENT, LANGUAGE_EVENTS)
+PERSISTENT_ROW_FIELDS = ("role", "content", "style", "timestamp", "camera", "tool_calls")
+EVENT_ROW_FIELDS = ("role", "content", "style", "camera", "tool_calls")
+
+CORE_STYLES = {
+    "subtask",
+    "plan",
+    "memory",
+    "motion",
+    "interjection",
+    "vqa",
+    "trace",
+    "task_aug",
+}
+# Project-local styles can be registered at import time by appending to
+# ``EXTENDED_STYLES`` before ``column_for_style`` is called. Anything added
+# here is treated as a known style alongside ``CORE_STYLES`` for resolver
+# validation. Empty by default — populate from a downstream module that
+# also extends ``PERSISTENT_STYLES`` or ``EVENT_ONLY_STYLES`` to declare
+# the new style's column.
+EXTENDED_STYLES: set[str] = set()
+STYLE_REGISTRY = CORE_STYLES | EXTENDED_STYLES
+
+PERSISTENT_STYLES = {"subtask", "plan", "memory", "motion", "task_aug"}
+EVENT_ONLY_STYLES = {"interjection", "vqa", "trace"}
+
+# Styles whose ``content`` is grounded in a specific camera view. Rows of these
+# styles MUST carry a non-null ``camera`` referencing an ``observation.images.*``
+# feature key. Rows of every other style MUST have ``camera=None``. ``motion``
+# is intentionally NOT in this set: motion primitives are described in
+# robot-frame (joint / Cartesian) terms, not pixel space, so they are
+# camera-agnostic. ``trace`` is the pixel-trajectory event style and IS
+# view-dependent. The ``camera`` field nevertheless lives on
+# ``PERSISTENT_ROW_FIELDS`` too so the schema, validator, and resolver
+# behave symmetrically across the two columns; persistent rows simply
+# always have ``camera=None`` in practice today.
+VIEW_DEPENDENT_STYLES = {"vqa", "trace"}
+
+LanguageColumn = Literal["language_persistent", "language_events"]
+
+
+def _json_arrow_type() -> pa.DataType:
+    """Return the Arrow JSON type, falling back to ``string`` on older pyarrow."""
+    return pa.json_() if hasattr(pa, "json_") else pa.string()
+
+
+def _json_feature() -> object:
+    """Return the HF ``datasets`` JSON feature, falling back to a string value."""
+    return datasets.Json() if hasattr(datasets, "Json") else datasets.Value("string")
+
+
+def language_persistent_row_arrow_type() -> pa.StructType:
+    """Return the Arrow struct type for a single persistent language row.
+
+    Persistent rows carry their own ``timestamp`` because they represent a state
+    that became active at a specific moment and remains active until superseded.
+    ``timestamp`` is ``float32`` to match the timestamp dtype LeRobotDataset
+    uses for frame data.
+    """
+    return pa.struct(
+        [
+            pa.field("role", pa.string(), nullable=False),
+            pa.field("content", pa.string(), nullable=True),
+            pa.field("style", pa.string(), nullable=True),
+            pa.field("timestamp", pa.float32(), nullable=False),
+            pa.field("camera", pa.string(), nullable=True),
+            pa.field("tool_calls", pa.list_(_json_arrow_type()), nullable=True),
+        ]
+    )
+
+
+def language_event_row_arrow_type() -> pa.StructType:
+    """Return the Arrow struct type for a single event language row.
+
+    Event rows have no ``timestamp`` field: each event is stored on the dataset
+    row whose frame timestamp is the event's firing time.
+    """
+    return pa.struct(
+        [
+            pa.field("role", pa.string(), nullable=False),
+            pa.field("content", pa.string(), nullable=True),
+            pa.field("style", pa.string(), nullable=True),
+            pa.field("camera", pa.string(), nullable=True),
+            pa.field("tool_calls", pa.list_(_json_arrow_type()), nullable=True),
+        ]
+    )
+
+
+def language_persistent_arrow_type() -> pa.ListType:
+    """Return the Arrow list type for the ``language_persistent`` column."""
+    return pa.list_(language_persistent_row_arrow_type())
+
+
+def language_events_arrow_type() -> pa.ListType:
+    """Return the Arrow list type for the ``language_events`` column."""
+    return pa.list_(language_event_row_arrow_type())
+
+
+def language_persistent_row_feature() -> dict[str, object]:
+    """Return the HF ``datasets`` feature mapping for a persistent language row."""
+    return {
+        "role": datasets.Value("string"),
+        "content": datasets.Value("string"),
+        "style": datasets.Value("string"),
+        "timestamp": datasets.Value("float32"),
+        "camera": datasets.Value("string"),
+        "tool_calls": datasets.List(_json_feature()),
+    }
+
+
+def language_event_row_feature() -> dict[str, object]:
+    """Return the HF ``datasets`` feature mapping for an event language row."""
+    return {
+        "role": datasets.Value("string"),
+        "content": datasets.Value("string"),
+        "style": datasets.Value("string"),
+        "camera": datasets.Value("string"),
+        "tool_calls": datasets.List(_json_feature()),
+    }
+
+
+def language_persistent_column_feature() -> datasets.List:
+    """Return the HF ``datasets`` feature for the ``language_persistent`` column."""
+    return datasets.List(language_persistent_row_feature())
+
+
+def language_events_column_feature() -> datasets.List:
+    """Return the HF ``datasets`` feature for the ``language_events`` column."""
+    return datasets.List(language_event_row_feature())
+
+
+def language_feature_info() -> dict[str, dict]:
+    """Return the ``info["features"]`` entries for both language columns."""
+    return {
+        LANGUAGE_PERSISTENT: {"dtype": "language", "shape": (1,), "names": None},
+        LANGUAGE_EVENTS: {"dtype": "language", "shape": (1,), "names": None},
+    }
+
+
+def is_language_column(key: str) -> bool:
+    """Return ``True`` if ``key`` is one of the dataset's language column names."""
+    return key in LANGUAGE_COLUMNS
+
+
+def is_view_dependent_style(style: str | None) -> bool:
+    """Return ``True`` if rows of ``style`` must be tagged with a ``camera`` key."""
+    return style in VIEW_DEPENDENT_STYLES
+
+
+def validate_camera_field(style: str | None, camera: str | None) -> None:
+    """Enforce the ``camera`` invariant: required iff ``style`` is view-dependent.
+
+    Raises ``ValueError`` if a view-dependent style is missing ``camera`` or if
+    a non-view-dependent style carries one. Pipeline writers and the validator
+    should call this on every emitted row.
+    """
+    if is_view_dependent_style(style):
+        if not camera:
+            raise ValueError(
+                f"Rows of view-dependent style {style!r} require a non-empty 'camera' "
+                f"field referencing an 'observation.images.*' feature key."
+            )
+    elif camera is not None:
+        raise ValueError(f"Rows of style {style!r} must have camera=None; got camera={camera!r}.")
+
+
+# --- Tool registry --------------------------------------------------------
+# Tools declared on a dataset live in ``meta/info.json["tools"]`` as a list
+# of OpenAI-style function schemas. The runtime / training stack reads them
+# through :class:`LeRobotDatasetMetadata.tools` (with these constants as
+# fallback when the dataset doesn't declare any). Implementations live
+# under :mod:`lerobot.tools` (one file per tool); see
+# ``docs/source/tools.mdx`` for the authoring guide.
+
+SAY_TOOL_SCHEMA: dict = {
+    "type": "function",
+    "function": {
+        "name": "say",
+        "description": "Speak a short utterance to the user via the TTS executor.",
+        "parameters": {
+            "type": "object",
+            "properties": {
+                "text": {
+                    "type": "string",
+                    "description": "The verbatim text to speak.",
+                }
+            },
+            "required": ["text"],
+        },
+    },
+}
+"""Canonical schema for the ``say`` tool emitted by the steerable
+annotation pipeline (PR 2 Module 2). Single source of truth — PR 2's
+writer, PR 3's runtime tool registry, and the dataset visualizer all
+import this constant rather than duplicating the dict."""
+
+DEFAULT_TOOLS: list[dict] = [SAY_TOOL_SCHEMA]
+"""Fallback tools list. Returned by ``LeRobotDatasetMetadata.tools``
+when ``meta/info.json["tools"]`` is unset, so unannotated datasets and
+chat-template consumers (``apply_chat_template(messages, tools=...)``)
+keep working out of the box."""
+
+
+def column_for_style(style: str | None) -> LanguageColumn:
+    """Map a language style to the column where rows of that style are stored.
+
+    Styles in :data:`PERSISTENT_STYLES` route to :data:`LANGUAGE_PERSISTENT`.
+    Styles in :data:`EVENT_ONLY_STYLES` and the implicit ``None`` style route
+    to :data:`LANGUAGE_EVENTS`.
+    """
+    if style is None:
+        return LANGUAGE_EVENTS
+    if style in PERSISTENT_STYLES:
+        return LANGUAGE_PERSISTENT
+    if style in EVENT_ONLY_STYLES:
+        return LANGUAGE_EVENTS
+    raise ValueError(f"Unknown language style: {style!r}")
diff --git a/src/lerobot/datasets/language_render.py b/src/lerobot/datasets/language_render.py
new file mode 100644
index 000000000..999fa19ad
--- /dev/null
+++ b/src/lerobot/datasets/language_render.py
@@ -0,0 +1,545 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import copy
+import hashlib
+import re
+from collections.abc import Sequence
+from typing import Any
+
+from lerobot.configs.recipe import DEFAULT_BINDINGS, PLACEHOLDER_RE, TrainingRecipe
+from lerobot.utils.utils import unwrap_scalar
+
+from .language import LANGUAGE_PERSISTENT, column_for_style
+
+LanguageRow = dict[str, Any]
+RenderedMessages = dict[str, list[Any]]
+
+_RESOLVER_RE = re.compile(r"^(?P<name>[A-Za-z_][A-Za-z0-9_]*)\((?P<args>.*)\)$")
+
+
+def active_at(
+    t: float,
+    *,
+    persistent: Sequence[LanguageRow],
+    style: str | None = None,
+    role: str | None = None,
+    tool_name: str | None = None,
+    camera: str | None = None,
+) -> LanguageRow | None:
+    """Return the persistent row of ``style`` that is active at time ``t``.
+
+    A persistent row is "active" at ``t`` when its own ``timestamp`` is the
+    most recent one ``<= t`` for the given ``style``/``role``/``tool_name``/
+    ``camera`` selector. Only valid for persistent styles.
+    """
+    _validate_persistent_resolver("active_at", style)
+    matches = [
+        row
+        for row in _matching_rows(persistent, style=style, role=role, tool_name=tool_name, camera=camera)
+        if _timestamp(row) <= t
+    ]
+    if not matches:
+        return None
+    latest_ts = max(_timestamp(row) for row in matches)
+    return _select_one(
+        [row for row in matches if _timestamp(row) == latest_ts],
+        style=style,
+        role=role,
+        tool_name=tool_name,
+        camera=camera,
+    )
+
+
+EMITTED_AT_TOLERANCE_S = 0.1
+"""Half-window for matching persistent rows to a frame timestamp in
+``emitted_at``. Persistent timestamps come from parquet (float32) and ``t``
+is also a float32 from parquet, so in the ideal hot path an exact match
+would suffice — but any caller that derives ``t`` arithmetically (e.g.
+``frame_idx / fps``) breaks bit-equality. A 0.1 s tolerance covers
+common arithmetic drift without admitting frames that are visibly far
+apart at typical control rates (30–100 Hz). This does mean two persistent
+rows of the same selector emitted within 0.1 s of each other cannot be
+told apart by ``emitted_at`` — acceptable because persistent annotations
+(subtask / plan / memory transitions) change on a human-action timescale,
+not at the camera frame rate."""
+
+
+def emitted_at(
+    t: float,
+    *,
+    persistent: Sequence[LanguageRow],
+    events: Sequence[LanguageRow],
+    style: str | None = None,
+    role: str | None = None,
+    tool_name: str | None = None,
+    camera: str | None = None,
+) -> LanguageRow | None:
+    """Return the row of ``style`` emitted at exactly time ``t``.
+
+    For persistent styles, this matches persistent rows whose own ``timestamp``
+    is within ``EMITTED_AT_TOLERANCE_S`` of ``t`` (see that constant for why
+    we use a tolerance instead of bit-equality). For event styles, the
+    ``events`` list is assumed to come from the dataset row at frame ``t``
+    (event rows carry no timestamp of their own), so all matching event rows
+    are considered emitted at ``t``. ``camera`` filters by the row's
+    ``camera`` field — required to disambiguate when multiple view-dependent
+    rows share ``(t, role)`` across cameras.
+    """
+    if column_for_style(style) == LANGUAGE_PERSISTENT:
+        matches = [
+            row
+            for row in _matching_rows(persistent, style=style, role=role, tool_name=tool_name, camera=camera)
+            if abs(_timestamp(row) - t) <= EMITTED_AT_TOLERANCE_S
+        ]
+    else:
+        matches = _matching_rows(events, style=style, role=role, tool_name=tool_name, camera=camera)
+    return _select_one(matches, style=style, role=role, tool_name=tool_name, camera=camera)
+
+
+def nth_prev(
+    t: float,
+    *,
+    persistent: Sequence[LanguageRow],
+    style: str | None = None,
+    offset: int = 1,
+    role: str | None = None,
+    tool_name: str | None = None,
+    camera: str | None = None,
+) -> LanguageRow | None:
+    """Return the persistent row that was active ``offset`` steps before ``t``.
+
+    Walks back through chronologically sorted persistent rows of ``style``
+    (filtered by optional ``role``/``tool_name``/``camera``) and returns the
+    one ``offset`` positions before the row active at ``t``. Only valid for
+    persistent styles.
+    """
+    return _nth_relative("nth_prev", t, persistent, style, -offset, role, tool_name, camera)
+
+
+def nth_next(
+    t: float,
+    *,
+    persistent: Sequence[LanguageRow],
+    style: str | None = None,
+    offset: int = 1,
+    role: str | None = None,
+    tool_name: str | None = None,
+    camera: str | None = None,
+) -> LanguageRow | None:
+    """Return the persistent row that becomes active ``offset`` steps after ``t``.
+
+    Walks forward through chronologically sorted persistent rows of ``style``
+    (filtered by optional ``role``/``tool_name``/``camera``) and returns the
+    one ``offset`` positions after the row active at ``t``. Only valid for
+    persistent styles.
+    """
+    return _nth_relative("nth_next", t, persistent, style, offset, role, tool_name, camera)
+
+
+def render_sample(
+    *,
+    recipe: TrainingRecipe,
+    persistent: Sequence[LanguageRow] | None,
+    events: Sequence[LanguageRow] | None,
+    t: float,
+    sample_idx: int,
+    task: str | None = None,
+    dataset_ctx: Any | None = None,
+) -> RenderedMessages | None:
+    """Render the chat-style messages for a single dataset sample.
+
+    Resolves the recipe's bindings against ``persistent`` and ``events`` rows
+    at frame timestamp ``t``, then expands the recipe's message templates.
+    Returns ``None`` if the resolved sample contains no target message.
+    """
+    persistent_rows = _normalize_rows(persistent or [])
+    event_rows = _normalize_rows(events or [])
+    selected_recipe = _select_recipe(recipe, sample_idx)
+    bindings = _resolve_bindings(
+        selected_recipe,
+        persistent=persistent_rows,
+        events=event_rows,
+        t=t,
+        sample_idx=sample_idx,
+        task=task,
+        dataset_ctx=dataset_ctx,
+    )
+    return _render_message_recipe(selected_recipe, bindings)
+
+
+def _select_recipe(recipe: TrainingRecipe, sample_idx: int) -> TrainingRecipe:
+    """Pick a deterministic blend component for ``sample_idx`` (or return ``recipe``)."""
+    if recipe.blend is None:
+        return recipe
+
+    total_weight = sum(component.weight or 0.0 for component in recipe.blend.values())
+    if total_weight <= 0:
+        raise ValueError("Blend weights must sum to a positive value.")
+
+    digest = hashlib.blake2b(str(sample_idx).encode(), digest_size=8).digest()
+    draw = int.from_bytes(digest, "big") / 2**64 * total_weight
+    cumulative = 0.0
+    last_component: TrainingRecipe | None = None
+    for component in recipe.blend.values():
+        last_component = component
+        cumulative += component.weight or 0.0
+        if draw < cumulative:
+            return component
+    assert last_component is not None
+    return last_component
+
+
+def _resolve_bindings(
+    recipe: TrainingRecipe,
+    *,
+    persistent: Sequence[LanguageRow],
+    events: Sequence[LanguageRow],
+    t: float,
+    sample_idx: int,
+    task: str | None,
+    dataset_ctx: Any | None,
+) -> dict[str, LanguageRow | str | None]:
+    """Resolve every binding in ``recipe`` (plus ``task``) at time ``t``."""
+    bindings: dict[str, LanguageRow | str | None] = {
+        "task": _resolve_task(task, dataset_ctx, persistent=persistent, sample_idx=sample_idx),
+    }
+    specs = {**DEFAULT_BINDINGS, **(recipe.bindings or {})}
+    for name, spec in specs.items():
+        bindings[name] = _resolve_spec(spec, persistent=persistent, events=events, t=t)
+    return bindings
+
+
+def _resolve_task(
+    task: str | None,
+    dataset_ctx: Any | None,
+    *,
+    persistent: Sequence[LanguageRow] = (),
+    sample_idx: int = 0,
+) -> str | None:
+    """Return the task string for ``sample_idx``.
+
+    Resolution order:
+
+    1. Explicit ``task`` override (caller-supplied) wins.
+    2. If ``persistent`` contains rows of style ``task_aug`` (role=user),
+       deterministically pick one by ``sample_idx`` so each frame of an
+       episode rotates through the available rephrasings across an epoch.
+       This realizes Xiao 2022 / CAST-style task-prompt diversity without
+       changing ``meta/tasks.parquet`` and without forcing recipes to opt
+       in: ``${task}`` automatically picks a rephrasing when one exists,
+       and falls back to the canonical task otherwise. Recipes that want
+       the literal canonical task can override the binding.
+    3. Otherwise read the canonical task from ``dataset_ctx`` (which is
+       backed by ``meta/tasks.parquet``).
+    """
+    if task is not None:
+        return task
+
+    aug_rows = [r for r in persistent if r.get("style") == "task_aug" and r.get("role") == "user"]
+    if aug_rows:
+        # Deterministic, blake2b-based pick keyed on sample_idx so the
+        # rotation is reproducible across runs (Python's built-in ``hash``
+        # is process-randomized).
+        digest = hashlib.blake2b(f"task_aug:{sample_idx}".encode(), digest_size=8).digest()
+        idx = int.from_bytes(digest, "big") % len(aug_rows)
+        chosen = aug_rows[idx].get("content")
+        if chosen:
+            return str(chosen)
+
+    if dataset_ctx is None:
+        return None
+    if isinstance(dataset_ctx, dict):
+        return dataset_ctx.get("task")
+    return getattr(dataset_ctx, "task", None)
+
+
+def _resolve_spec(
+    spec: str,
+    *,
+    persistent: Sequence[LanguageRow],
+    events: Sequence[LanguageRow],
+    t: float,
+) -> LanguageRow | None:
+    """Parse a single binding's resolver expression and dispatch to its function."""
+    match = _RESOLVER_RE.match(spec.strip())
+    if match is None:
+        raise ValueError(f"Invalid resolver expression: {spec!r}")
+    name = match.group("name")
+    kwargs = _parse_resolver_args(match.group("args"))
+    kwargs.pop("t_arg", None)
+
+    if name == "emitted_at":
+        return emitted_at(t, persistent=persistent, events=events, **kwargs)
+    if name == "active_at":
+        return active_at(t, persistent=persistent, **kwargs)
+    if name == "nth_prev":
+        return nth_prev(t, persistent=persistent, **kwargs)
+    if name == "nth_next":
+        return nth_next(t, persistent=persistent, **kwargs)
+    raise ValueError(f"Unknown language resolver: {name!r}")
+
+
+def _parse_resolver_args(args: str) -> dict[str, Any]:
+    """Parse a comma-separated resolver argument list into a kwargs dict."""
+    kwargs: dict[str, Any] = {}
+    if not args.strip():
+        return kwargs
+
+    parts = [part.strip() for part in args.split(",") if part.strip()]
+    for part in parts:
+        if part == "t":
+            kwargs["t_arg"] = True
+            continue
+        if "=" not in part:
+            raise ValueError(f"Invalid resolver argument: {part!r}")
+        key, value = (item.strip() for item in part.split("=", 1))
+        if key == "offset":
+            kwargs[key] = int(value)
+        else:
+            kwargs[key] = value.strip("\"'")
+    return kwargs
+
+
+def _render_message_recipe(
+    recipe: TrainingRecipe,
+    bindings: dict[str, LanguageRow | str | None],
+) -> RenderedMessages | None:
+    """Expand ``recipe.messages`` into rendered chat messages using ``bindings``."""
+    assert recipe.messages is not None
+    messages: list[dict[str, Any]] = []
+    streams: list[str | None] = []
+    target_indices: list[int] = []
+
+    for turn in recipe.messages:
+        if turn.if_present is not None and bindings.get(turn.if_present) is None:
+            continue
+
+        message = {"role": turn.role}
+        if turn.content is not None:
+            message["content"] = _render_content(turn.content, bindings)
+
+        if turn.tool_calls_from is not None:
+            row = bindings.get(turn.tool_calls_from)
+            tool_calls = row.get("tool_calls") if isinstance(row, dict) else None
+            if tool_calls:
+                message["tool_calls"] = copy.deepcopy(tool_calls)
+
+        message_idx = len(messages)
+        messages.append(message)
+        streams.append(turn.stream)
+        if turn.target:
+            target_indices.append(message_idx)
+
+    if not target_indices:
+        return None
+
+    rendered = {
+        "messages": messages,
+        "message_streams": streams,
+        "target_message_indices": target_indices,
+    }
+    _validate_rendered(rendered)
+    return rendered
+
+
+def _render_content(
+    content: str | list[dict[str, Any]],
+    bindings: dict[str, LanguageRow | str | None],
+) -> str | list[dict[str, Any]]:
+    """Substitute bindings into a string or each string field of multimodal blocks."""
+    if isinstance(content, str):
+        return _substitute(content, bindings)
+
+    rendered_blocks = []
+    for block in content:
+        rendered_block = copy.deepcopy(block)
+        for key, value in rendered_block.items():
+            if isinstance(value, str):
+                rendered_block[key] = _substitute(value, bindings)
+        rendered_blocks.append(rendered_block)
+    return rendered_blocks
+
+
+def _substitute(template: str, bindings: dict[str, LanguageRow | str | None]) -> str:
+    """Replace ``${name}`` placeholders in ``template`` with their bound values."""
+
+    def replace(match: re.Match[str]) -> str:
+        """Resolve a single ``${name}`` match to its bound string value."""
+        name = match.group(1)
+        if name not in bindings:
+            raise ValueError(f"Unknown template binding: {name!r}")
+        value = bindings[name]
+        if value is None:
+            return ""
+        if isinstance(value, dict):
+            content = value.get("content")
+            return "" if content is None else str(content)
+        return str(value)
+
+    return PLACEHOLDER_RE.sub(replace, template)
+
+
+def _validate_rendered(rendered: RenderedMessages) -> None:
+    """Sanity-check the rendered output for stream/target alignment."""
+    messages = rendered["messages"]
+    streams = rendered["message_streams"]
+    target_indices = rendered["target_message_indices"]
+
+    if len(streams) != len(messages):
+        raise ValueError("message_streams must be aligned with messages.")
+    if not target_indices:
+        raise ValueError("Rendered samples must contain at least one target message.")
+    for idx in target_indices:
+        if idx < 0 or idx >= len(messages):
+            raise ValueError(f"Target message index {idx} is out of bounds.")
+    # ``stream`` is enforced non-None at MessageTurn construction time
+    # (see ``MessageTurn.__post_init__``), so a missing stream here would
+    # mean the dataclass invariant was bypassed; no need to re-check.
+
+
+def _nth_relative(
+    name: str,
+    t: float,
+    persistent: Sequence[LanguageRow],
+    style: str | None,
+    offset: int,
+    role: str | None,
+    tool_name: str | None,
+    camera: str | None,
+) -> LanguageRow | None:
+    """Shared body for ``nth_prev`` / ``nth_next`` with signed ``offset``."""
+    _validate_persistent_resolver(name, style)
+    if abs(offset) < 1:
+        raise ValueError(f"{name} offset must be non-zero.")
+
+    rows = sorted(
+        _matching_rows(persistent, style=style, role=role, tool_name=tool_name, camera=camera),
+        key=_row_sort_key,
+    )
+    if not rows:
+        return None
+
+    anchor_idx = None
+    for idx, row in enumerate(rows):
+        if _timestamp(row) <= t:
+            anchor_idx = idx
+        else:
+            break
+
+    target_idx = (offset - 1 if offset > 0 else None) if anchor_idx is None else anchor_idx + offset
+
+    if target_idx is None or target_idx < 0 or target_idx >= len(rows):
+        return None
+    return rows[target_idx]
+
+
+def _validate_persistent_resolver(name: str, style: str | None) -> None:
+    """Reject calls with missing or event-only ``style`` for persistent resolvers."""
+    if style is None:
+        raise ValueError(f"{name} requires a persistent style.")
+    if column_for_style(style) != LANGUAGE_PERSISTENT:
+        raise ValueError(f"{name} cannot be used with event-only style {style!r}.")
+
+
+def _matching_rows(
+    rows: Sequence[LanguageRow],
+    *,
+    style: str | None,
+    role: str | None,
+    tool_name: str | None,
+    camera: str | None,
+) -> list[LanguageRow]:
+    """Return ``rows`` filtered by optional ``style``/``role``/``tool_name``/``camera`` selectors."""
+    return [
+        row
+        for row in rows
+        if (style is None or row.get("style") == style)
+        and (role is None or row.get("role") == role)
+        and (tool_name is None or _row_has_tool_name(row, tool_name))
+        and (camera is None or row.get("camera") == camera)
+    ]
+
+
+def _select_one(
+    rows: Sequence[LanguageRow],
+    *,
+    style: str | None,
+    role: str | None,
+    tool_name: str | None,
+    camera: str | None,
+) -> LanguageRow | None:
+    """Return the single matching row, or raise if the resolver is ambiguous.
+
+    Multiple matches always raise — even when the caller already passed
+    some selectors — because remaining ambiguity means the data has
+    several rows that look identical to the resolver and the caller
+    needs to pin down a specific one (e.g. add ``camera=...`` for VQA
+    rows shared across cameras).
+    """
+    if not rows:
+        return None
+    if len(rows) > 1:
+        raise ValueError(
+            f"Ambiguous resolver for style={style!r} role={role!r} "
+            f"tool_name={tool_name!r} camera={camera!r}: {len(rows)} matching rows. "
+            f"Add a selector that distinguishes them."
+        )
+    return rows[0]
+
+
+def _row_sort_key(row: LanguageRow) -> tuple[float, str, str]:
+    """Stable sort key for both persistent and event rows.
+
+    Event rows lack ``timestamp`` (it is implicit in the frame), so default
+    to ``0.0`` — within a single frame all event rows share the same sort
+    bucket and are tiebroken by ``(style, role)``.
+    """
+    timestamp = row.get("timestamp")
+    ts = float(unwrap_scalar(timestamp)) if timestamp is not None else 0.0
+    return (ts, row.get("style") or "", row.get("role") or "")
+
+
+def _timestamp(row: LanguageRow) -> float:
+    """Extract a row's ``timestamp`` as a Python float (unwrapping numpy scalars)."""
+    return float(unwrap_scalar(row["timestamp"]))
+
+
+def _row_has_tool_name(row: LanguageRow, tool_name: str) -> bool:
+    """Return ``True`` if any of the row's tool calls invokes ``tool_name``."""
+    for tool_call in row.get("tool_calls") or []:
+        if isinstance(tool_call, str):
+            continue
+        function = tool_call.get("function") if isinstance(tool_call, dict) else None
+        if isinstance(function, dict) and function.get("name") == tool_name:
+            return True
+    return False
+
+
+def _normalize_rows(rows: Sequence[Any]) -> list[LanguageRow]:
+    """Convert pyarrow scalars / mappings into a fresh list of plain dict rows."""
+    normalized = []
+    for row in rows:
+        if row is None:
+            continue
+        if hasattr(row, "as_py"):
+            row = row.as_py()
+        if not isinstance(row, dict):
+            raise TypeError(f"Language rows must be dictionaries, got {type(row).__name__}.")
+        normalized.append(dict(row))
+    return normalized
diff --git a/src/lerobot/datasets/utils.py b/src/lerobot/datasets/utils.py
index 715bd2f9b..de91978ea 100644
--- a/src/lerobot/datasets/utils.py
+++ b/src/lerobot/datasets/utils.py
@@ -88,7 +88,6 @@ VIDEO_DIR = "videos"
 
 CHUNK_FILE_PATTERN = "chunk-{chunk_index:03d}/file-{file_index:03d}"
 DEFAULT_TASKS_PATH = "meta/tasks.parquet"
-DEFAULT_SUBTASKS_PATH = "meta/subtasks.parquet"
 DEFAULT_EPISODES_PATH = EPISODES_DIR + "/" + CHUNK_FILE_PATTERN + ".parquet"
 DEFAULT_DATA_PATH = DATA_DIR + "/" + CHUNK_FILE_PATTERN + ".parquet"
 DEFAULT_VIDEO_PATH = VIDEO_DIR + "/{video_key}/" + CHUNK_FILE_PATTERN + ".mp4"
@@ -130,6 +129,9 @@ class DatasetInfo:
     # Optional metadata
     robot_type: str | None = None
     splits: dict[str, str] = field(default_factory=dict)
+    # OpenAI-style tool schemas declared by the dataset. ``None`` means the
+    # dataset doesn't declare any — readers fall back to ``DEFAULT_TOOLS``.
+    tools: list[dict] | None = None
 
     def __post_init__(self) -> None:
         # Coerce feature shapes from list to tuple — JSON deserialisation
@@ -151,11 +153,15 @@ class DatasetInfo:
         """Return a JSON-serialisable dict.
 
         Converts tuple shapes back to lists so ``json.dump`` can handle them.
+        Drops ``tools`` when unset so existing datasets keep a clean
+        ``info.json``.
         """
         d = dataclasses.asdict(self)
         for ft in d["features"].values():
             if isinstance(ft.get("shape"), tuple):
                 ft["shape"] = list(ft["shape"])
+        if d.get("tools") is None:
+            d.pop("tools", None)
         return d
 
     @classmethod
diff --git a/src/lerobot/processor/__init__.py b/src/lerobot/processor/__init__.py
index 3688a4b8c..fe35af4b4 100644
--- a/src/lerobot/processor/__init__.py
+++ b/src/lerobot/processor/__init__.py
@@ -95,6 +95,13 @@ from .relative_action_processor import (
 from .rename_processor import RenameObservationsProcessorStep, rename_stats
 from .tokenizer_processor import ActionTokenizerProcessorStep, TokenizerProcessorStep
 
+# RenderMessagesStep is intentionally NOT re-exported here: it pulls in
+# `lerobot.datasets.language`, which requires the `[dataset]` extra
+# (`datasets`, `pyarrow`). Importing it from the processor package would
+# break every base-install consumer of `lerobot.processor`. Users that
+# need it import directly:
+#   from lerobot.processor.render_messages_processor import RenderMessagesStep
+
 __all__ = [
     "ActionProcessorStep",
     "AddTeleopActionAsComplimentaryDataStep",
diff --git a/src/lerobot/processor/batch_processor.py b/src/lerobot/processor/batch_processor.py
index eb7db255a..669c68a0a 100644
--- a/src/lerobot/processor/batch_processor.py
+++ b/src/lerobot/processor/batch_processor.py
@@ -174,6 +174,24 @@ class AddBatchDimensionComplementaryDataStep(ComplementaryDataProcessorStep):
             task_index_value = complementary_data["task_index"]
             if isinstance(task_index_value, Tensor) and task_index_value.dim() == 0:
                 complementary_data["task_index"] = task_index_value.unsqueeze(0)
+
+        complementary_data.pop("language_persistent", None)
+        complementary_data.pop("language_events", None)
+
+        if "messages" in complementary_data:
+            messages = complementary_data["messages"]
+            if isinstance(messages, list) and (not messages or isinstance(messages[0], dict)):
+                complementary_data["messages"] = [messages]
+
+        if "message_streams" in complementary_data:
+            streams = complementary_data["message_streams"]
+            if isinstance(streams, list) and (not streams or isinstance(streams[0], str)):
+                complementary_data["message_streams"] = [streams]
+
+        if "target_message_indices" in complementary_data:
+            indices = complementary_data["target_message_indices"]
+            if isinstance(indices, list) and (not indices or isinstance(indices[0], int)):
+                complementary_data["target_message_indices"] = [indices]
         return complementary_data
 
     def transform_features(
diff --git a/src/lerobot/processor/converters.py b/src/lerobot/processor/converters.py
index ffdf0098c..faa4d5cd9 100644
--- a/src/lerobot/processor/converters.py
+++ b/src/lerobot/processor/converters.py
@@ -153,26 +153,30 @@ def from_tensor_to_numpy(x: torch.Tensor | Any) -> np.ndarray | float | int | An
     return x
 
 
+_COMPLEMENTARY_KEYS = (
+    "task",
+    "index",
+    "task_index",
+    "episode_index",
+    "timestamp",
+    "language_persistent",
+    "language_events",
+    "messages",
+    "message_streams",
+    "target_message_indices",
+)
+
+
 def _extract_complementary_data(batch: dict[str, Any]) -> dict[str, Any]:
-    """
-    Extract complementary data from a batch dictionary.
+    """Extract complementary data from a batch dictionary.
 
-    This includes padding flags, task description, and indices.
-
-    Args:
-        batch: The batch dictionary.
-
-    Returns:
-        A dictionary with the extracted complementary data.
+    Includes padding flags (any key containing ``_is_pad``) plus the fixed
+    set of metadata / language keys defined in ``_COMPLEMENTARY_KEYS`` —
+    each only when present in ``batch``.
     """
     pad_keys = {k: v for k, v in batch.items() if "_is_pad" in k}
-    task_key = {"task": batch["task"]} if "task" in batch else {}
-    subtask_key = {"subtask": batch["subtask"]} if "subtask" in batch else {}
-    index_key = {"index": batch["index"]} if "index" in batch else {}
-    task_index_key = {"task_index": batch["task_index"]} if "task_index" in batch else {}
-    episode_index_key = {"episode_index": batch["episode_index"]} if "episode_index" in batch else {}
-
-    return {**pad_keys, **task_key, **subtask_key, **index_key, **task_index_key, **episode_index_key}
+    extras = {k: batch[k] for k in _COMPLEMENTARY_KEYS if k in batch}
+    return {**pad_keys, **extras}
 
 
 def create_transition(
diff --git a/src/lerobot/processor/render_messages_processor.py b/src/lerobot/processor/render_messages_processor.py
new file mode 100644
index 000000000..140592f0e
--- /dev/null
+++ b/src/lerobot/processor/render_messages_processor.py
@@ -0,0 +1,84 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+from typing import Any
+
+from lerobot.configs import PipelineFeatureType, PolicyFeature
+from lerobot.configs.recipe import TrainingRecipe
+from lerobot.datasets.language import LANGUAGE_EVENTS, LANGUAGE_PERSISTENT
+from lerobot.datasets.language_render import render_sample
+from lerobot.types import EnvTransition, TransitionKey
+from lerobot.utils.utils import unwrap_scalar
+
+from .pipeline import ProcessorStep, ProcessorStepRegistry
+
+
+@dataclass
+@ProcessorStepRegistry.register(name="render_messages_processor")
+class RenderMessagesStep(ProcessorStep):
+    """Processor step that turns raw language columns into rendered chat messages.
+
+    Reads ``language_persistent`` and ``language_events`` from the transition's
+    complementary data, renders them through ``recipe`` at the sample timestamp,
+    and replaces the raw columns with the resulting ``messages`` /
+    ``message_streams`` / ``target_message_indices`` keys.
+    """
+
+    recipe: TrainingRecipe
+    dataset_ctx: Any | None = None
+
+    def __call__(self, transition: EnvTransition) -> EnvTransition | None:
+        """Render messages for a single transition; return ``None`` to drop it."""
+        complementary_data = transition.get(TransitionKey.COMPLEMENTARY_DATA) or {}
+        persistent = complementary_data.get(LANGUAGE_PERSISTENT) or []
+        events = complementary_data.get(LANGUAGE_EVENTS) or []
+
+        if not persistent and not events:
+            return transition
+
+        timestamp = complementary_data.get("timestamp")
+        if timestamp is None:
+            raise KeyError("RenderMessagesStep requires sample timestamp in complementary data.")
+
+        sample_idx = complementary_data.get("index", 0)
+        rendered = render_sample(
+            recipe=self.recipe,
+            persistent=persistent,
+            events=events,
+            t=unwrap_scalar(timestamp),
+            sample_idx=int(unwrap_scalar(sample_idx)),
+            task=complementary_data.get("task"),
+            dataset_ctx=self.dataset_ctx,
+        )
+        if rendered is None:
+            return None
+
+        new_transition = transition.copy()
+        new_complementary_data = dict(complementary_data)
+        new_complementary_data.pop(LANGUAGE_PERSISTENT, None)
+        new_complementary_data.pop(LANGUAGE_EVENTS, None)
+        new_complementary_data.update(rendered)
+        new_transition[TransitionKey.COMPLEMENTARY_DATA] = new_complementary_data
+        return new_transition
+
+    def transform_features(
+        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
+    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
+        """Pass features through unchanged; rendering only touches complementary data."""
+        return features
diff --git a/src/lerobot/scripts/lerobot_train.py b/src/lerobot/scripts/lerobot_train.py
index 55a8cc935..463668eb2 100644
--- a/src/lerobot/scripts/lerobot_train.py
+++ b/src/lerobot/scripts/lerobot_train.py
@@ -48,6 +48,7 @@ from lerobot.envs import close_envs, make_env, make_env_pre_post_processors
 from lerobot.optim.factory import make_optimizer_and_scheduler
 from lerobot.policies import PreTrainedPolicy, make_policy, make_pre_post_processors
 from lerobot.rewards import make_reward_pre_post_processors
+from lerobot.utils.collate import lerobot_collate_fn
 from lerobot.utils.import_utils import register_third_party_plugins
 from lerobot.utils.logging_utils import AverageMeter, MetricsTracker
 from lerobot.utils.random_utils import set_seed
@@ -401,6 +402,10 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
         shuffle = True
         sampler = None
 
+    # Only swap in the language-aware collate when the dataset actually
+    # declares language columns; otherwise stay on PyTorch's default
+    # collate so non-language training runs are unaffected.
+    collate_fn = lerobot_collate_fn if dataset.meta.has_language_columns else None
     dataloader = torch.utils.data.DataLoader(
         dataset,
         num_workers=cfg.num_workers,
@@ -409,6 +414,7 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
         sampler=sampler,
         pin_memory=device.type == "cuda",
         drop_last=False,
+        collate_fn=collate_fn,
         prefetch_factor=cfg.prefetch_factor if cfg.num_workers > 0 else None,
         persistent_workers=cfg.persistent_workers and cfg.num_workers > 0,
     )
diff --git a/src/lerobot/utils/collate.py b/src/lerobot/utils/collate.py
new file mode 100644
index 000000000..fce7e6b42
--- /dev/null
+++ b/src/lerobot/utils/collate.py
@@ -0,0 +1,65 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+from typing import Any
+
+from torch.utils.data._utils.collate import default_collate
+
+from lerobot.datasets.language import LANGUAGE_COLUMNS
+
+_PYTHON_LIST_KEYS = {"messages", "message_streams", "target_message_indices"}
+
+
+def lerobot_collate_fn(batch: list[dict[str, Any] | None]) -> dict[str, Any] | None:
+    """Collate function that preserves Python-list and language fields as lists.
+
+    Drops ``None`` samples (e.g. recipes that yielded no target message), keeps
+    rendered-message and language fields as plain Python lists, and delegates
+    every other key to PyTorch's ``default_collate``.
+    """
+    batch = [sample for sample in batch if sample is not None]
+    if not batch:
+        return None
+
+    # All-or-nothing per key: a partial-presence batch (e.g. half the samples
+    # carry `messages` and half don't) is a real bug in the upstream
+    # rendering step — silently filtering would hand downstream consumers a
+    # preserved list shorter than the tensor batch. Raise instead so the
+    # mismatch surfaces at the boundary.
+    preserved: dict[str, list[Any]] = {}
+    for key in _PYTHON_LIST_KEYS:
+        presence = [key in sample for sample in batch]
+        if not any(presence):
+            continue
+        if not all(presence):
+            raise ValueError(
+                f"Inconsistent batch: {sum(presence)}/{len(batch)} samples carry {key!r}; "
+                f"every sample in a batch must agree."
+            )
+        preserved[key] = [sample[key] for sample in batch]
+    tensorizable = [
+        {
+            key: value
+            for key, value in sample.items()
+            if key not in _PYTHON_LIST_KEYS and key not in LANGUAGE_COLUMNS
+        }
+        for sample in batch
+    ]
+    collated = default_collate(tensorizable)
+    collated.update(preserved)
+    return collated
diff --git a/src/lerobot/utils/utils.py b/src/lerobot/utils/utils.py
index 2574f1fa3..6aad0c503 100644
--- a/src/lerobot/utils/utils.py
+++ b/src/lerobot/utils/utils.py
@@ -160,6 +160,25 @@ def has_method(cls: object, method_name: str) -> bool:
     return hasattr(cls, method_name) and callable(getattr(cls, method_name))
 
 
+def unwrap_scalar(value: Any) -> Any:
+    """Unwrap a tensor / numpy scalar / single-element list into a Python scalar.
+
+    Tensors and numpy scalars expose ``.item()``; single-element lists are
+    unwrapped recursively. Anything else is returned unchanged. Centralized
+    here so the language renderer and processor steps share one definition.
+
+    Raises:
+        ValueError: If ``value`` is a list with zero or multiple elements.
+    """
+    if hasattr(value, "item"):
+        return value.item()
+    if isinstance(value, list):
+        if len(value) != 1:
+            raise ValueError(f"Expected a scalar, got list of length {len(value)}: {value!r}")
+        return unwrap_scalar(value[0])
+    return value
+
+
 def is_valid_numpy_dtype_string(dtype_str: str) -> bool:
     """
     Return True if a given string can be converted to a numpy dtype.
diff --git a/tests/configs/test_recipe.py b/tests/configs/test_recipe.py
new file mode 100644
index 000000000..b4954efbf
--- /dev/null
+++ b/tests/configs/test_recipe.py
@@ -0,0 +1,168 @@
+#!/usr/bin/env python
+
+from pathlib import Path
+from textwrap import dedent
+
+import pytest
+
+from lerobot.configs.recipe import MessageTurn, TrainingRecipe, load_recipe
+
+
+def _minimal_message_turn(content: str = "${task}") -> MessageTurn:
+    return MessageTurn(role="user", content=content, stream="high_level")
+
+
+def _minimal_target_turn() -> MessageTurn:
+    return MessageTurn(role="assistant", content="ok", stream="high_level", target=True)
+
+
+# ── Message-recipe validation ────────────────────────────────────────
+
+
+def test_message_recipe_validates_unknown_binding():
+    with pytest.raises(ValueError, match="unknown binding"):
+        TrainingRecipe(
+            messages=[
+                MessageTurn(role="user", content="${missing}", stream="high_level"),
+                _minimal_target_turn(),
+            ]
+        )
+
+
+def test_message_turn_requires_a_stream():
+    """Every turn must declare a stream — None is rejected at construction.
+
+    Previously this only failed at render time (``_validate_rendered``);
+    catching it here means a malformed recipe YAML errors at load instead
+    of at the first training sample.
+    """
+    with pytest.raises(ValueError, match="missing a stream"):
+        MessageTurn(role="user", content="${task}")
+
+
+def test_message_recipe_requires_at_least_one_target():
+    with pytest.raises(ValueError, match="target"):
+        TrainingRecipe(
+            messages=[
+                _minimal_message_turn(),
+                MessageTurn(role="assistant", content="no target", stream="high_level"),
+            ]
+        )
+
+
+def test_recipe_rejects_both_messages_and_blend():
+    with pytest.raises(ValueError, match="only one"):
+        TrainingRecipe(
+            messages=[_minimal_message_turn(), _minimal_target_turn()],
+            blend={"a": TrainingRecipe(weight=1.0, messages=[_minimal_target_turn()])},
+        )
+
+
+def test_recipe_rejects_neither_messages_nor_blend():
+    with pytest.raises(ValueError, match="must set one"):
+        TrainingRecipe()
+
+
+# ── Blend validation ─────────────────────────────────────────────────
+
+
+def test_blend_must_be_non_empty():
+    with pytest.raises(ValueError, match="at least one component"):
+        TrainingRecipe(blend={})
+
+
+def test_blend_component_must_define_weight():
+    with pytest.raises(ValueError, match="weight"):
+        TrainingRecipe(blend={"a": TrainingRecipe(messages=[_minimal_target_turn()])})
+
+
+def test_blend_component_weight_must_be_positive():
+    with pytest.raises(ValueError, match="positive weight"):
+        TrainingRecipe(blend={"a": TrainingRecipe(weight=0.0, messages=[_minimal_target_turn()])})
+
+
+def test_blend_component_must_define_messages():
+    # A bare TrainingRecipe(weight=1.0) would itself raise; build it without
+    # going through __post_init__ to exercise the blend-level validator.
+    bad = TrainingRecipe.__new__(TrainingRecipe)
+    bad.messages = None
+    bad.bindings = None
+    bad.blend = None
+    bad.weight = 1.0
+    with pytest.raises(ValueError, match="must define messages"):
+        TrainingRecipe(blend={"a": bad})
+
+
+def test_blend_components_cannot_themselves_define_a_blend():
+    inner = TrainingRecipe(blend={"x": TrainingRecipe(weight=1.0, messages=[_minimal_target_turn()])})
+    # Force-bypass the inner component's normal validation so the test
+    # exercises the outer blend's "no nested blends" rule directly.
+    nested = TrainingRecipe.__new__(TrainingRecipe)
+    nested.messages = None
+    nested.bindings = None
+    nested.blend = inner.blend
+    nested.weight = 1.0
+    with pytest.raises(ValueError, match="cannot itself define a blend"):
+        TrainingRecipe(blend={"outer": nested})
+
+
+# ── from_dict / from_yaml round-trips ────────────────────────────────
+
+
+def test_from_dict_with_nested_blend():
+    recipe = TrainingRecipe.from_dict(
+        {
+            "blend": {
+                "a": {
+                    "weight": 1.0,
+                    "messages": [
+                        {"role": "user", "content": "${task}", "stream": "high_level"},
+                        {"role": "assistant", "content": "a", "stream": "high_level", "target": True},
+                    ],
+                },
+                "b": {
+                    "weight": 2.0,
+                    "messages": [
+                        {"role": "user", "content": "${task}", "stream": "high_level"},
+                        {"role": "assistant", "content": "b", "stream": "high_level", "target": True},
+                    ],
+                },
+            }
+        }
+    )
+    assert recipe.blend is not None
+    assert set(recipe.blend) == {"a", "b"}
+    assert recipe.blend["b"].weight == 2.0
+    # Inner messages were promoted to MessageTurn instances.
+    assert isinstance(recipe.blend["a"].messages[0], MessageTurn)
+
+
+def test_from_yaml_round_trips_through_load_recipe(tmp_path: Path):
+    yaml_text = dedent(
+        """
+        bindings:
+          custom: "active_at(t, style=subtask)"
+        messages:
+          - {role: user, content: "${task}: ${custom}", stream: high_level}
+          - {role: assistant, content: "ok", stream: high_level, target: true}
+        """
+    ).strip()
+    path = tmp_path / "recipe.yaml"
+    path.write_text(yaml_text)
+
+    via_classmethod = TrainingRecipe.from_yaml(path)
+    via_helper = load_recipe(path)
+
+    assert via_classmethod.bindings == {"custom": "active_at(t, style=subtask)"}
+    assert via_classmethod.messages[1].target is True
+    # ``load_recipe`` is just a wrapper, but assert the two paths agree
+    # on the structural result so a future divergence is caught here.
+    assert via_helper.bindings == via_classmethod.bindings
+    assert len(via_helper.messages) == len(via_classmethod.messages)
+
+
+def test_from_yaml_rejects_non_mapping(tmp_path: Path):
+    path = tmp_path / "bad.yaml"
+    path.write_text("- just\n- a\n- list\n")
+    with pytest.raises(ValueError, match="mapping at the top level"):
+        TrainingRecipe.from_yaml(path)
diff --git a/tests/datasets/test_dataset_metadata.py b/tests/datasets/test_dataset_metadata.py
index 6c784c90b..171d8af8b 100644
--- a/tests/datasets/test_dataset_metadata.py
+++ b/tests/datasets/test_dataset_metadata.py
@@ -385,3 +385,140 @@ def test_finalize_flushes_buffered_metadata(tmp_path):
     assert episodes_dir.exists()
     parquet_files = list(episodes_dir.rglob("*.parquet"))
     assert len(parquet_files) > 0
+
+
+# ── Tools accessor ───────────────────────────────────────────────────
+
+
+def test_tools_falls_back_to_default_when_info_has_no_tools_field(tmp_path):
+    """meta.tools returns DEFAULT_TOOLS when info.json doesn't declare any."""
+    from lerobot.datasets.language import DEFAULT_TOOLS
+
+    root = tmp_path / "no_tools"
+    meta = LeRobotDatasetMetadata.create(
+        repo_id="test/no_tools",
+        fps=DEFAULT_FPS,
+        features=SIMPLE_FEATURES,
+        root=root,
+        use_videos=False,
+    )
+
+    assert meta.tools == DEFAULT_TOOLS
+    # info.json on disk should NOT include a `tools` key for clean datasets
+    with open(root / INFO_PATH) as f:
+        info_on_disk = json.load(f)
+    assert "tools" not in info_on_disk
+
+
+def test_tools_reads_declared_tools_from_info_json(tmp_path):
+    """A `tools` list written into info.json survives load → meta.tools.
+
+    Regression test for the bug where ``DatasetInfo.from_dict`` silently
+    dropped the ``tools`` key (no matching dataclass field), so
+    ``meta.tools`` always returned ``DEFAULT_TOOLS`` regardless of
+    what was on disk.
+    """
+    from lerobot.datasets.io_utils import load_info
+
+    root = tmp_path / "with_tools"
+    meta = LeRobotDatasetMetadata.create(
+        repo_id="test/with_tools",
+        fps=DEFAULT_FPS,
+        features=SIMPLE_FEATURES,
+        root=root,
+        use_videos=False,
+    )
+
+    custom_tool = {
+        "type": "function",
+        "function": {
+            "name": "record_observation",
+            "description": "Capture a still image.",
+            "parameters": {
+                "type": "object",
+                "properties": {"label": {"type": "string"}},
+                "required": ["label"],
+            },
+        },
+    }
+    info_path = root / INFO_PATH
+    with open(info_path) as f:
+        raw = json.load(f)
+    raw["tools"] = [custom_tool]
+    with open(info_path, "w") as f:
+        json.dump(raw, f)
+
+    # Reload info from disk and rebind it on the metadata object
+    meta.info = load_info(root)
+    assert meta.tools == [custom_tool]
+
+
+def test_tools_round_trip_through_dataset_info(tmp_path):
+    """A `tools` list survives DatasetInfo.from_dict / to_dict."""
+    from lerobot.datasets.utils import DatasetInfo
+
+    raw = {
+        "codebase_version": "v3.1",
+        "fps": 30,
+        "features": SIMPLE_FEATURES,
+        "tools": [{"type": "function", "function": {"name": "say"}}],
+    }
+    info = DatasetInfo.from_dict(raw)
+    assert info.tools == raw["tools"]
+    assert info.to_dict()["tools"] == raw["tools"]
+
+
+def test_tools_setter_persists_to_info_json_and_reloads(tmp_path):
+    """Assigning meta.tools writes info.json and reloads meta.info."""
+    from lerobot.datasets.io_utils import load_info
+
+    root = tmp_path / "set_tools"
+    meta = LeRobotDatasetMetadata.create(
+        repo_id="test/set_tools",
+        fps=DEFAULT_FPS,
+        features=SIMPLE_FEATURES,
+        root=root,
+        use_videos=False,
+    )
+
+    custom_tool = {
+        "type": "function",
+        "function": {
+            "name": "record_observation",
+            "description": "Capture a still image.",
+            "parameters": {
+                "type": "object",
+                "properties": {"label": {"type": "string"}},
+                "required": ["label"],
+            },
+        },
+    }
+    meta.tools = [custom_tool]
+
+    # In-memory metadata reflects the new catalog ...
+    assert meta.tools == [custom_tool]
+    assert meta.info.tools == [custom_tool]
+    # ... and a fresh read from disk agrees.
+    assert load_info(root).tools == [custom_tool]
+
+
+def test_tools_setter_clears_key_when_set_to_none(tmp_path):
+    """Setting meta.tools back to None drops the key and restores the default."""
+    from lerobot.datasets.language import DEFAULT_TOOLS
+
+    root = tmp_path / "clear_tools"
+    meta = LeRobotDatasetMetadata.create(
+        repo_id="test/clear_tools",
+        fps=DEFAULT_FPS,
+        features=SIMPLE_FEATURES,
+        root=root,
+        use_videos=False,
+    )
+
+    meta.tools = [{"type": "function", "function": {"name": "say"}}]
+    meta.tools = None
+
+    assert meta.tools == DEFAULT_TOOLS
+    with open(root / INFO_PATH) as f:
+        info_on_disk = json.load(f)
+    assert "tools" not in info_on_disk
diff --git a/tests/datasets/test_language.py b/tests/datasets/test_language.py
new file mode 100644
index 000000000..52c7b3708
--- /dev/null
+++ b/tests/datasets/test_language.py
@@ -0,0 +1,173 @@
+#!/usr/bin/env python
+
+import pytest
+
+pytest.importorskip("datasets", reason="datasets is required (install lerobot[dataset])")
+pytest.importorskip("pandas", reason="pandas is required (install lerobot[dataset])")
+
+import numpy as np  # noqa: E402
+import pandas as pd  # noqa: E402
+import pyarrow as pa  # noqa: E402
+
+from lerobot.datasets import LeRobotDataset  # noqa: E402
+from lerobot.datasets.io_utils import write_info  # noqa: E402
+from lerobot.datasets.language import (  # noqa: E402
+    EVENT_ONLY_STYLES,
+    LANGUAGE_EVENTS,
+    LANGUAGE_PERSISTENT,
+    PERSISTENT_STYLES,
+    STYLE_REGISTRY,
+    VIEW_DEPENDENT_STYLES,
+    column_for_style,
+    is_view_dependent_style,
+    language_events_arrow_type,
+    language_feature_info,
+    language_persistent_arrow_type,
+    validate_camera_field,
+)
+from lerobot.datasets.utils import DEFAULT_DATA_PATH  # noqa: E402
+
+
+def test_language_arrow_schema_has_expected_fields():
+    persistent_row_type = language_persistent_arrow_type().value_type
+    event_row_type = language_events_arrow_type().value_type
+
+    assert isinstance(persistent_row_type, pa.StructType)
+    assert persistent_row_type.names == [
+        "role",
+        "content",
+        "style",
+        "timestamp",
+        "camera",
+        "tool_calls",
+    ]
+
+    assert isinstance(event_row_type, pa.StructType)
+    assert event_row_type.names == ["role", "content", "style", "camera", "tool_calls"]
+
+    # Persistent-row timestamps use float32, matching LeRobotDataset frame timestamps.
+    assert persistent_row_type.field("timestamp").type == pa.float32()
+
+
+def test_validate_feature_language_warns_only_on_non_empty_value(caplog):
+    from lerobot.datasets.feature_utils import validate_feature_language
+
+    # None (the expected record-time value) is silent and non-fatal.
+    with caplog.at_level("WARNING"):
+        assert validate_feature_language("language_persistent", None) == ""
+    assert caplog.records == []
+
+    # A stray non-empty value is dropped later, so we warn rather than fail.
+    with caplog.at_level("WARNING"):
+        assert validate_feature_language("language_persistent", [{"role": "user"}]) == ""
+    assert any("language_persistent" in r.message for r in caplog.records)
+
+
+def test_style_registry_routes_columns():
+    assert {"subtask", "plan", "memory", "motion", "task_aug"} == PERSISTENT_STYLES
+    assert {"interjection", "vqa", "trace"} == EVENT_ONLY_STYLES
+    assert PERSISTENT_STYLES | EVENT_ONLY_STYLES <= STYLE_REGISTRY
+
+    assert column_for_style("subtask") == LANGUAGE_PERSISTENT
+    assert column_for_style("plan") == LANGUAGE_PERSISTENT
+    assert column_for_style("memory") == LANGUAGE_PERSISTENT
+    assert column_for_style("motion") == LANGUAGE_PERSISTENT
+    assert column_for_style("task_aug") == LANGUAGE_PERSISTENT
+    assert column_for_style("interjection") == LANGUAGE_EVENTS
+    assert column_for_style("vqa") == LANGUAGE_EVENTS
+    assert column_for_style("trace") == LANGUAGE_EVENTS
+    assert column_for_style(None) == LANGUAGE_EVENTS
+
+
+def test_view_dependent_styles():
+    # motion lives in PERSISTENT_STYLES and is described in robot-frame
+    # (joint / Cartesian) terms, so it is NOT view-dependent. Only vqa
+    # (event) and trace (event, pixel-trajectory) carry a camera tag.
+    assert {"vqa", "trace"} == VIEW_DEPENDENT_STYLES
+    assert is_view_dependent_style("vqa")
+    assert is_view_dependent_style("trace")
+    assert not is_view_dependent_style("motion")
+    assert not is_view_dependent_style("subtask")
+    assert not is_view_dependent_style("plan")
+    assert not is_view_dependent_style("interjection")
+    assert not is_view_dependent_style(None)
+
+
+def test_validate_camera_field_requires_camera_for_view_dependent_styles():
+    validate_camera_field("vqa", "observation.images.top")
+    validate_camera_field("trace", "observation.images.front")
+    with pytest.raises(ValueError, match="view-dependent"):
+        validate_camera_field("vqa", None)
+    with pytest.raises(ValueError, match="view-dependent"):
+        validate_camera_field("trace", "")
+
+
+def test_validate_camera_field_rejects_camera_on_non_view_dependent_styles():
+    validate_camera_field("subtask", None)
+    validate_camera_field("plan", None)
+    validate_camera_field("memory", None)
+    validate_camera_field("motion", None)
+    validate_camera_field("interjection", None)
+    validate_camera_field(None, None)
+    with pytest.raises(ValueError, match="must have camera=None"):
+        validate_camera_field("subtask", "observation.images.top")
+    with pytest.raises(ValueError, match="must have camera=None"):
+        validate_camera_field("motion", "observation.images.top")
+    with pytest.raises(ValueError, match="must have camera=None"):
+        validate_camera_field("interjection", "observation.images.top")
+    with pytest.raises(ValueError, match="must have camera=None"):
+        validate_camera_field(None, "observation.images.top")
+
+
+def test_unknown_style_rejected():
+    with pytest.raises(ValueError, match="Unknown language style"):
+        column_for_style("surprise")
+
+
+def test_lerobot_dataset_passes_language_columns_through(tmp_path, empty_lerobot_dataset_factory):
+    root = tmp_path / "language_dataset"
+    dataset = empty_lerobot_dataset_factory(
+        root=root,
+        features={"state": {"dtype": "float32", "shape": (2,), "names": None}},
+        use_videos=False,
+    )
+    dataset.add_frame({"state": np.array([0.0, 1.0], dtype=np.float32), "task": "tidy"})
+    dataset.add_frame({"state": np.array([1.0, 2.0], dtype=np.float32), "task": "tidy"})
+    dataset.save_episode()
+    dataset.finalize()
+
+    persistent = [
+        {
+            "role": "assistant",
+            "content": "reach for the cup",
+            "style": "subtask",
+            "timestamp": 0.0,
+            "camera": None,
+            "tool_calls": None,
+        }
+    ]
+    event = {
+        "role": "user",
+        "content": "what is visible?",
+        "style": "vqa",
+        "camera": "observation.images.top",
+        "tool_calls": None,
+    }
+    data_path = root / DEFAULT_DATA_PATH.format(chunk_index=0, file_index=0)
+    df = pd.read_parquet(data_path)
+    df[LANGUAGE_PERSISTENT] = [persistent, persistent]
+    df[LANGUAGE_EVENTS] = [[event], []]
+    df.to_parquet(data_path)
+
+    info = dataset.meta.info
+    info["features"].update(language_feature_info())
+    write_info(info, root)
+
+    reloaded = LeRobotDataset(repo_id=dataset.repo_id, root=root)
+
+    first = reloaded[0]
+    second = reloaded[1]
+    assert first[LANGUAGE_PERSISTENT] == persistent
+    assert first[LANGUAGE_EVENTS] == [event]
+    assert second[LANGUAGE_PERSISTENT] == persistent
+    assert second[LANGUAGE_EVENTS] == []
diff --git a/tests/datasets/test_language_render.py b/tests/datasets/test_language_render.py
new file mode 100644
index 000000000..fcef41fd8
--- /dev/null
+++ b/tests/datasets/test_language_render.py
@@ -0,0 +1,417 @@
+#!/usr/bin/env python
+
+import pytest
+
+pytest.importorskip("datasets", reason="datasets is required (install lerobot[dataset])")
+
+from lerobot.configs.recipe import MessageTurn, TrainingRecipe  # noqa: E402
+from lerobot.datasets.language_render import (  # noqa: E402
+    EMITTED_AT_TOLERANCE_S,
+    active_at,
+    emitted_at,
+    nth_next,
+    nth_prev,
+    render_sample,
+)
+
+
+def persistent_row(role, content, style, timestamp, tool_calls=None, camera=None):
+    return {
+        "role": role,
+        "content": content,
+        "style": style,
+        "timestamp": timestamp,
+        "camera": camera,
+        "tool_calls": tool_calls,
+    }
+
+
+def event_row(role, content, style, tool_calls=None, camera=None):
+    return {
+        "role": role,
+        "content": content,
+        "style": style,
+        "camera": camera,
+        "tool_calls": tool_calls,
+    }
+
+
+PERSISTENT = [
+    persistent_row("assistant", "plan 0", "plan", 0.0),
+    persistent_row("assistant", "memory 0", "memory", 0.0),
+    persistent_row("assistant", "subtask 0", "subtask", 0.0),
+    persistent_row("assistant", "memory 1", "memory", 1.0),
+    persistent_row("assistant", "subtask 1", "subtask", 1.0),
+]
+EVENTS_AT_1 = [
+    event_row("user", "what is visible?", "vqa", camera="observation.images.top"),
+    event_row("assistant", '{"count": 2}', "vqa", camera="observation.images.top"),
+]
+EVENTS_AT_2 = [
+    event_row("user", "skip wiping", "interjection"),
+    event_row(
+        "assistant",
+        None,
+        None,
+        [{"type": "function", "function": {"name": "say", "arguments": {"text": "Skipping wiping."}}}],
+    ),
+]
+# Same emission tick, two cameras: triggers per-camera disambiguation in
+# resolvers, mirroring how Module 3 of the annotation pipeline writes one
+# (vqa, user) + (vqa, assistant) pair per camera.
+EVENTS_AT_3_TWO_CAMERAS = [
+    event_row("user", "how many cups (top)?", "vqa", camera="observation.images.top"),
+    event_row("assistant", '{"count": 3}', "vqa", camera="observation.images.top"),
+    event_row("user", "how many cups (wrist)?", "vqa", camera="observation.images.wrist"),
+    event_row("assistant", '{"count": 1}', "vqa", camera="observation.images.wrist"),
+]
+
+
+def test_resolver_temporal_semantics():
+    assert active_at(0.5, persistent=PERSISTENT, style="subtask")["content"] == "subtask 0"
+    assert active_at(1.0, persistent=PERSISTENT, style="subtask")["content"] == "subtask 1"
+    assert emitted_at(0.5, persistent=PERSISTENT, events=[], style="vqa", role="assistant") is None
+    assert (
+        emitted_at(1.0, persistent=PERSISTENT, events=EVENTS_AT_1, style="vqa", role="assistant")["content"]
+        == '{"count": 2}'
+    )
+
+
+def test_persistent_relative_resolvers_reject_event_styles():
+    with pytest.raises(ValueError, match="event-only"):
+        active_at(1.0, persistent=PERSISTENT, style="vqa")
+    with pytest.raises(ValueError, match="event-only"):
+        nth_prev(1.0, persistent=PERSISTENT, style="interjection")
+
+
+def test_nth_prev_and_next():
+    assert nth_prev(1.0, persistent=PERSISTENT, style="subtask", offset=1)["content"] == "subtask 0"
+    assert nth_next(0.0, persistent=PERSISTENT, style="subtask", offset=1)["content"] == "subtask 1"
+
+
+def test_substitution_if_present_multimodal_and_tool_calls():
+    recipe = TrainingRecipe(
+        messages=[
+            MessageTurn(
+                role="user",
+                content=[
+                    {"type": "image", "feature": "observation.images.top"},
+                    {"type": "text", "text": "${task}: ${interjection}"},
+                ],
+                stream="high_level",
+                if_present="interjection",
+            ),
+            MessageTurn(
+                role="assistant",
+                content="${plan}",
+                stream="high_level",
+                target=True,
+                tool_calls_from="speech",
+            ),
+        ],
+        bindings={"plan": "active_at(t, style=plan)"},
+    )
+
+    rendered = render_sample(
+        recipe=recipe,
+        persistent=PERSISTENT,
+        events=EVENTS_AT_2,
+        t=2.0,
+        sample_idx=0,
+        task="clean kitchen",
+    )
+
+    assert rendered["messages"][0]["content"][1]["text"] == "clean kitchen: skip wiping"
+    assert rendered["messages"][1]["content"] == "plan 0"
+    assert rendered["messages"][1]["tool_calls"][0]["function"]["name"] == "say"
+    assert rendered["message_streams"] == ["high_level", "high_level"]
+    assert rendered["target_message_indices"] == [1]
+
+
+def test_exact_event_miss_returns_none_when_target_skips():
+    recipe = TrainingRecipe(
+        messages=[
+            MessageTurn(role="user", content="${vqa_query}", stream="high_level", if_present="vqa_query"),
+            MessageTurn(
+                role="assistant",
+                content="${vqa}",
+                stream="high_level",
+                target=True,
+                if_present="vqa",
+            ),
+        ]
+    )
+
+    assert (
+        render_sample(recipe=recipe, persistent=PERSISTENT, events=EVENTS_AT_2, t=0.0, sample_idx=0) is None
+    )
+
+
+def test_deterministic_blend_sampling():
+    recipe = TrainingRecipe(
+        blend={
+            "a": TrainingRecipe(
+                weight=1.0,
+                messages=[
+                    MessageTurn(role="user", content="${task}", stream="high_level"),
+                    MessageTurn(role="assistant", content="a", stream="high_level", target=True),
+                ],
+            ),
+            "b": TrainingRecipe(
+                weight=1.0,
+                messages=[
+                    MessageTurn(role="user", content="${task}", stream="high_level"),
+                    MessageTurn(role="assistant", content="b", stream="high_level", target=True),
+                ],
+            ),
+        }
+    )
+
+    first = render_sample(
+        recipe=recipe, persistent=PERSISTENT, events=EVENTS_AT_2, t=0.0, sample_idx=123, task="x"
+    )
+    second = render_sample(
+        recipe=recipe, persistent=PERSISTENT, events=EVENTS_AT_2, t=0.0, sample_idx=123, task="x"
+    )
+    assert first == second
+
+
+def test_emitted_at_filters_vqa_by_camera():
+    top = emitted_at(
+        3.0,
+        persistent=PERSISTENT,
+        events=EVENTS_AT_3_TWO_CAMERAS,
+        style="vqa",
+        role="assistant",
+        camera="observation.images.top",
+    )
+    wrist = emitted_at(
+        3.0,
+        persistent=PERSISTENT,
+        events=EVENTS_AT_3_TWO_CAMERAS,
+        style="vqa",
+        role="assistant",
+        camera="observation.images.wrist",
+    )
+    assert top["content"] == '{"count": 3}'
+    assert wrist["content"] == '{"count": 1}'
+
+
+def test_emitted_at_raises_on_ambiguous_per_camera_vqa():
+    with pytest.raises(ValueError, match="Ambiguous resolver"):
+        emitted_at(
+            3.0,
+            persistent=PERSISTENT,
+            events=EVENTS_AT_3_TWO_CAMERAS,
+            style="vqa",
+            role="assistant",
+        )
+
+
+def _vqa_subrecipe(camera: str) -> TrainingRecipe:
+    return TrainingRecipe(
+        weight=1.0,
+        bindings={
+            "vqa_query": f"emitted_at(t, style=vqa, role=user, camera={camera})",
+            "vqa": f"emitted_at(t, style=vqa, role=assistant, camera={camera})",
+        },
+        messages=[
+            MessageTurn(
+                role="user",
+                content=[{"type": "image", "feature": camera}, {"type": "text", "text": "${vqa_query}"}],
+                stream="high_level",
+                if_present="vqa_query",
+            ),
+            MessageTurn(
+                role="assistant",
+                content="${vqa}",
+                stream="high_level",
+                target=True,
+                if_present="vqa",
+            ),
+        ],
+    )
+
+
+@pytest.mark.parametrize(
+    ("camera", "expected_query", "expected_answer"),
+    [
+        ("observation.images.top", "how many cups (top)?", '{"count": 3}'),
+        ("observation.images.wrist", "how many cups (wrist)?", '{"count": 1}'),
+    ],
+)
+def test_per_camera_blend_renders_both_views(camera, expected_query, expected_answer):
+    rendered = render_sample(
+        recipe=_vqa_subrecipe(camera),
+        persistent=PERSISTENT,
+        events=EVENTS_AT_3_TWO_CAMERAS,
+        t=3.0,
+        sample_idx=0,
+    )
+
+    assert rendered["messages"][0]["content"][0]["feature"] == camera
+    assert rendered["messages"][0]["content"][1]["text"] == expected_query
+    assert rendered["messages"][1]["content"] == expected_answer
+
+
+def test_resolve_task_picks_rephrasing_deterministically_per_sample():
+    rephrasings = [
+        persistent_row("user", "tidy the kitchen", "task_aug", 0.0),
+        persistent_row("user", "please clean up the kitchen", "task_aug", 0.0),
+        persistent_row("user", "kitchen needs tidying", "task_aug", 0.0),
+        persistent_row("user", "make the kitchen clean", "task_aug", 0.0),
+    ]
+    recipe = TrainingRecipe(
+        messages=[
+            MessageTurn(role="user", content="${task}", stream="high_level"),
+            MessageTurn(role="assistant", content="ok", stream="high_level", target=True),
+        ]
+    )
+
+    # No explicit task override → resolver consults persistent rows.
+    seen: set[str] = set()
+    for sample_idx in range(64):
+        rendered = render_sample(
+            recipe=recipe,
+            persistent=rephrasings,
+            events=[],
+            t=0.0,
+            sample_idx=sample_idx,
+            dataset_ctx={"task": "canonical kitchen task"},
+        )
+        seen.add(rendered["messages"][0]["content"])
+    # Every rephrasing should be reachable across enough samples.
+    assert seen == {r["content"] for r in rephrasings}
+    # Same sample_idx → same pick (determinism).
+    a = render_sample(
+        recipe=recipe,
+        persistent=rephrasings,
+        events=[],
+        t=0.0,
+        sample_idx=42,
+        dataset_ctx={"task": "canonical"},
+    )
+    b = render_sample(
+        recipe=recipe,
+        persistent=rephrasings,
+        events=[],
+        t=0.0,
+        sample_idx=42,
+        dataset_ctx={"task": "canonical"},
+    )
+    assert a["messages"][0]["content"] == b["messages"][0]["content"]
+
+
+def test_resolve_task_falls_back_to_canonical_without_rephrasings():
+    recipe = TrainingRecipe(
+        messages=[
+            MessageTurn(role="user", content="${task}", stream="high_level"),
+            MessageTurn(role="assistant", content="ok", stream="high_level", target=True),
+        ]
+    )
+    rendered = render_sample(
+        recipe=recipe,
+        persistent=PERSISTENT,  # no task_aug rows
+        events=[],
+        t=0.0,
+        sample_idx=0,
+        dataset_ctx={"task": "clean the kitchen"},
+    )
+    assert rendered["messages"][0]["content"] == "clean the kitchen"
+
+
+def test_resolve_task_explicit_override_beats_rephrasings():
+    rephrasings = [
+        persistent_row("user", "rephrased one", "task_aug", 0.0),
+        persistent_row("user", "rephrased two", "task_aug", 0.0),
+    ]
+    recipe = TrainingRecipe(
+        messages=[
+            MessageTurn(role="user", content="${task}", stream="high_level"),
+            MessageTurn(role="assistant", content="ok", stream="high_level", target=True),
+        ]
+    )
+    rendered = render_sample(
+        recipe=recipe,
+        persistent=rephrasings,
+        events=[],
+        t=0.0,
+        sample_idx=0,
+        task="explicit override wins",
+        dataset_ctx={"task": "canonical"},
+    )
+    assert rendered["messages"][0]["content"] == "explicit override wins"
+
+
+def test_emitted_at_persistent_tolerates_small_timestamp_drift():
+    """Persistent ``emitted_at`` should match within EMITTED_AT_TOLERANCE_S
+    so callers that derive ``t`` arithmetically (``frame_idx / fps``) still
+    line up with the parquet-stored timestamp.
+    """
+    rows = [persistent_row("assistant", "memo", "memory", 1.0)]
+    # Half a tolerance window — bit-different float, comfortably inside
+    inside = emitted_at(1.0 + EMITTED_AT_TOLERANCE_S / 2, persistent=rows, events=[], style="memory")
+    assert inside is not None and inside["content"] == "memo"
+
+    # Just past the window — no match
+    outside = emitted_at(1.0 + EMITTED_AT_TOLERANCE_S * 2, persistent=rows, events=[], style="memory")
+    assert outside is None
+
+
+def test_render_sample_rejects_non_dict_language_rows():
+    """``_normalize_rows`` must surface malformed inputs as TypeError.
+
+    A pipeline that hands the renderer a non-dict (e.g. a stray string)
+    is a real upstream bug — silent skipping would let it propagate.
+    """
+    recipe = TrainingRecipe(
+        messages=[
+            MessageTurn(role="user", content="${task}", stream="high_level"),
+            MessageTurn(role="assistant", content="ok", stream="high_level", target=True),
+        ]
+    )
+    with pytest.raises(TypeError, match="must be dictionaries"):
+        render_sample(
+            recipe=recipe,
+            persistent=["not a dict"],
+            events=[],
+            t=0.0,
+            sample_idx=0,
+            task="x",
+        )
+
+
+def test_low_level_branch_renders_active_subtask():
+    low_level = TrainingRecipe(
+        blend={
+            "low": TrainingRecipe(
+                weight=1.0,
+                messages=[
+                    MessageTurn(
+                        role="user",
+                        content="${task}\nPlan: ${plan}\nMemory: ${memory}",
+                        stream="high_level",
+                    ),
+                    MessageTurn(
+                        role="assistant",
+                        content="${subtask}",
+                        stream="low_level",
+                        target=True,
+                    ),
+                ],
+            )
+        }
+    )
+
+    rendered = render_sample(
+        recipe=low_level,
+        persistent=PERSISTENT,
+        events=[],
+        t=0.5,
+        sample_idx=0,
+        task="clean kitchen",
+    )
+
+    assert rendered["messages"][-1] == {"role": "assistant", "content": "subtask 0"}
+    assert rendered["message_streams"][-1] == "low_level"
+    assert rendered["target_message_indices"] == [1]
diff --git a/tests/datasets/test_subtask_dataset.py b/tests/datasets/test_subtask_dataset.py
deleted file mode 100644
index bb77b77d1..000000000
--- a/tests/datasets/test_subtask_dataset.py
+++ /dev/null
@@ -1,193 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""
-Tests for subtask functionality in LeRobotDataset.
-
-These tests verify that:
-- Subtask information is correctly loaded from datasets that have subtask data
-- The __getitem__ method correctly adds subtask strings to returned items
-- Subtask handling gracefully handles missing data
-"""
-
-import pytest
-
-pytest.importorskip("pandas", reason="pandas is required (install lerobot[dataset])")
-
-import pandas as pd  # noqa: E402
-import torch
-
-from lerobot.datasets.lerobot_dataset import LeRobotDataset
-
-
-class TestSubtaskDataset:
-    """Tests for subtask handling in LeRobotDataset."""
-
-    @pytest.fixture
-    def subtask_dataset(self):
-        """Load the test subtask dataset from the hub."""
-        # Use lerobot/pusht-subtask dataset with episode 1
-        return LeRobotDataset(
-            repo_id="lerobot/pusht-subtask",
-            episodes=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
-        )
-
-    def test_subtask_dataset_loads(self, subtask_dataset):
-        """Test that the subtask dataset loads successfully."""
-        assert subtask_dataset is not None
-        assert len(subtask_dataset) > 0
-
-    def test_subtask_metadata_loaded(self, subtask_dataset):
-        """Test that subtask metadata is loaded when present in dataset."""
-        # The dataset should have subtasks metadata loaded
-        assert subtask_dataset.meta.subtasks is not None
-        assert isinstance(subtask_dataset.meta.subtasks, pd.DataFrame)
-
-    def test_subtask_index_in_features(self, subtask_dataset):
-        """Test that subtask_index is a feature when dataset has subtasks."""
-        assert "subtask_index" in subtask_dataset.features
-
-    def test_getitem_returns_subtask_string(self, subtask_dataset):
-        """Test that __getitem__ correctly adds subtask string to returned item."""
-        item = subtask_dataset[0]
-
-        # Subtask should be present in the returned item
-        assert "subtask" in item
-        assert isinstance(item["subtask"], str)
-        assert len(item["subtask"]) > 0  # Should not be empty
-
-    def test_getitem_has_subtask_index(self, subtask_dataset):
-        """Test that __getitem__ includes subtask_index."""
-        item = subtask_dataset[0]
-
-        assert "subtask_index" in item
-        assert isinstance(item["subtask_index"], torch.Tensor)
-
-    def test_subtask_index_maps_to_valid_subtask(self, subtask_dataset):
-        """Test that subtask_index correctly maps to a subtask in metadata."""
-        item = subtask_dataset[0]
-
-        subtask_idx = item["subtask_index"].item()
-        subtask_from_metadata = subtask_dataset.meta.subtasks.iloc[subtask_idx].name
-
-        assert item["subtask"] == subtask_from_metadata
-
-    def test_all_items_have_subtask(self, subtask_dataset):
-        """Test that all items in the dataset have subtask information."""
-        for i in range(min(len(subtask_dataset), 5)):  # Check first 5 items
-            item = subtask_dataset[i]
-            assert "subtask" in item
-            assert isinstance(item["subtask"], str)
-
-    def test_task_and_subtask_coexist(self, subtask_dataset):
-        """Test that both task and subtask are present in returned items."""
-        item = subtask_dataset[0]
-
-        # Both task and subtask should be present
-        assert "task" in item
-        assert "subtask" in item
-        assert isinstance(item["task"], str)
-        assert isinstance(item["subtask"], str)
-
-
-class TestSubtaskDatasetMissing:
-    """Tests for graceful handling when subtask data is missing."""
-
-    @pytest.fixture
-    def dataset_without_subtasks(self, tmp_path, empty_lerobot_dataset_factory):
-        """Create a dataset without subtask information."""
-        features = {"state": {"dtype": "float32", "shape": (2,), "names": None}}
-        dataset = empty_lerobot_dataset_factory(root=tmp_path / "no_subtask", features=features)
-
-        # Add some frames and save
-        for _ in range(5):
-            dataset.add_frame({"state": torch.randn(2), "task": "Test task"})
-        dataset.save_episode()
-        dataset.finalize()
-
-        # Reload the dataset
-        return LeRobotDataset(dataset.repo_id, root=dataset.root)
-
-    def test_no_subtask_in_features(self, dataset_without_subtasks):
-        """Test that subtask_index is not in features when not provided."""
-        assert "subtask_index" not in dataset_without_subtasks.features
-
-    def test_getitem_without_subtask(self, dataset_without_subtasks):
-        """Test that __getitem__ works when subtask is not present."""
-        item = dataset_without_subtasks[0]
-
-        # Item should still be retrievable
-        assert item is not None
-        assert "state" in item
-        assert "task" in item
-
-        # Subtask should NOT be present
-        assert "subtask" not in item
-
-    def test_subtasks_metadata_is_none(self, dataset_without_subtasks):
-        """Test that subtasks metadata is None when not present."""
-        assert dataset_without_subtasks.meta.subtasks is None
-
-
-class TestSubtaskEdgeCases:
-    """Edge case tests for subtask handling."""
-
-    def test_subtask_with_multiple_episodes(self):
-        """Test subtask handling with multiple episodes if available."""
-        try:
-            dataset = LeRobotDataset(
-                repo_id="lerobot/pusht-subtask",
-                episodes=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
-            )
-        except Exception:
-            pytest.skip("Could not load test-subtask dataset")
-
-        # Check first and last items have valid subtasks
-        first_item = dataset[0]
-        last_item = dataset[len(dataset) - 1]
-
-        assert "subtask" in first_item
-        assert "subtask" in last_item
-        assert isinstance(first_item["subtask"], str)
-        assert isinstance(last_item["subtask"], str)
-
-    def test_subtask_index_consistency(self):
-        """Test that same subtask_index returns same subtask string."""
-        try:
-            dataset = LeRobotDataset(
-                repo_id="lerobot/pusht-subtask",
-                episodes=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
-            )
-        except Exception:
-            pytest.skip("Could not load test-subtask dataset")
-
-        if len(dataset) < 2:
-            pytest.skip("Dataset too small for this test")
-
-        # Collect subtask_index to subtask mappings
-        subtask_map = {}
-        for i in range(min(len(dataset), 10)):
-            item = dataset[i]
-            idx = item["subtask_index"].item()
-            subtask = item["subtask"]
-
-            if idx in subtask_map:
-                # Same index should always return same subtask
-                assert subtask_map[idx] == subtask, (
-                    f"Inconsistent subtask for index {idx}: '{subtask_map[idx]}' vs '{subtask}'"
-                )
-            else:
-                subtask_map[idx] = subtask
diff --git a/tests/processor/test_render_messages_processor.py b/tests/processor/test_render_messages_processor.py
new file mode 100644
index 000000000..f96e3c0ab
--- /dev/null
+++ b/tests/processor/test_render_messages_processor.py
@@ -0,0 +1,60 @@
+#!/usr/bin/env python
+
+import pytest
+
+pytest.importorskip("datasets", reason="datasets is required (install lerobot[dataset])")
+
+import torch  # noqa: E402
+
+from lerobot.configs.recipe import MessageTurn, TrainingRecipe  # noqa: E402
+from lerobot.processor.converters import create_transition  # noqa: E402
+from lerobot.processor.render_messages_processor import RenderMessagesStep  # noqa: E402
+from lerobot.types import TransitionKey  # noqa: E402
+
+
+def test_render_messages_step_noops_without_language_columns():
+    recipe = TrainingRecipe(
+        messages=[
+            MessageTurn(role="user", content="${task}", stream="high_level"),
+            MessageTurn(role="assistant", content="${subtask}", stream="low_level", target=True),
+        ]
+    )
+    transition = create_transition(complementary_data={"task": "do it"})
+
+    assert RenderMessagesStep(recipe)(transition) == transition
+
+
+def test_render_messages_step_renders_and_drops_raw_language():
+    recipe = TrainingRecipe(
+        messages=[
+            MessageTurn(role="user", content="${task}", stream="high_level"),
+            MessageTurn(role="assistant", content="${subtask}", stream="low_level", target=True),
+        ]
+    )
+    transition = create_transition(
+        complementary_data={
+            "task": "do it",
+            "timestamp": torch.tensor(0.0),
+            "index": torch.tensor(7),
+            "language_persistent": [
+                {
+                    "role": "assistant",
+                    "content": "reach carefully",
+                    "style": "subtask",
+                    "timestamp": 0.0,
+                    "camera": None,
+                    "tool_calls": None,
+                }
+            ],
+            "language_events": [],
+        }
+    )
+
+    out = RenderMessagesStep(recipe)(transition)
+    data = out[TransitionKey.COMPLEMENTARY_DATA]
+
+    assert "language_persistent" not in data
+    assert "language_events" not in data
+    assert data["messages"][-1]["content"] == "reach carefully"
+    assert data["message_streams"] == ["high_level", "low_level"]
+    assert data["target_message_indices"] == [1]
diff --git a/tests/utils/test_collate.py b/tests/utils/test_collate.py
new file mode 100644
index 000000000..2b23b3180
--- /dev/null
+++ b/tests/utils/test_collate.py
@@ -0,0 +1,84 @@
+#!/usr/bin/env python
+
+import pytest
+
+pytest.importorskip("datasets", reason="datasets is required (install lerobot[dataset])")
+
+import torch  # noqa: E402
+
+from lerobot.utils.collate import lerobot_collate_fn  # noqa: E402
+
+
+def test_lerobot_collate_preserves_messages_and_drops_raw_language():
+    batch = [
+        {
+            "index": torch.tensor(0),
+            "messages": [{"role": "assistant", "content": "a"}],
+            "message_streams": ["low_level"],
+            "target_message_indices": [0],
+            "language_persistent": [{"content": "raw"}],
+            "language_events": [],
+        },
+        {
+            "index": torch.tensor(1),
+            "messages": [{"role": "assistant", "content": "b"}],
+            "message_streams": ["low_level"],
+            "target_message_indices": [0],
+            "language_persistent": [{"content": "raw"}],
+            "language_events": [],
+        },
+    ]
+
+    out = lerobot_collate_fn(batch)
+
+    assert out["index"].tolist() == [0, 1]
+    assert out["messages"][0][0]["content"] == "a"
+    assert out["messages"][1][0]["content"] == "b"
+    assert out["message_streams"] == [["low_level"], ["low_level"]]
+    assert out["target_message_indices"] == [[0], [0]]
+    assert "language_persistent" not in out
+    assert "language_events" not in out
+
+
+def test_lerobot_collate_passes_through_standard_batch():
+    """On a non-language batch, the collate must match ``default_collate``.
+
+    Guards against silent regressions: ``lerobot_train.py`` only opts into
+    ``lerobot_collate_fn`` when the dataset declares language columns, but
+    if a future change ever wires it in unconditionally we want the
+    behavior to remain a transparent pass-through for ordinary tensor
+    batches.
+    """
+    from torch.utils.data._utils.collate import default_collate
+
+    batch = [
+        {
+            "observation.image": torch.zeros(3, 4, 4),
+            "action": torch.tensor([0.0, 1.0]),
+            "index": torch.tensor(0),
+        },
+        {
+            "observation.image": torch.ones(3, 4, 4),
+            "action": torch.tensor([2.0, 3.0]),
+            "index": torch.tensor(1),
+        },
+    ]
+
+    custom = lerobot_collate_fn(batch)
+    expected = default_collate(batch)
+
+    assert custom.keys() == expected.keys()
+    for key in expected:
+        assert torch.equal(custom[key], expected[key]), f"key={key} diverged"
+
+
+def test_lerobot_collate_drops_none_samples():
+    """Recipes that yielded no target message return ``None`` — those samples
+    must be filtered out, and an entirely-``None`` batch must collapse to ``None``.
+    """
+    batch = [None, {"index": torch.tensor(0)}, None]
+    out = lerobot_collate_fn(batch)
+    assert out is not None
+    assert out["index"].tolist() == [0]
+
+    assert lerobot_collate_fn([None, None]) is None
diff --git a/uv.lock b/uv.lock
index 692029986..7092f780a 100644
--- a/uv.lock
+++ b/uv.lock
@@ -3057,7 +3057,7 @@ requires-dist = [
     { name = "av", marker = "extra == 'av-dep'", specifier = ">=15.0.0,<16.0.0" },
     { name = "cmake", specifier = ">=3.29.0.1,<4.2.0" },
     { name = "contourpy", marker = "extra == 'matplotlib-dep'", specifier = ">=1.3.0,<2.0.0" },
-    { name = "datasets", marker = "extra == 'dataset'", specifier = ">=4.0.0,<5.0.0" },
+    { name = "datasets", marker = "extra == 'dataset'", specifier = ">=4.7.0,<5.0.0" },
     { name = "debugpy", marker = "extra == 'dev'", specifier = ">=1.8.1,<1.9.0" },
     { name = "decord", marker = "(platform_machine == 'AMD64' and extra == 'groot') or (platform_machine == 'x86_64' and extra == 'groot')", specifier = ">=0.6.0,<1.0.0" },
     { name = "deepdiff", marker = "extra == 'deepdiff-dep'", specifier = ">=7.0.1,<9.0.0" },

From d38eb89f7117d0ea945e1a0d686f03501b49508e Mon Sep 17 00:00:00 2001
From: Caroline Pascal <caroline8.pascal@gmail.com>
Date: Tue, 19 May 2026 14:46:14 +0200
Subject: [PATCH 05/17] feat(video re-encoding): Adding utility and dataset
 edition tool for video re-encoding (#3611)

* feat(utility): adding video re-encode utility

* feat(edit): adding a new lerobot-edit-dataset tool to re-encode all the videos of a dataset

* chore(format): formatting code

* chore(review): fix Claude reviews

* test(reencode dataset): adding missing test for reencode dataset
---
 src/lerobot/datasets/__init__.py            |  2 +
 src/lerobot/datasets/dataset_tools.py       | 84 ++++++++++++++++++-
 src/lerobot/datasets/video_utils.py         | 86 ++++++++++++++++++++
 src/lerobot/scripts/lerobot_edit_dataset.py | 89 +++++++++++++++++++++
 tests/datasets/test_dataset_tools.py        | 42 ++++++++++
 tests/datasets/test_video_encoding.py       | 49 +++++++++---
 6 files changed, 342 insertions(+), 10 deletions(-)

diff --git a/src/lerobot/datasets/__init__.py b/src/lerobot/datasets/__init__.py
index e4e3ccdf6..2a67858d2 100644
--- a/src/lerobot/datasets/__init__.py
+++ b/src/lerobot/datasets/__init__.py
@@ -31,6 +31,7 @@ from .dataset_tools import (
     modify_features,
     modify_tasks,
     recompute_stats,
+    reencode_dataset,
     remove_feature,
     split_dataset,
 )
@@ -91,6 +92,7 @@ __all__ = [
     "modify_features",
     "modify_tasks",
     "recompute_stats",
+    "reencode_dataset",
     "remove_feature",
     "resolve_delta_timestamps",
     "safe_stop_image_writer",
diff --git a/src/lerobot/datasets/dataset_tools.py b/src/lerobot/datasets/dataset_tools.py
index 489914fbc..adbb841c4 100644
--- a/src/lerobot/datasets/dataset_tools.py
+++ b/src/lerobot/datasets/dataset_tools.py
@@ -26,7 +26,7 @@ This module provides utilities for:
 import logging
 import shutil
 from collections.abc import Callable
-from concurrent.futures import ThreadPoolExecutor, as_completed
+from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
 from pathlib import Path
 
 import datasets
@@ -61,11 +61,13 @@ from .utils import (
     DEFAULT_DATA_FILE_SIZE_IN_MB,
     DEFAULT_DATA_PATH,
     DEFAULT_EPISODES_PATH,
+    VIDEO_DIR,
     update_chunk_file_indices,
 )
 from .video_utils import (
     encode_video_frames,
     get_video_info,
+    reencode_video,
 )
 
 
@@ -1884,3 +1886,83 @@ def convert_image_to_video_dataset(
 
     # Return new dataset
     return LeRobotDataset(repo_id=repo_id, root=output_dir)
+
+
+def _reencode_video_worker(args: tuple) -> Path:
+    """Picklable worker for :func:`reencode_dataset`'s process pool."""
+    video_path, camera_encoder, encoder_threads = args
+    reencode_video(
+        input_video_path=video_path,
+        output_video_path=video_path,
+        camera_encoder=camera_encoder,
+        encoder_threads=encoder_threads,
+        overwrite=True,
+    )
+    return video_path
+
+
+def reencode_dataset(
+    dataset: LeRobotDataset,
+    camera_encoder: VideoEncoderConfig,
+    encoder_threads: int | None = None,
+    num_workers: int | None = None,
+) -> LeRobotDataset:
+    """Re-encode every video in a dataset with a new set of encoding parameters.
+
+    Videos are re-encoded in-place and the video information in ``info.json`` is refreshed.
+
+    Args:
+        dataset: An existing :class:`LeRobotDataset` whose videos will be
+            re-encoded.
+        camera_encoder: Target encoder configuration applied to every video
+            file.
+        encoder_threads: Per-encoder thread count forwarded to
+            :func:`reencode_video`. ``None`` lets the codec decide.
+        num_workers: Number of parallel processes. ``None`` or ``0`` means
+            sequential (no multiprocessing); ``1+`` spawns a
+            :class:`~concurrent.futures.ProcessPoolExecutor`.
+
+    Returns:
+        The same :class:`LeRobotDataset` instance with its metadata updated
+        on disk.
+    """
+    meta = dataset.meta
+    video_paths_list = []
+
+    # Only re-encode if the videos are not already encoded with the given video encoding parameters
+    for video_key in meta.video_keys:
+        current_info = meta.info.features[video_key].get("info", {})
+        current_encoder = VideoEncoderConfig.from_video_info(current_info)
+        if current_encoder != camera_encoder:
+            video_paths_list.extend((meta.root / VIDEO_DIR / video_key).rglob("*.mp4"))
+        else:
+            logging.info(f"{video_key} videos are already encoded with {camera_encoder}. Nothing to do.")
+
+    if len(video_paths_list) == 0:
+        logging.warning("Dataset has no videos to re-encode.")
+        return dataset
+    logging.info(f"Re-encoding {len(video_paths_list)} video file(s) with {camera_encoder}")
+
+    worker_args = [(vp, camera_encoder, encoder_threads) for vp in video_paths_list]
+    if num_workers and num_workers > 1:
+        with ProcessPoolExecutor(max_workers=num_workers) as pool:
+            futures = [pool.submit(_reencode_video_worker, args) for args in worker_args]
+            for future in tqdm(
+                as_completed(futures),
+                total=len(futures),
+                desc="Re-encoding videos",
+            ):
+                future.result()
+    else:
+        for args in tqdm(worker_args, desc="Re-encoding videos"):
+            _reencode_video_worker(args)
+
+    # Refresh video info in metadata for every video key.
+    for vid_key in meta.video_keys:
+        video_path = meta.root / meta.get_video_file_path(0, vid_key)
+        meta.info.features[vid_key]["info"] = get_video_info(video_path, camera_encoder=camera_encoder)
+
+    write_info(meta.info, meta.root)
+    logging.info("Dataset metadata updated.")
+
+    return dataset
diff --git a/src/lerobot/datasets/video_utils.py b/src/lerobot/datasets/video_utils.py
index e823a406c..99122381a 100644
--- a/src/lerobot/datasets/video_utils.py
+++ b/src/lerobot/datasets/video_utils.py
@@ -403,6 +403,92 @@ def encode_video_frames(
         raise OSError(f"Video encoding did not work. File not found: {video_path}.")
 
 
+def reencode_video(
+    input_video_path: Path | str,
+    output_video_path: Path | str,
+    camera_encoder: VideoEncoderConfig | None = None,
+    encoder_threads: int | None = None,
+    log_level: int | None = av.logging.WARNING,
+    overwrite: bool = False,
+) -> None:
+    """Re-encode a video file using the given encoder configuration.
+
+    Args:
+        input_video_path: Existing video file to read.
+        output_video_path: Path for the re-encoded file.
+        camera_encoder: Encoder configuration. Defaults to :func:`camera_encoder_defaults`.
+        encoder_threads: Optional thread count forwarded to :meth:`VideoEncoderConfig.get_codec_options`.
+        log_level: libav log level while encoding, or ``None`` to leave logging unchanged. Defaults to WARNING.
+        overwrite: When ``False`` and ``output_video_path`` already exists, skip and log a warning.
+    """
+
+    camera_encoder = camera_encoder or camera_encoder_defaults()
+
+    output_video_path = Path(output_video_path)
+
+    if output_video_path.exists() and not overwrite:
+        logger.warning(f"Video file already exists: {output_video_path}. Skipping re-encode.")
+        return
+
+    output_video_path.parent.mkdir(parents=True, exist_ok=True)
+
+    video_options = camera_encoder.get_codec_options(encoder_threads, as_strings=True)
+    vcodec = camera_encoder.vcodec
+    pix_fmt = camera_encoder.pix_fmt
+
+    with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as tmp_named_file:
+        tmp_output_video_path = tmp_named_file.name
+
+    if log_level is not None:
+        logging.getLogger("libav").setLevel(log_level)
+
+    try:
+        with av.open(input_video_path, mode="r") as src:
+            try:
+                in_stream = src.streams.video[0]
+            except IndexError as e:
+                raise ValueError(f"No video stream in {input_video_path}") from e
+
+            fps = (
+                in_stream.base_rate
+            )  # We allow fractional fps though LeRobotDataset only supports integer fps
+            width = int(in_stream.width)
+            height = int(in_stream.height)
+
+            with av.open(
+                tmp_output_video_path,
+                mode="w",
+                options={
+                    "movflags": "faststart"
+                },  # faststart is to move the metadata to the beginning of the file to speed up loading
+            ) as dst:
+                out_stream = dst.add_stream(vcodec, fps, options=video_options)
+                out_stream.pix_fmt = pix_fmt
+                out_stream.width = width
+                out_stream.height = height
+
+                for frame in src.decode(in_stream):
+                    frame = frame.reformat(width=width, height=height, format=pix_fmt)
+                    packet = out_stream.encode(frame)
+                    if packet:
+                        dst.mux(packet)
+
+                packet = out_stream.encode()
+                if packet:
+                    dst.mux(packet)
+
+        shutil.move(tmp_output_video_path, output_video_path)
+    except Exception:
+        Path(tmp_output_video_path).unlink(missing_ok=True)
+        raise
+    finally:
+        if log_level is not None:
+            av.logging.restore_default_callback()
+
+    if not output_video_path.exists():
+        raise OSError(f"Video re-encoding did not work. File not found: {output_video_path}.")
+
+
 def concatenate_video_files(
     input_video_paths: list[Path | str],
     output_video_path: Path,
diff --git a/src/lerobot/scripts/lerobot_edit_dataset.py b/src/lerobot/scripts/lerobot_edit_dataset.py
index eb6a57870..3c1edbb31 100644
--- a/src/lerobot/scripts/lerobot_edit_dataset.py
+++ b/src/lerobot/scripts/lerobot_edit_dataset.py
@@ -178,6 +178,31 @@ Recompute stats for relative actions and push to hub:
         --operation.num_workers 4 \
         --push_to_hub true
 
+Re-encode all videos in a dataset (saves to lerobot/pusht_reencoded by default):
+    lerobot-edit-dataset \
+        --repo_id lerobot/pusht \
+        --operation.type reencode_videos \
+        --operation.camera_encoder.vcodec h264 \
+        --operation.camera_encoder.pix_fmt yuv420p \
+        --operation.camera_encoder.crf 23
+
+Re-encode videos into a new dataset using 4 parallel processes:
+    lerobot-edit-dataset \
+        --repo_id lerobot/pusht \
+        --new_repo_id lerobot/pusht_h264 \
+        --operation.type reencode_videos \
+        --operation.camera_encoder.vcodec h264 \
+        --operation.camera_encoder.crf 23 \
+        --operation.num_workers 4
+
+Re-encode videos in-place (overwrites original dataset):
+    lerobot-edit-dataset \
+        --repo_id lerobot/pusht \
+        --new_repo_id lerobot/pusht \
+        --operation.type reencode_videos \
+        --operation.camera_encoder.vcodec h264 \
+        --operation.overwrite true
+
 Using JSON config file:
     lerobot-edit-dataset \
         --config_path path/to/edit_config.json
@@ -200,6 +225,7 @@ from lerobot.datasets import (
     merge_datasets,
     modify_tasks,
     recompute_stats,
+    reencode_dataset,
     remove_feature,
     split_dataset,
 )
@@ -268,6 +294,15 @@ class RecomputeStatsConfig(OperationConfig):
     overwrite: bool = False
 
 
+@OperationConfig.register_subclass("reencode_videos")
+@dataclass
+class ReencodeVideosConfig(OperationConfig):
+    camera_encoder: VideoEncoderConfig = field(default_factory=camera_encoder_defaults)
+    num_workers: int = 0
+    encoder_threads: int | None = None
+    overwrite: bool = False
+
+
 @OperationConfig.register_subclass("info")
 @dataclass
 class InfoConfig(OperationConfig):
@@ -634,6 +669,58 @@ def handle_recompute_stats(cfg: EditDatasetConfig) -> None:
         dataset.push_to_hub()
 
 
+def handle_reencode_videos(cfg: EditDatasetConfig) -> None:
+    if not isinstance(cfg.operation, ReencodeVideosConfig):
+        raise ValueError("Operation config must be ReencodeVideosConfig")
+
+    output_repo_id, input_root, output_root = _resolve_io_paths(
+        cfg.repo_id,
+        cfg.new_repo_id,
+        cfg.root,
+        cfg.new_root,
+        default_new_repo_id=f"{cfg.repo_id}_reencoded",
+    )
+    in_place = output_root == input_root
+
+    if in_place and not cfg.operation.overwrite:
+        raise ValueError(
+            f"reencode_videos would overwrite the dataset in-place at {input_root}. "
+            "Pass --operation.overwrite true to allow in-place modification, "
+            "or use --new_repo_id / --new_root to write to a different location. "
+            f"Default output repo_id when neither is set: '{cfg.repo_id}_reencoded'."
+        )
+
+    if in_place:
+        logging.warning(
+            f"Overwriting dataset videos in-place at {input_root}. The original videos will be lost."
+        )
+        dataset = LeRobotDataset(cfg.repo_id, root=input_root)
+    else:
+        logging.info(f"Copying dataset from {input_root} to {output_root}")
+        if output_root.exists():
+            backup_path = output_root.with_name(output_root.name + "_old")
+            logging.warning(f"Output directory {output_root} already exists. Moving to {backup_path}")
+            if backup_path.exists():
+                shutil.rmtree(backup_path)
+            shutil.move(output_root, backup_path)
+        shutil.copytree(input_root, output_root)
+        dataset = LeRobotDataset(output_repo_id, root=output_root)
+
+    logging.info(f"Re-encoding videos in {output_repo_id} with {cfg.operation.camera_encoder}")
+    reencode_dataset(
+        dataset,
+        camera_encoder=cfg.operation.camera_encoder,
+        encoder_threads=cfg.operation.encoder_threads,
+        num_workers=cfg.operation.num_workers,
+    )
+
+    logging.info(f"All videos re-encoded at {dataset.root}")
+
+    if cfg.push_to_hub:
+        logging.info(f"Pushing to hub as {output_repo_id}...")
+        dataset.push_to_hub()
+
+
 def _get_dataset_size(repo_path):
     import os
 
@@ -707,6 +794,8 @@ def edit_dataset(cfg: EditDatasetConfig) -> None:
         handle_convert_image_to_video(cfg)
     elif operation_type == "recompute_stats":
         handle_recompute_stats(cfg)
+    elif operation_type == "reencode_videos":
+        handle_reencode_videos(cfg)
     elif operation_type == "info":
         handle_info(cfg)
     else:
diff --git a/tests/datasets/test_dataset_tools.py b/tests/datasets/test_dataset_tools.py
index 032fd4f7c..d36312920 100644
--- a/tests/datasets/test_dataset_tools.py
+++ b/tests/datasets/test_dataset_tools.py
@@ -23,6 +23,7 @@ import torch
 
 pytest.importorskip("datasets", reason="datasets is required (install lerobot[dataset])")
 
+
 from lerobot.configs import VideoEncoderConfig
 from lerobot.datasets.dataset_tools import (
     add_features,
@@ -31,9 +32,12 @@ from lerobot.datasets.dataset_tools import (
     merge_datasets,
     modify_features,
     modify_tasks,
+    reencode_dataset,
     remove_feature,
     split_dataset,
 )
+from lerobot.datasets.io_utils import load_info
+from tests.datasets.test_video_encoding import _add_frames, require_h264, require_libsvtav1
 
 
 @pytest.fixture
@@ -1326,3 +1330,41 @@ def test_convert_image_to_video_dataset_subset_episodes(tmp_path):
 
         if output_dir.exists():
             shutil.rmtree(output_dir)
+
+
+# ─── reencode_dataset ─────────────────────────────────────────────────
+
+
+@require_libsvtav1
+@require_h264
+def test_reencode_dataset_multi_key_multiprocessing(
+    tmp_path, empty_lerobot_dataset_factory, features_factory
+):
+    """Re-encode a two-camera dataset with num_workers=2 and verify metadata refresh."""
+    features = features_factory(use_videos=True)
+    initial_cfg = VideoEncoderConfig(vcodec="libsvtav1", g=2, crf=30, preset=12)
+    dataset = empty_lerobot_dataset_factory(
+        root=tmp_path / "ds",
+        features=features,
+        use_videos=True,
+        camera_encoder=initial_cfg,
+    )
+
+    _add_frames(dataset, num_frames=4)
+    dataset.save_episode()
+    _add_frames(dataset, num_frames=4)
+    dataset.save_episode()
+    dataset.finalize()
+
+    assert len(dataset.meta.video_keys) == 2
+
+    target_cfg = VideoEncoderConfig(vcodec="h264", g=6, crf=23, pix_fmt="yuv420p")
+
+    result = reencode_dataset(dataset, camera_encoder=target_cfg, num_workers=2)
+
+    assert result is dataset
+
+    persisted_info = load_info(dataset.root)
+    for vk in dataset.meta.video_keys:
+        persisted_encoder = VideoEncoderConfig.from_video_info(persisted_info.features[vk].get("info", {}))
+        assert persisted_encoder == target_cfg
diff --git a/tests/datasets/test_video_encoding.py b/tests/datasets/test_video_encoding.py
index 224f2405b..1af61e9f9 100644
--- a/tests/datasets/test_video_encoding.py
+++ b/tests/datasets/test_video_encoding.py
@@ -35,6 +35,7 @@ from lerobot.datasets.video_utils import (
     concatenate_video_files,
     encode_video_frames,
     get_video_info,
+    reencode_video,
 )
 from tests.fixtures.constants import DUMMY_VIDEO_INFO
 
@@ -347,16 +348,22 @@ def _read_feature_info(dataset: LeRobotDataset) -> dict:
     return info["features"][VIDEO_KEY]["info"]
 
 
-def _add_frames(dataset: LeRobotDataset, num_frames: int) -> None:
-    shape = dataset.meta.features[VIDEO_KEY]["shape"]
+def _add_frames(dataset: LeRobotDataset, num_frames: int, video_keys: list[str] | None = None) -> None:
+    from lerobot.utils.constants import DEFAULT_FEATURES
+
+    if video_keys is None:
+        video_keys = dataset.meta.video_keys
     for _ in range(num_frames):
-        dataset.add_frame(
-            {
-                VIDEO_KEY: np.random.randint(0, 256, shape, dtype=np.uint8),
-                "action": np.zeros(2, dtype=np.float32),
-                "task": "test",
-            }
-        )
+        frame: dict = {"task": "test"}
+        for key, ft in dataset.meta.features.items():
+            if key in DEFAULT_FEATURES:
+                continue
+            shape = ft["shape"]
+            if key in video_keys:
+                frame[key] = np.random.randint(0, 256, shape, dtype=np.uint8)
+            else:
+                frame[key] = np.zeros(shape, dtype=np.float32)
+        dataset.add_frame(frame)
 
 
 class TestGetVideoInfo:
@@ -474,6 +481,30 @@ class TestEncodeVideoFrames:
         assert info["video.extra_options"] == {}
 
 
+class TestReencodeVideo:
+    @require_libsvtav1
+    @require_h264
+    def test_reencode_video(self, tmp_path):
+        src = TEST_ARTIFACTS_DIR / "clip_4frames.mp4"
+        out = tmp_path / "reencoded.mp4"
+        cfg = VideoEncoderConfig(vcodec="h264", g=6, crf=23, pix_fmt="yuv444p")
+        reencode_video(src, out, camera_encoder=cfg, overwrite=True)
+
+        assert out.exists()
+        with av.open(str(out)) as container:
+            n_frames = sum(1 for _ in container.decode(video=0))
+        assert n_frames == 4
+
+        info = get_video_info(out, camera_encoder=cfg)
+        assert info["video.codec"] == "h264"
+        assert info["video.pix_fmt"] == "yuv444p"
+        assert info["video.height"] == 64
+        assert info["video.width"] == 96
+        assert info["video.fps"] == 30
+        assert info["video.g"] == 6
+        assert info["video.crf"] == 23
+
+
 class TestConcatenateVideoFiles:
     def test_two_clips_frame_count(self, tmp_path):
         """Output frame count equals the sum of the two input frame counts."""

From 6a8878a6391d8d1343c47632c831e10a8e7b2d54 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E5=9B=9B=E4=B8=83?=
 <41624527+SevenFo@users.noreply.github.com>
Date: Tue, 19 May 2026 22:53:19 +0800
Subject: [PATCH 06/17] fix(datasets): normalize shape=(1,) numeric values
 before HF encoding (#3344)

* fix(datasets): normalize shape=(1,) numeric values before save

* test(datasets): cover shape=(1,) int/bool and finalize

Co-authored-by: Copilot <copilot@github.com>
---
 src/lerobot/datasets/dataset_writer.py |  9 ++++++-
 tests/datasets/test_datasets.py        | 36 ++++++++++++++++++++++++++
 2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/src/lerobot/datasets/dataset_writer.py b/src/lerobot/datasets/dataset_writer.py
index 6be63194f..633c00c1a 100644
--- a/src/lerobot/datasets/dataset_writer.py
+++ b/src/lerobot/datasets/dataset_writer.py
@@ -250,7 +250,14 @@ class DatasetWriter:
         for key, ft in self._meta.features.items():
             if key in ["index", "episode_index", "task_index"] or ft["dtype"] in ["image", "video"]:
                 continue
-            episode_buffer[key] = np.stack(episode_buffer[key])
+            stacked_values = np.stack(episode_buffer[key])
+
+            # `shape=(1,)` numeric features are serialized as `datasets.Value`, which expects scalars.
+            # Normalizing to `(N,)` keeps save semantics stable across dependency versions.
+            if tuple(ft["shape"]) == (1,) and ft["dtype"] != "string":
+                stacked_values = stacked_values.reshape(episode_length)
+
+            episode_buffer[key] = stacked_values
 
         # Wait for image writer to end, so that episode stats over images can be computed
         self._wait_image_writer()
diff --git a/tests/datasets/test_datasets.py b/tests/datasets/test_datasets.py
index ba9b64812..19c314fd6 100644
--- a/tests/datasets/test_datasets.py
+++ b/tests/datasets/test_datasets.py
@@ -24,6 +24,7 @@ import torch
 
 pytest.importorskip("datasets", reason="datasets is required (install lerobot[dataset])")
 
+import datasets
 from huggingface_hub import HfApi
 from PIL import Image
 from safetensors.torch import load_file
@@ -360,6 +361,41 @@ def test_add_frame_image_pil(image_dataset):
     assert dataset[0]["image"].shape == torch.Size(DUMMY_CHW)
 
 
+@pytest.mark.parametrize(
+    "dtype,np_dtype,values,assert_fn",
+    [
+        ("float32", np.float32, [1.0, 2.0], np.testing.assert_allclose),
+        ("int64", np.int64, [1, 2], np.testing.assert_array_equal),
+        ("bool", np.bool_, [True, False], np.testing.assert_array_equal),
+    ],
+    ids=["float32", "int64", "bool"],
+)
+def test_save_episode_shape_1_scalar_is_scalarized_before_hf_encoding(
+    tmp_path, empty_lerobot_dataset_factory, monkeypatch, dtype, np_dtype, values, assert_fn
+):
+    features = {"state": {"dtype": dtype, "shape": (1,), "names": None}}
+    dataset = empty_lerobot_dataset_factory(root=tmp_path / "test", features=features)
+    dataset.add_frame({"state": np.array([values[0]], dtype=np_dtype), "task": "Dummy task"})
+    dataset.add_frame({"state": np.array([values[1]], dtype=np_dtype), "task": "Dummy task"})
+
+    captured = {}
+    original_from_dict = datasets.Dataset.from_dict
+
+    def _from_dict_spy(cls, mapping, *args, **kwargs):
+        captured["state"] = mapping["state"]
+        return original_from_dict(mapping, *args, **kwargs)
+
+    monkeypatch.setattr(datasets.Dataset, "from_dict", classmethod(_from_dict_spy))
+
+    dataset.save_episode()
+    dataset.finalize()
+
+    assert "state" in captured
+    assert isinstance(captured["state"], np.ndarray)
+    assert captured["state"].shape == (2,)
+    assert_fn(captured["state"], np.array(values, dtype=np_dtype))
+
+
 def test_set_image_transforms_applies_transparently(image_dataset):
     dataset = image_dataset
     dataset.add_frame({"image": np.random.rand(*DUMMY_CHW), "task": "Dummy task"})

From dfdc48a7f131c89ade51e322f5c11180b8509c72 Mon Sep 17 00:00:00 2001
From: "Roham Z. Nobari" <rzninvo@gmail.com>
Date: Tue, 19 May 2026 16:54:25 +0200
Subject: [PATCH 07/17] fix(datasets): bound VideoDecoderCache to prevent OOM
 on large datasets (#3614)

VideoDecoderCache used an unbounded dict keyed on absolute path, with no
eviction in the standard LeRobotDataset path. With shuffled iteration over
datasets that have many distinct mp4 files, every DataLoader worker
accumulated one cached (VideoDecoder, fsspec file handle) pair per distinct
path it had ever touched. Per-entry cost is ~3-5 MB of host RAM plus one
open FD; at ~8 k entries this is roughly 30 GB per worker.

This was hit in the wild during a SmolVLA training run on a 4,195-episode
SO-101 dataset (8,390 mp4s, two cameras per episode). dmesg showed
anon-rss climbing to 34.9 GB on a single pt_data_worker before the OOM
killer fired ~30 min into training; with --num_workers=8 the per-worker
peak halved to 17.9 GB, which is the expected inverse-scaling signature
when the leak is per-decode and the workload is split across workers. The
working workaround on the affected platform was --dataset.video_backend=pyav,
because the pyav path opens/closes per call and never touches this cache.

Switch the backing store to an OrderedDict and evict LRU entries when the
cap is reached, closing the evicted file handle inside the lock so we do
not leak FDs either. Default cap is DEFAULT_DECODER_CACHE_SIZE = 100,
overridable via LEROBOT_VIDEO_DECODER_CACHE_SIZE or by passing max_size=
to the constructor; max_size=None restores the legacy unbounded behaviour
for callers that need it.

Validation on the original failing workload (decode_video_frames_torchcodec
called over real mp4s from the affected SO-101 dataset):

  unbounded:    300 files  ->  +1087 MB host RSS,  cache=300, still climbing
  cap=50:       500 files  ->   +266 MB host RSS,  cache=50,  stable
  cap=50:      2000 calls  ->   +312 MB host RSS,  cache=50,  stable
  cap=100:     1000 calls  ->   +470 MB host RSS,  cache=100, stable

Three independent seeded runs at cap=50 agreed to within 1% (263 / 266 /
265 MB delta), and the 2000-call multi-pass run shows RSS plateaus after
the cap is reached instead of drifting.

Tests in tests/datasets/test_video_decoder_cache.py cover:
default-is-bounded, size cap, LRU ordering, FD close on eviction, FD close
on clear(), cache-hit invariance, max_size=None fallback, and env-var
override. No regressions in test_video_encoding.py, test_streaming.py, or
test_dataset_reader.py (73 prior tests still pass alongside the 8 new ones).
---
 src/lerobot/datasets/video_utils.py        | 103 ++++++++++++---
 tests/datasets/test_video_decoder_cache.py | 140 +++++++++++++++++++++
 2 files changed, 227 insertions(+), 16 deletions(-)
 create mode 100644 tests/datasets/test_video_decoder_cache.py

diff --git a/src/lerobot/datasets/video_utils.py b/src/lerobot/datasets/video_utils.py
index 99122381a..84ab56e08 100644
--- a/src/lerobot/datasets/video_utils.py
+++ b/src/lerobot/datasets/video_utils.py
@@ -17,11 +17,13 @@ import contextlib
 import glob
 import importlib
 import logging
+import os
 import queue
 import shutil
 import tempfile
 import threading
 import warnings
+from collections import OrderedDict
 from dataclasses import asdict, dataclass, field
 from fractions import Fraction
 from pathlib import Path
@@ -191,15 +193,70 @@ def decode_video_frames_pyav(
     return closest_frames
 
 
-class VideoDecoderCache:
-    """Thread-safe cache for video decoders to avoid expensive re-initialization."""
+DEFAULT_DECODER_CACHE_SIZE = 100
+"""Default LRU capacity for :class:`VideoDecoderCache`.
 
-    def __init__(self):
-        self._cache: dict[str, tuple[Any, Any]] = {}
+Sized to comfortably hold a small rolling window of episodes worth of decoders
+(typical recipes: 2-4 cameras per episode × tens of episodes in flight) while
+bounding host RAM. Each cached entry retains a torchcodec ``VideoDecoder`` plus
+an open ``fsspec`` file handle — on the order of a few MB per entry. Override
+via the ``LEROBOT_VIDEO_DECODER_CACHE_SIZE`` env var or by passing ``max_size``
+to the constructor (``None`` restores the legacy unbounded behaviour).
+"""
+
+
+def _default_max_cache_size() -> int | None:
+    raw = os.environ.get("LEROBOT_VIDEO_DECODER_CACHE_SIZE")
+    if raw is None:
+        return DEFAULT_DECODER_CACHE_SIZE
+    raw = raw.strip().lower()
+    if raw in ("", "none", "unbounded", "-1"):
+        return None
+    try:
+        value = int(raw)
+    except ValueError as e:
+        raise ValueError(
+            f"LEROBOT_VIDEO_DECODER_CACHE_SIZE must be an integer, 'none', or '-1'; got {raw!r}"
+        ) from e
+    if value <= 0:
+        raise ValueError(f"LEROBOT_VIDEO_DECODER_CACHE_SIZE must be positive; got {value}")
+    return value
+
+
+class VideoDecoderCache:
+    """Thread-safe LRU cache for torchcodec ``VideoDecoder`` instances.
+
+    Cached entries hold a ``VideoDecoder`` plus the open ``fsspec`` file handle
+    backing it. When the cache is full and a new path is requested, the
+    least-recently-used entry is evicted and its file handle is closed. This
+    bounds host-RAM growth when iterating over datasets with many distinct
+    video files (otherwise each ``DataLoader`` worker pins every decoder it has
+    ever opened until the process exits).
+
+    Args:
+        max_size: Maximum number of decoders to retain. ``None`` disables
+            eviction and restores legacy unbounded behaviour. Defaults to the
+            value of ``LEROBOT_VIDEO_DECODER_CACHE_SIZE`` if set, otherwise
+            :data:`DEFAULT_DECODER_CACHE_SIZE`.
+    """
+
+    _SENTINEL: ClassVar[object] = object()
+
+    def __init__(self, max_size: int | None | object = _SENTINEL):
+        if max_size is VideoDecoderCache._SENTINEL:
+            max_size = _default_max_cache_size()
+        if max_size is not None and max_size <= 0:
+            raise ValueError(f"max_size must be positive or None; got {max_size}")
+        self.max_size: int | None = max_size  # type: ignore[assignment]
+        self._cache: OrderedDict[str, tuple[Any, Any]] = OrderedDict()
         self._lock = Lock()
 
+    def __contains__(self, video_path: object) -> bool:
+        with self._lock:
+            return str(video_path) in self._cache
+
     def get_decoder(self, video_path: str):
-        """Get a cached decoder or create a new one."""
+        """Get a cached decoder or create a new one, evicting LRU if at capacity."""
         if importlib.util.find_spec("torchcodec"):
             from torchcodec.decoders import VideoDecoder
         else:
@@ -211,22 +268,36 @@ class VideoDecoderCache:
         video_path = str(video_path)
 
         with self._lock:
-            if video_path not in self._cache:
-                file_handle = fsspec.open(video_path).__enter__()
-                try:
-                    decoder = VideoDecoder(file_handle, seek_mode="approximate")
-                except Exception:
-                    file_handle.close()
-                    raise
-                self._cache[video_path] = (decoder, file_handle)
+            entry = self._cache.get(video_path)
+            if entry is not None:
+                self._cache.move_to_end(video_path)
+                return entry[0]
 
-            return self._cache[video_path][0]
+            file_handle = fsspec.open(video_path).__enter__()
+            try:
+                decoder = VideoDecoder(file_handle, seek_mode="approximate")
+            except Exception:
+                file_handle.close()
+                raise
+            self._cache[video_path] = (decoder, file_handle)
+
+            # Evict LRU entries until we are back under the cap. We close
+            # evicted file handles immediately; the associated ``VideoDecoder``
+            # is released to the GC when its last reference goes away.
+            if self.max_size is not None:
+                while len(self._cache) > self.max_size:
+                    _evicted_path, (_evicted_decoder, evicted_handle) = self._cache.popitem(last=False)
+                    with contextlib.suppress(Exception):
+                        evicted_handle.close()
+
+            return decoder
 
     def clear(self):
-        """Clear the cache and close file handles."""
+        """Clear the cache and close all file handles."""
         with self._lock:
             for _, file_handle in self._cache.values():
-                file_handle.close()
+                with contextlib.suppress(Exception):
+                    file_handle.close()
             self._cache.clear()
 
     def size(self) -> int:
diff --git a/tests/datasets/test_video_decoder_cache.py b/tests/datasets/test_video_decoder_cache.py
new file mode 100644
index 000000000..6e69f8403
--- /dev/null
+++ b/tests/datasets/test_video_decoder_cache.py
@@ -0,0 +1,140 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Unit tests for ``lerobot.datasets.video_utils.VideoDecoderCache``.
+
+These cover the LRU bounding + file-handle release behaviour added to prevent
+unbounded growth when iterating over datasets with many distinct video files
+(observed: ~35 GB anon-rss per DataLoader worker on an 8 k-file dataset).
+"""
+
+import shutil
+from pathlib import Path
+
+import pytest
+
+pytest.importorskip("torchcodec", reason="torchcodec is required (install lerobot[dataset])")
+
+from lerobot.datasets.video_utils import VideoDecoderCache  # noqa: E402
+
+TEST_ARTIFACTS_DIR = Path(__file__).resolve().parent.parent / "artifacts" / "encoded_videos"
+SRC_CLIP = TEST_ARTIFACTS_DIR / "clip_4frames.mp4"
+
+
+def _make_distinct_clips(tmp_path: Path, n: int) -> list[Path]:
+    """Copy the small reference mp4 to ``n`` distinct paths.
+
+    The cache keys on absolute path, so distinct paths force distinct cache entries
+    even though the file contents are identical.
+    """
+    assert SRC_CLIP.exists(), f"missing test artifact {SRC_CLIP}"
+    paths = []
+    for i in range(n):
+        dst = tmp_path / f"clip_{i:04d}.mp4"
+        shutil.copyfile(SRC_CLIP, dst)
+        paths.append(dst)
+    return paths
+
+
+class TestVideoDecoderCacheBounded:
+    def test_default_cache_is_bounded(self):
+        """The default cache must have a finite ``max_size`` to bound RSS growth."""
+        cache = VideoDecoderCache()
+        assert cache.max_size is not None, "default cache must be bounded"
+        assert cache.max_size > 0
+
+    def test_size_capped_at_max_size(self, tmp_path):
+        """``get_decoder`` for >``max_size`` distinct paths must NOT grow without bound."""
+        paths = _make_distinct_clips(tmp_path, n=5)
+        cache = VideoDecoderCache(max_size=2)
+        for p in paths:
+            cache.get_decoder(p)
+        assert cache.size() == 2
+
+    def test_evicts_least_recently_used(self, tmp_path):
+        """Re-accessing an entry must promote it; the LRU entry is the one evicted."""
+        paths = _make_distinct_clips(tmp_path, n=3)
+        cache = VideoDecoderCache(max_size=2)
+
+        cache.get_decoder(paths[0])
+        cache.get_decoder(paths[1])
+        cache.get_decoder(paths[0])  # promote paths[0] to MRU; paths[1] is now LRU
+        cache.get_decoder(paths[2])  # should evict paths[1]
+
+        assert str(paths[0]) in cache  # MRU stays
+        assert str(paths[1]) not in cache  # LRU evicted
+        assert str(paths[2]) in cache  # newest stays
+
+    def test_eviction_closes_file_handle(self, tmp_path):
+        """Evicting an entry must close its fsspec file handle (otherwise we leak FDs)."""
+        paths = _make_distinct_clips(tmp_path, n=2)
+        cache = VideoDecoderCache(max_size=1)
+
+        cache.get_decoder(paths[0])
+        # Reach into the cache to capture the handle before it is evicted. This is
+        # the only assertion in the suite that touches a private attribute, and it
+        # is the most direct way to prove the file descriptor is actually released.
+        evicted_handle = cache._cache[str(paths[0])][1]
+        assert evicted_handle.closed is False
+
+        cache.get_decoder(paths[1])  # forces eviction of paths[0]
+
+        assert evicted_handle.closed is True
+
+    def test_clear_closes_all_file_handles(self, tmp_path):
+        """``clear()`` must close every cached file handle."""
+        paths = _make_distinct_clips(tmp_path, n=3)
+        cache = VideoDecoderCache(max_size=10)
+
+        for p in paths:
+            cache.get_decoder(p)
+        handles = [entry[1] for entry in cache._cache.values()]
+        assert all(not h.closed for h in handles)
+
+        cache.clear()
+
+        assert cache.size() == 0
+        assert all(h.closed for h in handles)
+
+    def test_hit_does_not_reopen_or_evict(self, tmp_path):
+        """A cache hit must return the same decoder instance without touching the cap."""
+        paths = _make_distinct_clips(tmp_path, n=1)
+        cache = VideoDecoderCache(max_size=2)
+
+        first = cache.get_decoder(paths[0])
+        second = cache.get_decoder(paths[0])
+
+        assert first is second
+        assert cache.size() == 1
+
+    def test_unbounded_when_max_size_none(self, tmp_path):
+        """``max_size=None`` preserves the legacy unbounded behaviour."""
+        paths = _make_distinct_clips(tmp_path, n=4)
+        cache = VideoDecoderCache(max_size=None)
+        for p in paths:
+            cache.get_decoder(p)
+        assert cache.size() == 4
+
+    def test_env_var_overrides_default(self, tmp_path, monkeypatch):
+        """``LEROBOT_VIDEO_DECODER_CACHE_SIZE`` env var sets the default ``max_size``."""
+        monkeypatch.setenv("LEROBOT_VIDEO_DECODER_CACHE_SIZE", "3")
+        cache = VideoDecoderCache()
+        assert cache.max_size == 3
+
+        paths = _make_distinct_clips(tmp_path, n=5)
+        for p in paths:
+            cache.get_decoder(p)
+        assert cache.size() == 3

From f4b834844e34f7dcf8245e17818e6bdd9ebdcaff Mon Sep 17 00:00:00 2001
From: Virgileboat <116651491+Virgileboat@users.noreply.github.com>
Date: Thu, 21 May 2026 11:44:04 +0200
Subject: [PATCH 08/17] Feat/clean can bus (#3526)

* change timeout  for handshake

* enforce last state read when querry

* change import order

* fix(motors): flush stale robstride RX and harden feedback drain

* robstride: remove redundant timeout and max_messages casts

* bugfix + %-style

* update exception catch
---
 src/lerobot/motors/robstride/robstride.py | 118 ++++++++++++++++++----
 src/lerobot/motors/robstride/tables.py    |   3 +-
 2 files changed, 102 insertions(+), 19 deletions(-)

diff --git a/src/lerobot/motors/robstride/robstride.py b/src/lerobot/motors/robstride/robstride.py
index ecde01e9a..359fc9385 100644
--- a/src/lerobot/motors/robstride/robstride.py
+++ b/src/lerobot/motors/robstride/robstride.py
@@ -43,6 +43,7 @@ from .tables import (
     CAN_CMD_SET_ZERO,
     DEFAULT_BAUDRATE,
     DEFAULT_TIMEOUT_MS,
+    HANDSHAKE_TIMEOUT_S,
     MODEL_RESOLUTION,
     MOTOR_LIMIT_PARAMS,
     NORMALIZED_DATA,
@@ -215,14 +216,16 @@ class RobstrideMotorsBus(MotorsBusBase):
             self._is_connected = False
             raise ConnectionError(f"Failed to connect to CAN bus: {e}") from e
 
-    def _query_status_via_clear_fault(self, motor: NameOrID) -> tuple[bool, can.Message | None]:
+    def _query_status_via_clear_fault(
+        self, motor: NameOrID, timeout: float = RUNNING_TIMEOUT
+    ) -> tuple[bool, can.Message | None]:
         motor_name = self._get_motor_name(motor)
         motor_id = self._get_motor_id(motor_name)
         recv_id = self._get_motor_recv_id(motor_name)
         data = [0xFF] * 7 + [CAN_CMD_CLEAR_FAULT]
         msg = can.Message(arbitration_id=motor_id, data=data, is_extended_id=False)
         self._bus().send(msg)
-        return self._recv_status_via_clear_fault(expected_recv_id=recv_id)
+        return self._recv_status_via_clear_fault(expected_recv_id=recv_id, timeout=timeout)
 
     def _recv_status_via_clear_fault(
         self, expected_recv_id: int | None = None, timeout: float = RUNNING_TIMEOUT
@@ -280,7 +283,7 @@ class RobstrideMotorsBus(MotorsBusBase):
         faulted_motors = []
 
         for motor_name in self.motors:
-            has_fault, msg = self._query_status_via_clear_fault(motor_name)
+            has_fault, msg = self._query_status_via_clear_fault(motor_name, timeout=HANDSHAKE_TIMEOUT_S)
             if msg is None:
                 missing_motors.append(motor_name)
             elif has_fault:
@@ -505,6 +508,87 @@ class RobstrideMotorsBus(MotorsBusBase):
 
         return responses
 
+    def _recv_all_messages_until_quiet(
+        self,
+        *,
+        timeout: float = RUNNING_TIMEOUT,
+        max_messages: int = 4096,
+    ) -> list[can.Message]:
+        """
+        Receive frames until the bus goes quiet.
+
+        Args:
+            timeout: Poll timeout used for each recv() call. Collection stops
+                when one recv() times out (quiet gap).
+            max_messages: Safety cap to prevent unbounded loops.
+        """
+        out: list[can.Message] = []
+        max_messages = max(1, max_messages)
+        timeout = max(0.0, timeout)
+
+        try:
+            while len(out) < max_messages:
+                msg = self._bus().recv(timeout=timeout)
+                if msg is None:
+                    break
+                out.append(msg)
+        except (can.CanError, OSError) as e:
+            logger.debug(f"Error draining CAN RX queue on {self.port}: {e}")
+
+        return out
+
+    def _process_feedback_messages(self, messages: list[can.Message]) -> set[int]:
+        """
+        Decode all received feedback frames and update cached motor states.
+
+        Returns:
+            Set of payload recv_ids that were successfully mapped to motors.
+        """
+        processed_recv_ids: set[int] = set()
+        for msg in messages:
+            if len(msg.data) < 1:
+                logger.debug(
+                    f"Dropping short CAN frame on {self.port} "
+                    f"(arb=0x{int(msg.arbitration_id):02X}, data={bytes(msg.data).hex()})"
+                )
+                continue
+
+            recv_id = int(msg.data[0])
+            motor_name = self._recv_id_to_motor.get(recv_id)
+            if motor_name is None:
+                logger.debug(
+                    f"Unmapped CAN frame on {self.port} "
+                    f"(arb=0x{int(msg.arbitration_id):02X}, recv_id=0x{recv_id:02X}, data={bytes(msg.data).hex()})"
+                )
+                continue
+
+            self._process_response(motor_name, msg)
+            processed_recv_ids.add(recv_id)
+
+        return processed_recv_ids
+
+    def flush_rx_queue(self, poll_timeout_s: float = 0.0005, max_messages: int = 4096) -> int:
+        """
+        Drain pending RX frames from the CAN interface.
+
+        This is used by higher-level controllers to drop stale feedback before issuing
+        a fresh read cycle, so subsequent state reads are based on most recent replies.
+        It should also be called once when a controller instance is created/connected,
+        to clear residual frames left on the interface from previous sessions.
+        """
+        drained = 0
+        poll_timeout_s = max(0.0, poll_timeout_s)
+        max_messages = max(1, max_messages)
+        try:
+            while drained < max_messages:
+                msg = self._bus().recv(timeout=poll_timeout_s)
+                if msg is None:
+                    break
+                drained += 1
+        except (can.CanError, OSError) as e:
+            logger.debug(f"Failed to flush CAN RX queue on {self.port}: {e}")
+        return drained
+
     def _speed_control(
         self,
         motor: NameOrID,
@@ -644,11 +728,14 @@ class RobstrideMotorsBus(MotorsBusBase):
             msg = can.Message(arbitration_id=motor_id, data=data, is_extended_id=False)
             self._bus().send(msg)
             recv_id_to_motor[self._get_motor_recv_id(motor)] = motor_name
+        # Read every feedback frame until RX goes quiet, then decode all of them.
+        # This avoids dropping useful frames when responses from different motors interleave.
+        messages = self._recv_all_messages_until_quiet()
+        processed_recv_ids = self._process_feedback_messages(messages)
 
-        responses = self._recv_all_responses(list(recv_id_to_motor.keys()), timeout=RUNNING_TIMEOUT)
         for recv_id, motor_name in recv_id_to_motor.items():
-            if msg := responses.get(recv_id):
-                self._process_response(motor_name, msg)
+            if recv_id not in processed_recv_ids:
+                logger.warning(f"Packet drop: {motor_name} (ID: 0x{recv_id:02X}). Using last known state.")
 
     def _float_to_uint(self, x: float, x_min: float, x_max: float, bits: int) -> int:
         """Convert float to unsigned integer for CAN transmission."""
@@ -711,7 +798,10 @@ class RobstrideMotorsBus(MotorsBusBase):
         try:
             self._decode_motor_state(msg.data)
         except Exception as e:
-            logger.warning(f"Failed to decode response from {motor}: {e}")
+            logger.warning(
+                f"Failed to decode response from {motor} "
+                f"(arb=0x{int(msg.arbitration_id):02X}, data={bytes(msg.data).hex()}): {e}"
+            )
 
     def _get_cached_value(self, motor: str, data_name: str) -> Value:
         """Retrieve a specific value from the state cache."""
@@ -848,20 +938,12 @@ class RobstrideMotorsBus(MotorsBusBase):
             self._bus().send(msg)
             updated_motors.append(motor)
 
-        expected_recv_ids = [self._get_motor_recv_id(motor) for motor in updated_motors]
-        responses = self._recv_all_responses(expected_recv_ids, timeout=RUNNING_TIMEOUT)
-
-        for response in responses.values():
-            payload_motor_name = self._recv_id_to_motor.get(response.data[0])
-            if payload_motor_name is not None:
-                self._process_response(payload_motor_name, response)
-            else:
-                # Fallback: still attempt to decode based on payload byte0 mapping.
-                self._decode_motor_state(response.data)
+        messages = self._recv_all_messages_until_quiet()
+        processed_recv_ids = self._process_feedback_messages(messages)
 
         for motor in updated_motors:
             recv_id = self._get_motor_recv_id(motor)
-            if recv_id not in responses:
+            if recv_id not in processed_recv_ids:
                 logger.warning(f"Packet drop: {motor} (ID: 0x{recv_id:02X}). Using last known state.")
 
     def read_calibration(self) -> dict[str, MotorCalibration]:
diff --git a/src/lerobot/motors/robstride/tables.py b/src/lerobot/motors/robstride/tables.py
index 2fc1a97b0..06b90df3a 100644
--- a/src/lerobot/motors/robstride/tables.py
+++ b/src/lerobot/motors/robstride/tables.py
@@ -114,7 +114,8 @@ CAN_CMD_SAVE_PARAM = 0xAA
 CAN_PARAM_ID = 0x7FF
 
 
-RUNNING_TIMEOUT = 0.001
+RUNNING_TIMEOUT = 0.003
+HANDSHAKE_TIMEOUT_S = 0.05
 PARAM_TIMEOUT = 0.01
 
 STATE_CACHE_TTL_S = 0.02

From bac4f61eae9bf465d79ebb6f03e9fba52304b1c2 Mon Sep 17 00:00:00 2001
From: Khalil Meftah <khalil.meftah@huggingface.co>
Date: Thu, 21 May 2026 14:32:10 +0200
Subject: [PATCH 09/17] refactor: support custom progress parquet overlays
 (#3640)

---
 examples/dataset/create_progress_videos.py | 52 +++++++++++++++-------
 1 file changed, 37 insertions(+), 15 deletions(-)

diff --git a/examples/dataset/create_progress_videos.py b/examples/dataset/create_progress_videos.py
index 5f98d2cea..cb85a9d3a 100644
--- a/examples/dataset/create_progress_videos.py
+++ b/examples/dataset/create_progress_videos.py
@@ -15,10 +15,12 @@
 # limitations under the License.
 
 """
-Create MP4 (or GIF) videos with sarm_progress overlay for specified episodes.
+Create MP4 (or GIF) videos with per-frame progress overlay for specified episodes.
 
 Downloads datasets from HuggingFace, seeks directly into the episode segment
 of the source video, draws a progress line on each frame, and writes the result.
+The progress data is read from a parquet file that lives alongside the dataset
+(configurable via ``--progress-file``).
 
 Usage:
     python examples/dataset/create_progress_videos.py \
@@ -56,22 +58,26 @@ SCORE_FONT_SCALE = 0.8
 TASK_FONT_SCALE = 0.55
 
 
-def download_episode_metadata(repo_id: str, episode: int) -> Path:
-    """Download only the metadata and sarm_progress files for a dataset.
+def download_episode_metadata(
+    repo_id: str, episode: int, progress_file: str = "sarm_progress.parquet"
+) -> Path:
+    """Download only the metadata and per-frame progress file for a dataset.
 
     Args:
         repo_id: HuggingFace dataset repository ID.
         episode: Episode index (used for logging only; all meta is fetched).
+        progress_file: Filename of the per-frame progress parquet inside the
+            dataset repo.
 
     Returns:
         Local cache path for the downloaded snapshot.
     """
-    logging.info("[1/4] Downloading metadata for %s (episode %d) ...", repo_id, episode)
+    logging.info("[1/4] Downloading metadata + %s for %s (episode %d) ...", progress_file, repo_id, episode)
     local_path = Path(
         snapshot_download(
             repo_id=repo_id,
             repo_type="dataset",
-            allow_patterns=["meta/**", "sarm_progress.parquet"],
+            allow_patterns=["meta/**", progress_file],
             ignore_patterns=["*.mp4"],
         )
     )
@@ -215,25 +221,28 @@ def download_video_file(repo_id: str, local_path: Path, video_rel: str) -> Path:
     return video_path
 
 
-def load_progress_data(local_path: Path, episode: int) -> np.ndarray | None:
-    """Load sarm_progress values for an episode.
+def load_progress_data(
+    local_path: Path, episode: int, progress_file: str = "sarm_progress.parquet"
+) -> np.ndarray | None:
+    """Load per-frame progress values for an episode.
 
     Args:
         local_path: Dataset cache root.
         episode: Episode index.
+        progress_file: Filename of the per-frame progress parquet.
 
     Returns:
         Sorted (N, 2) array of (frame_index, progress), or None if unavailable.
     """
-    parquet_path = local_path / "sarm_progress.parquet"
+    parquet_path = local_path / progress_file
     if not parquet_path.exists():
-        logging.warning("sarm_progress.parquet not found")
+        logging.warning("%s not found", progress_file)
         return None
     df = pd.read_parquet(parquet_path)
-    logging.info("   sarm_progress.parquet columns: %s", list(df.columns))
+    logging.info("   %s columns: %s", progress_file, list(df.columns))
     episode_df = df[df["episode_index"] == episode].copy()
     if episode_df.empty:
-        logging.warning("No sarm_progress rows for episode %d", episode)
+        logging.warning("No progress rows for episode %d in %s", episode, progress_file)
         return None
     episode_df = episode_df.sort_values("frame_index")
 
@@ -576,6 +585,7 @@ def process_dataset(
     camera_key: str | None,
     output_dir: Path,
     create_gif: bool = False,
+    progress_file: str = "sarm_progress.parquet",
 ) -> Path | None:
     """Full pipeline: download, extract metadata, composite progress, write output.
 
@@ -585,6 +595,8 @@ def process_dataset(
         camera_key: Camera key to use, or None for auto-selection.
         output_dir: Directory to write output files.
         create_gif: If True, also generate a GIF from the MP4.
+        progress_file: Filename of the per-frame progress parquet inside the
+            dataset repo.
 
     Returns:
         Path to the final output file, or None on failure.
@@ -592,7 +604,7 @@ def process_dataset(
     safe_name = repo_id.replace("/", "_")
     logging.info("Processing: %s  |  episode %d", repo_id, episode)
 
-    local_path = download_episode_metadata(repo_id, episode)
+    local_path = download_episode_metadata(repo_id, episode, progress_file)
     logging.info("   Local cache: %s", local_path)
 
     episode_meta = load_episode_meta(local_path, episode, camera_key)
@@ -600,9 +612,9 @@ def process_dataset(
 
     video_path = download_video_file(repo_id, local_path, episode_meta["video_rel"])
 
-    progress_data = load_progress_data(local_path, episode)
+    progress_data = load_progress_data(local_path, episode, progress_file)
     if progress_data is None:
-        logging.error("Could not load sarm_progress data. Skipping overlay.")
+        logging.error("Could not load progress data from %s. Skipping overlay.", progress_file)
         return None
 
     logging.info("   Progress frames: %d", len(progress_data))
@@ -627,7 +639,7 @@ def process_dataset(
 
 def main() -> None:
     parser = argparse.ArgumentParser(
-        description="Create MP4/GIF videos with sarm_progress overlay for dataset episodes."
+        description="Create MP4/GIF videos with per-frame progress overlay for dataset episodes."
     )
     parser.add_argument(
         "--repo-id",
@@ -658,6 +670,15 @@ def main() -> None:
         action="store_true",
         help="Also generate a GIF from the MP4 output.",
     )
+    parser.add_argument(
+        "--progress-file",
+        type=str,
+        default="sarm_progress.parquet",
+        help=(
+            "Filename of the per-frame progress parquet inside the dataset repo "
+            "(default: 'sarm_progress.parquet')."
+        ),
+    )
     args = parser.parse_args()
 
     logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
@@ -670,6 +691,7 @@ def main() -> None:
         camera_key=args.camera_key,
         output_dir=args.output_dir,
         create_gif=args.gif,
+        progress_file=args.progress_file,
     )
 
     if result:

From c0a2e9814df2e78c8c1201ac86a9806e8bf51b8e Mon Sep 17 00:00:00 2001
From: Nikodem Bartnik <39432165+NikodemBartnik@users.noreply.github.com>
Date: Thu, 21 May 2026 22:14:07 +0200
Subject: [PATCH 10/17] fix examples (#3623)

- Fixed broken API examples in Lerobot Imitation Learning Documentation
- Teleoperation with cameras improved by adding a fixed frequency in the loop (without it the cameras feed gets very slow)
- Wrapped record example script in main() to avoid problems on Mac
- Previously teleoperation example was using SO-ARM and teleoperation with cameras was using Koch. I changed it to use SO-ARM in all of the examples.
- Added section on how to train with HF Jobs - CLI and Python examples
- Replaced lerobot-record with lerobot-rollout in policies examples
---
 docs/source/act.mdx       |  16 +-
 docs/source/groot.mdx     |  10 +-
 docs/source/il_robots.mdx | 317 +++++++++++++++++++++++++-------------
 docs/source/smolvla.mdx   |  16 +-
 4 files changed, 228 insertions(+), 131 deletions(-)

diff --git a/docs/source/act.mdx b/docs/source/act.mdx
index 8e91edcf9..f64246d7a 100644
--- a/docs/source/act.mdx
+++ b/docs/source/act.mdx
@@ -79,17 +79,13 @@ If your local computer doesn't have a powerful GPU, you can utilize Google Colab
 Once training is complete, you can evaluate your ACT policy using the `lerobot-record` command with your trained policy. This will run inference and record evaluation episodes:
 
 ```bash
-lerobot-record \
-  --robot.type=so100_follower \
+lerobot-rollout \
+  --strategy.type=base \
+  --policy.path=${HF_USER}/act_policy \
+  --robot.type=so101_follower \
   --robot.port=/dev/ttyACM0 \
-  --robot.id=my_robot \
   --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
   --display_data=true \
-  --dataset.repo_id=${HF_USER}/eval_act_your_dataset \
-  --dataset.num_episodes=10 \
-  --dataset.single_task="Your task description" \
-  --dataset.streaming_encoding=true \
-  --dataset.encoder_threads=2 \
-  # --dataset.camera_encoder.vcodec=auto \
-  --policy.path=${HF_USER}/act_policy
+  --task="Your task description" \ # can be skipped for ACT
+  --duration=60
 ```
diff --git a/docs/source/groot.mdx b/docs/source/groot.mdx
index d69d10a57..a10b5e369 100644
--- a/docs/source/groot.mdx
+++ b/docs/source/groot.mdx
@@ -105,10 +105,12 @@ These results demonstrate GR00T's strong generalization capabilities across dive
 
 ### Evaluate in your hardware setup
 
-Once you have trained your model using your parameters you can run inference in your downstream task. Follow the instructions in [Imitation Learning for Robots](./il_robots). For example:
+Once you have trained your model using your parameters you can run inference in your downstream task. Follow the instructions in [Policy Deployment (lerobot-rollout)](./inference). For example:
 
 ```bash
-lerobot-record \
+lerobot-rollout\
+  --strategy.type=sentry \
+  --strategy.upload_every_n_episodes=5 \
   --robot.type=bi_so_follower \
   --robot.left_arm_port=/dev/ttyACM1 \
   --robot.right_arm_port=/dev/ttyACM0 \
@@ -119,14 +121,12 @@ lerobot-record \
   }' \
   --display_data=true \
   --dataset.repo_id=<user>/eval_groot-bimanual  \
-  --dataset.num_episodes=10 \
   --dataset.single_task="Grab and handover the red cube to the other arm" \
   --dataset.streaming_encoding=true \
   --dataset.encoder_threads=2 \
   # --dataset.camera_encoder.vcodec=auto \
   --policy.path=<user>/groot-bimanual \ # your trained model
-  --dataset.episode_time_s=30 \
-  --dataset.reset_time_s=10
+  --duration=600
 ```
 
 ## License
diff --git a/docs/source/il_robots.mdx b/docs/source/il_robots.mdx
index 07789225a..dc2e02737 100644
--- a/docs/source/il_robots.mdx
+++ b/docs/source/il_robots.mdx
@@ -68,13 +68,13 @@ from lerobot.teleoperators.so_leader import SO101Leader, SO101LeaderConfig
 from lerobot.robots.so_follower import SO101Follower, SO101FollowerConfig
 
 robot_config = SO101FollowerConfig(
-    port="/dev/tty.usbmodem58760431541",
-    id="my_red_robot_arm",
+    port="/dev/tty.usbmodem5AB90687491",
+    id="my_follower_arm",
 )
 
 teleop_config = SO101LeaderConfig(
-    port="/dev/tty.usbmodem58760431551",
-    id="my_blue_leader_arm",
+    port="/dev/tty.usbmodem5AB90689011",
+    id="my_leader_arm",
 )
 
 robot = SO101Follower(robot_config)
@@ -108,13 +108,13 @@ With `rerun`, you can teleoperate again while simultaneously visualizing the cam
 <hfoption id="Command">
 ```bash
 lerobot-teleoperate \
-    --robot.type=koch_follower \
-    --robot.port=/dev/tty.usbmodem58760431541 \
-    --robot.id=my_awesome_follower_arm \
-    --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 1920, height: 1080, fps: 30}}" \
-    --teleop.type=koch_leader \
-    --teleop.port=/dev/tty.usbmodem58760431551 \
-    --teleop.id=my_awesome_leader_arm \
+    --robot.type=so101_follower \
+    --robot.port=/dev/tty.usbmodem5AB90687491 \
+    --robot.id=my_follower_arm \
+    --robot.cameras="{front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
+    --teleop.type=so101_leader \
+    --teleop.port=/dev/tty.usbmodem5AB90689011 \
+    --teleop.id=my_leader_arm \
     --display_data=true
 ```
 </hfoption>
@@ -122,34 +122,48 @@ lerobot-teleoperate \
 
 <!-- prettier-ignore-start -->
 ```python
+import time
+from lerobot.teleoperators.so_leader import SO101Leader, SO101LeaderConfig
+from lerobot.robots.so_follower import SO101Follower, SO101FollowerConfig
 from lerobot.cameras.opencv import OpenCVCameraConfig
-from lerobot.teleoperators.koch_leader import KochLeader, KochLeaderConfig
-from lerobot.robots.koch_follower import KochFollower, KochFollowerConfig
+from lerobot.utils.visualization_utils import init_rerun, log_rerun_data, shutdown_rerun
 
-camera_config = {
-    "front": OpenCVCameraConfig(index_or_path=0, width=1920, height=1080, fps=30)
-}
-
-robot_config = KochFollowerConfig(
-    port="/dev/tty.usbmodem585A0076841",
-    id="my_red_robot_arm",
-    cameras=camera_config
+robot_config = SO101FollowerConfig(
+    port="/dev/tty.usbmodem5AB90687491",
+    id="my_follower_arm",
+    cameras={
+        "wrist": OpenCVCameraConfig(index_or_path=0, width=640, height=480, fps=30),
+        "top": OpenCVCameraConfig(index_or_path=1, width=640, height=480, fps=30)
+    }
 )
 
-teleop_config = KochLeaderConfig(
-    port="/dev/tty.usbmodem58760431551",
-    id="my_blue_leader_arm",
+teleop_config = SO101LeaderConfig(
+    port="/dev/tty.usbmodem5AB90689011",
+    id="my_leader_arm",
 )
 
-robot = KochFollower(robot_config)
-teleop_device = KochLeader(teleop_config)
+init_rerun(session_name="teleoperation")
+
+robot = SO101Follower(robot_config)
+teleop_device = SO101Leader(teleop_config)
 robot.connect()
 teleop_device.connect()
 
+TARGET_HZ = 30
+TIME_PER_FRAME = 1.0 / TARGET_HZ
+
 while True:
+    start_time = time.perf_counter()
+
     observation = robot.get_observation()
     action = teleop_device.get_action()
     robot.send_action(action)
+    log_rerun_data(observation=observation, action=action)
+
+    elapsed_time = time.perf_counter() - start_time
+    sleep_time = TIME_PER_FRAME - elapsed_time
+    if sleep_time > 0:
+        time.sleep(sleep_time)
 ```
 <!-- prettier-ignore-end -->
 
@@ -202,10 +216,11 @@ lerobot-record \
 <!-- prettier-ignore-start -->
 ```python
 from lerobot.cameras.opencv import OpenCVCameraConfig
-from lerobot.datasets import LeRobotDataset
+from lerobot.datasets.lerobot_dataset import LeRobotDataset
 from lerobot.utils.feature_utils import hw_to_dataset_features
-from lerobot.robots.so_follower import SO100Follower, SO100FollowerConfig
-from lerobot.teleoperators.so_leader import SO100Leader, SO100LeaderConfig
+from lerobot.robots.so_follower import SO101Follower, SO101FollowerConfig
+from lerobot.teleoperators.so_leader.config_so_leader import SO101LeaderConfig
+from lerobot.teleoperators.so_leader.so_leader import SO101Leader
 from lerobot.common.control_utils import init_keyboard_listener
 from lerobot.utils.utils import log_say
 from lerobot.utils.visualization_utils import init_rerun
@@ -218,71 +233,56 @@ EPISODE_TIME_SEC = 60
 RESET_TIME_SEC = 10
 TASK_DESCRIPTION = "My task description"
 
-# Create robot configuration
-robot_config = SO100FollowerConfig(
-    id="my_awesome_follower_arm",
-    cameras={
-        "front": OpenCVCameraConfig(index_or_path=0, width=640, height=480, fps=FPS) # Optional: fourcc="MJPG" for troubleshooting OpenCV async error.
-    },
-    port="/dev/tty.usbmodem58760434471",
-)
-
-teleop_config = SO100LeaderConfig(
-    id="my_awesome_leader_arm",
-    port="/dev/tty.usbmodem585A0077581",
-)
-
-# Initialize the robot and teleoperator
-robot = SO100Follower(robot_config)
-teleop = SO100Leader(teleop_config)
-
-# Configure the dataset features
-action_features = hw_to_dataset_features(robot.action_features, "action")
-obs_features = hw_to_dataset_features(robot.observation_features, "observation")
-dataset_features = {**action_features, **obs_features}
-
-# Create the dataset
-dataset = LeRobotDataset.create(
-    repo_id="<hf_username>/<dataset_repo_id>",
-    fps=FPS,
-    features=dataset_features,
-    robot_type=robot.name,
-    use_videos=True,
-    image_writer_threads=4,
-)
-
-# Initialize the keyboard listener and rerun visualization
-_, events = init_keyboard_listener()
-init_rerun(session_name="recording")
-
-# Connect the robot and teleoperator
-robot.connect()
-teleop.connect()
-
-# Create the required processors
-teleop_action_processor, robot_action_processor, robot_observation_processor = make_default_processors()
-
-episode_idx = 0
-while episode_idx < NUM_EPISODES and not events["stop_recording"]:
-    log_say(f"Recording episode {episode_idx + 1} of {NUM_EPISODES}")
-
-    record_loop(
-        robot=robot,
-        events=events,
-        fps=FPS,
-        teleop_action_processor=teleop_action_processor,
-        robot_action_processor=robot_action_processor,
-        robot_observation_processor=robot_observation_processor,
-        teleop=teleop,
-        dataset=dataset,
-        control_time_s=EPISODE_TIME_SEC,
-        single_task=TASK_DESCRIPTION,
-        display_data=True,
+def main():
+    # Create robot configuration
+    robot_config = SO101FollowerConfig(
+        port="/dev/tty.usbmodem5AB90687491",
+        id="my_follower_arm",
+        cameras={
+            "wrist": OpenCVCameraConfig(index_or_path=0, width=640, height=480, fps=30),
+            "top": OpenCVCameraConfig(index_or_path=1, width=640, height=480, fps=30)
+        }
     )
 
-    # Reset the environment if not stopping or re-recording
-    if not events["stop_recording"] and (episode_idx < NUM_EPISODES - 1 or events["rerecord_episode"]):
-        log_say("Reset the environment")
+    teleop_config = SO101LeaderConfig(
+        port="/dev/tty.usbmodem5AB90689011",
+        id="my_leader_arm",
+    )
+
+    # Initialize the robot and teleoperator
+    robot = SO101Follower(robot_config)
+    teleop = SO101Leader(teleop_config)
+
+    # Configure the dataset features
+    action_features = hw_to_dataset_features(robot.action_features, "action")
+    obs_features = hw_to_dataset_features(robot.observation_features, "observation")
+    dataset_features = {**action_features, **obs_features}
+
+    # Create the dataset
+    dataset = LeRobotDataset.create(
+        repo_id="<hf_username>/<dataset_repo_id>",
+        fps=FPS,
+        features=dataset_features,
+        robot_type=robot.name,
+        use_videos=True,
+        image_writer_threads=4,
+    )
+
+    # Initialize the keyboard listener and rerun visualization
+    _, events = init_keyboard_listener()
+    init_rerun(session_name="recording")
+
+    # Connect the robot and teleoperator
+    robot.connect()
+    teleop.connect()
+
+    # Create the required processors
+    teleop_action_processor, robot_action_processor, robot_observation_processor = make_default_processors()
+
+    episode_idx = 0
+    while episode_idx < NUM_EPISODES and not events["stop_recording"]:
+        log_say(f"Recording episode {episode_idx + 1} of {NUM_EPISODES}")
+
         record_loop(
             robot=robot,
             events=events,
@@ -291,26 +291,50 @@ while episode_idx < NUM_EPISODES and not events["stop_recording"]:
             robot_action_processor=robot_action_processor,
             robot_observation_processor=robot_observation_processor,
             teleop=teleop,
-            control_time_s=RESET_TIME_SEC,
+            dataset=dataset,
+            control_time_s=EPISODE_TIME_SEC,
             single_task=TASK_DESCRIPTION,
             display_data=True,
         )
 
-    if events["rerecord_episode"]:
-        log_say("Re-recording episode")
-        events["rerecord_episode"] = False
-        events["exit_early"] = False
-        dataset.clear_episode_buffer()
-        continue
+        # Reset the environment if not stopping or re-recording
+        if not events["stop_recording"] and (episode_idx < NUM_EPISODES - 1 or events["rerecord_episode"]):
+            log_say("Reset the environment")
+            record_loop(
+                robot=robot,
+                events=events,
+                fps=FPS,
+                teleop_action_processor=teleop_action_processor,
+                robot_action_processor=robot_action_processor,
+                robot_observation_processor=robot_observation_processor,
+                teleop=teleop,
+                control_time_s=RESET_TIME_SEC,
+                single_task=TASK_DESCRIPTION,
+                display_data=True,
+            )
 
-    dataset.save_episode()
-    episode_idx += 1
+        if events["rerecord_episode"]:
+            log_say("Re-recording episode")
+            events["rerecord_episode"] = False
+            events["exit_early"] = False
+            dataset.clear_episode_buffer()
+            continue
 
-# Clean up
-log_say("Stop recording")
-robot.disconnect()
-teleop.disconnect()
-dataset.push_to_hub()
+        dataset.save_episode()
+        episode_idx += 1
+
+    # finalize dataset
+    log_say("Finalizing dataset...")
+    dataset.finalize()
+    # Clean up
+    log_say("Stop recording")
+    robot.disconnect()
+    teleop.disconnect()
+    dataset.push_to_hub()
+
+
+if __name__ == "__main__":
+    main()
 ```
 <!-- prettier-ignore-end -->
 
@@ -348,7 +372,7 @@ The `record` function provides a suite of tools for capturing and managing data
 ##### 2. Checkpointing and Resuming
 
 - Checkpoints are automatically created during recording.
-- If an issue occurs, you can resume by re-running the same command with `--resume=true`. When resuming a recording, `--dataset.num_episodes` must be set to the **number of additional episodes to be recorded**, and not to the targeted total number of episodes in the dataset !
+- If an issue occurs or you want to record additional episodes in the same dataset, you can resume by re-running the same command with `--resume=true`. When resuming a recording, `--dataset.num_episodes` must be set to the **number of additional episodes to be recorded**, and not to the targeted total number of episodes in the dataset! Make sure that you also set `--dataset.root="local_path"`, it's a local path to save the new part of the dataset and is required to resume.
 - To start recording from scratch, **manually delete** the dataset directory.
 
 ##### 3. Recording Parameters
@@ -422,7 +446,7 @@ from lerobot.utils.utils import log_say
 
 episode_idx = 0
 
-robot_config = SO100FollowerConfig(port="/dev/tty.usbmodem58760434471", id="my_awesome_follower_arm")
+robot_config = SO100FollowerConfig(port="/dev/tty.usbmodem5AB90687491", id="my_follower_arm")
 
 robot = SO100Follower(robot_config)
 robot.connect()
@@ -490,6 +514,83 @@ Additionally you can provide extra `tags` or specify a `license` for your model
 
 If your local computer doesn't have a powerful GPU you could utilize Google Colab to train your model by following the [ACT training notebook](./notebooks#training-act).
 
+#### Train using Hugging Face Jobs
+
+Hugging Face jobs let's you easily select hardware and run the training in the cloud. So if you don't have a powerful GPU or you need more VRAM or just want to train a model much faster use HF Jobs! It's pay as you go and you simply pay for each second of use, you can see the pricing and additional information [here](https://huggingface.co/docs/hub/jobs).
+
+To run the training use this command:
+
+<hfoptions id="train_with_hf_jobs">
+<hfoption id="Command">
+```bash
+hf jobs run \
+  --flavor a10g-small \
+  --timeout 4h \
+  --secrets HF_TOKEN \
+  huggingface/lerobot-gpu:latest \
+  -- \
+  python -m lerobot.scripts.lerobot_train \
+    --dataset.repo_id=username/dataset \
+    --policy.type=act \
+    --steps=5000 \
+    --batch_size=16 \
+    --policy.device=cuda \
+    --policy.repo_id=username/your_policy \
+    --log_freq=100
+```
+</hfoption>
+<hfoption id="API example">
+
+<!-- prettier-ignore-start -->
+```python
+from huggingface_hub import run_job, get_token
+
+run_name = "act_so101_hf_jobs"
+dataset_id = "username/dataset"
+user_hub_id = "username"
+
+command_args = [
+    "python", "-m", "lerobot.scripts.lerobot_train",
+    "--dataset.repo_id", dataset_id,
+    "--policy.type", "act",
+    "--steps", "5000",
+    "--batch_size", "16",
+    "--num_workers", "4",
+    "--policy.device", "cuda",
+    "--log_freq", "100",
+    "--save_freq", "1000",
+    "--save_checkpoint", "true",
+    "--wandb.enable", "false",
+    "--policy.repo_id", f"{user_hub_id}/{run_name}"
+]
+
+print(f"Submitting job '{run_name}' to Hugging Face Infrastructure...")
+
+job_info = run_job(
+    image="huggingface/lerobot-gpu:latest",
+    command=command_args,
+    flavor="a10g-small",
+    timeout="4h",
+    secrets={"HF_TOKEN": get_token()}
+)
+
+print("\n🚀 Job successfully launched!")
+print(f"🔹 Job ID: {job_info.id}")
+print(f"🔗 Live UI Dashboard & Logs: {job_info.url}")
+```
+<!-- prettier-ignore-end -->
+
+</hfoption>
+</hfoptions>
+
+You can modify the `--flavor` to use different hardware, for example: `t4-small`, `a100-large`, `h200`. Use `hf jobs hardware` to see the full list with pricing.
+Depending on the model you want to train and the hardware you selected you can also modify the `--batch_size` and `--number_of_workers`.
+For longer training sessions increase the timeout.
+
+Once the training is started you can go to [Jobs](https://huggingface.co/settings/jobs) and see if your jobs is running as well as all the outputs. Sometimes it takes a few minutes to schedule your job so be patient.
+
+After training the model will be pushed to hub and you can use it as any other model with LeRobot.
+
 #### Upload policy checkpoints
 
 Once training is done, upload the latest checkpoint with:
diff --git a/docs/source/smolvla.mdx b/docs/source/smolvla.mdx
index 6c63c5d11..e28270c9b 100644
--- a/docs/source/smolvla.mdx
+++ b/docs/source/smolvla.mdx
@@ -97,22 +97,22 @@ Similarly for when recording an episode, it is recommended that you are logged i
 Once you are logged in, you can run inference in your setup by doing:
 
 ```bash
-lerobot-record \
+lerobot-rollout \
+  --strategy.type=base \
   --robot.type=so101_follower \
   --robot.port=/dev/ttyACM0 \ # <- Use your port
   --robot.id=my_blue_follower_arm \ # <- Use your robot id
   --robot.cameras="{ front: {type: opencv, index_or_path: 8, width: 640, height: 480, fps: 30}}" \ # <- Use your cameras
-  --dataset.single_task="Grasp a lego block and put it in the bin." \ # <- Use the same task description you used in your dataset recording
-  --dataset.repo_id=${HF_USER}/eval_DATASET_NAME_test \  # <- This will be the dataset name on HF Hub
-  --dataset.episode_time_s=50 \
-  --dataset.num_episodes=10 \
-  --dataset.streaming_encoding=true \
-  --dataset.encoder_threads=2 \
-  # --dataset.camera_encoder.vcodec=auto \
+  --task="Grasp a lego block and put it in the bin." \ # <- Use the same task description you used in your dataset recording
+  # <- RTC optional, use when running on low power hardware \
+  # --inference.type=rtc \
+  # --inference.rtc.execution_horizon=10 \
+  # --inference.rtc.max_guidance_weight=10.0 \
   # <- Teleop optional if you want to teleoperate in between episodes \
   # --teleop.type=so100_leader \
   # --teleop.port=/dev/ttyACM0 \
   # --teleop.id=my_red_leader_arm \
+  # --display_data=true #optional use if you want to see the camera stream \
   --policy.path=HF_USER/FINETUNE_MODEL_NAME # <- Use your fine-tuned model
 ```
 

From b74a551d38f6cf33ddde8e55b0c6f5a9b0c42e85 Mon Sep 17 00:00:00 2001
From: Haoming Song <haomingsong24@gmail.com>
Date: Fri, 22 May 2026 16:29:34 +0800
Subject: [PATCH 11/17] fix(pi0, pi05): stabilize torch.compile and expand test
 coverage (#3610)

* chore(gr00t): sync with #3606 for fixing gr00t config crash

* fix(pi0&pi05): fix graph break caused by deepcopy of past_key_values in sample_actions

* fix(pi0&pi05): fix frequent recompile caused by compute_layer_complete

* feat(test): add compile test and benchamrk for pi0 and pi05

* feat(test): add comprehensive testing for pi0 and pi05. Including processor, forward, sample action, etc.
---
 src/lerobot/policies/groot/groot_n1.py        |  24 +-
 src/lerobot/policies/pi0/modeling_pi0.py      |  56 +-
 src/lerobot/policies/pi05/modeling_pi05.py    |  56 +-
 .../pi0_pi05/openpi_pytorch/__init__.py       |   1 +
 .../policies/pi0_pi05/openpi_pytorch/gemma.py |  22 +
 .../pi0_pi05/openpi_pytorch/gemma_pytorch.py  | 300 +++++++++++
 .../pi0_pi05/openpi_pytorch/image_tools.py    |  79 +++
 .../pi0_pi05/openpi_pytorch/pi0_pytorch.py    | 471 +++++++++++++++++
 .../openpi_pytorch/preprocessing_pytorch.py   | 179 +++++++
 tests/policies/pi0_pi05/test_pi05_compile.py  | 101 ++++
 .../pi0_pi05/test_pi05_original_vs_lerobot.py | 485 ++++++------------
 tests/policies/pi0_pi05/test_pi0_compile.py   |  99 ++++
 .../pi0_pi05/test_pi0_original_vs_lerobot.py  | 479 ++++++-----------
 tests/policies/pi0_pi05/utils/__init__.py     |   1 +
 .../policies/pi0_pi05/utils/openpi_parity.py  | 291 +++++++++++
 .../policies/pi0_pi05/utils/torch_compile.py  | 207 ++++++++
 tests/processor/test_pi05_processor.py        | 155 ++++++
 tests/processor/test_pi0_processor.py         | 156 ++++++
 18 files changed, 2463 insertions(+), 699 deletions(-)
 create mode 100644 tests/policies/pi0_pi05/openpi_pytorch/__init__.py
 create mode 100644 tests/policies/pi0_pi05/openpi_pytorch/gemma.py
 create mode 100644 tests/policies/pi0_pi05/openpi_pytorch/gemma_pytorch.py
 create mode 100644 tests/policies/pi0_pi05/openpi_pytorch/image_tools.py
 create mode 100644 tests/policies/pi0_pi05/openpi_pytorch/pi0_pytorch.py
 create mode 100644 tests/policies/pi0_pi05/openpi_pytorch/preprocessing_pytorch.py
 create mode 100644 tests/policies/pi0_pi05/test_pi05_compile.py
 create mode 100644 tests/policies/pi0_pi05/test_pi0_compile.py
 create mode 100644 tests/policies/pi0_pi05/utils/__init__.py
 create mode 100644 tests/policies/pi0_pi05/utils/openpi_parity.py
 create mode 100644 tests/policies/pi0_pi05/utils/torch_compile.py
 create mode 100644 tests/processor/test_pi05_processor.py
 create mode 100644 tests/processor/test_pi0_processor.py

diff --git a/src/lerobot/policies/groot/groot_n1.py b/src/lerobot/policies/groot/groot_n1.py
index 381c5fbd6..c9110301f 100644
--- a/src/lerobot/policies/groot/groot_n1.py
+++ b/src/lerobot/policies/groot/groot_n1.py
@@ -14,7 +14,7 @@
 # limitations under the License.
 
 from pathlib import Path
-from typing import TYPE_CHECKING
+from typing import TYPE_CHECKING, Any
 
 import numpy as np
 import torch
@@ -26,9 +26,14 @@ from lerobot.utils.import_utils import _transformers_available
 
 # Conditional import for type checking and lazy loading
 if TYPE_CHECKING or _transformers_available:
+    from huggingface_hub.dataclasses import strict
     from transformers import AutoConfig, AutoModel, PretrainedConfig, PreTrainedModel
     from transformers.feature_extraction_utils import BatchFeature
 else:
+
+    def strict(cls):
+        return cls
+
     AutoConfig = None
     AutoModel = None
     PretrainedConfig = object
@@ -173,19 +178,20 @@ N_COLOR_CHANNELS = 3
 
 
 # config
+@strict
 class GR00TN15Config(PretrainedConfig):
     model_type = "gr00t_n1_5"
 
-    backbone_cfg: dict
-    action_head_cfg: dict
-    action_horizon: int
-    action_dim: int
+    backbone_cfg: dict[str, Any] | None = None
+    action_head_cfg: dict[str, Any] | None = None
+    action_horizon: int = 0
+    action_dim: int = 0
     compute_dtype: str = "float32"
 
-    def __init__(self, **kwargs):
-        super().__init__(**kwargs)
-        for key, value in kwargs.items():
-            setattr(self, key, value)
+    def __post_init__(self, **kwargs):
+        self.backbone_cfg = {} if self.backbone_cfg is None else self.backbone_cfg
+        self.action_head_cfg = {} if self.action_head_cfg is None else self.action_head_cfg
+        super().__post_init__(**kwargs)
 
 
 # real model
diff --git a/src/lerobot/policies/pi0/modeling_pi0.py b/src/lerobot/policies/pi0/modeling_pi0.py
index 510af0796..f6f4212fb 100644
--- a/src/lerobot/policies/pi0/modeling_pi0.py
+++ b/src/lerobot/policies/pi0/modeling_pi0.py
@@ -15,7 +15,6 @@
 # limitations under the License.
 
 import builtins
-import copy
 import logging
 import math
 from collections import deque
@@ -30,6 +29,7 @@ from lerobot.utils.import_utils import _transformers_available, require_package
 
 # Conditional import for type checking and lazy loading
 if TYPE_CHECKING or _transformers_available:
+    from transformers.cache_utils import DynamicCache
     from transformers.models.auto import CONFIG_MAPPING
     from transformers.models.gemma import modeling_gemma
 
@@ -41,6 +41,7 @@ if TYPE_CHECKING or _transformers_available:
     )
 else:
     CONFIG_MAPPING = None
+    DynamicCache = None
     modeling_gemma = None
     PiGemmaForCausalLM = None
     _gated_residual = None
@@ -141,6 +142,15 @@ def make_att_2d_masks(pad_masks, att_masks):  # see openpi `make_att_2d_masks` (
     return att_2d_masks & pad_2d_masks
 
 
+def clone_past_key_values(past_key_values):
+    """Clone the DynamicCache returned by prefix prefill for compiled denoising."""
+    return DynamicCache(
+        tuple(
+            (keys.clone(), values.clone(), sliding_window) for keys, values, sliding_window in past_key_values
+        )
+    )
+
+
 def pad_vector(vector, new_dim):
     """Pad the last dimension of a vector to new_dim with zeros.
 
@@ -227,16 +237,13 @@ def resize_with_pad_torch(  # see openpi `resize_with_pad_torch` (exact copy)
 
 
 # Define the complete layer computation function for gradient checkpointing
-def compute_layer_complete(
-    layer_idx, inputs_embeds, attention_mask, position_ids, adarms_cond, paligemma, gemma_expert
-):
-    models = [paligemma.model.language_model, gemma_expert.model]
+def compute_layer_complete(inputs_embeds, attention_mask, position_ids, adarms_cond, layers, rotary_emb):
     query_states = []
     key_states = []
     value_states = []
     gates = []
     for i, hidden_states in enumerate(inputs_embeds):
-        layer = models[i].layers[layer_idx]
+        layer = layers[i]
         hidden_states, gate = layernorm_forward(layer.input_layernorm, hidden_states, adarms_cond[i])
         gates.append(gate)
         input_shape = hidden_states.shape[:-1]
@@ -258,15 +265,16 @@ def compute_layer_complete(
         device=query_states.device,
         dtype=query_states.dtype,
     )
-    cos, sin = paligemma.model.language_model.rotary_emb(dummy_tensor, position_ids)
+    cos, sin = rotary_emb(dummy_tensor, position_ids)
     query_states, key_states = modeling_gemma.apply_rotary_pos_emb(
         query_states, key_states, cos, sin, unsqueeze_dim=1
     )
     batch_size = query_states.shape[0]
-    scaling = paligemma.model.language_model.layers[layer_idx].self_attn.scaling
+    paligemma_layer = layers[0]
+    scaling = paligemma_layer.self_attn.scaling
     # Attention computation
     att_output, _ = modeling_gemma.eager_attention_forward(
-        paligemma.model.language_model.layers[layer_idx].self_attn,
+        paligemma_layer.self_attn,
         query_states,
         key_states,
         value_states,
@@ -274,13 +282,13 @@ def compute_layer_complete(
         scaling,
     )
     # Get head_dim from the current layer, not from the model
-    head_dim = paligemma.model.language_model.layers[layer_idx].self_attn.head_dim
+    head_dim = paligemma_layer.self_attn.head_dim
     att_output = att_output.reshape(batch_size, -1, 1 * 8 * head_dim)
     # Process layer outputs
     outputs_embeds = []
     start_pos = 0
     for i, hidden_states in enumerate(inputs_embeds):
-        layer = models[i].layers[layer_idx]
+        layer = layers[i]
         end_pos = start_pos + hidden_states.shape[1]
         if att_output.dtype != layer.self_attn.o_proj.weight.dtype:
             att_output = att_output.to(layer.self_attn.o_proj.weight.dtype)
@@ -488,8 +496,9 @@ class PaliGemmaWithExpertModel(
             prefix_output = None
             prefix_past_key_values = None
         else:
-            models = [self.paligemma.model.language_model, self.gemma_expert.model]
-            num_layers = self.paligemma.config.text_config.num_hidden_layers
+            paligemma_layers = self.paligemma.model.language_model.layers
+            gemma_expert_layers = self.gemma_expert.model.layers
+            rotary_emb = self.paligemma.model.language_model.rotary_emb
 
             # Check if gradient checkpointing is enabled for any of the models
             use_gradient_checkpointing = (
@@ -499,36 +508,39 @@ class PaliGemmaWithExpertModel(
             ) or (hasattr(self, "gradient_checkpointing") and self.gradient_checkpointing and self.training)
 
             # Process all layers with gradient checkpointing if enabled
-            for layer_idx in range(num_layers):
+            for layers in zip(paligemma_layers, gemma_expert_layers, strict=True):
                 if use_gradient_checkpointing:
                     inputs_embeds = torch.utils.checkpoint.checkpoint(
                         compute_layer_complete,
-                        layer_idx,
                         inputs_embeds,
                         attention_mask,
                         position_ids,
                         adarms_cond,
                         use_reentrant=False,
                         preserve_rng_state=False,
-                        paligemma=self.paligemma,
-                        gemma_expert=self.gemma_expert,
+                        layers=layers,
+                        rotary_emb=rotary_emb,
                     )
                 else:
                     inputs_embeds = compute_layer_complete(
-                        layer_idx,
                         inputs_embeds,
                         attention_mask,
                         position_ids,
                         adarms_cond,
-                        paligemma=self.paligemma,
-                        gemma_expert=self.gemma_expert,
+                        layers=layers,
+                        rotary_emb=rotary_emb,
                     )
 
             # final norm
+            final_norms = (
+                self.paligemma.model.language_model.norm,
+                self.gemma_expert.model.norm,
+            )
+
             def compute_final_norms(inputs_embeds, adarms_cond):
                 outputs_embeds = []
                 for i, hidden_states in enumerate(inputs_embeds):
-                    out_emb, _ = layernorm_forward(models[i].norm, hidden_states, adarms_cond[i])
+                    out_emb, _ = layernorm_forward(final_norms[i], hidden_states, adarms_cond[i])
                     outputs_embeds.append(out_emb)
                 return outputs_embeds
 
@@ -907,7 +919,7 @@ class PI0Pytorch(nn.Module):  # see openpi `PI0Pytorch`
         full_att_2d_masks_4d = self._prepare_attention_masks_4d(full_att_2d_masks)
         self.paligemma_with_expert.gemma_expert.model.config._attn_implementation = "eager"  # noqa: SLF001
 
-        past_key_values = copy.deepcopy(past_key_values)
+        past_key_values = clone_past_key_values(past_key_values)
         outputs_embeds, _ = self.paligemma_with_expert.forward(
             attention_mask=full_att_2d_masks_4d,
             position_ids=position_ids,
diff --git a/src/lerobot/policies/pi05/modeling_pi05.py b/src/lerobot/policies/pi05/modeling_pi05.py
index bdaf01f2c..aabd04c6f 100644
--- a/src/lerobot/policies/pi05/modeling_pi05.py
+++ b/src/lerobot/policies/pi05/modeling_pi05.py
@@ -15,7 +15,6 @@
 # limitations under the License.
 
 import builtins
-import copy
 import logging
 import math
 from collections import deque
@@ -30,6 +29,7 @@ from lerobot.utils.import_utils import _transformers_available, require_package
 
 # Conditional import for type checking and lazy loading
 if TYPE_CHECKING or _transformers_available:
+    from transformers.cache_utils import DynamicCache
     from transformers.models.auto import CONFIG_MAPPING
     from transformers.models.gemma import modeling_gemma
 
@@ -41,6 +41,7 @@ if TYPE_CHECKING or _transformers_available:
     )
 else:
     CONFIG_MAPPING = None
+    DynamicCache = None
     modeling_gemma = None
     PiGemmaForCausalLM = None
     _gated_residual = None
@@ -138,6 +139,15 @@ def make_att_2d_masks(pad_masks, att_masks):  # see openpi `make_att_2d_masks` (
     return att_2d_masks & pad_2d_masks
 
 
+def clone_past_key_values(past_key_values):
+    """Clone the DynamicCache returned by prefix prefill for compiled denoising."""
+    return DynamicCache(
+        tuple(
+            (keys.clone(), values.clone(), sliding_window) for keys, values, sliding_window in past_key_values
+        )
+    )
+
+
 def pad_vector(vector, new_dim):
     """Pad the last dimension of a vector to new_dim with zeros.
 
@@ -224,16 +234,13 @@ def resize_with_pad_torch(  # see openpi `resize_with_pad_torch` (exact copy)
 
 
 # Define the complete layer computation function for gradient checkpointing
-def compute_layer_complete(
-    layer_idx, inputs_embeds, attention_mask, position_ids, adarms_cond, paligemma, gemma_expert
-):
-    models = [paligemma.model.language_model, gemma_expert.model]
+def compute_layer_complete(inputs_embeds, attention_mask, position_ids, adarms_cond, layers, rotary_emb):
     query_states = []
     key_states = []
     value_states = []
     gates = []
     for i, hidden_states in enumerate(inputs_embeds):
-        layer = models[i].layers[layer_idx]
+        layer = layers[i]
         hidden_states, gate = layernorm_forward(layer.input_layernorm, hidden_states, adarms_cond[i])
         gates.append(gate)
         input_shape = hidden_states.shape[:-1]
@@ -255,15 +262,16 @@ def compute_layer_complete(
         device=query_states.device,
         dtype=query_states.dtype,
     )
-    cos, sin = paligemma.model.language_model.rotary_emb(dummy_tensor, position_ids)
+    cos, sin = rotary_emb(dummy_tensor, position_ids)
     query_states, key_states = modeling_gemma.apply_rotary_pos_emb(
         query_states, key_states, cos, sin, unsqueeze_dim=1
     )
     batch_size = query_states.shape[0]
-    scaling = paligemma.model.language_model.layers[layer_idx].self_attn.scaling
+    paligemma_layer = layers[0]
+    scaling = paligemma_layer.self_attn.scaling
     # Attention computation
     att_output, _ = modeling_gemma.eager_attention_forward(
-        paligemma.model.language_model.layers[layer_idx].self_attn,
+        paligemma_layer.self_attn,
         query_states,
         key_states,
         value_states,
@@ -271,13 +279,13 @@ def compute_layer_complete(
         scaling,
     )
     # Get head_dim from the current layer, not from the model
-    head_dim = paligemma.model.language_model.layers[layer_idx].self_attn.head_dim
+    head_dim = paligemma_layer.self_attn.head_dim
     att_output = att_output.reshape(batch_size, -1, 1 * 8 * head_dim)
     # Process layer outputs
     outputs_embeds = []
     start_pos = 0
     for i, hidden_states in enumerate(inputs_embeds):
-        layer = models[i].layers[layer_idx]
+        layer = layers[i]
         end_pos = start_pos + hidden_states.shape[1]
         if att_output.dtype != layer.self_attn.o_proj.weight.dtype:
             att_output = att_output.to(layer.self_attn.o_proj.weight.dtype)
@@ -485,8 +493,9 @@ class PaliGemmaWithExpertModel(
             prefix_output = None
             prefix_past_key_values = None
         else:
-            models = [self.paligemma.model.language_model, self.gemma_expert.model]
-            num_layers = self.paligemma.config.text_config.num_hidden_layers
+            paligemma_layers = self.paligemma.model.language_model.layers
+            gemma_expert_layers = self.gemma_expert.model.layers
+            rotary_emb = self.paligemma.model.language_model.rotary_emb
 
             # Check if gradient checkpointing is enabled for any of the models
             use_gradient_checkpointing = (
@@ -496,36 +505,39 @@ class PaliGemmaWithExpertModel(
             ) or (hasattr(self, "gradient_checkpointing") and self.gradient_checkpointing and self.training)
 
             # Process all layers with gradient checkpointing if enabled
-            for layer_idx in range(num_layers):
+            for layers in zip(paligemma_layers, gemma_expert_layers, strict=True):
                 if use_gradient_checkpointing:
                     inputs_embeds = torch.utils.checkpoint.checkpoint(
                         compute_layer_complete,
-                        layer_idx,
                         inputs_embeds,
                         attention_mask,
                         position_ids,
                         adarms_cond,
                         use_reentrant=False,
                         preserve_rng_state=False,
-                        paligemma=self.paligemma,
-                        gemma_expert=self.gemma_expert,
+                        layers=layers,
+                        rotary_emb=rotary_emb,
                     )
                 else:
                     inputs_embeds = compute_layer_complete(
-                        layer_idx,
                         inputs_embeds,
                         attention_mask,
                         position_ids,
                         adarms_cond,
-                        paligemma=self.paligemma,
-                        gemma_expert=self.gemma_expert,
+                        layers=layers,
+                        rotary_emb=rotary_emb,
                     )
 
             # final norm
+            final_norms = (
+                self.paligemma.model.language_model.norm,
+                self.gemma_expert.model.norm,
+            )
+
             def compute_final_norms(inputs_embeds, adarms_cond):
                 outputs_embeds = []
                 for i, hidden_states in enumerate(inputs_embeds):
-                    out_emb, _ = layernorm_forward(models[i].norm, hidden_states, adarms_cond[i])
+                    out_emb, _ = layernorm_forward(final_norms[i], hidden_states, adarms_cond[i])
                     outputs_embeds.append(out_emb)
                 return outputs_embeds
 
@@ -880,7 +892,7 @@ class PI05Pytorch(nn.Module):  # see openpi `PI0Pytorch`
         full_att_2d_masks_4d = self._prepare_attention_masks_4d(full_att_2d_masks)
         self.paligemma_with_expert.gemma_expert.model.config._attn_implementation = "eager"  # noqa: SLF001
 
-        past_key_values = copy.deepcopy(past_key_values)
+        past_key_values = clone_past_key_values(past_key_values)
         outputs_embeds, _ = self.paligemma_with_expert.forward(
             attention_mask=full_att_2d_masks_4d,
             position_ids=position_ids,
diff --git a/tests/policies/pi0_pi05/openpi_pytorch/__init__.py b/tests/policies/pi0_pi05/openpi_pytorch/__init__.py
new file mode 100644
index 000000000..e1cdcf3fc
--- /dev/null
+++ b/tests/policies/pi0_pi05/openpi_pytorch/__init__.py
@@ -0,0 +1 @@
+"""Lightweight vendored OpenPI PyTorch modules for PI0/PI05 parity tests."""
diff --git a/tests/policies/pi0_pi05/openpi_pytorch/gemma.py b/tests/policies/pi0_pi05/openpi_pytorch/gemma.py
new file mode 100644
index 000000000..2210f5c01
--- /dev/null
+++ b/tests/policies/pi0_pi05/openpi_pytorch/gemma.py
@@ -0,0 +1,22 @@
+from dataclasses import dataclass
+
+
+@dataclass
+class Config:
+    width: int
+    depth: int
+    mlp_dim: int
+    num_heads: int
+    num_kv_heads: int
+    head_dim: int
+
+
+def get_config(variant: str) -> Config:
+    """Return the Gemma shape config needed by the OpenPI PyTorch model."""
+    if variant == "dummy":
+        return Config(width=64, depth=4, mlp_dim=128, num_heads=8, num_kv_heads=1, head_dim=16)
+    if variant == "gemma_300m":
+        return Config(width=1024, depth=18, mlp_dim=4096, num_heads=8, num_kv_heads=1, head_dim=256)
+    if variant == "gemma_2b":
+        return Config(width=2048, depth=18, mlp_dim=16_384, num_heads=8, num_kv_heads=1, head_dim=256)
+    raise ValueError(f"Unknown variant: {variant}")
diff --git a/tests/policies/pi0_pi05/openpi_pytorch/gemma_pytorch.py b/tests/policies/pi0_pi05/openpi_pytorch/gemma_pytorch.py
new file mode 100644
index 000000000..48f07cd35
--- /dev/null
+++ b/tests/policies/pi0_pi05/openpi_pytorch/gemma_pytorch.py
@@ -0,0 +1,300 @@
+from typing import Literal
+
+import torch
+from torch import nn
+from transformers.models.auto import CONFIG_MAPPING
+from transformers.models.gemma import modeling_gemma
+
+from lerobot.policies.pi_gemma import (
+    PaliGemmaForConditionalGenerationWithPiGemma,
+    PiGemmaForCausalLM,
+    _gated_residual,
+    layernorm_forward,
+)
+
+
+class PaliGemmaWithExpertModel(nn.Module):
+    def __init__(
+        self,
+        vlm_config,
+        action_expert_config,
+        use_adarms=None,
+        precision: Literal["bfloat16", "float32"] = "bfloat16",
+    ):
+        if use_adarms is None:
+            use_adarms = [False, False]
+        super().__init__()
+
+        vlm_config_hf = CONFIG_MAPPING["paligemma"]()
+        vlm_config_hf._vocab_size = 257152  # noqa: SLF001
+        vlm_config_hf.image_token_index = 257152
+        vlm_config_hf.text_config.hidden_size = vlm_config.width
+        vlm_config_hf.text_config.intermediate_size = vlm_config.mlp_dim
+        vlm_config_hf.text_config.num_attention_heads = vlm_config.num_heads
+        vlm_config_hf.text_config.head_dim = vlm_config.head_dim
+        vlm_config_hf.text_config.num_hidden_layers = vlm_config.depth
+        vlm_config_hf.text_config.num_key_value_heads = vlm_config.num_kv_heads
+        vlm_config_hf.text_config.hidden_activation = "gelu_pytorch_tanh"
+        vlm_config_hf.text_config.dtype = "float32"
+        vlm_config_hf.text_config.vocab_size = 257152
+        vlm_config_hf.text_config.use_adarms = use_adarms[0]
+        vlm_config_hf.text_config.adarms_cond_dim = vlm_config.width if use_adarms[0] else None
+        vlm_config_hf.vision_config.intermediate_size = 4304
+        vlm_config_hf.vision_config.projection_dim = 2048
+        vlm_config_hf.vision_config.projector_hidden_act = "gelu_fast"
+        vlm_config_hf.vision_config.dtype = "float32"
+
+        action_expert_config_hf = CONFIG_MAPPING["gemma"](
+            head_dim=action_expert_config.head_dim,
+            hidden_size=action_expert_config.width,
+            intermediate_size=action_expert_config.mlp_dim,
+            num_attention_heads=action_expert_config.num_heads,
+            num_hidden_layers=action_expert_config.depth,
+            num_key_value_heads=action_expert_config.num_kv_heads,
+            vocab_size=257152,
+            hidden_activation="gelu_pytorch_tanh",
+            dtype="float32",
+            use_adarms=use_adarms[1],
+            adarms_cond_dim=action_expert_config.width if use_adarms[1] else None,
+        )
+
+        self.paligemma = PaliGemmaForConditionalGenerationWithPiGemma(config=vlm_config_hf)
+        self.gemma_expert = PiGemmaForCausalLM(config=action_expert_config_hf)
+        self.gemma_expert.model.embed_tokens = None
+
+        self.to_bfloat16_for_selected_params(precision)
+
+    def to_bfloat16_for_selected_params(self, precision: Literal["bfloat16", "float32"] = "bfloat16"):
+        if precision == "bfloat16":
+            self.to(dtype=torch.bfloat16)
+        elif precision == "float32":
+            self.to(dtype=torch.float32)
+            return
+        else:
+            raise ValueError(f"Invalid precision: {precision}")
+
+        params_to_keep_float32 = [
+            "vision_tower",
+            "multi_modal_projector",
+            "input_layernorm",
+            "post_attention_layernorm",
+            "model.norm",
+        ]
+
+        for name, param in self.named_parameters():
+            if any(selector in name for selector in params_to_keep_float32):
+                param.data = param.data.to(dtype=torch.float32)
+
+    def embed_image(self, image: torch.Tensor):
+        # Transformers 5.4 no longer divides PaliGemma image features by sqrt(hidden_size),
+        # so the upstream helper now matches OpenPI's patched PaliGemma image-scale semantics.
+        # See https://github.com/huggingface/transformers/pull/44432/changes#diff-c916907e7e52ac85ee1a1527560eae4656cd6c76141ceb1fe3da61bd5f697d2a
+        out_dtype = image.dtype
+        if image.dtype != torch.float32:
+            image = image.to(torch.float32)
+        image_outputs = self.paligemma.model.get_image_features(image)
+        features = image_outputs.pooler_output
+        if features.dtype != out_dtype:
+            features = features.to(out_dtype)
+        return features
+
+    def embed_language_tokens(self, tokens: torch.Tensor):
+        return self.paligemma.model.language_model.get_input_embeddings()(tokens)
+
+    def forward(
+        self,
+        attention_mask: torch.Tensor | None = None,
+        position_ids: torch.LongTensor | None = None,
+        past_key_values: list[torch.FloatTensor] | None = None,
+        inputs_embeds: list[torch.FloatTensor] | None = None,
+        use_cache: bool | None = None,
+        adarms_cond: list[torch.Tensor] | None = None,
+    ):
+        if adarms_cond is None:
+            adarms_cond = [None, None]
+        if inputs_embeds[1] is None:
+            prefix_output = self.paligemma.model.language_model.forward(
+                inputs_embeds=inputs_embeds[0],
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                past_key_values=past_key_values,
+                use_cache=use_cache,
+                adarms_cond=adarms_cond[0] if adarms_cond is not None else None,
+            )
+            prefix_past_key_values = prefix_output.past_key_values
+            prefix_output = prefix_output.last_hidden_state
+            suffix_output = None
+        elif inputs_embeds[0] is None:
+            suffix_output = self.gemma_expert.model.forward(
+                inputs_embeds=inputs_embeds[1],
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                past_key_values=past_key_values,
+                use_cache=use_cache,
+                adarms_cond=adarms_cond[1] if adarms_cond is not None else None,
+            )
+            suffix_output = suffix_output.last_hidden_state
+            prefix_output = None
+            prefix_past_key_values = None
+        else:
+            models = [self.paligemma.model.language_model, self.gemma_expert.model]
+            num_layers = self.paligemma.config.text_config.num_hidden_layers
+
+            # Check if gradient checkpointing is enabled for any of the models
+            use_gradient_checkpointing = (
+                hasattr(self.gemma_expert.model, "gradient_checkpointing")
+                and self.gemma_expert.model.gradient_checkpointing
+                and self.training
+            ) or (hasattr(self, "gradient_checkpointing") and self.gradient_checkpointing and self.training)
+
+            # Force enable gradient checkpointing if we're in training mode and the model supports it
+            if self.training and hasattr(self.gemma_expert.model, "gradient_checkpointing"):
+                if not self.gemma_expert.model.gradient_checkpointing:
+                    print("Forcing gradient checkpointing to be enabled for Gemma expert model")
+                    self.gemma_expert.model.gradient_checkpointing = True
+                use_gradient_checkpointing = True
+
+            # Debug gradient checkpointing status
+            if hasattr(self, "_debug_gc_printed") and not self._debug_gc_printed:
+                print(f"Gemma expert model gradient checkpointing: {use_gradient_checkpointing}")
+                print(f"Model training mode: {self.training}")
+                print(
+                    f"Gemma expert model has gradient_checkpointing attr: {hasattr(self.gemma_expert.model, 'gradient_checkpointing')}"
+                )
+                if hasattr(self.gemma_expert.model, "gradient_checkpointing"):
+                    print(
+                        f"Gemma expert model gradient_checkpointing value: {self.gemma_expert.model.gradient_checkpointing}"
+                    )
+                self._debug_gc_printed = True
+
+            # Define the complete layer computation function for gradient checkpointing
+            def compute_layer_complete(layer_idx, inputs_embeds, attention_mask, position_ids, adarms_cond):
+                models = [self.paligemma.model.language_model, self.gemma_expert.model]
+
+                query_states = []
+                key_states = []
+                value_states = []
+                gates = []
+                for i, hidden_states in enumerate(inputs_embeds):
+                    layer = models[i].layers[layer_idx]
+                    hidden_states, gate = layernorm_forward(
+                        layer.input_layernorm, hidden_states, adarms_cond[i]
+                    )
+                    gates.append(gate)
+
+                    input_shape = hidden_states.shape[:-1]
+                    hidden_shape = (*input_shape, -1, layer.self_attn.head_dim)
+                    query_state = layer.self_attn.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+                    key_state = layer.self_attn.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+                    value_state = layer.self_attn.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
+
+                    query_states.append(query_state)
+                    key_states.append(key_state)
+                    value_states.append(value_state)
+
+                # Concatenate and process attention
+                query_states = torch.cat(query_states, dim=2)
+                key_states = torch.cat(key_states, dim=2)
+                value_states = torch.cat(value_states, dim=2)
+
+                dummy_tensor = torch.zeros(
+                    query_states.shape[0],
+                    query_states.shape[2],
+                    query_states.shape[-1],
+                    device=query_states.device,
+                    dtype=query_states.dtype,
+                )
+                cos, sin = self.paligemma.model.language_model.rotary_emb(dummy_tensor, position_ids)
+                query_states, key_states = modeling_gemma.apply_rotary_pos_emb(
+                    query_states, key_states, cos, sin, unsqueeze_dim=1
+                )
+
+                batch_size = query_states.shape[0]
+                scaling = self.paligemma.model.language_model.layers[layer_idx].self_attn.scaling
+
+                # Attention computation
+                att_output, _ = modeling_gemma.eager_attention_forward(
+                    self.paligemma.model.language_model.layers[layer_idx].self_attn,
+                    query_states,
+                    key_states,
+                    value_states,
+                    attention_mask,
+                    scaling,
+                )
+                # Get head_dim from the current layer, not from the model
+                head_dim = self.paligemma.model.language_model.layers[layer_idx].self_attn.head_dim
+                att_output = att_output.reshape(batch_size, -1, 1 * 8 * head_dim)
+
+                # Process layer outputs
+                outputs_embeds = []
+                start_pos = 0
+                for i, hidden_states in enumerate(inputs_embeds):
+                    layer = models[i].layers[layer_idx]
+                    end_pos = start_pos + hidden_states.shape[1]
+
+                    if att_output.dtype != layer.self_attn.o_proj.weight.dtype:
+                        att_output = att_output.to(layer.self_attn.o_proj.weight.dtype)
+                    out_emb = layer.self_attn.o_proj(att_output[:, start_pos:end_pos])
+
+                    # first residual
+                    out_emb = _gated_residual(hidden_states, out_emb, gates[i])
+                    after_first_residual = out_emb.clone()
+                    out_emb, gate = layernorm_forward(layer.post_attention_layernorm, out_emb, adarms_cond[i])
+                    # Convert to bfloat16 if the next layer (mlp) uses bfloat16
+                    if layer.mlp.up_proj.weight.dtype == torch.bfloat16:
+                        out_emb = out_emb.to(dtype=torch.bfloat16)
+
+                    out_emb = layer.mlp(out_emb)
+                    # second residual
+                    out_emb = _gated_residual(after_first_residual, out_emb, gate)
+                    outputs_embeds.append(out_emb)
+                    start_pos = end_pos
+
+                return outputs_embeds
+
+            # Process all layers with gradient checkpointing if enabled
+            for layer_idx in range(num_layers):
+                if use_gradient_checkpointing:
+                    inputs_embeds = torch.utils.checkpoint.checkpoint(
+                        compute_layer_complete,
+                        layer_idx,
+                        inputs_embeds,
+                        attention_mask,
+                        position_ids,
+                        adarms_cond,
+                        use_reentrant=False,
+                        preserve_rng_state=False,
+                    )
+                else:
+                    inputs_embeds = compute_layer_complete(
+                        layer_idx, inputs_embeds, attention_mask, position_ids, adarms_cond
+                    )
+
+                # Old code removed - now using compute_layer_complete function above
+
+            # final norm
+            # Define final norm computation function for gradient checkpointing
+            def compute_final_norms(inputs_embeds, adarms_cond):
+                outputs_embeds = []
+                for i, hidden_states in enumerate(inputs_embeds):
+                    out_emb, _ = layernorm_forward(models[i].norm, hidden_states, adarms_cond[i])
+                    outputs_embeds.append(out_emb)
+                return outputs_embeds
+
+            # Apply gradient checkpointing to final norm if enabled
+            if use_gradient_checkpointing:
+                outputs_embeds = torch.utils.checkpoint.checkpoint(
+                    compute_final_norms,
+                    inputs_embeds,
+                    adarms_cond,
+                    use_reentrant=False,
+                    preserve_rng_state=False,
+                )
+            else:
+                outputs_embeds = compute_final_norms(inputs_embeds, adarms_cond)
+
+            prefix_output = outputs_embeds[0]
+            suffix_output = outputs_embeds[1]
+            prefix_past_key_values = None
+
+        return [prefix_output, suffix_output], prefix_past_key_values
diff --git a/tests/policies/pi0_pi05/openpi_pytorch/image_tools.py b/tests/policies/pi0_pi05/openpi_pytorch/image_tools.py
new file mode 100644
index 000000000..a459f7859
--- /dev/null
+++ b/tests/policies/pi0_pi05/openpi_pytorch/image_tools.py
@@ -0,0 +1,79 @@
+import torch
+import torch.nn.functional as F  # noqa: N812
+
+
+def resize_with_pad_torch(
+    images: torch.Tensor,
+    height: int,
+    width: int,
+    mode: str = "bilinear",
+) -> torch.Tensor:
+    """PyTorch version of resize_with_pad. Resizes an image to a target height and width without distortion
+    by padding with black. If the image is float32, it must be in the range [-1, 1].
+
+    Args:
+        images: Tensor of shape [*b, h, w, c] or [*b, c, h, w]
+        height: Target height
+        width: Target width
+        mode: Interpolation mode ('bilinear', 'nearest', etc.)
+
+    Returns:
+        Resized and padded tensor with same shape format as input
+    """
+    # Check if input is in channels-last format [*b, h, w, c] or channels-first [*b, c, h, w]
+    if images.shape[-1] <= 4:  # Assume channels-last format
+        channels_last = True
+        # Convert to channels-first for torch operations
+        if images.dim() == 3:
+            images = images.unsqueeze(0)  # Add batch dimension
+        images = images.permute(0, 3, 1, 2)  # [b, h, w, c] -> [b, c, h, w]
+    else:
+        channels_last = False
+        if images.dim() == 3:
+            images = images.unsqueeze(0)  # Add batch dimension
+
+    batch_size, channels, cur_height, cur_width = images.shape
+
+    # Calculate resize ratio
+    ratio = max(cur_width / width, cur_height / height)
+    resized_height = int(cur_height / ratio)
+    resized_width = int(cur_width / ratio)
+
+    # Resize
+    resized_images = F.interpolate(
+        images,
+        size=(resized_height, resized_width),
+        mode=mode,
+        align_corners=False if mode == "bilinear" else None,
+    )
+
+    # Handle dtype-specific clipping
+    if images.dtype == torch.uint8:
+        resized_images = torch.round(resized_images).clamp(0, 255).to(torch.uint8)
+    elif images.dtype == torch.float32:
+        resized_images = resized_images.clamp(-1.0, 1.0)
+    else:
+        raise ValueError(f"Unsupported image dtype: {images.dtype}")
+
+    # Calculate padding
+    pad_h0, remainder_h = divmod(height - resized_height, 2)
+    pad_h1 = pad_h0 + remainder_h
+    pad_w0, remainder_w = divmod(width - resized_width, 2)
+    pad_w1 = pad_w0 + remainder_w
+
+    # Pad
+    constant_value = 0 if images.dtype == torch.uint8 else -1.0
+    padded_images = F.pad(
+        resized_images,
+        (pad_w0, pad_w1, pad_h0, pad_h1),  # left, right, top, bottom
+        mode="constant",
+        value=constant_value,
+    )
+
+    # Convert back to original format if needed
+    if channels_last:
+        padded_images = padded_images.permute(0, 2, 3, 1)  # [b, c, h, w] -> [b, h, w, c]
+        if batch_size == 1 and images.shape[0] == 1:
+            padded_images = padded_images.squeeze(0)  # Remove batch dimension if it was added
+
+    return padded_images
diff --git a/tests/policies/pi0_pi05/openpi_pytorch/pi0_pytorch.py b/tests/policies/pi0_pi05/openpi_pytorch/pi0_pytorch.py
new file mode 100644
index 000000000..77f1342d9
--- /dev/null
+++ b/tests/policies/pi0_pi05/openpi_pytorch/pi0_pytorch.py
@@ -0,0 +1,471 @@
+import copy
+import logging
+import math
+
+import torch
+import torch.nn.functional as F  # noqa: N812
+from torch import Tensor, nn
+
+import tests.policies.pi0_pi05.openpi_pytorch.gemma as _gemma
+from tests.policies.pi0_pi05.openpi_pytorch import preprocessing_pytorch as _preprocessing
+from tests.policies.pi0_pi05.openpi_pytorch.gemma_pytorch import PaliGemmaWithExpertModel
+
+
+def get_safe_dtype(target_dtype, device_type):
+    """Get a safe dtype for the given device type."""
+    if device_type == "cpu":
+        # CPU doesn't support bfloat16, use float32 instead
+        if target_dtype == torch.bfloat16:
+            return torch.float32
+        if target_dtype == torch.float64:
+            return torch.float64
+    return target_dtype
+
+
+def create_sinusoidal_pos_embedding(
+    time: torch.tensor, dimension: int, min_period: float, max_period: float, device="cpu"
+) -> Tensor:
+    """Computes sine-cosine positional embedding vectors for scalar positions."""
+    if dimension % 2 != 0:
+        raise ValueError(f"dimension ({dimension}) must be divisible by 2")
+
+    if time.ndim != 1:
+        raise ValueError("The time tensor is expected to be of shape `(batch_size, )`.")
+
+    dtype = get_safe_dtype(torch.float64, device.type)
+    fraction = torch.linspace(0.0, 1.0, dimension // 2, dtype=dtype, device=device)
+    period = min_period * (max_period / min_period) ** fraction
+
+    # Compute the outer product
+    scaling_factor = 1.0 / period * 2 * math.pi
+    sin_input = scaling_factor[None, :] * time[:, None]
+    return torch.cat([torch.sin(sin_input), torch.cos(sin_input)], dim=1)
+
+
+def sample_beta(alpha, beta, bsize, device):
+    alpha_t = torch.as_tensor(alpha, dtype=torch.float32, device=device)
+    beta_t = torch.as_tensor(beta, dtype=torch.float32, device=device)
+    dist = torch.distributions.Beta(alpha_t, beta_t)
+    return dist.sample((bsize,))
+
+
+def make_att_2d_masks(pad_masks, att_masks):
+    """Copied from big_vision.
+
+    Tokens can attend to valid inputs tokens which have a cumulative mask_ar
+    smaller or equal to theirs. This way `mask_ar` int[B, N] can be used to
+    setup several types of attention, for example:
+
+      [[1 1 1 1 1 1]]: pure causal attention.
+
+      [[0 0 0 1 1 1]]: prefix-lm attention. The first 3 tokens can attend between
+          themselves and the last 3 tokens have a causal attention. The first
+          entry could also be a 1 without changing behaviour.
+
+      [[1 0 1 0 1 0 0 1 0 0]]: causal attention between 4 blocks. Tokens of a
+          block can attend all previous blocks and all tokens on the same block.
+
+    Args:
+      input_mask: bool[B, N] true if its part of the input, false if padding.
+      mask_ar: int32[B, N] mask that's 1 where previous tokens cannot depend on
+        it and 0 where it shares the same attention mask as the previous token.
+    """
+    if att_masks.ndim != 2:
+        raise ValueError(att_masks.ndim)
+    if pad_masks.ndim != 2:
+        raise ValueError(pad_masks.ndim)
+
+    cumsum = torch.cumsum(att_masks, dim=1)
+    att_2d_masks = cumsum[:, None, :] <= cumsum[:, :, None]
+    pad_2d_masks = pad_masks[:, None, :] * pad_masks[:, :, None]
+    return att_2d_masks & pad_2d_masks
+
+
+class PI0Pytorch(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.pi05 = config.pi05
+
+        paligemma_config = _gemma.get_config(config.paligemma_variant)
+        action_expert_config = _gemma.get_config(config.action_expert_variant)
+
+        self.paligemma_with_expert = PaliGemmaWithExpertModel(
+            paligemma_config,
+            action_expert_config,
+            use_adarms=[False, True] if self.pi05 else [False, False],
+            precision=config.dtype,
+        )
+
+        self.action_in_proj = nn.Linear(config.action_dim, action_expert_config.width)
+        self.action_out_proj = nn.Linear(action_expert_config.width, config.action_dim)
+
+        if self.pi05:
+            self.time_mlp_in = nn.Linear(action_expert_config.width, action_expert_config.width)
+            self.time_mlp_out = nn.Linear(action_expert_config.width, action_expert_config.width)
+        else:
+            self.state_proj = nn.Linear(config.action_dim, action_expert_config.width)
+            self.action_time_mlp_in = nn.Linear(2 * action_expert_config.width, action_expert_config.width)
+            self.action_time_mlp_out = nn.Linear(action_expert_config.width, action_expert_config.width)
+
+        torch.set_float32_matmul_precision("high")
+        if config.pytorch_compile_mode is not None:
+            self.sample_actions = torch.compile(self.sample_actions, mode=config.pytorch_compile_mode)
+
+        # Initialize gradient checkpointing flag
+        self.gradient_checkpointing_enabled = False
+
+        # The upstream OpenPI module verifies a site-package Transformers patch here.
+        # This vendored test copy instead routes through LeRobot's local PiGemma compatibility layer.
+
+    def gradient_checkpointing_enable(self):
+        """Enable gradient checkpointing for memory optimization."""
+        self.gradient_checkpointing_enabled = True
+        self.paligemma_with_expert.paligemma.model.language_model.gradient_checkpointing = True
+        self.paligemma_with_expert.paligemma.model.vision_tower.gradient_checkpointing = True
+        self.paligemma_with_expert.gemma_expert.model.gradient_checkpointing = True
+
+        logging.info("Enabled gradient checkpointing for PI0Pytorch model")
+
+    def gradient_checkpointing_disable(self):
+        """Disable gradient checkpointing."""
+        self.gradient_checkpointing_enabled = False
+        self.paligemma_with_expert.paligemma.model.language_model.gradient_checkpointing = False
+        self.paligemma_with_expert.paligemma.model.vision_tower.gradient_checkpointing = False
+        self.paligemma_with_expert.gemma_expert.model.gradient_checkpointing = False
+
+        logging.info("Disabled gradient checkpointing for PI0Pytorch model")
+
+    def is_gradient_checkpointing_enabled(self):
+        """Check if gradient checkpointing is enabled."""
+        return self.gradient_checkpointing_enabled
+
+    def _apply_checkpoint(self, func, *args, **kwargs):
+        """Helper method to apply gradient checkpointing if enabled."""
+        if self.gradient_checkpointing_enabled and self.training:
+            return torch.utils.checkpoint.checkpoint(
+                func, *args, use_reentrant=False, preserve_rng_state=False, **kwargs
+            )
+        return func(*args, **kwargs)
+
+    def _prepare_attention_masks_4d(self, att_2d_masks):
+        """Helper method to prepare 4D attention masks for transformer."""
+        att_2d_masks_4d = att_2d_masks[:, None, :, :]
+        return torch.where(att_2d_masks_4d, 0.0, -2.3819763e38)
+
+    def _preprocess_observation(self, observation, *, train=True):
+        """Helper method to preprocess observation."""
+        observation = _preprocessing.preprocess_observation_pytorch(observation, train=train)
+        return (
+            list(observation.images.values()),
+            list(observation.image_masks.values()),
+            observation.tokenized_prompt,
+            observation.tokenized_prompt_mask,
+            observation.state,
+        )
+
+    def sample_noise(self, shape, device):
+        return torch.normal(
+            mean=0.0,
+            std=1.0,
+            size=shape,
+            dtype=torch.float32,
+            device=device,
+        )
+
+    def sample_time(self, bsize, device):
+        time_beta = sample_beta(1.5, 1.0, bsize, device)
+        time = time_beta * 0.999 + 0.001
+        return time.to(dtype=torch.float32, device=device)
+
+    def embed_prefix(
+        self, images, img_masks, lang_tokens, lang_masks
+    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Embed images with SigLIP and language tokens with embedding layer to prepare
+        for PaliGemma transformer processing.
+        """
+        embs = []
+        pad_masks = []
+        att_masks = []
+
+        # Process images
+        for img, img_mask in zip(images, img_masks, strict=True):
+
+            def image_embed_func(img):
+                return self.paligemma_with_expert.embed_image(img)
+
+            img_emb = self._apply_checkpoint(image_embed_func, img)
+
+            bsize, num_img_embs = img_emb.shape[:2]
+
+            embs.append(img_emb)
+            pad_masks.append(img_mask[:, None].expand(bsize, num_img_embs))
+
+            # Create attention masks so that image tokens attend to each other
+            att_masks += [0] * num_img_embs
+
+        # Process language tokens
+        def lang_embed_func(lang_tokens):
+            lang_emb = self.paligemma_with_expert.embed_language_tokens(lang_tokens)
+            # Transformers > 5.4 scales Gemma token embeddings inside embed_tokens, matching
+            # OpenPI's former explicit sqrt(hidden_size) multiply without applying it twice.
+            # See https://github.com/huggingface/transformers/pull/44432/changes#diff-5f76eac6f18f4b491521314c318a9692318feb4d19228e9576cce7bde4240834
+            return lang_emb
+
+        lang_emb = self._apply_checkpoint(lang_embed_func, lang_tokens)
+
+        embs.append(lang_emb)
+        pad_masks.append(lang_masks)
+
+        # full attention between image and language inputs
+        num_lang_embs = lang_emb.shape[1]
+        att_masks += [0] * num_lang_embs
+
+        embs = torch.cat(embs, dim=1)
+        pad_masks = torch.cat(pad_masks, dim=1)
+        att_masks = torch.tensor(att_masks, dtype=torch.bool, device=pad_masks.device)
+
+        # Get batch size from the first dimension of the concatenated tensors
+        bsize = pad_masks.shape[0]
+        att_masks = att_masks[None, :].expand(bsize, len(att_masks))
+
+        return embs, pad_masks, att_masks
+
+    def embed_suffix(self, state, noisy_actions, timestep):
+        """Embed state, noisy_actions, timestep to prepare for Expert Gemma processing."""
+        embs = []
+        pad_masks = []
+        att_masks = []
+
+        if not self.pi05:
+            if self.state_proj.weight.dtype == torch.float32:
+                state = state.to(torch.float32)
+
+            # Embed state
+            def state_proj_func(state):
+                return self.state_proj(state)
+
+            state_emb = self._apply_checkpoint(state_proj_func, state)
+
+            embs.append(state_emb[:, None, :])
+            bsize = state_emb.shape[0]
+            device = state_emb.device
+
+            state_mask = torch.ones(bsize, 1, dtype=torch.bool, device=device)
+            pad_masks.append(state_mask)
+
+            # Set attention masks so that image and language inputs do not attend to state or actions
+            att_masks += [1]
+
+        # Embed timestep using sine-cosine positional encoding with sensitivity in the range [0, 1]
+        time_emb = create_sinusoidal_pos_embedding(
+            timestep,
+            self.action_in_proj.out_features,
+            min_period=4e-3,
+            max_period=4.0,
+            device=timestep.device,
+        )
+        time_emb = time_emb.type(dtype=timestep.dtype)
+
+        # Fuse timestep + action information using an MLP
+        def action_proj_func(noisy_actions):
+            return self.action_in_proj(noisy_actions)
+
+        action_emb = self._apply_checkpoint(action_proj_func, noisy_actions)
+
+        if not self.pi05:
+            time_emb = time_emb[:, None, :].expand_as(action_emb)
+            action_time_emb = torch.cat([action_emb, time_emb], dim=2)
+
+            # Apply MLP layers
+            def mlp_func(action_time_emb):
+                x = self.action_time_mlp_in(action_time_emb)
+                x = F.silu(x)  # swish == silu
+                return self.action_time_mlp_out(x)
+
+            action_time_emb = self._apply_checkpoint(mlp_func, action_time_emb)
+            adarms_cond = None
+        else:
+            # time MLP (for adaRMS)
+            def time_mlp_func(time_emb):
+                x = self.time_mlp_in(time_emb)
+                x = F.silu(x)  # swish == silu
+                x = self.time_mlp_out(x)
+                return F.silu(x)
+
+            time_emb = self._apply_checkpoint(time_mlp_func, time_emb)
+            action_time_emb = action_emb
+            adarms_cond = time_emb
+
+        # Add to input tokens
+        embs.append(action_time_emb)
+
+        bsize, action_time_dim = action_time_emb.shape[:2]
+        action_time_mask = torch.ones(bsize, action_time_dim, dtype=torch.bool, device=timestep.device)
+        pad_masks.append(action_time_mask)
+
+        # Set attention masks so that image, language and state inputs do not attend to action tokens
+        att_masks += [1] + ([0] * (self.config.action_horizon - 1))
+
+        embs = torch.cat(embs, dim=1)
+        pad_masks = torch.cat(pad_masks, dim=1)
+        att_masks = torch.tensor(att_masks, dtype=embs.dtype, device=embs.device)
+        att_masks = att_masks[None, :].expand(bsize, len(att_masks))
+
+        return embs, pad_masks, att_masks, adarms_cond
+
+    def forward(self, observation, actions, noise=None, time=None) -> Tensor:
+        """Do a full training forward pass and compute the loss (batch_size x num_steps x num_motors)"""
+        images, img_masks, lang_tokens, lang_masks, state = self._preprocess_observation(
+            observation, train=True
+        )
+
+        if noise is None:
+            noise = self.sample_noise(actions.shape, actions.device)
+
+        if time is None:
+            time = self.sample_time(actions.shape[0], actions.device)
+
+        time_expanded = time[:, None, None]
+        x_t = time_expanded * noise + (1 - time_expanded) * actions
+        u_t = noise - actions
+
+        prefix_embs, prefix_pad_masks, prefix_att_masks = self.embed_prefix(
+            images, img_masks, lang_tokens, lang_masks
+        )
+        suffix_embs, suffix_pad_masks, suffix_att_masks, adarms_cond = self.embed_suffix(state, x_t, time)
+        if (
+            self.paligemma_with_expert.paligemma.model.language_model.layers[0].self_attn.q_proj.weight.dtype
+            == torch.bfloat16
+        ):
+            suffix_embs = suffix_embs.to(dtype=torch.bfloat16)
+            prefix_embs = prefix_embs.to(dtype=torch.bfloat16)
+
+        pad_masks = torch.cat([prefix_pad_masks, suffix_pad_masks], dim=1)
+        att_masks = torch.cat([prefix_att_masks, suffix_att_masks], dim=1)
+
+        att_2d_masks = make_att_2d_masks(pad_masks, att_masks)
+        position_ids = torch.cumsum(pad_masks, dim=1) - 1
+
+        # Prepare attention masks
+        att_2d_masks_4d = self._prepare_attention_masks_4d(att_2d_masks)
+
+        # Apply gradient checkpointing if enabled
+        def forward_func(prefix_embs, suffix_embs, att_2d_masks_4d, position_ids, adarms_cond):
+            (_, suffix_out), _ = self.paligemma_with_expert.forward(
+                attention_mask=att_2d_masks_4d,
+                position_ids=position_ids,
+                past_key_values=None,
+                inputs_embeds=[prefix_embs, suffix_embs],
+                use_cache=False,
+                adarms_cond=[None, adarms_cond],
+            )
+            return suffix_out
+
+        suffix_out = self._apply_checkpoint(
+            forward_func, prefix_embs, suffix_embs, att_2d_masks_4d, position_ids, adarms_cond
+        )
+
+        suffix_out = suffix_out[:, -self.config.action_horizon :]
+        suffix_out = suffix_out.to(dtype=torch.float32)
+
+        # Apply gradient checkpointing to final action projection if enabled
+        def action_out_proj_func(suffix_out):
+            return self.action_out_proj(suffix_out)
+
+        v_t = self._apply_checkpoint(action_out_proj_func, suffix_out)
+
+        return F.mse_loss(u_t, v_t, reduction="none")
+
+    @torch.no_grad()
+    def sample_actions(self, device, observation, noise=None, num_steps=10) -> Tensor:
+        """Do a full inference forward and compute the action (batch_size x num_steps x num_motors)"""
+        bsize = observation.state.shape[0]
+        if noise is None:
+            actions_shape = (bsize, self.config.action_horizon, self.config.action_dim)
+            noise = self.sample_noise(actions_shape, device)
+
+        images, img_masks, lang_tokens, lang_masks, state = self._preprocess_observation(
+            observation, train=False
+        )
+
+        prefix_embs, prefix_pad_masks, prefix_att_masks = self.embed_prefix(
+            images, img_masks, lang_tokens, lang_masks
+        )
+        prefix_att_2d_masks = make_att_2d_masks(prefix_pad_masks, prefix_att_masks)
+        prefix_position_ids = torch.cumsum(prefix_pad_masks, dim=1) - 1
+
+        # Compute image and language key value cache
+        prefix_att_2d_masks_4d = self._prepare_attention_masks_4d(prefix_att_2d_masks)
+        self.paligemma_with_expert.paligemma.model.language_model.config._attn_implementation = "eager"  # noqa: SLF001
+
+        _, past_key_values = self.paligemma_with_expert.forward(
+            attention_mask=prefix_att_2d_masks_4d,
+            position_ids=prefix_position_ids,
+            past_key_values=None,
+            inputs_embeds=[prefix_embs, None],
+            use_cache=True,
+        )
+
+        dt = -1.0 / num_steps
+        dt = torch.tensor(dt, dtype=torch.float32, device=device)
+
+        x_t = noise
+        time = torch.tensor(1.0, dtype=torch.float32, device=device)
+        while time >= -dt / 2:
+            expanded_time = time.expand(bsize)
+            v_t = self.denoise_step(
+                state,
+                prefix_pad_masks,
+                past_key_values,
+                x_t,
+                expanded_time,
+            )
+
+            # Euler step - use new tensor assignment instead of in-place operation
+            x_t = x_t + dt * v_t
+            time += dt
+        return x_t
+
+    def denoise_step(
+        self,
+        state,
+        prefix_pad_masks,
+        past_key_values,
+        x_t,
+        timestep,
+    ):
+        """Apply one denoising step of the noise `x_t` at a given timestep."""
+        suffix_embs, suffix_pad_masks, suffix_att_masks, adarms_cond = self.embed_suffix(state, x_t, timestep)
+
+        suffix_len = suffix_pad_masks.shape[1]
+        batch_size = prefix_pad_masks.shape[0]
+        prefix_len = prefix_pad_masks.shape[1]
+
+        prefix_pad_2d_masks = prefix_pad_masks[:, None, :].expand(batch_size, suffix_len, prefix_len)
+
+        suffix_att_2d_masks = make_att_2d_masks(suffix_pad_masks, suffix_att_masks)
+
+        full_att_2d_masks = torch.cat([prefix_pad_2d_masks, suffix_att_2d_masks], dim=2)
+
+        prefix_offsets = torch.sum(prefix_pad_masks, dim=-1)[:, None]
+        position_ids = prefix_offsets + torch.cumsum(suffix_pad_masks, dim=1) - 1
+
+        # Prepare attention masks
+        full_att_2d_masks_4d = self._prepare_attention_masks_4d(full_att_2d_masks)
+        self.paligemma_with_expert.gemma_expert.model.config._attn_implementation = "eager"  # noqa: SLF001
+
+        past_key_values = copy.deepcopy(past_key_values)
+        outputs_embeds, _ = self.paligemma_with_expert.forward(
+            attention_mask=full_att_2d_masks_4d,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=[None, suffix_embs],
+            use_cache=False,
+            adarms_cond=[None, adarms_cond],
+        )
+
+        suffix_out = outputs_embeds[1]
+        suffix_out = suffix_out[:, -self.config.action_horizon :]
+        suffix_out = suffix_out.to(dtype=torch.float32)
+        return self.action_out_proj(suffix_out)
diff --git a/tests/policies/pi0_pi05/openpi_pytorch/preprocessing_pytorch.py b/tests/policies/pi0_pi05/openpi_pytorch/preprocessing_pytorch.py
new file mode 100644
index 000000000..40df8d947
--- /dev/null
+++ b/tests/policies/pi0_pi05/openpi_pytorch/preprocessing_pytorch.py
@@ -0,0 +1,179 @@
+import logging
+from collections.abc import Sequence
+
+import torch
+
+from tests.policies.pi0_pi05.openpi_pytorch import image_tools
+
+logger = logging.getLogger("openpi")
+
+# Constants moved from model.py
+IMAGE_KEYS = (
+    "base_0_rgb",
+    "left_wrist_0_rgb",
+    "right_wrist_0_rgb",
+)
+
+IMAGE_RESOLUTION = (224, 224)
+
+
+def preprocess_observation_pytorch(
+    observation,
+    *,
+    train: bool = False,
+    image_keys: Sequence[str] = IMAGE_KEYS,
+    image_resolution: tuple[int, int] = IMAGE_RESOLUTION,
+):
+    """Torch.compile-compatible version of preprocess_observation_pytorch with simplified type annotations.
+
+    This function avoids complex type annotations that can cause torch.compile issues.
+    """
+    if not set(image_keys).issubset(observation.images):
+        raise ValueError(f"images dict missing keys: expected {image_keys}, got {list(observation.images)}")
+
+    batch_shape = observation.state.shape[:-1]
+
+    out_images = {}
+    for key in image_keys:
+        image = observation.images[key]
+
+        # TODO: This is a hack to handle both [B, C, H, W] and [B, H, W, C] formats
+        # Handle both [B, C, H, W] and [B, H, W, C] formats
+        is_channels_first = image.shape[1] == 3  # Check if channels are in dimension 1
+
+        if is_channels_first:
+            # Convert [B, C, H, W] to [B, H, W, C] for processing
+            image = image.permute(0, 2, 3, 1)
+
+        if image.shape[1:3] != image_resolution:
+            logger.info(f"Resizing image {key} from {image.shape[1:3]} to {image_resolution}")
+            image = image_tools.resize_with_pad_torch(image, *image_resolution)
+
+        if train:
+            # Convert from [-1, 1] to [0, 1] for PyTorch augmentations
+            image = image / 2.0 + 0.5
+
+            # Apply PyTorch-based augmentations
+            if "wrist" not in key:
+                # Geometric augmentations for non-wrist cameras
+                height, width = image.shape[1:3]
+
+                # Random crop and resize
+                crop_height = int(height * 0.95)
+                crop_width = int(width * 0.95)
+
+                # Random crop
+                max_h = height - crop_height
+                max_w = width - crop_width
+                if max_h > 0 and max_w > 0:
+                    # Use tensor operations instead of .item() for torch.compile compatibility
+                    start_h = torch.randint(0, max_h + 1, (1,), device=image.device)
+                    start_w = torch.randint(0, max_w + 1, (1,), device=image.device)
+                    image = image[:, start_h : start_h + crop_height, start_w : start_w + crop_width, :]
+
+                # Resize back to original size
+                image = torch.nn.functional.interpolate(
+                    image.permute(0, 3, 1, 2),  # [b, h, w, c] -> [b, c, h, w]
+                    size=(height, width),
+                    mode="bilinear",
+                    align_corners=False,
+                ).permute(0, 2, 3, 1)  # [b, c, h, w] -> [b, h, w, c]
+
+                # Random rotation (small angles)
+                # Use tensor operations instead of .item() for torch.compile compatibility
+                angle = torch.rand(1, device=image.device) * 10 - 5  # Random angle between -5 and 5 degrees
+                if torch.abs(angle) > 0.1:  # Only rotate if angle is significant
+                    # Convert to radians
+                    angle_rad = angle * torch.pi / 180.0
+
+                    # Create rotation matrix
+                    cos_a = torch.cos(angle_rad)
+                    sin_a = torch.sin(angle_rad)
+
+                    # Apply rotation using grid_sample
+                    grid_x = torch.linspace(-1, 1, width, device=image.device)
+                    grid_y = torch.linspace(-1, 1, height, device=image.device)
+
+                    # Create meshgrid
+                    grid_y, grid_x = torch.meshgrid(grid_y, grid_x, indexing="ij")
+
+                    # Expand to batch dimension
+                    grid_x = grid_x.unsqueeze(0).expand(image.shape[0], -1, -1)
+                    grid_y = grid_y.unsqueeze(0).expand(image.shape[0], -1, -1)
+
+                    # Apply rotation transformation
+                    grid_x_rot = grid_x * cos_a - grid_y * sin_a
+                    grid_y_rot = grid_x * sin_a + grid_y * cos_a
+
+                    # Stack and reshape for grid_sample
+                    grid = torch.stack([grid_x_rot, grid_y_rot], dim=-1)
+
+                    image = torch.nn.functional.grid_sample(
+                        image.permute(0, 3, 1, 2),  # [b, h, w, c] -> [b, c, h, w]
+                        grid,
+                        mode="bilinear",
+                        padding_mode="zeros",
+                        align_corners=False,
+                    ).permute(0, 2, 3, 1)  # [b, c, h, w] -> [b, h, w, c]
+
+            # Color augmentations for all cameras
+            # Random brightness
+            # Use tensor operations instead of .item() for torch.compile compatibility
+            brightness_factor = (
+                0.7 + torch.rand(1, device=image.device) * 0.6
+            )  # Random factor between 0.7 and 1.3
+            image = image * brightness_factor
+
+            # Random contrast
+            # Use tensor operations instead of .item() for torch.compile compatibility
+            contrast_factor = (
+                0.6 + torch.rand(1, device=image.device) * 0.8
+            )  # Random factor between 0.6 and 1.4
+            mean = image.mean(dim=[1, 2, 3], keepdim=True)
+            image = (image - mean) * contrast_factor + mean
+
+            # Random saturation (convert to HSV, modify S, convert back)
+            # For simplicity, we'll just apply a random scaling to the color channels
+            # Use tensor operations instead of .item() for torch.compile compatibility
+            saturation_factor = (
+                0.5 + torch.rand(1, device=image.device) * 1.0
+            )  # Random factor between 0.5 and 1.5
+            gray = image.mean(dim=-1, keepdim=True)
+            image = gray + (image - gray) * saturation_factor
+
+            # Clamp values to [0, 1]
+            image = torch.clamp(image, 0, 1)
+
+            # Back to [-1, 1]
+            image = image * 2.0 - 1.0
+
+        # Convert back to [B, C, H, W] format if it was originally channels-first
+        if is_channels_first:
+            image = image.permute(0, 3, 1, 2)  # [B, H, W, C] -> [B, C, H, W]
+
+        out_images[key] = image
+
+    # obtain mask
+    out_masks = {}
+    for key in out_images:
+        if key not in observation.image_masks:
+            # do not mask by default
+            out_masks[key] = torch.ones(batch_shape, dtype=torch.bool, device=observation.state.device)
+        else:
+            out_masks[key] = observation.image_masks[key]
+
+    # Create a simple object with the required attributes instead of using the complex Observation class
+    class SimpleProcessedObservation:
+        def __init__(self, **kwargs):
+            for key, value in kwargs.items():
+                setattr(self, key, value)
+
+    return SimpleProcessedObservation(
+        images=out_images,
+        image_masks=out_masks,
+        state=observation.state,
+        tokenized_prompt=observation.tokenized_prompt,
+        tokenized_prompt_mask=observation.tokenized_prompt_mask,
+        token_ar_mask=observation.token_ar_mask,
+        token_loss_mask=observation.token_loss_mask,
+    )
diff --git a/tests/policies/pi0_pi05/test_pi05_compile.py b/tests/policies/pi0_pi05/test_pi05_compile.py
new file mode 100644
index 000000000..ce940e998
--- /dev/null
+++ b/tests/policies/pi0_pi05/test_pi05_compile.py
@@ -0,0 +1,101 @@
+#!/usr/bin/env python
+
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import pytest
+import torch
+
+pytest.importorskip("transformers")
+
+from lerobot.policies.pi05 import PI05Config  # noqa: E402
+from lerobot.policies.pi05.modeling_pi05 import PI05Pytorch  # noqa: E402
+from tests.policies.pi0_pi05.utils.torch_compile import (  # noqa: E402
+    assert_cache_stability,
+    assert_compiled_output_matches_eager,
+    assert_explain_has_no_graph_breaks,
+    benchmark_runtime,
+    make_compile_config,
+    reset_compile_state,
+)
+from tests.utils import require_cuda  # noqa: E402
+
+pytestmark = pytest.mark.skipif(
+    os.environ.get("CI") == "true" or os.environ.get("GITHUB_ACTIONS") == "true",
+    reason="torch.compile benchmark is too slow for CI; run manually on GPU nodes",
+)
+
+
+def _make_model(*, compile_model):
+    return PI05Pytorch(make_compile_config(PI05Config, compile_model=compile_model)).cuda().eval()
+
+
+def _make_dummy_inputs(config):
+    device = torch.device("cuda")
+    common = {
+        "images": [torch.randn(1, 3, *config.image_resolution, device=device)],
+        "img_masks": [torch.ones(1, dtype=torch.bool, device=device)],
+        "tokens": torch.randint(0, 1024, (1, 5), dtype=torch.long, device=device),
+        "masks": torch.ones(1, 5, dtype=torch.bool, device=device),
+    }
+    forward_kwargs = {
+        **common,
+        "actions": torch.randn(1, config.chunk_size, config.max_action_dim, device=device),
+        "noise": torch.randn(1, config.chunk_size, config.max_action_dim, device=device),
+        "time": torch.rand(1, device=device),
+    }
+    sample_kwargs = {
+        **common,
+        "noise": torch.randn(1, config.chunk_size, config.max_action_dim, device=device),
+        "num_steps": config.num_inference_steps,
+    }
+    return forward_kwargs, sample_kwargs
+
+
+@require_cuda
+def test_pi05_torch_compile_forward_and_sample_actions():
+    if not hasattr(torch, "compile"):
+        pytest.skip("torch.compile is not available")
+    if not torch._dynamo.is_dynamo_supported():
+        pytest.skip("torch._dynamo is not supported on this platform")
+
+    torch.manual_seed(0)
+    eager_model = _make_model(compile_model=False)
+    torch.manual_seed(0)
+    compiled_model = _make_model(compile_model=True)
+    forward_kwargs, sample_kwargs = _make_dummy_inputs(compiled_model.config)
+
+    try:
+        assert_compiled_output_matches_eager(eager_model, compiled_model, forward_kwargs, sample_kwargs)
+
+        assert_explain_has_no_graph_breaks(eager_model.forward, forward_kwargs, "pi05.forward")
+        assert_explain_has_no_graph_breaks(eager_model.sample_actions, sample_kwargs, "pi05.sample_actions")
+
+        assert_cache_stability(compiled_model.forward, forward_kwargs, "pi05.forward")
+        assert_cache_stability(compiled_model.sample_actions, sample_kwargs, "pi05.sample_actions")
+
+        benchmark_runtime(eager_model.forward, compiled_model.forward, forward_kwargs, "pi05.forward")
+        benchmark_runtime(
+            eager_model.sample_actions,
+            compiled_model.sample_actions,
+            sample_kwargs,
+            "pi05.sample_actions",
+        )
+    finally:
+        reset_compile_state()
+        del eager_model
+        del compiled_model
+        torch.cuda.empty_cache()
diff --git a/tests/policies/pi0_pi05/test_pi05_original_vs_lerobot.py b/tests/policies/pi0_pi05/test_pi05_original_vs_lerobot.py
index a965132b0..2ab5f1b94 100644
--- a/tests/policies/pi0_pi05/test_pi05_original_vs_lerobot.py
+++ b/tests/policies/pi0_pi05/test_pi05_original_vs_lerobot.py
@@ -14,52 +14,56 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-"""Test script to verify PI0OpenPI policy integration with LeRobot vs the original implementation"""
+"""Compare LeRobot PI0.5 against the vendored OpenPI PyTorch reference."""
 
+import gc
 import os
-from copy import deepcopy
-from typing import Any
 
-import numpy as np
 import pytest
 import torch
 
-# Skip if openpi or transformers is not available
-pytest.importorskip("openpi")
 pytest.importorskip("transformers")
 
-# Skip this entire module in CI
-pytestmark = pytest.mark.skipif(
-    os.environ.get("CI") == "true" or os.environ.get("GITHUB_ACTIONS") == "true",
-    reason="This test requires local OpenPI installation and is not meant for CI",
+from lerobot.configs import PreTrainedConfig  # noqa: E402
+from lerobot.policies.pi05 import PI05Policy  # noqa: E402
+from lerobot.policies.pi05.processor_pi05 import make_pi05_pre_post_processors  # noqa: E402
+from lerobot.utils.constants import ACTION, OBS_STATE  # noqa: E402
+from tests.policies.pi0_pi05.openpi_pytorch.pi0_pytorch import PI0Pytorch  # noqa: E402
+from tests.policies.pi0_pi05.utils.openpi_parity import (  # noqa: E402
+    assert_processor_inputs_match_lerobot,
+    clone_batch,
+    deterministic_openpi_forward_preprocess,
+    fix_reference_state_dict,
+    fixed_flow_sampling,
+    load_openpi_reference_state_dict,
+    make_openpi_observation_from_raw,
+    openpi_model_actions_from_raw,
 )
 
-from openpi.models_pytorch import preprocessing_pytorch as openpi_preprocessing  # noqa: E402
+pytestmark = pytest.mark.skipif(
+    os.environ.get("CI") == "true" or os.environ.get("GITHUB_ACTIONS") == "true",
+    reason="OpenPI parity and torch.compile checks are too slow for CI; run manually on GPU nodes",
+)
 
-# NOTE: Assumes PYTHONPATH is set to include OpenPI src as per instructions.
-from openpi.models_pytorch.pi0_pytorch import PI0Pytorch  # noqa: E402
-from transformers import AutoTokenizer  # noqa: E402
-
-from lerobot.policies.pi05 import PI05Config, PI05Policy  # noqa: E402
-from lerobot.policies.pi05.processor_pi05 import make_pi05_pre_post_processors  # noqa: E402
-from lerobot.processor import PolicyProcessorPipeline  # noqa: E402
-from lerobot.types import PolicyAction  # noqa: E402
-
-# TODO: ADDING DEFAULT IMAGES_FEATURES TO CONFIG
 DUMMY_ACTION_DIM = 32
 DUMMY_STATE_DIM = 32
 DUMMY_ACTION_HORIZON = 50
 DUMMY_MAX_TOKEN_LEN = 200
-DEVICE = "cpu"  # Use CPU to avoid memory issues for testing
+DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+COMPILE_MODE = "default"
+FORWARD_RTOL = 1e-4
+FORWARD_ATOL = 1e-4
+SAMPLE_RTOL = 1e-2
+SAMPLE_ATOL = 5e-3
 
 DUMMY_DATASET_STATS = {
-    "observation.state": {
+    OBS_STATE: {
         "mean": torch.zeros(DUMMY_STATE_DIM),
         "std": torch.ones(DUMMY_STATE_DIM),
         "q01": torch.zeros(DUMMY_STATE_DIM),
         "q99": torch.ones(DUMMY_STATE_DIM),
     },
-    "action": {
+    ACTION: {
         "mean": torch.zeros(DUMMY_ACTION_DIM),
         "std": torch.ones(DUMMY_ACTION_DIM),
         "q01": torch.zeros(DUMMY_ACTION_DIM),
@@ -88,6 +92,15 @@ DUMMY_DATASET_STATS = {
 }
 
 
+@pytest.fixture(autouse=True)
+def cleanup_cuda_after_test():
+    yield
+    gc.collect()
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+        torch.cuda.ipc_collect()
+
+
 class PI05BaseOriginalConfig:
     action_dim: int = DUMMY_ACTION_DIM
     action_horizon: int = DUMMY_ACTION_HORIZON
@@ -96,341 +109,163 @@ class PI05BaseOriginalConfig:
     precision: str = "float32"
     pi05: bool = True
     dtype: str = "float32"
+    pytorch_compile_mode: str | None = None
 
 
-def instantiate_lerobot_pi05(
-    from_pretrained: bool = False,
-) -> tuple[
-    PI05Policy,
-    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
-    PolicyProcessorPipeline[PolicyAction, PolicyAction],
-]:
-    if from_pretrained:
-        # Load the policy first
-        policy = PI05Policy.from_pretrained(pretrained_name_or_path="lerobot/pi05_base", strict=True)
-    else:
-        config = PI05Config(max_action_dim=DUMMY_ACTION_DIM, max_state_dim=DUMMY_STATE_DIM, dtype="float32")
-        policy = PI05Policy(config)
+def instantiate_lerobot_pi05(*, compile_model: bool = False, gradient_checkpointing: bool = False):
+    config = PreTrainedConfig.from_pretrained("lerobot/pi05_base")
+    config.device = str(DEVICE)
+    config.dtype = "float32"
+    config.compile_model = compile_model
+    config.compile_mode = COMPILE_MODE
+    config.gradient_checkpointing = gradient_checkpointing
 
+    policy = PI05Policy.from_pretrained("lerobot/pi05_base", config=config, strict=True)
     policy.to(DEVICE)
-    policy.config.device = DEVICE
-    preprocessor, postprocessor = make_pi05_pre_post_processors(
-        config=policy.config, dataset_stats=DUMMY_DATASET_STATS
-    )
-    return (policy, preprocessor, postprocessor)
+    policy.config.device = str(DEVICE)
+    preprocessor, _ = make_pi05_pre_post_processors(config=policy.config, dataset_stats=DUMMY_DATASET_STATS)
+    return policy, preprocessor
 
 
-def instantiate_original_pi05(from_pretrained: bool = False, model_path: str | None = None):
-    config = PI05BaseOriginalConfig()
-    policy = PI0Pytorch(config)
+def instantiate_original_pi05():
+    policy = PI0Pytorch(PI05BaseOriginalConfig()).to(DEVICE)
 
-    if from_pretrained:
-        try:
-            print("Loading converted PyTorch weights from HuggingFace Hub (lerobot/pi05_base)...")
-
-            # Download the model from HuggingFace Hub
-            import safetensors.torch
-            from huggingface_hub import snapshot_download
-
-            # Download the entire repository
-            if model_path and os.path.exists(model_path):
-                cache_dir = model_path
-                print(f"Using cached model from: {cache_dir}")
-            else:
-                cache_dir = snapshot_download(repo_id="lerobot/pi05_base", repo_type="model")
-                print(f"Downloaded model to: {cache_dir}")
-
-            # Try to load safetensors format first
-            model_file = os.path.join(cache_dir, "model.safetensors")
-            if os.path.exists(model_file):
-                state_dict = safetensors.torch.load_file(model_file)
-                print(f"Loaded {len(state_dict)} parameters from safetensors")
-            else:
-                raise FileNotFoundError(f"No safetensors file found in {cache_dir}")
-
-            # Load the state dict into the model
-            missing_keys, unexpected_keys = policy.load_state_dict(state_dict, strict=False)
-
-            if missing_keys:
-                print(f"Missing keys: {len(missing_keys)}")
-                if len(missing_keys) <= 5:
-                    for key in missing_keys:
-                        print(f"    - {key}")
-                else:
-                    for key in missing_keys[:5]:
-                        print(f"    - {key}")
-                    print(f"    ... and {len(missing_keys) - 5} more")
-
-            if unexpected_keys:
-                print(f"Unexpected keys: {len(unexpected_keys)}")
-                if len(unexpected_keys) <= 5:
-                    for key in unexpected_keys:
-                        print(f"    - {key}")
-                else:
-                    for key in unexpected_keys[:5]:
-                        print(f"    - {key}")
-                    print(f"    ... and {len(unexpected_keys) - 5} more")
-
-            if not missing_keys and not unexpected_keys:
-                print("All pretrained weights loaded successfully!")
-            else:
-                print("Pretrained weights loaded with some missing/unexpected keys (this may be normal)")
-
-        except Exception as e:
-            print(f"Failed to load pretrained weights: {e}")
-            print("   Using randomly initialized weights...")
-            import traceback
-
-            traceback.print_exc()
-
-    policy.to(DEVICE)
+    # NOTE: `lerobot/pi05_base` 的 LeRobot loader 和 PI0 一样会在 strict load 前做 key
+    # 兼容转换，因此预期没有 missing_keys 或 unexpected_keys。vendored reference 则是裸
+    # `nn.Module`，需要在测试侧补齐 checkpoint 与模块命名之间的最小差异。
+    # NOTE: `lm_head.weight` 是 PaliGemma tied embedding 的保存名；LeRobot 的
+    # from_pretrained 会把它映射到内部 `embed_tokens.weight`，而 reference 模型没有这层
+    # loader，所以这里手动复用同一份 tensor，避免把权重别名差异误判成模型差异。
+    state_dict = fix_reference_state_dict(load_openpi_reference_state_dict("lerobot/pi05_base"))
+    missing_keys, unexpected_keys = policy.load_state_dict(state_dict, strict=False)
+    assert missing_keys == []
+    assert unexpected_keys == []
     return policy
 
 
 def create_dummy_data():
-    batch_size = 2  # Reduce batch size for testing
-    device = DEVICE
-
-    # Use the exact same prompt for both implementations
+    batch_size = 2
     prompt = "Pick up the red block and place it in the bin"
-
-    batch = {
-        "observation.state": torch.randn(batch_size, DUMMY_STATE_DIM, dtype=torch.float32, device=device),
-        "action": torch.randn(
-            batch_size, DUMMY_ACTION_HORIZON, DUMMY_ACTION_DIM, dtype=torch.float32, device=device
+    return {
+        OBS_STATE: torch.randn(batch_size, DUMMY_STATE_DIM, dtype=torch.float32, device=DEVICE),
+        ACTION: torch.randn(
+            batch_size, DUMMY_ACTION_HORIZON, DUMMY_ACTION_DIM, dtype=torch.float32, device=DEVICE
         ),
-        # Create images in [0, 1] range as expected by LeRobot (will be converted to [-1, 1] internally)
         "observation.images.base_0_rgb": torch.rand(
-            batch_size, 3, 224, 224, dtype=torch.float32, device=device
+            batch_size, 3, 224, 224, dtype=torch.float32, device=DEVICE
         ),
         "observation.images.left_wrist_0_rgb": torch.rand(
-            batch_size, 3, 224, 224, dtype=torch.float32, device=device
+            batch_size, 3, 224, 224, dtype=torch.float32, device=DEVICE
         ),
         "observation.images.right_wrist_0_rgb": torch.rand(
-            batch_size, 3, 224, 224, dtype=torch.float32, device=device
+            batch_size, 3, 224, 224, dtype=torch.float32, device=DEVICE
         ),
-        # Add the task prompt for LeRobot - provide as list with single element to trigger expansion
         "task": [prompt for _ in range(batch_size)],
     }
-    return batch
 
 
-def extract_lerobot_processed_inputs(lerobot_pi0, batch):
-    """Extract the exact same processed inputs that LeRobot uses internally."""
-    # Get the tokenized language from LeRobot's internal method
-    lang_tokens, lang_masks = lerobot_pi0._tokenize_language(batch)
-
-    # Get the preprocessed images from LeRobot's internal method
-    images, img_masks = lerobot_pi0._preprocess_images(batch, train=False)
-
-    # Create dummy token_ar_mask and token_loss_mask for original implementation
-    token_ar_mask = torch.zeros_like(lang_tokens, dtype=torch.int32)
-    token_loss_mask = torch.ones_like(lang_masks, dtype=torch.bool)
-
-    return images, img_masks, lang_tokens, lang_masks, token_ar_mask, token_loss_mask
+def prepare_parity_inputs(lerobot_pi05, lerobot_preprocessor):
+    torch.manual_seed(0)
+    raw_batch = create_dummy_data()
+    lerobot_batch = lerobot_preprocessor(clone_batch(raw_batch))
+    openpi_observation = make_openpi_observation_from_raw(
+        raw_batch,
+        action_dim=DUMMY_ACTION_DIM,
+        max_token_len=DUMMY_MAX_TOKEN_LEN,
+        dataset_stats=DUMMY_DATASET_STATS,
+        pi05=True,
+    )
+    openpi_actions = openpi_model_actions_from_raw(
+        raw_batch,
+        action_dim=DUMMY_ACTION_DIM,
+        dataset_stats=DUMMY_DATASET_STATS,
+        pi05=True,
+    )
+    assert_processor_inputs_match_lerobot(
+        lerobot_pi05,
+        lerobot_batch,
+        openpi_observation,
+        compare_state=False,
+    )
+    batch_size = raw_batch[OBS_STATE].shape[0]
+    noise = torch.randn(
+        batch_size,
+        DUMMY_ACTION_HORIZON,
+        DUMMY_ACTION_DIM,
+        dtype=torch.float32,
+        device=DEVICE,
+    )
+    time = torch.linspace(0.2, 0.8, batch_size, dtype=torch.float32, device=DEVICE)
+    return lerobot_batch, openpi_observation, openpi_actions, noise, time
 
 
-class PI05Observation:
-    """Observation class that matches the original OpenPI format."""
-
-    def __init__(
-        self,
-        state,
-        images,
-        image_masks,
-        tokenized_prompt,
-        tokenized_prompt_mask,
-        token_ar_mask,
-        token_loss_mask,
-    ):
-        self.state = state
-        self.images = images
-        self.image_masks = image_masks
-        self.tokenized_prompt = tokenized_prompt
-        self.tokenized_prompt_mask = tokenized_prompt_mask
-        self.token_ar_mask = token_ar_mask
-        self.token_loss_mask = token_loss_mask
-
-
-def create_original_observation_with_openpi_preprocessing(batch):
-    """Create observation object for OpenPI using OpenPI's own preprocessing with pi05 state tokenizer."""
-    batch_size = batch["observation.state"].shape[0]
-    device = batch["observation.state"].device
-
-    # Create tokenizer for OpenPI (same as LeRobot uses)
-    tokenizer = AutoTokenizer.from_pretrained("google/paligemma-3b-pt-224")
-
-    # Get task description (pi05 processor handles all text formatting)
-    tasks = batch.get("task", ["Pick up the object"] * batch_size)
-    if isinstance(tasks, str):
-        tasks = [tasks] * batch_size
-    elif len(tasks) == 1:
-        tasks = tasks * batch_size
-
-    # Use pi05 state and input tokenizer logic (same as Pi05PrepareStateTokenizerProcessorStep)
-    state = batch["observation.state"]
-    state = deepcopy(state)
-
-    # Prepare state (pad to max_state_dim)
-    from lerobot.policies.pi05.modeling_pi05 import pad_vector
-
-    state = pad_vector(state, DUMMY_STATE_DIM)
-
-    # Normalize state to [-1, 1] range if needed (assuming it's already normalized from normalize_inputs)
-    # Discretize into 256 bins (see openpi `PaligemmaTokenizer.tokenize()`)
-    state_np = state.cpu().numpy()
-    discretized_states = np.digitize(state_np, bins=np.linspace(-1, 1, 256 + 1)[:-1]) - 1
-
-    # Create pi05-formatted prompts that include state information
-    full_prompts = []
-    for i, task in enumerate(tasks):
-        cleaned_text = task.strip().replace("_", " ").replace("\n", " ")
-        state_str = " ".join(map(str, discretized_states[i]))
-        full_prompt = f"Task: {cleaned_text}, State: {state_str};\nAction: "
-        full_prompts.append(full_prompt)
-
-    # Tokenize with max_length padding to match OpenPI's expected format
-    tokenized = tokenizer(
-        full_prompts,
-        padding="max_length",
-        padding_side="right",
-        truncation=True,
-        max_length=DUMMY_MAX_TOKEN_LEN,
-        return_tensors="pt",
+def assert_forward_matches(*, compile_model: bool = False, gradient_checkpointing: bool = False):
+    lerobot_pi05, lerobot_preprocessor = instantiate_lerobot_pi05(
+        compile_model=compile_model,
+        gradient_checkpointing=gradient_checkpointing,
+    )
+    original_pi05 = instantiate_original_pi05()
+    lerobot_batch, openpi_observation, openpi_actions, noise, time = prepare_parity_inputs(
+        lerobot_pi05,
+        lerobot_preprocessor,
     )
 
-    lang_tokens = tokenized["input_ids"].to(device)
-    lang_masks = tokenized["attention_mask"].to(device, dtype=torch.bool)
+    if gradient_checkpointing:
+        lerobot_pi05.train()
+    else:
+        lerobot_pi05.eval()
+    original_pi05.eval()
 
-    # Create dummy token_ar_mask and token_loss_mask for OpenPI
-    token_ar_mask = torch.zeros_like(lang_tokens, dtype=torch.int32)
-    token_loss_mask = torch.ones_like(lang_masks, dtype=torch.bool)
+    with fixed_flow_sampling(lerobot_pi05.model, noise=noise, time=time):
+        lerobot_loss, _ = lerobot_pi05(lerobot_batch, reduction="none")
+    with deterministic_openpi_forward_preprocess(original_pi05):
+        openpi_losses = original_pi05(openpi_observation, openpi_actions, noise=noise, time=time)
+    openpi_loss = openpi_losses.mean(dim=(1, 2))
 
-    # Convert LeRobot images format to OpenPI format (convert [0,1] to [-1,1] range)
-    image_dict = {
-        "base_0_rgb": batch["observation.images.base_0_rgb"] * 2.0 - 1.0,
-        "left_wrist_0_rgb": batch["observation.images.left_wrist_0_rgb"] * 2.0 - 1.0,
-        "right_wrist_0_rgb": batch["observation.images.right_wrist_0_rgb"] * 2.0 - 1.0,
-    }
+    torch.testing.assert_close(lerobot_loss, openpi_loss, rtol=FORWARD_RTOL, atol=FORWARD_ATOL)
 
-    # Create image masks (all ones for real images)
-    image_masks_dict = {}
-    for key in image_dict:
-        image_masks_dict[key] = torch.ones(batch_size, dtype=torch.bool, device=device)
 
-    # Create raw observation object (before preprocessing)
-    raw_observation = PI05Observation(
-        state=batch["observation.state"],
-        images=image_dict,
-        image_masks=image_masks_dict,
-        tokenized_prompt=lang_tokens,
-        tokenized_prompt_mask=lang_masks,
-        token_ar_mask=token_ar_mask,
-        token_loss_mask=token_loss_mask,
+def assert_sample_actions_match_openpi(*, compile_model: bool = False):
+    lerobot_pi05, lerobot_preprocessor = instantiate_lerobot_pi05(compile_model=compile_model)
+    original_pi05 = instantiate_original_pi05()
+    lerobot_batch, openpi_observation, _openpi_actions, noise, _time = prepare_parity_inputs(
+        lerobot_pi05,
+        lerobot_preprocessor,
     )
 
-    # Now use OpenPI's preprocessing
-    processed_obs = openpi_preprocessing.preprocess_observation_pytorch(raw_observation, train=False)
-
-    return processed_obs
-
-
-def create_original_observation_from_lerobot(lerobot_pi0, batch):
-    """Create observation object compatible with original OpenPI using the exact same inputs as LeRobot."""
-    _batch_size = batch["observation.state"].shape[0]
-    _device = batch["observation.state"].device
-
-    # Extract the exact same processed inputs that LeRobot uses
-    images, img_masks, lang_tokens, lang_masks, token_ar_mask, token_loss_mask = (
-        extract_lerobot_processed_inputs(lerobot_pi0, batch)
-    )
-
-    # Convert images list to dict with original OpenPI keys
-    image_dict = {
-        "base_0_rgb": images[0],
-        "left_wrist_0_rgb": images[1],
-        "right_wrist_0_rgb": images[2],
-    }
-
-    # Convert image masks list to dict with original OpenPI keys
-    image_masks_dict = {
-        "base_0_rgb": img_masks[0],
-        "left_wrist_0_rgb": img_masks[1],
-        "right_wrist_0_rgb": img_masks[2],
-    }
-
-    return PI05Observation(
-        state=batch["observation.state"],
-        images=image_dict,
-        image_masks=image_masks_dict,
-        tokenized_prompt=lang_tokens,
-        tokenized_prompt_mask=lang_masks,
-        token_ar_mask=token_ar_mask,
-        token_loss_mask=token_loss_mask,
-    )
-
-
-def test_pi05_original_vs_lerobot():
-    """Test PI05 original implementation vs LeRobot implementation."""
-    print("Initializing models...")
-    lerobot_pi05, lerobot_preprocessor, lerobot_postprocessor = instantiate_lerobot_pi05(
-        from_pretrained=True
-    )  # Load pretrained LeRobot model
-    original_pi0 = instantiate_original_pi05(
-        from_pretrained=True
-    )  # Load pretrained OpenPI model from HuggingFace Hub
-
-    print("Creating dummy data...")
-    batch = create_dummy_data()
-    batch_lerobot = deepcopy(batch)
-
-    # Test each model with its own preprocessing (more realistic end-to-end test)
-    print("\nTest each model with its own preprocessing")
-    print("Creating observation for OpenPI using OpenPI's own preprocessing...")
-    pi0_obs_openpi = create_original_observation_with_openpi_preprocessing(batch)
-
-    print(f"Task prompt: '{batch['task'][0]}'")
-    print(f"OpenPI tokenized prompt shape: {pi0_obs_openpi.tokenized_prompt.shape}")
-    print(f"OpenPI image shapes: {[img.shape for img in pi0_obs_openpi.images.values()]}")
-    print(f"OpenPI state shape: {pi0_obs_openpi.state.shape}")
-
-    print("Testing OpenPI with own preprocessing...")
-    original_pi0.eval()
-    torch.manual_seed(42)  # Set seed for reproducibility
-    batch_size = batch["observation.state"].shape[0]
-    noise_shape = (batch_size, DUMMY_ACTION_HORIZON, DUMMY_ACTION_DIM)
-    fixed_noise = torch.randn(noise_shape, dtype=torch.float32, device=DEVICE)
-
-    with torch.no_grad():
-        openpi_actions = original_pi0.sample_actions(
-            device=DEVICE, observation=pi0_obs_openpi, noise=fixed_noise, num_steps=10
-        )
-        openpi_actions_unit = openpi_actions[:, 0, :]
-    print(f"OpenPI (own preprocessing) Actions shape: {openpi_actions.shape}")
-    print(f"OpenPI (own preprocessing) Actions unit shape: {openpi_actions_unit.shape}")
-    print(f"OpenPI (own preprocessing) Actions mean: {openpi_actions.mean().item():.6f}")
-    print(f"OpenPI (own preprocessing) Actions std: {openpi_actions.std().item():.6f}")
-
-    print("Testing LeRobot with own preprocessing...")
     lerobot_pi05.eval()
-    torch.manual_seed(42)  # Set the same seed
-
-    batch_lerobot_processed = lerobot_preprocessor(batch_lerobot)
+    original_pi05.eval()
     with torch.no_grad():
-        lerobot_actions_own = lerobot_pi05.predict_action_chunk(
-            batch_lerobot_processed
-        )  # batch_size, n_action_steps, action_dim
-        lerobot_actions_unit = lerobot_actions_own[:, 0, :]
-    print(f"LeRobot (own preprocessing) Actions shape: {lerobot_actions_own.shape}")
-    print(f"LeRobot (own preprocessing) Actions unit shape: {lerobot_actions_unit.shape}")
-    print(f"LeRobot (own preprocessing) Actions mean: {lerobot_actions_own.mean().item():.6f}")
-    print(f"LeRobot (own preprocessing) Actions std: {lerobot_actions_own.std().item():.6f}")
+        lerobot_actions = lerobot_pi05.predict_action_chunk(lerobot_batch, noise=noise, num_steps=10)
+        openpi_actions = original_pi05.sample_actions(
+            device=DEVICE,
+            observation=openpi_observation,
+            noise=noise,
+            num_steps=10,
+        )
 
-    print("\nComparing end-to-end implementations:")
-    print(f"Actions close (atol=1e-4): {torch.allclose(lerobot_actions_own, openpi_actions, atol=1e-4)}")
-    print(f"Actions close (atol=1e-2): {torch.allclose(lerobot_actions_own, openpi_actions, atol=1e-2)}")
-    print(f"Max absolute difference: {torch.abs(lerobot_actions_own - openpi_actions).max().item():.6f}")
+    torch.testing.assert_close(lerobot_actions, openpi_actions, rtol=SAMPLE_RTOL, atol=SAMPLE_ATOL)
 
-    assert torch.allclose(lerobot_actions_own, openpi_actions, atol=1e-4)
-    assert torch.allclose(lerobot_actions_own, openpi_actions, atol=1e-2)
-    assert torch.abs(lerobot_actions_own - openpi_actions).max().item() < 1e-4
+
+def test_pi05_forward_matches_openpi():
+    assert_forward_matches()
+
+
+def test_pi05_sample_actions_match_openpi():
+    assert_sample_actions_match_openpi()
+
+
+def test_pi05_gradient_checkpointing_forward_matches_openpi():
+    assert_forward_matches(gradient_checkpointing=True)
+
+
+def test_pi05_compile_forward_matches_openpi():
+    assert_forward_matches(compile_model=True)
+
+
+def test_pi05_compile_sample_actions_match_openpi():
+    assert_sample_actions_match_openpi(compile_model=True)
+
+
+def test_pi05_compile_gradient_checkpointing_forward_matches_openpi():
+    assert_forward_matches(compile_model=True, gradient_checkpointing=True)
diff --git a/tests/policies/pi0_pi05/test_pi0_compile.py b/tests/policies/pi0_pi05/test_pi0_compile.py
new file mode 100644
index 000000000..4c8f55d4c
--- /dev/null
+++ b/tests/policies/pi0_pi05/test_pi0_compile.py
@@ -0,0 +1,99 @@
+#!/usr/bin/env python
+
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import pytest
+import torch
+
+pytest.importorskip("transformers")
+
+from lerobot.policies.pi0 import PI0Config  # noqa: E402
+from lerobot.policies.pi0.modeling_pi0 import PI0Pytorch  # noqa: E402
+from tests.policies.pi0_pi05.utils.torch_compile import (  # noqa: E402
+    assert_cache_stability,
+    assert_compiled_output_matches_eager,
+    assert_explain_has_no_graph_breaks,
+    benchmark_runtime,
+    make_compile_config,
+    reset_compile_state,
+)
+from tests.utils import require_cuda  # noqa: E402
+
+pytestmark = pytest.mark.skipif(
+    os.environ.get("CI") == "true" or os.environ.get("GITHUB_ACTIONS") == "true",
+    reason="torch.compile benchmark is too slow for CI; run manually on GPU nodes",
+)
+
+
+def _make_model(*, compile_model):
+    return PI0Pytorch(make_compile_config(PI0Config, compile_model=compile_model)).cuda().eval()
+
+
+def _make_dummy_inputs(config):
+    device = torch.device("cuda")
+    common = {
+        "images": [torch.randn(1, 3, *config.image_resolution, device=device)],
+        "img_masks": [torch.ones(1, dtype=torch.bool, device=device)],
+        "lang_tokens": torch.randint(0, 1024, (1, 5), dtype=torch.long, device=device),
+        "lang_masks": torch.ones(1, 5, dtype=torch.bool, device=device),
+        "state": torch.randn(1, config.max_state_dim, device=device),
+    }
+    forward_kwargs = {
+        **common,
+        "actions": torch.randn(1, config.chunk_size, config.max_action_dim, device=device),
+        "noise": torch.randn(1, config.chunk_size, config.max_action_dim, device=device),
+        "time": torch.rand(1, device=device),
+    }
+    sample_kwargs = {
+        **common,
+        "noise": torch.randn(1, config.chunk_size, config.max_action_dim, device=device),
+        "num_steps": config.num_inference_steps,
+    }
+    return forward_kwargs, sample_kwargs
+
+
+@require_cuda
+def test_pi0_torch_compile_forward_and_sample_actions():
+    if not hasattr(torch, "compile"):
+        pytest.skip("torch.compile is not available")
+    if not torch._dynamo.is_dynamo_supported():
+        pytest.skip("torch._dynamo is not supported on this platform")
+
+    torch.manual_seed(0)
+    eager_model = _make_model(compile_model=False)
+    torch.manual_seed(0)
+    compiled_model = _make_model(compile_model=True)
+    forward_kwargs, sample_kwargs = _make_dummy_inputs(compiled_model.config)
+
+    try:
+        assert_compiled_output_matches_eager(eager_model, compiled_model, forward_kwargs, sample_kwargs)
+
+        assert_explain_has_no_graph_breaks(eager_model.forward, forward_kwargs, "pi0.forward")
+        assert_explain_has_no_graph_breaks(eager_model.sample_actions, sample_kwargs, "pi0.sample_actions")
+
+        assert_cache_stability(compiled_model.forward, forward_kwargs, "pi0.forward")
+        assert_cache_stability(compiled_model.sample_actions, sample_kwargs, "pi0.sample_actions")
+
+        benchmark_runtime(eager_model.forward, compiled_model.forward, forward_kwargs, "pi0.forward")
+        benchmark_runtime(
+            eager_model.sample_actions, compiled_model.sample_actions, sample_kwargs, "pi0.sample_actions"
+        )
+    finally:
+        reset_compile_state()
+        del eager_model
+        del compiled_model
+        torch.cuda.empty_cache()
diff --git a/tests/policies/pi0_pi05/test_pi0_original_vs_lerobot.py b/tests/policies/pi0_pi05/test_pi0_original_vs_lerobot.py
index 62e34b70d..9dad90d60 100644
--- a/tests/policies/pi0_pi05/test_pi0_original_vs_lerobot.py
+++ b/tests/policies/pi0_pi05/test_pi0_original_vs_lerobot.py
@@ -14,51 +14,56 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-"""Test script to verify PI0 policy integration with LeRobot vs the original implementation"""
+"""Compare LeRobot PI0 against the vendored OpenPI PyTorch reference."""
 
+import gc
 import os
-from copy import deepcopy
-from typing import Any
 
 import pytest
 import torch
 
-# Skip if openpi or transformers is not available
-pytest.importorskip("openpi")
 pytest.importorskip("transformers")
 
-# Skip this entire module in CI
-pytestmark = pytest.mark.skipif(
-    os.environ.get("CI") == "true" or os.environ.get("GITHUB_ACTIONS") == "true",
-    reason="This test requires local OpenPI installation and is not meant for CI",
+from lerobot.configs import PreTrainedConfig  # noqa: E402
+from lerobot.policies.pi0 import PI0Policy  # noqa: E402
+from lerobot.policies.pi0.processor_pi0 import make_pi0_pre_post_processors  # noqa: E402
+from lerobot.utils.constants import ACTION, OBS_STATE  # noqa: E402
+from tests.policies.pi0_pi05.openpi_pytorch.pi0_pytorch import PI0Pytorch  # noqa: E402
+from tests.policies.pi0_pi05.utils.openpi_parity import (  # noqa: E402
+    assert_processor_inputs_match_lerobot,
+    clone_batch,
+    deterministic_openpi_forward_preprocess,
+    fix_reference_state_dict,
+    fixed_flow_sampling,
+    load_openpi_reference_state_dict,
+    make_openpi_observation_from_raw,
+    openpi_model_actions_from_raw,
 )
 
-from openpi.models_pytorch import preprocessing_pytorch as openpi_preprocessing  # noqa: E402
+pytestmark = pytest.mark.skipif(
+    os.environ.get("CI") == "true" or os.environ.get("GITHUB_ACTIONS") == "true",
+    reason="OpenPI parity and torch.compile checks are too slow for CI; run manually on GPU nodes",
+)
 
-# NOTE: Assumes PYTHONPATH is set to include OpenPI src as per instructions.
-from openpi.models_pytorch.pi0_pytorch import PI0Pytorch  # noqa: E402
-from transformers import AutoTokenizer  # noqa: E402
-
-from lerobot.policies.pi0 import PI0Config, PI0Policy  # noqa: E402
-from lerobot.policies.pi0.processor_pi0 import make_pi0_pre_post_processors  # noqa: E402
-from lerobot.processor import PolicyProcessorPipeline  # noqa: E402
-from lerobot.types import PolicyAction  # noqa: E402
-
-# TODO: ADDING DEFAULT IMAGES_FEATURES TO CONFIG
 DUMMY_ACTION_DIM = 32
 DUMMY_STATE_DIM = 32
 DUMMY_ACTION_HORIZON = 50
-DUMMY_MAX_TOKEN_LEN = 48  # Default for PI0 (non-pi05)
-DEVICE = "cpu"  # Use CPU to avoid memory issues for testing
+DUMMY_MAX_TOKEN_LEN = 48
+DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+COMPILE_MODE = "default"
+FORWARD_RTOL = 1e-4
+FORWARD_ATOL = 1e-4
+SAMPLE_RTOL = 1e-2
+SAMPLE_ATOL = 5e-3
 
 DUMMY_DATASET_STATS = {
-    "observation.state": {
+    OBS_STATE: {
         "mean": torch.zeros(DUMMY_STATE_DIM),
         "std": torch.ones(DUMMY_STATE_DIM),
         "q01": torch.zeros(DUMMY_STATE_DIM),
         "q99": torch.ones(DUMMY_STATE_DIM),
     },
-    "action": {
+    ACTION: {
         "mean": torch.zeros(DUMMY_ACTION_DIM),
         "std": torch.ones(DUMMY_ACTION_DIM),
         "q01": torch.zeros(DUMMY_ACTION_DIM),
@@ -87,6 +92,15 @@ DUMMY_DATASET_STATS = {
 }
 
 
+@pytest.fixture(autouse=True)
+def cleanup_cuda_after_test():
+    yield
+    gc.collect()
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+        torch.cuda.ipc_collect()
+
+
 class PI0BaseOriginalConfig:
     action_dim: int = DUMMY_ACTION_DIM
     action_horizon: int = DUMMY_ACTION_HORIZON
@@ -95,333 +109,156 @@ class PI0BaseOriginalConfig:
     precision: str = "float32"
     pi05: bool = False
     dtype: str = "float32"
+    pytorch_compile_mode: str | None = None
 
 
-def instantiate_lerobot_pi0(
-    from_pretrained: bool = False,
-) -> tuple[
-    PI0Policy,
-    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
-    PolicyProcessorPipeline[PolicyAction, PolicyAction],
-]:
-    if from_pretrained:
-        # Load the policy first
-        policy = PI0Policy.from_pretrained(pretrained_name_or_path="lerobot/pi0_base", strict=True)
-    else:
-        config = PI0Config(max_action_dim=DUMMY_ACTION_DIM, max_state_dim=DUMMY_STATE_DIM, dtype="float32")
-        policy = PI0Policy(config)
+def instantiate_lerobot_pi0(*, compile_model: bool = False, gradient_checkpointing: bool = False):
+    config = PreTrainedConfig.from_pretrained("lerobot/pi0_base")
+    config.device = str(DEVICE)
+    config.dtype = "float32"
+    config.compile_model = compile_model
+    config.compile_mode = COMPILE_MODE
+    config.gradient_checkpointing = gradient_checkpointing
 
+    policy = PI0Policy.from_pretrained("lerobot/pi0_base", config=config, strict=True)
     policy.to(DEVICE)
-    policy.config.device = DEVICE
-    preprocessor, postprocessor = make_pi0_pre_post_processors(
-        config=policy.config, dataset_stats=DUMMY_DATASET_STATS
-    )
-    return (policy, preprocessor, postprocessor)
+    policy.config.device = str(DEVICE)
+    preprocessor, _ = make_pi0_pre_post_processors(config=policy.config, dataset_stats=DUMMY_DATASET_STATS)
+    return policy, preprocessor
 
 
-def instantiate_original_pi0(from_pretrained: bool = False, model_path: str = None):
-    config = PI0BaseOriginalConfig()
-    policy = PI0Pytorch(config)
-
-    if from_pretrained:
-        try:
-            print("Loading converted PyTorch weights from HuggingFace Hub (lerobot/pi0_base)...")
-
-            # Download the model from HuggingFace Hub
-            import safetensors.torch
-            from huggingface_hub import snapshot_download
-
-            # Download the entire repository
-            if model_path and os.path.exists(model_path):
-                cache_dir = model_path
-                print(f"Using cached model from: {cache_dir}")
-            else:
-                cache_dir = snapshot_download(repo_id="lerobot/pi0_base", repo_type="model")
-                print(f"Downloaded model to: {cache_dir}")
-
-            # Try to load safetensors format first
-            model_file = os.path.join(cache_dir, "model.safetensors")
-            if os.path.exists(model_file):
-                state_dict = safetensors.torch.load_file(model_file)
-                print(f"Loaded {len(state_dict)} parameters from safetensors")
-            else:
-                raise FileNotFoundError(f"No safetensors file found in {cache_dir}")
-
-            # Load the state dict into the model
-            missing_keys, unexpected_keys = policy.load_state_dict(state_dict, strict=False)
-
-            if missing_keys:
-                print(f"Missing keys: {len(missing_keys)}")
-                if len(missing_keys) <= 5:
-                    for key in missing_keys:
-                        print(f"    - {key}")
-                else:
-                    for key in missing_keys[:5]:
-                        print(f"    - {key}")
-                    print(f"    ... and {len(missing_keys) - 5} more")
-
-            if unexpected_keys:
-                print(f"Unexpected keys: {len(unexpected_keys)}")
-                if len(unexpected_keys) <= 5:
-                    for key in unexpected_keys:
-                        print(f"    - {key}")
-                else:
-                    for key in unexpected_keys[:5]:
-                        print(f"    - {key}")
-                    print(f"    ... and {len(unexpected_keys) - 5} more")
-
-            if not missing_keys and not unexpected_keys:
-                print("All pretrained weights loaded successfully!")
-            else:
-                print("Pretrained weights loaded with some missing/unexpected keys (this may be normal)")
-
-        except Exception as e:
-            print(f"Failed to load pretrained weights: {e}")
-            print("   Using randomly initialized weights...")
-            import traceback
-
-            traceback.print_exc()
-
-    policy.to(DEVICE)
+def instantiate_original_pi0():
+    policy = PI0Pytorch(PI0BaseOriginalConfig()).to(DEVICE)
+    state_dict = fix_reference_state_dict(load_openpi_reference_state_dict("lerobot/pi0_base"))
+    missing_keys, unexpected_keys = policy.load_state_dict(state_dict, strict=False)
+    assert missing_keys == []
+    assert unexpected_keys == []
     return policy
 
 
 def create_dummy_data():
-    batch_size = 2  # Reduce batch size for testing
-    device = DEVICE
-
-    # Use the exact same prompt for both implementations
+    batch_size = 2
     prompt = "Pick up the red block and place it in the bin"
-
-    batch = {
-        "observation.state": torch.randn(batch_size, DUMMY_STATE_DIM, dtype=torch.float32, device=device),
-        "action": torch.randn(
-            batch_size, DUMMY_ACTION_HORIZON, DUMMY_ACTION_DIM, dtype=torch.float32, device=device
+    return {
+        OBS_STATE: torch.randn(batch_size, DUMMY_STATE_DIM, dtype=torch.float32, device=DEVICE),
+        ACTION: torch.randn(
+            batch_size, DUMMY_ACTION_HORIZON, DUMMY_ACTION_DIM, dtype=torch.float32, device=DEVICE
         ),
-        # Create images in [0, 1] range as expected by LeRobot (will be converted to [-1, 1] internally)
         "observation.images.base_0_rgb": torch.rand(
-            batch_size, 3, 224, 224, dtype=torch.float32, device=device
+            batch_size, 3, 224, 224, dtype=torch.float32, device=DEVICE
         ),
         "observation.images.left_wrist_0_rgb": torch.rand(
-            batch_size, 3, 224, 224, dtype=torch.float32, device=device
+            batch_size, 3, 224, 224, dtype=torch.float32, device=DEVICE
         ),
         "observation.images.right_wrist_0_rgb": torch.rand(
-            batch_size, 3, 224, 224, dtype=torch.float32, device=device
+            batch_size, 3, 224, 224, dtype=torch.float32, device=DEVICE
         ),
-        # Add the task prompt for LeRobot - provide as list with single element to trigger expansion
         "task": [prompt for _ in range(batch_size)],
     }
-    return batch
 
 
-def extract_lerobot_processed_inputs(lerobot_pi0, batch):
-    """Extract the exact same processed inputs that LeRobot uses internally."""
-    # Get the tokenized language from LeRobot's internal method
-    lang_tokens, lang_masks = lerobot_pi0._tokenize_language(batch)
-
-    # Get the preprocessed images from LeRobot's internal method
-    images, img_masks = lerobot_pi0._preprocess_images(batch, train=False)
-
-    # Create dummy token_ar_mask and token_loss_mask for original implementation
-    token_ar_mask = torch.zeros_like(lang_tokens, dtype=torch.int32)
-    token_loss_mask = torch.ones_like(lang_masks, dtype=torch.bool)
-
-    return images, img_masks, lang_tokens, lang_masks, token_ar_mask, token_loss_mask
+def prepare_parity_inputs(lerobot_pi0, lerobot_preprocessor):
+    torch.manual_seed(0)
+    raw_batch = create_dummy_data()
+    lerobot_batch = lerobot_preprocessor(clone_batch(raw_batch))
+    openpi_observation = make_openpi_observation_from_raw(
+        raw_batch,
+        action_dim=DUMMY_ACTION_DIM,
+        max_token_len=DUMMY_MAX_TOKEN_LEN,
+        dataset_stats=DUMMY_DATASET_STATS,
+        pi05=False,
+    )
+    openpi_actions = openpi_model_actions_from_raw(
+        raw_batch,
+        action_dim=DUMMY_ACTION_DIM,
+        dataset_stats=DUMMY_DATASET_STATS,
+        pi05=False,
+    )
+    assert_processor_inputs_match_lerobot(
+        lerobot_pi0,
+        lerobot_batch,
+        openpi_observation,
+        compare_state=True,
+    )
+    batch_size = raw_batch[OBS_STATE].shape[0]
+    noise = torch.randn(
+        batch_size,
+        DUMMY_ACTION_HORIZON,
+        DUMMY_ACTION_DIM,
+        dtype=torch.float32,
+        device=DEVICE,
+    )
+    time = torch.linspace(0.2, 0.8, batch_size, dtype=torch.float32, device=DEVICE)
+    return lerobot_batch, openpi_observation, openpi_actions, noise, time
 
 
-class PI0Observation:
-    """Observation class that matches the original OpenPI format."""
+def assert_forward_matches(*, compile_model: bool = False, gradient_checkpointing: bool = False):
+    lerobot_pi0, lerobot_preprocessor = instantiate_lerobot_pi0(
+        compile_model=compile_model,
+        gradient_checkpointing=gradient_checkpointing,
+    )
+    original_pi0 = instantiate_original_pi0()
+    lerobot_batch, openpi_observation, openpi_actions, noise, time = prepare_parity_inputs(
+        lerobot_pi0,
+        lerobot_preprocessor,
+    )
 
-    def __init__(
-        self,
-        state,
-        images,
-        image_masks,
-        tokenized_prompt,
-        tokenized_prompt_mask,
-        token_ar_mask,
-        token_loss_mask,
-    ):
-        self.state = state
-        self.images = images
-        self.image_masks = image_masks
-        self.tokenized_prompt = tokenized_prompt
-        self.tokenized_prompt_mask = tokenized_prompt_mask
-        self.token_ar_mask = token_ar_mask
-        self.token_loss_mask = token_loss_mask
-
-
-def create_original_observation_with_openpi_preprocessing(batch):
-    """Create observation object for OpenPI using OpenPI's own preprocessing."""
-    batch_size = batch["observation.state"].shape[0]
-    device = batch["observation.state"].device
-
-    # Create tokenizer for OpenPI (same as LeRobot uses)
-    tokenizer = AutoTokenizer.from_pretrained("google/paligemma-3b-pt-224")
-
-    # Get task description
-    if "task" in batch:
-        tasks = batch["task"]
-        if isinstance(tasks, str):
-            # Single string: add newline if not present, then convert to list
-            if not tasks.endswith("\n"):
-                tasks = f"{tasks}\n"
-            tasks = [tasks]
-        elif isinstance(tasks, list) and all(isinstance(t, str) for t in tasks):
-            # List of strings: add newline to each if not present
-            tasks = [t if t.endswith("\n") else f"{t}\n" for t in tasks]
-            if len(tasks) == 1:
-                # Expand to batch size
-                tasks = tasks * batch_size
-                if len(tasks) != batch_size:
-                    raise ValueError(f"Expected batch size {batch_size}, got {len(tasks)}")
-        # If task is neither string nor list of strings, leave unchanged
+    if gradient_checkpointing:
+        lerobot_pi0.train()
     else:
-        # Default task if not provided
-        tasks = ["Pick up the object\n"] * batch_size
-
-    # Tokenize with max_length padding to match OpenPI's expected format
-    tokenized = tokenizer(
-        tasks,
-        padding="max_length",
-        padding_side="right",
-        truncation=True,
-        max_length=DUMMY_MAX_TOKEN_LEN,
-        return_tensors="pt",
-    )
-
-    lang_tokens = tokenized["input_ids"].to(device)
-    lang_masks = tokenized["attention_mask"].to(device, dtype=torch.bool)
-
-    # Create dummy token_ar_mask and token_loss_mask for OpenPI
-    token_ar_mask = torch.zeros_like(lang_tokens, dtype=torch.int32)
-    token_loss_mask = torch.ones_like(lang_masks, dtype=torch.bool)
-
-    # Convert LeRobot images format to OpenPI format (convert [0,1] to [-1,1] range)
-    image_dict = {
-        "base_0_rgb": batch["observation.images.base_0_rgb"] * 2.0 - 1.0,
-        "left_wrist_0_rgb": batch["observation.images.left_wrist_0_rgb"] * 2.0 - 1.0,
-        "right_wrist_0_rgb": batch["observation.images.right_wrist_0_rgb"] * 2.0 - 1.0,
-    }
-
-    # Create image masks (all ones for real images)
-    image_masks_dict = {}
-    for key in image_dict:
-        image_masks_dict[key] = torch.ones(batch_size, dtype=torch.bool, device=device)
-
-    # Create raw observation object (before preprocessing)
-    raw_observation = PI0Observation(
-        state=batch["observation.state"],
-        images=image_dict,
-        image_masks=image_masks_dict,
-        tokenized_prompt=lang_tokens,
-        tokenized_prompt_mask=lang_masks,
-        token_ar_mask=token_ar_mask,
-        token_loss_mask=token_loss_mask,
-    )
-
-    # Now use OpenPI's preprocessing
-    processed_obs = openpi_preprocessing.preprocess_observation_pytorch(raw_observation, train=False)
-
-    return processed_obs
-
-
-def create_original_observation_from_lerobot(lerobot_pi0, batch):
-    """Create observation object compatible with original OpenPI using the exact same inputs as LeRobot."""
-    _batch_size = batch["observation.state"].shape[0]
-    _device = batch["observation.state"].device
-
-    # Extract the exact same processed inputs that LeRobot uses
-    images, img_masks, lang_tokens, lang_masks, token_ar_mask, token_loss_mask = (
-        extract_lerobot_processed_inputs(lerobot_pi0, batch)
-    )
-
-    # Convert images list to dict with original OpenPI keys
-    image_dict = {
-        "base_0_rgb": images[0],
-        "left_wrist_0_rgb": images[1],
-        "right_wrist_0_rgb": images[2],
-    }
-
-    # Convert image masks list to dict with original OpenPI keys
-    image_masks_dict = {
-        "base_0_rgb": img_masks[0],
-        "left_wrist_0_rgb": img_masks[1],
-        "right_wrist_0_rgb": img_masks[2],
-    }
-
-    return PI0Observation(
-        state=batch["observation.state"],
-        images=image_dict,
-        image_masks=image_masks_dict,
-        tokenized_prompt=lang_tokens,
-        tokenized_prompt_mask=lang_masks,
-        token_ar_mask=token_ar_mask,
-        token_loss_mask=token_loss_mask,
-    )
-
-
-def test_pi0_original_vs_lerobot():
-    """Test PI0 original implementation vs LeRobot implementation."""
-    print("Initializing models...")
-    lerobot_pi0, lerobot_preprocessor, lerobot_postprocessor = instantiate_lerobot_pi0(
-        from_pretrained=True
-    )  # Load pretrained LeRobot model
-    original_pi0 = instantiate_original_pi0(
-        from_pretrained=True
-    )  # Load pretrained OpenPI model from HuggingFace Hub
-
-    print("Creating dummy data...")
-    batch = create_dummy_data()
-    batch_lerobot = deepcopy(batch)
-
-    # Test each model with its own preprocessing (more realistic end-to-end test)
-    print("\nTest each model with its own preprocessing")
-    print("Creating observation for OpenPI using OpenPI's own preprocessing...")
-    pi0_obs_openpi = create_original_observation_with_openpi_preprocessing(batch)
-
-    print(f"Task prompt: '{batch['task'][0]}'")
-    print(f"OpenPI tokenized prompt shape: {pi0_obs_openpi.tokenized_prompt.shape}")
-    print(f"OpenPI image shapes: {[img.shape for img in pi0_obs_openpi.images.values()]}")
-    print(f"OpenPI state shape: {pi0_obs_openpi.state.shape}")
-
-    print("Testing OpenPI with own preprocessing...")
+        lerobot_pi0.eval()
     original_pi0.eval()
-    torch.manual_seed(42)  # Set seed for reproducibility
-    batch_size = batch["observation.state"].shape[0]
-    noise_shape = (batch_size, DUMMY_ACTION_HORIZON, DUMMY_ACTION_DIM)
-    fixed_noise = torch.randn(noise_shape, dtype=torch.float32, device=DEVICE)
 
-    with torch.no_grad():
-        openpi_actions = original_pi0.sample_actions(
-            device=DEVICE, observation=pi0_obs_openpi, noise=fixed_noise, num_steps=10
-        )
-        openpi_actions_unit = openpi_actions[:, 0, :]
-    print(f"OpenPI (own preprocessing) Actions shape: {openpi_actions.shape}")
-    print(f"OpenPI (own preprocessing) Actions unit shape: {openpi_actions_unit.shape}")
-    print(f"OpenPI (own preprocessing) Actions mean: {openpi_actions.mean().item():.6f}")
-    print(f"OpenPI (own preprocessing) Actions std: {openpi_actions.std().item():.6f}")
+    with fixed_flow_sampling(lerobot_pi0.model, noise=noise, time=time):
+        lerobot_loss, _ = lerobot_pi0(lerobot_batch, reduction="none")
+    with deterministic_openpi_forward_preprocess(original_pi0):
+        openpi_losses = original_pi0(openpi_observation, openpi_actions, noise=noise, time=time)
+    openpi_loss = openpi_losses.mean(dim=(1, 2))
+
+    torch.testing.assert_close(lerobot_loss, openpi_loss, rtol=FORWARD_RTOL, atol=FORWARD_ATOL)
+
+
+def assert_sample_actions_match_openpi(*, compile_model: bool = False):
+    lerobot_pi0, lerobot_preprocessor = instantiate_lerobot_pi0(compile_model=compile_model)
+    original_pi0 = instantiate_original_pi0()
+    lerobot_batch, openpi_observation, _openpi_actions, noise, _time = prepare_parity_inputs(
+        lerobot_pi0,
+        lerobot_preprocessor,
+    )
 
-    print("Testing LeRobot with own preprocessing...")
     lerobot_pi0.eval()
-    torch.manual_seed(42)  # Set the same seed
-
-    batch_lerobot_processed = lerobot_preprocessor(batch_lerobot)
+    original_pi0.eval()
     with torch.no_grad():
-        lerobot_actions_own = lerobot_pi0.predict_action_chunk(
-            batch_lerobot_processed
-        )  # batch_size, n_action_steps, action_dim
-        lerobot_actions_unit = lerobot_actions_own[:, 0, :]
-    print(f"LeRobot (own preprocessing) Actions shape: {lerobot_actions_own.shape}")
-    print(f"LeRobot (own preprocessing) Actions unit shape: {lerobot_actions_unit.shape}")
-    print(f"LeRobot (own preprocessing) Actions mean: {lerobot_actions_own.mean().item():.6f}")
-    print(f"LeRobot (own preprocessing) Actions std: {lerobot_actions_own.std().item():.6f}")
+        lerobot_actions = lerobot_pi0.predict_action_chunk(lerobot_batch, noise=noise, num_steps=10)
+        openpi_actions = original_pi0.sample_actions(
+            device=DEVICE,
+            observation=openpi_observation,
+            noise=noise,
+            num_steps=10,
+        )
 
-    print("\nComparing end-to-end implementations:")
-    print(f"Actions close (atol=1e-4): {torch.allclose(lerobot_actions_own, openpi_actions, atol=1e-4)}")
-    print(f"Actions close (atol=1e-2): {torch.allclose(lerobot_actions_own, openpi_actions, atol=1e-2)}")
-    print(f"Max absolute difference: {torch.abs(lerobot_actions_own - openpi_actions).max().item():.6f}")
+    torch.testing.assert_close(lerobot_actions, openpi_actions, rtol=SAMPLE_RTOL, atol=SAMPLE_ATOL)
 
-    assert torch.allclose(lerobot_actions_own, openpi_actions, atol=1e-4)
-    assert torch.allclose(lerobot_actions_own, openpi_actions, atol=1e-2)
-    assert torch.abs(lerobot_actions_own - openpi_actions).max().item() < 1e-4
+
+def test_pi0_forward_matches_openpi():
+    assert_forward_matches()
+
+
+def test_pi0_sample_actions_match_openpi():
+    assert_sample_actions_match_openpi()
+
+
+def test_pi0_gradient_checkpointing_forward_matches_openpi():
+    assert_forward_matches(gradient_checkpointing=True)
+
+
+def test_pi0_compile_forward_matches_openpi():
+    assert_forward_matches(compile_model=True)
+
+
+def test_pi0_compile_sample_actions_match_openpi():
+    assert_sample_actions_match_openpi(compile_model=True)
+
+
+def test_pi0_compile_gradient_checkpointing_forward_matches_openpi():
+    assert_forward_matches(compile_model=True, gradient_checkpointing=True)
diff --git a/tests/policies/pi0_pi05/utils/__init__.py b/tests/policies/pi0_pi05/utils/__init__.py
new file mode 100644
index 000000000..9a7a15a09
--- /dev/null
+++ b/tests/policies/pi0_pi05/utils/__init__.py
@@ -0,0 +1 @@
+"""Utilities shared by PI0/PI05 policy tests."""
diff --git a/tests/policies/pi0_pi05/utils/openpi_parity.py b/tests/policies/pi0_pi05/utils/openpi_parity.py
new file mode 100644
index 000000000..f66e4e473
--- /dev/null
+++ b/tests/policies/pi0_pi05/utils/openpi_parity.py
@@ -0,0 +1,291 @@
+#!/usr/bin/env python
+
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+from collections.abc import Iterator
+from contextlib import contextmanager
+from dataclasses import dataclass
+from functools import lru_cache
+from pathlib import Path
+
+import numpy as np
+import safetensors.torch
+import torch
+import torch.nn.functional as F  # noqa: N812
+from huggingface_hub import snapshot_download
+from transformers import AutoTokenizer
+
+from lerobot.utils.constants import (
+    ACTION,
+    OBS_LANGUAGE_ATTENTION_MASK,
+    OBS_LANGUAGE_TOKENS,
+    OBS_STATE,
+)
+from tests.policies.pi0_pi05.openpi_pytorch import preprocessing_pytorch as openpi_preprocessing
+
+IMAGE_KEYS = ("base_0_rgb", "left_wrist_0_rgb", "right_wrist_0_rgb")
+TOKENIZER_NAME = "google/paligemma-3b-pt-224"
+
+
+@dataclass
+class OpenPIObservation:
+    state: torch.Tensor
+    images: dict[str, torch.Tensor]
+    image_masks: dict[str, torch.Tensor]
+    tokenized_prompt: torch.Tensor
+    tokenized_prompt_mask: torch.Tensor
+    token_ar_mask: torch.Tensor
+    token_loss_mask: torch.Tensor
+
+
+@lru_cache(maxsize=1)
+def paligemma_tokenizer():
+    return AutoTokenizer.from_pretrained(TOKENIZER_NAME)
+
+
+def clone_batch(batch: dict) -> dict:
+    return {
+        key: value.clone() if isinstance(value, torch.Tensor) else list(value) for key, value in batch.items()
+    }
+
+
+def pad_last_dim(tensor: torch.Tensor, target_dim: int) -> torch.Tensor:
+    if tensor.shape[-1] > target_dim:
+        raise ValueError(f"Cannot pad last dimension {tensor.shape[-1]} down to {target_dim}")
+    return F.pad(tensor, (0, target_dim - tensor.shape[-1]))
+
+
+def mean_std_normalize(tensor: torch.Tensor, stats: dict[str, torch.Tensor]) -> torch.Tensor:
+    mean = stats["mean"].to(device=tensor.device, dtype=tensor.dtype)
+    std = stats["std"].to(device=tensor.device, dtype=tensor.dtype)
+    return (tensor - mean) / (std + 1e-8)
+
+
+def quantile_normalize(tensor: torch.Tensor, stats: dict[str, torch.Tensor]) -> torch.Tensor:
+    q01 = stats["q01"].to(device=tensor.device, dtype=tensor.dtype)
+    q99 = stats["q99"].to(device=tensor.device, dtype=tensor.dtype)
+    denom = torch.where(q99 == q01, torch.full_like(q99, 1e-8), q99 - q01)
+    return 2.0 * (tensor - q01) / denom - 1.0
+
+
+def openpi_model_state_from_raw(
+    batch: dict[str, torch.Tensor],
+    *,
+    action_dim: int,
+    dataset_stats: dict[str, dict[str, torch.Tensor]],
+    pi05: bool,
+) -> torch.Tensor:
+    state = batch[OBS_STATE].to(dtype=torch.float32)
+    if pi05:
+        state = quantile_normalize(state, dataset_stats[OBS_STATE])
+    else:
+        state = mean_std_normalize(state, dataset_stats[OBS_STATE])
+    return pad_last_dim(state, action_dim)
+
+
+def openpi_model_actions_from_raw(
+    batch: dict[str, torch.Tensor],
+    *,
+    action_dim: int,
+    dataset_stats: dict[str, dict[str, torch.Tensor]],
+    pi05: bool,
+) -> torch.Tensor:
+    actions = batch[ACTION].to(dtype=torch.float32)
+    if pi05:
+        actions = quantile_normalize(actions, dataset_stats[ACTION])
+    else:
+        actions = mean_std_normalize(actions, dataset_stats[ACTION])
+    return pad_last_dim(actions, action_dim)
+
+
+def _tasks_from_raw(batch: dict, batch_size: int) -> list[str]:
+    tasks = batch.get("task")
+    if tasks is None:
+        raise ValueError("The parity batch must include a task prompt.")
+    if isinstance(tasks, str):
+        return [tasks] * batch_size
+    if len(tasks) == 1:
+        return [tasks[0]] * batch_size
+    if len(tasks) != batch_size:
+        raise ValueError(f"Expected {batch_size} task prompts, got {len(tasks)}")
+    return list(tasks)
+
+
+def _format_pi0_prompts(tasks: list[str]) -> list[str]:
+    return [f"{task.strip().replace('_', ' ').replace(chr(10), ' ')}\n" for task in tasks]
+
+
+def _format_pi05_prompts(tasks: list[str], normalized_state: torch.Tensor) -> list[str]:
+    state_np = normalized_state.detach().cpu().numpy()
+    discretized_states = np.digitize(state_np, bins=np.linspace(-1, 1, 256 + 1)[:-1]) - 1
+    prompts = []
+    for task, state in zip(tasks, discretized_states, strict=True):
+        cleaned_text = task.strip().replace("_", " ").replace("\n", " ")
+        state_str = " ".join(map(str, state))
+        prompts.append(f"Task: {cleaned_text}, State: {state_str};\nAction: ")
+    return prompts
+
+
+def _tokenize_prompts(prompts: list[str], *, max_token_len: int, device: torch.device | str):
+    tokenized = paligemma_tokenizer()(
+        prompts,
+        padding="max_length",
+        padding_side="right",
+        truncation=True,
+        max_length=max_token_len,
+        return_tensors="pt",
+    )
+    tokens = tokenized["input_ids"].to(device)
+    masks = tokenized["attention_mask"].to(device=device, dtype=torch.bool)
+    return tokens, masks
+
+
+def make_openpi_observation_from_raw(
+    batch: dict[str, torch.Tensor],
+    *,
+    action_dim: int,
+    max_token_len: int,
+    dataset_stats: dict[str, dict[str, torch.Tensor]],
+    pi05: bool,
+) -> OpenPIObservation:
+    batch_size = batch[OBS_STATE].shape[0]
+    device = batch[OBS_STATE].device
+    state = openpi_model_state_from_raw(
+        batch,
+        action_dim=action_dim,
+        dataset_stats=dataset_stats,
+        pi05=pi05,
+    )
+
+    tasks = _tasks_from_raw(batch, batch_size)
+    prompts = _format_pi05_prompts(tasks, state) if pi05 else _format_pi0_prompts(tasks)
+    tokens, masks = _tokenize_prompts(prompts, max_token_len=max_token_len, device=device)
+
+    images = {
+        key: batch[f"observation.images.{key}"].to(device=device, dtype=torch.float32) * 2.0 - 1.0
+        for key in IMAGE_KEYS
+    }
+    image_masks = {key: torch.ones(batch_size, dtype=torch.bool, device=device) for key in IMAGE_KEYS}
+
+    return OpenPIObservation(
+        state=state,
+        images=images,
+        image_masks=image_masks,
+        tokenized_prompt=tokens,
+        tokenized_prompt_mask=masks,
+        token_ar_mask=torch.zeros_like(tokens, dtype=torch.int32),
+        token_loss_mask=torch.ones_like(masks, dtype=torch.bool),
+    )
+
+
+def assert_processor_inputs_match_lerobot(
+    lerobot_policy,
+    lerobot_batch: dict[str, torch.Tensor],
+    openpi_observation: OpenPIObservation,
+    *,
+    compare_state: bool,
+):
+    openpi_processed = openpi_preprocessing.preprocess_observation_pytorch(openpi_observation, train=False)
+    lerobot_images, lerobot_image_masks = lerobot_policy._preprocess_images(lerobot_batch)
+
+    # Token IDs, token masks, images, image masks, and PI0 state are intentionally built from the same
+    # raw batch through independent LeRobot/OpenPI-style processor logic. They must be bitwise equal.
+    torch.testing.assert_close(
+        openpi_observation.tokenized_prompt, lerobot_batch[OBS_LANGUAGE_TOKENS], rtol=0, atol=0
+    )
+    torch.testing.assert_close(
+        openpi_observation.tokenized_prompt_mask,
+        lerobot_batch[OBS_LANGUAGE_ATTENTION_MASK],
+        rtol=0,
+        atol=0,
+    )
+
+    for openpi_image, lerobot_image in zip(openpi_processed.images.values(), lerobot_images, strict=True):
+        torch.testing.assert_close(openpi_image, lerobot_image, rtol=0, atol=0)
+
+    for openpi_mask, lerobot_mask in zip(
+        openpi_processed.image_masks.values(), lerobot_image_masks, strict=True
+    ):
+        torch.testing.assert_close(openpi_mask, lerobot_mask, rtol=0, atol=0)
+
+    if compare_state:
+        torch.testing.assert_close(
+            openpi_processed.state, lerobot_policy.prepare_state(lerobot_batch), rtol=0, atol=0
+        )
+
+
+def load_openpi_reference_state_dict(repo_id: str) -> dict[str, torch.Tensor]:
+    cache_dir = Path(snapshot_download(repo_id=repo_id, repo_type="model"))
+    return safetensors.torch.load_file(cache_dir / "model.safetensors")
+
+
+def fix_reference_state_dict(state_dict: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]:
+    fixed_state_dict = dict(state_dict)
+    lm_head_key = "paligemma_with_expert.paligemma.lm_head.weight"
+    embed_tokens_key = "paligemma_with_expert.paligemma.model.language_model.embed_tokens.weight"
+    if lm_head_key in fixed_state_dict and embed_tokens_key not in fixed_state_dict:
+        fixed_state_dict[embed_tokens_key] = fixed_state_dict[lm_head_key].clone()
+    return fixed_state_dict
+
+
+@contextmanager
+def fixed_flow_sampling(model, *, noise: torch.Tensor, time: torch.Tensor) -> Iterator[None]:
+    original_sample_noise = model.sample_noise
+    original_sample_time = model.sample_time
+
+    def sample_noise(shape, device):
+        if tuple(shape) != tuple(noise.shape):
+            raise ValueError(f"Expected noise shape {tuple(noise.shape)}, got {tuple(shape)}")
+        return noise.to(device=device)
+
+    def sample_time(batch_size, device):
+        if batch_size != time.shape[0]:
+            raise ValueError(f"Expected time batch size {time.shape[0]}, got {batch_size}")
+        return time.to(device=device)
+
+    model.sample_noise = sample_noise
+    model.sample_time = sample_time
+    try:
+        yield
+    finally:
+        model.sample_noise = original_sample_noise
+        model.sample_time = original_sample_time
+
+
+@contextmanager
+def deterministic_openpi_forward_preprocess(openpi_policy) -> Iterator[None]:
+    """Disable OpenPI's training-time image augmentation only inside a parity forward block.
+
+    OpenPI's `forward()` calls `_preprocess_observation(..., train=True)`, which can apply stochastic
+    image augmentation. LeRobot's policy forward path does not apply that augmentation, so parity would
+    otherwise compare two different image tensors rather than two model implementations. The context manager
+    keeps the public `openpi_policy.forward(observation, ...)` call while making preprocessing deterministic.
+
+    `yield` marks the body of the caller's `with` block. The `try/finally` restores the original method even
+    if the assertion inside the block fails, so the temporary monkeypatch cannot leak into later tests.
+    """
+
+    original_preprocess_observation = openpi_policy._preprocess_observation
+
+    def preprocess_observation(observation, *, train=True):
+        return original_preprocess_observation(observation, train=False)
+
+    openpi_policy._preprocess_observation = preprocess_observation
+    try:
+        yield
+    finally:
+        openpi_policy._preprocess_observation = original_preprocess_observation
diff --git a/tests/policies/pi0_pi05/utils/torch_compile.py b/tests/policies/pi0_pi05/utils/torch_compile.py
new file mode 100644
index 000000000..2e71d15bb
--- /dev/null
+++ b/tests/policies/pi0_pi05/utils/torch_compile.py
@@ -0,0 +1,207 @@
+#!/usr/bin/env python
+
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import time
+from collections.abc import Callable
+
+import torch
+from torch._dynamo.utils import counters, guard_failures
+from torch.profiler import ProfilerActivity
+
+FORWARD_RTOL = 1e-5
+FORWARD_ATOL = 5e-2
+SAMPLE_RTOL = 1e-5
+SAMPLE_ATOL = 1e-2
+COMPILE_MODE = "max-autotune"
+STEADY_STATE_WARMUPS = 3
+STEADY_STATE_REPEATS = 3
+
+
+def make_compile_config(config_cls, *, compile_model):
+    return config_cls(device="cuda", compile_model=compile_model, compile_mode=COMPILE_MODE)
+
+
+def counter_total(name):
+    return sum(counters.get(name, {}).values())
+
+
+def compile_snapshot():
+    return {
+        "graph_breaks": counter_total("graph_break"),
+        "recompiles": counter_total("recompiles"),
+        "recompile_limits": counter_total("recompile_limit"),
+        "unique_graphs": counters["stats"].get("unique_graphs", 0),
+    }
+
+
+def reset_compile_state():
+    torch._dynamo.reset()
+    counters.clear()
+    guard_failures.clear()
+
+
+def clone_cuda_graph_output(output):
+    if torch.is_tensor(output):
+        return output.clone()
+    if isinstance(output, tuple):
+        return tuple(clone_cuda_graph_output(item) for item in output)
+    if isinstance(output, list):
+        return [clone_cuda_graph_output(item) for item in output]
+    if isinstance(output, dict):
+        return {key: clone_cuda_graph_output(value) for key, value in output.items()}
+    return output
+
+
+def run_model_step(fn: Callable, kwargs: dict):
+    if hasattr(torch.compiler, "cudagraph_mark_step_begin"):
+        torch.compiler.cudagraph_mark_step_begin()
+    return fn(**kwargs)
+
+
+def assert_explain_has_no_graph_breaks(fn: Callable, kwargs: dict, label: str):
+    reset_compile_state()
+    explanation = torch._dynamo.explain(fn)(**kwargs)
+
+    assert explanation.graph_count > 0, f"{label} was not captured by Dynamo"
+    assert explanation.graph_break_count == 0, (
+        f"{label} has {explanation.graph_break_count} graph break(s): {explanation.break_reasons}"
+    )
+    assert not explanation.break_reasons, f"{label} graph break reasons: {explanation.break_reasons}"
+
+    print(
+        f"{label} capture: graphs={explanation.graph_count}, "
+        f"graph_breaks={explanation.graph_break_count}, ops={explanation.op_count}, "
+        f"guards={len(explanation.out_guards or [])}"
+    )
+    return explanation
+
+
+@torch.no_grad()
+def assert_compiled_output_matches_eager(eager_model, compiled_model, forward_kwargs, sample_kwargs):
+    eager_forward = eager_model.forward(**forward_kwargs)
+    compiled_forward = compiled_model.forward(**forward_kwargs)
+    torch.testing.assert_close(compiled_forward, eager_forward, rtol=FORWARD_RTOL, atol=FORWARD_ATOL)
+
+    eager_actions = eager_model.sample_actions(**sample_kwargs)
+    compiled_actions = compiled_model.sample_actions(**sample_kwargs)
+    torch.testing.assert_close(compiled_actions, eager_actions, rtol=SAMPLE_RTOL, atol=SAMPLE_ATOL)
+
+
+@torch.no_grad()
+def assert_cache_stability(fn: Callable, kwargs: dict, label: str):
+    reset_compile_state()
+
+    first_output = clone_cuda_graph_output(run_model_step(fn, kwargs))
+    first_snapshot = compile_snapshot()
+    second_output = clone_cuda_graph_output(run_model_step(fn, kwargs))
+    second_snapshot = compile_snapshot()
+    third_output = clone_cuda_graph_output(run_model_step(fn, kwargs))
+    third_snapshot = compile_snapshot()
+
+    torch.testing.assert_close(second_output, first_output, rtol=FORWARD_RTOL, atol=FORWARD_ATOL)
+    torch.testing.assert_close(third_output, first_output, rtol=FORWARD_RTOL, atol=FORWARD_ATOL)
+    assert first_snapshot["unique_graphs"] > 0, f"{label} did not compile any graph"
+    assert third_snapshot["graph_breaks"] == 0, f"{label} graph breaks: {third_snapshot}"
+    assert third_snapshot["recompiles"] == 0, f"{label} recompiled: {third_snapshot}"
+    assert third_snapshot["recompile_limits"] == 0, f"{label} hit recompile limit: {third_snapshot}"
+    assert second_snapshot["unique_graphs"] == first_snapshot["unique_graphs"], (
+        f"{label} compiled new graph on second call: first={first_snapshot}, second={second_snapshot}"
+    )
+    assert third_snapshot["unique_graphs"] == first_snapshot["unique_graphs"], (
+        f"{label} compiled new graph on third call: first={first_snapshot}, third={third_snapshot}"
+    )
+    assert not guard_failures, f"{label} guard failures: {dict(guard_failures)}"
+
+    print(f"{label} cache: first={first_snapshot}, third={third_snapshot}")
+
+
+@torch.no_grad()
+def benchmark_runtime(eager_fn: Callable, compiled_fn: Callable, kwargs: dict, label: str):
+    run_warmups(eager_fn, kwargs)
+    run_warmups(compiled_fn, kwargs)
+    torch.cuda.synchronize()
+
+    eager_metrics = profile_callable(eager_fn, kwargs)
+    compiled_metrics = profile_callable(compiled_fn, kwargs)
+    speedup = eager_metrics["cuda_event_ms"] / compiled_metrics["cuda_event_ms"]
+
+    print(
+        f"{label} runtime: eager_cuda={eager_metrics['cuda_event_ms']:.3f} ms, "
+        f"compiled_cuda={compiled_metrics['cuda_event_ms']:.3f} ms, speedup={speedup:.3f}x, "
+        f"host_wall_ms eager/compiled={eager_metrics['host_wall_ms']:.3f}/"
+        f"{compiled_metrics['host_wall_ms']:.3f}, "
+        f"cpu_self_time_ms eager/compiled={eager_metrics['cpu_self_time_ms']:.3f}/"
+        f"{compiled_metrics['cpu_self_time_ms']:.3f}, "
+        f"cuda_launches eager/compiled={eager_metrics['cuda_launch_count']}/"
+        f"{compiled_metrics['cuda_launch_count']}, "
+        f"profiler_events eager/compiled={eager_metrics['profiler_event_count']}/"
+        f"{compiled_metrics['profiler_event_count']}, "
+        f"peak_mem_mib eager/compiled={eager_metrics['peak_mem_mib']:.1f}/"
+        f"{compiled_metrics['peak_mem_mib']:.1f}"
+    )
+
+    assert eager_metrics["cuda_event_ms"] > 0
+    assert compiled_metrics["cuda_event_ms"] > 0
+    assert eager_metrics["profiler_event_count"] > 0
+    assert compiled_metrics["profiler_event_count"] > 0
+    return eager_metrics, compiled_metrics
+
+
+def run_warmups(fn: Callable, kwargs: dict):
+    for _ in range(STEADY_STATE_WARMUPS):
+        run_model_step(fn, kwargs)
+    torch.cuda.synchronize()
+
+
+def profile_callable(fn: Callable, kwargs: dict):
+    torch.cuda.synchronize()
+    torch.cuda.reset_peak_memory_stats()
+
+    start_event = torch.cuda.Event(enable_timing=True)
+    end_event = torch.cuda.Event(enable_timing=True)
+    host_start = time.perf_counter()
+    start_event.record()
+    for _ in range(STEADY_STATE_REPEATS):
+        run_model_step(fn, kwargs)
+    end_event.record()
+    torch.cuda.synchronize()
+    cuda_event_ms = start_event.elapsed_time(end_event) / STEADY_STATE_REPEATS
+    host_wall_ms = (time.perf_counter() - host_start) * 1000 / STEADY_STATE_REPEATS
+    peak_mem_mib = torch.cuda.max_memory_allocated() / 1024**2
+
+    with torch.profiler.profile(
+        activities=[ProfilerActivity.CPU],
+    ) as profiler:
+        run_model_step(fn, kwargs)
+        torch.cuda.synchronize()
+
+    key_averages = profiler.key_averages()
+    cpu_self_time_ms = sum(event.self_cpu_time_total for event in key_averages) / 1000
+    cuda_launch_count = sum(
+        event.count
+        for event in key_averages
+        if event.key in {"cudaLaunchKernel", "cudaGraphLaunch", "cudaLaunchKernelExC"}
+    )
+    profiler_event_count = sum(event.count for event in key_averages)
+
+    return {
+        "cuda_event_ms": cuda_event_ms,
+        "host_wall_ms": host_wall_ms,
+        "cpu_self_time_ms": cpu_self_time_ms,
+        "cuda_launch_count": cuda_launch_count,
+        "profiler_event_count": profiler_event_count,
+        "peak_mem_mib": peak_mem_mib,
+    }
diff --git a/tests/processor/test_pi05_processor.py b/tests/processor/test_pi05_processor.py
new file mode 100644
index 000000000..b3dd85f45
--- /dev/null
+++ b/tests/processor/test_pi05_processor.py
@@ -0,0 +1,155 @@
+#!/usr/bin/env python
+
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Compare the PI0.5 processor pipeline against the vendored OpenPI reference processors."""
+
+import os
+
+import pytest
+import torch
+
+pytest.importorskip("transformers")
+
+from lerobot.configs import FeatureType, PolicyFeature  # noqa: E402
+from lerobot.policies.pi05 import PI05Policy  # noqa: E402
+from lerobot.policies.pi05.configuration_pi05 import PI05Config  # noqa: E402
+from lerobot.policies.pi05.processor_pi05 import make_pi05_pre_post_processors  # noqa: E402
+from lerobot.utils.constants import ACTION, OBS_STATE  # noqa: E402
+from tests.policies.pi0_pi05.utils.openpi_parity import (  # noqa: E402
+    IMAGE_KEYS,
+    assert_processor_inputs_match_lerobot,
+    clone_batch,
+    make_openpi_observation_from_raw,
+    openpi_model_actions_from_raw,
+)
+
+pytestmark = pytest.mark.skipif(
+    os.environ.get("CI") == "true" or os.environ.get("GITHUB_ACTIONS") == "true",
+    reason="OpenPI processor parity uses the PaliGemma tokenizer; run manually outside CI.",
+)
+
+DUMMY_ACTION_DIM = 32
+DUMMY_STATE_DIM = 32
+DUMMY_ACTION_HORIZON = 50
+DUMMY_MAX_TOKEN_LEN = 200
+DEVICE = torch.device("cpu")
+
+DUMMY_DATASET_STATS = {
+    OBS_STATE: {
+        "mean": torch.zeros(DUMMY_STATE_DIM),
+        "std": torch.ones(DUMMY_STATE_DIM),
+        "q01": torch.zeros(DUMMY_STATE_DIM),
+        "q99": torch.ones(DUMMY_STATE_DIM),
+    },
+    ACTION: {
+        "mean": torch.zeros(DUMMY_ACTION_DIM),
+        "std": torch.ones(DUMMY_ACTION_DIM),
+        "q01": torch.zeros(DUMMY_ACTION_DIM),
+        "q99": torch.ones(DUMMY_ACTION_DIM),
+    },
+    "images": {
+        key: {
+            "mean": torch.zeros(3, 224, 224),
+            "std": torch.ones(3, 224, 224),
+            "q01": torch.zeros(3, 224, 224),
+            "q99": torch.ones(3, 224, 224),
+        }
+        for key in IMAGE_KEYS
+    },
+}
+
+
+class PI05PolicyInputAdapter(torch.nn.Module):
+    """Minimal adapter exposing PI0.5 policy image preparation without loading model weights."""
+
+    _preprocess_images = PI05Policy._preprocess_images
+
+    def __init__(self, config: PI05Config) -> None:
+        super().__init__()
+        self.config = config
+        self._device_anchor = torch.nn.Parameter(torch.empty((), device=config.device), requires_grad=False)
+
+
+def create_pi05_config() -> PI05Config:
+    config = PI05Config(device=str(DEVICE))
+    config.max_state_dim = DUMMY_STATE_DIM
+    config.max_action_dim = DUMMY_ACTION_DIM
+    config.chunk_size = DUMMY_ACTION_HORIZON
+    config.n_action_steps = DUMMY_ACTION_HORIZON
+    config.tokenizer_max_length = DUMMY_MAX_TOKEN_LEN
+    config.input_features = {
+        OBS_STATE: PolicyFeature(type=FeatureType.STATE, shape=(DUMMY_STATE_DIM,)),
+        **{
+            f"observation.images.{key}": PolicyFeature(type=FeatureType.VISUAL, shape=(3, 224, 224))
+            for key in IMAGE_KEYS
+        },
+    }
+    config.output_features = {
+        ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(DUMMY_ACTION_DIM,)),
+    }
+    return config
+
+
+def create_dummy_data() -> dict:
+    batch_size = 2
+    prompt = "Pick up the red block and place it in the bin"
+    return {
+        OBS_STATE: torch.randn(batch_size, DUMMY_STATE_DIM, dtype=torch.float32, device=DEVICE),
+        ACTION: torch.randn(
+            batch_size, DUMMY_ACTION_HORIZON, DUMMY_ACTION_DIM, dtype=torch.float32, device=DEVICE
+        ),
+        **{
+            f"observation.images.{key}": torch.rand(
+                batch_size, 3, 224, 224, dtype=torch.float32, device=DEVICE
+            )
+            for key in IMAGE_KEYS
+        },
+        "task": [prompt for _ in range(batch_size)],
+    }
+
+
+def test_pi05_processor_inputs_match_openpi_reference():
+    torch.manual_seed(0)
+    config = create_pi05_config()
+    preprocessor, _ = make_pi05_pre_post_processors(config=config, dataset_stats=DUMMY_DATASET_STATS)
+
+    raw_batch = create_dummy_data()
+    lerobot_batch = preprocessor(clone_batch(raw_batch))
+    openpi_observation = make_openpi_observation_from_raw(
+        raw_batch,
+        action_dim=DUMMY_ACTION_DIM,
+        max_token_len=DUMMY_MAX_TOKEN_LEN,
+        dataset_stats=DUMMY_DATASET_STATS,
+        pi05=True,
+    )
+
+    assert_processor_inputs_match_lerobot(
+        PI05PolicyInputAdapter(config),
+        lerobot_batch,
+        openpi_observation,
+        compare_state=False,
+    )
+    torch.testing.assert_close(
+        lerobot_batch[ACTION],
+        openpi_model_actions_from_raw(
+            raw_batch,
+            action_dim=DUMMY_ACTION_DIM,
+            dataset_stats=DUMMY_DATASET_STATS,
+            pi05=True,
+        ),
+        rtol=0,
+        atol=0,
+    )
diff --git a/tests/processor/test_pi0_processor.py b/tests/processor/test_pi0_processor.py
new file mode 100644
index 000000000..e9d5b4a37
--- /dev/null
+++ b/tests/processor/test_pi0_processor.py
@@ -0,0 +1,156 @@
+#!/usr/bin/env python
+
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Compare the PI0 processor pipeline against the vendored OpenPI reference processors."""
+
+import os
+
+import pytest
+import torch
+
+pytest.importorskip("transformers")
+
+from lerobot.configs import FeatureType, PolicyFeature  # noqa: E402
+from lerobot.policies.pi0 import PI0Policy  # noqa: E402
+from lerobot.policies.pi0.configuration_pi0 import PI0Config  # noqa: E402
+from lerobot.policies.pi0.processor_pi0 import make_pi0_pre_post_processors  # noqa: E402
+from lerobot.utils.constants import ACTION, OBS_STATE  # noqa: E402
+from tests.policies.pi0_pi05.utils.openpi_parity import (  # noqa: E402
+    IMAGE_KEYS,
+    assert_processor_inputs_match_lerobot,
+    clone_batch,
+    make_openpi_observation_from_raw,
+    openpi_model_actions_from_raw,
+)
+
+pytestmark = pytest.mark.skipif(
+    os.environ.get("CI") == "true" or os.environ.get("GITHUB_ACTIONS") == "true",
+    reason="OpenPI processor parity uses the PaliGemma tokenizer; run manually outside CI.",
+)
+
+DUMMY_ACTION_DIM = 32
+DUMMY_STATE_DIM = 32
+DUMMY_ACTION_HORIZON = 50
+DUMMY_MAX_TOKEN_LEN = 48
+DEVICE = torch.device("cpu")
+
+DUMMY_DATASET_STATS = {
+    OBS_STATE: {
+        "mean": torch.zeros(DUMMY_STATE_DIM),
+        "std": torch.ones(DUMMY_STATE_DIM),
+        "q01": torch.zeros(DUMMY_STATE_DIM),
+        "q99": torch.ones(DUMMY_STATE_DIM),
+    },
+    ACTION: {
+        "mean": torch.zeros(DUMMY_ACTION_DIM),
+        "std": torch.ones(DUMMY_ACTION_DIM),
+        "q01": torch.zeros(DUMMY_ACTION_DIM),
+        "q99": torch.ones(DUMMY_ACTION_DIM),
+    },
+    "images": {
+        key: {
+            "mean": torch.zeros(3, 224, 224),
+            "std": torch.ones(3, 224, 224),
+            "q01": torch.zeros(3, 224, 224),
+            "q99": torch.ones(3, 224, 224),
+        }
+        for key in IMAGE_KEYS
+    },
+}
+
+
+class PI0PolicyInputAdapter(torch.nn.Module):
+    """Minimal adapter exposing PI0 policy input-preparation helpers without loading model weights."""
+
+    _preprocess_images = PI0Policy._preprocess_images
+    prepare_state = PI0Policy.prepare_state
+
+    def __init__(self, config: PI0Config) -> None:
+        super().__init__()
+        self.config = config
+        self._device_anchor = torch.nn.Parameter(torch.empty((), device=config.device), requires_grad=False)
+
+
+def create_pi0_config() -> PI0Config:
+    config = PI0Config(device=str(DEVICE))
+    config.max_state_dim = DUMMY_STATE_DIM
+    config.max_action_dim = DUMMY_ACTION_DIM
+    config.chunk_size = DUMMY_ACTION_HORIZON
+    config.n_action_steps = DUMMY_ACTION_HORIZON
+    config.tokenizer_max_length = DUMMY_MAX_TOKEN_LEN
+    config.input_features = {
+        OBS_STATE: PolicyFeature(type=FeatureType.STATE, shape=(DUMMY_STATE_DIM,)),
+        **{
+            f"observation.images.{key}": PolicyFeature(type=FeatureType.VISUAL, shape=(3, 224, 224))
+            for key in IMAGE_KEYS
+        },
+    }
+    config.output_features = {
+        ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(DUMMY_ACTION_DIM,)),
+    }
+    return config
+
+
+def create_dummy_data() -> dict:
+    batch_size = 2
+    prompt = "Pick up the red block and place it in the bin"
+    return {
+        OBS_STATE: torch.randn(batch_size, DUMMY_STATE_DIM, dtype=torch.float32, device=DEVICE),
+        ACTION: torch.randn(
+            batch_size, DUMMY_ACTION_HORIZON, DUMMY_ACTION_DIM, dtype=torch.float32, device=DEVICE
+        ),
+        **{
+            f"observation.images.{key}": torch.rand(
+                batch_size, 3, 224, 224, dtype=torch.float32, device=DEVICE
+            )
+            for key in IMAGE_KEYS
+        },
+        "task": [prompt for _ in range(batch_size)],
+    }
+
+
+def test_pi0_processor_inputs_match_openpi_reference():
+    torch.manual_seed(0)
+    config = create_pi0_config()
+    preprocessor, _ = make_pi0_pre_post_processors(config=config, dataset_stats=DUMMY_DATASET_STATS)
+
+    raw_batch = create_dummy_data()
+    lerobot_batch = preprocessor(clone_batch(raw_batch))
+    openpi_observation = make_openpi_observation_from_raw(
+        raw_batch,
+        action_dim=DUMMY_ACTION_DIM,
+        max_token_len=DUMMY_MAX_TOKEN_LEN,
+        dataset_stats=DUMMY_DATASET_STATS,
+        pi05=False,
+    )
+
+    assert_processor_inputs_match_lerobot(
+        PI0PolicyInputAdapter(config),
+        lerobot_batch,
+        openpi_observation,
+        compare_state=True,
+    )
+    torch.testing.assert_close(
+        lerobot_batch[ACTION],
+        openpi_model_actions_from_raw(
+            raw_batch,
+            action_dim=DUMMY_ACTION_DIM,
+            dataset_stats=DUMMY_DATASET_STATS,
+            pi05=False,
+        ),
+        rtol=0,
+        atol=0,
+    )

From 9f437d86b6d74982c26b1ef499fa32d71cbe115f Mon Sep 17 00:00:00 2001
From: Haoming Song <haomingsong24@gmail.com>
Date: Fri, 22 May 2026 16:31:04 +0800
Subject: [PATCH 12/17] fix(groot): align GR00TN15Config with transformers
 config dataclasses (#3606)

* fix(gr00t): fix gr00t config dataclass init TypeError

* fix(groot): guard strict config decorator without transformers for passing CI

---------

Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>

From 8194897994be649cc85405fe54eb9ede751886b1 Mon Sep 17 00:00:00 2001
From: Pepijn <138571049+pkooij@users.noreply.github.com>
Date: Fri, 22 May 2026 12:03:07 +0200
Subject: [PATCH 13/17] fix(deps): cap placo below 0.9.16 and harden kinematics
 import (#3647)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* fix(deps): cap placo below 0.9.16 and harden kinematics import

placo 0.9.16 links against liburdfdom_sensor.so.4, which is unavailable
on Ubuntu 24.04 (noble ships urdfdom 3.x). Importing placo on that base
crashes with:

  ImportError: liburdfdom_sensor.so.4.0: cannot open shared object file

This broke nightly Latest Deps tests (CPU and GPU) when the lockfile
upgrade picked placo 0.9.16, since lerobot.model.kinematics
unconditionally imports placo when _placo_available is true, and that
check (importlib.util.find_spec) cannot detect dlopen failures of
transitive shared libraries — so unrelated subsystems (RL actor,
gym_manipulator) became unimportable.

Two changes:

1. Pin placo to <0.9.16 in pyproject.toml + regenerate uv.lock
   (0.9.16 → 0.9.15). Short-term unblock for nightly CI until system
   urdfdom 4.x is broadly available.

2. Harden the import guard in src/lerobot/model/kinematics.py:
   wrap 'import placo' in try/except ImportError so a missing
   transitive .so no longer crashes module import. RobotKinematics
   instantiation now raises an informative ImportError citing the
   underlying dlopen failure via _raise_if_placo_unusable().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(kinematics): hoist _placo_runtime_error to module scope for mypy

Mypy walks the TYPE_CHECKING branch in which the runtime else-block is
not executed, so _placo_runtime_error was only defined at runtime and
mypy reported 'Name "_placo_runtime_error" is not defined' on the
three references inside _raise_if_placo_unusable. Declare the symbol
unconditionally at module scope with a default of None; the runtime
import-failure branch still assigns to it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style(kinematics): drop verbose comments around placo import guard

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 pyproject.toml                  |  4 +++-
 src/lerobot/model/kinematics.py | 20 +++++++++++++++++---
 uv.lock                         | 22 +++++++++++-----------
 3 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/pyproject.toml b/pyproject.toml
index ca6248c95..5d182648c 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -138,7 +138,9 @@ dataset_viz = ["lerobot[dataset]", "lerobot[viz]"]
 # Common
 av-dep = ["av>=15.0.0,<16.0.0"]
 pygame-dep = ["pygame>=2.5.1,<2.7.0"]
-placo-dep = ["placo>=0.9.6,<0.9.17"]
+# NOTE: 0.9.16 links against liburdfdom_sensor.so.4, which is unavailable on Ubuntu 24.04
+# (noble ships urdfdom 3.x). Cap below 0.9.16 until system urdfdom 4.x is broadly available.
+placo-dep = ["placo>=0.9.6,<0.9.16"]
 transformers-dep = ["transformers>=5.4.0,<5.6.0"]
 grpcio-dep = ["grpcio==1.73.1", "protobuf>=6.31.1,<6.32.0"]
 can-dep = ["python-can>=4.2.0,<5.0.0"]
diff --git a/src/lerobot/model/kinematics.py b/src/lerobot/model/kinematics.py
index 01705ded5..45bd7d438 100644
--- a/src/lerobot/model/kinematics.py
+++ b/src/lerobot/model/kinematics.py
@@ -18,12 +18,25 @@ from typing import TYPE_CHECKING
 
 import numpy as np
 
-from lerobot.utils.import_utils import _placo_available, require_package
+from lerobot.utils.import_utils import require_package
 
-if TYPE_CHECKING or _placo_available:
+_placo_runtime_error: ImportError | None = None
+
+if TYPE_CHECKING:
     import placo  # type: ignore[import-not-found]
 else:
-    placo = None
+    try:
+        import placo  # type: ignore[import-not-found]
+    except ImportError as _placo_import_err:
+        placo = None
+        _placo_runtime_error = _placo_import_err
+
+
+def _raise_if_placo_unusable() -> None:
+    if placo is None and _placo_runtime_error is not None:
+        raise ImportError(
+            f"placo is installed but failed to import: {_placo_runtime_error!s}"
+        ) from _placo_runtime_error
 
 
 class RobotKinematics:
@@ -44,6 +57,7 @@ class RobotKinematics:
             joint_names (list[str] | None): List of joint names to use for the kinematics solver
         """
         require_package("placo", extra="placo-dep")
+        _raise_if_placo_unusable()
 
         self.robot = placo.RobotWrapper(urdf_path)
         self.solver = placo.KinematicsSolver(self.robot)
diff --git a/uv.lock b/uv.lock
index 7092f780a..c5f026517 100644
--- a/uv.lock
+++ b/uv.lock
@@ -3203,7 +3203,7 @@ requires-dist = [
     { name = "pandas", marker = "extra == 'video-benchmark'", specifier = ">=2.2.2,<2.4.0" },
     { name = "peft", marker = "extra == 'peft-dep'", specifier = ">=0.18.0,<1.0.0" },
     { name = "pillow", specifier = ">=10.0.0,<13.0.0" },
-    { name = "placo", marker = "extra == 'placo-dep'", specifier = ">=0.9.6,<0.9.17" },
+    { name = "placo", marker = "extra == 'placo-dep'", specifier = ">=0.9.6,<0.9.16" },
     { name = "pre-commit", marker = "extra == 'dev'", specifier = ">=3.7.0,<5.0.0" },
     { name = "protobuf", marker = "extra == 'grpcio-dep'", specifier = ">=6.31.1,<6.32.0" },
     { name = "pyarrow", marker = "extra == 'dataset'", specifier = ">=21.0.0,<30.0.0" },
@@ -4592,7 +4592,7 @@ wheels = [
 
 [[package]]
 name = "placo"
-version = "0.9.16"
+version = "0.9.15"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
     { name = "cmeel" },
@@ -4602,16 +4602,16 @@ dependencies = [
     { name = "pin" },
     { name = "rhoban-cmeel-jsoncpp" },
 ]
-sdist = { url = "https://files.pythonhosted.org/packages/9e/0a/36c5b729d0d69075e7dfafd1b36c4df6fbb8c1ff1585e88d3c56d4c15010/placo-0.9.16.tar.gz", hash = "sha256:5314faaf6442e7ffe17347680d236af953951813bbfb1c09c4a27f7388d332e4", size = 136871, upload-time = "2025-11-07T14:24:58.811Z" }
+sdist = { url = "https://files.pythonhosted.org/packages/40/c4/a33a0ee2ad798471a1c43a96109d28f358fd95c78a56f8cff57acb66d2bc/placo-0.9.15.tar.gz", hash = "sha256:df47f1154bae305c943bd20ba4f56d50ffc65625efc98679fefb11e8ff3c462c", size = 136856, upload-time = "2025-11-03T10:49:13.151Z" }
 wheels = [
-    { url = "https://files.pythonhosted.org/packages/a4/95/8a85b58033303fd354a680e1494f47801abdca9133c222ae1c2473983f25/placo-0.9.16-0-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:417a89920b340e3aec19f1f49e1fb06789c679a807450157af8bdf4aef4bc82b", size = 1641806, upload-time = "2025-11-07T14:24:34.736Z" },
-    { url = "https://files.pythonhosted.org/packages/92/bd/2fb3556c71b0689b3168c0e85fce5befb605affcfe4afb3b5e7b5ba6749f/placo-0.9.16-0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:a7ef7ac33ba889d2122db0d7ed55eeecdffed020e2282712989bb11e408bab40", size = 1515468, upload-time = "2025-11-07T14:24:36.587Z" },
-    { url = "https://files.pythonhosted.org/packages/ea/fd/7dba380720dfb89df582a51d0b2cb43957a36849f676baa3dfc74704e67f/placo-0.9.16-0-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:885773fe8a8e809022451ec16d47479562a042596f663b8c5bbe762cd616f573", size = 2106540, upload-time = "2025-11-07T14:24:38.149Z" },
-    { url = "https://files.pythonhosted.org/packages/7a/40/97c7c799fe4f89111b973d7a5f86626a2ec1d0e6e20ce2988e0a2bda66f5/placo-0.9.16-0-cp312-cp312-manylinux_2_28_x86_64.whl", hash = "sha256:19f097305c714e539fbf19e761897f6daab2ff73f639319431b144e77dd3852e", size = 2178511, upload-time = "2025-11-07T14:24:40.04Z" },
-    { url = "https://files.pythonhosted.org/packages/f7/4d/f1700aae269584477b5d72561d2fc5ace37b4bca167892a74a369849c67e/placo-0.9.16-0-cp313-cp313-macosx_10_9_x86_64.whl", hash = "sha256:be11fa987702114097ccf3d94e1c4a891796878429e25c8d88b187ecc652e7ae", size = 1641812, upload-time = "2025-11-07T14:24:41.308Z" },
-    { url = "https://files.pythonhosted.org/packages/43/d7/21d1d0dd1311c0cbd9ccd233cdae520bbe2370095e3c831059d6077c90bd/placo-0.9.16-0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:c2d65aeb4844eae28006ad3a50c8519b27c701912cc99c46c95e33ed049f3635", size = 1515457, upload-time = "2025-11-07T14:24:42.758Z" },
-    { url = "https://files.pythonhosted.org/packages/0f/e8/939ba23bfa539fb90ab9ab1c2c59ff9a9a46e24699fc90e8ca3ff2948646/placo-0.9.16-0-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:a7633aff1c592c1f45e86a174a372d5d7972673935cb9151391277ff49ec2072", size = 2106538, upload-time = "2025-11-07T14:24:44.517Z" },
-    { url = "https://files.pythonhosted.org/packages/08/00/ad24cc0ad85fbe12267df28c2061e1eaef8f852146c467fcd7a681e11028/placo-0.9.16-0-cp313-cp313-manylinux_2_28_x86_64.whl", hash = "sha256:0d97a7284b65fc45aef27865c80cf7e53f04646d35bb18494ab62dfbbc9a35bd", size = 2178514, upload-time = "2025-11-07T14:24:45.994Z" },
+    { url = "https://files.pythonhosted.org/packages/ef/03/207b1c087996b918fdbaa5a3a685e3b14b068cd303bf87affdf83f722b33/placo-0.9.15-0-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:eab7a299e73291fe631c02448b9e9826539f4824e198bcf85f7c91fdd77d054b", size = 1641975, upload-time = "2025-11-03T10:48:48.887Z" },
+    { url = "https://files.pythonhosted.org/packages/92/55/40432b26bb1c5b9e677fbc41e8d85b54fa8897b7daebb2a22d410b0a7f7b/placo-0.9.15-0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:23f9dd19b8d15fa9d86968948b57981ebc6f1decafeffc2d646d8b56f685b50d", size = 1515448, upload-time = "2025-11-03T10:48:50.562Z" },
+    { url = "https://files.pythonhosted.org/packages/fd/8e/e6283201d329409dccf2045b5c1efd73b3dad5268143bbea4668029ca9c6/placo-0.9.15-0-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:2680a2166c23a0a2aa6226ad75c63a2b2310c812673a5db296616d9af053e076", size = 2106550, upload-time = "2025-11-03T10:48:52.364Z" },
+    { url = "https://files.pythonhosted.org/packages/51/c3/77efe4c999e1d80ec14879ef73ea2a2144aa12db2b67870a562f87ed5b43/placo-0.9.15-0-cp312-cp312-manylinux_2_28_x86_64.whl", hash = "sha256:1a2202a78bcd2874ca09a9a6526a95b38874803923cb9b3b4b96cd68ab4b7217", size = 2178531, upload-time = "2025-11-03T10:48:53.932Z" },
+    { url = "https://files.pythonhosted.org/packages/fe/e7/b5cc5ad53414ff7af3357e0c9d97d902a3ce276e7810f8814fe9f0c1fb70/placo-0.9.15-0-cp313-cp313-macosx_10_9_x86_64.whl", hash = "sha256:84a445a99b059a512d1b4c64841a91d6f50149c7be9255c65bedeebbe6663989", size = 1641982, upload-time = "2025-11-03T10:48:55.277Z" },
+    { url = "https://files.pythonhosted.org/packages/ad/1c/1c9163d941698a077617f218041efc573d3bf5a1c169a284112bd622fccd/placo-0.9.15-0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:b3106e7e6b05cbfa494239d8aa14795f7da8ee5dec851602f0d6297e311d7334", size = 1515447, upload-time = "2025-11-03T10:48:56.975Z" },
+    { url = "https://files.pythonhosted.org/packages/cd/22/3d9b9045b89248c8476dd42243bc9821a123d9199e4e96a944124ad80cf1/placo-0.9.15-0-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:66c3d099e87551401aace04f1293a3c3563b1399319976647846845bf92c3ccf", size = 2106558, upload-time = "2025-11-03T10:48:58.667Z" },
+    { url = "https://files.pythonhosted.org/packages/20/0b/45dbdd2c378a7cece578b7344fda493d5a2aa6777089798a315ce4f97c22/placo-0.9.15-0-cp313-cp313-manylinux_2_28_x86_64.whl", hash = "sha256:0e06b7d3d618ddc2b649ab8b0b46db8001fe72fe2fbcc801524df0ccc8a3da40", size = 2178531, upload-time = "2025-11-03T10:49:00.533Z" },
 ]
 
 [[package]]

From f65f3f7a4a8bc2eb405d692ed297b9f9a3828e20 Mon Sep 17 00:00:00 2001
From: Reece O'Mahoney <66252930+reeceomahoney@users.noreply.github.com>
Date: Tue, 26 May 2026 13:01:19 +0100
Subject: [PATCH 14/17] Fix policy.path in YAML configs (PR #3145 followup)
 (#3597)

PR #3145 added YAML support for policy.path but left two bugs:

1. extract_path_fields_from_config only deleted config_data[field] when
   no sibling overrides existed. With siblings, the dict stayed in place
   and draccus crashed decoding it as PreTrainedConfig (no 'type' key).
   Sibling overrides go into _config_yaml_overrides and are applied later
   by from_pretrained(), so the field can always be removed.

2. wrap() updated config_path_cli to the cleaned temp file path but
   never propagated it to the draccus.parse fallback branch. cli_args
   still contained --config_path=<original>, so draccus read the
   original YAML with path: still present.

Tests passed because they (a) called extract_path_fields_from_config
directly and (b) included type: alongside path: in the YAML, sidestepping
both bugs.

Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
---
 src/lerobot/configs/parser.py  |  11 +++-
 tests/test_yaml_policy_path.py | 116 +++++++++++++++++++++++++++++++--
 2 files changed, 117 insertions(+), 10 deletions(-)

diff --git a/src/lerobot/configs/parser.py b/src/lerobot/configs/parser.py
index d55fa44aa..46cff2b48 100644
--- a/src/lerobot/configs/parser.py
+++ b/src/lerobot/configs/parser.py
@@ -255,8 +255,7 @@ def extract_path_fields_from_config(config_path: str, path_fields: list[str]) ->
             remaining = config_data[field]
             if remaining:
                 _config_yaml_overrides[field] = _flatten_to_cli_args(remaining)
-            else:
-                del config_data[field]
+            del config_data[field]
             modified = True
 
     if not modified:
@@ -311,7 +310,13 @@ def wrap(config_path: Path | None = None) -> Callable[[F], F]:
                     cli_args = filter_arg("config_path", cli_args)
                     cfg = argtype.from_pretrained(config_path_cli, cli_args=cli_args)
                 else:
-                    cfg = draccus.parse(config_class=argtype, config_path=config_path, args=cli_args)
+                    if config_path_cli:
+                        cli_args = filter_arg("config_path", cli_args)
+                    cfg = draccus.parse(
+                        config_class=argtype,
+                        config_path=config_path_cli or config_path,
+                        args=cli_args,
+                    )
             response = fn(cfg, *args, **kwargs)
             return response
 
diff --git a/tests/test_yaml_policy_path.py b/tests/test_yaml_policy_path.py
index 710a71c9a..8d8f7f2ec 100644
--- a/tests/test_yaml_policy_path.py
+++ b/tests/test_yaml_policy_path.py
@@ -1,10 +1,14 @@
 """Tests for policy.path support in YAML config files (issue #2957)."""
 
 import json
+import sys
 import tempfile
+from dataclasses import dataclass, field
+from unittest.mock import patch
 
 import yaml
 
+from lerobot.configs import parser
 from lerobot.configs.parser import (
     _config_path_args,
     _config_yaml_overrides,
@@ -16,7 +20,8 @@ from lerobot.configs.parser import (
 
 
 def test_extract_path_fields_from_yaml():
-    """Test that policy.path is extracted from a YAML config and removed."""
+    """Test that policy.path is extracted from a YAML config and the policy block
+    is removed entirely (siblings are captured separately as cli_overrides)."""
     config = {
         "dataset": {"repo_id": "lerobot/pusht"},
         "policy": {"type": "smolvla", "path": "lerobot/smolvla_base", "push_to_hub": False},
@@ -26,26 +31,33 @@ def test_extract_path_fields_from_yaml():
         config_path = f.name
 
     _config_path_args.clear()
+    _config_yaml_overrides.clear()
     cleaned_path = extract_path_fields_from_config(config_path, ["policy"])
 
     # Path should be extracted and stored
     assert _config_path_args["policy"] == "lerobot/smolvla_base"
 
-    # Cleaned config should not have the path field
+    # Cleaned config should not have the policy block at all -- draccus must not
+    # try to decode it as PreTrainedConfig; the actual config comes from
+    # from_pretrained(path) with the captured overrides applied on top.
     with open(cleaned_path) as f:
         cleaned = yaml.safe_load(f)
-    assert "path" not in cleaned["policy"]
-    assert cleaned["policy"]["type"] == "smolvla"
-    assert cleaned["policy"]["push_to_hub"] is False
+    assert "policy" not in cleaned
 
     # Original dataset should be untouched
     assert cleaned["dataset"]["repo_id"] == "lerobot/pusht"
 
+    # Sibling overrides (excluding type/path) captured for from_pretrained.
+    overrides = get_yaml_overrides("policy")
+    assert any("push_to_hub=false" in o for o in overrides)
+
     _config_path_args.clear()
+    _config_yaml_overrides.clear()
 
 
 def test_extract_path_fields_from_json():
-    """Test that policy.path is extracted from a JSON config."""
+    """Test that policy.path is extracted from a JSON config and the policy
+    block is removed entirely."""
     config = {
         "policy": {"type": "act", "path": "some/local/path"},
     }
@@ -54,15 +66,17 @@ def test_extract_path_fields_from_json():
         config_path = f.name
 
     _config_path_args.clear()
+    _config_yaml_overrides.clear()
     cleaned_path = extract_path_fields_from_config(config_path, ["policy"])
 
     assert _config_path_args["policy"] == "some/local/path"
 
     with open(cleaned_path) as f:
         cleaned = json.load(f)
-    assert "path" not in cleaned["policy"]
+    assert "policy" not in cleaned
 
     _config_path_args.clear()
+    _config_yaml_overrides.clear()
 
 
 def test_extract_no_path_returns_original():
@@ -216,3 +230,91 @@ def test_flatten_nested_with_bools():
     args = _flatten_to_cli_args(d)
     assert "--optimizer.use_warmup=true" in args
     assert "--optimizer.lr=0.01" in args
+
+
+def test_extract_removes_field_with_siblings_and_no_type():
+    """Regression: when policy.path has siblings but no type:, the entire policy
+    block must still be removed from the cleaned config. Otherwise draccus tries
+    to decode the leftover dict as PreTrainedConfig and crashes on the missing
+    type discriminator.
+    """
+    config = {
+        "dataset": {"repo_id": "lerobot/pusht"},
+        "policy": {
+            "path": "lerobot/smolvla_base",
+            "n_action_steps": 10,
+            "dtype": "bfloat16",
+        },
+    }
+    with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
+        yaml.dump(config, f)
+        config_path = f.name
+
+    _config_path_args.clear()
+    _config_yaml_overrides.clear()
+    cleaned_path = extract_path_fields_from_config(config_path, ["policy"])
+
+    with open(cleaned_path) as f:
+        cleaned = yaml.safe_load(f) or {}
+    assert "policy" not in cleaned, "policy block should be fully removed when path is present"
+    assert cleaned["dataset"]["repo_id"] == "lerobot/pusht"
+    assert _config_path_args["policy"] == "lerobot/smolvla_base"
+    overrides = get_yaml_overrides("policy")
+    assert any("n_action_steps=10" in o for o in overrides)
+    assert any("dtype=bfloat16" in o for o in overrides)
+
+    _config_path_args.clear()
+    _config_yaml_overrides.clear()
+
+
+@dataclass
+class _DummyNested:
+    foo: int = 0
+
+
+@dataclass
+class _DummyConfig:
+    nested: _DummyNested = field(default_factory=_DummyNested)
+    other: str = "default"
+
+    @classmethod
+    def __get_path_fields__(cls):
+        return ["nested"]
+
+
+def test_wrap_uses_cleaned_config_for_draccus_parse():
+    """Regression: wrap() updates config_path_cli to point at the cleaned temp
+    file but must propagate that to the draccus.parse fallback branch. Without
+    the fix, cli_args still contains --config_path=<original> and draccus reads
+    the original YAML with `path:` still in it, crashing on the unknown field.
+    """
+    config = {
+        "nested": {"path": "some/checkpoint", "foo": 42},
+        "other": "set-via-yaml",
+    }
+    with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
+        yaml.dump(config, f)
+        config_path = f.name
+
+    _config_path_args.clear()
+    _config_yaml_overrides.clear()
+
+    captured: dict = {}
+
+    @parser.wrap()
+    def main(cfg: _DummyConfig) -> _DummyConfig:
+        captured["cfg"] = cfg
+        return cfg
+
+    with patch.object(sys, "argv", ["prog", f"--config_path={config_path}"]):
+        main()
+
+    assert captured["cfg"].other == "set-via-yaml"
+    assert _config_path_args["nested"] == "some/checkpoint"
+    # Cleaned config dropped `nested:` entirely; defaults stand for this wrapper
+    # class (a real PreTrainedConfig would now load the checkpoint and apply
+    # the captured yaml_overrides via from_pretrained()).
+    assert captured["cfg"].nested.foo == 0
+
+    _config_path_args.clear()
+    _config_yaml_overrides.clear()

From 5c98e80430d4a747926b45893568e388105a2400 Mon Sep 17 00:00:00 2001
From: Haoming Song <haomingsong24@gmail.com>
Date: Tue, 26 May 2026 20:04:22 +0800
Subject: [PATCH 15/17] fix(gr00t): fix Eagle25VL model and processor crash in
 transformers>=5.4.0, <5.6.0 (#3652)

Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
---
 .../policies/groot/eagle2_hg_model/modeling_eagle2_5_vl.py  | 1 +
 .../groot/eagle2_hg_model/processing_eagle2_5_vl.py         | 1 -
 src/lerobot/policies/groot/processor_groot.py               | 6 +++++-
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/lerobot/policies/groot/eagle2_hg_model/modeling_eagle2_5_vl.py b/src/lerobot/policies/groot/eagle2_hg_model/modeling_eagle2_5_vl.py
index 5a66cfbce..6e5532ea4 100755
--- a/src/lerobot/policies/groot/eagle2_hg_model/modeling_eagle2_5_vl.py
+++ b/src/lerobot/policies/groot/eagle2_hg_model/modeling_eagle2_5_vl.py
@@ -60,6 +60,7 @@ class Eagle25VLPreTrainedModel(PreTrainedModel):
         "SiglipEncoderLayer",
     ]
     _skip_keys_device_placement = "past_key_values"
+    _supports_flash_attn = True
     _supports_flash_attn_2 = True
     _supports_cache_class = True
     _supports_static_cache = True
diff --git a/src/lerobot/policies/groot/eagle2_hg_model/processing_eagle2_5_vl.py b/src/lerobot/policies/groot/eagle2_hg_model/processing_eagle2_5_vl.py
index 7b1f67fef..b36e70c47 100755
--- a/src/lerobot/policies/groot/eagle2_hg_model/processing_eagle2_5_vl.py
+++ b/src/lerobot/policies/groot/eagle2_hg_model/processing_eagle2_5_vl.py
@@ -124,7 +124,6 @@ class Eagle25VLProcessor(ProcessorMixin):
         "videos_kwargs",
         "text_kwargs",
     ]
-    image_processor_class = "AutoImageProcessor"
     tokenizer_class = "AutoTokenizer"
 
     def __init__(
diff --git a/src/lerobot/policies/groot/processor_groot.py b/src/lerobot/policies/groot/processor_groot.py
index 3367de711..6848c7c84 100644
--- a/src/lerobot/policies/groot/processor_groot.py
+++ b/src/lerobot/policies/groot/processor_groot.py
@@ -206,7 +206,11 @@ def _build_eagle_processor(tokenizer_assets_repo: str = DEFAULT_TOKENIZER_ASSETS
             "Vendor files are copied during model creation. Create the policy/model first, "
             "or call ensure_eagle_cache_ready() before building processors."
         )
-    proc = AutoProcessor.from_pretrained(str(cache_dir), trust_remote_code=True, use_fast=True)
+    proc = AutoProcessor.from_pretrained(
+        str(cache_dir),
+        trust_remote_code=True,
+        fix_mistral_regex=False,
+    )
     proc.tokenizer.padding_side = "left"
     return proc
 

From e86f5af5bf30d7cd442d07b862b3fbb82f5c79b2 Mon Sep 17 00:00:00 2001
From: Khalil Meftah <khalil.meftah@huggingface.co>
Date: Wed, 27 May 2026 14:24:31 +0200
Subject: [PATCH 16/17] feat(rewards): add TOPReward reward model (#3629)

* feat(rewards): add TOPReward reward model

* refactor(rewards): clean up TOPReward processor/model

* fix(rewards/topreward): add missing input keys mm_token_type_ids

* fix(rewards/topreward): fix pyproject extra typo and simplify processor (#3653)

Add lerobot[topreward] extra to all in
pyproject.toml, drop the redundant labels arg in scoring, and
collapse the dead-branch shape check in the encoder processor.

* optmize topreward input processing (#3660)

---------

Co-authored-by: Cole <91766445+jcoleharrison@users.noreply.github.com>
Co-authored-by: Haoming Song <haomingsong24@gmail.com>
---
 docs/source/_toctree.yml                      |   2 +
 docs/source/topreward.mdx                     | 177 +++++++++
 pyproject.toml                                |   2 +
 src/lerobot/rewards/__init__.py               |   2 +
 src/lerobot/rewards/factory.py                |  19 +-
 src/lerobot/rewards/topreward/__init__.py     |  19 +
 .../rewards/topreward/compute_rabc_weights.py | 353 ++++++++++++++++++
 .../topreward/configuration_topreward.py      | 146 ++++++++
 .../rewards/topreward/modeling_topreward.py   | 238 ++++++++++++
 .../rewards/topreward/processor_topreward.py  | 305 +++++++++++++++
 .../lerobot_rewardmodel_modelcard_template.md |   2 +
 tests/rewards/test_modeling_topreward.py      | 296 +++++++++++++++
 tests/rewards/test_topreward.py               |  80 ++++
 tests/rewards/test_topreward_processor.py     | 246 ++++++++++++
 uv.lock                                       |   7 +-
 15 files changed, 1891 insertions(+), 3 deletions(-)
 create mode 100644 docs/source/topreward.mdx
 create mode 100644 src/lerobot/rewards/topreward/__init__.py
 create mode 100644 src/lerobot/rewards/topreward/compute_rabc_weights.py
 create mode 100644 src/lerobot/rewards/topreward/configuration_topreward.py
 create mode 100644 src/lerobot/rewards/topreward/modeling_topreward.py
 create mode 100644 src/lerobot/rewards/topreward/processor_topreward.py
 create mode 100644 tests/rewards/test_modeling_topreward.py
 create mode 100644 tests/rewards/test_topreward.py
 create mode 100644 tests/rewards/test_topreward_processor.py

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 412386e2d..527cb7e63 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -73,6 +73,8 @@
 - sections:
   - local: sarm
     title: SARM
+  - local: topreward
+    title: TOPReward
   title: "Reward Models"
 - sections:
   - local: inference
diff --git a/docs/source/topreward.mdx b/docs/source/topreward.mdx
new file mode 100644
index 000000000..f84fbed49
--- /dev/null
+++ b/docs/source/topreward.mdx
@@ -0,0 +1,177 @@
+# TOPReward
+
+TOPReward is a **zero-shot reward model** that extracts token log-probabilities from an off-the-shelf vision-language model (VLM) as a robotic reward signal. Given a video trajectory and a task instruction, it returns the VLM's log-likelihood that the instruction is true — no fine-tuning required.
+
+**Paper**: [TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics](https://arxiv.org/abs/2602.19313)
+**Project**: [topreward.github.io](https://topreward.github.io/webpage/)
+**Original code**: [github.com/TOPReward/TOPReward](https://github.com/TOPReward/TOPReward)
+**Default backbone**: [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
+
+## Overview
+
+TOPReward asks a generic VLM how likely a task instruction is, **conditioned on the video** of a robot trying to complete that task. Concretely, given:
+
+- A trajectory video (a sequence of frames).
+- A task instruction (e.g. _"open the drawer"_).
+
+it builds a chat prompt of the form
+
+```text
+<video>
+"The above video shows a robot manipulation trajectory that completes the
+ following task: <instruction> Decide whether the above statement is True
+ or not. The answer is: True"
+```
+
+forwards it through the VLM, label-masks everything except the very last token, and reads back the log-probability of that token — by default the literal `"True"` that closes the suffix template. The resulting `log P("True" | video + prompt + instruction)` is the reward.
+
+Because the method only depends on a frozen VLM, TOPReward is **zero-shot**: there are no fine-tuned weights to host. The "model" in LeRobot is a small wrapper around `transformers`' `Qwen3VLForConditionalGeneration` plus the label-masking logic. The processor owns the tokeniser and builds the full chat prompt (EO-1/Robometer pattern).
+
+## What the LeRobot integration covers
+
+- Standard `reward_model.type=topreward` configuration through LeRobot.
+- VLM loading via the `transformers` `Qwen3VLForConditionalGeneration` API.
+- Prompt assembly + tokenisation in the processor (matching upstream `QwenClient.compute_instruction_reward`).
+- `compute_reward()` returns one scalar log-prob per sample.
+- LeRobot reward-model save/load — `save_pretrained` writes only `config.json` (the VLM is identified by `vlm_name`).
+- An offline labeling script that writes a `topreward_progress.parquet` (SARM-compatible schema) for RA-BC and overlay.
+
+The current LeRobot port supports the **Qwen3-VL client only**. Other upstream clients (Gemini, OpenAI, Gemma, Molmo) can be added as follow-up extras.
+
+## Installation Requirements
+
+1. Install LeRobot following the [Installation Guide](./installation).
+2. Install the TOPReward optional extra:
+
+```bash
+pip install -e ".[topreward]"
+```
+
+or, with `uv` from a source checkout:
+
+```bash
+uv sync --extra topreward
+```
+
+This pulls in `transformers`. The first time you run TOPReward, Hugging Face will also download the VLM weights from the Hub (~16 GB for Qwen3-VL-8B-Instruct). A GPU is strongly recommended.
+
+## Model Inputs and Outputs
+
+TOPReward expects:
+
+- A trajectory video or sequence of frames.
+- A natural-language task description.
+
+In LeRobot datasets the preprocessor reads:
+
+| Config field              | Default                     | Meaning                                       |
+| ------------------------- | --------------------------- | --------------------------------------------- |
+| `reward_model.image_key`  | `observation.images.top`    | Camera observation used by TOPReward          |
+| `reward_model.task_key`   | `task`                      | Key in complementary data for the task string |
+| `reward_model.max_frames` | `16`                        | Cap on frames per sample                      |
+| `reward_model.fps`        | `2.0`                       | Metadata passed to the Qwen video processor   |
+| `reward_model.vlm_name`   | `Qwen/Qwen3-VL-8B-Instruct` | Hugging Face Hub id of the underlying VLM     |
+
+The model returns:
+
+- `compute_reward(batch)`: one log-probability per sample. Higher = better task-video alignment. When `success_threshold` is finite, returns the binary thresholded value instead.
+
+## Usage
+
+### Load the reward model directly
+
+```python
+from lerobot.rewards.topreward import TOPRewardConfig, TOPRewardModel
+
+cfg = TOPRewardConfig(
+    vlm_name="Qwen/Qwen3-VL-8B-Instruct",
+    device="cuda",
+)
+reward_model = TOPRewardModel(cfg)
+```
+
+### Use the reward factory
+
+```python
+from lerobot.rewards import make_reward_model, make_reward_model_config, make_reward_pre_post_processors
+
+cfg = make_reward_model_config(
+    "topreward",
+    vlm_name="Qwen/Qwen3-VL-8B-Instruct",
+    device="cuda",
+    image_key="observation.images.top",
+)
+reward_model = make_reward_model(cfg)
+preprocessor, postprocessor = make_reward_pre_post_processors(cfg)
+```
+
+The preprocessor tokenises the full prompt (video + prefix + instruction suffix), writes Qwen-VL tensors + `prompt_length` under `observation.topreward.*`. The model reads those tensors, label-masks based on `prompt_length`, and extracts the log-prob reward.
+
+### Offline dataset labeling
+
+Write a `topreward_progress.parquet` for RA-BC training and overlay videos:
+
+```bash
+# Sparse-dense (15 anchors per episode, matches upstream)
+uv run python -m lerobot.rewards.topreward.compute_rabc_weights \
+    --dataset-repo-id lerobot/libero_10_image \
+    --num-samples 15 \
+    --device cuda
+```
+
+Then render the progress overlay for any episode:
+
+```bash
+uv run examples/dataset/create_progress_videos.py \
+    --repo-id lerobot/libero_10_image \
+    --episode 0 \
+    --progress-file topreward_progress.parquet \
+    --gif
+```
+
+## Configuration Notes
+
+### Prompt knobs
+
+The default prompt mirrors the upstream paper:
+
+```text
+prompt_prefix = "The above video shows a robot manipulation trajectory that completes the following task: "
+prompt_suffix_template = "{instruction} Decide whether the above statement is True or not. The answer is: True"
+```
+
+Both are exposed on `TOPRewardConfig` for ablation. The suffix template **must** contain `{instruction}`.
+
+### Chat template
+
+`add_chat_template=True` wraps the full prompt (including instruction) with the tokenizer's chat template before tokenisation. Default is `False`, matching the upstream paper's main experiments.
+
+## Limitations
+
+- The current LeRobot port is **inference-only and zero-shot**; `forward()` is not overridden and `is_trainable` returns `False`.
+- Only the **Qwen3-VL family** is supported; other upstream clients are out of scope.
+- TOPReward inherits the underlying VLM's biases.
+
+## References
+
+- [TOPReward project page](https://topreward.github.io/webpage/)
+- [TOPReward paper](https://arxiv.org/abs/2602.19313)
+- [Original TOPReward code](https://github.com/TOPReward/TOPReward)
+- [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
+
+## Citation
+
+```bibtex
+@article{chen2026topreward,
+  title={TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics},
+  author={Chen, Shirui and Harrison, Cole and Lee, Ying-Chun and Yang, Angela Jin and
+          Ren, Zhongzheng and Ratliff, Lillian J and Duan, Jiafei and Fox, Dieter and
+          Krishna, Ranjay},
+  journal={arXiv preprint arXiv:2602.19313},
+  year={2026}
+}
+```
+
+## License
+
+The original TOPReward codebase is MIT-licensed. The LeRobot port follows the LeRobot Apache 2.0 license; the wrapped Qwen3-VL weights are subject to the original Qwen license.
diff --git a/pyproject.toml b/pyproject.toml
index 5d182648c..264297c5e 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -211,6 +211,7 @@ groot = [
     "flash-attn>=2.5.9,<3.0.0 ; sys_platform != 'darwin'"
 ]
 sarm = ["lerobot[transformers-dep]", "pydantic>=2.0.0,<3.0.0", "faker>=33.0.0,<35.0.0", "lerobot[matplotlib-dep]", "lerobot[qwen-vl-utils-dep]"]
+topreward = ["lerobot[transformers-dep]"]
 xvla = ["lerobot[transformers-dep]"]
 eo1 = ["lerobot[transformers-dep]", "lerobot[qwen-vl-utils-dep]"]
 hilserl = ["lerobot[transformers-dep]", "lerobot[dataset]", "gym-hil>=0.1.13,<0.2.0", "lerobot[grpcio-dep]", "lerobot[placo-dep]"]
@@ -288,6 +289,7 @@ all = [
     "lerobot[libero]; sys_platform == 'linux'",
     "lerobot[metaworld]",
     "lerobot[sarm]",
+    "lerobot[topreward]",
     "lerobot[peft]",
     # "lerobot[unitree_g1]", TODO: Unitree requires specific installation instructions for unitree_sdk2
 ]
diff --git a/src/lerobot/rewards/__init__.py b/src/lerobot/rewards/__init__.py
index 203fe2ee1..ae23424e3 100644
--- a/src/lerobot/rewards/__init__.py
+++ b/src/lerobot/rewards/__init__.py
@@ -21,11 +21,13 @@ from .factory import (
 )
 from .pretrained import PreTrainedRewardModel as PreTrainedRewardModel
 from .sarm.configuration_sarm import SARMConfig as SARMConfig
+from .topreward.configuration_topreward import TOPRewardConfig as TOPRewardConfig
 
 __all__ = [
     # Configuration classes
     "RewardClassifierConfig",
     "SARMConfig",
+    "TOPRewardConfig",
     # Base class
     "PreTrainedRewardModel",
     # Factory functions
diff --git a/src/lerobot/rewards/factory.py b/src/lerobot/rewards/factory.py
index c173f44a5..d500cc593 100644
--- a/src/lerobot/rewards/factory.py
+++ b/src/lerobot/rewards/factory.py
@@ -26,6 +26,7 @@ from lerobot.processor import PolicyAction, PolicyProcessorPipeline
 from .classifier.configuration_classifier import RewardClassifierConfig
 from .pretrained import PreTrainedRewardModel
 from .sarm.configuration_sarm import SARMConfig
+from .topreward.configuration_topreward import TOPRewardConfig
 
 
 def get_reward_model_class(name: str) -> type[PreTrainedRewardModel]:
@@ -37,7 +38,7 @@ def get_reward_model_class(name: str) -> type[PreTrainedRewardModel]:
 
     Args:
         name: The name of the reward model. Supported names are "reward_classifier",
-              "sarm".
+              "sarm", "topreward".
 
     Returns:
         The reward model class corresponding to the given name.
@@ -53,6 +54,10 @@ def get_reward_model_class(name: str) -> type[PreTrainedRewardModel]:
         from lerobot.rewards.sarm.modeling_sarm import SARMRewardModel
 
         return SARMRewardModel
+    elif name == "topreward":
+        from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+
+        return TOPRewardModel
     else:
         try:
             return _get_reward_model_cls_from_name(name=name)
@@ -69,7 +74,7 @@ def make_reward_model_config(reward_type: str, **kwargs) -> RewardModelConfig:
 
     Args:
         reward_type: The type of the reward model. Supported types include
-                     "reward_classifier", "sarm".
+                     "reward_classifier", "sarm", "topreward".
         **kwargs: Keyword arguments to be passed to the configuration class constructor.
 
     Returns:
@@ -82,6 +87,8 @@ def make_reward_model_config(reward_type: str, **kwargs) -> RewardModelConfig:
         return RewardClassifierConfig(**kwargs)
     elif reward_type == "sarm":
         return SARMConfig(**kwargs)
+    elif reward_type == "topreward":
+        return TOPRewardConfig(**kwargs)
     else:
         try:
             config_cls = RewardModelConfig.get_choice_class(reward_type)
@@ -162,6 +169,14 @@ def make_reward_pre_post_processors(
             dataset_meta=kwargs.get("dataset_meta"),
         )
 
+    elif isinstance(reward_cfg, TOPRewardConfig):
+        from lerobot.rewards.topreward.processor_topreward import make_topreward_pre_post_processors
+
+        return make_topreward_pre_post_processors(
+            config=reward_cfg,
+            dataset_stats=kwargs.get("dataset_stats"),
+        )
+
     else:
         try:
             processors = _make_processors_from_reward_model_config(
diff --git a/src/lerobot/rewards/topreward/__init__.py b/src/lerobot/rewards/topreward/__init__.py
new file mode 100644
index 000000000..9b03ca866
--- /dev/null
+++ b/src/lerobot/rewards/topreward/__init__.py
@@ -0,0 +1,19 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .configuration_topreward import TOPRewardConfig
+from .modeling_topreward import TOPRewardModel
+from .processor_topreward import make_topreward_pre_post_processors
+
+__all__ = ["TOPRewardConfig", "TOPRewardModel", "make_topreward_pre_post_processors"]
diff --git a/src/lerobot/rewards/topreward/compute_rabc_weights.py b/src/lerobot/rewards/topreward/compute_rabc_weights.py
new file mode 100644
index 000000000..a448654e5
--- /dev/null
+++ b/src/lerobot/rewards/topreward/compute_rabc_weights.py
@@ -0,0 +1,353 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Compute per-frame TOPReward progress curves for a LeRobot dataset.
+
+For each episode, scores trajectory prefixes of increasing length using
+the TOPReward reward model, min-max normalises the raw log-prob rewards per episode,
+and writes a parquet file with one row per frame.
+
+The parquet uses the same schema as SARM's :mod:`lerobot.rewards.sarm.compute_rabc_weights`.
+
+Usage:
+    # Sparse-dense mode (15 anchors per episode, matches upstream)
+    python -m lerobot.rewards.topreward.compute_rabc_weights \\
+        --dataset-repo-id lerobot/libero_10_image \\
+        --num-samples 15
+
+    # Use a different VLM backbone
+    python -m lerobot.rewards.topreward.compute_rabc_weights \\
+        --dataset-repo-id lerobot/libero_10_image \\
+        --vlm-name Qwen/Qwen3-VL-4B-Instruct
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+from pathlib import Path
+from typing import Any
+
+import numpy as np
+import pyarrow as pa
+import pyarrow.parquet as pq
+import torch
+from tqdm import tqdm
+
+from lerobot.datasets import LeRobotDataset
+from lerobot.rewards.topreward.configuration_topreward import TOPRewardConfig
+from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+from lerobot.rewards.topreward.processor_topreward import TOPRewardEncoderProcessorStep
+from lerobot.types import TransitionKey
+
+DEFAULT_OUTPUT_FILENAME = "topreward_progress.parquet"
+
+
+def get_reward_model_path_from_parquet(parquet_path: Path) -> str | None:
+    """Read ``reward_model_path`` from parquet metadata if available."""
+    if not parquet_path.exists():
+        return None
+    try:
+        metadata = pq.read_metadata(parquet_path).schema.to_arrow_schema().metadata
+        if metadata and b"reward_model_path" in metadata:
+            return metadata[b"reward_model_path"].decode()
+    except Exception:  # nosec B110
+        return None
+    return None
+
+
+def _resolve_task(sample: dict[str, Any], default: str) -> str:
+    """Best-effort task extraction from a dataset sample."""
+    task = sample.get("task")
+    if isinstance(task, str) and task:
+        return task
+    return default
+
+
+def normalize_rewards(rewards: list[float] | np.ndarray) -> np.ndarray:
+    """Min-max normalise raw log-prob rewards into ``[0, 1]``."""
+    rewards_arr = np.asarray(rewards, dtype=np.float64)
+    if rewards_arr.size == 0:
+        return rewards_arr.astype(np.float32)
+    if rewards_arr.size == 1:
+        return np.array([1.0], dtype=np.float32)
+    r_min, r_max = rewards_arr.min(), rewards_arr.max()
+    if r_max == r_min:
+        return np.ones_like(rewards_arr, dtype=np.float32)
+    return ((rewards_arr - r_min) / (r_max - r_min)).astype(np.float32)
+
+
+def compute_instruction_rewards_for_prefixes(
+    model: TOPRewardModel,
+    encoder: TOPRewardEncoderProcessorStep,
+    dataset: LeRobotDataset,
+    ep_start: int,
+    num_frames: int,
+    task: str,
+    image_key: str,
+    num_samples: int | None,
+    device: str,
+) -> np.ndarray:
+    """Score an episode via prefix sweep and return a per-frame normalised curve."""
+    if num_samples is None or num_samples >= num_frames:
+        prefix_lengths = np.arange(1, num_frames + 1, dtype=np.int64)
+    else:
+        prefix_lengths = np.unique(np.linspace(1, num_frames, num_samples).round().astype(np.int64))
+
+    episode_frames = torch.stack([dataset[ep_start + i][image_key] for i in range(num_frames)])
+    rewards: list[float] = []
+    for length in prefix_lengths:
+        frames = episode_frames[: int(length)].unsqueeze(0)  # (1, T, C, H, W)
+
+        transition = {
+            TransitionKey.OBSERVATION: {image_key: frames},
+            TransitionKey.COMPLEMENTARY_DATA: {"task": task},
+        }
+        encoded = encoder(transition)
+        obs = encoded[TransitionKey.OBSERVATION]
+        batch = {
+            key: value.to(device) if isinstance(value, torch.Tensor) else value for key, value in obs.items()
+        }
+
+        with torch.no_grad():
+            reward = model.compute_reward(batch)
+        rewards.append(float(reward.item()))
+
+    normalized_rewards = normalize_rewards(rewards)
+
+    if prefix_lengths.shape[0] == num_frames:
+        return normalized_rewards
+
+    return np.interp(
+        np.arange(1, num_frames + 1, dtype=np.float64),
+        prefix_lengths.astype(np.float64),
+        normalized_rewards.astype(np.float64),
+    ).astype(np.float32)
+
+
+def compute_topreward_progress(
+    dataset_repo_id: str,
+    reward_model_path: str | None = None,
+    vlm_name: str | None = None,
+    output_path: str | None = None,
+    device: str = "cuda",
+    num_samples: int | None = None,
+    fps: float | None = None,
+    episodes: list[int] | None = None,
+) -> Path:
+    """Run TOPReward over a dataset and write per-frame progress."""
+    if reward_model_path is not None:
+        logging.info(f"Loading TOPReward config from: {reward_model_path}")
+        model = TOPRewardModel.from_pretrained(reward_model_path)
+        config = model.config
+        config.device = device
+        if vlm_name is not None and vlm_name != config.vlm_name:
+            logging.info(f"Overriding vlm_name from config: {config.vlm_name} -> {vlm_name}")
+            config.vlm_name = vlm_name
+            model = TOPRewardModel(config)
+    else:
+        config_kwargs: dict[str, Any] = {"device": device}
+        if vlm_name is not None:
+            config_kwargs["vlm_name"] = vlm_name
+        if fps is not None:
+            config_kwargs["fps"] = fps
+        config = TOPRewardConfig(**config_kwargs)
+        logging.info(f"Constructing TOPReward with VLM: {config.vlm_name}")
+        model = TOPRewardModel(config)
+
+    model.to(device).eval()
+
+    encoder = TOPRewardEncoderProcessorStep(
+        vlm_name=config.vlm_name,
+        image_key=config.image_key,
+        task_key=config.task_key,
+        default_task=config.default_task,
+        max_frames=None,  # no tail-crop: we control prefix length explicitly
+        fps=config.fps,
+        prompt_prefix=config.prompt_prefix,
+        prompt_suffix_template=config.prompt_suffix_template,
+        add_chat_template=config.add_chat_template,
+        max_length=config.max_input_length,
+    )
+
+    image_key = config.image_key
+
+    logging.info(f"Loading dataset: {dataset_repo_id}")
+    dataset = LeRobotDataset(dataset_repo_id, download_videos=True)
+    logging.info(f"Dataset: {dataset.num_episodes} episodes, {dataset.num_frames} frames")
+
+    episode_indices = list(range(dataset.num_episodes)) if episodes is None else episodes
+    logging.info(f"Processing {len(episode_indices)} episode(s)")
+
+    all_index: list[int] = []
+    all_episode: list[int] = []
+    all_frame: list[int] = []
+    all_progress: list[float] = []
+
+    for episode_idx in tqdm(episode_indices, desc="Episodes"):
+        ep = dataset.meta.episodes[episode_idx]
+        ep_start = int(ep["dataset_from_index"])
+        ep_end = int(ep["dataset_to_index"])
+        num_frames = ep_end - ep_start
+        if num_frames <= 0:
+            continue
+
+        first_sample = dataset[ep_start]
+        task = _resolve_task(first_sample, default=config.default_task or "perform the task")
+
+        per_frame = compute_instruction_rewards_for_prefixes(
+            model=model,
+            encoder=encoder,
+            dataset=dataset,
+            ep_start=ep_start,
+            num_frames=num_frames,
+            task=task,
+            image_key=image_key,
+            num_samples=num_samples,
+            device=device,
+        )
+
+        for local in range(num_frames):
+            all_index.append(ep_start + local)
+            all_episode.append(episode_idx)
+            all_frame.append(local)
+            all_progress.append(float(per_frame[local]))
+
+        if device.startswith("cuda"):
+            torch.cuda.empty_cache()
+
+    table = pa.table(
+        {
+            "index": np.asarray(all_index, dtype=np.int64),
+            "episode_index": np.asarray(all_episode, dtype=np.int64),
+            "frame_index": np.asarray(all_frame, dtype=np.int64),
+            "progress_sparse": np.asarray(all_progress, dtype=np.float32),
+        }
+    )
+
+    schema_metadata: dict[bytes, bytes] = {b"vlm_name": config.vlm_name.encode()}
+    if reward_model_path is not None:
+        schema_metadata[b"reward_model_path"] = reward_model_path.encode()
+    table = table.replace_schema_metadata(schema_metadata)
+
+    out = Path(dataset.root) / DEFAULT_OUTPUT_FILENAME if output_path is None else Path(output_path)
+    out.parent.mkdir(parents=True, exist_ok=True)
+    pq.write_table(table, out)
+    logging.info(f"Saved {len(table)} frame values to {out}")
+
+    progress_arr = np.asarray(all_progress, dtype=np.float32)
+    if progress_arr.size:
+        logging.info(
+            f"Progress: mean={float(progress_arr.mean()):.4f}, "
+            f"std={float(progress_arr.std()):.4f}, "
+            f"min={float(progress_arr.min()):.4f}, "
+            f"max={float(progress_arr.max()):.4f}"
+        )
+    return out
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Compute per-frame TOPReward progress curves for RA-BC weighting.",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+    # Sparse-dense mode (matches upstream TOPReward num_samples=15)
+    python -m lerobot.rewards.topreward.compute_rabc_weights \\
+        --dataset-repo-id lerobot/libero_10_image \\
+        --num-samples 15
+
+    # Use a smaller VLM
+    python -m lerobot.rewards.topreward.compute_rabc_weights \\
+        --dataset-repo-id lerobot/libero_10_image \\
+        --vlm-name Qwen/Qwen3-VL-4B-Instruct
+        """,
+    )
+    parser.add_argument(
+        "--dataset-repo-id", type=str, required=True, help="HuggingFace dataset repo id or local path."
+    )
+    parser.add_argument(
+        "--reward-model-path", type=str, default=None, help="Optional TOPReward LeRobot config."
+    )
+    parser.add_argument("--vlm-name", type=str, default=None, help="Override the VLM backbone (HF Hub id).")
+    parser.add_argument("--output-path", type=str, default=None, help="Output parquet path.")
+    parser.add_argument("--device", type=str, default="cuda", help="Device to use (default: cuda).")
+    parser.add_argument(
+        "--num-samples",
+        type=int,
+        default=None,
+        help="Anchor prefix samples per episode. None = dense. 15 matches upstream.",
+    )
+    parser.add_argument(
+        "--episodes",
+        type=int,
+        nargs="+",
+        default=None,
+        help="Process only these episode indices (e.g. --episodes 0 or --episodes 0 5 10).",
+    )
+    parser.add_argument("--fps", type=float, default=None, help="Override TOPRewardConfig.fps.")
+    parser.add_argument(
+        "--push-to-hub", action="store_true", help="Upload to the dataset repo on HuggingFace Hub."
+    )
+
+    args = parser.parse_args()
+
+    logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
+
+    output_path = compute_topreward_progress(
+        dataset_repo_id=args.dataset_repo_id,
+        reward_model_path=args.reward_model_path,
+        vlm_name=args.vlm_name,
+        output_path=args.output_path,
+        device=args.device,
+        num_samples=args.num_samples,
+        fps=args.fps,
+        episodes=args.episodes,
+    )
+
+    print(f"\nTOPReward progress saved to: {output_path}")
+
+    if args.push_to_hub:
+        from huggingface_hub import HfApi
+
+        api = HfApi()
+        hub_path = DEFAULT_OUTPUT_FILENAME
+
+        print(f"\nUploading to Hub: {args.dataset_repo_id}/{hub_path}")
+        api.upload_file(
+            path_or_fileobj=str(output_path),
+            path_in_repo=hub_path,
+            repo_id=args.dataset_repo_id,
+            repo_type="dataset",
+        )
+        print(
+            "Successfully uploaded to: "
+            f"https://huggingface.co/datasets/{args.dataset_repo_id}/blob/main/{hub_path}"
+        )
+
+        print("\nTo use in training, add to your config:")
+        print("  use_rabc: true")
+        print(f"  rabc_progress_path: hf://datasets/{args.dataset_repo_id}/{hub_path}")
+        print("  rabc_head_mode: sparse")
+    else:
+        print("\nTo use in training, add to your config:")
+        print("  use_rabc: true")
+        print(f"  rabc_progress_path: {output_path}")
+        print("  rabc_head_mode: sparse")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/src/lerobot/rewards/topreward/configuration_topreward.py b/src/lerobot/rewards/topreward/configuration_topreward.py
new file mode 100644
index 000000000..7302734c8
--- /dev/null
+++ b/src/lerobot/rewards/topreward/configuration_topreward.py
@@ -0,0 +1,146 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+
+from lerobot.configs import FeatureType, NormalizationMode, PolicyFeature
+from lerobot.configs.rewards import RewardModelConfig
+from lerobot.utils.constants import OBS_IMAGES
+
+# Default prompt scaffolding from the upstream TOPReward paper / reference
+# implementation (``QwenClient.compute_instruction_reward``). The prompt
+# scores the terminal ``True`` token in ``f"{instruction} ... True"``
+# given the video.
+DEFAULT_PROMPT_PREFIX = (
+    "The above video shows a robot manipulation trajectory that completes the following task: "
+)
+DEFAULT_PROMPT_SUFFIX_TEMPLATE = (
+    "{instruction} Decide whether the above statement is True or not. The answer is: True"
+)
+
+
+@RewardModelConfig.register_subclass("topreward")
+@dataclass
+class TOPRewardConfig(RewardModelConfig):
+    """Configuration for the TOPReward zero-shot reward model.
+
+    TOPReward is **zero-shot**: it has no learnable parameters of its own.
+    The "model" is a generic vision-language model (default
+    ``Qwen/Qwen3-VL-8B-Instruct``) used with a fixed prompt to extract
+    token log-probabilities as a reward signal. There is therefore no
+    fine-tuned checkpoint to host: ``pretrained_path`` is unused at
+    runtime — the model identity is :attr:`vlm_name` (an HF Hub id).
+
+    Args:
+        vlm_name: Hugging Face Hub id of the underlying VLM. Must be a
+            Qwen3-VL family model (the only client implemented in this
+            LeRobot port).
+        torch_dtype: Torch dtype name passed to the VLM loader
+            (``"auto"``, ``"bfloat16"``, ``"float16"``, ...).
+        attn_implementation: ``transformers`` attention implementation
+            (e.g. ``"flash_attention_2"``, ``"sdpa"``). Defaults to
+            ``None`` so the upstream picks the best available.
+        image_key: Observation key that holds the trajectory frames.
+        task_key: Complementary-data key that holds the task instruction.
+        default_task: Fallback instruction when ``task_key`` is absent.
+        max_frames: Cap on the number of frames fed to the VLM per
+            sample. ``None`` = use all frames.
+        fps: Frames-per-second metadata for the Qwen video processor.
+        prompt_prefix: Text shown to the VLM right after the video and
+            before the suffix template.
+        prompt_suffix_template: Suffix appended after ``prompt_prefix``.
+            Must contain ``{instruction}``; the VLM scores the
+            log-likelihood of the tokens that follow the prefix.
+        add_chat_template: If ``True``, wrap the full prompt with the
+            tokenizer's chat template before tokenisation (matches
+            upstream ``add_chat_template=True``).
+        success_threshold: Optional log-prob threshold. If finite,
+            :meth:`TOPRewardModel.compute_reward` returns
+            ``(reward > success_threshold).float()`` instead of the raw
+            log-prob.
+        max_input_length: Hard limit on the total tokenized input length;
+            samples that exceed it raise a ``ValueError``.
+    """
+
+    # Path to a local LeRobot dir or HF repo that holds a ``config.json``
+    # snapshot of this TOPRewardConfig. The VLM weights themselves are
+    # always identified by ``vlm_name``.
+    pretrained_path: str | None = None
+
+    vlm_name: str = "Qwen/Qwen3-VL-8B-Instruct"
+    torch_dtype: str = "auto"
+    attn_implementation: str | None = None
+
+    image_key: str = OBS_IMAGES + ".top"
+    task_key: str = "task"
+    default_task: str | None = None
+    max_frames: int | None = 16
+    fps: float = 2.0
+
+    prompt_prefix: str = DEFAULT_PROMPT_PREFIX
+    prompt_suffix_template: str = DEFAULT_PROMPT_SUFFIX_TEMPLATE
+    add_chat_template: bool = False
+
+    success_threshold: float = float("-inf")
+    max_input_length: int = 32768
+
+    license: str | None = "mit"  # matches upstream TOPReward
+    tags: list[str] | None = field(
+        default_factory=lambda: ["reward-model", "vision-language", "qwen3-vl", "zero-shot"]
+    )
+
+    input_features: dict[str, PolicyFeature] = field(default_factory=dict)
+    output_features: dict[str, PolicyFeature] = field(default_factory=dict)
+    normalization_mapping: dict[str, NormalizationMode] = field(
+        default_factory=lambda: {
+            "VISUAL": NormalizationMode.IDENTITY,
+            "REWARD": NormalizationMode.IDENTITY,
+        }
+    )
+
+    def __post_init__(self) -> None:
+        super().__post_init__()
+        if self.max_frames is not None and self.max_frames < 1:
+            raise ValueError(f"max_frames must be >= 1, got {self.max_frames}")
+        if self.fps <= 0:
+            raise ValueError(f"fps must be > 0, got {self.fps}")
+        if "{instruction}" not in self.prompt_suffix_template:
+            raise ValueError(
+                "prompt_suffix_template must contain `{instruction}` so the model "
+                "scores the log-likelihood of the task suffix."
+            )
+        if self.max_input_length <= 0:
+            raise ValueError(f"max_input_length must be > 0, got {self.max_input_length}")
+
+        if self.image_key not in self.input_features:
+            self.input_features[self.image_key] = PolicyFeature(shape=(3, 224, 224), type=FeatureType.VISUAL)
+        self.output_features.setdefault("reward", PolicyFeature(shape=(1,), type=FeatureType.REWARD))
+
+    @property
+    def observation_delta_indices(self) -> list[int] | None:
+        return None
+
+    @property
+    def action_delta_indices(self) -> None:
+        return None
+
+    @property
+    def reward_delta_indices(self) -> None:
+        return None
+
+    def validate_features(self) -> None:
+        if self.image_key not in self.input_features:
+            raise ValueError(f"TOPReward requires image input feature {self.image_key!r}")
diff --git a/src/lerobot/rewards/topreward/modeling_topreward.py b/src/lerobot/rewards/topreward/modeling_topreward.py
new file mode 100644
index 000000000..4958d5449
--- /dev/null
+++ b/src/lerobot/rewards/topreward/modeling_topreward.py
@@ -0,0 +1,238 @@
+# Copyright 2026 Shirui Chen, Cole Harrison, Ying-Chun Lee, Angela Jin Yang,
+# Zhongzheng Ren, Lillian J. Ratliff, Jiafei Duan, Dieter Fox, Ranjay Krishna
+# and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics.
+
+Paper:         https://arxiv.org/abs/2602.19313
+Project:       https://topreward.github.io/webpage/
+Original code: https://github.com/TOPReward/TOPReward
+Backbone:      https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct  (default)
+
+TOPReward is a **zero-shot** reward model: it has no fine-tuned weights of
+its own. Given a video trajectory and a task instruction, it asks an
+off-the-shelf VLM how likely the instruction is, conditioned on the video,
+and returns that log-likelihood as the reward signal.
+
+Inference recipe:
+
+1. The processor builds a chat-style prompt, tokenises it, and emits
+   ``input_ids``, ``attention_mask``, vision tensors, and ``labels``.
+   The processor label-masks everything except the terminal answer token with
+   ``-100``.
+2. Forward the full token sequence through the VLM.
+3. Read the terminal answer token log-probability from the logits as the
+   scalar reward.
+
+With the default ``prompt_suffix_template``, the only unmasked token is the
+literal ``"True"`` at the end — the reward is
+``log P("True" | video + prompt + instruction)``.
+
+This LeRobot port is **inference-only and not trainable** — :meth:`forward`
+is intentionally inherited from :class:`PreTrainedRewardModel` and raises
+``NotImplementedError``, making :attr:`PreTrainedRewardModel.is_trainable`
+return ``False``.
+
+Because the VLM weights live on the Hugging Face Hub under their canonical
+id (``Qwen/Qwen3-VL-8B-Instruct`` etc.) and TOPReward never modifies them,
+:meth:`_save_pretrained` and :meth:`from_pretrained` are overridden so a
+TOPReward LeRobot "checkpoint" is a single ``config.json`` (the VLM is
+re-fetched from the Hub at load time).
+"""
+
+from __future__ import annotations
+
+import builtins
+import logging
+import os
+from pathlib import Path
+from tempfile import TemporaryDirectory
+from typing import TYPE_CHECKING, Any, TypeVar
+
+import numpy as np
+import torch
+from huggingface_hub import HfApi, hf_hub_download
+from huggingface_hub.constants import CONFIG_NAME
+from huggingface_hub.errors import HfHubHTTPError
+from torch import Tensor
+from torch.nn.functional import cross_entropy
+
+from lerobot.configs.rewards import RewardModelConfig
+from lerobot.rewards.pretrained import PreTrainedRewardModel
+from lerobot.rewards.topreward.configuration_topreward import TOPRewardConfig
+from lerobot.rewards.topreward.processor_topreward import TOPREWARD_FEATURE_PREFIX, TOPREWARD_INPUT_KEYS
+from lerobot.utils.import_utils import _transformers_available, require_package
+
+if TYPE_CHECKING:
+    from lerobot.configs.train import TrainPipelineConfig
+
+if TYPE_CHECKING or _transformers_available:
+    from transformers import Qwen3VLForConditionalGeneration
+else:
+    Qwen3VLForConditionalGeneration = None  # type: ignore[assignment]
+
+logger = logging.getLogger(__name__)
+
+T = TypeVar("T", bound="TOPRewardModel")
+
+
+def _torch_dtype(name: str) -> torch.dtype | str:
+    """Resolve a torch dtype name; ``"auto"`` is passed through verbatim."""
+    if name == "auto":
+        return "auto"
+    dtype = getattr(torch, name, None)
+    if isinstance(dtype, torch.dtype):
+        return dtype
+    raise ValueError(f"Unknown torch dtype: {name!r}")
+
+
+class TOPRewardModel(PreTrainedRewardModel):
+    """TOPReward zero-shot reward model."""
+
+    name = "topreward"
+    config_class = TOPRewardConfig
+
+    def __init__(self, config: TOPRewardConfig) -> None:
+        require_package("transformers", extra="topreward")
+        super().__init__(config)
+        self.config = config
+
+        torch_dtype = _torch_dtype(config.torch_dtype)
+        model_kwargs: dict[str, Any] = {"dtype": torch_dtype, "trust_remote_code": True}
+        if config.attn_implementation is not None:
+            model_kwargs["attn_implementation"] = config.attn_implementation
+
+        self.model = Qwen3VLForConditionalGeneration.from_pretrained(config.vlm_name, **model_kwargs)
+
+    def compute_reward(self, batch: dict[str, Any]) -> Tensor:
+        """Return one log-prob reward per sample in the batch."""
+        inputs: dict[str, Any] = {}
+        for key in TOPREWARD_INPUT_KEYS:
+            batch_key = f"{TOPREWARD_FEATURE_PREFIX}{key}"
+            if batch_key not in batch:
+                raise KeyError(
+                    f"TOPReward batch missing `{batch_key}`. Make sure the "
+                    "TOPRewardEncoderProcessorStep ran before `compute_reward`."
+                )
+            inputs[key] = batch[batch_key]
+
+        device = next(self.model.parameters()).device
+        inputs = {key: value.to(device) if hasattr(value, "to") else value for key, value in inputs.items()}
+        labels = inputs.pop("labels")
+        inputs["logits_to_keep"] = 2
+
+        self.eval()
+        with torch.no_grad():
+            outputs = self.model(**inputs)
+        logits = outputs.logits
+        rewards = -cross_entropy(logits[:, -2, :].float(), labels[:, -1], reduction="none")
+        if np.isfinite(self.config.success_threshold):
+            rewards = (rewards > self.config.success_threshold).float()
+        return rewards.to(self.config.device or "cpu")
+
+    def _save_pretrained(self, save_directory: Path) -> None:
+        """Save ``config.json`` only."""
+        self.config._save_pretrained(save_directory)
+
+    @classmethod
+    def from_pretrained(
+        cls: builtins.type[T],
+        pretrained_name_or_path: str | Path,
+        *,
+        config: RewardModelConfig | None = None,
+        force_download: bool = False,
+        resume_download: bool | None = None,
+        proxies: dict | None = None,
+        token: str | bool | None = None,
+        cache_dir: str | Path | None = None,
+        local_files_only: bool = False,
+        revision: str | None = None,
+        strict: bool = False,  # noqa: ARG003 — accepted for API parity; unused (no safetensors to load)
+        **kwargs: Any,
+    ) -> T:
+        """Load a TOPReward configuration and instantiate the wrapped VLM."""
+        if config is None:
+            config = RewardModelConfig.from_pretrained(
+                pretrained_name_or_path=pretrained_name_or_path,
+                force_download=force_download,
+                resume_download=resume_download,
+                proxies=proxies,
+                token=token,
+                cache_dir=cache_dir,
+                local_files_only=local_files_only,
+                revision=revision,
+                **kwargs,
+            )
+        if not isinstance(config, TOPRewardConfig):
+            raise TypeError(
+                f"Expected a TOPRewardConfig, got {type(config).__name__}. Make sure "
+                f"`pretrained_name_or_path={pretrained_name_or_path!r}` points at a "
+                "TOPReward checkpoint."
+            )
+
+        model_id = str(pretrained_name_or_path)
+        if not os.path.isdir(model_id):
+            try:
+                hf_hub_download(
+                    repo_id=model_id,
+                    filename=CONFIG_NAME,
+                    revision=revision,
+                    cache_dir=cache_dir,
+                    force_download=force_download,
+                    proxies=proxies,
+                    resume_download=resume_download,
+                    token=token,
+                    local_files_only=local_files_only,
+                )
+            except HfHubHTTPError as e:
+                raise FileNotFoundError(
+                    f"{CONFIG_NAME} not found on the HuggingFace Hub in {model_id}"
+                ) from e
+
+        instance = cls(config, **kwargs)
+        instance.to(config.device)
+        instance.eval()
+        return instance
+
+    def push_model_to_hub(self, cfg: TrainPipelineConfig):
+        """Push the TOPReward ``config.json`` + model card to the Hub."""
+        api = HfApi()
+        repo_id = api.create_repo(
+            repo_id=self.config.repo_id, private=self.config.private, exist_ok=True
+        ).repo_id
+
+        with TemporaryDirectory(ignore_cleanup_errors=True) as tmp:
+            saved_path = Path(tmp) / repo_id
+            saved_path.mkdir(parents=True, exist_ok=True)
+
+            self.config._save_pretrained(saved_path)
+
+            card = self.generate_model_card(
+                cfg.dataset.repo_id, self.config.type, self.config.license, self.config.tags
+            )
+            card.save(str(saved_path / "README.md"))
+
+            cfg.save_pretrained(saved_path)
+
+            commit_info = api.upload_folder(
+                repo_id=repo_id,
+                repo_type="model",
+                folder_path=saved_path,
+                commit_message="Upload TOPReward config and readme",
+                allow_patterns=["*.json", "*.yaml", "*.md"],
+                ignore_patterns=["*.tmp", "*.log", "*.safetensors"],
+            )
+
+            logger.info(f"Model pushed to {commit_info.repo_url.url}")
diff --git a/src/lerobot/rewards/topreward/processor_topreward.py b/src/lerobot/rewards/topreward/processor_topreward.py
new file mode 100644
index 000000000..ff0646e49
--- /dev/null
+++ b/src/lerobot/rewards/topreward/processor_topreward.py
@@ -0,0 +1,305 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""TOPReward pre/post processing pipeline."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from typing import TYPE_CHECKING, Any
+
+import torch
+from torch import Tensor
+
+from lerobot.configs import PipelineFeatureType, PolicyFeature
+from lerobot.processor import (
+    AddBatchDimensionProcessorStep,
+    DeviceProcessorStep,
+    PolicyAction,
+    PolicyProcessorPipeline,
+    ProcessorStep,
+    ProcessorStepRegistry,
+    policy_action_to_transition,
+)
+from lerobot.rewards.topreward.configuration_topreward import (
+    DEFAULT_PROMPT_PREFIX,
+    DEFAULT_PROMPT_SUFFIX_TEMPLATE,
+    TOPRewardConfig,
+)
+from lerobot.types import EnvTransition, TransitionKey
+from lerobot.utils.constants import (
+    OBS_IMAGES,
+    OBS_PREFIX,
+    POLICY_POSTPROCESSOR_DEFAULT_NAME,
+    POLICY_PREPROCESSOR_DEFAULT_NAME,
+)
+from lerobot.utils.import_utils import _transformers_available, require_package
+
+if TYPE_CHECKING or _transformers_available:
+    from transformers import AutoProcessor
+else:
+    AutoProcessor = None
+
+TOPREWARD_FEATURE_PREFIX = f"{OBS_PREFIX}topreward."
+
+_TRUE_ANSWER = "True"
+
+TOPREWARD_VLM_INPUT_KEYS = (
+    "input_ids",
+    "attention_mask",
+    "pixel_values_videos",
+    "video_grid_thw",
+    "mm_token_type_ids",
+)
+TOPREWARD_INPUT_KEYS = TOPREWARD_VLM_INPUT_KEYS + ("labels",)
+
+
+def _prepare_video_batch(video: Tensor, *, max_frames: int | None) -> Tensor:
+    """Return videos as ``(B, T, C, H, W)`` uint8 tensors for Qwen3-VL."""
+    if video.ndim == 4:
+        video = video.unsqueeze(1)
+    elif video.ndim != 5:
+        raise ValueError(
+            f"Expected TOPReward frames with shape (B,C,H,W) or (B,T,C,H,W); got {tuple(video.shape)}"
+        )
+
+    if max_frames is not None:
+        video = video[:, -max_frames:]
+    if video.shape[-1] in (1, 3):
+        video = video.permute(0, 1, 4, 2, 3)
+    elif video.shape[2] not in (1, 3):
+        raise ValueError(f"Expected channel dim of size 1 or 3, got shape {tuple(video.shape)}")
+
+    if video.is_floating_point():
+        video = video * 255.0
+
+    return video.clamp(0, 255).to(torch.uint8).contiguous()
+
+
+def _expand_tasks(task: Any, *, batch_size: int, default: str | None) -> list[str]:
+    if task is None:
+        task = default
+    if task is None:
+        raise KeyError("TOPReward expected a task description in complementary data")
+    if isinstance(task, str):
+        return [task] * batch_size
+    if isinstance(task, tuple):
+        task = list(task)
+    if not (isinstance(task, list) and all(isinstance(item, str) for item in task)):
+        raise TypeError(f"TOPReward task must be a string or list of strings, got {type(task)}")
+    if len(task) == 1 and batch_size > 1:
+        return task * batch_size
+    if len(task) != batch_size:
+        raise ValueError(f"Expected {batch_size} tasks, got {len(task)}")
+    return task
+
+
+@dataclass
+@ProcessorStepRegistry.register(name="topreward_encoder")
+class TOPRewardEncoderProcessorStep(ProcessorStep):
+    """Encode raw frames + task into Qwen-VL tensors for the TOPReward model.
+
+    Loads a :class:`~transformers.AutoProcessor` matching ``vlm_name`` and
+    builds the full chat prompt including the instruction suffix. The
+    resulting ``input_ids``, ``attention_mask``, vision tensors, and
+    ``labels`` are written under the ``observation.topreward.*`` namespace
+    so the model can score without re-tokenising.
+
+    At call time the step reads:
+
+    - ``observation[image_key]``: ``(B, T, C, H, W)`` or ``(B, C, H, W)`` frames.
+    - ``complementary_data[task_key]``: a string or list of strings.
+
+    and writes ``observation[f"{TOPREWARD_FEATURE_PREFIX}<name>"]`` for the
+    Qwen-VL tensors plus ``labels``.
+    """
+
+    vlm_name: str = "Qwen/Qwen3-VL-8B-Instruct"
+    image_key: str = OBS_IMAGES + ".top"
+    task_key: str = "task"
+    default_task: str | None = None
+    max_frames: int | None = 16
+    fps: float = 2.0
+    prompt_prefix: str = DEFAULT_PROMPT_PREFIX
+    prompt_suffix_template: str = DEFAULT_PROMPT_SUFFIX_TEMPLATE
+    add_chat_template: bool = False
+    max_length: int = 32768
+
+    _processor: Any = field(default=None, init=False, repr=False)
+
+    def __post_init__(self) -> None:
+        require_package("transformers", extra="topreward")
+        self._processor = AutoProcessor.from_pretrained(self.vlm_name, trust_remote_code=True)
+
+    def __call__(self, transition: EnvTransition) -> EnvTransition:
+        observation = transition.get(TransitionKey.OBSERVATION)
+        complementary = transition.get(TransitionKey.COMPLEMENTARY_DATA) or {}
+        if self.image_key not in observation:
+            raise KeyError(f"TOPReward expected image key {self.image_key!r} in observation")
+
+        frames = observation[self.image_key]
+        videos = frames.detach().cpu() if isinstance(frames, Tensor) else torch.as_tensor(frames)
+        videos = _prepare_video_batch(videos, max_frames=self.max_frames)
+
+        batch_size = videos.shape[0]
+        tasks = _expand_tasks(
+            complementary.get(self.task_key, self.default_task),
+            batch_size=batch_size,
+            default=self.default_task,
+        )
+
+        encoded = self._encode_batch(videos, tasks, batch_size)
+
+        new_observation = dict(observation)
+        for key, value in encoded.items():
+            new_observation[f"{TOPREWARD_FEATURE_PREFIX}{key}"] = value
+
+        new_transition = transition.copy()
+        new_transition[TransitionKey.OBSERVATION] = new_observation
+        return new_transition
+
+    def _encode_batch(self, videos: Tensor, tasks: list[str], batch_size) -> dict[str, Any]:
+        """Tokenise a batch of (frames, task) pairs into Qwen-VL tensors.
+
+        The loop only builds per-sample chat strings. Tokenisation, padding,
+        video preprocessing, and label construction are batched.
+        """
+
+        texts: list[str] = []
+        video_metadata = [
+            {
+                "total_num_frames": int(videos.shape[1]),
+                "fps": float(self.fps),
+                "frames_indices": list(range(int(videos.shape[1]))),
+            }
+            for _ in range(batch_size)
+        ]
+        eos_token = self._processor.tokenizer.eos_token
+
+        for i in range(batch_size):
+            instruction_suffix = self.prompt_suffix_template.format(instruction=tasks[i])
+            if self.add_chat_template:
+                suffix_for_template = instruction_suffix.removesuffix(_TRUE_ANSWER).rstrip()
+                templated_messages = [
+                    {
+                        "role": "user",
+                        "content": [
+                            {"type": "video", "video": videos[i], "fps": self.fps},
+                            {"type": "text", "text": f"{self.prompt_prefix}{suffix_for_template}"},
+                        ],
+                    }
+                ]
+                prompt_chat = self._processor.apply_chat_template(
+                    templated_messages, tokenize=False, add_generation_prompt=True
+                )
+                full_text = f"{prompt_chat}{_TRUE_ANSWER}"
+            else:
+                user_messages = [
+                    {
+                        "role": "user",
+                        "content": [
+                            {"type": "video", "video": videos[i], "fps": self.fps},
+                            {"type": "text", "text": self.prompt_prefix},
+                        ],
+                    }
+                ]
+                prompt_chat = self._processor.apply_chat_template(
+                    user_messages, tokenize=False, add_generation_prompt=False
+                )
+                if eos_token is not None:
+                    prompt_chat = prompt_chat.split(eos_token)[0]
+                full_text = f"{prompt_chat}{instruction_suffix}"
+
+            texts.append(full_text)
+
+        result = self._processor(
+            text=texts,
+            videos=videos,
+            video_metadata=video_metadata,
+            do_sample_frames=False,
+            padding=True,
+            padding_side="left",
+            return_tensors="pt",
+        )
+        input_ids = result["input_ids"]
+
+        if input_ids.shape[-1] > self.max_length:
+            raise ValueError(
+                f"TOPReward input length {input_ids.shape[-1]} exceeds max_length "
+                f"{self.max_length}; lower `max_frames` or raise `max_length`."
+            )
+
+        labels = torch.full_like(input_ids, -100)
+        labels[:, -1] = input_ids[:, -1]
+        result["labels"] = labels
+        return result
+
+    def transform_features(
+        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
+    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
+        return features
+
+    def get_config(self) -> dict[str, Any]:
+        return {
+            "vlm_name": self.vlm_name,
+            "image_key": self.image_key,
+            "task_key": self.task_key,
+            "default_task": self.default_task,
+            "max_frames": self.max_frames,
+            "fps": self.fps,
+            "prompt_prefix": self.prompt_prefix,
+            "prompt_suffix_template": self.prompt_suffix_template,
+            "add_chat_template": self.add_chat_template,
+            "max_length": self.max_length,
+        }
+
+
+def make_topreward_pre_post_processors(
+    config: TOPRewardConfig,
+    dataset_stats: dict[str, dict[str, Any]] | None = None,
+) -> tuple[
+    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
+    PolicyProcessorPipeline[PolicyAction, PolicyAction],
+]:
+    """Pipeline that pre-encodes frames + task into Qwen-VL tensors.
+
+    The preprocessor adds a batch dimension if needed, runs TOPReward's
+    encoder (which tokenises the full prompt and emits ``labels``), and
+    moves everything to the configured device. The postprocessor is
+    the identity since TOPReward outputs a single reward tensor.
+    """
+    preprocessor = PolicyProcessorPipeline[dict[str, Any], dict[str, Any]](
+        steps=[
+            AddBatchDimensionProcessorStep(),
+            TOPRewardEncoderProcessorStep(
+                vlm_name=config.vlm_name,
+                image_key=config.image_key,
+                task_key=config.task_key,
+                default_task=config.default_task,
+                max_frames=config.max_frames,
+                fps=config.fps,
+                prompt_prefix=config.prompt_prefix,
+                prompt_suffix_template=config.prompt_suffix_template,
+                add_chat_template=config.add_chat_template,
+                max_length=config.max_input_length,
+            ),
+            DeviceProcessorStep(device=config.device or "cpu"),
+        ],
+        name=POLICY_PREPROCESSOR_DEFAULT_NAME,
+    )
+    postprocessor = PolicyProcessorPipeline(
+        name=POLICY_POSTPROCESSOR_DEFAULT_NAME,
+        to_transition=policy_action_to_transition,
+    )
+    return preprocessor, postprocessor
diff --git a/src/lerobot/templates/lerobot_rewardmodel_modelcard_template.md b/src/lerobot/templates/lerobot_rewardmodel_modelcard_template.md
index 933bf7586..11df95de5 100644
--- a/src/lerobot/templates/lerobot_rewardmodel_modelcard_template.md
+++ b/src/lerobot/templates/lerobot_rewardmodel_modelcard_template.md
@@ -13,6 +13,8 @@
 A reward classifier is a lightweight neural network that scores observations or trajectories for task success, providing a learned reward signal or offline evaluation when explicit rewards are unavailable.
 {% elif model_name == "sarm" %}
 A Success-Aware Reward Model (SARM) predicts a dense reward signal from observations, typically used downstream for reinforcement learning or human-in-the-loop fine-tuning when task success is not directly observable.
+{% elif model_name == "topreward" %}
+TOPReward is a **zero-shot** reward model that extracts token log-probabilities from an off-the-shelf vision-language model (default Qwen3-VL) as a reward signal. Given a video trajectory and a task instruction, it returns the VLM's log-likelihood of the instruction being true, with no fine-tuning required.
 {% else %}
 _Reward model type not recognized — please update this template._
 {% endif %}
diff --git a/tests/rewards/test_modeling_topreward.py b/tests/rewards/test_modeling_topreward.py
new file mode 100644
index 000000000..0cd185e12
--- /dev/null
+++ b/tests/rewards/test_modeling_topreward.py
@@ -0,0 +1,296 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for the TOPReward reward model."""
+
+from __future__ import annotations
+
+from types import SimpleNamespace
+
+import pytest
+import torch
+
+from lerobot.configs.rewards import RewardModelConfig
+from lerobot.rewards.factory import get_reward_model_class, make_reward_model_config
+from lerobot.rewards.topreward import TOPRewardConfig
+from lerobot.rewards.topreward.processor_topreward import TOPREWARD_FEATURE_PREFIX, TOPREWARD_INPUT_KEYS
+from tests.utils import skip_if_package_missing
+
+
+class _FakeQwenModel(torch.nn.Module):
+    """Stand-in for ``Qwen3VLForConditionalGeneration``.
+
+    Returns a ``SimpleNamespace`` with ``logits`` of a controlled shape so
+    the log-prob extraction path in ``compute_reward`` can be exercised
+    without downloading real VLM weights.
+    """
+
+    def __init__(self) -> None:
+        super().__init__()
+        self._param = torch.nn.Parameter(torch.zeros(1))
+        self._reward_value: float = -1.5
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):  # noqa: ARG003
+        return cls()
+
+    def forward(  # noqa: ARG002
+        self, input_ids, attention_mask=None, labels=None, logits_to_keep=0, **kwargs
+    ):
+        batch_size, seq_len = input_ids.shape
+        vocab_size = 1000
+        logits = torch.zeros(batch_size, seq_len, vocab_size)
+        # Place a controlled log-prob at the target token position so the
+        # model returns a predictable reward value.
+        # The label-masked suffix is the last token.
+        # After the causal-LM shift (logits[:, :-1], labels[:, 1:]) the scored
+        # position is logits[:, -2, :] predicting labels[:, -1].
+        # We set logits so that log_softmax at the target token ≈ _reward_value.
+        for i in range(batch_size):
+            target_idx = int(input_ids[i, -1].item())
+            logits[i, -2, target_idx] = self._reward_value * -10  # high logit -> high log-prob
+        if logits_to_keep:
+            logits = logits[:, -logits_to_keep:, :]
+        return SimpleNamespace(logits=logits)
+
+
+def _patch_build(monkeypatch) -> None:
+    """Stub out HF AutoX so TOPReward construction is cheap and offline."""
+    from lerobot.rewards.topreward import modeling_topreward
+
+    monkeypatch.setattr(modeling_topreward, "Qwen3VLForConditionalGeneration", _FakeQwenModel)
+
+
+def _make_batch(
+    input_ids: torch.Tensor,
+    attention_mask: torch.Tensor | None = None,
+    labels: torch.Tensor | None = None,
+    *,
+    omit: str | None = None,
+) -> dict[str, torch.Tensor]:
+    """Build a ``compute_reward``-ready batch using TOPReward's namespaced keys."""
+    batch_size, seq_len = input_ids.shape
+    if attention_mask is None:
+        attention_mask = torch.ones(batch_size, seq_len, dtype=torch.long)
+    batch: dict[str, torch.Tensor] = {}
+    if labels is not None:
+        batch[f"{TOPREWARD_FEATURE_PREFIX}labels"] = labels
+    batch.update(
+        {
+            f"{TOPREWARD_FEATURE_PREFIX}input_ids": input_ids,
+            f"{TOPREWARD_FEATURE_PREFIX}attention_mask": attention_mask,
+            f"{TOPREWARD_FEATURE_PREFIX}pixel_values_videos": torch.zeros(
+                batch_size, 1536, dtype=torch.float32
+            ),
+            f"{TOPREWARD_FEATURE_PREFIX}video_grid_thw": torch.ones(batch_size, 3, dtype=torch.long),
+            f"{TOPREWARD_FEATURE_PREFIX}mm_token_type_ids": torch.zeros_like(input_ids),
+        }
+    )
+    if omit is not None:
+        batch.pop(f"{TOPREWARD_FEATURE_PREFIX}{omit}", None)
+    return batch
+
+
+def _terminal_labels(input_ids: torch.Tensor) -> torch.Tensor:
+    labels = torch.full_like(input_ids, -100)
+    labels[:, -1] = input_ids[:, -1]
+    return labels
+
+
+# ---------------------------------------------------------------------------
+# Registry + factory
+# ---------------------------------------------------------------------------
+
+
+def test_topreward_config_registered():
+    assert "topreward" in RewardModelConfig.get_known_choices()
+    assert RewardModelConfig.get_choice_class("topreward") is TOPRewardConfig
+    assert isinstance(make_reward_model_config("topreward", device="cpu"), TOPRewardConfig)
+
+
+def test_topreward_factory_returns_in_tree_class():
+    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+
+    assert get_reward_model_class("topreward") is TOPRewardModel
+
+
+# ---------------------------------------------------------------------------
+# Config validation
+# ---------------------------------------------------------------------------
+
+
+def test_topreward_config_rejects_zero_max_frames():
+    with pytest.raises(ValueError, match="max_frames must be >= 1"):
+        TOPRewardConfig(device="cpu", max_frames=0)
+
+
+def test_topreward_config_rejects_non_positive_fps():
+    with pytest.raises(ValueError, match="fps must be > 0"):
+        TOPRewardConfig(device="cpu", fps=0.0)
+
+
+def test_topreward_config_rejects_suffix_without_instruction_placeholder():
+    with pytest.raises(ValueError, match=r"\{instruction\}"):
+        TOPRewardConfig(device="cpu", prompt_suffix_template="no placeholder here")
+
+
+# ---------------------------------------------------------------------------
+# compute_reward
+# ---------------------------------------------------------------------------
+
+
+@skip_if_package_missing("transformers")
+def test_topreward_compute_reward_returns_one_scalar_per_sample(monkeypatch):
+    """``compute_reward`` must return a ``(B,)`` float32 tensor with one
+    log-prob reward per sample, consuming pre-encoded Qwen-VL tensors."""
+    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+
+    _patch_build(monkeypatch)
+    cfg = TOPRewardConfig(device="cpu")
+    model = TOPRewardModel(cfg)
+
+    input_ids = torch.randint(0, 100, (2, 10))
+    attention_mask = torch.ones(2, 10, dtype=torch.long)
+    labels = _terminal_labels(input_ids)
+
+    batch = _make_batch(input_ids, attention_mask, labels)
+    rewards = model.compute_reward(batch)
+
+    assert rewards.shape == (2,)
+    assert rewards.dtype == torch.float32
+
+
+@skip_if_package_missing("transformers")
+def test_topreward_compute_reward_applies_success_threshold(monkeypatch):
+    """When ``success_threshold`` is finite, the model returns binary success."""
+    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+
+    _patch_build(monkeypatch)
+    cfg = TOPRewardConfig(device="cpu", success_threshold=0.0)
+    model = TOPRewardModel(cfg)
+
+    input_ids = torch.randint(0, 100, (2, 10))
+    attention_mask = torch.ones(2, 10, dtype=torch.long)
+    labels = _terminal_labels(input_ids)
+
+    batch = _make_batch(input_ids, attention_mask, labels)
+    rewards = model.compute_reward(batch)
+
+    assert rewards.shape == (2,)
+    assert set(rewards.tolist()).issubset({0.0, 1.0})
+
+
+@skip_if_package_missing("transformers")
+def test_topreward_compute_reward_errors_when_inputs_missing(monkeypatch):
+    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+
+    _patch_build(monkeypatch)
+    cfg = TOPRewardConfig(device="cpu")
+    model = TOPRewardModel(cfg)
+
+    with pytest.raises(KeyError, match=r"observation\.topreward\.input_ids"):
+        model.compute_reward(_make_batch(torch.randint(0, 100, (1, 10)), omit="input_ids"))
+
+
+@skip_if_package_missing("transformers")
+def test_topreward_compute_reward_errors_when_labels_missing(monkeypatch):
+    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+
+    _patch_build(monkeypatch)
+    cfg = TOPRewardConfig(device="cpu")
+    model = TOPRewardModel(cfg)
+
+    input_ids = torch.randint(0, 100, (1, 10))
+    with pytest.raises(KeyError, match=r"observation\.topreward\.labels"):
+        model.compute_reward(_make_batch(input_ids, labels=None))
+
+
+@skip_if_package_missing("transformers")
+def test_topreward_compute_reward_requires_all_encoder_keys(monkeypatch):
+    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+
+    _patch_build(monkeypatch)
+    cfg = TOPRewardConfig(device="cpu")
+    model = TOPRewardModel(cfg)
+
+    input_ids = torch.randint(0, 100, (1, 10))
+    labels = _terminal_labels(input_ids)
+    required_encoder_keys = set(TOPREWARD_INPUT_KEYS) - {"input_ids", "labels"}
+
+    for key in required_encoder_keys:
+        with pytest.raises(KeyError, match=rf"observation\.topreward\.{key}"):
+            model.compute_reward(_make_batch(input_ids, labels=labels, omit=key))
+
+
+# ---------------------------------------------------------------------------
+# Save / load — config-only checkpoint
+# ---------------------------------------------------------------------------
+
+
+@skip_if_package_missing("transformers")
+def test_topreward_save_pretrained_writes_only_config_json(monkeypatch, tmp_path):
+    from huggingface_hub.constants import CONFIG_NAME, SAFETENSORS_SINGLE_FILE
+
+    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+
+    _patch_build(monkeypatch)
+    cfg = TOPRewardConfig(
+        device="cpu",
+        vlm_name="Qwen/Qwen3-VL-8B-Instruct",
+        fps=4.0,
+        image_key="observation.images.front",
+    )
+    model = TOPRewardModel(cfg)
+    model.save_pretrained(str(tmp_path))
+
+    assert (tmp_path / CONFIG_NAME).exists()
+    assert not (tmp_path / SAFETENSORS_SINGLE_FILE).exists()
+
+
+@skip_if_package_missing("transformers")
+def test_topreward_from_pretrained_local_dir_roundtrips_config(monkeypatch, tmp_path):
+    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+
+    _patch_build(monkeypatch)
+    cfg = TOPRewardConfig(
+        device="cpu",
+        vlm_name="Qwen/Qwen3-VL-8B-Instruct",
+        fps=4.0,
+        image_key="observation.images.front",
+        add_chat_template=True,
+        success_threshold=-1.5,
+    )
+    TOPRewardModel(cfg).save_pretrained(str(tmp_path))
+
+    reloaded = TOPRewardModel.from_pretrained(str(tmp_path))
+
+    assert isinstance(reloaded.config, TOPRewardConfig)
+    assert reloaded.config.vlm_name == "Qwen/Qwen3-VL-8B-Instruct"
+    assert reloaded.config.fps == 4.0
+    assert reloaded.config.image_key == "observation.images.front"
+    assert reloaded.config.add_chat_template is True
+    assert reloaded.config.success_threshold == -1.5
+
+
+@skip_if_package_missing("transformers")
+def test_topreward_is_not_trainable(monkeypatch):
+    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+
+    _patch_build(monkeypatch)
+    cfg = TOPRewardConfig(device="cpu")
+    model = TOPRewardModel(cfg)
+
+    assert model.is_trainable is False
+    with pytest.raises(NotImplementedError, match="not trainable"):
+        model.forward({"x": torch.zeros(1)})
diff --git a/tests/rewards/test_topreward.py b/tests/rewards/test_topreward.py
new file mode 100644
index 000000000..cbf960751
--- /dev/null
+++ b/tests/rewards/test_topreward.py
@@ -0,0 +1,80 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""End-to-end TOPReward smoke test with the real Qwen3-VL model."""
+
+import os
+
+import pytest
+import torch
+
+pytest.importorskip("transformers")
+
+from lerobot.rewards.topreward.configuration_topreward import TOPRewardConfig  # noqa: E402
+from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel  # noqa: E402
+from lerobot.rewards.topreward.processor_topreward import (  # noqa: E402
+    TOPREWARD_FEATURE_PREFIX,
+    TOPREWARD_INPUT_KEYS,
+    make_topreward_pre_post_processors,
+)
+from tests.utils import require_cuda  # noqa: E402
+
+pytestmark = pytest.mark.skipif(
+    os.environ.get("CI") == "true" or os.environ.get("GITHUB_ACTIONS") == "true",
+    reason="This test requires downloading and loading Qwen3-VL and is not meant for CI",
+)
+
+
+def _make_dummy_topreward_batch(image_key: str, task_key: str) -> dict[str, object]:
+    num_frames = 4
+    image_size = 64
+    frames = torch.zeros(1, num_frames, 3, image_size, image_size, dtype=torch.uint8)
+    for frame_idx in range(num_frames):
+        frames[0, frame_idx, 0].fill_(min(frame_idx * 48, 255))
+        frames[0, frame_idx, 1].fill_(96)
+        frames[0, frame_idx, 2].fill_(192)
+
+    return {
+        image_key: frames,
+        task_key: ["pick up the red cube"],
+    }
+
+
+@require_cuda
+def test_topreward_full_qwen3vl_preprocessor_to_compute_reward():
+    cfg = TOPRewardConfig(
+        vlm_name="Qwen/Qwen3-VL-8B-Instruct",
+        device="cuda",
+        max_frames=4,
+        fps=2.0,
+        max_input_length=4096,
+    )
+
+    preprocessor, _ = make_topreward_pre_post_processors(cfg)
+    encoded_batch = preprocessor(_make_dummy_topreward_batch(cfg.image_key, cfg.task_key))
+    for key in TOPREWARD_INPUT_KEYS:
+        assert f"{TOPREWARD_FEATURE_PREFIX}{key}" in encoded_batch
+
+    model = TOPRewardModel(cfg)
+    try:
+        model.to(cfg.device)
+        model.eval()
+        rewards = model.compute_reward(encoded_batch)
+    finally:
+        del model
+        torch.cuda.empty_cache()
+
+    assert rewards.shape == (1,)
+    assert rewards.dtype == torch.float32
+    assert torch.isfinite(rewards).all()
diff --git a/tests/rewards/test_topreward_processor.py b/tests/rewards/test_topreward_processor.py
new file mode 100644
index 000000000..df379276e
--- /dev/null
+++ b/tests/rewards/test_topreward_processor.py
@@ -0,0 +1,246 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for TOPReward's pre-processing helpers and encoder step."""
+
+from __future__ import annotations
+
+import pytest
+import torch
+
+from lerobot.configs import FeatureType, PipelineFeatureType, PolicyFeature
+from lerobot.rewards.topreward.processor_topreward import (
+    TOPREWARD_FEATURE_PREFIX,
+    TOPREWARD_INPUT_KEYS,
+    _expand_tasks,
+    _prepare_video_batch,
+)
+from lerobot.types import TransitionKey
+from tests.utils import skip_if_package_missing
+
+# ---------------------------------------------------------------------------
+# _prepare_video_batch — raw image/video batch -> (B, T, C, H, W) uint8
+# ---------------------------------------------------------------------------
+
+
+def test_prepare_video_batch_batched_chw_float_is_converted_to_uint8():
+    video = torch.rand(2, 4, 3, 8, 8)
+    tensor = _prepare_video_batch(video, max_frames=None)
+
+    assert tensor.shape == (2, 4, 3, 8, 8)
+    assert tensor.dtype == torch.uint8
+    assert tensor.min() >= 0 and tensor.max() <= 255
+
+
+def test_prepare_video_batch_batched_thwc_uint8_is_permuted_to_channel_first():
+    video = torch.randint(0, 256, (2, 3, 8, 8, 3), dtype=torch.uint8)
+    tensor = _prepare_video_batch(video, max_frames=None)
+
+    assert tensor.shape == (2, 3, 3, 8, 8)
+    assert tensor.dtype == torch.uint8
+
+
+def test_prepare_video_batch_max_frames_tail_crops_recent_frames():
+    video = torch.zeros(1, 10, 3, 4, 4)
+    for t in range(10):
+        video[:, t] = t / 9.0
+
+    tensor = _prepare_video_batch(video, max_frames=3)
+
+    assert tensor.shape == (1, 3, 3, 4, 4)
+    assert int(tensor[0, 0, 0, 0, 0]) == int(7 / 9 * 255)
+    assert int(tensor[0, -1, 0, 0, 0]) == 255
+
+
+def test_prepare_video_batch_rejects_3d_input():
+    with pytest.raises(ValueError, match="Expected TOPReward frames"):
+        _prepare_video_batch(torch.zeros(4, 8, 8), max_frames=None)
+
+
+def test_prepare_video_batch_floats_above_one_are_rescaled_and_clipped():
+    video = torch.full((1, 1, 3, 2, 2), 5.0)
+    tensor = _prepare_video_batch(video, max_frames=None)
+
+    assert tensor.shape == (1, 1, 3, 2, 2)
+    assert int(tensor.max()) == 255
+
+
+def test_prepare_video_batch_clips_very_large_floats_to_uint8_max():
+    video = torch.full((1, 1, 3, 2, 2), 300.0)
+    tensor = _prepare_video_batch(video, max_frames=None)
+
+    assert int(tensor.max()) == 255
+
+
+# ---------------------------------------------------------------------------
+# _expand_tasks — string / list / tuple broadcasting to batch size
+# ---------------------------------------------------------------------------
+
+
+def test_expand_tasks_string_is_broadcast_to_batch_size():
+    assert _expand_tasks("pick up", batch_size=3, default=None) == ["pick up", "pick up", "pick up"]
+
+
+def test_expand_tasks_list_of_matching_size_passes_through():
+    assert _expand_tasks(["a", "b", "c"], batch_size=3, default=None) == ["a", "b", "c"]
+
+
+def test_expand_tasks_tuple_is_normalised_to_list():
+    assert _expand_tasks(("a", "b"), batch_size=2, default=None) == ["a", "b"]
+
+
+def test_expand_tasks_single_element_list_is_broadcast():
+    assert _expand_tasks(["only one"], batch_size=3, default=None) == ["only one"] * 3
+
+
+def test_expand_tasks_size_mismatch_raises():
+    with pytest.raises(ValueError, match="Expected 3 tasks"):
+        _expand_tasks(["a", "b"], batch_size=3, default=None)
+
+
+def test_expand_tasks_missing_uses_default():
+    assert _expand_tasks(None, batch_size=2, default="fallback") == ["fallback", "fallback"]
+
+
+def test_expand_tasks_missing_without_default_raises():
+    with pytest.raises(KeyError, match="task description"):
+        _expand_tasks(None, batch_size=1, default=None)
+
+
+def test_expand_tasks_wrong_type_raises():
+    with pytest.raises(TypeError, match="must be a string or list"):
+        _expand_tasks(42, batch_size=1, default=None)
+
+
+# ---------------------------------------------------------------------------
+# Encoder step — stubbed AutoProcessor
+# ---------------------------------------------------------------------------
+
+
+def _skip_if_topreward_extras_missing(func):
+    func = skip_if_package_missing("transformers")(func)
+    return func
+
+
+class _FakeTokenizer:
+    eos_token = "<|endoftext|>"
+    pad_token = "<|endoftext|>"
+
+    def __call__(self, *args, **kwargs):
+        return {"input_ids": torch.zeros(1, 10, dtype=torch.long)}
+
+
+class _FakeAutoProcessor:
+    def __init__(self) -> None:
+        self.tokenizer = _FakeTokenizer()
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):  # noqa: ARG003
+        return cls()
+
+    def apply_chat_template(self, messages, **kwargs):  # noqa: ARG002
+        return "fake_prompt_text"
+
+    def __call__(self, text=None, images=None, videos=None, **kwargs):  # noqa: ARG002
+        seq_len = 10
+        batch_size = len(text) if isinstance(text, list) else 1
+        return {
+            "input_ids": torch.randint(0, 100, (batch_size, seq_len)),
+            "attention_mask": torch.ones(batch_size, seq_len, dtype=torch.long),
+            "pixel_values_videos": torch.zeros(batch_size, 1536, dtype=torch.float32),
+            "video_grid_thw": torch.ones(batch_size, 3, dtype=torch.long),
+            "mm_token_type_ids": torch.zeros(batch_size, seq_len, dtype=torch.long),
+        }
+
+
+def _build_step(monkeypatch, **overrides):
+    from lerobot.rewards.topreward import processor_topreward
+
+    monkeypatch.setattr(processor_topreward, "AutoProcessor", _FakeAutoProcessor)
+    return processor_topreward.TOPRewardEncoderProcessorStep(**overrides)
+
+
+def _make_transition(observation: dict, complementary: dict | None = None) -> dict:
+    transition: dict = {TransitionKey.OBSERVATION: observation}
+    if complementary is not None:
+        transition[TransitionKey.COMPLEMENTARY_DATA] = complementary
+    return transition
+
+
+@_skip_if_topreward_extras_missing
+def test_encoder_step_emits_input_ids_and_labels(monkeypatch):
+    """The processor must emit Qwen-VL tensors including ``input_ids`` and
+    ``labels`` under the ``observation.topreward.*`` namespace."""
+    step = _build_step(monkeypatch)
+
+    frames_batch = torch.zeros(2, 4, 3, 8, 8)
+    out = step(
+        _make_transition(
+            observation={"observation.images.top": frames_batch},
+            complementary={"task": ["pick", "place"]},
+        )
+    )
+
+    obs_out = out[TransitionKey.OBSERVATION]
+    for key in TOPREWARD_INPUT_KEYS:
+        assert f"{TOPREWARD_FEATURE_PREFIX}{key}" in obs_out
+
+    input_ids = obs_out[f"{TOPREWARD_FEATURE_PREFIX}input_ids"]
+    labels = obs_out[f"{TOPREWARD_FEATURE_PREFIX}labels"]
+    assert labels.dtype == torch.long
+    assert labels.shape == (2, 10)
+    assert labels[:, :-1].eq(-100).all()
+    assert labels[:, -1].equal(input_ids[:, -1])
+
+
+@_skip_if_topreward_extras_missing
+def test_encoder_step_get_config_roundtrips_user_fields(monkeypatch):
+    step = _build_step(
+        monkeypatch,
+        vlm_name="Qwen/Qwen3-VL-8B-Instruct",
+        image_key="observation.images.cam_top",
+        task_key="task",
+        default_task="do the thing",
+        max_frames=8,
+        fps=4.0,
+        add_chat_template=True,
+        max_length=2048,
+    )
+
+    cfg = step.get_config()
+    assert cfg["vlm_name"] == "Qwen/Qwen3-VL-8B-Instruct"
+    assert cfg["image_key"] == "observation.images.cam_top"
+    assert cfg["default_task"] == "do the thing"
+    assert cfg["max_frames"] == 8
+    assert cfg["fps"] == 4.0
+    assert cfg["add_chat_template"] is True
+    assert cfg["max_length"] == 2048
+
+
+@_skip_if_topreward_extras_missing
+def test_encoder_step_transform_features_is_identity(monkeypatch):
+    step = _build_step(monkeypatch)
+    features = {
+        PipelineFeatureType.OBSERVATION: {
+            "observation.images.top": PolicyFeature(shape=(3, 224, 224), type=FeatureType.VISUAL),
+        }
+    }
+    assert step.transform_features(features) == features
+
+
+@_skip_if_topreward_extras_missing
+def test_encoder_step_rejects_missing_image_key(monkeypatch):
+    step = _build_step(monkeypatch, image_key="observation.images.top")
+    with pytest.raises(KeyError, match="image key"):
+        step(_make_transition(observation={}, complementary={"task": "pick"}))
diff --git a/uv.lock b/uv.lock
index c5f026517..3eb1dda23 100644
--- a/uv.lock
+++ b/uv.lock
@@ -3009,6 +3009,9 @@ test = [
     { name = "pytest-cov" },
     { name = "pytest-timeout" },
 ]
+topreward = [
+    { name = "transformers" },
+]
 training = [
     { name = "accelerate" },
     { name = "av" },
@@ -3167,6 +3170,7 @@ requires-dist = [
     { name = "lerobot", extras = ["scipy-dep"], marker = "extra == 'wallx'" },
     { name = "lerobot", extras = ["smolvla"], marker = "extra == 'all'" },
     { name = "lerobot", extras = ["test"], marker = "extra == 'all'" },
+    { name = "lerobot", extras = ["topreward"], marker = "extra == 'all'" },
     { name = "lerobot", extras = ["training"], marker = "extra == 'all'" },
     { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'eo1'" },
     { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'groot'" },
@@ -3177,6 +3181,7 @@ requires-dist = [
     { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'pi'" },
     { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'sarm'" },
     { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'smolvla'" },
+    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'topreward'" },
     { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'wallx'" },
     { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'xvla'" },
     { name = "lerobot", extras = ["video-benchmark"], marker = "extra == 'all'" },
@@ -3244,7 +3249,7 @@ requires-dist = [
     { name = "transformers", marker = "extra == 'transformers-dep'", specifier = ">=5.4.0,<5.6.0" },
     { name = "wandb", marker = "extra == 'training'", specifier = ">=0.24.0,<0.25.0" },
 ]
-provides-extras = ["dataset", "training", "hardware", "viz", "core-scripts", "evaluation", "dataset-viz", "av-dep", "pygame-dep", "placo-dep", "transformers-dep", "grpcio-dep", "can-dep", "peft-dep", "scipy-dep", "diffusers-dep", "qwen-vl-utils-dep", "matplotlib-dep", "pyserial-dep", "deepdiff-dep", "pynput-dep", "pyzmq-dep", "motorbridge-dep", "motorbridge-smart-servo-dep", "feetech", "dynamixel", "damiao", "robstride", "openarms", "gamepad", "hopejr", "lekiwi", "unitree-g1", "reachy2", "rebot", "kinematics", "intelrealsense", "phone", "diffusion", "wallx", "pi", "smolvla", "multi-task-dit", "groot", "sarm", "xvla", "eo1", "hilserl", "async", "peft", "dev", "notebook", "test", "video-benchmark", "aloha", "pusht", "libero", "metaworld", "all"]
+provides-extras = ["dataset", "training", "hardware", "viz", "core-scripts", "evaluation", "dataset-viz", "av-dep", "pygame-dep", "placo-dep", "transformers-dep", "grpcio-dep", "can-dep", "peft-dep", "scipy-dep", "diffusers-dep", "qwen-vl-utils-dep", "matplotlib-dep", "pyserial-dep", "deepdiff-dep", "pynput-dep", "pyzmq-dep", "motorbridge-dep", "motorbridge-smart-servo-dep", "feetech", "dynamixel", "damiao", "robstride", "openarms", "gamepad", "hopejr", "lekiwi", "unitree-g1", "reachy2", "rebot", "kinematics", "intelrealsense", "phone", "diffusion", "wallx", "pi", "smolvla", "multi-task-dit", "groot", "sarm", "topreward", "xvla", "eo1", "hilserl", "async", "peft", "dev", "notebook", "test", "video-benchmark", "aloha", "pusht", "libero", "metaworld", "all"]
 
 [[package]]
 name = "librt"

From 24017e960c39a24fe1b6ea6248522460fa5aa4b3 Mon Sep 17 00:00:00 2001
From: Haoquan Fang <71356829+hq-fang@users.noreply.github.com>
Date: Wed, 27 May 2026 09:58:37 -0700
Subject: [PATCH 17/17] Add MolmoAct2 policy (#3604)

* add molmoact2 policy

* add apache headers to molmoact2 files

* simplify molmoact2 package imports

* align molmoact2 feature validation with eo pattern

* remove molmoact2 processor override from factory

* guard molmoact2 transformers imports

* guard molmoact2 processor transformers import

* add scipy dependency to molmoact2 extra

* use a single molmoact2 action queue

* move molmoact2 config logic into config

* fix molmoact2 hf image key resolution

* load molmoact2 without remote code

* lazy import molmoact2 scipy

* format molmoact2 files

* skip molmoact2 tests without optional deps

* fix molmoact2 pre-commit checks

* validate molmoact2 gripper range
---
 docs/source/_toctree.yml                      |    2 +
 docs/source/molmoact2.mdx                     |  433 ++
 docs/source/policy_molmoact2_README.md        |   39 +
 pyproject.toml                                |    7 +-
 src/lerobot/policies/__init__.py              |    2 +
 src/lerobot/policies/factory.py               |   26 +-
 src/lerobot/policies/molmoact2/README.md      |    1 +
 src/lerobot/policies/molmoact2/__init__.py    |   21 +
 .../molmoact2/configuration_molmoact2.py      |  519 ++
 .../policies/molmoact2/hf_model/__init__.py   |   17 +
 .../molmoact2/hf_model/action_tokenizer.py    |  237 +
 .../hf_model/configuration_molmoact2.py       |  553 ++
 .../hf_model/image_processing_molmoact2.py    |  564 ++
 .../policies/molmoact2/hf_model/inference.py  |  748 +++
 .../molmoact2/hf_model/modeling_molmoact2.py  | 4591 +++++++++++++++++
 .../hf_model/processing_molmoact2.py          |  431 ++
 .../hf_model/video_processing_molmoact2.py    |  997 ++++
 .../policies/molmoact2/modeling_molmoact2.py  | 1551 ++++++
 .../policies/molmoact2/processor_molmoact2.py | 1083 ++++
 tests/policies/molmoact2/test_molmoact2.py    | 1397 +++++
 uv.lock                                       |   11 +-
 21 files changed, 13226 insertions(+), 4 deletions(-)
 create mode 100644 docs/source/molmoact2.mdx
 create mode 100644 docs/source/policy_molmoact2_README.md
 create mode 120000 src/lerobot/policies/molmoact2/README.md
 create mode 100644 src/lerobot/policies/molmoact2/__init__.py
 create mode 100644 src/lerobot/policies/molmoact2/configuration_molmoact2.py
 create mode 100644 src/lerobot/policies/molmoact2/hf_model/__init__.py
 create mode 100644 src/lerobot/policies/molmoact2/hf_model/action_tokenizer.py
 create mode 100644 src/lerobot/policies/molmoact2/hf_model/configuration_molmoact2.py
 create mode 100644 src/lerobot/policies/molmoact2/hf_model/image_processing_molmoact2.py
 create mode 100644 src/lerobot/policies/molmoact2/hf_model/inference.py
 create mode 100644 src/lerobot/policies/molmoact2/hf_model/modeling_molmoact2.py
 create mode 100644 src/lerobot/policies/molmoact2/hf_model/processing_molmoact2.py
 create mode 100644 src/lerobot/policies/molmoact2/hf_model/video_processing_molmoact2.py
 create mode 100644 src/lerobot/policies/molmoact2/modeling_molmoact2.py
 create mode 100644 src/lerobot/policies/molmoact2/processor_molmoact2.py
 create mode 100644 tests/policies/molmoact2/test_molmoact2.py

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 527cb7e63..1d4d9e770 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -59,6 +59,8 @@
     title: π₀-FAST (Pi0Fast)
   - local: pi05
     title: π₀.₅ (Pi05)
+  - local: molmoact2
+    title: MolmoAct2
   - local: eo1
     title: EO-1
   - local: groot
diff --git a/docs/source/molmoact2.mdx b/docs/source/molmoact2.mdx
new file mode 100644
index 000000000..ddd178acd
--- /dev/null
+++ b/docs/source/molmoact2.mdx
@@ -0,0 +1,433 @@
+# MolmoAct2 Policy
+
+MolmoAct2 is the LeRobot policy implementation of
+[MolmoAct2](https://allenai.org/blog/molmoact2), ported into the LeRobot
+training, evaluation, checkpointing, and dataset interfaces for easier use with
+LeRobot datasets.
+
+This implementation currently supports training and evaluation for the regular
+MolmoAct2 model. MolmoAct2-Think, which supports adaptive depth reasoning, is
+not included in this LeRobot policy yet and is coming soon.
+
+For the original MolmoAct2 training code used for the experiments reported in
+the paper, see [allenai/molmoact2](https://github.com/allenai/molmoact2).
+
+## Installation Requirements
+
+Install LeRobot with the MolmoAct2 optional dependencies:
+
+```bash
+pip install -e ".[molmoact2]"
+```
+
+To run the models in this repository, you need an NVIDIA GPU. The measurements
+below were taken on a single NVIDIA H100 80GB with bf16 model loading, LIBERO with two RGB cameras. MolmoAct2 rows use `chunk_size=10`, action dim 7
+padded to `expected_max_action_dim=32`, and `num_flow_timesteps=8`. Training measurements use
+`gradient_checkpointing=true` and include the forward pass, backward pass,
+gradient clipping, optimizer step, and optimizer state allocation. Values are
+peak GPU memory sampled with `nvidia-smi`. Leave a few GiB of headroom for
+dataloader workers, CUDA context, and fragmentation.
+
+Multi-GPU training through `accelerate` increases throughput and global batch
+size, but this LeRobot port does not currently expose the original MolmoAct2
+`fsdp_devices` model-parallel training path. The current training script has
+not been tested for multi-node training.
+
+| Mode                                             | Peak Memory, bs=8 | Peak Memory, bs=16 | Peak Memory, bs=32 |
+| ------------------------------------------------ | ----------------: | -----------------: | -----------------: |
+| Inference, continuous, CUDA graph enabled (bs=1) |          12.1 GiB |                  - |                  - |
+| Fine-tuning, action expert only, continuous      |          16.5 GiB |           18.3 GiB |           21.4 GiB |
+| Fine-tuning, LoRA VLM, both action modes         |          20.2 GiB |           26.8 GiB |           41.3 GiB |
+| Fine-tuning, full model, both action modes       |          48.3 GiB |           49.8 GiB |           60.1 GiB |
+
+The repo has been tested with Ubuntu 22.04.
+
+## Usage
+
+To use MolmoAct2 in a LeRobot training config, set:
+
+```python
+policy.type=molmoact2
+```
+
+## Training
+
+MolmoAct2 can be fine-tuned from either the released MolmoAct2 Hugging Face
+checkpoint format or from a checkpoint already saved by LeRobot. Both routes use
+the same LeRobot training loop, dataset transforms, checkpoint saving, and
+logging. The difference is only how the initial policy weights and processor
+state are loaded.
+
+### Training With Original MolmoAct2 Weight
+
+Use `policy.checkpoint_path` when starting from a released MolmoAct2 checkpoint,
+for example `allenai/MolmoAct2` or `allenai/MolmoAct2-LIBERO`. LeRobot will load
+the original HF model files, then build its own policy processor from the
+dataset metadata and the policy options below.
+
+The command below shows full fine-tuning on the merged LIBERO dataset. It uses
+bf16 model loading, 8 flow timesteps, LeRobot dataset statistics, image
+augmentation, and LeRobot's checkpointing/logging path.
+
+```bash
+accelerate launch \
+  --num_processes=8 \
+  --mixed_precision=bf16 \
+  -m lerobot.scripts.lerobot_train \
+  --dataset.repo_id=allenai/MolmoAct2-LIBERO-Dataset \
+  --dataset.root=/path/to/lerobot/data/allenai/MolmoAct2-LIBERO-Dataset \
+  --dataset.video_backend=pyav \
+  --dataset.image_transforms.enable=true \
+  --policy.type=molmoact2 \
+  --policy.checkpoint_path=allenai/MolmoAct2-LIBERO \
+  --policy.device=cuda \
+  --policy.action_mode=both \
+  --policy.chunk_size=10 \
+  --policy.n_action_steps=10 \
+  --policy.setup_type="single franka robotic arm in libero" \
+  --policy.control_mode="delta end-effector pose" \
+  --policy.image_keys='["observation.images.image","observation.images.wrist_image"]' \
+  --policy.model_dtype=bfloat16 \
+  --policy.num_flow_timesteps=8 \
+  --policy.gradient_checkpointing=true \
+  --policy.freeze_embedding=true \
+  --policy.normalize_gripper=false \
+  --policy.enable_knowledge_insulation=false \
+  --policy.push_to_hub=false \
+  --wandb.enable=true \
+  --wandb.entity=<wandb_entity> \
+  --wandb.project=<wandb_project> \
+  --job_name=<job_name> \
+  --output_dir=outputs/<job_name> \
+  --steps=10000 \
+  --batch_size=32 \
+  --num_workers=4 \
+  --log_freq=20 \
+  --eval_freq=-1 \
+  --save_checkpoint=true \
+  --save_freq=2000
+```
+
+### Training With LeRobot MolmoAct2 Weight
+
+Use `policy.path` when starting from a MolmoAct2 checkpoint that was saved by
+LeRobot, either from a local `pretrained_model` directory or from the Hub. This
+restores the saved LeRobot policy config, model weights, processor, and
+normalization statistics. You can still override training-time options such as
+`batch_size`, `steps`, LoRA flags, or `policy.action_mode`.
+
+```bash
+accelerate launch \
+  --num_processes=8 \
+  --mixed_precision=bf16 \
+  -m lerobot.scripts.lerobot_train \
+  --dataset.repo_id=allenai/MolmoAct2-LIBERO-Dataset \
+  --dataset.root=/path/to/lerobot/data/allenai/MolmoAct2-LIBERO-Dataset \
+  --dataset.video_backend=pyav \
+  --dataset.image_transforms.enable=true \
+  --policy.path=/path/to/pretrained_model \
+  --policy.device=cuda \
+  --policy.action_mode=both \
+  --policy.chunk_size=10 \
+  --policy.n_action_steps=10 \
+  --policy.model_dtype=bfloat16 \
+  --policy.num_flow_timesteps=8 \
+  --policy.gradient_checkpointing=true \
+  --wandb.enable=true \
+  --wandb.entity=<wandb_entity> \
+  --wandb.project=<wandb_project> \
+  --job_name=<job_name> \
+  --output_dir=outputs/<job_name> \
+  --steps=10000 \
+  --batch_size=32 \
+  --num_workers=4 \
+  --log_freq=20 \
+  --eval_freq=-1 \
+  --save_checkpoint=true \
+  --save_freq=2000
+```
+
+### Common Practices
+
+For fine-tuning on a comparatively small dataset, such as a single LIBERO suite
+or a real-world dataset with less than 200 demonstrations, a global batch size of
+16 to 32 is a good starting point. In these settings, `policy.enable_lora_vlm=true` or `policy.train_action_expert_only=true` is also a practical choice. In both
+cases, we intentionally keep the action expert fully trainable, which we found
+to be crucial for model performance. For larger fine-tuning datasets, larger
+global batch sizes and full fine-tuning are usually preferred.
+
+### Common Policy Options
+
+- `policy.checkpoint_path`: original MolmoAct2 HF checkpoint to initialize from.
+  Use this for released MolmoAct2 weights.
+- `policy.path`: LeRobot checkpoint to initialize from. Use this for checkpoints
+  created by LeRobot training.
+- `policy.action_mode`: training target, one of `continuous`, `discrete`, or
+  `both`. `both` trains the flow-matching action expert and the discrete
+  action-token loss.
+- `policy.train_action_expert_only`: trains only parameters whose names contain
+  `action_expert`. It requires `policy.action_mode=continuous`.
+- `policy.enable_lora_vlm`: enables LoRA on VLM linear layers. Use
+  `policy.enable_lora_action_expert=true` only if LoRA should also cover action
+  expert linear layers. When `policy.enable_lora_action_expert=false`, the
+  action expert base weights remain fully trainable while the VLM is trained
+  through LoRA adapters. When `policy.enable_lora_action_expert=true`, the
+  action expert is also adapter-tuned instead of fully fine-tuned.
+- `policy.enable_knowledge_insulation`: when `true`, detaches action-expert
+  context K/V states before the action loss. The default is `false`.
+- `policy.chunk_size`: action horizon used by the policy. For LIBERO we use
+  `10`. This LeRobot port overrides the loaded checkpoint's
+  `max_action_horizon` with this value.
+- `policy.n_action_steps`: number of actions consumed from each predicted
+  chunk before querying the policy again. For LIBERO, set it to `chunk_size`.
+- `policy.setup_type`: text inserted into the prompt to describe the robot and
+  scene, e.g. `single franka robotic arm in libero`. More examples are listed
+  in the `metadata_by_tag` entries of
+  [`norm_stats.json`](https://huggingface.co/allenai/MolmoAct2/blob/main/norm_stats.json).
+- `policy.control_mode`: text inserted into the prompt to describe the action
+  space, e.g. `delta end-effector pose` or `absolute joint pose`.
+- `policy.image_keys`: ordered LeRobot image observation keys passed to the
+  processor.
+- `policy.model_dtype`: checkpoint/forward dtype, one of `float32`,
+  `bfloat16`, or `float16`. Use `bfloat16` for normal training.
+- `policy.num_flow_timesteps`: number of flow-matching timesteps sampled per
+  example during training. We use `8` for fine-tuning.
+- `policy.num_inference_steps`: optional override for continuous action
+  generation steps at inference time.
+- `policy.gradient_checkpointing`: enables checkpointing in the VLM/action path
+  to reduce activation memory.
+- `policy.freeze_embedding`: freezes input embeddings. The default is `true`.
+- `policy.normalize_gripper`: controls whether gripper dimensions are included
+  in state/action quantile normalization. The default is `false`.
+- `policy.normalize_language`: normalizes task strings before prompt
+  construction. The default is `true`.
+- `policy.mask_action_dim_padding`: masks padded dimensions in the flow loss.
+  Released checkpoints use `policy.expected_max_action_dim=32`.
+- `policy.max_sequence_length`: optional manual sequence cap. Leave unset to
+  infer it from images, state dimension, action dimension, action horizon, and
+  discrete-action mode.
+
+### Learning Rates
+
+MolmoAct2 uses parameter-group learning rates to match the original MolmoAct2
+fine-tuning experiments.
+
+- Full fine-tuning uses `policy.optimizer_lr=1e-5` for the VLM,
+  `policy.optimizer_vit_lr=5e-6` for the vision tower,
+  `policy.optimizer_connector_lr=5e-6` for image connector layers, and
+  `policy.optimizer_action_expert_lr=5e-5` for the action expert.
+- LoRA VLM fine-tuning sets the VLM, vision, and connector LoRA parameter
+  groups to `5e-5` when `policy.enable_lora_vlm=true`. By default,
+  `policy.enable_lora_action_expert=false`, so the action expert is still fully
+  fine-tuned with `policy.optimizer_action_expert_lr`. If
+  `policy.enable_lora_action_expert=true`, the action expert is trained through
+  LoRA adapters instead.
+- Action-expert-only fine-tuning trains only the action expert and uses
+  `policy.optimizer_action_expert_lr=5e-5`.
+
+You can override the full fine-tuning and action-expert learning rates with
+`policy.optimizer_lr`, `policy.optimizer_vit_lr`,
+`policy.optimizer_connector_lr`, and `policy.optimizer_action_expert_lr`.
+Scheduler settings can be changed with `policy.scheduler_warmup_steps`,
+`policy.scheduler_decay_steps`, and `policy.scheduler_decay_lr`.
+
+### Dataset Quantile Statistics
+
+MolmoAct2 defaults to quantile normalization for state and action features. If
+your dataset has not been converted with quantile statistics, you can add them
+with:
+
+```bash
+python src/lerobot/datasets/v30/augment_dataset_quantile_stats.py \
+  --repo-id=your_dataset
+```
+
+Alternatively, train MolmoAct2 with mean/std normalization:
+
+```bash
+--policy.normalization_mapping='{"ACTION": "MEAN_STD", "STATE": "MEAN_STD", "VISUAL": "IDENTITY"}'
+```
+
+## Evaluation
+
+Evaluation also supports both LeRobot-saved checkpoints and original MolmoAct2
+HF checkpoints. For LIBERO replication, keep the EGL rendering environment
+fixed and use `policy.per_episode_seed=true`.
+
+**Important:** We found that `num_steps_wait=10` does not reliably let the
+LIBERO scene stabilize and can degrade measured success. All LIBERO evaluation
+results reported here use `num_steps_wait=50`.
+
+### Evaluation With LeRobot MolmoAct2 Weight
+
+Use `policy.path` for a checkpoint saved by LeRobot. The saved processor and
+normalization statistics are restored together with the model.
+
+```bash
+export MUJOCO_GL=egl
+export PYOPENGL_PLATFORM=egl
+export OMP_NUM_THREADS=1
+export MKL_NUM_THREADS=1
+
+lerobot-eval \
+  --policy.path=allenai/MolmoAct2-LIBERO-LeRobot \
+  --policy.inference_action_mode=continuous \
+  --policy.model_dtype=bfloat16 \
+  --policy.use_amp=true \
+  --policy.enable_inference_cuda_graph=true \
+  --policy.device=cuda \
+  --policy.per_episode_seed=true \
+  --policy.eval_seed=1000 \
+  --env.type=libero \
+  --env.task=libero_10,libero_goal,libero_object,libero_spatial \
+  --env.camera_name_mapping='{"agentview_image":"image","robot0_eye_in_hand_image":"wrist_image"}' \
+  --eval.batch_size=1 \
+  --eval.n_episodes=50 \
+  --seed=1000
+```
+
+### Evaluation With Original MolmoAct2 Weight
+
+You can evaluate a released Hugging Face checkpoint directly without first
+converting it to a LeRobot checkpoint. In this case, set
+`policy.checkpoint_path` to the HF model repo and provide `policy.norm_tag`.
+For LIBERO, `policy.norm_tag=libero` loads the LIBERO action/state
+normalization statistics, action horizon, prompt metadata, and image-key order
+from the checkpoint's `norm_stats.json`.
+
+To fully replicate the MolmoAct2 paper results with released Hugging Face
+checkpoints, we recommend using the v0.5.1-pinned
+[`allenai/lerobot` `molmoact2-hf-inference`](https://github.com/allenai/lerobot/tree/molmoact2-hf-inference)
+branch. That branch matches the original evaluation settings used for the
+reported numbers.
+
+```bash
+export MUJOCO_GL=egl
+export PYOPENGL_PLATFORM=egl
+export OMP_NUM_THREADS=1
+export MKL_NUM_THREADS=1
+
+lerobot-eval \
+  --policy.type=molmoact2 \
+  --policy.checkpoint_path=allenai/MolmoAct2-LIBERO \
+  --policy.norm_tag=libero \
+  --policy.inference_action_mode=continuous \
+  --policy.model_dtype=float32 \
+  --policy.use_amp=false \
+  --policy.enable_inference_cuda_graph=true \
+  --policy.device=cuda \
+  --policy.per_episode_seed=true \
+  --policy.eval_seed=1000 \
+  --env.type=libero \
+  --env.task=libero_goal \
+  --env.camera_name_mapping='{"agentview_image":"image","robot0_eye_in_hand_image":"wrist_image"}' \
+  --eval.batch_size=1 \
+  --eval.n_episodes=50 \
+  --seed=1000
+```
+
+Use `--env.task=libero_10,libero_goal,libero_object,libero_spatial` to run the
+full LIBERO suite. The same command works for other released MolmoAct2
+checkpoints as long as the requested `policy.norm_tag` exists in that
+checkpoint's `norm_stats.json`.
+
+### Common Evaluation Options
+
+- `policy.inference_action_mode`: required for rollout. Use `continuous` for
+  flow-matching inference or `discrete` for action-token inference. It must be
+  compatible with the training-time `policy.action_mode` saved in the
+  checkpoint.
+- `policy.path`: LeRobot checkpoint path or Hub repo. Use this for checkpoints
+  saved by LeRobot.
+- `policy.checkpoint_path`: original MolmoAct2 HF checkpoint path or Hub repo.
+  Use this with `policy.type=molmoact2` and `policy.norm_tag`.
+- `policy.norm_tag`: selects normalization statistics, prompt metadata,
+  image-key order, and action horizon from the original checkpoint's
+  `norm_stats.json`. It is required for direct original-HF checkpoint
+  evaluation.
+- `policy.model_dtype`: model load/forward dtype. Use `bfloat16` for normal
+  GPU evaluation. Use `float32` only when you explicitly want fp32 inference.
+- `policy.use_amp`: runs the policy forward under autocast during eval. For
+  `model_dtype=bfloat16`, keep this enabled.
+- `policy.enable_inference_cuda_graph`: enables the MolmoAct2 inference CUDA
+  graph path for faster repeated continuous-action rollout.
+- `policy.per_episode_seed` and `policy.eval_seed`: make stochastic continuous
+  action generation deterministic per episode for replication.
+- `env.task`: comma-separated LIBERO suites or a single suite. Use
+  `libero_10,libero_goal,libero_object,libero_spatial` for the full benchmark.
+- `env.camera_name_mapping`: maps LIBERO camera names to the image keys expected
+  by the policy processor.
+
+## Performance Results
+
+### LIBERO Benchmark Results
+
+MolmoAct2 has demonstrated strong performance on the LIBERO benchmark suite. To
+compare and test its LeRobot implementation, we fine-tuned
+[`allenai/MolmoAct2-LIBERO`](https://huggingface.co/allenai/MolmoAct2-LIBERO)
+for an additional 10k steps on the LIBERO dataset with per-GPU batch size 32 on
+8 H100 GPUs, then compared the results to the original MolmoAct2 reference
+results.
+
+The LeRobot fine-tuned checkpoint reported here is available at
+[`allenai/MolmoAct2-LIBERO-LeRobot`](https://huggingface.co/allenai/MolmoAct2-LIBERO-LeRobot)
+and was trained on
+[`allenai/MolmoAct2-LIBERO-Dataset`](https://huggingface.co/datasets/allenai/MolmoAct2-LIBERO-Dataset).
+
+| Benchmark      | LeRobot Implementation | MolmoAct2 Original |
+| -------------- | ---------------------: | -----------------: |
+| LIBERO Spatial |                  98.4% |              97.8% |
+| LIBERO Object  |                 100.0% |             100.0% |
+| LIBERO Goal    |                  98.0% |              97.8% |
+| LIBERO 10      |                  96.6% |              93.2% |
+| Average        |                 98.25% |             97.20% |
+
+These results demonstrate MolmoAct2's strong performance across diverse robotic
+manipulation tasks. To reproduce them, follow the instructions in the LIBERO
+evaluation section.
+
+## Differences From the Original Implementation
+
+This LeRobot port is intended to match MolmoAct2 behavior while using LeRobot's
+dataset, training, evaluation, checkpoint, and logging infrastructure. The main
+differences from the original training repository are:
+
+- The original paper training stack loads the model in fp32 and trains under
+  mixed precision. This LeRobot port usually loads the checkpoint directly in
+  `policy.model_dtype=bfloat16` for lower memory use.
+- The original repository uses its own FSDP/model-parallel training path. The
+  LeRobot port uses the standard LeRobot/Accelerate training path and has not
+  been tested for multi-node training.
+- The original repository supports sequence packing. The LeRobot port trains on
+  one LeRobot sample per item and pads to an inferred fixed sequence budget.
+- The LeRobot port follows LeRobot's optimizer, scheduler, checkpoint saving,
+  dataset transforms, image augmentation, and Weights & Biases logging
+  conventions.
+- The original training path supports mixed action horizons by padding to
+  `max_action_horizon` and masking padded horizon slots in the action expert
+  self-attention. This is useful when training across datasets with different
+  control frequencies. The LeRobot port currently targets single-dataset
+  fine-tuning, so `policy.chunk_size` overrides the checkpoint
+  `max_action_horizon` and horizon masking is not implemented yet. Support for
+  this mixed-horizon path is planned.
+
+## Citation
+
+```bibtex
+@misc{fang2026molmoact2actionreasoningmodels,
+      title={MolmoAct2: Action Reasoning Models for Real-world Deployment},
+      author={Haoquan Fang and Jiafei Duan and Donovan Clay and Sam Wang and Shuo Liu and Weikai Huang and Xiang Fan and Wei-Chuan Tsai and Shirui Chen and Yi Ru Wang and Shanli Xing and Jaemin Cho and Jae Sung Park and Ainaz Eftekhar and Peter Sushko and Karen Farley and Angad Wadhwa and Cole Harrison and Winson Han and Ying-Chun Lee and Eli VanderBilt and Rose Hendrix and Suveen Ellawela and Lucas Ngoo and Joyce Chai and Zhongzheng Ren and Ali Farhadi and Dieter Fox and Ranjay Krishna},
+      year={2026},
+      eprint={2605.02881},
+      archivePrefix={arXiv},
+      primaryClass={cs.RO},
+      url={https://arxiv.org/abs/2605.02881},
+}
+```
+
+## License
+
+This model is licensed under Apache 2.0. It is intended for research and
+educational use in accordance with
+[Ai2's Responsible Use Guidelines](https://allenai.org/responsible-use),
+consistent with [allenai/molmoact2](https://github.com/allenai/molmoact2).
diff --git a/docs/source/policy_molmoact2_README.md b/docs/source/policy_molmoact2_README.md
new file mode 100644
index 000000000..df3a6341e
--- /dev/null
+++ b/docs/source/policy_molmoact2_README.md
@@ -0,0 +1,39 @@
+# MolmoAct2
+
+This repository contains the LeRobot policy implementation of
+[MolmoAct2](https://allenai.org/blog/molmoact2), ported into LeRobot for
+training, evaluation, checkpointing, and dataset compatibility.
+
+This implementation currently supports training and evaluation for the regular
+MolmoAct2 model. MolmoAct2-Think, which supports adaptive depth reasoning, is
+not included in this LeRobot policy yet and is coming soon.
+
+For the original MolmoAct2 training code used for the experiments reported in
+the paper, see [allenai/molmoact2](https://github.com/allenai/molmoact2).
+
+## LIBERO Evaluation
+
+Important: we found that `num_steps_wait=10` does not reliably let the LIBERO
+scene stabilize and can degrade measured success. All LIBERO evaluation results
+reported for this LeRobot implementation use `num_steps_wait=50`.
+
+## Citation
+
+```bibtex
+@misc{fang2026molmoact2actionreasoningmodels,
+      title={MolmoAct2: Action Reasoning Models for Real-world Deployment},
+      author={Haoquan Fang and Jiafei Duan and Donovan Clay and Sam Wang and Shuo Liu and Weikai Huang and Xiang Fan and Wei-Chuan Tsai and Shirui Chen and Yi Ru Wang and Shanli Xing and Jaemin Cho and Jae Sung Park and Ainaz Eftekhar and Peter Sushko and Karen Farley and Angad Wadhwa and Cole Harrison and Winson Han and Ying-Chun Lee and Eli VanderBilt and Rose Hendrix and Suveen Ellawela and Lucas Ngoo and Joyce Chai and Zhongzheng Ren and Ali Farhadi and Dieter Fox and Ranjay Krishna},
+      year={2026},
+      eprint={2605.02881},
+      archivePrefix={arXiv},
+      primaryClass={cs.RO},
+      url={https://arxiv.org/abs/2605.02881},
+}
+```
+
+## License
+
+This model is licensed under Apache 2.0. It is intended for research and
+educational use in accordance with
+[Ai2's Responsible Use Guidelines](https://allenai.org/responsible-use),
+consistent with [allenai/molmoact2](https://github.com/allenai/molmoact2).
diff --git a/pyproject.toml b/pyproject.toml
index 264297c5e..a6785c564 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -198,6 +198,7 @@ wallx = [
     "lerobot[qwen-vl-utils-dep]",
 ]
 pi = ["lerobot[transformers-dep]", "lerobot[scipy-dep]"]
+molmoact2 = ["lerobot[transformers-dep]", "lerobot[peft-dep]", "lerobot[scipy-dep]"]
 smolvla = ["lerobot[transformers-dep]", "num2words>=0.5.14,<0.6.0", "accelerate>=1.7.0,<2.0.0"]
 multi_task_dit = ["lerobot[transformers-dep]", "lerobot[diffusers-dep]"]
 groot = [
@@ -275,6 +276,7 @@ all = [
     "lerobot[multi_task_dit]",
     "lerobot[wallx]",
     "lerobot[pi]",
+    "lerobot[molmoact2]",
     "lerobot[smolvla]",
     # "lerobot[groot]", TODO(Steven): Gr00t requires specific installation instructions for flash-attn
     "lerobot[xvla]",
@@ -405,8 +407,11 @@ default.extend-ignore-identifiers-re = [
     "ein",
     "thw",
     "inpt",
+    "arange",
+    "is_compileable",
     "ROBOTIS",
-    "OT_VALUE"
+    "OT_VALUE",
+    "VanderBilt"
 ]
 
 # TODO: Uncomment when ready to use
diff --git a/src/lerobot/policies/__init__.py b/src/lerobot/policies/__init__.py
index 3a6b8e5d2..68d23c9ca 100644
--- a/src/lerobot/policies/__init__.py
+++ b/src/lerobot/policies/__init__.py
@@ -20,6 +20,7 @@ from .eo1.configuration_eo1 import EO1Config as EO1Config
 from .factory import get_policy_class, make_policy, make_policy_config, make_pre_post_processors
 from .gaussian_actor.configuration_gaussian_actor import GaussianActorConfig as GaussianActorConfig
 from .groot.configuration_groot import GrootConfig as GrootConfig
+from .molmoact2.configuration_molmoact2 import MolmoAct2Config as MolmoAct2Config
 from .multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig as MultiTaskDiTConfig
 from .pi0.configuration_pi0 import PI0Config as PI0Config
 from .pi0_fast.configuration_pi0_fast import PI0FastConfig as PI0FastConfig
@@ -43,6 +44,7 @@ __all__ = [
     "EO1Config",
     "GaussianActorConfig",
     "GrootConfig",
+    "MolmoAct2Config",
     "MultiTaskDiTConfig",
     "PI0Config",
     "PI0FastConfig",
diff --git a/src/lerobot/policies/factory.py b/src/lerobot/policies/factory.py
index 8937bc6ae..05fda05d8 100644
--- a/src/lerobot/policies/factory.py
+++ b/src/lerobot/policies/factory.py
@@ -49,6 +49,7 @@ from .diffusion.configuration_diffusion import DiffusionConfig
 from .eo1.configuration_eo1 import EO1Config
 from .gaussian_actor.configuration_gaussian_actor import GaussianActorConfig
 from .groot.configuration_groot import GrootConfig
+from .molmoact2.configuration_molmoact2 import MolmoAct2Config
 from .multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig
 from .pi0.configuration_pi0 import PI0Config
 from .pi05.configuration_pi05 import PI05Config
@@ -88,7 +89,8 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:
 
     Args:
         name: The name of the policy. Supported names are "tdmpc", "diffusion", "act",
-            "multi_task_dit", "vqbet", "pi0", "pi05", "gaussian_actor", "smolvla", "wall_x".
+            "multi_task_dit", "vqbet", "pi0", "pi05", "gaussian_actor", "smolvla", "wall_x",
+            "molmoact2".
     Returns:
         The policy class corresponding to the given name.
 
@@ -151,6 +153,10 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:
         from .eo1.modeling_eo1 import EO1Policy
 
         return EO1Policy
+    elif name == "molmoact2":
+        from .molmoact2.modeling_molmoact2 import MolmoAct2Policy
+
+        return MolmoAct2Policy
     else:
         try:
             return _get_policy_cls_from_policy_name(name=name)
@@ -168,7 +174,7 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:
     Args:
         policy_type: The type of the policy. Supported types include "tdmpc",
                      "multi_task_dit", "diffusion", "act", "vqbet", "pi0", "pi05", "gaussian_actor",
-                     "smolvla", "wall_x".
+                     "smolvla", "wall_x", "molmoact2".
         **kwargs: Keyword arguments to be passed to the configuration class constructor.
 
     Returns:
@@ -203,6 +209,8 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:
         return WallXConfig(**kwargs)
     elif policy_type == "eo1":
         return EO1Config(**kwargs)
+    elif policy_type == "molmoact2":
+        return MolmoAct2Config(**kwargs)
     else:
         try:
             config_cls = PreTrainedConfig.get_choice_class(policy_type)
@@ -231,6 +239,7 @@ class ProcessorConfigKwargs(TypedDict, total=False):
     preprocessor_overrides: dict[str, Any] | None
     postprocessor_overrides: dict[str, Any] | None
     dataset_stats: dict[str, dict[str, torch.Tensor]] | None
+    dataset_meta: Any | None
 
 
 def make_pre_post_processors(
@@ -414,6 +423,15 @@ def make_pre_post_processors(
             dataset_stats=kwargs.get("dataset_stats"),
         )
 
+    elif isinstance(policy_cfg, MolmoAct2Config):
+        from .molmoact2.processor_molmoact2 import make_molmoact2_pre_post_processors
+
+        processors = make_molmoact2_pre_post_processors(
+            config=policy_cfg,
+            dataset_stats=kwargs.get("dataset_stats"),
+            dataset_meta=kwargs.get("dataset_meta"),
+        )
+
     else:
         try:
             processors = _make_processors_from_policy_config(
@@ -499,6 +517,10 @@ def make_policy(
         action_names = ds_meta.features.get(ACTION, {}).get("names")
         if action_names is not None:
             cfg.action_feature_names = list(action_names)
+    if ds_meta is not None:
+        set_dataset_feature_metadata = getattr(cfg, "set_dataset_feature_metadata", None)
+        if callable(set_dataset_feature_metadata):
+            set_dataset_feature_metadata(ds_meta.features)
 
     kwargs["config"] = cfg
 
diff --git a/src/lerobot/policies/molmoact2/README.md b/src/lerobot/policies/molmoact2/README.md
new file mode 120000
index 000000000..ef419516d
--- /dev/null
+++ b/src/lerobot/policies/molmoact2/README.md
@@ -0,0 +1 @@
+../../../../docs/source/policy_molmoact2_README.md
\ No newline at end of file
diff --git a/src/lerobot/policies/molmoact2/__init__.py b/src/lerobot/policies/molmoact2/__init__.py
new file mode 100644
index 000000000..bfef53bb2
--- /dev/null
+++ b/src/lerobot/policies/molmoact2/__init__.py
@@ -0,0 +1,21 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .configuration_molmoact2 import MolmoAct2Config
+from .modeling_molmoact2 import MolmoAct2Policy
+from .processor_molmoact2 import make_molmoact2_pre_post_processors
+
+__all__ = ["MolmoAct2Config", "MolmoAct2Policy", "make_molmoact2_pre_post_processors"]
diff --git a/src/lerobot/policies/molmoact2/configuration_molmoact2.py b/src/lerobot/policies/molmoact2/configuration_molmoact2.py
new file mode 100644
index 000000000..de2585281
--- /dev/null
+++ b/src/lerobot/policies/molmoact2/configuration_molmoact2.py
@@ -0,0 +1,519 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import json
+import math
+import os
+from contextlib import suppress
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+from huggingface_hub import snapshot_download
+
+from lerobot.configs import FeatureType, NormalizationMode, PolicyFeature, PreTrainedConfig
+from lerobot.optim import (
+    AdamWConfig,
+    CosineDecayWithWarmupSchedulerConfig,
+    LRSchedulerConfig,
+    OptimizerConfig,
+)
+from lerobot.utils.constants import ACTION, OBS_STATE
+
+from ..rtc.configuration_rtc import RTCConfig
+
+MOLMOACT2_DEFAULT_NUM_IMAGES = 2
+MOLMOACT2_IMAGE_TOKENS_PER_IMAGE = 196
+MOLMOACT2_FIXED_PROMPT_TOKEN_BUDGET = 80
+MOLMOACT2_TASK_TOKEN_BUDGET = 32
+MOLMOACT2_SEQUENCE_LENGTH_MARGIN = 32
+MOLMOACT2_SEQUENCE_LENGTH_MULTIPLE = 64
+MOLMOACT2_DISCRETE_ACTION_WRAPPER_TOKENS = 4
+MOLMOACT2_MIN_DISCRETE_ACTION_TOKENS_PER_STEP = 6
+MOLMOACT2_DISCRETE_ACTION_TOKENS_PER_DIM = 0.95
+
+
+def _hf_token() -> str | None:
+    return os.environ.get("HF_TOKEN") or os.environ.get("HF_ACCESS_TOKEN")
+
+
+def _resolve_checkpoint_location(
+    checkpoint_path: str,
+    *,
+    revision: str | None = None,
+    force_download: bool = False,
+) -> str:
+    checkpoint_path = str(checkpoint_path or "").strip()
+    if not checkpoint_path:
+        raise ValueError("MolmoAct2 policy requires `checkpoint_path`.")
+    local_path = Path(checkpoint_path).expanduser()
+    if local_path.exists():
+        return str(local_path)
+    return snapshot_download(
+        repo_id=checkpoint_path,
+        repo_type="model",
+        revision=revision,
+        force_download=force_download,
+        ignore_patterns=["*.py", "*.pyc", "__pycache__/*"],
+        token=_hf_token(),
+    )
+
+
+def _load_hf_norm_metadata_for_tag(
+    checkpoint_path: str,
+    *,
+    revision: str | None,
+    force_download: bool,
+    norm_tag: str | None,
+) -> dict[str, Any]:
+    norm_tag = str(norm_tag or "").strip()
+    if not norm_tag:
+        return {}
+    checkpoint_location = Path(
+        _resolve_checkpoint_location(
+            checkpoint_path,
+            revision=revision,
+            force_download=force_download,
+        )
+    )
+    norm_stats_filename = "norm_stats.json"
+    config_path = checkpoint_location / "config.json"
+    if config_path.exists():
+        with suppress(OSError, json.JSONDecodeError):
+            norm_stats_filename = str(
+                json.loads(config_path.read_text()).get("norm_stats_filename") or norm_stats_filename
+            )
+    stats_path = checkpoint_location / norm_stats_filename
+    if not stats_path.exists():
+        raise FileNotFoundError(
+            f"MolmoAct2 HF checkpoint is missing {norm_stats_filename!r}; cannot resolve norm_tag={norm_tag!r}."
+        )
+    payload = json.loads(stats_path.read_text())
+    metadata_by_tag = payload.get("metadata_by_tag")
+    if not isinstance(metadata_by_tag, dict):
+        raise ValueError(f"MolmoAct2 norm stats file {stats_path} has no metadata_by_tag mapping.")
+    metadata = metadata_by_tag.get(norm_tag)
+    if not isinstance(metadata, dict):
+        available = sorted(str(tag) for tag in metadata_by_tag)
+        raise ValueError(f"Unknown MolmoAct2 norm_tag={norm_tag!r}. Available tags: {available}.")
+    return metadata
+
+
+@LRSchedulerConfig.register_subclass("molmoact2_cosine_decay_with_warmup")
+@dataclass
+class MolmoAct2CosineDecayWithWarmupSchedulerConfig(CosineDecayWithWarmupSchedulerConfig):
+    """MolmoAct2-local cosine scheduler with optional decay-step auto-match.
+
+    LeRobot's generic cosine scheduler keeps an explicit integer decay length.
+    For MolmoAct2, leaving num_decay_steps unset means "decay across this run's
+    training steps"; build() is the first point where num_training_steps is known.
+    """
+
+    num_decay_steps: int | None
+
+    def build(self, optimizer, num_training_steps: int):
+        return CosineDecayWithWarmupSchedulerConfig(
+            peak_lr=self.peak_lr,
+            decay_lr=self.decay_lr,
+            num_warmup_steps=self.num_warmup_steps,
+            num_decay_steps=num_training_steps if self.num_decay_steps is None else self.num_decay_steps,
+        ).build(optimizer, num_training_steps=num_training_steps)
+
+
+def _round_up(value: int, multiple: int) -> int:
+    return int(math.ceil(value / multiple) * multiple)
+
+
+def infer_molmoact2_max_sequence_length(
+    *,
+    num_images: int,
+    state_dim: int,
+    action_dim: int,
+    action_horizon: int,
+    include_discrete_action: bool,
+) -> int:
+    """Infer the padded text/image sequence cap from MolmoAct2's fixed token layout."""
+    if num_images < 1:
+        num_images = MOLMOACT2_DEFAULT_NUM_IMAGES
+    if state_dim < 0:
+        state_dim = 0
+    if action_dim < 1:
+        action_dim = 1
+    if action_horizon < 1:
+        action_horizon = 1
+
+    image_tokens = num_images * MOLMOACT2_IMAGE_TOKENS_PER_IMAGE
+    prompt_tokens = (
+        MOLMOACT2_FIXED_PROMPT_TOKEN_BUDGET
+        + MOLMOACT2_TASK_TOKEN_BUDGET
+        + state_dim
+        + MOLMOACT2_SEQUENCE_LENGTH_MARGIN
+    )
+    action_tokens = 0
+    if include_discrete_action:
+        action_tokens_per_step = max(
+            MOLMOACT2_MIN_DISCRETE_ACTION_TOKENS_PER_STEP,
+            math.ceil(action_dim * MOLMOACT2_DISCRETE_ACTION_TOKENS_PER_DIM),
+        )
+        action_tokens = MOLMOACT2_DISCRETE_ACTION_WRAPPER_TOKENS + action_horizon * action_tokens_per_step
+
+    return _round_up(
+        image_tokens + prompt_tokens + action_tokens,
+        MOLMOACT2_SEQUENCE_LENGTH_MULTIPLE,
+    )
+
+
+@PreTrainedConfig.register_subclass("molmoact2")
+@dataclass
+class MolmoAct2Config(PreTrainedConfig):
+    """MolmoAct2 policy backed by the converted HF checkpoint implementation."""
+
+    checkpoint_path: str = "allenai/MolmoAct2"
+    checkpoint_revision: str | None = None
+    checkpoint_force_download: bool = False
+
+    n_obs_steps: int = 1
+    chunk_size: int = 30
+    n_action_steps: int = 30
+
+    action_mode: str = "both"
+    inference_action_mode: str | None = None
+    discrete_action_tokenizer: str = "allenai/MolmoAct2-FAST-Tokenizer"
+    discrete_generation_max_steps: int | None = None
+    norm_tag: str | None = None
+
+    setup_type: str = ""
+    control_mode: str = ""
+    image_keys: list[str] = field(default_factory=list)
+    normalize_language: bool = True
+    add_setup_tokens: bool = True
+    add_control_tokens: bool = True
+    normalize_gripper: bool = False
+    num_state_tokens: int = 256
+    # Leave unset for the default MolmoAct2 sequence budget inferred from the fixed
+    # image/prompt/state/action token layout. Override only for unusual long prompts.
+    max_sequence_length: int | None = None
+
+    # Fixed by released MolmoAct2 checkpoints. We validate this at model load.
+    expected_max_action_dim: int = 32
+
+    # Flow-matching training knobs copied from the original MolmoAct2 training path.
+    num_flow_timesteps: int = 8
+    flow_matching_cutoff: float = 1.0
+    flow_matching_time_offset: float = 0.001
+    flow_matching_time_scale: float = 0.999
+    flow_matching_beta_alpha: float = 1.0
+    flow_matching_beta_beta: float = 1.5
+    num_inference_steps: int | None = None
+    mask_action_dim_padding: bool = True
+    enable_inference_cuda_graph: bool = True
+    # MolmoAct2-local eval option. When enabled, stochastic continuous action
+    # generation uses a rollout-local generator derived from eval_seed.
+    per_episode_seed: bool = False
+    eval_seed: int | None = None
+    rtc_config: RTCConfig | None = None
+
+    # Default is full finetuning with gradients from the action expert flowing into the VLM.
+    enable_lora_vlm: bool = False
+    lora_rank: int = 64
+    lora_alpha: int = 16
+    lora_dropout: float = 0.05
+    lora_bias: str = "none"
+    enable_lora_action_expert: bool = False
+    enable_knowledge_insulation: bool = False
+    freeze_embedding: bool = True
+    train_action_expert_only: bool = False
+    gradient_checkpointing: bool = False
+
+    model_dtype: str = "bfloat16"
+    softmax_auxiliary_loss: bool = True
+    softmax_auxiliary_loss_scale: float = 1e-4
+    discrete_loss_token_weighting: str = "root_subsegments_root_tokens"
+
+    optimizer_lr: float = 1e-5
+    optimizer_vit_lr: float = 5e-6
+    optimizer_connector_lr: float = 5e-6
+    optimizer_action_expert_lr: float = 5e-5
+    optimizer_betas: tuple[float, float] = (0.9, 0.95)
+    optimizer_eps: float = 1e-6
+    optimizer_weight_decay: float = 0.0
+    optimizer_grad_clip_norm: float = 1.0
+
+    scheduler_warmup_steps: int = 200
+    scheduler_decay_steps: int | None = None
+    scheduler_decay_lr: float = 1e-6
+
+    normalization_mapping: dict[str, NormalizationMode] = field(
+        default_factory=lambda: {
+            "VISUAL": NormalizationMode.IDENTITY,
+            "STATE": NormalizationMode.QUANTILES,
+            "ACTION": NormalizationMode.QUANTILES,
+        }
+    )
+
+    input_features: dict[str, PolicyFeature] = field(default_factory=dict)
+    output_features: dict[str, PolicyFeature] = field(default_factory=dict)
+    dataset_feature_names: dict[str, Any] = field(default_factory=dict)
+
+    def __post_init__(self) -> None:
+        super().__post_init__()
+        if self.action_mode not in {"continuous", "discrete", "both"}:
+            raise ValueError(
+                f"Unsupported action_mode={self.action_mode!r}. "
+                "Expected one of {'continuous', 'discrete', 'both'}."
+            )
+        if self.inference_action_mode not in {None, "continuous", "discrete"}:
+            raise ValueError(
+                f"Unsupported inference_action_mode={self.inference_action_mode!r}. "
+                "Expected one of {None, 'continuous', 'discrete'}."
+            )
+        if self.inference_action_mode == "continuous" and self.action_mode == "discrete":
+            raise ValueError("MolmoAct2 action_mode='discrete' cannot run continuous inference.")
+        if self.inference_action_mode == "discrete" and self.action_mode == "continuous":
+            raise ValueError("MolmoAct2 action_mode='continuous' cannot run discrete inference.")
+        if self.train_action_expert_only and self.action_mode != "continuous":
+            raise ValueError("MolmoAct2 train_action_expert_only requires action_mode='continuous'.")
+        if self.train_action_expert_only and self.enable_lora_vlm:
+            raise ValueError("MolmoAct2 train_action_expert_only is incompatible with enable_lora_vlm.")
+        if self.enable_lora_action_expert and not self.enable_lora_vlm:
+            raise ValueError("MolmoAct2 enable_lora_action_expert requires enable_lora_vlm.")
+        if self.chunk_size < 1:
+            raise ValueError(f"chunk_size must be >= 1, got {self.chunk_size}.")
+        if self.n_action_steps < 1:
+            raise ValueError(f"n_action_steps must be >= 1, got {self.n_action_steps}.")
+        if self.n_action_steps > self.chunk_size:
+            raise ValueError(
+                f"n_action_steps ({self.n_action_steps}) cannot exceed chunk_size ({self.chunk_size})."
+            )
+        if self.expected_max_action_dim != 32:
+            raise ValueError("MolmoAct2 released checkpoints use expected_max_action_dim=32.")
+        if self.model_dtype not in {"float32", "bfloat16", "float16"}:
+            raise ValueError(
+                f"Unsupported model_dtype={self.model_dtype!r}. Expected 'float32', 'bfloat16', or 'float16'."
+            )
+        if self.lora_rank < 1:
+            raise ValueError(f"lora_rank must be >= 1, got {self.lora_rank}.")
+        if self.lora_alpha < 1:
+            raise ValueError(f"lora_alpha must be >= 1, got {self.lora_alpha}.")
+        if not 0 <= self.lora_dropout <= 1:
+            raise ValueError(f"lora_dropout must be in [0, 1], got {self.lora_dropout}.")
+        if self.lora_bias not in {"none", "all", "lora_only"}:
+            raise ValueError(
+                f"Unsupported lora_bias={self.lora_bias!r}. Expected one of 'none', 'all', or 'lora_only'."
+            )
+        if self.discrete_loss_token_weighting not in {
+            "none",
+            "token",
+            "root_tokens",
+            "root_subsegments",
+            "root_subsegments_root_tokens",
+        }:
+            raise ValueError(
+                f"Unsupported discrete_loss_token_weighting={self.discrete_loss_token_weighting!r}."
+            )
+        if self.discrete_generation_max_steps is not None and self.discrete_generation_max_steps < 1:
+            raise ValueError(
+                f"discrete_generation_max_steps must be >= 1 or None, got {self.discrete_generation_max_steps}."
+            )
+        if self.max_sequence_length is not None and self.max_sequence_length < 1:
+            raise ValueError(f"max_sequence_length must be >= 1 or None, got {self.max_sequence_length}.")
+
+    def inferred_max_sequence_length(
+        self,
+        *,
+        num_images: int | None = None,
+        state_dim: int | None = None,
+        action_dim: int | None = None,
+        action_horizon: int | None = None,
+        include_discrete_action: bool | None = None,
+    ) -> int:
+        if self.max_sequence_length is not None:
+            return int(self.max_sequence_length)
+
+        if num_images is None:
+            num_images = len(self.image_keys) or len(self.image_features) or MOLMOACT2_DEFAULT_NUM_IMAGES
+        if state_dim is None:
+            state_feature = self.robot_state_feature
+            state_dim = int(state_feature.shape[0]) if state_feature is not None else 0
+        if action_dim is None:
+            action_feature = self.action_feature
+            action_dim = (
+                int(action_feature.shape[0]) if action_feature is not None else self.expected_max_action_dim
+            )
+        if action_horizon is None:
+            action_horizon = self.chunk_size
+        if include_discrete_action is None:
+            include_discrete_action = self.action_mode in {"discrete", "both"}
+
+        return infer_molmoact2_max_sequence_length(
+            num_images=int(num_images),
+            state_dim=int(state_dim),
+            action_dim=int(action_dim),
+            action_horizon=int(action_horizon),
+            include_discrete_action=bool(include_discrete_action),
+        )
+
+    @property
+    def observation_delta_indices(self) -> None:
+        return None
+
+    @property
+    def action_delta_indices(self) -> list[int]:
+        return list(range(self.chunk_size))
+
+    @property
+    def reward_delta_indices(self) -> None:
+        return None
+
+    def get_optimizer_preset(self) -> OptimizerConfig:
+        return AdamWConfig(
+            lr=self.optimizer_lr,
+            betas=self.optimizer_betas,
+            eps=self.optimizer_eps,
+            weight_decay=self.optimizer_weight_decay,
+            grad_clip_norm=self.optimizer_grad_clip_norm,
+        )
+
+    def get_scheduler_preset(self) -> LRSchedulerConfig | None:
+        return MolmoAct2CosineDecayWithWarmupSchedulerConfig(
+            peak_lr=self.optimizer_lr,
+            decay_lr=self.scheduler_decay_lr,
+            num_warmup_steps=self.scheduler_warmup_steps,
+            num_decay_steps=self.scheduler_decay_steps,
+        )
+
+    def set_dataset_feature_metadata(self, features: dict[str, Any]) -> None:
+        self.dataset_feature_names = {}
+        for key in (ACTION, OBS_STATE):
+            feature = features.get(key) if isinstance(features, dict) else None
+            if isinstance(feature, dict) and feature.get("names") is not None:
+                self.dataset_feature_names[key] = feature["names"]
+
+    def validate_features(self) -> None:
+        """Validate and set up MolmoAct2 input and output features."""
+        image_features = [key for key, feat in self.input_features.items() if feat.type == FeatureType.VISUAL]
+        if not image_features:
+            raise ValueError(
+                "MolmoAct2 policy requires at least one visual input feature. "
+                "No features of type FeatureType.VISUAL found in input_features."
+            )
+
+        if OBS_STATE not in self.input_features:
+            state_feature = PolicyFeature(
+                type=FeatureType.STATE,
+                shape=(0,),
+            )
+            self.input_features[OBS_STATE] = state_feature
+
+        if ACTION not in self.output_features:
+            action_feature = PolicyFeature(
+                type=FeatureType.ACTION,
+                shape=(self.expected_max_action_dim,),
+            )
+            self.output_features[ACTION] = action_feature
+
+    def apply_norm_tag_metadata(self) -> None:
+        if not str(self.norm_tag or "").strip():
+            return
+        metadata = _load_hf_norm_metadata_for_tag(
+            self.checkpoint_path,
+            revision=self.checkpoint_revision,
+            force_download=bool(self.checkpoint_force_download),
+            norm_tag=self.norm_tag,
+        )
+        if metadata.get("action_horizon") is not None:
+            self.chunk_size = int(metadata["action_horizon"])
+        if metadata.get("n_action_steps") is not None:
+            self.n_action_steps = int(metadata["n_action_steps"])
+        if not self.setup_type and metadata.get("setup_type") is not None:
+            self.setup_type = str(metadata["setup_type"])
+        if not self.control_mode and metadata.get("control_mode") is not None:
+            self.control_mode = str(metadata["control_mode"])
+
+    def saved_policy_action_mode(self) -> str | None:
+        pretrained_path = getattr(self, "pretrained_path", None)
+        if pretrained_path is None:
+            return None
+        config_path = Path(pretrained_path) / "config.json"
+        if not config_path.exists():
+            return None
+        try:
+            mode = json.loads(config_path.read_text()).get("action_mode")
+        except (OSError, json.JSONDecodeError):
+            return None
+        if mode in {"continuous", "discrete", "both"}:
+            return str(mode)
+        return None
+
+    def training_action_mode(self, saved_policy_action_mode: str | None = None) -> str:
+        return saved_policy_action_mode or self.action_mode
+
+    def validate_inference_action_mode(self, saved_policy_action_mode: str | None = None) -> None:
+        requested_mode = self.inference_action_mode
+        if requested_mode is None:
+            return
+        training_mode = self.training_action_mode(saved_policy_action_mode)
+        if requested_mode == "continuous" and training_mode == "discrete":
+            raise ValueError(
+                "MolmoAct2 checkpoint was trained with action_mode='discrete' and cannot run "
+                "continuous inference."
+            )
+        if requested_mode == "discrete" and training_mode == "continuous":
+            raise ValueError(
+                "MolmoAct2 checkpoint was trained with action_mode='continuous' and cannot run "
+                "discrete inference. Train with action_mode='both' or action_mode='discrete' first."
+            )
+
+    def validate_checkpoint_action_mode(
+        self,
+        checkpoint_action_mode: str,
+        *,
+        has_action_expert: bool,
+    ) -> None:
+        if self.action_mode == "both" and checkpoint_action_mode != "both":
+            raise ValueError(
+                f"action_mode='both' requires checkpoint action_mode='both', got {checkpoint_action_mode!r}."
+            )
+        if self.action_mode == "discrete" and checkpoint_action_mode not in {"discrete", "both"}:
+            raise ValueError(
+                f"action_mode='discrete' requires checkpoint action_mode in {{'discrete', 'both'}}, "
+                f"got {checkpoint_action_mode!r}."
+            )
+        if self.action_mode in {"continuous", "both"} and not has_action_expert:
+            raise ValueError("Continuous MolmoAct2 training requires an action expert checkpoint.")
+
+    def resolve_inference_action_mode(
+        self,
+        requested_mode: str | None,
+        saved_policy_action_mode: str | None = None,
+    ) -> str:
+        training_mode = self.training_action_mode(saved_policy_action_mode)
+        if requested_mode is None:
+            requested_mode = self.inference_action_mode
+        if requested_mode is None:
+            raise ValueError(
+                "MolmoAct2 inference requires `inference_action_mode` to be set explicitly "
+                "to either 'continuous' or 'discrete'."
+            )
+        if requested_mode not in {"continuous", "discrete"}:
+            raise ValueError("MolmoAct2 inference_action_mode must be either 'continuous' or 'discrete'.")
+        if requested_mode == "continuous" and training_mode == "discrete":
+            raise ValueError("MolmoAct2 action_mode='discrete' checkpoint cannot run continuous inference.")
+        if requested_mode == "discrete" and training_mode == "continuous":
+            raise ValueError("MolmoAct2 action_mode='continuous' checkpoint cannot run discrete inference.")
+        return requested_mode
diff --git a/src/lerobot/policies/molmoact2/hf_model/__init__.py b/src/lerobot/policies/molmoact2/hf_model/__init__.py
new file mode 100644
index 000000000..39b15cb3a
--- /dev/null
+++ b/src/lerobot/policies/molmoact2/hf_model/__init__.py
@@ -0,0 +1,17 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# ruff: noqa
diff --git a/src/lerobot/policies/molmoact2/hf_model/action_tokenizer.py b/src/lerobot/policies/molmoact2/hf_model/action_tokenizer.py
new file mode 100644
index 000000000..f7dacbce6
--- /dev/null
+++ b/src/lerobot/policies/molmoact2/hf_model/action_tokenizer.py
@@ -0,0 +1,237 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# ruff: noqa
+
+import logging
+import os
+from pathlib import Path
+from typing import ClassVar
+
+import numpy as np
+from tokenizers import ByteLevelBPETokenizer
+from tokenizers.trainers import BpeTrainer
+from huggingface_hub import snapshot_download
+from transformers import PreTrainedTokenizerFast
+from transformers.processing_utils import ProcessorMixin
+
+
+def _hf_token() -> str | None:
+    return os.environ.get("HF_TOKEN") or os.environ.get("HF_ACCESS_TOKEN")
+
+
+def _resolve_tokenizer_location(
+    tokenizer_path: str,
+    *,
+    revision: str | None = None,
+    force_download: bool = False,
+) -> str:
+    local_path = Path(str(tokenizer_path)).expanduser()
+    if local_path.exists():
+        return str(local_path)
+    return snapshot_download(
+        repo_id=str(tokenizer_path),
+        repo_type="model",
+        revision=revision,
+        force_download=force_download,
+        ignore_patterns=["*.py", "*.pyc", "__pycache__/*"],
+        token=_hf_token(),
+    )
+
+
+class UniversalActionProcessor(ProcessorMixin):
+    attributes: ClassVar[list[str]] = ["tokenizer"]
+    tokenizer_class: str = "AutoTokenizer"
+
+    def __init__(
+        self,
+        tokenizer: PreTrainedTokenizerFast,
+        scale: float = 10,
+        vocab_size: int = 1024,
+        min_token: int = 0,
+        *,
+        action_dim: int | None = None,
+        time_horizon: int | None = None,
+    ):
+        self.scale = scale
+        self.vocab_size = vocab_size
+        self.min_token = min_token
+
+        # Action horizon and dimension needed during decoding. These can be specified
+        # in three ways (in order of priority):
+        # 1. passed in as kwargs to decode()
+        # 2. in the constructor
+        # 3. cached from the last time decode() was called
+        self.time_horizon = time_horizon
+        self.action_dim = action_dim
+        self.called_time_horizon = time_horizon
+        self.called_action_dim = action_dim
+
+        super().__init__(tokenizer)
+        self.bpe_tokenizer = self.tokenizer
+
+    def __call__(self, action_chunk: np.array) -> np.array:
+        from scipy.fft import dct
+
+        assert action_chunk.ndim <= 3, "Only 3 dimensions supported: [batch, timesteps, action_dim]"
+        if action_chunk.ndim == 2:
+            action_chunk = action_chunk[None, ...]
+
+        # Cache the time horizon and action dimension for decoding
+        self.called_time_horizon = action_chunk.shape[-2]
+        self.called_action_dim = action_chunk.shape[-1]
+
+        dct_coeff = dct(action_chunk, axis=1, norm="ortho")
+        dct_coeff = np.around(dct_coeff * self.scale)
+        tokens = []
+        for elem in dct_coeff:
+            token_str = "".join(map(chr, np.maximum(elem.flatten() - self.min_token, 0).astype(int)))
+            tokens.append(self.bpe_tokenizer(token_str)["input_ids"])
+        return tokens
+
+    def decode(
+        self,
+        tokens: list[list[int]],
+        *,
+        time_horizon: int | None = None,
+        action_dim: int | None = None,
+    ) -> np.array:
+        from scipy.fft import idct
+
+        self.time_horizon = time_horizon or self.time_horizon or self.called_time_horizon
+        self.action_dim = action_dim or self.action_dim or self.called_action_dim
+
+        # Cache the time horizon and action dimension for the next call
+        self.called_time_horizon = self.time_horizon
+        self.called_action_dim = self.action_dim
+
+        assert self.time_horizon is not None and self.action_dim is not None, (
+            "Tokenizer not initialized, call encode() once or pass in time_horizon and action_dim."
+        )
+
+        decoded_actions = []
+        for token in tokens:
+            try:
+                decoded_tokens = self.bpe_tokenizer.decode(token)
+                decoded_dct_coeff = np.array(list(map(ord, decoded_tokens))) + self.min_token
+                decoded_dct_coeff = decoded_dct_coeff.reshape(-1, self.action_dim)
+                assert decoded_dct_coeff.shape == (
+                    self.time_horizon,
+                    self.action_dim,
+                ), (
+                    f"Decoded DCT coefficients have shape {decoded_dct_coeff.shape}, expected ({self.time_horizon}, {self.action_dim})"
+                )
+            except Exception as e:
+                print(f"Error decoding tokens: {e}")
+                print(f"Tokens: {token}")
+                decoded_dct_coeff = np.zeros((self.time_horizon, self.action_dim))
+            decoded_actions.append(idct(decoded_dct_coeff / self.scale, axis=0, norm="ortho"))
+        return np.stack(decoded_actions)
+
+    @classmethod
+    def fit(
+        cls,
+        action_data: list[np.array],
+        scale: float = 10,
+        vocab_size: int = 1024,
+        *,
+        time_horizon: int | None = None,
+        action_dim: int | None = None,
+    ) -> "UniversalActionProcessor":
+        from scipy.fft import dct
+
+        # Run DCT over all inputs
+        dct_tokens = [dct(a, axis=0, norm="ortho").flatten() for a in action_data]
+
+        # Quantize and find min token
+        max_token = int(np.around(np.concatenate(dct_tokens) * scale).max())
+        min_token = int(np.around(np.concatenate(dct_tokens) * scale).min())
+        min_vocab_size = max_token - min_token
+
+        assert min_vocab_size <= vocab_size, (
+            f"Vocab size {vocab_size} is too small for the range of tokens {min_vocab_size}"
+        )
+        if min_vocab_size + 100 > vocab_size:
+            logging.warning(
+                f"Initial alphabet size {min_vocab_size} is almost as large as the vocab"
+                f"size {vocab_size}, consider increasing vocab size"
+            )
+
+        # Make token iterator for BPE training
+        def _token_iter():
+            for tokens in dct_tokens:
+                rounded_tokens = np.around(tokens * scale) - min_token
+                rounded_tokens = rounded_tokens.astype(int)
+                string = "".join(map(chr, rounded_tokens))
+                yield string
+
+        # Train BPE tokenizer
+        bpe = ByteLevelBPETokenizer()
+
+        # Set up the entire range of possible tokens as the initial alphabet
+        alphabet = [chr(i) for i in range(max_token - min_token + 1)]
+        trainer = BpeTrainer(
+            vocab_size=vocab_size,
+            min_frequency=2,
+            show_progress=True,
+            special_tokens=[],
+            initial_alphabet=alphabet,
+            max_token_length=10000,
+        )
+
+        # Train the inner tokenizer (don't use ByteLevelBPETokenizer.train_from_iterator()
+        # because it doesn't support custom alphabets)
+        bpe._tokenizer.train_from_iterator(_token_iter(), trainer=trainer)
+
+        return cls(
+            PreTrainedTokenizerFast(tokenizer_object=bpe, clean_up_tokenization_spaces=False),
+            scale=scale,
+            vocab_size=vocab_size,
+            min_token=min_token,
+            time_horizon=time_horizon,
+            action_dim=action_dim,
+        )
+
+    @classmethod
+    def from_pretrained_local(
+        cls,
+        pretrained_model_name_or_path: str,
+        *,
+        revision: str | None = None,
+        force_download: bool = False,
+    ) -> "UniversalActionProcessor":
+        location = Path(
+            _resolve_tokenizer_location(
+                pretrained_model_name_or_path,
+                revision=revision,
+                force_download=force_download,
+            )
+        )
+        processor_config = {}
+        processor_config_path = location / "processor_config.json"
+        if processor_config_path.exists():
+            import json
+
+            processor_config = json.loads(processor_config_path.read_text())
+        tokenizer = PreTrainedTokenizerFast.from_pretrained(str(location))
+        return cls(
+            tokenizer,
+            scale=processor_config.get("scale", 10),
+            vocab_size=processor_config.get("vocab_size", 1024),
+            min_token=processor_config.get("min_token", 0),
+            action_dim=processor_config.get("action_dim"),
+            time_horizon=processor_config.get("time_horizon"),
+        )
diff --git a/src/lerobot/policies/molmoact2/hf_model/configuration_molmoact2.py b/src/lerobot/policies/molmoact2/hf_model/configuration_molmoact2.py
new file mode 100644
index 000000000..29da68c14
--- /dev/null
+++ b/src/lerobot/policies/molmoact2/hf_model/configuration_molmoact2.py
@@ -0,0 +1,553 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# ruff: noqa
+
+"""
+MolmoAct2 configuration
+"""
+
+from typing import Optional, Any
+
+from transformers import PretrainedConfig
+from transformers.modeling_rope_utils import rope_config_validation
+from transformers.utils import logging
+
+logger = logging.get_logger(__name__)
+
+
+class MolmoAct2VitConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`MolmoAct2VisionTransformer`].
+    It is used to instantiate a `MolmoAct2VisionTransformer` according to the specified arguments,
+    defining the model architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Example:
+    ```python
+    >>> from transformers import MolmoAct2VitConfig, MolmoAct2VisionTransformer
+
+    >>> # Initializing a MolmoAct2VitConfig
+    >>> configuration = MolmoAct2VitConfig()
+
+    >>> # Initializing a MolmoAct2VisionTransformer (with random weights)
+    >>> model = MolmoAct2VisionTransformer(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "molmoact2"
+    base_config_key = "vit_config"
+
+    def __init__(
+        self,
+        hidden_size: int = 1152,
+        intermediate_size: int = 4304,
+        num_hidden_layers: int = 27,
+        num_attention_heads: int = 16,
+        num_key_value_heads: int = 16,
+        head_dim: int = 72,
+        hidden_act: str = "gelu_pytorch_tanh",
+        layer_norm_eps: float = 1e-6,
+        image_default_input_size: tuple[int, int] = (378, 378),
+        image_patch_size: int = 14,
+        image_num_pos: int = 577,
+        attention_dropout: float = 0.0,
+        residual_dropout: float = 0.0,
+        initializer_range: float = 0.02,
+        float32_attention: bool = True,
+        attn_implementation: str = "eager",
+        **kwargs,
+    ):
+        self.attn_implementation = attn_implementation
+        super().__init__(attn_implementation=attn_implementation, **kwargs)
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.head_dim = head_dim
+        self.hidden_act = hidden_act
+        self.layer_norm_eps = layer_norm_eps
+        self.image_default_input_size = image_default_input_size
+        self.image_patch_size = image_patch_size
+        self.image_num_pos = image_num_pos
+        self.attention_dropout = attention_dropout
+        self.residual_dropout = residual_dropout
+        self.initializer_range = initializer_range
+        self.float32_attention = float32_attention
+
+    @property
+    def image_num_patch(self):
+        h, w = self.image_default_input_size
+        return h // self.image_patch_size, w // self.image_patch_size
+
+
+class MolmoAct2AdapterConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of MolmoAct2Adapter. With MolmoAct2VitConfig,
+    It is used to instantiate an MolmoAct2VisionBackbone according to the specified arguments,
+    defining the model architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Example:
+
+    ```python
+    >>> from transformers import MolmoAct2VitConfig, MolmoAct2AdapterConfig, MolmoAct2VisionBackbone
+
+    >>> # Initializing a MolmoAct2VitConfig and a MolmoAct2AdapterConfig
+    >>> vit_config = MolmoAct2VitConfig()
+    >>> adapter_config = MolmoPoolingConfig()
+
+    >>> # Initializing a MolmoAct2VisionBackbone (with random weights)
+    >>> model = MolmoAct2VisionBackbone(vit_config, adapter_config)
+
+    >>> # Accessing the model configuration
+    >>> vit_configuration = model.vit_config
+    >>> adapter_configuration = model.adapter_config
+    ```"""
+
+    model_type = "molmoact2"
+    base_config_key = "adapter_config"
+
+    def __init__(
+        self,
+        vit_layers: tuple = (-3, -9),
+        pooling_attention_mask: bool = False,
+        hidden_size: int = 1152,
+        num_attention_heads: int = 16,
+        num_key_value_heads: int = 16,
+        head_dim: int = 72,
+        float32_attention: bool = True,
+        attention_dropout: float = 0.0,
+        residual_dropout: float = 0.0,
+        hidden_act: str = "silu",
+        intermediate_size: int = 18944,
+        text_hidden_size: int = 3584,
+        image_feature_dropout: float = 0.0,
+        initializer_range: float = 0.02,
+        attn_implementation: str = "eager",
+        **kwargs,
+    ):
+        self.attn_implementation = attn_implementation
+        super().__init__(attn_implementation=attn_implementation, **kwargs)
+        self.vit_layers = vit_layers
+        self.pooling_attention_mask = pooling_attention_mask
+        self.hidden_size = hidden_size
+        self.num_attention_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.head_dim = head_dim
+        self.float32_attention = float32_attention
+        self.attention_dropout = attention_dropout
+        self.residual_dropout = residual_dropout
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.text_hidden_size = text_hidden_size
+        self.image_feature_dropout = image_feature_dropout
+        self.initializer_range = initializer_range
+
+
+class MolmoAct2TextConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`MolmoAct2TextModel`]. It is used to instantiate a
+    `MolmoAct2TextModel` according to the specified arguments, defining the model architecture.
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Example:
+    ```python
+    >>> from transformers import MolmoAct2TextConfig, MolmoAct2TextModel
+
+    >>> # Initializing a MolmoAct2TextConfig
+    >>> configuration = MolmoAct2TextConfig()
+
+    >>> # Initializing a MolmoAct2TextModel (with random weights)
+    >>> model = MolmoAct2TextModel(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "molmoact2_text"
+    base_config_key = "text_config"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    base_model_tp_plan = {
+        "blocks.*.self_attn.att_proj": "colwise",
+        "blocks.*.self_attn.attn_out": "rowwise",
+        "blocks.*.mlp.ff_proj": "colwise",
+        "blocks.*.mlp.ff_out": "rowwise",
+    }
+    base_model_pp_plan = {
+        "wte": (["input_ids"], ["inputs_embeds"]),
+        "blocks": (["hidden_states", "attention_mask"], ["hidden_states"]),
+        "ln_f": (["hidden_states"], ["hidden_states"]),
+    }
+
+    def __init__(
+        self,
+        hidden_size: int = 3584,
+        num_attention_heads: int = 28,
+        num_key_value_heads: int | None = 4,
+        head_dim: int = 128,
+        vocab_size: int = 152064,
+        additional_vocab_size: int = 128,
+        qkv_bias: bool = True,
+        num_hidden_layers: int = 48,
+        intermediate_size: int = 18944,
+        hidden_act: str = "silu",
+        embedding_dropout: float = 0.0,
+        attention_dropout: float = 0.0,
+        residual_dropout: float = 0.0,
+        max_position_embeddings: int = 4096,
+        rope_theta: float = 1000000.0,
+        rope_scaling: dict[str, Any] = None,
+        rope_scaling_layers: list[int] | None = None,
+        use_qk_norm: bool = False,
+        qk_norm_type: str = "olmo",
+        layer_norm_eps: int = 1e-6,
+        norm_after: bool = False,
+        initializer_range: float = 0.02,
+        use_cache=True,
+        tie_word_embeddings=False,
+        attn_implementation: str = "eager",
+        **kwargs,
+    ):
+        self.attn_implementation = attn_implementation
+        super().__init__(
+            tie_word_embeddings=tie_word_embeddings, attn_implementation=attn_implementation, **kwargs
+        )
+        self.hidden_size = hidden_size
+        self.num_attention_heads = num_attention_heads
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.head_dim = head_dim
+        self.vocab_size = vocab_size
+        self.additional_vocab_size = additional_vocab_size
+        self.qkv_bias = qkv_bias
+        self.num_hidden_layers = num_hidden_layers
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.embedding_dropout = embedding_dropout
+        self.attention_dropout = attention_dropout
+        self.residual_dropout = residual_dropout
+        self.max_position_embeddings = max_position_embeddings
+        self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
+        self.rope_scaling_layers = rope_scaling_layers
+        self.use_qk_norm = use_qk_norm
+        self.qk_norm_type = qk_norm_type
+        self.layer_norm_eps = layer_norm_eps
+        self.norm_after = norm_after
+        self.initializer_range = initializer_range
+        self.use_cache = use_cache
+
+        # Validate the correctness of rotary position embeddings parameters
+        rope_config_validation(self)
+
+
+class MolmoAct2ActionExpertConfig(PretrainedConfig):
+    r"""Configuration for the MolmoAct2 modern action expert."""
+
+    model_type = "molmoact2_action_expert"
+    base_config_key = "action_expert_config"
+
+    def __init__(
+        self,
+        max_action_horizon: int = 32,
+        max_action_dim: int = 32,
+        hidden_size: int = 1024,
+        num_layers: int = 32,
+        num_heads: int = 16,
+        mlp_ratio: float = 8.0 / 3.0,
+        ffn_multiple_of: int = 256,
+        timestep_embed_dim: int = 256,
+        dropout: float = 0.0,
+        attn_dropout: float = 0.0,
+        context_layer_norm: bool = True,
+        qk_norm: bool = True,
+        qk_norm_eps: float = 1e-6,
+        rope: bool = True,
+        causal_attn: bool = False,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.max_action_horizon = max_action_horizon
+        self.max_action_dim = max_action_dim
+        self.hidden_size = hidden_size
+        self.num_layers = num_layers
+        self.num_heads = num_heads
+        self.mlp_ratio = mlp_ratio
+        self.ffn_multiple_of = ffn_multiple_of
+        self.timestep_embed_dim = timestep_embed_dim
+        self.dropout = dropout
+        self.attn_dropout = attn_dropout
+        self.context_layer_norm = context_layer_norm
+        self.qk_norm = qk_norm
+        self.qk_norm_eps = qk_norm_eps
+        self.rope = rope
+        self.causal_attn = causal_attn
+
+    def to_dict(self):
+        output = super().to_dict()
+        # These are derived from the parent MolmoAct2Config for HF exports. Keeping
+        # them out of the public nested config avoids duplicated sources of truth.
+        output.pop("max_action_horizon", None)
+        output.pop("max_action_dim", None)
+        return output
+
+
+class MolmoAct2Config(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`MolmoAct2ForConditionalGeneration`].
+    It is used to instantiate an MolmoAct2 model according to the specified arguments, defining the model architecture.
+
+    Example:
+
+    ```python
+    >>> from transformers import MolmoAct2Config, MolmoAct2VitConfig, MolmoAct2AdapterConfig, MolmoAct2TextConfig
+
+    >>> # Initializing a MolmoAct2VitConfig
+    >>> vit_config = MolmoAct2VitConfig()
+
+    >>> # Initializing a MolmoAct2AdapterConfig
+    >>> adapter_config = MolmoAct2AdapterConfig()
+
+    >>> # Initializing a MolmoAct2TextConfig
+    >>> text_config = MolmoAct2TextConfig()
+
+    >>> # Initializing a MolmoAct2Config
+    >>> configuration = MolmoAct2Config(
+    >>>     vit_config=vit_config,
+    >>>     adapter_config=adapter_config,
+    >>>     text_config=text_config,
+    >>>     image_start_token_id=151936,
+    >>>     image_end_token_id=151937,
+    >>>     image_patch_id=151938,
+    >>>     image_col_id=151939,
+    >>>     low_res_image_start_token_id=151940,
+    >>>     image_low_res_id=151942,
+    >>>     frame_start_token_id=151943,
+    >>>     frame_end_token_id=151944,
+    >>> )
+
+    >>> # Initializing a model
+    >>> model = MolmoAct2ForConditionalGeneration(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "molmoact2"
+    sub_configs = {
+        "text_config": MolmoAct2TextConfig,
+        "vit_config": MolmoAct2VitConfig,
+        "adapter_config": MolmoAct2AdapterConfig,
+        "action_expert_config": MolmoAct2ActionExpertConfig,
+    }
+
+    def __init__(
+        self,
+        vit_config: MolmoAct2VitConfig = None,
+        adapter_config: MolmoAct2AdapterConfig = None,
+        text_config: MolmoAct2TextConfig = None,
+        action_expert_config: MolmoAct2ActionExpertConfig = None,
+        image_start_token_id: int = None,
+        low_res_image_start_token_id: int = None,
+        image_end_token_id: int = None,
+        image_low_res_id: int = None,
+        image_patch_id: int = None,
+        image_col_id: int = None,
+        frame_start_token_id: int = None,
+        frame_end_token_id: int = None,
+        use_frame_special_tokens: bool = True,
+        initializer_range: float = 0.02,
+        add_action_expert: bool = True,
+        max_action_dim: int = 32,
+        max_action_horizon: int = 30,
+        n_obs_steps: int = 30,
+        action_mode: str = "both",
+        state_format: str = "discrete",
+        flow_matching_num_steps: int = 10,
+        flow_matching_cutoff: float = 1.0,
+        flow_matching_time_offset: float = 0.001,
+        flow_matching_time_scale: float = 0.999,
+        flow_matching_beta_alpha: float = 1.0,
+        flow_matching_beta_beta: float = 1.5,
+        mask_action_dim_padding: bool = True,
+        enable_depth_reasoning: bool = False,
+        depth_mode: int = 2,
+        num_depth_codes: int = 100,
+        action_expert_depth_gate: bool = False,
+        action_expert_depth_gate_per_layer: bool = False,
+        action_expert_depth_gate_init_bias: float = -4.0,
+        action_output_token_id: int = None,
+        action_start_token_id: int = None,
+        action_end_token_id: int = None,
+        action_token_start_id: int = None,
+        num_action_tokens: int = 0,
+        depth_output_token_id: int = None,
+        depth_start_token_id: int = None,
+        depth_end_token_id: int = None,
+        depth_token_start_id: int = None,
+        num_depth_tokens: int = 0,
+        state_start_token_id: int = None,
+        state_end_token_id: int = None,
+        state_token_start_id: int = None,
+        num_state_tokens: int = 0,
+        add_setup_tokens: bool = True,
+        add_control_tokens: bool = True,
+        norm_stats_filename: str = "norm_stats.json",
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        if vit_config is None:
+            self.vit_config = MolmoAct2VitConfig()
+        elif isinstance(vit_config, dict):
+            self.vit_config = MolmoAct2VitConfig(**vit_config)
+        else:
+            self.vit_config = vit_config
+        if adapter_config is None:
+            self.adapter_config = MolmoAct2AdapterConfig()
+        elif isinstance(adapter_config, dict):
+            self.adapter_config = MolmoAct2AdapterConfig(**adapter_config)
+        else:
+            self.adapter_config = adapter_config
+        if text_config is None:
+            self.text_config = MolmoAct2TextConfig()
+        elif isinstance(text_config, dict):
+            self.text_config = MolmoAct2TextConfig(**text_config)
+        else:
+            self.text_config = text_config
+        self.add_action_expert = bool(add_action_expert)
+        if not self.add_action_expert:
+            self.action_expert_config = None
+        elif action_expert_config is None:
+            self.action_expert_config = MolmoAct2ActionExpertConfig(
+                max_action_horizon=max_action_horizon,
+                max_action_dim=max_action_dim,
+                num_layers=self.text_config.num_hidden_layers,
+            )
+        elif isinstance(action_expert_config, dict):
+            self.action_expert_config = MolmoAct2ActionExpertConfig(**action_expert_config)
+        else:
+            self.action_expert_config = action_expert_config
+        if self.add_action_expert:
+            self.action_expert_config.max_action_dim = int(max_action_dim)
+            self.action_expert_config.max_action_horizon = int(max_action_horizon)
+            self._validate_release_action_config(
+                state_format=state_format,
+            )
+        self.image_start_token_id = image_start_token_id
+        self.low_res_image_start_token_id = low_res_image_start_token_id
+        self.image_end_token_id = image_end_token_id
+        self.image_low_res_id = image_low_res_id
+        self.image_high_res_id = image_patch_id
+        self.image_patch_id = image_patch_id
+        self.image_col_id = image_col_id
+        self.frame_start_token_id = frame_start_token_id
+        self.frame_end_token_id = frame_end_token_id
+        self.use_frame_special_tokens = use_frame_special_tokens
+        self.initializer_range = initializer_range
+        self.max_action_dim = max_action_dim
+        self.max_action_horizon = max_action_horizon
+        self.n_obs_steps = n_obs_steps
+        self.action_mode = action_mode
+        self.state_format = state_format
+        self.flow_matching_num_steps = flow_matching_num_steps
+        self.flow_matching_cutoff = flow_matching_cutoff
+        self.flow_matching_time_offset = flow_matching_time_offset
+        self.flow_matching_time_scale = flow_matching_time_scale
+        self.flow_matching_beta_alpha = flow_matching_beta_alpha
+        self.flow_matching_beta_beta = flow_matching_beta_beta
+        self.mask_action_dim_padding = mask_action_dim_padding
+        self.enable_depth_reasoning = enable_depth_reasoning
+        self.depth_mode = depth_mode
+        self.num_depth_codes = num_depth_codes
+        self.action_expert_depth_gate = action_expert_depth_gate
+        self.action_expert_depth_gate_per_layer = action_expert_depth_gate_per_layer
+        self.action_expert_depth_gate_init_bias = action_expert_depth_gate_init_bias
+        self.action_output_token_id = action_output_token_id
+        self.action_start_token_id = action_start_token_id
+        self.action_end_token_id = action_end_token_id
+        self.action_token_start_id = action_token_start_id
+        self.num_action_tokens = num_action_tokens
+        self.depth_output_token_id = depth_output_token_id
+        self.depth_start_token_id = depth_start_token_id
+        self.depth_end_token_id = depth_end_token_id
+        self.depth_token_start_id = depth_token_start_id
+        self.num_depth_tokens = num_depth_tokens
+        self.state_start_token_id = state_start_token_id
+        self.state_end_token_id = state_end_token_id
+        self.state_token_start_id = state_token_start_id
+        self.num_state_tokens = num_state_tokens
+        self.add_setup_tokens = add_setup_tokens
+        self.add_control_tokens = add_control_tokens
+        self.norm_stats_filename = norm_stats_filename
+
+    @staticmethod
+    def _validate_release_action_config(
+        *,
+        state_format: str,
+    ) -> None:
+        if state_format != "discrete":
+            raise ValueError("MolmoAct2 HF export supports only state_format='discrete'.")
+
+    @property
+    def image_num_patch(self):
+        assert self.vit_config is not None
+        return self.vit_config.image_num_patch
+
+    @property
+    def num_attention_heads(self):
+        return self.text_config.num_attention_heads
+
+    @property
+    def num_key_value_heads(self):
+        return self.text_config.num_key_value_heads
+
+    @property
+    def head_dim(self):
+        return self.text_config.head_dim
+
+    @property
+    def num_hidden_layers(self):
+        return self.text_config.num_hidden_layers
+
+    @property
+    def hidden_size(self):
+        return self.text_config.hidden_size
+
+    @property
+    def vocab_size(self):
+        return self.text_config.vocab_size
+
+    @property
+    def max_position_embeddings(self):
+        return self.text_config.max_position_embeddings
+
+
+MolmoAct2VitConfig.register_for_auto_class()
+MolmoAct2AdapterConfig.register_for_auto_class()
+MolmoAct2TextConfig.register_for_auto_class()
+MolmoAct2ActionExpertConfig.register_for_auto_class()
+MolmoAct2Config.register_for_auto_class()
diff --git a/src/lerobot/policies/molmoact2/hf_model/image_processing_molmoact2.py b/src/lerobot/policies/molmoact2/hf_model/image_processing_molmoact2.py
new file mode 100644
index 000000000..a172c8477
--- /dev/null
+++ b/src/lerobot/policies/molmoact2/hf_model/image_processing_molmoact2.py
@@ -0,0 +1,564 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# ruff: noqa
+
+"""Image processor class for MolmoAct2"""
+
+from typing import Optional, Union
+import numpy as np
+import einops
+import torch
+import torchvision.transforms
+
+from transformers.image_utils import (
+    IMAGENET_STANDARD_MEAN,
+    IMAGENET_STANDARD_STD,
+    ImageInput,
+    PILImageResampling,
+    make_flat_list_of_images,
+    valid_images,
+    to_numpy_array,
+)
+from transformers.image_transforms import convert_to_rgb
+from transformers.processing_utils import ImagesKwargs
+from transformers.image_processing_utils import BaseImageProcessor, get_size_dict
+from transformers.utils import logging
+from transformers.feature_extraction_utils import BatchFeature
+from transformers.utils import TensorType, logging
+
+
+logger = logging.get_logger(__name__)
+
+
+def normalize_image(
+    image: np.ndarray,
+    image_mean: list[float],
+    image_std: list[float],
+) -> np.ndarray:
+    if np.allclose(image_mean, [0.5, 0.5, 0.5]) and np.allclose(image_std, [0.5, 0.5, 0.5]):
+        return image * np.asarray(2.0, dtype=np.float32) - np.asarray(1.0, dtype=np.float32)
+    image -= np.array(image_mean, dtype=np.float32)[None, None, :]
+    image /= np.array(image_std, dtype=np.float32)[None, None, :]
+    return image
+
+
+def resize_image(
+    image: np.ndarray,
+    desired_output_size: list[int],
+    resample: PILImageResampling,
+) -> np.ndarray:
+    image = torch.permute(torch.from_numpy(image), [2, 0, 1])
+    dtype = image.dtype
+    if torch.is_floating_point(image):
+        in_min = 0.0
+        in_max = 1.0
+        resized = torchvision.transforms.Resize(
+            desired_output_size,
+            resample,
+            antialias=False,
+        )(image)
+        resized = torch.clip(resized, 0.0, 1.0).to(dtype)
+    else:
+        assert image.dtype == torch.uint8, "SigLIP expects float images or uint8 images, but got {}".format(
+            image.dtype
+        )
+        in_min = 0.0
+        in_max = 255.0
+        resized = torchvision.transforms.Resize(
+            desired_output_size,
+            resample,
+            antialias=False,
+        )(image)
+        resized = torch.clip(resized, 0, 255).to(dtype)
+
+    resized = resized.to(torch.float32)
+    resized = (resized - in_min) / (in_max - in_min)
+
+    resized = torch.permute(resized, [1, 2, 0]).numpy()
+
+    return resized
+
+
+def select_tiling(h, w, patch_size, max_num_crops):
+    """Divide in image of size [w, h] in up to max_num_patches of size patch_size"""
+    original_size = np.stack([h, w])  # [1, 2]
+    original_res = h * w
+    tilings = []
+    for i in range(1, max_num_crops + 1):
+        for j in range(1, max_num_crops + 1):
+            if i * j <= max_num_crops:
+                tilings.append((i, j))
+    # sort so argmin and argmax favour smaller tilings in the event of a tie
+    tilings.sort(key=lambda x: (x[0] * x[1], x[0]))
+    candidate_tilings = np.array(tilings, dtype=np.int32)  # [n_resolutions, 2]
+    candidate_resolutions = candidate_tilings * patch_size  # [n_resolutions, 2]
+
+    # How much we would need to scale the image to fit exactly in each tiling
+    original_size = np.stack([h, w], dtype=np.float32)  # [1, 2]
+
+    # The original size can be zero in rare cases if the image is smaller than the margin
+    # In those cases letting the scale become infinite means the tiling is based on the
+    # other side, or falls back to the smallest tiling
+    with np.errstate(divide="ignore"):
+        required_scale_d = (candidate_resolutions.astype(np.float32) / original_size,)
+    required_scale = np.min(required_scale_d, axis=-1, keepdims=True)  # [n_resolutions, 1]
+    if np.all(required_scale < 1):
+        # We are forced to downscale, so try to minimize the amount of downscaling
+        ix = np.argmax(required_scale)
+    else:
+        # Pick the resolution that required the least upscaling so that it most closely fits the image
+        required_scale = np.where(required_scale < 1.0, 10e9, required_scale)
+        ix = np.argmin(required_scale)
+    return candidate_tilings[ix]
+
+
+def build_resized_image(
+    image: np.ndarray,
+    base_image_input_size: list[int],
+    resample: PILImageResampling,
+    image_mean: list[float],
+    image_std: list[float],
+    image_patch_size: int,
+) -> tuple[np.ndarray, np.ndarray]:
+    resized = resize_image(
+        image,
+        base_image_input_size,
+        resample,
+    )
+    resized = normalize_image(resized, image_mean, image_std)
+    if len(resized.shape) == 3:
+        resized = np.expand_dims(resized, 0)
+    crop_patch_w = base_image_input_size[1] // image_patch_size
+    crop_patch_h = base_image_input_size[0] // image_patch_size
+    resize_idx = np.arange(crop_patch_w * crop_patch_h).reshape([crop_patch_h, crop_patch_w])
+    return resized, resize_idx
+
+
+def build_overlapping_crops(
+    image: np.ndarray,
+    max_crops: int,
+    overlap_margins: list[int],
+    base_image_input_size: list[int],
+    resample: PILImageResampling,
+    image_mean: list[float],
+    image_std: list[float],
+    image_patch_size: int,
+) -> tuple[np.ndarray, np.ndarray]:
+    """Decompose an image into a set of overlapping crops
+
+    :return crop_arr: [n_crops, h, w, 3] The crops
+    :return patch_idx: [overlap_patch_h, overlap_patch_w] For each patch in the resized image
+                        the crops were extracted from, what patch in `crop_arr` it corresponds to
+    """
+    original_image_h, original_image_w = image.shape[:2]
+    crop_size = base_image_input_size[0]
+    assert base_image_input_size[0] == base_image_input_size[1]
+
+    left_margin, right_margin = overlap_margins
+    total_margin_pixels = image_patch_size * (right_margin + left_margin)  # pixels removed per dim
+    crop_patches = base_image_input_size[0] // image_patch_size  # patches per crop dim
+    crop_window_patches = crop_patches - (right_margin + left_margin)  # usable patches
+    crop_window_size = crop_window_patches * image_patch_size
+    crop_patch_w = base_image_input_size[1] // image_patch_size
+    crop_patch_h = base_image_input_size[0] // image_patch_size
+    original_image_h, original_image_w = image.shape[:2]
+    crop_size = base_image_input_size[0]
+
+    # Decide how to tile the image, to account for the overlap margins we compute the tiling
+    # as if we had an image without the margins and were using a crop size without the margins
+    tiling = select_tiling(
+        original_image_h - total_margin_pixels,
+        original_image_w - total_margin_pixels,
+        crop_window_size,
+        max_crops,
+    )
+
+    src = resize_image(
+        image,
+        [
+            tiling[0] * crop_window_size + total_margin_pixels,
+            tiling[1] * crop_window_size + total_margin_pixels,
+        ],
+        resample,
+    )
+    src = normalize_image(src, image_mean, image_std)
+
+    # Now we have to split the image into crops, and track what patches came from
+    # where in `patch_idx_arr`
+    n_crops = tiling[0] * tiling[1]
+    crop_arr = np.zeros([n_crops, crop_size, crop_size, 3], dtype=src.dtype)
+    patch_idx_arr = np.zeros([n_crops, crop_patch_h, crop_patch_w], dtype=np.int32)
+    on_crop = 0
+    for i in range(tiling[0]):
+        # Slide over `src` by `crop_window_size` steps, but extract crops of size `crops_size`
+        # which results in overlapping crop windows
+        y0 = i * crop_window_size
+        for j in range(tiling[1]):
+            x0 = j * crop_window_size
+            crop_arr[on_crop] = src[y0 : y0 + crop_size, x0 : x0 + crop_size]
+            patch_idx = np.arange(crop_patch_w * crop_patch_h).reshape(crop_patch_h, crop_patch_w)
+            patch_idx += on_crop * crop_patch_h * crop_patch_w
+
+            # Mask out idx that are in the overlap region
+            if i != 0:
+                patch_idx[:left_margin, :] = -1
+            if j != 0:
+                patch_idx[:, :left_margin] = -1
+            if i != tiling[0] - 1:
+                patch_idx[-right_margin:, :] = -1
+            if j != tiling[1] - 1:
+                patch_idx[:, -right_margin:] = -1
+            patch_idx_arr[on_crop] = patch_idx
+            on_crop += 1
+
+    # `patch_idx_arr` is ordered crop-by-crop, here we transpose `patch_idx_arr`
+    # so it is ordered left-to-right order
+    patch_idx_arr = np.reshape(patch_idx_arr, [tiling[0], tiling[1], crop_patch_h, crop_patch_w])
+    patch_idx_arr = np.transpose(patch_idx_arr, [0, 2, 1, 3])
+    patch_idx_arr = np.reshape(patch_idx_arr, [-1])
+
+    # Now get the parts not in the overlap region, so it should map each patch in `src`
+    # to the correct patch it should come from in `crop_arr`
+    patch_idx_arr = patch_idx_arr[patch_idx_arr >= 0].reshape(
+        src.shape[0] // image_patch_size,
+        src.shape[1] // image_patch_size,
+    )
+    return crop_arr, patch_idx_arr
+
+
+def batch_pixels_to_patches(array: np.ndarray, patch_size: int) -> np.ndarray:
+    """Reshape images of [n_images, h, w, 3] -> [n_images, n_patches, pixels_per_patch]"""
+    if len(array.shape) == 3:
+        n_crops, h, w = array.shape
+        h_patches = h // patch_size
+        w_patches = w // patch_size
+        array = np.reshape(array, [n_crops, h_patches, patch_size, w_patches, patch_size])
+        array = np.transpose(array, [0, 1, 3, 2, 4])
+        array = np.reshape(array, [n_crops, h_patches * w_patches, patch_size * patch_size])
+        return array
+    else:
+        n_crops, h, w, c = array.shape
+        h_patches = h // patch_size
+        w_patches = w // patch_size
+        array = np.reshape(array, [n_crops, h_patches, patch_size, w_patches, patch_size, c])
+        array = np.transpose(array, [0, 1, 3, 2, 4, 5])
+        array = np.reshape(array, [n_crops, h_patches * w_patches, patch_size * patch_size * c])
+        return array
+
+
+def arange_for_pooling(
+    idx_arr: np.ndarray,
+    pool_h: int,
+    pool_w: int,
+) -> np.ndarray:
+    h_pad = pool_h * ((idx_arr.shape[0] + pool_h - 1) // pool_h) - idx_arr.shape[0]
+    w_pad = pool_w * ((idx_arr.shape[1] + pool_w - 1) // pool_w) - idx_arr.shape[1]
+    idx_arr = np.pad(
+        idx_arr,
+        [[h_pad // 2, (h_pad + 1) // 2], [w_pad // 2, (w_pad + 1) // 2]],
+        mode="constant",
+        constant_values=-1,
+    )
+    return einops.rearrange(idx_arr, "(h dh) (w dw) -> h w (dh dw)", dh=pool_h, dw=pool_w)
+
+
+def image_to_patches_and_grids(
+    image: np.ndarray,
+    max_crops: int,
+    overlap_margins: list[int],
+    base_image_input_size: list[int],
+    resample: PILImageResampling,
+    image_mean: list[float],
+    image_std: list[float],
+    image_patch_size: int,
+    image_pooling_w: int,
+    image_pooling_h: int,
+    crop_mode: str = "overlap-and-resize-c2",
+) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
+    """
+    :return image_grids, the shape of each (low-res, high-res) image after pooling
+    :return crops, the image crops to processes with the ViT
+    :return pooled_patch_idx, for each patch_id tokens in `image_tokens`, the indices of the
+                                patches in `crops` to pool for that token, masked with -1
+    """
+    if isinstance(base_image_input_size, int):
+        base_image_input_size = (base_image_input_size, base_image_input_size)
+
+    base_image_input_d = image_patch_size
+    pooling_w = image_pooling_w
+    pooling_h = image_pooling_h
+    crop_patch_w = base_image_input_size[1] // base_image_input_d
+    crop_patch_h = base_image_input_size[0] // base_image_input_d
+
+    if crop_mode == "resize":
+        resized, resize_idx = build_resized_image(
+            image,
+            base_image_input_size,
+            resample,
+            image_mean,
+            image_std,
+            image_patch_size,
+        )
+        resize_idx = arange_for_pooling(resize_idx, pooling_h, pooling_w)
+        resized_h, resized_w = resize_idx.shape[:2]
+        resize_idx = resize_idx.reshape([-1, pooling_h * pooling_w])
+        image_grid = [np.array([resized_h, resized_w, 0, 0])]
+        return (
+            np.stack(image_grid, 0),
+            batch_pixels_to_patches(resized, image_patch_size),
+            resize_idx,
+        )
+
+    if crop_mode not in {"overlap-and-resize-c2", "overlap-and-resize"}:
+        raise ValueError(f"Unsupported MolmoAct2 image crop_mode {crop_mode!r}.")
+
+    crop_arr, patch_idx_arr = build_overlapping_crops(
+        image,
+        max_crops,
+        overlap_margins,
+        base_image_input_size,
+        resample,
+        image_mean,
+        image_std,
+        image_patch_size,
+    )
+    pooling_idx = arange_for_pooling(patch_idx_arr, pooling_h, pooling_w)
+    h, w = pooling_idx.shape[:2]
+    pooling_idx = pooling_idx.reshape([-1, pooling_h * pooling_w])
+
+    # Finally do the same for the global image
+    resized, resize_idx = build_resized_image(
+        image,
+        base_image_input_size,
+        resample,
+        image_mean,
+        image_std,
+        image_patch_size,
+    )
+    crop_arr = np.concatenate([resized, crop_arr], 0)
+
+    resize_idx = arange_for_pooling(resize_idx, pooling_h, pooling_w)
+    resized_h, resized_w = resize_idx.shape[:2]
+    resize_idx = resize_idx.reshape([-1, pooling_h * pooling_w])
+
+    # Global image goes first, so the order of patches in previous crops gets increased
+    pooling_idx = np.where(pooling_idx >= 0, pooling_idx + crop_patch_h * crop_patch_w, -1)
+    pooling_idx = np.concatenate([resize_idx, pooling_idx])
+    image_grid = [np.array([resized_h, resized_w, h, w])]
+
+    return (np.stack(image_grid, 0), batch_pixels_to_patches(crop_arr, image_patch_size), pooling_idx)
+
+
+class MolmoAct2ImagesKwargs(ImagesKwargs, total=False):
+    max_crops: int | None
+    overlap_margins: list[int] | None
+    crop_mode: str | None
+    patch_size: int | None
+    pooling_size: list[int] | None
+
+
+class MolmoAct2ImageProcessor(BaseImageProcessor):
+    r"""
+    Constructs a MolmoAct2 image processor that preprocesses images for the model.
+
+    Args:
+        size (`dict[str, int]` *optional*, defaults to `{"height": 378, "width": 378}`):
+            Size of the image after resizing.
+        resample (`PILImageResampling`, *optional*, defaults to `Resampling.BILINEAR`):
+            Resampling filter to use when resizing the image.
+        image_mean (`float` or `list[float]`, *optional*, defaults to `[0.5, 0.5, 0.5]`):
+            Mean to use if normalizing the image. This is a float or list of floats for each channel in the image.
+        image_std (`float` or `list[float]`, *optional*, defaults to `[0.5, 0.5, 0.5]`):
+            Standard deviation to use if normalizing the image. This is a float or list of floats for each channel in the image.
+        do_convert_rgb (`bool`, *optional*, defaults to `True`):
+            Whether to convert the image to RGB.
+        max_crops (`int`, *optional*, defaults to `8`):
+            Maximum number of crops to use per image.
+        overlap_margins (`list[int]`, *optional*, defaults to `[4, 4]`):
+            Overlap margins to use.
+        patch_size (`int`, *optional*, defaults to 14):
+            The spatial patch size of the vision encoder.
+        pooling_size (`list[int]`, *optional*, defaults to `[2, 2]`):
+            The pooling size of the vision adapter.
+    """
+
+    model_input_names = ["pixel_values", "image_token_pooling", "image_grids", "image_num_crops"]
+
+    def __init__(
+        self,
+        size: dict[str, int] | None = None,
+        resample: PILImageResampling = PILImageResampling.BILINEAR,
+        image_mean: float | list[float] | None = None,
+        image_std: float | list[float] | None = None,
+        do_convert_rgb: bool = True,
+        max_crops: int = 8,
+        overlap_margins: list[int] = [4, 4],
+        crop_mode: str = "overlap-and-resize-c2",
+        patch_size: int = 14,
+        pooling_size: list[int] = [2, 2],
+        **kwargs,
+    ) -> None:
+        super().__init__(**kwargs)
+        size = size if size is not None else {"height": 378, "width": 378}
+        size = get_size_dict(size, default_to_square=True)
+        self.size = size
+
+        self.resample = resample
+        self.image_mean = image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN
+        self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
+        self.do_convert_rgb = do_convert_rgb
+
+        self.max_crops = max_crops
+        self.overlap_margins = overlap_margins
+        self.crop_mode = crop_mode
+        self.patch_size = patch_size
+        self.pooling_size = pooling_size
+
+    def preprocess(
+        self,
+        images: ImageInput,
+        size: dict[str, int] | None = None,
+        resample: PILImageResampling | None = None,
+        image_mean: float | list[float] | None = None,
+        image_std: float | list[float] | None = None,
+        do_convert_rgb: bool | None = None,
+        max_crops: int | None = None,
+        overlap_margins: list[int] | None = None,
+        crop_mode: str | None = None,
+        patch_size: int | None = None,
+        pooling_size: list[int] | None = None,
+        return_tensors: str | TensorType | None = None,
+        **kwargs,
+    ) -> BatchFeature:
+        """
+        Args:
+            images (`ImageInput`):
+                Image to preprocess.
+            size (`dict[str, int]`, *optional*, defaults to `self.size`):
+                Size of the image after resizing.
+            resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
+                Resampling filter to use when resizing the image. This can be one of the enum `PILImageResampling`. Only
+                has an effect if `do_resize` is set to `True`.
+            image_mean (`float` or `list[float]`, *optional*, defaults to `self.image_mean`):
+                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
+            image_std (`float` or `list[float]`, *optional*, defaults to `self.image_std`):
+                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
+                `True`.
+            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
+                Whether to convert the image to RGB.
+            max_crops (`int`, *optional*, defaults to `self.max_crops`):
+                Maximum number of crops to use per image.
+            overlap_margins (`list[int]`, *optional*, defaults to `self.overlap_margins`):
+                Overlap margins to use.
+            patch_size (`int`, *optional*, defaults to `self.patch_size`):
+                The spatial patch size of the vision encoder.
+            pooling_size (`list[int]`, *optional*, defaults to `self.pooling_size`):
+                The pooling size of the vision adapter.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                - Unset: Return a list of `np.ndarray`.
+                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
+                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
+                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
+
+        Returns:
+            A `BatchFeature` containing the following keys:
+                - `pixel_values`: The preprocessed images.
+                - `image_token_pooling`: The indices of the patches in `crops` to pool for each token in `image_tokens`.
+                - `image_grids`: The image grids.
+                - `image_num_crops`: The number of crops for each image.
+        """
+        if size is not None:
+            if "height" not in size or "width" not in size:
+                raise ValueError("size must contain 'height' and 'width' keys.")
+        else:
+            size = {**self.size}
+
+        base_image_input_size = [size["height"], size["width"]]
+
+        resample = resample or self.resample
+        image_mean = image_mean or self.image_mean
+        image_std = image_std or self.image_std
+        do_convert_rgb = do_convert_rgb or self.do_convert_rgb
+
+        max_crops = max_crops or self.max_crops
+        overlap_margins = overlap_margins or self.overlap_margins
+        crop_mode = crop_mode or self.crop_mode
+        patch_size = patch_size or self.patch_size
+        pooling_size = pooling_size or self.pooling_size
+
+        image_pooling_h, image_pooling_w = pooling_size
+
+        if images is not None:
+            images = self.fetch_images(images)
+            images = make_flat_list_of_images(images)
+
+        if images is not None and not valid_images(images):
+            raise ValueError(
+                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
+                "torch.Tensor, tf.Tensor or jax.ndarray."
+            )
+
+        if do_convert_rgb:
+            images = [convert_to_rgb(image) for image in images]
+
+        # All transformations expect numpy arrays.
+        images = [to_numpy_array(image) for image in images]
+
+        data = {}
+        if images is not None:
+            batch_grids = []
+            batch_crops = []
+            batch_pooled_patches_idx = []
+            batch_num_crops = []
+
+            for image in images:
+                image_grid, crops, pooled_idx = image_to_patches_and_grids(
+                    image,
+                    max_crops,
+                    overlap_margins,
+                    base_image_input_size,
+                    resample,
+                    image_mean,
+                    image_std,
+                    patch_size,
+                    image_pooling_w,
+                    image_pooling_h,
+                    crop_mode,
+                )
+                batch_grids.append(image_grid)
+                batch_crops.append(crops)
+                batch_pooled_patches_idx.append(pooled_idx)
+                batch_num_crops.append(crops.shape[0])
+
+            pixel_values = np.concatenate(batch_crops, 0)
+            image_token_pooling = np.concatenate(batch_pooled_patches_idx, 0)
+            image_grids = np.concatenate(batch_grids, 0)
+            image_num_crops = np.array(batch_num_crops)
+
+            data.update(
+                pixel_values=pixel_values,
+                image_token_pooling=image_token_pooling,
+                image_grids=image_grids,
+                image_num_crops=image_num_crops,
+            )
+
+        return BatchFeature(data, tensor_type=return_tensors)
+
+
+MolmoAct2ImageProcessor.register_for_auto_class()
diff --git a/src/lerobot/policies/molmoact2/hf_model/inference.py b/src/lerobot/policies/molmoact2/hf_model/inference.py
new file mode 100644
index 000000000..2c0243880
--- /dev/null
+++ b/src/lerobot/policies/molmoact2/hf_model/inference.py
@@ -0,0 +1,748 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# ruff: noqa
+
+"""Inference utilities for MolmoAct2"""
+
+from dataclasses import dataclass
+from typing import Any, Optional, Tuple
+from collections.abc import Iterable, Sequence
+
+import torch
+from torch.nn import functional as F
+from transformers.cache_utils import Cache
+from transformers.configuration_utils import PretrainedConfig
+
+
+@dataclass
+class _ActionFlowInputs:
+    trajectory: torch.Tensor
+    context: Any
+    modulations: Sequence[Any]
+    action_dim_is_pad: torch.Tensor | None
+
+
+@dataclass
+class _ActionFlowCudaGraph:
+    key: tuple[Any, ...]
+    graph: torch.cuda.CUDAGraph
+    static_inputs: _ActionFlowInputs
+    output: torch.Tensor
+
+
+@dataclass
+class _DepthDecodeCudaGraphLayerStage:
+    residual: torch.Tensor
+    query: torch.Tensor
+    key: torch.Tensor
+    value: torch.Tensor
+
+
+@dataclass
+class _DepthDecodeCudaGraphPostStage:
+    graph: torch.cuda.CUDAGraph
+    attn_context: torch.Tensor
+
+
+@dataclass
+class _DepthDecodeCudaGraph:
+    cache_key: tuple[Any, ...]
+    pre_graph: torch.cuda.CUDAGraph
+    token_ids: torch.Tensor
+    cos: torch.Tensor
+    sin: torch.Tensor
+    positions: torch.Tensor
+    stages: Sequence[_DepthDecodeCudaGraphLayerStage]
+    post_graphs: Sequence[_DepthDecodeCudaGraphPostStage]
+    output: torch.Tensor
+
+
+@dataclass
+class _DepthDecodeCudaGraphSpec:
+    eligible: bool
+    cache_key_prefix: tuple[Any, ...]
+    num_hidden_layers: int
+    head_dim: int
+    num_attention_heads: int
+
+
+def _cache_seq_len_int(past_key_values: Cache | None) -> int:
+    if past_key_values is None:
+        return 0
+    seq_len = past_key_values.get_seq_length()
+    if torch.is_tensor(seq_len):
+        return int(seq_len.item())
+    return int(seq_len)
+
+
+def _cache_max_len_int(past_key_values: Cache | None) -> int:
+    if past_key_values is None:
+        return -1
+    max_len = past_key_values.get_max_cache_shape()
+    if torch.is_tensor(max_len):
+        return int(max_len.item())
+    return int(max_len)
+
+
+def _iter_cache_key_values(
+    past_key_values: Cache,
+) -> Iterable[tuple[torch.Tensor | None, torch.Tensor | None]]:
+    layers = getattr(past_key_values, "layers", None)
+    if layers is not None:
+        for layer in layers:
+            yield getattr(layer, "keys", None), getattr(layer, "values", None)
+        return
+    for layer in past_key_values:
+        yield layer[0], layer[1]
+
+
+class _DepthDecodeStaticLayerCache:
+    is_compileable = False
+    is_sliding = False
+
+    def __init__(self, max_cache_len: int) -> None:
+        self.max_cache_len = int(max_cache_len)
+        self.cumulative_length = 0
+        self.keys: torch.Tensor | None = None
+        self.values: torch.Tensor | None = None
+
+    def _allocate(self, key_states: torch.Tensor, value_states: torch.Tensor) -> None:
+        bsz, n_heads = key_states.shape[:2]
+        self.keys = torch.empty(
+            (bsz, n_heads, self.max_cache_len, key_states.shape[-1]),
+            dtype=key_states.dtype,
+            device=key_states.device,
+        )
+        self.values = torch.empty(
+            (bsz, n_heads, self.max_cache_len, value_states.shape[-1]),
+            dtype=value_states.dtype,
+            device=value_states.device,
+        )
+
+    def update(
+        self,
+        key_states: torch.Tensor,
+        value_states: torch.Tensor,
+        *args,
+        **kwargs,
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        if self.keys is None:
+            self._allocate(key_states, value_states)
+        start = self.cumulative_length
+        end = start + key_states.shape[-2]
+        if end > self.max_cache_len:
+            raise RuntimeError(f"KV cache length {end} exceeds max_cache_len={self.max_cache_len}.")
+        self.keys[:, :, start:end, :].copy_(key_states)
+        self.values[:, :, start:end, :].copy_(value_states)
+        self.cumulative_length = end
+        return self.keys[:, :, :end, :], self.values[:, :, :end, :]
+
+    def get_seq_length(self) -> int:
+        return self.cumulative_length
+
+    def get_max_cache_shape(self) -> int:
+        return -1
+
+    def reset(self) -> None:
+        self.cumulative_length = 0
+
+
+class _DepthDecodeStaticCache(Cache):
+    def __init__(self, config: PretrainedConfig, max_cache_len: int) -> None:
+        text_config = config.get_text_config(decoder=True)
+        super().__init__(
+            layers=[
+                _DepthDecodeStaticLayerCache(max_cache_len=max_cache_len)
+                for _ in range(text_config.num_hidden_layers)
+            ]
+        )
+
+    def get_seq_length(self, layer_idx: int = 0) -> int:
+        return self.layers[layer_idx].get_seq_length()
+
+    def get_max_cache_shape(self, layer_idx: int = 0) -> int:
+        return self.layers[layer_idx].get_max_cache_shape()
+
+    def reset(self) -> None:
+        for layer in self.layers:
+            layer.reset()
+
+
+class ActionCudaGraphManager:
+    def __init__(self, model: Any) -> None:
+        self.model = model
+        self.enabled = True
+        self.action_flow_graph: _ActionFlowCudaGraph | None = None
+
+    def set_enabled(self, enabled: bool) -> None:
+        self.enabled = bool(enabled)
+
+    def can_use_action_flow(self, inputs: _ActionFlowInputs) -> bool:
+        action_model = self.model
+        if not self.enabled:
+            return False
+        if action_model.training or action_model._require_action_expert().training:
+            return False
+        if inputs.trajectory.device.type != "cuda":
+            return False
+
+        def all_on_cuda():
+            yield inputs.trajectory
+            for k, v in inputs.context.kv_contexts:
+                yield k
+                yield v
+            for t in (
+                inputs.context.cross_mask,
+                inputs.context.self_mask,
+                inputs.context.valid_action,
+                inputs.action_dim_is_pad,
+            ):
+                if t is not None:
+                    yield t
+            if inputs.context.rope_cache is not None:
+                yield from inputs.context.rope_cache
+            for step in inputs.modulations:
+                yield step.conditioning
+                for block_modulation in step.block_modulations:
+                    yield from block_modulation
+                yield from step.final_modulation
+
+        return all(t.device.type == "cuda" for t in all_on_cuda())
+
+    def run_action_flow(
+        self,
+        inputs: _ActionFlowInputs,
+        steps: int,
+        run_loop,
+    ) -> torch.Tensor:
+        key = _cuda_graph_key(inputs, steps)
+        cache = self.action_flow_graph
+        if cache is None or cache.key != key:
+            static_inputs = _clone_static_inputs(inputs)
+            graph, output = _capture_cuda_graph(
+                lambda: run_loop(static_inputs, steps),
+                inputs.trajectory.device,
+                after_warmup=lambda: static_inputs.trajectory.copy_(inputs.trajectory),
+            )
+            cache = _ActionFlowCudaGraph(
+                key=key,
+                graph=graph,
+                static_inputs=static_inputs,
+                output=output,
+            )
+            self.action_flow_graph = cache
+        else:
+            _copy_inputs_(cache.static_inputs, inputs)
+
+        cache.graph.replay()
+        return cache.output.clone()
+
+
+class DepthDecodeCudaGraphManager:
+    def __init__(self, model: Any) -> None:
+        self.model = model
+        self.backbone = model.model
+        self.enabled = True
+        self.graph: _DepthDecodeCudaGraph | None = None
+        self.graph_spec: _DepthDecodeCudaGraphSpec | None = None
+
+    def set_enabled(self, enabled: bool) -> None:
+        self.enabled = bool(enabled)
+
+    def make_static_cache(self, max_cache_len: int) -> _DepthDecodeStaticCache:
+        return _DepthDecodeStaticCache(
+            config=self.model.config.text_config,
+            max_cache_len=max_cache_len,
+        )
+
+    def _depth_decode_spec(self) -> _DepthDecodeCudaGraphSpec:
+        static = self.graph_spec
+        if static is None:
+            cfg = self.backbone.transformer.config
+            rotary_emb = getattr(self.backbone.transformer, "rotary_emb", None)
+            static = _DepthDecodeCudaGraphSpec(
+                eligible=(
+                    not cfg.norm_after
+                    and cfg.rope_scaling_layers is None
+                    and getattr(rotary_emb, "rope_type", None) == "default"
+                    and cfg._attn_implementation == "sdpa"
+                ),
+                cache_key_prefix=(
+                    cfg.hidden_size,
+                    cfg.num_attention_heads,
+                    cfg.num_key_value_heads,
+                    cfg.head_dim,
+                    cfg.num_hidden_layers,
+                    cfg.use_qk_norm,
+                    cfg.qk_norm_type,
+                    cfg._attn_implementation,
+                ),
+                num_hidden_layers=cfg.num_hidden_layers,
+                head_dim=cfg.head_dim,
+                num_attention_heads=cfg.num_attention_heads,
+            )
+            self.graph_spec = static
+        return static
+
+    def can_use(
+        self,
+        next_input_ids: torch.Tensor,
+        *,
+        past_key_values: Cache,
+        attention_bias: torch.Tensor,
+    ) -> bool:
+        if not self.enabled or self.model.training or self.backbone.transformer.training:
+            return False
+        if next_input_ids.device.type != "cuda":
+            return False
+        if next_input_ids.ndim != 2 or next_input_ids.shape[0] != 1 or next_input_ids.shape[1] != 1:
+            return False
+        if not isinstance(past_key_values, _DepthDecodeStaticCache):
+            return False
+        if not torch.is_tensor(attention_bias) or attention_bias.device != next_input_ids.device:
+            return False
+        return self._depth_decode_spec().eligible
+
+    def _depth_decode_key(
+        self,
+        next_input_ids: torch.Tensor,
+        attention_bias: torch.Tensor,
+    ) -> tuple[Any, ...]:
+        device = next_input_ids.device
+        return (
+            self._depth_decode_spec().cache_key_prefix,
+            device.type,
+            device.index,
+            self.model.lm_head.weight.dtype,
+            attention_bias.shape[-1],
+        )
+
+    def _select_depth_decode_rope(self, cos: torch.Tensor, sin: torch.Tensor, *, past_length: int) -> None:
+        emb = self.backbone.transformer.rotary_emb
+        cos.copy_(emb._pos_cos_cache[0, :, past_length : past_length + 1, :])
+        sin.copy_(emb._pos_sin_cache[0, :, past_length : past_length + 1, :])
+
+    def _depth_decode_pre_layer(
+        self,
+        layer_idx: int,
+        hidden_states: torch.Tensor,
+        cos: torch.Tensor,
+        sin: torch.Tensor,
+    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+        block = self.backbone.transformer.blocks[layer_idx]
+        attention = block.self_attn
+        residual = hidden_states
+        hidden_states = block.attn_norm(hidden_states)
+
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, attention.head_dim)
+        qkv = attention.att_proj(hidden_states)
+        query_states, key_states, value_states = qkv.split(attention.fused_dims, dim=-1)
+        value_states = value_states.view(hidden_shape)
+
+        apply_qk_norm = attention.q_norm is not None and attention.k_norm is not None
+        norm_after_view = apply_qk_norm and attention.qk_norm_type == "qwen3"
+
+        if apply_qk_norm and not norm_after_view:
+            query_states = attention.q_norm(query_states)
+            key_states = attention.k_norm(key_states)
+
+        query_states = query_states.view(hidden_shape)
+        key_states = key_states.view(hidden_shape)
+
+        if norm_after_view:
+            query_states = attention.q_norm(query_states)
+            key_states = attention.k_norm(key_states)
+
+        query_states = query_states.transpose(1, 2)
+        key_states = key_states.transpose(1, 2)
+        value_states = value_states.transpose(1, 2)
+        query_states, key_states = _apply_rotary_pos_emb(query_states, key_states, cos, sin)
+        return residual, query_states, key_states, value_states
+
+    def _depth_decode_pre0(
+        self,
+        token_ids: torch.Tensor,
+        cos: torch.Tensor,
+        sin: torch.Tensor,
+    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+        inputs_embeds = self.model._embed_base_tokens(token_ids)
+        return self._depth_decode_pre_layer(0, inputs_embeds, cos, sin)
+
+    def _depth_decode_post_layer(
+        self,
+        layer_idx: int,
+        residual: torch.Tensor,
+        attn_context: torch.Tensor,
+    ) -> torch.Tensor:
+        block = self.backbone.transformer.blocks[layer_idx]
+        attention = block.self_attn
+        input_shape = residual.shape[:-1]
+        attn_output = attn_context.reshape(*input_shape, -1).contiguous()
+        attn_output = attention.attn_out(attn_output)
+        hidden_states = residual + block.dropout(attn_output)
+
+        residual = hidden_states
+        hidden_states = block.ff_norm(hidden_states)
+        hidden_states = block.mlp(hidden_states)
+        hidden_states = residual + block.dropout(hidden_states)
+        return hidden_states
+
+    def _depth_decode_post_and_pre_next(
+        self,
+        layer_idx: int,
+        residual: torch.Tensor,
+        attn_context: torch.Tensor,
+        cos: torch.Tensor,
+        sin: torch.Tensor,
+    ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+        hidden_states = self._depth_decode_post_layer(layer_idx, residual, attn_context)
+        return self._depth_decode_pre_layer(layer_idx + 1, hidden_states, cos, sin)
+
+    def _depth_decode_last_post(
+        self,
+        layer_idx: int,
+        residual: torch.Tensor,
+        attn_context: torch.Tensor,
+    ) -> torch.Tensor:
+        hidden_states = self._depth_decode_post_layer(layer_idx, residual, attn_context)
+        return self.backbone.transformer.ln_f(hidden_states)
+
+    def _build_depth_decode_graph(
+        self,
+        next_input_ids: torch.Tensor,
+        *,
+        past_length: int,
+        attention_bias: torch.Tensor,
+    ) -> _DepthDecodeCudaGraph:
+        text_config = self.backbone.transformer.config
+        device = next_input_ids.device
+        dtype = self.model.lm_head.weight.dtype
+        static = self._depth_decode_spec()
+        num_layers = static.num_hidden_layers
+        head_dim = static.head_dim
+        max_cache_len = int(attention_bias.shape[-1])
+        max_rope_len = max(int(text_config.max_position_embeddings or 0), max_cache_len)
+        self.backbone.transformer.prepare_rope_cache(device=device, max_seq_len=max_rope_len)
+
+        token_ids = torch.empty((1, 1), device=device, dtype=torch.long)
+        cos = torch.empty((1, 1, head_dim), device=device, dtype=dtype)
+        sin = torch.empty_like(cos)
+        positions = torch.arange(max_cache_len, device=device, dtype=torch.long)
+        context_shape = (1, 1, static.num_attention_heads, head_dim)
+
+        token_ids.copy_(next_input_ids)
+        self._select_depth_decode_rope(cos, sin, past_length=past_length)
+
+        pre_graph, pre_output = _capture_cuda_graph(
+            lambda: self._depth_decode_pre0(token_ids, cos, sin),
+            device,
+        )
+        stages = [_DepthDecodeCudaGraphLayerStage(*pre_output)]
+        post_graphs = []
+        for layer_idx in range(num_layers - 1):
+            stage = stages[-1]
+            attn_context = torch.empty(context_shape, device=device, dtype=dtype)
+            graph, output = _capture_cuda_graph(
+                lambda layer_idx=layer_idx, stage=stage, attn_context=attn_context: (
+                    self._depth_decode_post_and_pre_next(
+                        layer_idx,
+                        stage.residual,
+                        attn_context,
+                        cos,
+                        sin,
+                    )
+                ),
+                device,
+            )
+            post_graphs.append(_DepthDecodeCudaGraphPostStage(graph=graph, attn_context=attn_context))
+            stages.append(_DepthDecodeCudaGraphLayerStage(*output))
+
+        last_stage = stages[-1]
+        last_attn_context = torch.empty(context_shape, device=device, dtype=dtype)
+        last_graph, last_output = _capture_cuda_graph(
+            lambda: self._depth_decode_last_post(
+                num_layers - 1,
+                last_stage.residual,
+                last_attn_context,
+            ),
+            device,
+        )
+        post_graphs.append(_DepthDecodeCudaGraphPostStage(graph=last_graph, attn_context=last_attn_context))
+        return _DepthDecodeCudaGraph(
+            cache_key=self._depth_decode_key(next_input_ids, attention_bias),
+            pre_graph=pre_graph,
+            token_ids=token_ids,
+            cos=cos,
+            sin=sin,
+            positions=positions,
+            stages=tuple(stages),
+            post_graphs=tuple(post_graphs),
+            output=last_output,
+        )
+
+    def _get_depth_decode_graph(
+        self,
+        next_input_ids: torch.Tensor,
+        *,
+        past_length: int,
+        attention_bias: torch.Tensor,
+    ) -> _DepthDecodeCudaGraph:
+        key = self._depth_decode_key(next_input_ids, attention_bias)
+        decode_graph = self.graph
+        if decode_graph is None or decode_graph.cache_key != key:
+            decode_graph = self._build_depth_decode_graph(
+                next_input_ids,
+                past_length=past_length,
+                attention_bias=attention_bias,
+            )
+            self.graph = decode_graph
+        else:
+            decode_graph.token_ids.copy_(next_input_ids)
+            self._select_depth_decode_rope(decode_graph.cos, decode_graph.sin, past_length=past_length)
+        return decode_graph
+
+    def _run_depth_decode_attention_core(
+        self,
+        layer_idx: int,
+        stage: _DepthDecodeCudaGraphLayerStage,
+        *,
+        past_key_values: Cache,
+        attention_bias: torch.Tensor,
+        cache_position: torch.Tensor,
+        cos: torch.Tensor,
+        sin: torch.Tensor,
+    ) -> torch.Tensor:
+        attention = self.backbone.transformer.blocks[layer_idx].self_attn
+        cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+        key_states, value_states = past_key_values.update(
+            stage.key,
+            stage.value,
+            layer_idx,
+            cache_kwargs,
+        )
+        key_states = _repeat_kv(key_states, attention.num_key_value_groups)
+        value_states = _repeat_kv(value_states, attention.num_key_value_groups)
+        attn_output = F.scaled_dot_product_attention(
+            stage.query,
+            key_states,
+            value_states,
+            attn_mask=attention_bias,
+            dropout_p=0.0,
+            is_causal=False,
+        )
+        return attn_output.transpose(1, 2)
+
+    def run(
+        self,
+        next_input_ids: torch.Tensor,
+        *,
+        past_key_values: Cache,
+        attention_bias: torch.Tensor,
+        past_length: int,
+    ) -> tuple[torch.Tensor, Cache]:
+        end = past_length + 1
+        decode_graph = self._get_depth_decode_graph(
+            next_input_ids,
+            past_length=past_length,
+            attention_bias=attention_bias,
+        )
+        cache_position = decode_graph.positions[past_length:end]
+        attention_bias_q = attention_bias[:, :, past_length:end, :end]
+
+        decode_graph.pre_graph.replay()
+
+        for layer_idx, post_graph in enumerate(decode_graph.post_graphs):
+            attn_context = self._run_depth_decode_attention_core(
+                layer_idx,
+                decode_graph.stages[layer_idx],
+                past_key_values=past_key_values,
+                attention_bias=attention_bias_q,
+                cache_position=cache_position,
+                cos=decode_graph.cos,
+                sin=decode_graph.sin,
+            )
+            post_graph.attn_context.copy_(attn_context)
+            post_graph.graph.replay()
+
+        return decode_graph.output, past_key_values
+
+
+def _cuda_graph_tensor_signature(
+    tensor: torch.Tensor | None,
+) -> tuple[Any, ...] | None:
+    if tensor is None:
+        return None
+    return (
+        tuple(tensor.shape),
+        tuple(tensor.stride()),
+        str(tensor.dtype),
+        str(tensor.device),
+    )
+
+
+def _cuda_graph_context_signature(context: Any) -> tuple[Any, ...]:
+    sig = _cuda_graph_tensor_signature
+    return (
+        tuple((sig(k), sig(v)) for k, v in context.kv_contexts),
+        sig(context.cross_mask),
+        sig(context.self_mask),
+        sig(context.valid_action),
+        None if context.rope_cache is None else tuple(sig(t) for t in context.rope_cache),
+    )
+
+
+def _cuda_graph_modulation_signature(modulations: Sequence[Any]) -> tuple[Any, ...]:
+    sig = _cuda_graph_tensor_signature
+    return tuple(
+        (
+            sig(step.conditioning),
+            tuple(tuple(sig(t) for t in block_modulation) for block_modulation in step.block_modulations),
+            tuple(sig(t) for t in step.final_modulation),
+        )
+        for step in modulations
+    )
+
+
+def _cuda_graph_key(inputs: _ActionFlowInputs, steps: int) -> tuple[Any, ...]:
+    sig = _cuda_graph_tensor_signature
+    return (
+        sig(inputs.trajectory),
+        _cuda_graph_context_signature(inputs.context),
+        _cuda_graph_modulation_signature(inputs.modulations),
+        sig(inputs.action_dim_is_pad),
+        int(steps),
+    )
+
+
+def _clone_static_tensor(tensor: torch.Tensor | None) -> torch.Tensor | None:
+    if tensor is None:
+        return None
+    static = torch.empty_strided(
+        tuple(tensor.shape),
+        tuple(tensor.stride()),
+        device=tensor.device,
+        dtype=tensor.dtype,
+    )
+    static.copy_(tensor)
+    return static
+
+
+def _clone_static_context(context: Any) -> Any:
+    rope_cache = None
+    if context.rope_cache is not None:
+        rope_cache = tuple(_clone_static_tensor(t) for t in context.rope_cache)
+    return context.__class__(
+        kv_contexts=tuple((_clone_static_tensor(k), _clone_static_tensor(v)) for k, v in context.kv_contexts),
+        cross_mask=_clone_static_tensor(context.cross_mask),
+        self_mask=_clone_static_tensor(context.self_mask),
+        valid_action=_clone_static_tensor(context.valid_action),
+        rope_cache=rope_cache,
+    )
+
+
+def _clone_static_modulations(modulations: Sequence[Any]) -> Sequence[Any]:
+    return tuple(
+        step.__class__(
+            conditioning=_clone_static_tensor(step.conditioning),
+            block_modulations=tuple(
+                tuple(_clone_static_tensor(t) for t in block_modulation)
+                for block_modulation in step.block_modulations
+            ),
+            final_modulation=tuple(_clone_static_tensor(t) for t in step.final_modulation),
+        )
+        for step in modulations
+    )
+
+
+def _clone_static_inputs(inputs: _ActionFlowInputs) -> _ActionFlowInputs:
+    return _ActionFlowInputs(
+        trajectory=_clone_static_tensor(inputs.trajectory),
+        context=_clone_static_context(inputs.context),
+        modulations=_clone_static_modulations(inputs.modulations),
+        action_dim_is_pad=_clone_static_tensor(inputs.action_dim_is_pad),
+    )
+
+
+def _copy_context_(dst: Any, src: Any) -> None:
+    for (dst_k, dst_v), (src_k, src_v) in zip(dst.kv_contexts, src.kv_contexts):
+        dst_k.copy_(src_k)
+        dst_v.copy_(src_v)
+    if src.cross_mask is not None:
+        dst.cross_mask.copy_(src.cross_mask)
+    if src.self_mask is not None:
+        dst.self_mask.copy_(src.self_mask)
+    if src.valid_action is not None:
+        dst.valid_action.copy_(src.valid_action)
+    if src.rope_cache is not None:
+        for dst_tensor, src_tensor in zip(dst.rope_cache, src.rope_cache):
+            dst_tensor.copy_(src_tensor)
+
+
+def _copy_inputs_(dst: _ActionFlowInputs, src: _ActionFlowInputs) -> None:
+    dst.trajectory.copy_(src.trajectory)
+    _copy_context_(dst.context, src.context)
+    if src.action_dim_is_pad is not None:
+        dst.action_dim_is_pad.copy_(src.action_dim_is_pad)
+
+
+def _rotate_half(x: torch.Tensor) -> torch.Tensor:
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+
+
+def _apply_rotary_pos_emb(
+    q: torch.Tensor,
+    k: torch.Tensor,
+    cos: torch.Tensor,
+    sin: torch.Tensor,
+    unsqueeze_dim: int = 1,
+) -> tuple[torch.Tensor, torch.Tensor]:
+    cos = cos.unsqueeze(unsqueeze_dim)
+    sin = sin.unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (_rotate_half(q) * sin)
+    k_embed = (k * cos) + (_rotate_half(k) * sin)
+    return q_embed, k_embed
+
+
+def _repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+
+
+def _capture_cuda_graph(
+    fn,
+    device: torch.device,
+    *,
+    after_warmup=None,
+) -> tuple[torch.cuda.CUDAGraph, Any]:
+    warmup_stream = torch.cuda.Stream(device=device)
+    warmup_stream.wait_stream(torch.cuda.current_stream(device))
+    with torch.cuda.stream(warmup_stream):
+        fn()
+    torch.cuda.current_stream(device).wait_stream(warmup_stream)
+    if after_warmup is not None:
+        after_warmup()
+
+    graph = torch.cuda.CUDAGraph()
+    with torch.cuda.graph(graph):
+        output = fn()
+    return graph, output
diff --git a/src/lerobot/policies/molmoact2/hf_model/modeling_molmoact2.py b/src/lerobot/policies/molmoact2/hf_model/modeling_molmoact2.py
new file mode 100644
index 000000000..4c36b04c8
--- /dev/null
+++ b/src/lerobot/policies/molmoact2/hf_model/modeling_molmoact2.py
@@ -0,0 +1,4591 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# ruff: noqa
+
+"""Modeling code for MolmoAct2"""
+
+import json
+import math
+import os
+import re
+from copy import deepcopy
+from dataclasses import dataclass
+from typing import Any, Dict, List, Optional, Tuple, Union
+from collections.abc import Callable, Mapping, Sequence
+
+import numpy as np
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import functional as F
+from torch.nn.attention import SDPBackend, sdpa_kernel
+from transformers.activations import ACT2FN
+from transformers.cache_utils import Cache, DynamicCache
+from transformers.configuration_utils import PretrainedConfig
+from transformers.generation import GenerationMixin
+from transformers.masking_utils import create_causal_mask, create_masks_for_generate
+from transformers.modeling_flash_attention_utils import (
+    FlashAttentionKwargs,
+    _flash_attention_forward,
+    flash_attn_supports_top_left_mask,
+)
+from transformers.modeling_layers import GradientCheckpointingLayer
+from transformers.modeling_outputs import (
+    BaseModelOutputWithPast,
+)
+from transformers.modeling_rope_utils import ROPE_INIT_FUNCTIONS, dynamic_rope_update
+from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
+from transformers.processing_utils import Unpack
+from transformers.utils import (
+    ModelOutput,
+    TransformersKwargs,
+    can_return_tuple,
+    logging,
+)
+
+from .configuration_molmoact2 import (
+    MolmoAct2ActionExpertConfig,
+    MolmoAct2AdapterConfig,
+    MolmoAct2Config,
+    MolmoAct2TextConfig,
+    MolmoAct2VitConfig,
+)
+from .inference import (
+    ActionCudaGraphManager,
+    DepthDecodeCudaGraphManager,
+    _ActionFlowInputs,
+    _cache_max_len_int,
+    _cache_seq_len_int,
+    _iter_cache_key_values,
+)
+
+logger = logging.get_logger(__name__)
+
+
+ACTION_START_TOKEN = "<action_start>"  # nosec B105
+ACTION_END_TOKEN = "<action_end>"  # nosec B105
+ACTION_OUTPUT_TOKEN = "<action_output>"  # nosec B105
+STATE_START_TOKEN = "<state_start>"  # nosec B105
+STATE_END_TOKEN = "<state_end>"  # nosec B105
+STATE_TOKEN_PREFIX = "<state_"  # nosec B105
+DEPTH_START_TOKEN = "<depth_start>"  # nosec B105
+DEPTH_END_TOKEN = "<depth_end>"  # nosec B105
+DEPTH_OUTPUT_TOKEN = "<depth_output>"  # nosec B105
+DEPTH_TOKEN_PREFIX = "<depth_"  # nosec B105
+SETUP_START_TOKEN = "<setup_start>"  # nosec B105
+SETUP_END_TOKEN = "<setup_end>"  # nosec B105
+CONTROL_START_TOKEN = "<control_start>"  # nosec B105
+CONTROL_END_TOKEN = "<control_end>"  # nosec B105
+
+_QUESTION_TRAILING_SENTENCE_PUNCTUATION = ".,!?;:,…"
+_QUESTION_TRAILING_CLOSERS = "\"'”’)]}"
+_QUESTION_SURROUNDING_DELIMITERS = "\"'`“”‘’[](){}"
+_QUESTION_PREFIX_PATTERNS = tuple(
+    re.compile(pattern, flags=re.IGNORECASE)
+    for pattern in (
+        r"^(?:task|instruction|language[_ ]instruction|goal)\s*[:\-]\s*",
+        r"^(?:the\s+task\s+is\s+to|your\s+task\s+is\s+to)\s+",
+    )
+)
+
+_DEPTH_REASONING_PATCH_SIZE = 32
+_DEPTH_REASONING_THRESHOLD = 0.996
+
+
+def _modulate(x: torch.Tensor, shift: torch.Tensor, scale: torch.Tensor) -> torch.Tensor:
+    return x * (1 + scale.unsqueeze(1)) + shift.unsqueeze(1)
+
+
+def _round_up_multiple(value: int, multiple_of: int) -> int:
+    if multiple_of <= 0:
+        return value
+    return int(math.ceil(value / multiple_of) * multiple_of)
+
+
+def _init_linear(linear: nn.Linear, *, zero: bool = False, scale: float = 1.0) -> None:
+    if zero:
+        nn.init.zeros_(linear.weight)
+    else:
+        nn.init.xavier_uniform_(linear.weight)
+        if scale != 1.0:
+            with torch.no_grad():
+                linear.weight.mul_(scale)
+    if linear.bias is not None:
+        nn.init.zeros_(linear.bias)
+
+
+@dataclass
+class ActionExpertContext:
+    kv_contexts: Sequence[tuple[torch.Tensor, torch.Tensor]]
+    cross_mask: torch.Tensor | None
+    self_mask: torch.Tensor | None
+    valid_action: torch.Tensor | None
+    rope_cache: tuple[torch.Tensor, torch.Tensor] | None = None
+
+
+@dataclass
+class ActionExpertStepModulation:
+    conditioning: torch.Tensor
+    block_modulations: Sequence[tuple[torch.Tensor, ...]]
+    final_modulation: tuple[torch.Tensor, torch.Tensor]
+
+
+class ActionExpertRMSNorm(nn.Module):
+    def __init__(
+        self,
+        size: int,
+        *,
+        eps: float = 1e-6,
+        elementwise_affine: bool = False,
+        device=None,
+    ) -> None:
+        super().__init__()
+        self.size = size
+        self.eps = eps
+        if elementwise_affine:
+            self.weight = nn.Parameter(torch.ones(size, device=device))
+        else:
+            self.register_parameter("weight", None)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        with torch.autocast(enabled=False, device_type=x.device.type):
+            dtype = x.dtype
+            x_float = x.to(torch.float32)
+            variance = x_float.pow(2).mean(dim=-1, keepdim=True)
+            out = x_float * torch.rsqrt(variance + self.eps)
+            out = out.to(dtype)
+        if self.weight is not None:
+            out = out * self.weight
+        return out
+
+    def reset_parameters(self) -> None:
+        if self.weight is not None:
+            nn.init.ones_(self.weight)
+
+
+class ActionExpertRotaryEmbedding(nn.Module):
+    def __init__(self, head_dim: int, base: float = 10000.0) -> None:
+        super().__init__()
+        if head_dim % 2 != 0:
+            raise ValueError("RoPE requires an even head_dim.")
+        self.head_dim = head_dim
+        self.base = base
+
+    def build_cache(
+        self,
+        *,
+        seq_len: int,
+        device: torch.device,
+        dtype: torch.dtype,
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        half_dim = self.head_dim // 2
+        inv_freq = 1.0 / (
+            self.base ** (torch.arange(0, half_dim, device=device, dtype=torch.float32) / max(half_dim, 1))
+        )
+        positions = torch.arange(seq_len, device=device, dtype=torch.float32)
+        freqs = torch.outer(positions, inv_freq)
+        cos = freqs.cos().to(dtype=dtype).view(1, 1, seq_len, half_dim)
+        sin = freqs.sin().to(dtype=dtype).view(1, 1, seq_len, half_dim)
+        return cos, sin
+
+    def forward(
+        self,
+        q: torch.Tensor,
+        k: torch.Tensor,
+        *,
+        rope_cache: tuple[torch.Tensor, torch.Tensor] | None = None,
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        if rope_cache is None:
+            rope_cache = self.build_cache(seq_len=q.shape[-2], device=q.device, dtype=q.dtype)
+        cos, sin = rope_cache
+        half_dim = self.head_dim // 2
+
+        def _apply(x: torch.Tensor) -> torch.Tensor:
+            x1, x2 = x[..., :half_dim], x[..., half_dim:]
+            return torch.cat([x1 * cos - x2 * sin, x1 * sin + x2 * cos], dim=-1)
+
+        return _apply(q), _apply(k)
+
+
+class ActionExpertSelfAttention(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int,
+        num_heads: int,
+        *,
+        attn_dropout: float = 0.0,
+        proj_dropout: float = 0.0,
+        qk_norm: bool = True,
+        qk_norm_eps: float = 1e-6,
+        use_rope: bool = True,
+    ) -> None:
+        super().__init__()
+        if hidden_size % num_heads != 0:
+            raise ValueError("hidden_size must be divisible by num_heads")
+        self.hidden_size = hidden_size
+        self.num_heads = num_heads
+        self.head_dim = hidden_size // num_heads
+        self.attn_dropout = attn_dropout
+        self.q_norm = ActionExpertRMSNorm(self.head_dim, eps=qk_norm_eps) if qk_norm else None
+        self.k_norm = ActionExpertRMSNorm(self.head_dim, eps=qk_norm_eps) if qk_norm else None
+        self.rope = ActionExpertRotaryEmbedding(self.head_dim) if use_rope else None
+        self.qkv = nn.Linear(hidden_size, hidden_size * 3)
+        self.out_proj = nn.Linear(hidden_size, hidden_size)
+        self.out_drop = nn.Dropout(proj_dropout)
+
+    def _apply_qk_norm(self, q: torch.Tensor, k: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
+        if self.q_norm is None or self.k_norm is None:
+            return q, k
+        return self.q_norm(q), self.k_norm(k)
+
+    def _attention(
+        self,
+        q: torch.Tensor,
+        k: torch.Tensor,
+        v: torch.Tensor,
+        *,
+        attn_mask: torch.Tensor | None = None,
+        is_causal: bool = False,
+    ) -> torch.Tensor:
+        dropout_p = self.attn_dropout if self.training else 0.0
+        out = F.scaled_dot_product_attention(
+            q.transpose(1, 2),
+            k.transpose(1, 2),
+            v.transpose(1, 2),
+            attn_mask=attn_mask,
+            dropout_p=dropout_p,
+            is_causal=is_causal,
+        )
+        return out.transpose(1, 2).contiguous()
+
+    def forward(
+        self,
+        x: torch.Tensor,
+        *,
+        attn_mask: torch.Tensor | None = None,
+        is_causal: bool = False,
+        rope_cache: tuple[torch.Tensor, torch.Tensor] | None = None,
+    ) -> torch.Tensor:
+        bsz, seq_len, _ = x.shape
+        qkv = self.qkv(x).view(bsz, seq_len, 3, self.num_heads, self.head_dim)
+        q = qkv[:, :, 0].transpose(1, 2)
+        k = qkv[:, :, 1].transpose(1, 2)
+        v = qkv[:, :, 2].contiguous()
+        q, k = self._apply_qk_norm(q, k)
+        if self.rope is not None:
+            q, k = self.rope(q, k, rope_cache=rope_cache)
+        q = q.transpose(1, 2)
+        k = k.transpose(1, 2)
+        out = self._attention(q, k, v, attn_mask=attn_mask, is_causal=is_causal)
+        out = out.reshape(bsz, seq_len, self.hidden_size)
+        return self.out_drop(self.out_proj(out))
+
+
+class ActionExpertCrossAttention(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int,
+        num_heads: int,
+        *,
+        attn_dropout: float = 0.0,
+        proj_dropout: float = 0.0,
+        qk_norm: bool = True,
+        qk_norm_eps: float = 1e-6,
+    ) -> None:
+        super().__init__()
+        if hidden_size % num_heads != 0:
+            raise ValueError("hidden_size must be divisible by num_heads")
+        self.hidden_size = hidden_size
+        self.num_heads = num_heads
+        self.head_dim = hidden_size // num_heads
+        self.attn_dropout = attn_dropout
+        self.q_norm = ActionExpertRMSNorm(self.head_dim, eps=qk_norm_eps) if qk_norm else None
+        self.k_norm = ActionExpertRMSNorm(self.head_dim, eps=qk_norm_eps) if qk_norm else None
+        self.q_proj = nn.Linear(hidden_size, hidden_size)
+        self.out_proj = nn.Linear(hidden_size, hidden_size)
+        self.out_drop = nn.Dropout(proj_dropout)
+
+    def _as_heads(self, x: torch.Tensor) -> torch.Tensor:
+        if x.dim() == 4:
+            if x.shape[2] == self.num_heads:
+                return x
+            if x.shape[1] == self.num_heads:
+                return x.transpose(1, 2).contiguous()
+            raise ValueError(f"Unexpected cross-attention KV shape {tuple(x.shape)}")
+        if x.dim() != 3:
+            raise ValueError(f"Expected 3D/4D cross-attention KV, got {tuple(x.shape)}")
+        bsz, seq_len, _ = x.shape
+        return x.view(bsz, seq_len, self.num_heads, self.head_dim)
+
+    def _attention(
+        self,
+        q: torch.Tensor,
+        k: torch.Tensor,
+        v: torch.Tensor,
+        *,
+        attn_mask: torch.Tensor | None = None,
+    ) -> torch.Tensor:
+        dropout_p = self.attn_dropout if self.training else 0.0
+        out = F.scaled_dot_product_attention(
+            q.transpose(1, 2),
+            k.transpose(1, 2),
+            v.transpose(1, 2),
+            attn_mask=attn_mask,
+            dropout_p=dropout_p,
+            is_causal=False,
+        )
+        return out.transpose(1, 2).contiguous()
+
+    def forward(
+        self,
+        x: torch.Tensor,
+        *,
+        kv_k: torch.Tensor,
+        kv_v: torch.Tensor,
+        attn_mask: torch.Tensor | None = None,
+    ) -> torch.Tensor:
+        bsz, tgt_len, _ = x.shape
+        q = self.q_proj(x).view(bsz, tgt_len, self.num_heads, self.head_dim)
+        k = self._as_heads(kv_k)
+        v = self._as_heads(kv_v)
+        q = q.transpose(1, 2)
+        k = k.transpose(1, 2)
+        if self.q_norm is not None:
+            q = self.q_norm(q)
+        q = q.transpose(1, 2)
+        k = k.transpose(1, 2)
+        out = self._attention(q, k, v, attn_mask=attn_mask)
+        out = out.reshape(bsz, tgt_len, self.hidden_size)
+        return self.out_drop(self.out_proj(out))
+
+
+class ActionExpertMLP(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int,
+        *,
+        mlp_ratio: float,
+        multiple_of: int,
+        dropout: float = 0.0,
+    ) -> None:
+        super().__init__()
+        inner_dim = _round_up_multiple(int(hidden_size * mlp_ratio), multiple_of)
+        self.up_proj = nn.Linear(hidden_size, inner_dim)
+        self.gate_proj = nn.Linear(hidden_size, inner_dim)
+        self.down_proj = nn.Linear(inner_dim, hidden_size)
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = F.silu(self.gate_proj(x)) * self.up_proj(x)
+        x = self.dropout(x)
+        x = self.down_proj(x)
+        return self.dropout(x)
+
+
+class ActionExpertModulation(nn.Module):
+    def __init__(self, hidden_size: int, num_chunks: int) -> None:
+        super().__init__()
+        self.act = nn.SiLU()
+        self.linear = nn.Linear(hidden_size, num_chunks * hidden_size)
+
+    def forward(self, conditioning: torch.Tensor) -> torch.Tensor:
+        return self.linear(self.act(conditioning))
+
+
+class ActionExpertBlock(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int,
+        num_heads: int,
+        *,
+        mlp_ratio: float,
+        ffn_multiple_of: int,
+        attn_dropout: float = 0.0,
+        dropout: float = 0.0,
+        qk_norm: bool = True,
+        qk_norm_eps: float = 1e-6,
+        rope: bool = True,
+    ) -> None:
+        super().__init__()
+        self.self_norm = ActionExpertRMSNorm(hidden_size, eps=1e-6)
+        self.cross_norm = ActionExpertRMSNorm(hidden_size, eps=1e-6)
+        self.ff_norm = ActionExpertRMSNorm(hidden_size, eps=1e-6)
+        self.self_attn = ActionExpertSelfAttention(
+            hidden_size,
+            num_heads,
+            attn_dropout=attn_dropout,
+            proj_dropout=dropout,
+            qk_norm=qk_norm,
+            qk_norm_eps=qk_norm_eps,
+            use_rope=rope,
+        )
+        self.cross_attn = ActionExpertCrossAttention(
+            hidden_size,
+            num_heads,
+            attn_dropout=attn_dropout,
+            proj_dropout=dropout,
+            qk_norm=qk_norm,
+            qk_norm_eps=qk_norm_eps,
+        )
+        self.mlp = ActionExpertMLP(
+            hidden_size,
+            mlp_ratio=mlp_ratio,
+            multiple_of=ffn_multiple_of,
+            dropout=dropout,
+        )
+        self.modulation = ActionExpertModulation(hidden_size, 9)
+
+    def forward(
+        self,
+        x: torch.Tensor,
+        conditioning: torch.Tensor,
+        *,
+        cross_kv: tuple[torch.Tensor, torch.Tensor],
+        self_attn_mask: torch.Tensor | None = None,
+        attn_mask: torch.Tensor | None = None,
+        is_causal: bool = False,
+        modulation: tuple[torch.Tensor, ...] | None = None,
+        rope_cache: tuple[torch.Tensor, torch.Tensor] | None = None,
+    ) -> torch.Tensor:
+        if modulation is None:
+            modulation = self.modulation(conditioning).chunk(9, dim=1)
+        (
+            shift_msa,
+            scale_msa,
+            gate_msa,
+            shift_mca,
+            scale_mca,
+            gate_mca,
+            shift_mlp,
+            scale_mlp,
+            gate_mlp,
+        ) = modulation
+        x = x + gate_msa.unsqueeze(1) * self.self_attn(
+            _modulate(self.self_norm(x), shift_msa, scale_msa),
+            attn_mask=self_attn_mask,
+            is_causal=is_causal,
+            rope_cache=rope_cache,
+        )
+        x = x + gate_mca.unsqueeze(1) * self.cross_attn(
+            _modulate(self.cross_norm(x), shift_mca, scale_mca),
+            kv_k=cross_kv[0],
+            kv_v=cross_kv[1],
+            attn_mask=attn_mask,
+        )
+        x = x + gate_mlp.unsqueeze(1) * self.mlp(_modulate(self.ff_norm(x), shift_mlp, scale_mlp))
+        return x
+
+
+class ActionExpertFinalLayer(nn.Module):
+    def __init__(self, hidden_size: int, output_dim: int) -> None:
+        super().__init__()
+        self.norm = ActionExpertRMSNorm(hidden_size, eps=1e-6)
+        self.modulation = ActionExpertModulation(hidden_size, 2)
+        self.linear = nn.Linear(hidden_size, output_dim)
+
+    def forward(
+        self,
+        x: torch.Tensor,
+        conditioning: torch.Tensor,
+        *,
+        modulation: tuple[torch.Tensor, torch.Tensor] | None = None,
+    ) -> torch.Tensor:
+        if modulation is None:
+            modulation = self.modulation(conditioning).chunk(2, dim=1)
+        shift, scale = modulation
+        return self.linear(_modulate(self.norm(x), shift, scale))
+
+
+class SinusoidalTimeEmbedding(nn.Module):
+    def __init__(self, dim: int):
+        super().__init__()
+        self.dim = dim
+
+    def forward(self, timesteps: torch.Tensor) -> torch.Tensor:
+        if timesteps.dim() > 1:
+            timesteps = timesteps.view(timesteps.shape[0], -1)[:, 0]
+        half_dim = self.dim // 2
+        freq = torch.exp(
+            torch.arange(half_dim, device=timesteps.device, dtype=timesteps.dtype)
+            * (-math.log(10000.0) / max(half_dim - 1, 1))
+        )
+        args = timesteps[:, None] * freq[None, :]
+        emb = torch.cat([torch.sin(args), torch.cos(args)], dim=-1)
+        if self.dim % 2 == 1:
+            emb = F.pad(emb, (0, 1))
+        return emb
+
+
+class ActionExpert(nn.Module):
+    """Modern MolmoAct2 action expert embedded in the local LeRobot implementation."""
+
+    def __init__(
+        self,
+        config: MolmoAct2ActionExpertConfig,
+        *,
+        llm_dim: int,
+        llm_kv_dim: int,
+        llm_num_layers: int,
+        device=None,
+    ):
+        super().__init__()
+        if config.num_layers != llm_num_layers:
+            raise ValueError(
+                "MolmoAct2 HF action expert supports only per-layer conditioning with one "
+                f"action block per LLM layer (action={config.num_layers}, llm={llm_num_layers})."
+            )
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.llm_dim = llm_dim
+        self.llm_kv_dim = llm_kv_dim
+        self.action_head_dim = config.hidden_size // config.num_heads
+
+        self.time_embed = nn.Sequential(
+            SinusoidalTimeEmbedding(config.timestep_embed_dim),
+            nn.Linear(config.timestep_embed_dim, config.hidden_size, device=device),
+            nn.SiLU(),
+            nn.Linear(config.hidden_size, config.hidden_size, device=device),
+        )
+        self.action_embed = nn.Linear(config.max_action_dim, config.hidden_size, device=device)
+        self.context_k_proj = nn.Linear(self.llm_kv_dim, config.hidden_size, bias=False, device=device)
+        self.context_v_proj = nn.Linear(self.llm_kv_dim, config.hidden_size, bias=False, device=device)
+        self.context_norm = (
+            ActionExpertRMSNorm(config.hidden_size, eps=1e-6) if config.context_layer_norm else nn.Identity()
+        )
+        self._modulation_cache_key: tuple[Any, ...] | None = None
+        self._modulation_cache_value: Sequence[ActionExpertStepModulation] | None = None
+        self.blocks = nn.ModuleList(
+            [
+                ActionExpertBlock(
+                    config.hidden_size,
+                    config.num_heads,
+                    mlp_ratio=config.mlp_ratio,
+                    ffn_multiple_of=config.ffn_multiple_of,
+                    attn_dropout=config.attn_dropout,
+                    dropout=config.dropout,
+                    qk_norm=config.qk_norm,
+                    qk_norm_eps=config.qk_norm_eps,
+                    rope=config.rope,
+                )
+                for _ in range(config.num_layers)
+            ]
+        )
+        self.final_layer = ActionExpertFinalLayer(config.hidden_size, config.max_action_dim)
+        self.reset_parameters()
+
+    def reset_parameters(self) -> None:
+        for module in self.time_embed.modules():
+            if isinstance(module, nn.Linear):
+                _init_linear(module)
+        _init_linear(self.action_embed)
+        _init_linear(self.context_k_proj)
+        _init_linear(self.context_v_proj)
+        if isinstance(self.context_norm, ActionExpertRMSNorm):
+            self.context_norm.reset_parameters()
+        residual_scale = (2 * max(self.config.num_layers, 1)) ** -0.5
+        for block in self.blocks:
+            _init_linear(block.self_attn.qkv)
+            _init_linear(block.self_attn.out_proj, scale=residual_scale)
+            _init_linear(block.cross_attn.q_proj)
+            _init_linear(block.cross_attn.out_proj, scale=residual_scale)
+            _init_linear(block.mlp.up_proj)
+            _init_linear(block.mlp.gate_proj)
+            _init_linear(block.mlp.down_proj, scale=residual_scale)
+            _init_linear(block.modulation.linear, zero=True)
+            block.self_norm.reset_parameters()
+            block.cross_norm.reset_parameters()
+            block.ff_norm.reset_parameters()
+            if block.self_attn.q_norm is not None:
+                block.self_attn.q_norm.reset_parameters()
+            if block.self_attn.k_norm is not None:
+                block.self_attn.k_norm.reset_parameters()
+            if block.cross_attn.q_norm is not None:
+                block.cross_attn.q_norm.reset_parameters()
+            if block.cross_attn.k_norm is not None:
+                block.cross_attn.k_norm.reset_parameters()
+        self.final_layer.norm.reset_parameters()
+        _init_linear(self.final_layer.modulation.linear, zero=True)
+        _init_linear(self.final_layer.linear, zero=True)
+
+    def _reshape_hidden_to_heads(self, x: torch.Tensor) -> torch.Tensor:
+        return x.view(x.shape[0], x.shape[1], self.config.num_heads, self.action_head_dim)
+
+    def _time_conditioning(self, timesteps: torch.Tensor) -> torch.Tensor:
+        conditioning = self.time_embed[0](timesteps)
+        first_linear = self.time_embed[1]
+        if isinstance(first_linear, nn.Linear):
+            conditioning = conditioning.to(dtype=first_linear.weight.dtype)
+        for module in list(self.time_embed.children())[1:]:
+            conditioning = module(conditioning)
+        return conditioning
+
+    def _project_kv_tensor(self, x: torch.Tensor, proj: nn.Linear) -> torch.Tensor:
+        flat = self.context_norm(proj(x))
+        return self._reshape_hidden_to_heads(flat)
+
+    def _prepare_kv_context(
+        self,
+        encoder_kv_states: Sequence[tuple[torch.Tensor, torch.Tensor]],
+    ) -> Sequence[tuple[torch.Tensor, torch.Tensor]]:
+        if len(encoder_kv_states) != len(self.blocks):
+            raise ValueError(
+                f"Expected {len(self.blocks)} KV layers for per-layer conditioning, "
+                f"got {len(encoder_kv_states)}."
+            )
+        kv_contexts = []
+        for block, (k_in, v_in) in zip(self.blocks, encoder_kv_states):
+            k_ctx = self._project_kv_tensor(k_in, self.context_k_proj)
+            v_ctx = self._project_kv_tensor(v_in, self.context_v_proj)
+            k_norm = block.cross_attn.k_norm
+            if k_norm is not None:
+                k_ctx = k_norm(k_ctx.transpose(1, 2)).transpose(1, 2)
+            kv_contexts.append((k_ctx, v_ctx))
+        return kv_contexts
+
+    @staticmethod
+    def _build_cross_attention_mask(
+        encoder_attention_mask: torch.Tensor | None,
+        batch_size: int,
+        dtype: torch.dtype,
+    ) -> torch.Tensor | None:
+        if encoder_attention_mask is None:
+            return None
+        mask = encoder_attention_mask[:, None, None, :].to(dtype=dtype)
+        return (1.0 - mask) * torch.finfo(dtype).min
+
+    def _build_self_attention_mask(
+        self,
+        action_attention_mask: torch.Tensor | None,
+        seq_len: int,
+        device: torch.device,
+        dtype: torch.dtype,
+    ) -> torch.Tensor | None:
+        mask = None
+        if action_attention_mask is not None:
+            valid = action_attention_mask.to(device=device, dtype=torch.bool)
+            key_mask = (~valid)[:, None, None, :].to(dtype=dtype)
+            mask = key_mask * torch.finfo(dtype).min
+        if self.config.causal_attn:
+            causal = torch.ones(seq_len, seq_len, device=device, dtype=torch.bool).triu(diagonal=1)
+            causal = causal.unsqueeze(0).unsqueeze(0).to(dtype=dtype) * torch.finfo(dtype).min
+            mask = causal if mask is None else mask + causal
+        return mask
+
+    def prepare_context(
+        self,
+        *,
+        encoder_kv_states: Sequence[tuple[torch.Tensor, torch.Tensor]],
+        encoder_attention_mask: torch.Tensor | None = None,
+        action_attention_mask: torch.Tensor | None = None,
+        state_embeddings: torch.Tensor | None = None,
+        batch_size: int,
+        seq_len: int,
+        device: torch.device,
+        dtype: torch.dtype,
+    ) -> ActionExpertContext:
+        if state_embeddings is not None:
+            raise ValueError(
+                "MolmoAct2 HF action expert supports only discrete state tokens. "
+                "Continuous state embeddings are not supported."
+            )
+        valid_action = None
+        if action_attention_mask is not None:
+            valid_action = action_attention_mask.to(device=device, dtype=dtype).unsqueeze(-1)
+        rope_cache = None
+        if len(self.blocks) > 0 and self.blocks[0].self_attn.rope is not None:
+            rope_cache = self.blocks[0].self_attn.rope.build_cache(
+                seq_len=seq_len,
+                device=device,
+                dtype=dtype,
+            )
+        kv_contexts = self._prepare_kv_context(encoder_kv_states)
+        cross_mask = self._build_cross_attention_mask(
+            encoder_attention_mask,
+            batch_size,
+            dtype,
+        )
+        self_mask = self._build_self_attention_mask(action_attention_mask, seq_len, device, dtype)
+        return ActionExpertContext(
+            kv_contexts=kv_contexts,
+            cross_mask=cross_mask,
+            self_mask=self_mask,
+            valid_action=valid_action,
+            rope_cache=rope_cache,
+        )
+
+    def prepare_modulation_cache(
+        self,
+        timesteps: Sequence[torch.Tensor],
+    ) -> Sequence[ActionExpertStepModulation]:
+        cache = []
+        for idx, step_t in enumerate(timesteps):
+            conditioning = self._time_conditioning(step_t)
+            block_modulations = []
+            for block in self.blocks:
+                block_modulations.append(tuple(block.modulation(conditioning).chunk(9, dim=1)))
+            final_modulation = tuple(self.final_layer.modulation(conditioning).chunk(2, dim=1))
+            cache.append(
+                ActionExpertStepModulation(
+                    conditioning=conditioning,
+                    block_modulations=block_modulations,
+                    final_modulation=final_modulation,
+                )
+            )
+        return cache
+
+    def get_or_prepare_modulation_cache(
+        self,
+        timesteps: Sequence[torch.Tensor],
+        *,
+        cache_key: tuple[Any, ...] | None = None,
+    ) -> Sequence[ActionExpertStepModulation]:
+        if self.training or cache_key is None:
+            return self.prepare_modulation_cache(timesteps)
+        if self._modulation_cache_key == cache_key and self._modulation_cache_value is not None:
+            return self._modulation_cache_value
+        cached = self.prepare_modulation_cache(timesteps)
+        self._modulation_cache_key = cache_key
+        self._modulation_cache_value = cached
+        return cached
+
+    def forward_with_context(
+        self,
+        actions: torch.Tensor,
+        timesteps: torch.Tensor,
+        *,
+        context: ActionExpertContext,
+        modulation: ActionExpertStepModulation | None = None,
+    ) -> torch.Tensor:
+        bsz, seq_len, _ = actions.shape
+        if seq_len > self.config.max_action_horizon:
+            raise ValueError(
+                f"Action sequence length {seq_len} exceeds configured max_action_horizon={self.config.max_action_horizon}"
+            )
+        if modulation is None:
+            conditioning = self._time_conditioning(timesteps)
+            block_modulations: Sequence[tuple[torch.Tensor, ...] | None] = [None] * len(self.blocks)
+            final_modulation = None
+        else:
+            conditioning = modulation.conditioning
+            block_modulations = modulation.block_modulations
+            final_modulation = modulation.final_modulation
+        x = self.action_embed(actions)
+        if context.valid_action is not None:
+            x = x * context.valid_action
+        for idx, (block, kv_context, block_modulation) in enumerate(
+            zip(self.blocks, context.kv_contexts, block_modulations)
+        ):
+            x = block(
+                x,
+                conditioning,
+                cross_kv=kv_context,
+                self_attn_mask=context.self_mask,
+                attn_mask=context.cross_mask,
+                is_causal=self.config.causal_attn,
+                modulation=block_modulation,
+                rope_cache=context.rope_cache,
+            )
+            if context.valid_action is not None:
+                x = x * context.valid_action
+        out = self.final_layer(x, conditioning, modulation=final_modulation)
+        if context.valid_action is not None:
+            out = out * context.valid_action
+        return out
+
+    def forward(
+        self,
+        actions: torch.Tensor,
+        timesteps: torch.Tensor,
+        *,
+        encoder_kv_states: Sequence[tuple[torch.Tensor, torch.Tensor]],
+        encoder_attention_mask: torch.Tensor | None = None,
+        action_attention_mask: torch.Tensor | None = None,
+        state_embeddings: torch.Tensor | None = None,
+    ) -> torch.Tensor:
+        bsz, seq_len, _ = actions.shape
+        context = self.prepare_context(
+            encoder_kv_states=encoder_kv_states,
+            encoder_attention_mask=encoder_attention_mask,
+            action_attention_mask=action_attention_mask,
+            state_embeddings=state_embeddings,
+            batch_size=bsz,
+            seq_len=seq_len,
+            device=actions.device,
+            dtype=actions.dtype,
+        )
+        return self.forward_with_context(actions, timesteps, context=context)
+
+
+def _to_numpy(value: Any) -> np.ndarray:
+    if isinstance(value, np.ndarray):
+        return value
+    if torch.is_tensor(value):
+        return value.detach().cpu().numpy()
+    return np.asarray(value)
+
+
+def _to_array(value: Any) -> np.ndarray | None:
+    if value is None:
+        return None
+    if torch.is_tensor(value):
+        tensor = value.detach()
+        if tensor.dtype in (torch.bfloat16, torch.float16):
+            tensor = tensor.float()
+        return tensor.cpu().numpy().astype(np.float32, copy=False)
+    return np.asarray(value, dtype=np.float32)
+
+
+def _to_mask(value: Any, fallback_like: np.ndarray | None) -> np.ndarray | None:
+    if value is None:
+        return None
+    mask = np.asarray(value, dtype=np.bool_)
+    if fallback_like is not None and mask.shape != fallback_like.shape:
+        mask = np.broadcast_to(mask, fallback_like.shape)
+    return mask
+
+
+def _feature_dim_from_stats(stats: Mapping[str, Any] | None) -> int | None:
+    if not isinstance(stats, Mapping):
+        return None
+    for key in (
+        "mean",
+        "std",
+        "min",
+        "max",
+        "q01",
+        "q99",
+        "q10",
+        "q90",
+        "mask",
+        "names",
+    ):
+        value = stats.get(key)
+        if value is None:
+            continue
+        arr = np.asarray(value)
+        if arr.shape:
+            return int(arr.shape[-1])
+        if isinstance(value, Sequence) and not isinstance(value, (str, bytes)):
+            return int(len(value))
+    return None
+
+
+class _FeatureNormalizer:
+    def __init__(
+        self,
+        *,
+        mode: str,
+        mean: np.ndarray | None = None,
+        std: np.ndarray | None = None,
+        min_val: np.ndarray | None = None,
+        max_val: np.ndarray | None = None,
+        q_low: np.ndarray | None = None,
+        q_high: np.ndarray | None = None,
+        mask: np.ndarray | None = None,
+        zero_mask: np.ndarray | None = None,
+    ):
+        self.mode = mode
+        self.mean = mean
+        self.std = std
+        self.min_val = min_val
+        self.max_val = max_val
+        self.q_low = q_low
+        self.q_high = q_high
+        self.mask = mask
+        self.zero_mask = zero_mask
+
+    @classmethod
+    def from_stats(cls, stats: Mapping[str, Any] | None, mode: str) -> Optional["_FeatureNormalizer"]:
+        if stats is None:
+            return None
+        raw_mask = stats.get("mask") if isinstance(stats, Mapping) else None
+        if mode == "none":
+            fallback = None
+            for key in (
+                "mean",
+                "std",
+                "min",
+                "max",
+                "q01",
+                "q99",
+                "q10",
+                "q90",
+                "mask",
+            ):
+                fallback = _to_array(stats.get(key))
+                if fallback is not None:
+                    break
+            return cls(mode=mode, mask=_to_mask(raw_mask, fallback))
+        if mode == "mean_std":
+            mean = _to_array(stats.get("mean"))
+            std = _to_array(stats.get("std"))
+            if mean is None or std is None:
+                raise ValueError("norm_mode='mean_std' requires mean and std stats.")
+            return cls(mode=mode, mean=mean, std=std, mask=_to_mask(raw_mask, mean))
+        if mode == "min_max":
+            min_val = _to_array(stats.get("min"))
+            max_val = _to_array(stats.get("max"))
+            if min_val is None or max_val is None:
+                raise ValueError("norm_mode='min_max' requires min and max stats.")
+            return cls(
+                mode=mode,
+                min_val=min_val,
+                max_val=max_val,
+                mask=_to_mask(raw_mask, min_val),
+                zero_mask=(min_val == max_val),
+            )
+        if mode in {"q01_q99", "q10_q90"}:
+            low_key, high_key = ("q01", "q99") if mode == "q01_q99" else ("q10", "q90")
+            q_low = _to_array(stats.get(low_key))
+            q_high = _to_array(stats.get(high_key))
+            if q_low is None or q_high is None:
+                raise ValueError(f"norm_mode={mode!r} requires {low_key} and {high_key} stats.")
+            min_val = _to_array(stats.get("min"))
+            max_val = _to_array(stats.get("max"))
+            fallback = min_val if min_val is not None else q_low
+            zero_mask = None if min_val is None or max_val is None else (min_val == max_val)
+            return cls(
+                mode=mode,
+                min_val=min_val,
+                max_val=max_val,
+                q_low=q_low,
+                q_high=q_high,
+                mask=_to_mask(raw_mask, fallback),
+                zero_mask=zero_mask,
+            )
+        raise ValueError(f"Unsupported robot normalization mode {mode!r}.")
+
+    def normalize(self, x: Any) -> Any:
+        arr = _to_array(x)
+        if arr is None:
+            return None
+        eps = 1e-6
+        if self.mode == "none":
+            normed = arr
+        elif self.mode == "mean_std":
+            normed = (arr - self.mean) / np.maximum(self.std, eps)
+        elif self.mode == "min_max":
+            normed = 2.0 * (arr - self.min_val) / np.maximum(self.max_val - self.min_val, eps) - 1.0
+        elif self.mode in {"q01_q99", "q10_q90"}:
+            normed = 2.0 * (arr - self.q_low) / np.maximum(self.q_high - self.q_low, eps) - 1.0
+        else:
+            normed = arr
+        if self.mode in {"min_max", "q01_q99", "q10_q90"}:
+            normed = np.clip(normed, -1.0, 1.0)
+        if self.mask is not None:
+            normed = np.where(self.mask, normed, arr)
+        if self.zero_mask is not None:
+            normed = np.where(self.zero_mask, 0.0, normed)
+        if torch.is_tensor(x):
+            return torch.as_tensor(normed, device=x.device, dtype=x.dtype)
+        return normed
+
+    def unnormalize(self, x: Any) -> Any:
+        arr = _to_array(x)
+        if arr is None:
+            return None
+        if self.mode in {"min_max", "q01_q99", "q10_q90"}:
+            arr = np.clip(arr, -1.0, 1.0)
+        if self.mode == "none":
+            out = arr
+        elif self.mode == "mean_std":
+            out = arr * self.std + self.mean
+        elif self.mode == "min_max":
+            out = (arr + 1.0) * (self.max_val - self.min_val) / 2.0 + self.min_val
+        elif self.mode in {"q01_q99", "q10_q90"}:
+            out = (arr + 1.0) * (self.q_high - self.q_low) / 2.0 + self.q_low
+        else:
+            out = arr
+        if self.mask is not None:
+            out = np.where(self.mask, out, arr)
+        if torch.is_tensor(x):
+            return torch.as_tensor(out, device=x.device, dtype=x.dtype)
+        return out
+
+
+class _RobotStats:
+    def __init__(self, payload: Mapping[str, Any]):
+        self.norm_mode = str(payload.get("norm_mode", "min_max"))
+        self.metadata_by_tag: dict[str, dict[str, Any]] = {
+            str(tag): dict(metadata or {})
+            for tag, metadata in dict(payload.get("metadata_by_tag") or {}).items()
+        }
+        self.action_normalizers = {}
+        self.state_normalizers = {}
+        for tag, metadata in self.metadata_by_tag.items():
+            if metadata.get("action_stats") is not None:
+                self.action_normalizers[tag] = _FeatureNormalizer.from_stats(
+                    metadata.get("action_stats"),
+                    self.norm_mode,
+                )
+            if metadata.get("state_stats") is not None:
+                self.state_normalizers[tag] = _FeatureNormalizer.from_stats(
+                    metadata.get("state_stats"),
+                    self.norm_mode,
+                )
+
+    def validate_tag(self, norm_tag: str | None) -> str:
+        tag = str(norm_tag or "").strip()
+        if not tag:
+            raise ValueError("MolmoAct2 `predict_action` requires `norm_tag`.")
+        if tag not in self.metadata_by_tag:
+            allowed = ", ".join(sorted(self.metadata_by_tag))
+            raise ValueError(f"Unknown MolmoAct2 normalization tag {tag!r}. Allowed tags: {allowed}.")
+        return tag
+
+    def get_metadata(self, norm_tag: str | None) -> dict[str, Any]:
+        if norm_tag is None:
+            return {}
+        return dict(self.metadata_by_tag.get(str(norm_tag), {}) or {})
+
+    def normalize_state(self, state: Any, norm_tag: str) -> Any:
+        normalizer = self.state_normalizers.get(str(norm_tag))
+        return state if normalizer is None else normalizer.normalize(state)
+
+    def unnormalize_action(self, action: Any, norm_tag: str) -> Any:
+        normalizer = self.action_normalizers.get(str(norm_tag))
+        return action if normalizer is None else normalizer.unnormalize(action)
+
+    def get_action_dim(self, norm_tag: str) -> int | None:
+        metadata = self.get_metadata(norm_tag)
+        stats = metadata.get("action_stats")
+        dim = _feature_dim_from_stats(stats)
+        return dim
+
+    def get_state_dim(self, norm_tag: str) -> int | None:
+        metadata = self.get_metadata(norm_tag)
+        return _feature_dim_from_stats(metadata.get("state_stats"))
+
+    def get_action_horizon(self, norm_tag: str) -> int | None:
+        return self._get_positive_int(norm_tag, "action_horizon")
+
+    def get_n_action_steps(self, norm_tag: str) -> int | None:
+        return self._get_positive_int(norm_tag, "n_action_steps")
+
+    def _get_positive_int(self, norm_tag: str, key: str) -> int | None:
+        value = self.get_metadata(norm_tag).get(key)
+        if value is None:
+            return None
+        value = int(value)
+        if value < 1:
+            raise ValueError(f"Robot metadata for norm_tag={norm_tag!r} must define {key} >= 1.")
+        return value
+
+
+def _normalize_image_for_cache(image: Any) -> np.ndarray:
+    arr = np.asarray(image)
+    if arr.ndim == 2:
+        arr = np.stack([arr] * 3, axis=-1)
+    if arr.ndim == 3 and arr.shape[0] in {1, 3, 4} and arr.shape[-1] not in {1, 3, 4}:
+        arr = np.moveaxis(arr, 0, -1)
+    if arr.ndim == 3 and arr.shape[-1] == 1:
+        arr = np.repeat(arr, 3, axis=-1)
+    if arr.dtype in (np.float32, np.float64):
+        if arr.size > 0 and float(arr.max()) <= 1.0:
+            arr = arr * 255.0
+        arr = np.clip(arr, 0, 255).astype(np.uint8)
+    elif arr.dtype != np.uint8:
+        arr = np.clip(arr, 0, 255).astype(np.uint8)
+    return arr
+
+
+def _extract_first_image(images: Any) -> np.ndarray | None:
+    if images is None:
+        return None
+    if isinstance(images, (list, tuple)):
+        if not images:
+            return None
+        return _normalize_image_for_cache(images[0])
+    arr = _to_numpy(images)
+    if arr.ndim == 4:
+        return _normalize_image_for_cache(arr[0])
+    return _normalize_image_for_cache(arr)
+
+
+def _resize_depth_reasoning_image(image: np.ndarray, target_size: int) -> np.ndarray:
+    from PIL import Image
+
+    if image.shape[0] == target_size and image.shape[1] == target_size:
+        return image
+    pil_image = Image.fromarray(np.asarray(image, dtype=np.uint8))
+    return np.asarray(pil_image.resize((target_size, target_size), Image.BILINEAR))
+
+
+def _compute_depth_update_mask(
+    current_image: np.ndarray,
+    previous_image: np.ndarray,
+    *,
+    num_depth_codes: int,
+) -> np.ndarray:
+    grid_side = int(math.isqrt(int(num_depth_codes)))
+    if grid_side * grid_side != int(num_depth_codes):
+        raise ValueError(
+            f"enable_adaptive_depth=True requires a square depth grid, got num_depth_codes={int(num_depth_codes)}."
+        )
+    target_size = grid_side * _DEPTH_REASONING_PATCH_SIZE
+    current_resized = _resize_depth_reasoning_image(current_image, target_size).astype(np.float32)
+    previous_resized = _resize_depth_reasoning_image(previous_image, target_size).astype(np.float32)
+    current_patches = (
+        current_resized.reshape(
+            grid_side,
+            _DEPTH_REASONING_PATCH_SIZE,
+            grid_side,
+            _DEPTH_REASONING_PATCH_SIZE,
+            3,
+        )
+        .transpose(0, 2, 1, 3, 4)
+        .reshape(grid_side, grid_side, -1)
+    )
+    previous_patches = (
+        previous_resized.reshape(
+            grid_side,
+            _DEPTH_REASONING_PATCH_SIZE,
+            grid_side,
+            _DEPTH_REASONING_PATCH_SIZE,
+            3,
+        )
+        .transpose(0, 2, 1, 3, 4)
+        .reshape(grid_side, grid_side, -1)
+    )
+    dot = np.sum(current_patches * previous_patches, axis=-1)
+    norm_current = np.linalg.norm(current_patches, axis=-1)
+    norm_previous = np.linalg.norm(previous_patches, axis=-1)
+    denom = norm_current * norm_previous
+    similarity = np.where(denom < 1e-8, 1.0, dot / (denom + 1e-12))
+    return np.asarray(similarity < _DEPTH_REASONING_THRESHOLD, dtype=np.bool_).reshape(-1)
+
+
+def _build_depth_update_spans(
+    update_mask: Sequence[bool],
+) -> list[tuple[int, int, bool]]:
+    flat_mask = np.asarray(update_mask, dtype=np.bool_).reshape(-1)
+    if flat_mask.size == 0:
+        return []
+    spans: list[tuple[int, int, bool]] = []
+    start = 0
+    current_value = bool(flat_mask[0])
+    for idx in range(1, int(flat_mask.shape[0])):
+        next_value = bool(flat_mask[idx])
+        if next_value == current_value:
+            continue
+        spans.append((start, idx, current_value))
+        start = idx
+        current_value = next_value
+    spans.append((start, int(flat_mask.shape[0]), current_value))
+    return spans
+
+
+def _wrap_setup_text(setup_type: str, add_setup_tokens: bool = False) -> str:
+    setup_type = str(setup_type or "")
+    if setup_type.startswith(SETUP_START_TOKEN) and setup_type.endswith(SETUP_END_TOKEN):
+        return setup_type
+    if not setup_type or not add_setup_tokens:
+        return setup_type
+    return f"{SETUP_START_TOKEN}{setup_type}{SETUP_END_TOKEN}"
+
+
+def _wrap_control_text(control_mode: str, add_control_tokens: bool = False) -> str:
+    control_mode = str(control_mode or "")
+    if control_mode.startswith(CONTROL_START_TOKEN) and control_mode.endswith(CONTROL_END_TOKEN):
+        return control_mode
+    if not control_mode or not add_control_tokens:
+        return control_mode
+    return f"{CONTROL_START_TOKEN}{control_mode}{CONTROL_END_TOKEN}"
+
+
+def _discretize_normalized_state(state: np.ndarray, num_state_tokens: int) -> np.ndarray:
+    arr = np.asarray(state, dtype=np.float32)
+    arr = np.nan_to_num(arr, nan=0.0, posinf=1.0, neginf=-1.0)
+    arr = np.clip(arr, -1.0, 1.0)
+    scaled = (arr + 1.0) / 2.0 * float(num_state_tokens - 1)
+    return np.clip(np.rint(scaled).astype(np.int64), 0, int(num_state_tokens) - 1)
+
+
+def _build_discrete_state_string(state: np.ndarray | None, num_state_tokens: int) -> str:
+    if state is None:
+        return ""
+    token_ids = _discretize_normalized_state(state, num_state_tokens).reshape(-1)
+    return f"{STATE_START_TOKEN}{''.join(f'{STATE_TOKEN_PREFIX}{int(token_id)}>' for token_id in token_ids)}{STATE_END_TOKEN}"
+
+
+def _normalize_question_text(text: str) -> str:
+    normalized = re.sub(r"\s+", " ", text).strip()
+    if not normalized:
+        return ""
+    previous = None
+    while normalized and normalized != previous:
+        previous = normalized
+        normalized = normalized.strip().strip(_QUESTION_SURROUNDING_DELIMITERS).strip()
+        for pattern in _QUESTION_PREFIX_PATTERNS:
+            normalized = pattern.sub("", normalized, count=1).strip()
+        normalized = normalized.rstrip(_QUESTION_TRAILING_SENTENCE_PUNCTUATION).rstrip()
+        normalized = normalized.rstrip(_QUESTION_TRAILING_CLOSERS).rstrip()
+        normalized = normalized.rstrip(_QUESTION_TRAILING_SENTENCE_PUNCTUATION).rstrip()
+    sentence_chunks = [chunk.strip() for chunk in re.split(r"[.!?]+", normalized) if chunk.strip()]
+    if len(sentence_chunks) > 1:
+        normalized = "; ".join(sentence_chunks)
+    normalized = normalized.lower()
+    return normalized
+
+
+def _build_robot_text(
+    *,
+    task: str,
+    style: str,
+    discrete_state_string: str,
+    setup_type: str,
+    control_mode: str,
+    add_setup_tokens: bool,
+    add_control_tokens: bool,
+    num_images: int,
+) -> str:
+    setup_text = _wrap_setup_text(setup_type, add_setup_tokens=add_setup_tokens)
+    control_text = _wrap_control_text(control_mode, add_control_tokens=add_control_tokens)
+    state_clause = (
+        f" The current state of the robot is {discrete_state_string}." if discrete_state_string else ""
+    )
+    if style == "robot_depth_action":
+        prompt = (
+            f"The task is to {task}. The setup is {setup_text}.{state_clause} "
+            f"The expected control mode is {control_text}. Given these, first predict the depth map of the main image "
+            "and then predict the action the robot should take to complete the task?"
+        )
+        trigger = f"{DEPTH_OUTPUT_TOKEN}{ACTION_OUTPUT_TOKEN}"
+    else:
+        prompt = (
+            f"The task is to {task}. The setup is {setup_text}.{state_clause} "
+            f"The expected control mode is {control_text}. Given these, what action should the robot take to complete the task?"
+        )
+        trigger = ACTION_OUTPUT_TOKEN
+    if num_images <= 0:
+        image_prefix = ""
+    elif num_images == 1:
+        image_prefix = "<|image|>"
+    else:
+        image_prefix = "".join(f"Image {idx + 1}<|image|>" for idx in range(num_images))
+    return f"{image_prefix}<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n{trigger}"
+
+
+def _flatten_generated_token_ids(token_ids: torch.Tensor) -> list[int]:
+    if token_ids.ndim == 3:
+        return [int(x) for x in token_ids[0, 0].detach().cpu().tolist()]
+    if token_ids.ndim == 2:
+        return [int(x) for x in token_ids[0].detach().cpu().tolist()]
+    if token_ids.ndim == 1:
+        return [int(x) for x in token_ids.detach().cpu().tolist()]
+    raise ValueError(f"Unexpected generated token tensor shape {tuple(token_ids.shape)}")
+
+
+def _extract_discrete_token_bins(
+    generated_ids: list[int],
+    start_token_id: int,
+    end_token_id: int,
+    token_id_to_bin: dict[int, int],
+) -> list[int]:
+    start_idx = None
+    end_idx = None
+    for idx, token_id in enumerate(generated_ids):
+        if token_id == start_token_id:
+            start_idx = idx
+            break
+    if start_idx is not None:
+        for idx in range(start_idx + 1, len(generated_ids)):
+            if generated_ids[idx] == end_token_id:
+                end_idx = idx
+                break
+    span_start = 0 if start_idx is None else start_idx + 1
+    span_end = len(generated_ids) if end_idx is None else end_idx
+    return [
+        int(token_id_to_bin[token_id])
+        for token_id in generated_ids[span_start:span_end]
+        if token_id in token_id_to_bin
+    ]
+
+
+@dataclass
+class MolmoAct2ActionOutput(ModelOutput):
+    actions: torch.FloatTensor | None = None
+    generated_token_ids: torch.LongTensor | None = None
+    depth_bins: torch.LongTensor | None = None
+    depth_cache: dict[str, Any] | None = None
+
+
+@dataclass
+class _DepthPrefix:
+    token_ids: torch.Tensor
+    depth_bins: torch.Tensor
+    full_input_ids: torch.Tensor
+    attention_mask: torch.Tensor | None
+    encoder_kv_states: Sequence[tuple[torch.Tensor, torch.Tensor]]
+    next_output: Any
+    past_key_values: Cache | None
+
+
+@dataclass
+class MolmoAct2CausalLMOutputWithPast(ModelOutput):
+    """
+    Base class for MolmoAct2 causal language model (or autoregressive) outputs.
+
+    Args:
+        loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
+            Language modeling loss (for next-token prediction).
+        logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        past_key_values (`Cache`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+            It is a [`~cache_utils.Cache`] instance. For more details, see our [kv cache guide](https://huggingface.co/docs/transformers/en/kv_cache).
+
+            Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
+            `past_key_values` input) to speed up sequential decoding.
+        image_hidden_states (`torch.FloatTensor`, *optional*):
+            A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
+            image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
+    """
+
+    loss: torch.FloatTensor | None = None
+    logits: torch.FloatTensor | None = None
+    past_key_values: Cache | None = None
+    hidden_states: tuple[torch.FloatTensor] | None = None
+    attentions: tuple[torch.FloatTensor] | None = None
+    image_hidden_states: torch.FloatTensor | None = None
+
+
+@dataclass
+class MolmoAct2ModelOutputWithPast(BaseModelOutputWithPast):
+    """
+    Base class for MolmoAct2 outputs, with hidden states and attentions.
+
+    Args:
+        image_hidden_states (`torch.FloatTensor`, *optional*):
+            A `torch.FloatTensor` of size `(batch_num_patches, hidden_size)`.
+            image_hidden_states of the model produced by the vision backbone
+    """
+
+    last_hidden_state: torch.FloatTensor | None = None
+    past_key_values: Cache | None = None
+    hidden_states: tuple[torch.FloatTensor] | None = None
+    attentions: tuple[torch.FloatTensor] | None = None
+    image_hidden_states: torch.FloatTensor | None = None
+
+
+class ViTMLP(nn.Module):
+    def __init__(
+        self,
+        dim: int,
+        hidden_dim: int,
+        hidden_act: str,
+        device: str | torch.device = None,
+    ):
+        super().__init__()
+        self.w1 = nn.Linear(dim, hidden_dim, bias=True, device=device)
+        self.act = ACT2FN[hidden_act]
+        self.w2 = nn.Linear(hidden_dim, dim, bias=True, device=device)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.w2(self.act(self.w1(x)))
+
+
+class ViTMultiHeadDotProductAttention(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int,
+        num_heads: int,
+        num_key_value_heads: int,
+        head_dim: int,
+        use_bias: bool = True,
+        input_dim: int | None = None,
+        float32_attention: bool = True,
+        attention_dropout: float = 0.0,
+        residual_dropout: float = 0.0,
+        device: str | torch.device = None,
+        attn_implementation: str = "eager",
+    ):
+        super().__init__()
+
+        self.hidden_size = hidden_size
+        self.num_heads = num_heads
+        self.head_dim = head_dim
+        self.num_key_value_heads = num_key_value_heads
+        self.num_key_value_groups = self.num_heads // self.num_key_value_heads
+        self.attn_implementation = attn_implementation
+        self.is_causal = False
+
+        input_dim = input_dim or hidden_size
+
+        self.wq = nn.Linear(
+            input_dim,
+            self.num_heads * self.head_dim,
+            bias=use_bias,
+            device=device,
+        )
+        self.wk = nn.Linear(
+            input_dim,
+            self.num_key_value_heads * self.head_dim,
+            bias=use_bias,
+            device=device,
+        )
+        self.wv = nn.Linear(
+            input_dim,
+            self.num_key_value_heads * self.head_dim,
+            bias=use_bias,
+            device=device,
+        )
+        self.wo = nn.Linear(
+            self.num_heads * self.head_dim,
+            self.hidden_size,
+        )
+        self.float32_attention = float32_attention
+        self.attention_dropout = attention_dropout
+        self.residual_dropout = nn.Dropout(residual_dropout)
+        self.sdpa_backend_list = [
+            SDPBackend.FLASH_ATTENTION,
+            SDPBackend.CUDNN_ATTENTION,
+            SDPBackend.EFFICIENT_ATTENTION,
+            SDPBackend.MATH,
+        ]
+
+    def _split_heads(self, hidden_states, num_heads) -> torch.Tensor:
+        return hidden_states.reshape(hidden_states.shape[:2] + (num_heads, self.head_dim))
+
+    def _merge_heads(self, hidden_states) -> torch.Tensor:
+        return hidden_states.reshape(hidden_states.shape[:2] + (self.hidden_size,))
+
+    def forward(
+        self,
+        inputs_q: torch.Tensor,
+        inputs_kv: torch.Tensor | None = None,
+        attn_mask: torch.Tensor | None = None,
+    ) -> torch.Tensor:
+        if inputs_kv is not None:
+            inputs_k = inputs_kv
+            inputs_v = inputs_kv
+        else:
+            inputs_k = inputs_q
+            inputs_v = inputs_q
+
+        xq, xk, xv = self.wq(inputs_q), self.wk(inputs_k), self.wv(inputs_v)
+
+        xq = self._split_heads(xq, self.num_heads)
+        xk = self._split_heads(xk, self.num_key_value_heads)
+        xv = self._split_heads(xv, self.num_key_value_heads)
+
+        if self.num_heads != self.num_key_value_heads:
+            xk = xk.repeat_interleave(self.num_key_value_groups, dim=2, output_size=self.num_heads)
+            xv = xv.repeat_interleave(self.num_key_value_groups, dim=2, output_size=self.num_heads)
+
+        og_dtype = xq.dtype
+
+        if self.float32_attention:
+            xq = xq.to(torch.float)
+            xk = xk.to(torch.float)
+
+        dropout_p = 0.0 if not self.training else self.attention_dropout
+
+        if self.attn_implementation == "eager":
+            attn_weights = torch.einsum("...qhd,...khd->...hqk", xq / math.sqrt(xq.size(-1)), xk)
+            attn_weights = F.softmax(attn_weights, dim=-1, dtype=torch.float32).to(xq.dtype)
+            attn_weights = F.dropout(attn_weights, p=dropout_p, training=self.training)
+            attn_output = torch.einsum("...hqk,...khd->...qhd", attn_weights.to(xv.dtype), xv)
+
+        elif self.attn_implementation == "sdpa":
+            if self.float32_attention:
+                xv = xv.to(torch.float32)
+
+            query = xq.transpose(1, 2).contiguous()
+            key = xk.transpose(1, 2).contiguous()
+            value = xv.transpose(1, 2).contiguous()
+            if inputs_kv is not None:
+                with sdpa_kernel(self.sdpa_backend_list):
+                    attn_output = F.scaled_dot_product_attention(
+                        query,
+                        key,
+                        value,
+                        attn_mask=attn_mask,
+                        is_causal=False,
+                        dropout_p=dropout_p,
+                    ).transpose(1, 2)
+            else:
+                attn_output = F.scaled_dot_product_attention(
+                    query,
+                    key,
+                    value,
+                    attn_mask=attn_mask,
+                    is_causal=False,
+                    dropout_p=dropout_p,
+                ).transpose(1, 2)
+
+        elif self.attn_implementation == "flash_attention_2":
+            if xq.dtype == torch.float32:
+                if torch.is_autocast_enabled():
+                    target_dtype = torch.get_autocast_gpu_dtype()
+                else:
+                    target_dtype = self.wq.weight.dtype
+            attn_output = _flash_attention_forward(
+                xq,
+                xk,
+                xv,
+                attention_mask=attn_mask,
+                query_length=inputs_q.shape[1],
+                is_causal=False,
+                dropout=dropout_p,
+                softmax_scale=xq.shape[-1] ** -0.5,
+                use_top_left_mask=flash_attn_supports_top_left_mask(),
+                target_dtype=target_dtype,
+                implementation=self.attn_implementation,
+            )
+        else:
+            raise ValueError(f"Attention implementation {self.attn_implementation} not supported")
+
+        attn_output = attn_output.to(og_dtype)
+        attn_output = self._merge_heads(attn_output)
+        attn_output = self.wo(attn_output)
+        attn_output = self.residual_dropout(attn_output)
+
+        return attn_output
+
+
+class MolmoAct2VisionBlock(nn.Module):
+    def __init__(self, config: MolmoAct2VitConfig, device: str | torch.device = None):
+        super().__init__()
+        self.attention = ViTMultiHeadDotProductAttention(
+            hidden_size=config.hidden_size,
+            num_heads=config.num_attention_heads,
+            num_key_value_heads=config.num_key_value_heads,
+            head_dim=config.head_dim,
+            float32_attention=config.float32_attention,
+            attention_dropout=config.attention_dropout,
+            residual_dropout=config.residual_dropout,
+            device=device,
+            attn_implementation=config._attn_implementation,
+        )
+        self.feed_forward = ViTMLP(
+            config.hidden_size,
+            config.intermediate_size,
+            config.hidden_act,
+            device=device,
+        )
+        self.attention_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps, device=device)
+        self.ffn_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps, device=device)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = x + self.attention(self.attention_norm(x))
+        x = x + self.feed_forward(self.ffn_norm(x))
+        return x
+
+
+class MolmoAct2VisionBlockCollection(nn.Module):
+    def __init__(self, config: MolmoAct2VitConfig, device: str | torch.device = None):
+        super().__init__()
+        self.config = config
+        self.resblocks = nn.ModuleList(
+            [MolmoAct2VisionBlock(config, device) for _ in range(config.num_hidden_layers)]
+        )
+
+    def forward(self, x: torch.Tensor) -> list[torch.Tensor]:
+        hidden_states = []
+        for r in self.resblocks:
+            x = r(x)
+            hidden_states.append(x)
+        return hidden_states
+
+
+class MolmoAct2VisionTransformer(nn.Module):
+    def __init__(self, config: MolmoAct2VitConfig, device: str | torch.device = None):
+        super().__init__()
+        self.config = config
+
+        # positional embeddings
+        self.scale = config.hidden_size**-0.5
+        self.num_prefix_tokens: int = 0  # no class embeddings
+        self.positional_embedding = nn.Parameter(
+            torch.zeros(config.image_num_pos, config.hidden_size, device=device),
+        )
+
+        image_patch_size = config.image_patch_size
+        self.patch_embedding = nn.Linear(
+            image_patch_size * image_patch_size * 3,
+            config.hidden_size,
+            bias=True,
+            device=device,
+        )
+
+        self.transformer = MolmoAct2VisionBlockCollection(config, device)
+
+    def add_pos_emb(self, x: torch.Tensor, patch_num: int) -> torch.Tensor:
+        pos_emb = self.positional_embedding
+
+        pos_emb = pos_emb.reshape(
+            (
+                int(math.sqrt(pos_emb.shape[0])),
+                int(math.sqrt(pos_emb.shape[0])),
+                pos_emb.shape[1],
+            )
+        )
+
+        (patch_num_0, patch_num_1) = patch_num
+
+        if pos_emb.shape[0] != patch_num_0 or pos_emb.shape[1] != patch_num_1:
+            # Derived from https://github.com/facebookresearch/mae/blob/main/util/pos_embed.py
+            # antialias: default True in jax.image.resize
+            pos_emb = pos_emb.unsqueeze(0).permute(0, 3, 1, 2)
+            pos_emb = F.interpolate(
+                pos_emb,
+                size=(patch_num_0, patch_num_1),
+                mode="bicubic",
+                align_corners=False,
+                antialias=True,
+            )
+            pos_emb = pos_emb.permute(0, 2, 3, 1).squeeze(0)
+
+        pos_emb = pos_emb.reshape(-1, pos_emb.shape[-1])
+        x = x + pos_emb[None, :, :].to(x.dtype)
+        return x
+
+    def forward(self, x: torch.Tensor, patch_num: int = None) -> list[torch.Tensor]:
+        """
+        : param x: (batch_size, num_patch, n_pixels)
+        """
+        if patch_num is None:
+            patch_num = self.config.image_num_patch
+
+        B, N, D = x.shape
+
+        x = self.patch_embedding(x)
+
+        # class embeddings and positional embeddings
+        x = self.add_pos_emb(x, patch_num)
+
+        hidden_states = self.transformer(x)
+        return hidden_states
+
+
+class ImageProjectorMLP(nn.Module):
+    def __init__(
+        self,
+        input_dim: int,
+        hidden_dim: int,
+        output_dim: int,
+        hidden_act: str,
+        device: str | torch.device = None,
+    ):
+        super().__init__()
+        self.w1 = nn.Linear(input_dim, hidden_dim, bias=False, device=device)
+        self.w2 = nn.Linear(hidden_dim, output_dim, bias=False, device=device)
+        self.w3 = nn.Linear(input_dim, hidden_dim, bias=False, device=device)
+        self.act = ACT2FN[hidden_act]
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.w2(self.act(self.w1(x)) * self.w3(x))
+
+
+class MolmoAct2VisionBackbone(nn.Module):
+    def __init__(self, vit_config: MolmoAct2VitConfig, adapter_config: MolmoAct2AdapterConfig):
+        super().__init__()
+        self.vit_config = vit_config
+        self.adapter_config = adapter_config
+
+        self.vit_layers = []
+        for layer in adapter_config.vit_layers:
+            if layer >= 0:
+                self.vit_layers.append(layer)
+            else:
+                self.vit_layers.append(layer + vit_config.num_hidden_layers)
+
+        last_layer_needed = max(self.vit_layers) + 1
+        if last_layer_needed < vit_config.num_hidden_layers:
+            new_vit_config = deepcopy(vit_config)
+            new_vit_config.num_hidden_layers = last_layer_needed
+            self.image_vit = MolmoAct2VisionTransformer(new_vit_config)
+        else:
+            self.image_vit = MolmoAct2VisionTransformer(vit_config)
+
+        self.num_prefix_tokens: int = self.image_vit.num_prefix_tokens
+
+        pool_dim = vit_config.hidden_size * len(adapter_config.vit_layers)
+        self.image_pooling_2d = ViTMultiHeadDotProductAttention(
+            hidden_size=adapter_config.hidden_size,
+            num_heads=adapter_config.num_attention_heads,
+            num_key_value_heads=adapter_config.num_key_value_heads,
+            head_dim=adapter_config.head_dim,
+            input_dim=pool_dim,
+            float32_attention=adapter_config.float32_attention,
+            attention_dropout=adapter_config.attention_dropout,
+            residual_dropout=adapter_config.residual_dropout,
+            attn_implementation=adapter_config._attn_implementation,
+        )
+        self.image_projector = ImageProjectorMLP(
+            adapter_config.hidden_size,
+            adapter_config.intermediate_size,
+            adapter_config.text_hidden_size,
+            adapter_config.hidden_act,
+        )
+        self.image_feature_dropout = nn.Dropout(adapter_config.image_feature_dropout)
+        self.gradient_checkpointing = False
+
+    def encode_image(self, images: torch.Tensor) -> torch.Tensor:
+        """
+        : param images: (batch_size, num_crops, num_patch, n_pixels)
+        """
+        batch_size, num_crops, num_patches, patch_dim = images.shape
+        images = images.view(batch_size * num_crops, num_patches, patch_dim)
+
+        x = self.image_vit.patch_embedding(images)
+        x = self.image_vit.add_pos_emb(x, self.image_vit.config.image_num_patch)
+
+        needed_layers = {int(layer) for layer in self.vit_layers}
+        selected_features: dict[int, torch.Tensor] = {}
+        use_checkpoint = bool(self.gradient_checkpointing and self.training and torch.is_grad_enabled())
+        for layer_idx, block in enumerate(self.image_vit.transformer.resblocks):
+            if use_checkpoint:
+                x = torch.utils.checkpoint.checkpoint(block, x, use_reentrant=False)
+            else:
+                x = block(x)
+            if layer_idx in needed_layers:
+                selected_features[layer_idx] = x
+
+        missing = needed_layers - set(selected_features)
+        if missing:
+            raise RuntimeError(
+                f"MolmoAct2 vision backbone did not produce requested layers: {sorted(missing)}."
+            )
+
+        image_features = torch.cat([selected_features[int(layer)] for layer in self.vit_layers], dim=-1)
+
+        if self.num_prefix_tokens > 0:
+            image_features = image_features[:, 1:]
+        image_features = image_features.view(batch_size, num_crops, num_patches, -1)
+        return image_features
+
+    @property
+    def dtype(self) -> torch.dtype:
+        return self.image_vit.patch_embedding.weight.dtype
+
+    @property
+    def device(self) -> torch.device:
+        return self.image_vit.patch_embedding.weight.device
+
+    def forward(
+        self,
+        images: torch.Tensor,
+        pooled_patches_idx: torch.Tensor,
+    ) -> tuple[torch.Tensor, torch.Tensor | None]:
+        # image_features: (batch_size, num_crops(=num_image), num_patch, nximage_emb_dim)
+        batch_size, num_image = images.shape[:2]
+        images = images.to(device=self.device)
+        if images.dtype == torch.uint8:
+            images = images.to(dtype=torch.float32) / 255.0
+            images = images * 2.0 - 1.0
+        elif torch.is_floating_point(images):
+            # Native MolmoAct2 eval keeps resized SigLIP pixels as uint8 and normalizes
+            # on device. Canonicalize HF processor floats to that exact grid.
+            images = torch.round(((images.to(dtype=torch.float32) + 1.0) * 0.5) * 255.0)
+            images = torch.clamp(images, 0.0, 255.0) / 255.0
+            images = images * 2.0 - 1.0
+        images = images.to(dtype=self.dtype)
+        image_features = self.encode_image(images)
+
+        image_features = self.image_feature_dropout(image_features)
+        dim = image_features.shape[-1]
+        valid = pooled_patches_idx >= 0
+        valid_token = torch.any(valid, -1)
+
+        # Use `pooled_patches_idx` to arange the features for image pooling
+        batch_idx = torch.arange(
+            pooled_patches_idx.shape[0],
+            dtype=torch.long,
+            device=pooled_patches_idx.device,
+        )
+        batch_idx = torch.tile(
+            batch_idx.view(batch_size, 1, 1),
+            [1, pooled_patches_idx.shape[1], pooled_patches_idx.shape[2]],
+        )
+
+        # Now [batch, num_high_res_features, pool_dim, dim]
+        to_pool = image_features.reshape(batch_size, -1, dim)[batch_idx, torch.clip(pooled_patches_idx, 0)]
+        to_pool = to_pool * valid.to(self.dtype)[:, :, :, None]
+        to_pool = to_pool.reshape([-1, pooled_patches_idx.shape[-1], dim])
+        if self.adapter_config.pooling_attention_mask:
+            attn_mask = valid.reshape([-1, 1, 1, valid.shape[-1]])
+            denom = valid.view(-1, to_pool.shape[-2]).float().sum(-1)
+            denom = torch.where(denom == 0, 1, denom)
+            query = to_pool.sum(-2, keepdim=True) / denom[:, None, None].to(to_pool.dtype)
+        else:
+            attn_mask = None
+            query = to_pool.mean(-2, keepdim=True)
+        pooled_features = self.image_pooling_2d(query, to_pool, attn_mask=attn_mask)
+        pooled_features = pooled_features.reshape([batch_size, -1, pooled_features.shape[-1]])
+
+        # MLP layer to map the feature.
+        pooled_features = self.image_projector(pooled_features)
+        return pooled_features.view(-1, pooled_features.shape[-1])[valid_token.flatten()]
+
+
+# Copied from transformers.models.llama.modeling_llama.rotate_half
+
+
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+
+
+# Copied from transformers.models.llama.modeling_llama.apply_rotary_pos_emb
+def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
+    """Applies Rotary Position Embedding to the query and key tensors.
+
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+        position_ids (`torch.Tensor`, *optional*):
+            Deprecated and unused.
+        unsqueeze_dim (`int`, *optional*, defaults to 1):
+            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    cos = cos.unsqueeze(unsqueeze_dim)
+    sin = sin.unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+
+
+class MolmoAct2RotaryEmbedding(nn.Module):
+    inv_freq: torch.Tensor  # fix linting for `register_buffer`
+
+    def __init__(
+        self,
+        config: MolmoAct2TextConfig,
+        device: str | torch.device = None,
+        rope_type: str | None = None,
+    ):
+        super().__init__()
+        if rope_type is not None:
+            self.rope_type = rope_type
+        elif hasattr(config, "rope_scaling") and isinstance(config.rope_scaling, dict):
+            # BC: "rope_type" was originally "type"
+            self.rope_type = config.rope_scaling.get("rope_type", config.rope_scaling.get("type"))
+        else:
+            self.rope_type = "default"
+        self.max_seq_len_cached = config.max_position_embeddings
+        self.original_max_seq_len = config.max_position_embeddings
+
+        self.config = config
+        if self.rope_type == "default":
+            self.rope_init_fn = self._default_rope_init
+        else:
+            self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
+
+        inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
+        self.register_buffer("inv_freq", inv_freq, persistent=True)
+        self.original_inv_freq = self.inv_freq
+        self.register_buffer("_pos_sin_cache", torch.empty(0), persistent=False)
+        self.register_buffer("_pos_cos_cache", torch.empty(0), persistent=False)
+
+    @staticmethod
+    def _default_rope_init(
+        config: MolmoAct2TextConfig, device: str | torch.device = None, **_
+    ) -> tuple[torch.Tensor, float]:
+        inv_freq = 1.0 / (
+            config.rope_theta
+            ** (torch.arange(0, config.head_dim, 2, dtype=torch.float32, device=device) / config.head_dim)
+        )
+        return inv_freq, 1.0
+
+    def _target_cache_seq_len(self, x: torch.Tensor, position_ids: torch.Tensor | None) -> int:
+        if self.config.max_position_embeddings:
+            return int(self.config.max_position_embeddings)
+        if position_ids is not None:
+            return int(position_ids.max().item()) + 1
+        return int(x.shape[-2])
+
+    def _rope_cache_ready(self, device: torch.device, seq_len: int) -> bool:
+        return (
+            self._pos_sin_cache.numel() > 0
+            and self._pos_sin_cache.device == device
+            and self._pos_cos_cache.device == device
+            and self._pos_sin_cache.shape[-2] >= seq_len
+            and self._pos_cos_cache.shape[-2] >= seq_len
+        )
+
+    def _refresh_inv_freq_if_needed(self, device: torch.device) -> None:
+        device = torch.device(device)
+        expected = int(self.config.head_dim) // 2
+        needs_refresh = (
+            self.inv_freq is None
+            or self._pos_sin_cache.numel() == 0
+            or self.inv_freq.device.type == "meta"
+            or self.inv_freq.device != device
+            or self.inv_freq.numel() != expected
+        )
+        if not needs_refresh:
+            inv_freq_cpu = self.inv_freq.detach()
+            needs_refresh = (
+                not bool(torch.isfinite(inv_freq_cpu).all().item())
+                or bool((inv_freq_cpu <= 0).any().item())
+                or not bool(torch.isclose(inv_freq_cpu[0].cpu(), torch.tensor(1.0)).item())
+            )
+        if needs_refresh:
+            inv_freq, self.attention_scaling = self.rope_init_fn(self.config, device)
+            self.register_buffer("inv_freq", inv_freq, persistent=True)
+            self.original_inv_freq = self.inv_freq
+            self._pos_sin_cache = torch.empty(0, device=device)
+            self._pos_cos_cache = torch.empty(0, device=device)
+
+    def _build_rope_cache(self, device: torch.device, seq_len: int) -> None:
+        device_type = device.type if device.type != "mps" else "cpu"
+        with torch.autocast(device_type=device_type, enabled=False):
+            seq = torch.arange(seq_len, device=device, dtype=torch.float)
+            freqs = torch.einsum("i,j->ij", seq, self.inv_freq.to(device=device, dtype=torch.float))
+            emb = torch.cat((freqs, freqs), dim=-1)
+            self._pos_sin_cache = emb.sin()[None, None, :, :] * self.attention_scaling
+            self._pos_cos_cache = emb.cos()[None, None, :, :] * self.attention_scaling
+
+    @torch.no_grad()
+    def prepare_rope_cache(
+        self,
+        *,
+        device: str | torch.device,
+        max_seq_len: int | None = None,
+    ) -> None:
+        if self.rope_type != "default":
+            return
+        device = torch.device(device)
+        seq_len = int(max_seq_len or self.config.max_position_embeddings or 0)
+        if seq_len <= 0:
+            raise ValueError("RoPE cache preparation requires a positive max sequence length.")
+        if self._rope_cache_ready(device, seq_len):
+            return
+        self._refresh_inv_freq_if_needed(device)
+        self._build_rope_cache(device, seq_len)
+
+    def _select_rope_cache(
+        self,
+        x: torch.Tensor,
+        position_ids: torch.Tensor | None,
+        seq_len: int,
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        pos_sin = self._pos_sin_cache[:, :, :seq_len, :]
+        pos_cos = self._pos_cos_cache[:, :, :seq_len, :]
+        if position_ids is None:
+            sin = pos_sin[0, 0, : x.shape[-2], :]
+            cos = pos_cos[0, 0, : x.shape[-2], :]
+        else:
+            sin = pos_sin[0, 0][position_ids].view(position_ids.shape + (pos_sin.shape[-1],))
+            cos = pos_cos[0, 0][position_ids].view(position_ids.shape + (pos_cos.shape[-1],))
+        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
+
+    @torch.no_grad()
+    @dynamic_rope_update  # power user: used with advanced RoPE types (e.g. dynamic rope)
+    def forward(self, x, position_ids: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
+        seq_len = self._target_cache_seq_len(x, position_ids)
+        if not self._rope_cache_ready(x.device, seq_len):
+            self._refresh_inv_freq_if_needed(x.device)
+            self._build_rope_cache(x.device, seq_len)
+        return self._select_rope_cache(x, position_ids, seq_len)
+
+
+class MolmoAct2RMSNorm(nn.Module):
+    def __init__(
+        self,
+        size: int,
+        eps: float = 1e-6,
+        device: str | torch.device = None,
+    ):
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(size, device=device))
+        self.eps = eps
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        with torch.autocast(enabled=False, device_type=x.device.type):
+            og_dtype = x.dtype
+            x = x.to(torch.float32)
+            variance = x.pow(2).mean(-1, keepdim=True)
+            x = x * torch.rsqrt(variance + self.eps)
+            x = x.to(og_dtype)
+
+        return self.weight * x
+
+    def extra_repr(self):
+        return f"{tuple(self.weight.shape)}, eps={self.eps}"
+
+
+# Copied from transformers.models.llama.modeling_llama.repeat_kv
+def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+    """
+    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
+    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
+    """
+    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+    if n_rep == 1:
+        return hidden_states
+    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+
+
+def eager_attention_forward(
+    module: nn.Module,
+    query: torch.Tensor,
+    key: torch.Tensor,
+    value: torch.Tensor,
+    attention_mask: torch.Tensor | None,
+    scaling: float,
+    dropout: float = 0.0,
+    **kwargs,
+) -> tuple[torch.Tensor, torch.Tensor | None]:
+    key_states = repeat_kv(key, module.num_key_value_groups)
+    value_states = repeat_kv(value, module.num_key_value_groups)
+
+    attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
+    if attention_mask is not None:
+        causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
+        attn_weights = attn_weights + causal_mask
+
+    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
+    attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
+    attn_output = torch.matmul(attn_weights, value_states)
+    attn_output = attn_output.transpose(1, 2).contiguous()
+
+    return attn_output, attn_weights
+
+
+class MolmoAct2Attention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+
+    def __init__(self, config: MolmoAct2TextConfig, layer_idx: int) -> None:
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        self.num_heads = config.num_attention_heads
+        self.num_key_value_heads = config.num_key_value_heads
+        self.num_key_value_groups = config.num_attention_heads // config.num_key_value_heads
+        self.head_dim = config.head_dim
+        self.scaling = self.head_dim**-0.5
+        self.is_causal = True
+
+        self.fused_dims = (
+            config.num_attention_heads * config.head_dim,
+            config.head_dim * config.num_key_value_heads,
+            config.head_dim * config.num_key_value_heads,
+        )
+        self.att_proj = nn.Linear(
+            config.hidden_size,
+            sum(self.fused_dims),
+            bias=config.qkv_bias,
+        )
+
+        # Layer norms.
+        self.k_norm: MolmoAct2RMSNorm | None = None
+        self.q_norm: MolmoAct2RMSNorm | None = None
+        self.qk_norm_type: str | None = None
+        if config.use_qk_norm:
+            k_norm_size = (
+                config.head_dim
+                if config.qk_norm_type == "qwen3"
+                else config.num_key_value_heads * config.head_dim
+            )
+            self.k_norm = MolmoAct2RMSNorm(k_norm_size, eps=config.layer_norm_eps)
+            q_norm_size = (
+                config.head_dim
+                if config.qk_norm_type == "qwen3"
+                else config.num_attention_heads * config.head_dim
+            )
+            self.q_norm = MolmoAct2RMSNorm(q_norm_size, eps=config.layer_norm_eps)
+            self.qk_norm_type = config.qk_norm_type
+
+        self.attention_dropout = config.attention_dropout
+
+        self.attn_out = nn.Linear(
+            config.head_dim * config.num_attention_heads,
+            config.hidden_size,
+            bias=False,
+        )
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: tuple[torch.Tensor, torch.Tensor],
+        attention_mask: torch.Tensor | None,
+        past_key_values: Cache | None = None,
+        cache_position: torch.LongTensor | None = None,
+        **kwargs: Unpack[FlashAttentionKwargs],
+    ) -> tuple[torch.Tensor, torch.Tensor | None, tuple[torch.Tensor] | None]:
+        collect_layer_kv_states = bool(kwargs.pop("collect_layer_kv_states", False))
+        input_shape = hidden_states.shape[:-1]
+        hidden_shape = (*input_shape, -1, self.head_dim)
+
+        qkv = self.att_proj(hidden_states)
+        query_states, key_states, value_states = qkv.split(self.fused_dims, dim=-1)
+        value_states = value_states.view(hidden_shape)
+
+        # Optionally apply layer norm to keys and queries.
+        if self.q_norm is not None and self.k_norm is not None and self.qk_norm_type != "qwen3":
+            query_states = self.q_norm(query_states)
+            key_states = self.k_norm(key_states)
+
+        query_states = query_states.view(hidden_shape)
+        key_states = key_states.view(hidden_shape)
+        if self.q_norm is not None and self.k_norm is not None and self.qk_norm_type == "qwen3":
+            query_states = self.q_norm(query_states)
+            key_states = self.k_norm(key_states)
+        query_states = query_states.transpose(1, 2)
+        key_states = key_states.transpose(1, 2)
+        value_states = value_states.transpose(1, 2)
+
+        cos, sin = position_embeddings
+        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
+
+        if past_key_values is not None:
+            # sin and cos are specific to RoPE models; cache_position needed for the static cache
+            cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
+            key_states, value_states = past_key_values.update(
+                key_states, value_states, self.layer_idx, cache_kwargs
+            )
+
+        collected_key_states = key_states
+        collected_value_states = value_states
+
+        dropout_p = 0.0 if not self.training else self.attention_dropout
+        if self.config._attn_implementation == "sdpa" and (
+            attention_mask is None or torch.is_tensor(attention_mask)
+        ):
+            key_states = repeat_kv(key_states, self.num_key_value_groups)
+            value_states = repeat_kv(value_states, self.num_key_value_groups)
+            attn_output = F.scaled_dot_product_attention(
+                query_states,
+                key_states,
+                value_states,
+                attn_mask=attention_mask,
+                dropout_p=dropout_p,
+                is_causal=attention_mask is None,
+            )
+            attn_output = attn_output.transpose(1, 2).contiguous()
+            attn_weights = None
+        else:
+            attention_interface: Callable = eager_attention_forward
+            if self.config._attn_implementation != "eager":
+                attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+
+            attn_output, attn_weights = attention_interface(
+                self,
+                query_states,
+                key_states,
+                value_states,
+                attention_mask,
+                dropout=dropout_p,
+                scaling=self.scaling,
+                **kwargs,
+            )
+
+        attn_output = attn_output.reshape(*input_shape, -1).contiguous()
+        attn_output = self.attn_out(attn_output)
+        if collect_layer_kv_states:
+            return attn_output, attn_weights, collected_key_states, collected_value_states
+        return attn_output, attn_weights
+
+
+class LanguageModelMLP(nn.Module):
+    def __init__(
+        self,
+        input_dim: int,
+        intermediate_size: int,
+        hidden_act: str,
+        device: str | torch.device = None,
+    ):
+        super().__init__()
+        self.ff_proj = nn.Linear(input_dim, intermediate_size * 2, bias=False, device=device)
+        self.ff_out = nn.Linear(intermediate_size, input_dim, bias=False, device=device)
+        self.act = ACT2FN[hidden_act]
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.ff_proj(x)
+        x, gate = x.chunk(2, dim=-1)
+        x = self.act(gate) * x
+        x = self.ff_out(x)
+        return x
+
+
+class MolmoAct2DecoderLayer(GradientCheckpointingLayer):
+    def __init__(
+        self,
+        config: MolmoAct2TextConfig,
+        layer_idx: int | None = None,
+        device: str | torch.device = None,
+    ):
+        super().__init__()
+        self.config = config
+
+        self.self_attn = MolmoAct2Attention(config, layer_idx)
+        self.attn_norm = MolmoAct2RMSNorm(config.hidden_size, eps=config.layer_norm_eps, device=device)
+        self.dropout = nn.Dropout(config.residual_dropout)
+        self.mlp = LanguageModelMLP(
+            config.hidden_size,
+            config.intermediate_size,
+            config.hidden_act,
+            device=device,
+        )
+        self.ff_norm = MolmoAct2RMSNorm(config.hidden_size, eps=config.layer_norm_eps, device=device)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: tuple[torch.Tensor, torch.Tensor],
+        attention_mask: torch.Tensor | None = None,
+        position_ids: torch.LongTensor | None = None,
+        past_key_values: Cache | None = None,
+        output_attentions: bool | None = False,
+        use_cache: bool | None = False,
+        cache_position: torch.LongTensor | None = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> tuple[torch.FloatTensor, tuple[torch.FloatTensor, torch.FloatTensor] | None]:
+        collect_layer_kv_states = bool(kwargs.pop("collect_layer_kv_states", False))
+
+        residual = hidden_states
+        hidden_states = self.attn_norm(hidden_states)
+
+        # Self Attention
+        attention_outputs = self.self_attn(
+            hidden_states=hidden_states,
+            position_embeddings=position_embeddings,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            output_attentions=output_attentions,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            collect_layer_kv_states=collect_layer_kv_states,
+            **kwargs,
+        )
+        hidden_states = attention_outputs[0]
+        self_attn_weights = attention_outputs[1]
+
+        hidden_states = residual + self.dropout(hidden_states)
+
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.ff_norm(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+
+        hidden_states = residual + self.dropout(hidden_states)
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (self_attn_weights,)
+        if collect_layer_kv_states:
+            outputs += (attention_outputs[2], attention_outputs[3])
+
+        return outputs
+
+
+class MolmoAct2PostNormDecoderLayer(MolmoAct2DecoderLayer):
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        position_embeddings: tuple[torch.Tensor, torch.Tensor],
+        attention_mask: torch.Tensor | None = None,
+        position_ids: torch.LongTensor | None = None,
+        past_key_values: Cache | None = None,
+        output_attentions: bool | None = False,
+        use_cache: bool | None = False,
+        cache_position: torch.LongTensor | None = None,
+        **kwargs,
+    ) -> tuple[torch.FloatTensor, tuple[torch.FloatTensor, torch.FloatTensor] | None]:
+        collect_layer_kv_states = bool(kwargs.pop("collect_layer_kv_states", False))
+
+        residual = hidden_states
+
+        # Self Attention
+        attention_outputs = self.self_attn(
+            hidden_states=hidden_states,
+            position_embeddings=position_embeddings,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            output_attentions=output_attentions,
+            use_cache=use_cache,
+            cache_position=cache_position,
+            collect_layer_kv_states=collect_layer_kv_states,
+            **kwargs,
+        )
+        hidden_states = attention_outputs[0]
+        self_attn_weights = attention_outputs[1]
+        hidden_states = self.attn_norm(hidden_states)
+
+        hidden_states = residual + self.dropout(hidden_states)
+
+        # Fully Connected
+        residual = hidden_states
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = self.ff_norm(hidden_states)
+
+        hidden_states = residual + self.dropout(hidden_states)
+
+        outputs = (hidden_states,)
+
+        if output_attentions:
+            outputs += (self_attn_weights,)
+        if collect_layer_kv_states:
+            outputs += (attention_outputs[2], attention_outputs[3])
+
+        return outputs
+
+
+class MolmoAct2Embedding(nn.Module):
+    def __init__(
+        self,
+        num_embeddings: int,
+        num_new_embeddings: int,
+        features: int,
+        device: str | torch.device = None,
+    ):
+        super().__init__()
+        self.embedding = nn.Parameter(
+            torch.zeros(num_embeddings, features, device=device),
+        )
+        self.new_embedding = nn.Parameter(
+            torch.zeros(num_new_embeddings, features, device=device),
+        )
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return F.embedding(x, torch.cat([self.embedding, self.new_embedding], dim=0))
+
+
+class MolmoAct2PreTrainedModel(PreTrainedModel):
+    config: MolmoAct2Config
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = [
+        "MolmoAct2DecoderLayer",
+        "MolmoAct2PostNormDecoderLayer",
+        "MolmoAct2VisionBlock",
+        "ViTMultiHeadDotProductAttention",
+    ]
+    _skip_keys_device_placement = "past_key_values"
+    _supports_flash_attn = True
+    _supports_sdpa = True
+
+    _can_compile_fullgraph = True
+    _supports_attention_backend = True
+    _can_record_outputs = {
+        "hidden_states": MolmoAct2DecoderLayer,
+        "attentions": MolmoAct2Attention,
+    }
+
+    def _init_weights(self, module):
+        std = self.config.initializer_range
+        if isinstance(module, (nn.Linear,)):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, MolmoAct2Embedding):
+            module.embedding.data.normal_(mean=0.0, std=std)
+            module.new_embedding.data.normal_(mean=0.0, std=std)
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, MolmoAct2RMSNorm):
+            module.weight.data.fill_(1.0)
+        elif isinstance(module, nn.LayerNorm):
+            module.weight.data.fill_(1.0)
+            if module.bias is not None:
+                module.bias.data.zero_()
+
+
+class MolmoAct2TextModel(MolmoAct2PreTrainedModel):
+    config: MolmoAct2TextConfig
+    _no_split_modules = ["MolmoAct2DecoderLayer", "MolmoAct2PostNormDecoderLayer"]
+
+    def __init__(self, config: MolmoAct2TextConfig):
+        super().__init__(config)
+        if config.additional_vocab_size is not None:
+            self.wte = MolmoAct2Embedding(
+                config.vocab_size,
+                config.additional_vocab_size,
+                config.hidden_size,
+            )
+        else:
+            self.wte = nn.Embedding(config.vocab_size, config.hidden_size)
+        self.emb_drop = nn.Dropout(config.embedding_dropout)
+        decoder_layer = MolmoAct2PostNormDecoderLayer if config.norm_after else MolmoAct2DecoderLayer
+        self.blocks = nn.ModuleList(
+            [decoder_layer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
+        )
+        self.ln_f = MolmoAct2RMSNorm(config.hidden_size, eps=config.layer_norm_eps)
+        if config.rope_scaling_layers is not None:
+            self.rotary_embs = nn.ModuleDict(
+                {
+                    "default": MolmoAct2RotaryEmbedding(config, rope_type="default"),
+                    "scaling": MolmoAct2RotaryEmbedding(config),
+                }
+            )
+        else:
+            self.rotary_emb = MolmoAct2RotaryEmbedding(config)
+        self.gradient_checkpointing = False
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    @torch.no_grad()
+    def prepare_rope_cache(
+        self,
+        *,
+        device: str | torch.device,
+        max_seq_len: int | None = None,
+    ) -> None:
+        if self.config.rope_scaling_layers is not None:
+            for rotary_emb in self.rotary_embs.values():
+                rotary_emb.prepare_rope_cache(device=device, max_seq_len=max_seq_len)
+            return
+        self.rotary_emb.prepare_rope_cache(device=device, max_seq_len=max_seq_len)
+
+    def get_input_embeddings(self) -> torch.nn.Module:
+        return self.wte
+
+    def set_input_embeddings(self, value: torch.nn.Module) -> None:
+        self.wte = value
+
+    @can_return_tuple
+    def forward(
+        self,
+        input_ids: torch.LongTensor | None = None,
+        attention_mask: torch.Tensor | None = None,
+        position_ids: torch.LongTensor | None = None,
+        past_key_values: Cache | None = None,
+        inputs_embeds: torch.FloatTensor | None = None,
+        use_cache: bool | None = None,
+        output_attentions: bool | None = None,
+        output_hidden_states: bool | None = None,
+        cache_position: torch.LongTensor | None = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> BaseModelOutputWithPast:
+        output_attentions = (
+            output_attentions if output_attentions is not None else self.config.output_attentions
+        )
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        collect_layer_kv_states = bool(kwargs.pop("collect_layer_kv_states", False))
+        if collect_layer_kv_states and past_key_values is not None:
+            raise ValueError("collect_layer_kv_states cannot be used with past_key_values.")
+        if collect_layer_kv_states:
+            use_cache = False
+
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+
+        if self.gradient_checkpointing and self.training and use_cache:
+            logger.warning_once(
+                "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
+            )
+            use_cache = False
+
+        if inputs_embeds is None:
+            input_ids = input_ids * (input_ids != -1).to(input_ids.dtype)
+            inputs_embeds = self.wte(input_ids)
+
+        # torch.jit.trace() doesn't support cache objects in the output
+        if use_cache and past_key_values is None and not torch.jit.is_tracing():
+            past_key_values = DynamicCache(config=self.config)
+
+        if cache_position is None:
+            past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
+            cache_position = torch.arange(
+                past_seen_tokens,
+                past_seen_tokens + inputs_embeds.shape[1],
+                device=inputs_embeds.device,
+            )
+
+        if position_ids is None:
+            position_ids = cache_position.unsqueeze(0)
+
+        # It may already have been prepared by e.g. `generate`
+        if torch.is_tensor(attention_mask) and attention_mask.ndim == 4:
+            causal_mask_mapping = attention_mask
+        elif not isinstance(causal_mask_mapping := attention_mask, dict):
+            # Prepare mask arguments
+            mask_kwargs = {
+                "config": self.config,
+                "input_embeds": inputs_embeds,
+                "attention_mask": attention_mask,
+                "cache_position": cache_position,
+                "past_key_values": past_key_values,
+                "position_ids": position_ids,
+            }
+
+            # Create the mask
+            causal_mask_mapping = create_causal_mask(**mask_kwargs)
+
+        hidden_states = inputs_embeds
+
+        # create position embeddings to be shared across the decoder layers
+        if self.config.rope_scaling_layers is not None:
+            position_embeddings_mapping = {
+                "default": self.rotary_embs["default"](hidden_states, position_ids),
+                "scaling": self.rotary_embs["scaling"](hidden_states, position_ids),
+            }
+        else:
+            position_embeddings = self.rotary_emb(hidden_states, position_ids)
+
+        # decoder layers
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        collected_kv_states = [] if collect_layer_kv_states else None
+
+        for layer_idx, decoder_block in enumerate(self.blocks[: self.config.num_hidden_layers]):
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+
+            if self.config.rope_scaling_layers is not None:
+                position_embeddings_i = (
+                    position_embeddings_mapping["scaling"]
+                    if layer_idx in self.config.rope_scaling_layers
+                    else position_embeddings_mapping["default"]
+                )
+            else:
+                position_embeddings_i = position_embeddings
+
+            layer_outputs = decoder_block(
+                hidden_states,
+                attention_mask=causal_mask_mapping,
+                position_ids=position_ids,
+                past_key_values=past_key_values,
+                output_attentions=output_attentions,
+                use_cache=use_cache,
+                cache_position=cache_position,
+                position_embeddings=position_embeddings_i,
+                collect_layer_kv_states=collect_layer_kv_states,
+                **kwargs,
+            )
+
+            hidden_states = layer_outputs[0]
+
+            output_idx = 1
+            if output_attentions:
+                all_self_attns += (layer_outputs[output_idx],)
+                output_idx += 1
+            if collect_layer_kv_states:
+                collected_kv_states.append((layer_outputs[output_idx], layer_outputs[output_idx + 1]))
+
+        hidden_states = self.ln_f(hidden_states)
+
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=tuple(collected_kv_states) if collect_layer_kv_states else past_key_values,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+
+
+# Adapted from transformers.models.gemma3.modeling_gemma3
+def token_type_ids_mask_function(
+    token_type_ids: torch.Tensor | None = None,
+) -> Callable | None:
+    """
+    This function adds the correct offsets to the `q_idx` and `kv_idx` as the torch API can only accept lengths,
+    not start and end indices.
+    """
+    # Do not return an additional mask in this case
+    if token_type_ids is None:
+        return None
+
+    def inner_mask(batch_idx: int, head_idx: int, q_idx: int, kv_idx: int) -> bool:
+        # If it's 1 for both query and key/value, we are in an image block
+        # NOTE: static cache shape goes beyond input seq length, while token_type_ids.shape[1] == input seq length
+        # Since vmap doesn't support `if statement` we workaround it with `torch.where`
+        safe_idx = torch.where(kv_idx < token_type_ids.shape[1], kv_idx, 0)
+        token_type_ids_at_kv_idx = token_type_ids[batch_idx, safe_idx]
+        token_type_ids_at_kv_idx = torch.where(kv_idx < token_type_ids.shape[1], token_type_ids_at_kv_idx, 0)
+
+        is_image_block = (token_type_ids[batch_idx, q_idx] == 1) & (token_type_ids_at_kv_idx == 1)
+
+        # This is bidirectional attention whenever we are dealing with image tokens
+        return is_image_block & is_image_block
+
+    return inner_mask
+
+
+class MolmoAct2Model(MolmoAct2PreTrainedModel):
+    base_model_prefix = ""
+    _checkpoint_conversion_mapping = {}
+    # Reference: fix gemma3 grad acc #37208
+    accepts_loss_kwargs = False
+    config: MolmoAct2Config
+
+    def __init__(self, config: MolmoAct2Config):
+        super().__init__(config)
+        self.transformer: MolmoAct2TextModel = MolmoAct2TextModel(config.text_config)
+        self.vision_backbone: MolmoAct2VisionBackbone | None = None
+        if config.vit_config is not None and config.adapter_config is not None:
+            self.vision_backbone = MolmoAct2VisionBackbone(config.vit_config, config.adapter_config)
+        llm_kv_dim = config.text_config.num_key_value_heads * config.text_config.head_dim
+        if config.add_action_expert:
+            self.action_expert = ActionExpert(
+                config.action_expert_config,
+                llm_dim=config.hidden_size,
+                llm_kv_dim=llm_kv_dim,
+                llm_num_layers=config.num_hidden_layers,
+            )
+        else:
+            self.action_expert = None
+        if config.add_action_expert and config.action_expert_depth_gate:
+            if config.action_expert_depth_gate_per_layer:
+                self.action_expert_depth_gate = nn.ModuleList(
+                    nn.Linear(llm_kv_dim, 1) for _ in range(config.action_expert_config.num_layers)
+                )
+            else:
+                self.action_expert_depth_gate = nn.Linear(llm_kv_dim, 1)
+            self.reset_action_expert_depth_gate_parameters()
+        else:
+            self.action_expert_depth_gate = None
+        self._depth_gate_token_ids = self._resolve_depth_gate_token_ids()
+        self.action_cuda_graph_manager: ActionCudaGraphManager | None = None
+
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self) -> torch.nn.Module:
+        return self.transformer.wte
+
+    def set_input_embeddings(self, value: torch.nn.Module) -> None:
+        self.transformer.wte = value
+
+    def set_decoder(self, decoder):
+        self.transformer = decoder
+
+    def get_decoder(self):
+        return self.transformer
+
+    @property
+    def device(self) -> torch.device:
+        return self.transformer.ln_f.weight.device
+
+    def reset_action_expert_depth_gate_parameters(self) -> None:
+        if self.action_expert_depth_gate is None:
+            return
+        gates = (
+            self.action_expert_depth_gate
+            if isinstance(self.action_expert_depth_gate, nn.ModuleList)
+            else [self.action_expert_depth_gate]
+        )
+        for gate in gates:
+            nn.init.zeros_(gate.weight)
+            nn.init.constant_(gate.bias, float(self.config.action_expert_depth_gate_init_bias))
+
+    def _resolve_depth_gate_token_ids(self) -> tuple[int, ...]:
+        if not self.config.action_expert_depth_gate:
+            return ()
+        token_ids = []
+        for token_id in (
+            self.config.depth_output_token_id,
+            self.config.depth_start_token_id,
+            self.config.depth_end_token_id,
+        ):
+            if token_id is not None:
+                token_ids.append(int(token_id))
+        if self.config.depth_token_start_id is not None and int(self.config.num_depth_tokens or 0) > 0:
+            start = int(self.config.depth_token_start_id)
+            token_ids.extend(range(start, start + int(self.config.num_depth_tokens)))
+        return tuple(dict.fromkeys(token_ids))
+
+    def _require_action_expert(self) -> ActionExpert:
+        if self.action_expert is None:
+            raise RuntimeError("This MolmoAct2 checkpoint does not include an action expert.")
+        return self.action_expert
+
+    def _cache_to_sequence(self, cache: torch.Tensor) -> torch.Tensor:
+        if cache.dim() != 4:
+            raise ValueError(f"Expected KV cache tensor with 4 dims, got shape {tuple(cache.shape)}")
+        head_candidates = {
+            self.config.text_config.num_key_value_heads,
+            self.config.text_config.num_attention_heads,
+        }
+        if cache.shape[1] in head_candidates:
+            bsz, n_heads, seq_len, head_dim = cache.shape
+            return cache.permute(0, 2, 1, 3).reshape(bsz, seq_len, n_heads * head_dim)
+        if cache.shape[2] in head_candidates:
+            bsz, seq_len, n_heads, head_dim = cache.shape
+            return cache.reshape(bsz, seq_len, n_heads * head_dim)
+        if cache.shape[1] <= cache.shape[2]:
+            bsz, n_heads, seq_len, head_dim = cache.shape
+            return cache.permute(0, 2, 1, 3).reshape(bsz, seq_len, n_heads * head_dim)
+        bsz, seq_len, n_heads, head_dim = cache.shape
+        return cache.reshape(bsz, seq_len, n_heads * head_dim)
+
+    def _extract_kv_states(self, past_key_values: Cache) -> Sequence[tuple[torch.Tensor, torch.Tensor]]:
+        if past_key_values is None:
+            raise RuntimeError("Action generation requires past_key_values from the VLM forward pass.")
+        seq_len = _cache_seq_len_int(past_key_values)
+        kv_states = []
+        for key, value in _iter_cache_key_values(past_key_values):
+            if key is None or value is None:
+                continue
+            if key.shape[-2] > seq_len:
+                key = key[..., :seq_len, :]
+                value = value[..., :seq_len, :]
+            kv_states.append((self._cache_to_sequence(key), self._cache_to_sequence(value)))
+        if len(kv_states) != self.config.action_expert_config.num_layers:
+            raise RuntimeError(
+                f"Expected {self.config.action_expert_config.num_layers} KV layers, got {len(kv_states)}."
+            )
+        return kv_states
+
+    @staticmethod
+    def _mask_discrete_output_span(
+        row_ids: torch.Tensor,
+        row_mask: torch.Tensor,
+        start_id: int | None,
+        end_id: int | None,
+    ) -> None:
+        if start_id is None or end_id is None:
+            return
+        start_positions = (row_ids == start_id).nonzero(as_tuple=False).flatten().tolist()
+        if not start_positions:
+            return
+        end_positions = (row_ids == end_id).nonzero(as_tuple=False).flatten().tolist()
+        end_ptr = 0
+        for start_pos in start_positions:
+            while end_ptr < len(end_positions) and end_positions[end_ptr] < start_pos:
+                end_ptr += 1
+            if end_ptr >= len(end_positions):
+                row_mask[start_pos:] = False
+                break
+            end_pos = end_positions[end_ptr]
+            row_mask[start_pos : end_pos + 1] = False
+            end_ptr += 1
+
+    def _get_encoder_attention_mask(
+        self,
+        input_ids: torch.Tensor | None,
+        attention_mask: torch.Tensor | None,
+    ) -> torch.Tensor | None:
+        if attention_mask is not None:
+            mask = attention_mask.to(dtype=torch.bool).clone()
+        elif input_ids is not None:
+            mask = input_ids != -1
+        else:
+            return None
+        if self.config.action_mode != "both" or input_ids is None:
+            return mask
+        eos_id = getattr(self.config, "eos_token_id", None)
+        if eos_id is not None:
+            mask &= input_ids != int(eos_id)
+        for batch_idx in range(input_ids.shape[0]):
+            self._mask_discrete_output_span(
+                input_ids[batch_idx],
+                mask[batch_idx],
+                self.config.action_start_token_id,
+                self.config.action_end_token_id,
+            )
+        return mask
+
+    def _get_depth_token_mask(
+        self,
+        input_ids: torch.Tensor | None,
+        encoder_attention_mask: torch.Tensor | None,
+    ) -> torch.Tensor | None:
+        if not self.config.action_expert_depth_gate or input_ids is None or not self._depth_gate_token_ids:
+            return None
+        depth_token_ids = torch.as_tensor(
+            self._depth_gate_token_ids,
+            device=input_ids.device,
+            dtype=input_ids.dtype,
+        )
+        depth_mask = (input_ids.unsqueeze(-1) == depth_token_ids).any(dim=-1)
+        if encoder_attention_mask is not None:
+            depth_mask = depth_mask & encoder_attention_mask.to(device=input_ids.device, dtype=torch.bool)
+        return depth_mask
+
+    @staticmethod
+    def _depth_gate_from_source(
+        gate_head: nn.Linear,
+        *,
+        source: torch.Tensor,
+        depth_mask: torch.Tensor,
+        encoder_attention_mask: torch.Tensor | None,
+    ) -> torch.Tensor:
+        if source.ndim == 4:
+            source = source.reshape(source.shape[0], source.shape[1], -1)
+        if source.ndim != 3:
+            raise ValueError(f"Depth gate expected a 3D sequence tensor, got {tuple(source.shape)}.")
+        if encoder_attention_mask is not None:
+            valid_mask = encoder_attention_mask.to(device=source.device, dtype=torch.bool)
+        else:
+            valid_mask = torch.ones(depth_mask.shape, device=source.device, dtype=torch.bool)
+        depth_mask = depth_mask.to(device=source.device, dtype=torch.bool)
+        pool_mask = valid_mask & ~depth_mask
+        has_pool = pool_mask.any(dim=-1, keepdim=True)
+        pool_mask = torch.where(has_pool, pool_mask, valid_mask)
+        weights = pool_mask.to(dtype=source.dtype).unsqueeze(-1)
+        pooled = (source * weights).sum(dim=1) / weights.sum(dim=1).clamp_min(1.0)
+        gate_logits = gate_head(pooled.to(dtype=gate_head.weight.dtype))
+        return torch.sigmoid(gate_logits).to(dtype=source.dtype)
+
+    def _depth_gate_from_condition(
+        self,
+        *,
+        input_ids: torch.Tensor | None,
+        encoder_attention_mask: torch.Tensor | None,
+        layer_kv_states: Sequence[tuple[torch.Tensor, torch.Tensor]] | None,
+    ) -> tuple[torch.Tensor | Sequence[torch.Tensor] | None, torch.Tensor | None]:
+        gate_head = self.action_expert_depth_gate
+        if gate_head is None:
+            return None, None
+        depth_mask = self._get_depth_token_mask(input_ids, encoder_attention_mask)
+        if depth_mask is None or layer_kv_states is None:
+            return None, depth_mask
+        sources = [value for _, value in layer_kv_states]
+        if isinstance(gate_head, nn.ModuleList):
+            if len(gate_head) != len(sources):
+                raise ValueError(
+                    f"Depth gate layer count mismatch: gates={len(gate_head)}, sources={len(sources)}."
+                )
+            gates = [
+                self._depth_gate_from_source(
+                    gate,
+                    source=source,
+                    depth_mask=depth_mask,
+                    encoder_attention_mask=encoder_attention_mask,
+                )
+                for gate, source in zip(gate_head, sources)
+            ]
+            return gates, depth_mask
+        gate = self._depth_gate_from_source(
+            gate_head,
+            source=sources[-1],
+            depth_mask=depth_mask,
+            encoder_attention_mask=encoder_attention_mask,
+        )
+        return gate, depth_mask
+
+    @staticmethod
+    def _depth_gate_for_layer(
+        gate: torch.Tensor | Sequence[torch.Tensor],
+        layer_idx: int,
+        *,
+        num_layers: int,
+    ) -> torch.Tensor:
+        if isinstance(gate, torch.Tensor):
+            return gate
+        if len(gate) != num_layers:
+            raise ValueError(f"Depth gate layer count mismatch: gates={len(gate)}, layers={num_layers}.")
+        return gate[layer_idx]
+
+    def _apply_depth_gate_to_layer_kv_states(
+        self,
+        layer_kv_states: Sequence[tuple[torch.Tensor, torch.Tensor]] | None,
+        depth_mask: torch.Tensor | None,
+        gate: torch.Tensor | Sequence[torch.Tensor] | None,
+    ) -> Sequence[tuple[torch.Tensor, torch.Tensor]] | None:
+        if layer_kv_states is None or depth_mask is None or gate is None:
+            return layer_kv_states
+        gated_kv = []
+        for layer_idx, (key, value) in enumerate(layer_kv_states):
+            layer_gate = self._depth_gate_for_layer(gate, layer_idx, num_layers=len(layer_kv_states))
+            mask = depth_mask.to(device=key.device, dtype=torch.bool)
+            view_shape = [mask.shape[0], mask.shape[1]] + [1] * (key.ndim - 2)
+            scale = torch.ones(view_shape, device=key.device, dtype=key.dtype)
+            gate_view = layer_gate.to(device=key.device, dtype=key.dtype).view(
+                layer_gate.shape[0],
+                *([1] * (key.ndim - 1)),
+            )
+            scale = torch.where(mask.view(view_shape), gate_view, scale)
+            gated_kv.append((key * scale, value * scale))
+        return gated_kv
+
+    @staticmethod
+    def _action_dim_valid_mask(
+        target: torch.Tensor,
+        action_dim_is_pad: torch.Tensor | None,
+    ) -> torch.Tensor | None:
+        if action_dim_is_pad is None:
+            return None
+        mask = ~action_dim_is_pad.to(device=target.device, dtype=torch.bool)
+        if mask.ndim == 1:
+            mask = mask.unsqueeze(0)
+        if mask.shape[-1] != target.shape[-1]:
+            raise ValueError(
+                f"action_dim_is_pad width {mask.shape[-1]} does not match target width {target.shape[-1]}."
+            )
+        if mask.shape[0] == 1 and target.shape[0] != 1:
+            mask = mask.expand(target.shape[0], -1)
+        if mask.shape[0] != target.shape[0]:
+            raise ValueError(
+                f"action_dim_is_pad batch {mask.shape[0]} does not match target batch {target.shape[0]}."
+            )
+        while mask.ndim < target.ndim:
+            mask = mask.unsqueeze(1)
+        return mask
+
+    @classmethod
+    def _mask_action_dim_tensor(
+        cls,
+        tensor: torch.Tensor,
+        *,
+        action_dim_is_pad: torch.Tensor | None,
+        enabled: bool,
+    ) -> torch.Tensor:
+        if not enabled:
+            return tensor
+        valid_mask = cls._action_dim_valid_mask(tensor, action_dim_is_pad)
+        if valid_mask is None:
+            return tensor
+        return tensor.masked_fill(~valid_mask, 0)
+
+    def _run_action_flow_loop(self, inputs: _ActionFlowInputs, steps: int) -> torch.Tensor:
+        action_expert = self._require_action_expert()
+        dt = 1.0 / steps
+        trajectory = inputs.trajectory
+        action_dim_is_pad = inputs.action_dim_is_pad
+        mask_enabled = self.config.mask_action_dim_padding
+        for idx in range(steps):
+            velocity = action_expert.forward_with_context(
+                trajectory,
+                inputs.modulations[idx].conditioning,
+                context=inputs.context,
+                modulation=inputs.modulations[idx],
+            )
+            velocity = self._mask_action_dim_tensor(
+                velocity,
+                action_dim_is_pad=action_dim_is_pad,
+                enabled=mask_enabled,
+            )
+            trajectory = trajectory + dt * velocity
+            trajectory = self._mask_action_dim_tensor(
+                trajectory,
+                action_dim_is_pad=action_dim_is_pad,
+                enabled=mask_enabled,
+            )
+        return trajectory
+
+    def _resolve_action_horizon(self, action_horizon: int | None = None) -> int:
+        max_action_horizon = int(self.config.max_action_horizon or 1)
+        resolved = max_action_horizon if action_horizon is None else int(action_horizon)
+        if resolved < 1:
+            raise ValueError(f"action_horizon must be >= 1, got {resolved}.")
+        if resolved > max_action_horizon:
+            raise ValueError(
+                f"Requested action_horizon={resolved} exceeds checkpoint max_action_horizon={max_action_horizon}."
+            )
+        return resolved
+
+    @torch.no_grad()
+    def generate_actions_from_inputs(
+        self,
+        *,
+        input_ids: torch.LongTensor,
+        pixel_values: torch.Tensor | None = None,
+        image_token_pooling: torch.Tensor | None = None,
+        image_grids: torch.Tensor | None = None,
+        image_num_crops: torch.Tensor | None = None,
+        pixel_values_videos: torch.Tensor | None = None,
+        video_token_pooling: torch.Tensor | None = None,
+        video_grids: torch.Tensor | None = None,
+        attention_mask: torch.Tensor | None = None,
+        token_type_ids: torch.LongTensor | None = None,
+        states: torch.Tensor | None = None,
+        action_dim_is_pad: torch.Tensor | None = None,
+        action_horizon: int | None = None,
+        num_steps: int | None = None,
+        generator: torch.Generator | None = None,
+        encoder_kv_states: Sequence[tuple[torch.Tensor, torch.Tensor]] | None = None,
+        encoder_attention_mask: torch.Tensor | None = None,
+    ) -> torch.Tensor:
+        action_expert = self._require_action_expert()
+        if encoder_kv_states is None:
+            outputs = self(
+                input_ids=input_ids,
+                pixel_values=pixel_values,
+                image_token_pooling=image_token_pooling,
+                image_grids=image_grids,
+                image_num_crops=image_num_crops,
+                pixel_values_videos=pixel_values_videos,
+                video_token_pooling=video_token_pooling,
+                video_grids=video_grids,
+                attention_mask=attention_mask,
+                token_type_ids=token_type_ids,
+                use_cache=True,
+            )
+            encoder_kv_states = self._extract_kv_states(outputs.past_key_values)
+            encoder_attention_mask = self._get_encoder_attention_mask(input_ids, attention_mask)
+        elif encoder_attention_mask is None:
+            encoder_attention_mask = self._get_encoder_attention_mask(input_ids, attention_mask)
+
+        depth_gate, depth_mask = self._depth_gate_from_condition(
+            input_ids=input_ids,
+            encoder_attention_mask=encoder_attention_mask,
+            layer_kv_states=encoder_kv_states,
+        )
+        encoder_kv_states = self._apply_depth_gate_to_layer_kv_states(
+            encoder_kv_states,
+            depth_mask,
+            depth_gate,
+        )
+        steps = int(num_steps or self.config.flow_matching_num_steps)
+        if steps <= 0:
+            raise ValueError(f"num_steps must be >= 1, got {steps}.")
+        source_tensor = encoder_kv_states[0][0]
+        batch_size = source_tensor.shape[0]
+        device = source_tensor.device
+        action_horizon = self._resolve_action_horizon(action_horizon)
+        trajectory_dtype = action_expert.action_embed.weight.dtype
+        trajectory = torch.randn(
+            (batch_size, action_horizon, self.config.max_action_dim),
+            device=device,
+            dtype=trajectory_dtype,
+            generator=generator,
+        )
+        trajectory = self._mask_action_dim_tensor(
+            trajectory,
+            action_dim_is_pad=action_dim_is_pad,
+            enabled=self.config.mask_action_dim_padding,
+        )
+        action_context = action_expert.prepare_context(
+            encoder_kv_states=encoder_kv_states,
+            encoder_attention_mask=encoder_attention_mask,
+            state_embeddings=states,
+            batch_size=batch_size,
+            seq_len=trajectory.shape[1],
+            device=device,
+            dtype=trajectory.dtype,
+        )
+        flow_timesteps = [
+            torch.full((batch_size,), idx / steps, device=device, dtype=torch.float32) for idx in range(steps)
+        ]
+        modulation_cache = action_expert.get_or_prepare_modulation_cache(
+            flow_timesteps,
+            cache_key=(steps, batch_size, device, trajectory.dtype),
+        )
+        flow_inputs = _ActionFlowInputs(
+            trajectory=trajectory,
+            context=action_context,
+            modulations=modulation_cache,
+            action_dim_is_pad=action_dim_is_pad,
+        )
+        action_cuda_graph_manager = self.action_cuda_graph_manager
+        if action_cuda_graph_manager is not None and action_cuda_graph_manager.can_use_action_flow(
+            flow_inputs
+        ):
+            trajectory = action_cuda_graph_manager.run_action_flow(
+                flow_inputs, steps, self._run_action_flow_loop
+            )
+        else:
+            trajectory = self._run_action_flow_loop(flow_inputs, steps)
+        return trajectory
+
+    def build_batched_images(
+        self,
+        input_ids: torch.LongTensor,
+        pixel_values: torch.Tensor,
+        image_token_pooling: torch.Tensor,
+        image_grids: torch.Tensor,
+        image_num_crops: torch.Tensor,
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        # 1) Count the number of images in each example
+        raw_counts = (input_ids == self.config.image_end_token_id).sum(1)  # [N]
+        total_images = int(image_grids.size(0))
+        total_end_tokens = int(raw_counts.sum().item())
+        if total_images <= 0:
+            counts = raw_counts.new_zeros(raw_counts.shape)
+        elif total_end_tokens == total_images:
+            counts = raw_counts
+        elif total_end_tokens == 2 * total_images:
+            counts = raw_counts // 2
+        else:
+            raise ValueError(
+                "Could not infer image counts from image end tokens: "
+                f"end_tokens={total_end_tokens}, image_grids={total_images}."
+            )
+        N = counts.size(0)
+        device = input_ids.device
+
+        # Total number of images in the batch
+        num_images = total_images
+
+        # Sanity check
+        assert image_grids.size(0) == num_images, (
+            f"Expected {num_images} image grids, but got {image_grids.size(0)}"
+        )
+        assert image_num_crops.size(0) == num_images, (
+            f"Expected {num_images} image num crops, but got {image_num_crops.size(0)}"
+        )
+
+        # 1-1) Compute per-image pooled patch count from image grids
+        with torch.no_grad():
+            first_prod = image_grids[:, :2].prod(dim=1)  # [num_images]
+            second_prod = image_grids[:, 2:].prod(dim=1)  # [num_images]
+            num_pooled_patches_per_image = (first_prod + second_prod).to(
+                image_num_crops.dtype
+            )  # [num_images]
+
+        # pixel_values: [n_crops, n_patches, pixels_per_patch]
+        n_crops, n_patches, pixels_per_patch = pixel_values.shape
+
+        # 2) Map each image index → example index
+        # Example: if counts = [2, 1, 3], then this becomes [0,0,1,2,2,2]
+        example_ids_for_image = torch.arange(N, device=device).repeat_interleave(counts)  # [num_images]
+        assert example_ids_for_image.numel() == num_images
+
+        # 2-1) Compute crops_per_example by summing per-image crop counts
+        crops_per_example = torch.zeros(N, dtype=image_num_crops.dtype, device=image_num_crops.device)
+        crops_per_example.index_add_(0, example_ids_for_image, image_num_crops)  # [N]
+
+        # 2-2) Per-image number of patches = (crops per image) * n_patches
+        patches_per_image = image_num_crops * n_patches  # [num_images]
+
+        # 2-3) Compute per-example per-image patch offsets
+        counts_list = counts.tolist()
+        index_offset_per_example_list = []
+        offset_img = 0
+        for c in counts_list:
+            per_img_patches = patches_per_image[offset_img : offset_img + c]  # [c]
+            # Offsets: [0, img0_total_patches, img0+img1_total_patches, ...]
+            index_offset = [0] + per_img_patches.cumsum(0).tolist()[:-1]
+            index_offset_per_example_list.append(index_offset)
+            offset_img += c
+
+        # 2-4) Compute num_pooled_patches_per_example
+        num_pooled_patches_per_example = torch.zeros(
+            N,
+            dtype=num_pooled_patches_per_image.dtype,
+            device=num_pooled_patches_per_image.device,
+        )
+        num_pooled_patches_per_example.index_add_(0, example_ids_for_image, num_pooled_patches_per_image)
+
+        # Sanity checks
+        total_crops = int(crops_per_example.sum().item())
+        assert total_crops == n_crops, f"Expected {total_crops} crops, but got {n_crops}"
+
+        total_num_pooled_patches = int(num_pooled_patches_per_example.sum().item())
+        assert total_num_pooled_patches == image_token_pooling.size(0), (
+            f"Expected {total_num_pooled_patches} pooled patches, but got {image_token_pooling.size(0)}"
+        )
+
+        # 3) Build images tensor filled with -1
+        M = int(crops_per_example.max().item())
+        images = torch.full(
+            (N, M, n_patches, pixels_per_patch),
+            fill_value=-1,
+            dtype=pixel_values.dtype,
+            device=pixel_values.device,
+        )
+
+        # 4) Fill images with per-example slices from pixel_values
+        offset_crop = 0
+        for i in range(N):
+            num = int(crops_per_example[i].item())
+            cur = pixel_values[offset_crop : offset_crop + num]  # [num, n_patches, pixels_per_patch]
+            images[i, :num] = cur
+            offset_crop += num
+
+        # Sanity check
+        assert offset_crop == n_crops
+
+        # 5) Build new_token_pooling tensor filled with -1
+        P = int(num_pooled_patches_per_example.max().item())
+        _, dim = image_token_pooling.shape
+        new_token_pooling = torch.full(
+            (N, P, dim),
+            fill_value=-1,
+            dtype=image_token_pooling.dtype,
+            device=image_token_pooling.device,
+        )
+
+        # 6) Fill token_pooling with per-example slices, adding per-image patch offsets
+        patch_offset = 0
+        img_offset = 0
+
+        for i, c in enumerate(counts_list):
+            num_patches = int(num_pooled_patches_per_example[i].item())
+
+            # Subsequence of pooled tokens belonging to this example
+            cur = image_token_pooling[patch_offset : patch_offset + num_patches].clone()  # [num_patches, dim]
+
+            index_offset_per_example = index_offset_per_example_list[i]  # length = c
+            per_img_pooled = num_pooled_patches_per_image[img_offset : img_offset + c]  # [c]
+
+            assert len(index_offset_per_example) == per_img_pooled.numel()
+
+            # Apply per-image offsets to the (ragged) subsequence
+            offset = 0
+            for j in range(c):
+                index_offset = int(index_offset_per_example[j])
+                n = int(per_img_pooled[j].item())
+                cur_slice = cur[offset : offset + n]
+
+                # Apply offset across all columns
+                cur[offset : offset + n] = torch.where(
+                    cur_slice >= 0,
+                    cur_slice + index_offset,
+                    cur_slice,
+                )
+                offset += n
+
+            new_token_pooling[i, :num_patches] = cur
+
+            patch_offset += num_patches
+            img_offset += c
+
+        # Final sanity checks
+        assert patch_offset == total_num_pooled_patches
+        assert img_offset == num_images
+
+        return images, new_token_pooling
+
+    def build_batched_videos(
+        self,
+        input_ids: torch.LongTensor,
+        pixel_values_videos: torch.Tensor,
+        video_token_pooling: torch.Tensor,
+        video_grids: torch.Tensor,
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        # 1) Count the number of videos in each example
+        if self.config.use_frame_special_tokens:
+            end_token_id = self.config.frame_end_token_id
+        else:
+            end_token_id = self.config.image_end_token_id
+        counts = (input_ids == end_token_id).any(dim=1).long()  # [N]
+        N = counts.size(0)
+        device = input_ids.device
+
+        # Total number of videos in the batch
+        num_videos = int(counts.sum().item())
+
+        # Sanity check
+        assert video_grids.size(0) == num_videos, (
+            f"Expected {num_videos} videos, but got {video_grids.size(0)}"
+        )
+
+        video_num_frames = video_grids[:, 0]  # [num_videos]
+        num_pooled_patches_per_video = video_grids.prod(dim=1)  # [num_videos]
+
+        # pixel_values_videos: [n_frames, n_patches, pixels_per_patch]
+        n_frames, n_patches, pixels_per_patch = pixel_values_videos.shape
+
+        # 2) Map each video index -> example index
+        # Example: if counts = [2, 1, 3], then this becomes [0,0,1,2,2,2]
+        example_ids_for_video = torch.arange(N, device=device).repeat_interleave(counts)  # [num_videos]
+        assert example_ids_for_video.numel() == num_videos
+
+        # 2-1) Compute frames_per_example by summing per-video frame counts
+        frames_per_example = torch.zeros(
+            N,
+            dtype=video_num_frames.dtype,
+            device=device,
+        )
+        frames_per_example.index_add_(0, example_ids_for_video, video_num_frames)  # [N]
+
+        # 2-2) Compute num_pooled_patches_per_example
+        num_pooled_patches_per_example = torch.zeros(
+            N,
+            dtype=num_pooled_patches_per_video.dtype,
+            device=num_pooled_patches_per_video.device,
+        )
+        num_pooled_patches_per_example.index_add_(
+            0,
+            example_ids_for_video,
+            num_pooled_patches_per_video,
+        )
+
+        # Sanity checks
+        total_frames = int(frames_per_example.sum().item())
+        assert total_frames == n_frames, f"Expected {total_frames} frames, but got {n_frames}"
+
+        total_num_pooled_patches = int(num_pooled_patches_per_example.sum().item())
+        assert total_num_pooled_patches == video_token_pooling.size(0), (
+            f"Expected {total_num_pooled_patches} pooled patches, but got {video_token_pooling.size(0)}"
+        )
+
+        # 3) Build videos tensor filled with -1
+        M = int(frames_per_example.max().item())
+        videos = torch.full(
+            (N, M, n_patches, pixels_per_patch),
+            fill_value=-1,
+            dtype=pixel_values_videos.dtype,
+            device=device,
+        )
+
+        # 4) Fill videos with per-examples slices from pixel_values_videos
+        offset_frame = 0
+        for i in range(N):
+            num = int(frames_per_example[i].item())
+            cur = pixel_values_videos[offset_frame : offset_frame + num]  # [num, n_patches, pixels_per_patch]
+            videos[i, :num] = cur
+            offset_frame += num
+
+        # Sanity check
+        assert offset_frame == n_frames
+
+        # 5) Build new token_pooling tensor filled with -1
+        P = int(num_pooled_patches_per_example.max().item())
+        _, dim = video_token_pooling.shape
+        new_token_pooling = torch.full(
+            (N, P, dim),
+            fill_value=-1,
+            dtype=video_token_pooling.dtype,
+            device=video_token_pooling.device,
+        )
+
+        # 6) Fill new token_pooling with per-examples slices from video_token_pooling
+        patch_offset = 0
+        for i in range(N):
+            num_patches = int(num_pooled_patches_per_example[i].item())
+            cur = video_token_pooling[patch_offset : patch_offset + num_patches]  # [num_patches, dim]
+            new_token_pooling[i, :num_patches] = cur
+            patch_offset += num_patches
+
+        # Final sanity checks
+        assert patch_offset == total_num_pooled_patches
+
+        return videos, new_token_pooling
+
+    def merge_visual_inputs(
+        self,
+        input_ids: torch.LongTensor | None = None,
+        pixel_values: torch.Tensor | None = None,
+        image_token_pooling: torch.Tensor | None = None,
+        image_grids: torch.Tensor | None = None,
+        image_num_crops: torch.Tensor | None = None,
+        pixel_values_videos: torch.Tensor | None = None,
+        video_token_pooling: torch.Tensor | None = None,
+        video_grids: torch.Tensor | None = None,
+    ) -> tuple[torch.Tensor | None, torch.Tensor | None]:
+        if pixel_values is not None and pixel_values_videos is not None:
+            raise ValueError("pixel_values and pixel_values_videos are provided at the same time")
+        elif pixel_values is not None:
+            assert input_ids is not None
+            images, token_pooling = self.build_batched_images(
+                input_ids=input_ids,
+                pixel_values=pixel_values,
+                image_token_pooling=image_token_pooling,
+                image_grids=image_grids,
+                image_num_crops=image_num_crops,
+            )
+        elif pixel_values_videos is not None:
+            assert input_ids is not None
+            images, token_pooling = self.build_batched_videos(
+                input_ids=input_ids,
+                pixel_values_videos=pixel_values_videos,
+                video_token_pooling=video_token_pooling,
+                video_grids=video_grids,
+            )
+        else:
+            images, token_pooling = None, None
+        return images, token_pooling
+
+    def build_input_embeddings(
+        self,
+        input_ids: torch.LongTensor,
+        images: torch.FloatTensor | None = None,  # image inputs
+        token_pooling: torch.LongTensor | None = None,
+    ) -> tuple[torch.Tensor, torch.Tensor | None]:
+        # Get embeddings of input.
+        # shape: (batch_size, seq_len, d_model)
+        input_ids = input_ids * (input_ids != -1).to(input_ids.dtype)
+        x = self.transformer.wte(input_ids)
+
+        image_features: torch.FloatTensor | None = None
+        if images is not None:
+            image_features = self.vision_backbone(images, token_pooling).to(x.device)
+            is_image_patch = input_ids.reshape(-1) == self.config.image_patch_id
+            if is_image_patch.sum() != len(image_features):
+                raise RuntimeError(
+                    f"Expected {int(is_image_patch.sum())} image patch embeddings, got {len(image_features)}."
+                )
+            flat_x = x.reshape(-1, x.shape[-1]).clone()
+            flat_x[is_image_patch] = flat_x[is_image_patch] + image_features
+            x = flat_x.reshape_as(x)
+
+        # shape: (batch_size, seq_len, d_model)
+        x = self.transformer.emb_drop(x)  # type: ignore
+
+        return x, image_features
+
+    def _build_native_attention_bias(
+        self,
+        *,
+        inputs_embeds: torch.Tensor,
+        attention_mask: torch.Tensor | None,
+        token_type_ids: torch.Tensor | None,
+        past_key_values: Cache | None,
+    ) -> torch.Tensor:
+        if attention_mask is not None and attention_mask.ndim == 4:
+            return attention_mask.to(device=inputs_embeds.device)
+        batch_size, seq_len = inputs_embeds.shape[:2]
+        past_length = _cache_seq_len_int(past_key_values)
+        current_length = past_length + int(seq_len)
+        max_cache_len = _cache_max_len_int(past_key_values)
+        attention_mask_len = max_cache_len if max_cache_len > 0 else current_length
+        device = inputs_embeds.device
+
+        if attention_mask is None:
+            positions = torch.arange(attention_mask_len, device=device)
+            valid_mask = positions.unsqueeze(0) < current_length
+            valid_mask = valid_mask.expand(batch_size, -1)
+        elif attention_mask.ndim == 2:
+            valid_mask = torch.zeros((batch_size, attention_mask_len), device=device, dtype=torch.bool)
+            source_mask = attention_mask.to(device=device, dtype=torch.bool)
+            copy_len = min(int(source_mask.shape[-1]), attention_mask_len)
+            if copy_len > 0:
+                valid_mask[:, :copy_len] = source_mask[:, :copy_len]
+            if attention_mask_len > current_length:
+                valid_mask[:, current_length:] = False
+        else:
+            raise ValueError(f"Unsupported attention_mask shape for MolmoAct2: {tuple(attention_mask.shape)}")
+
+        valid_mask = valid_mask[:, None, None, :]
+        causal_mask = torch.tril(
+            torch.ones(attention_mask_len, attention_mask_len, device=device, dtype=torch.bool)
+        )[None, None, past_length:current_length, :attention_mask_len]
+
+        if token_type_ids is not None and past_length == 0:
+            causal_mask = causal_mask.expand(batch_size, -1, -1, -1).clone()
+            image_mask = token_type_ids.to(device=device, dtype=torch.bool)
+            can_attend_back = image_mask[:, :, None] & image_mask[:, None, :]
+            image_len = min(int(token_type_ids.shape[1]), attention_mask_len)
+            causal_mask[:, :, :, :image_len] = (
+                causal_mask[:, :, :, :image_len] | can_attend_back[:, None, :, :image_len]
+            )
+
+        allowed = valid_mask & causal_mask
+        return torch.where(
+            allowed,
+            torch.zeros((), device=device, dtype=inputs_embeds.dtype),
+            torch.full(
+                (),
+                torch.finfo(inputs_embeds.dtype).min,
+                device=device,
+                dtype=inputs_embeds.dtype,
+            ),
+        )
+
+    @can_return_tuple
+    def forward(
+        self,
+        input_ids: torch.LongTensor | None = None,
+        pixel_values: torch.FloatTensor | None = None,
+        image_token_pooling: torch.Tensor | None = None,
+        image_grids: torch.Tensor | None = None,
+        image_num_crops: torch.Tensor | None = None,
+        pixel_values_videos: torch.Tensor | None = None,
+        video_token_pooling: torch.Tensor | None = None,
+        video_grids: torch.Tensor | None = None,
+        attention_mask: torch.Tensor | None = None,
+        position_ids: torch.Tensor | None = None,
+        past_key_values: Cache | None = None,
+        token_type_ids: torch.LongTensor | None = None,
+        inputs_embeds: torch.FloatTensor | None = None,
+        use_cache: bool | None = None,
+        output_attentions: bool | None = None,
+        output_hidden_states: bool | None = None,
+        cache_position: torch.LongTensor | None = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> tuple | MolmoAct2ModelOutputWithPast:
+        output_attentions = (
+            output_attentions if output_attentions is not None else self.config.output_attentions
+        )
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+
+        images, token_pooling = self.merge_visual_inputs(
+            input_ids=input_ids,
+            pixel_values=pixel_values,
+            image_token_pooling=image_token_pooling,
+            image_grids=image_grids,
+            image_num_crops=image_num_crops,
+            pixel_values_videos=pixel_values_videos,
+            video_token_pooling=video_token_pooling,
+            video_grids=video_grids,
+        )
+
+        if images is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both images and inputs_embeds at the same time.")
+
+        if inputs_embeds is None:
+            inputs_embeds, image_features = self.build_input_embeddings(
+                input_ids,
+                images,
+                token_pooling,
+            )
+
+        if cache_position is None:
+            past_seen_tokens = _cache_seq_len_int(past_key_values)
+            cache_position = torch.arange(
+                past_seen_tokens,
+                past_seen_tokens + inputs_embeds.shape[1],
+                device=inputs_embeds.device,
+            )
+
+        if isinstance(attention_mask, dict):
+            causal_mask_mapping = attention_mask
+        else:
+            causal_mask_mapping = self._build_native_attention_bias(
+                inputs_embeds=inputs_embeds,
+                attention_mask=attention_mask,
+                token_type_ids=token_type_ids,
+                past_key_values=past_key_values,
+            )
+
+        outputs = self.transformer(
+            attention_mask=causal_mask_mapping,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            cache_position=cache_position,
+            **kwargs,
+        )
+
+        return MolmoAct2ModelOutputWithPast(
+            last_hidden_state=outputs.last_hidden_state,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            image_hidden_states=image_features if images is not None else None,
+        )
+
+
+class MolmoAct2ForConditionalGeneration(MolmoAct2PreTrainedModel, GenerationMixin):
+    _checkpoint_conversion_mapping = {}
+    _tied_weights_keys = []  # Weights are not tied
+    # Reference: fix gemma3 grad acc #37208
+    accepts_loss_kwargs = False
+    config: MolmoAct2Config
+
+    def __init__(self, config: MolmoAct2Config):
+        super().__init__(config)
+
+        self.model = MolmoAct2Model(config)
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.vocab_size = config.vocab_size
+        self.model.action_cuda_graph_manager = ActionCudaGraphManager(self.model)
+        self.depth_decode_cuda_graph_manager = DepthDecodeCudaGraphManager(self)
+        # Initialize weights and apply final processing
+        self.post_init()
+
+    def get_input_embeddings(self) -> torch.nn.Module:
+        return self.model.transformer.wte
+
+    def set_input_embeddings(self, value: torch.nn.Module) -> None:
+        self.model.transformer.wte = value
+
+    def set_decoder(self, decoder):
+        self.model.set_decoder(decoder)
+
+    def get_decoder(self):
+        return self.model.get_decoder()
+
+    # Make modules available through conditional class for BC
+    @property
+    def language_model(self) -> torch.nn.Module:
+        return self.model.transformer
+
+    @property
+    def vision_backbone(self) -> torch.nn.Module:
+        return self.model.vision_backbone
+
+    def _get_robot_stats(self) -> _RobotStats:
+        stats = getattr(self, "_molmoact2_robot_stats", None)
+        if stats is not None:
+            return stats
+        filename = getattr(self.config, "norm_stats_filename", "norm_stats.json")
+        base_dir = getattr(self.config, "_name_or_path", None) or getattr(self, "name_or_path", None)
+        if not base_dir:
+            raise ValueError(
+                "MolmoAct2 normalization stats are not loaded and config._name_or_path is empty; "
+                "load the model from a converted HF directory containing norm_stats.json."
+            )
+        stats_path = os.path.join(str(base_dir), filename)
+        if not os.path.isfile(stats_path):
+            try:
+                from huggingface_hub import hf_hub_download
+
+                stats_path = hf_hub_download(str(base_dir), filename, repo_type="model")
+            except Exception as exc:
+                raise FileNotFoundError(
+                    f"MolmoAct2 normalization stats file is missing: {stats_path}. "
+                    "Converted checkpoints must include norm_stats.json."
+                ) from exc
+        with open(stats_path, encoding="utf-8") as f:
+            payload = json.load(f)
+        stats = _RobotStats(payload)
+        self._molmoact2_robot_stats = stats
+        return stats
+
+    @staticmethod
+    def _move_inputs_to_device(inputs: Mapping[str, Any], device: torch.device) -> dict[str, Any]:
+        out = {}
+        for key, value in inputs.items():
+            out[key] = value.to(device) if torch.is_tensor(value) else value
+        return out
+
+    @staticmethod
+    def _drop_trivial_attention_mask(inputs: Mapping[str, Any]) -> dict[str, Any]:
+        out = dict(inputs)
+        attention_mask = out.get("attention_mask")
+        if torch.is_tensor(attention_mask) and bool(attention_mask.to(dtype=torch.bool).all().item()):
+            out.pop("attention_mask", None)
+        return out
+
+    @staticmethod
+    def _count_images(images: Any) -> int:
+        if images is None:
+            return 0
+        if isinstance(images, (list, tuple)):
+            return len(images)
+        arr = np.asarray(images) if not torch.is_tensor(images) else images
+        if getattr(arr, "ndim", 0) == 4:
+            return int(arr.shape[0])
+        return 1
+
+    @staticmethod
+    def _build_action_dim_is_pad(
+        *,
+        action_dim: int,
+        max_action_dim: int,
+        batch_size: int,
+        device: torch.device,
+    ) -> torch.Tensor | None:
+        if int(action_dim) > int(max_action_dim):
+            raise ValueError(
+                f"Requested action_dim {int(action_dim)} exceeds checkpoint max_action_dim {int(max_action_dim)}."
+            )
+        if int(action_dim) == int(max_action_dim):
+            return None
+        mask = torch.ones((int(batch_size), int(max_action_dim)), device=device, dtype=torch.bool)
+        mask[:, : int(action_dim)] = False
+        return mask
+
+    @staticmethod
+    def _slice_action_dim(actions: torch.Tensor, action_dim: int) -> torch.Tensor:
+        if actions.shape[-1] < int(action_dim):
+            raise ValueError(
+                f"Requested action_dim {int(action_dim)} but chunk only has width {actions.shape[-1]}."
+            )
+        return actions[..., : int(action_dim)]
+
+    @staticmethod
+    def _slice_action_chunk(
+        actions: torch.Tensor, n_obs_steps: int, n_action_steps: int | None
+    ) -> torch.Tensor:
+        if n_action_steps is None:
+            return actions
+        start = int(n_obs_steps) - 1
+        end = start + int(n_action_steps)
+        if end > actions.shape[1]:
+            raise ValueError(f"Requested actions up to {end} but model produced horizon {actions.shape[1]}.")
+        return actions[:, start:end]
+
+    def _depth_token_id_to_bin(self) -> dict[int, int]:
+        if self.config.depth_token_start_id is None or int(self.config.num_depth_tokens or 0) <= 0:
+            return {}
+        start = int(self.config.depth_token_start_id)
+        return {start + idx: idx for idx in range(int(self.config.num_depth_tokens))}
+
+    def _action_token_id_to_bin(self) -> dict[int, int]:
+        if self.config.action_token_start_id is None or int(self.config.num_action_tokens or 0) <= 0:
+            return {}
+        start = int(self.config.action_token_start_id)
+        return {start + idx: idx for idx in range(int(self.config.num_action_tokens))}
+
+    def _require_eos_token_id(self) -> int:
+        eos_token_id = getattr(self.config, "eos_token_id", None)
+        if eos_token_id is None and getattr(self, "generation_config", None) is not None:
+            eos_token_id = getattr(self.generation_config, "eos_token_id", None)
+        if isinstance(eos_token_id, (list, tuple)):
+            eos_token_id = eos_token_id[0] if eos_token_id else None
+        if eos_token_id is None:
+            raise RuntimeError(
+                "Discrete action generation requires `eos_token_id` in the converted HF config."
+            )
+        return int(eos_token_id)
+
+    def _decode_depth_bins_from_token_ids(self, token_ids: torch.Tensor) -> torch.Tensor:
+        if self.config.depth_start_token_id is None or self.config.depth_end_token_id is None:
+            raise RuntimeError("Depth generation requires <depth_start>/<depth_end> token IDs.")
+        token_id_to_bin = self._depth_token_id_to_bin()
+        if not token_id_to_bin:
+            raise RuntimeError("Depth generation requires indexed depth tokens in the converted config.")
+        depth_token_bins = _extract_discrete_token_bins(
+            _flatten_generated_token_ids(token_ids),
+            int(self.config.depth_start_token_id),
+            int(self.config.depth_end_token_id),
+            token_id_to_bin,
+        )
+        if not depth_token_bins:
+            raise RuntimeError("Model generated no decodable depth tokens between <depth_start>/<depth_end>.")
+        return torch.as_tensor([depth_token_bins], device=self.device, dtype=torch.long)
+
+    def _consume_generation_tokens(
+        self,
+        token_ids: torch.Tensor,
+        *,
+        past_key_values: Cache | None,
+        attention_mask: torch.Tensor | None,
+    ) -> tuple[MolmoAct2CausalLMOutputWithPast, torch.Tensor | None]:
+        if token_ids.ndim == 1:
+            next_input_ids = token_ids.unsqueeze(1)
+        elif token_ids.ndim == 2:
+            next_input_ids = token_ids
+        else:
+            raise ValueError(f"Expected token_ids to have rank 1 or 2, got {tuple(token_ids.shape)}.")
+        next_attention_mask = attention_mask
+        if next_attention_mask is not None:
+            past_length = _cache_seq_len_int(past_key_values)
+            required_len = int(past_length) + int(next_input_ids.shape[1])
+            if int(next_attention_mask.shape[-1]) < required_len:
+                pad_len = required_len - int(next_attention_mask.shape[-1])
+                next_attention_mask = torch.cat(
+                    (
+                        next_attention_mask,
+                        next_attention_mask.new_ones((next_input_ids.shape[0], pad_len)),
+                    ),
+                    dim=-1,
+                )
+        past_length = _cache_seq_len_int(past_key_values)
+        output = self(
+            input_ids=next_input_ids,
+            attention_mask=next_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=True,
+            cache_position=(
+                torch.arange(
+                    past_length,
+                    past_length + int(next_input_ids.shape[1]),
+                    device=next_input_ids.device,
+                )
+                if past_key_values is not None
+                else None
+            ),
+        )
+        return output, next_attention_mask
+
+    def _make_depth_decode_attention_bias(
+        self, inputs: Mapping[str, Any], past_key_values: Cache
+    ) -> torch.Tensor:
+        layers = getattr(past_key_values, "layers", None)
+        max_cache_len = int(getattr(layers[0], "max_cache_len", 0)) if layers else 0
+        if max_cache_len <= 0:
+            raise RuntimeError("Depth decode fast path requires a cache with a fixed maximum length.")
+        input_ids = inputs["input_ids"]
+        batch_size = int(input_ids.shape[0])
+        device = input_ids.device
+        dtype = self.lm_head.weight.dtype
+
+        positions = torch.arange(max_cache_len, device=device, dtype=torch.long)
+        valid_mask = torch.ones((batch_size, max_cache_len), device=device, dtype=torch.bool)
+        attention_mask = inputs.get("attention_mask")
+        if attention_mask is not None:
+            source_mask = attention_mask.to(device=device, dtype=torch.bool)
+            copy_len = min(int(source_mask.shape[-1]), max_cache_len)
+            if copy_len > 0:
+                valid_mask[:, :copy_len] = source_mask[:, :copy_len]
+        causal_mask = positions[None, :] <= positions[:, None]
+        allowed = causal_mask.unsqueeze(0) & valid_mask[:, None, :]
+        attention_bias = torch.where(
+            allowed[:, None, :, :],
+            torch.zeros((), device=device, dtype=dtype),
+            torch.full((), torch.finfo(dtype).min, device=device, dtype=dtype),
+        )
+        return attention_bias
+
+    def _embed_base_tokens(self, input_ids: torch.Tensor) -> torch.Tensor:
+        # Skips MolmoAct2Embedding's per-call cat([base, new]); safe only for IDs
+        # below text_config.vocab_size. This includes released depth/action tokens.
+        wte = self.model.transformer.wte
+        base_embedding = getattr(wte, "embedding", None)
+        if base_embedding is None:
+            return wte(input_ids)
+        return F.embedding(input_ids, base_embedding)
+
+    def _run_ar_decode_step(
+        self,
+        token_ids: torch.Tensor,
+        *,
+        past_key_values: Cache,
+        attention_bias: torch.Tensor,
+    ) -> tuple[torch.Tensor, Cache]:
+        if token_ids.ndim == 1:
+            next_input_ids = token_ids.unsqueeze(1)
+        elif token_ids.ndim == 2:
+            next_input_ids = token_ids
+        else:
+            raise ValueError(f"Expected token_ids to have rank 1 or 2, got {tuple(token_ids.shape)}.")
+        past_length = _cache_seq_len_int(past_key_values)
+        end = past_length + int(next_input_ids.shape[1])
+        if self.depth_decode_cuda_graph_manager.can_use(
+            next_input_ids,
+            past_key_values=past_key_values,
+            attention_bias=attention_bias,
+        ):
+            return self.depth_decode_cuda_graph_manager.run(
+                next_input_ids,
+                past_key_values=past_key_values,
+                attention_bias=attention_bias,
+                past_length=past_length,
+            )
+        cache_position = torch.arange(past_length, end, device=next_input_ids.device, dtype=torch.long)
+        attention_bias = attention_bias[:, :, past_length:end, :end]
+        inputs_embeds = self._embed_base_tokens(next_input_ids)
+        outputs = self.model.transformer(
+            attention_mask=attention_bias,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            use_cache=True,
+            output_attentions=False,
+            output_hidden_states=False,
+            cache_position=cache_position,
+        )
+        return outputs.last_hidden_state[:, -1:, :], outputs.past_key_values
+
+    def _run_depth_decode_step(
+        self,
+        token_ids: torch.Tensor,
+        *,
+        past_key_values: Cache,
+        attention_bias: torch.Tensor,
+    ) -> tuple[torch.Tensor, Cache]:
+        return self._run_ar_decode_step(
+            token_ids,
+            past_key_values=past_key_values,
+            attention_bias=attention_bias,
+        )
+
+    def _project_depth_logits(self, last_hidden: torch.Tensor) -> torch.Tensor:
+        start = int(self.config.depth_token_start_id)
+        end_id = start + int(self.config.num_depth_tokens)
+        return F.linear(last_hidden, self.lm_head.weight[start:end_id])
+
+    def _max_depth_decode_steps(self) -> int:
+        return max(
+            int(self.config.num_depth_codes or 0) + 8,
+            self.model._resolve_action_horizon() * 16,
+            1,
+        )
+
+    def _make_ar_decode_static_cache(self, inputs: Mapping[str, Any], max_steps: int) -> Cache:
+        prompt_len = inputs["input_ids"].shape[1]
+        return self.depth_decode_cuda_graph_manager.make_static_cache(
+            max_cache_len=prompt_len + max(1, int(max_steps)),
+        )
+
+    def _make_depth_static_cache(self, inputs: Mapping[str, Any]) -> Cache:
+        prompt_len = inputs["input_ids"].shape[1]
+        action_horizon = self.model._resolve_action_horizon()
+        max_end_steps = max(8, action_horizon)
+        action_token_budget = max(1, action_horizon * 16)
+        return self.depth_decode_cuda_graph_manager.make_static_cache(
+            max_cache_len=prompt_len + self._max_depth_decode_steps() + max_end_steps + action_token_budget,
+        )
+
+    def _continue_discrete_generation_from_output(
+        self,
+        initial_output: MolmoAct2CausalLMOutputWithPast,
+        *,
+        past_key_values: Cache | None,
+        attention_mask: torch.Tensor | None,
+        end_token_id: int,
+        max_steps: int,
+        attention_bias: torch.Tensor | None = None,
+    ) -> torch.Tensor:
+        generated_tokens: list[torch.Tensor] = []
+        current_output = initial_output
+        current_past_key_values = past_key_values
+        current_attention_mask = attention_mask
+        hit_end = False
+        for _ in range(int(max_steps)):
+            next_token = torch.argmax(current_output.logits[:, -1, :], dim=-1)
+            generated_tokens.append(next_token)
+            if bool((next_token == int(end_token_id)).all()):
+                hit_end = True
+                break
+            if attention_bias is None:
+                current_output, current_attention_mask = self._consume_generation_tokens(
+                    next_token,
+                    past_key_values=current_past_key_values,
+                    attention_mask=current_attention_mask,
+                )
+                current_past_key_values = current_output.past_key_values
+            else:
+                last_hidden, current_past_key_values = self._run_ar_decode_step(
+                    next_token,
+                    past_key_values=current_past_key_values,
+                    attention_bias=attention_bias,
+                )
+                current_output = MolmoAct2CausalLMOutputWithPast(
+                    logits=self.lm_head(last_hidden),
+                    past_key_values=current_past_key_values,
+                )
+        if not generated_tokens:
+            raise RuntimeError("Discrete continuation generated no tokens.")
+        if not hit_end:
+            raise RuntimeError(
+                f"Discrete continuation did not emit end token {int(end_token_id)} within {int(max_steps)} steps."
+            )
+        return torch.stack(generated_tokens, dim=1)
+
+    def _generate_depth_prefix(
+        self,
+        inputs: Mapping[str, Any],
+        *,
+        latest_first_image: np.ndarray | None,
+        depth_cache: Mapping[str, Any] | None,
+        enable_adaptive_depth: bool,
+    ) -> _DepthPrefix:
+        if self.config.depth_start_token_id is None or self.config.depth_end_token_id is None:
+            raise RuntimeError("Depth reasoning requires single-token <depth_start>/<depth_end>.")
+        if self.config.depth_token_start_id is None or int(self.config.num_depth_tokens or 0) <= 0:
+            raise RuntimeError("Depth reasoning requires indexed depth tokens.")
+        batch_size = int(inputs["input_ids"].shape[0])
+        if batch_size != 1 and enable_adaptive_depth:
+            raise ValueError("enable_adaptive_depth=True currently supports batch size 1.")
+        static_cache = self._make_depth_static_cache(inputs)
+        output = self(**inputs, use_cache=True, past_key_values=static_cache)
+        current_output = output
+        current_past_key_values = output.past_key_values
+        current_attention_mask = inputs.get("attention_mask")
+        generated_tokens: list[torch.Tensor] = []
+
+        if not enable_adaptive_depth:
+            hit_depth_end = False
+            max_steps = self._max_depth_decode_steps()
+            for _ in range(max_steps):
+                next_token = torch.argmax(current_output.logits[:, -1, :], dim=-1)
+                generated_tokens.append(next_token)
+                current_output, current_attention_mask = self._consume_generation_tokens(
+                    next_token,
+                    past_key_values=current_past_key_values,
+                    attention_mask=current_attention_mask,
+                )
+                current_past_key_values = current_output.past_key_values
+                if bool((next_token == int(self.config.depth_end_token_id)).all()):
+                    hit_depth_end = True
+                    break
+            if not generated_tokens:
+                raise RuntimeError("Depth generation produced no tokens.")
+            if not hit_depth_end:
+                raise RuntimeError(f"Depth generation did not emit <depth_end> within {max_steps} steps.")
+            depth_token_ids = torch.stack(generated_tokens, dim=1)
+            full_input_ids = torch.cat([inputs["input_ids"], depth_token_ids], dim=1)
+            full_attention_mask = None
+            if current_attention_mask is not None:
+                full_attention_mask = current_attention_mask[:, : full_input_ids.shape[1]]
+            encoder_kv_states = self.model._extract_kv_states(current_past_key_values)
+            return _DepthPrefix(
+                token_ids=depth_token_ids,
+                depth_bins=self._decode_depth_bins_from_token_ids(depth_token_ids),
+                full_input_ids=full_input_ids,
+                attention_mask=full_attention_mask,
+                encoder_kv_states=encoder_kv_states,
+                next_output=current_output,
+                past_key_values=current_past_key_values,
+            )
+
+        depth_start = torch.full(
+            (batch_size,),
+            int(self.config.depth_start_token_id),
+            device=self.device,
+            dtype=torch.long,
+        )
+        code_token_ids = torch.arange(
+            int(self.config.depth_token_start_id),
+            int(self.config.depth_token_start_id) + int(self.config.num_depth_tokens),
+            device=self.device,
+            dtype=torch.long,
+        )
+        depth_attention_bias = self._make_depth_decode_attention_bias(inputs, current_past_key_values)
+        generated_tokens.append(depth_start)
+        last_hidden, current_past_key_values = self._run_depth_decode_step(
+            depth_start,
+            past_key_values=current_past_key_values,
+            attention_bias=depth_attention_bias,
+        )
+        previous_image = None
+        previous_bins = None
+        if depth_cache is not None:
+            previous_image = depth_cache.get("image")
+            previous_bins = depth_cache.get("depth_bins")
+        selective = (
+            bool(enable_adaptive_depth)
+            and latest_first_image is not None
+            and previous_image is not None
+            and previous_bins is not None
+        )
+        update_mask = None
+        previous_buffer_t = None
+        if selective:
+            previous_buffer = np.asarray(previous_bins, dtype=np.int64).reshape(-1)
+            if previous_buffer.shape[0] == int(self.config.num_depth_codes):
+                update_mask = _compute_depth_update_mask(
+                    latest_first_image,
+                    _normalize_image_for_cache(previous_image),
+                    num_depth_codes=int(self.config.num_depth_codes),
+                )
+                previous_buffer_t = (
+                    torch.from_numpy(previous_buffer)
+                    .to(
+                        device=self.device,
+                        dtype=torch.long,
+                    )
+                    .unsqueeze(0)
+                )
+            else:
+                selective = False
+
+        depth_bins = torch.zeros(
+            (batch_size, int(self.config.num_depth_codes)),
+            device=self.device,
+            dtype=torch.long,
+        )
+        num_depth_codes = int(self.config.num_depth_codes)
+        if not selective or update_mask is None or previous_buffer_t is None:
+            for depth_idx in range(num_depth_codes):
+                depth_logits = self._project_depth_logits(last_hidden)
+                predicted_bins = depth_logits.squeeze(1).argmax(dim=-1)
+                depth_bins[:, depth_idx] = predicted_bins
+                chosen_token_ids = code_token_ids[predicted_bins]
+                generated_tokens.append(chosen_token_ids)
+                last_hidden, current_past_key_values = self._run_depth_decode_step(
+                    chosen_token_ids,
+                    past_key_values=current_past_key_values,
+                    attention_bias=depth_attention_bias,
+                )
+        else:
+            for start_idx, end_idx, should_generate in _build_depth_update_spans(update_mask):
+                if should_generate:
+                    for depth_idx in range(start_idx, end_idx):
+                        depth_logits = self._project_depth_logits(last_hidden)
+                        predicted_bins = depth_logits.squeeze(1).argmax(dim=-1)
+                        depth_bins[:, depth_idx] = predicted_bins
+                        chosen_token_ids = code_token_ids[predicted_bins]
+                        generated_tokens.append(chosen_token_ids)
+                        last_hidden, current_past_key_values = self._run_depth_decode_step(
+                            chosen_token_ids,
+                            past_key_values=current_past_key_values,
+                            attention_bias=depth_attention_bias,
+                        )
+                    continue
+                replay_bins = previous_buffer_t[:, start_idx:end_idx].expand(batch_size, -1)
+                depth_bins[:, start_idx:end_idx] = replay_bins
+                replay_token_ids = code_token_ids[replay_bins]
+                generated_tokens.extend(replay_token_ids.unbind(dim=1))
+                last_hidden, current_past_key_values = self._run_depth_decode_step(
+                    replay_token_ids,
+                    past_key_values=current_past_key_values,
+                    attention_bias=depth_attention_bias,
+                )
+        hit_depth_end = False
+        max_depth_end_steps = max(8, self.model._resolve_action_horizon())
+        full_logits = self.lm_head(last_hidden)
+        for _ in range(max_depth_end_steps):
+            next_token = full_logits.squeeze(1).argmax(dim=-1)
+            generated_tokens.append(next_token)
+            last_hidden, current_past_key_values = self._run_depth_decode_step(
+                next_token,
+                past_key_values=current_past_key_values,
+                attention_bias=depth_attention_bias,
+            )
+            full_logits = self.lm_head(last_hidden)
+            if bool((next_token == int(self.config.depth_end_token_id)).all()):
+                hit_depth_end = True
+                break
+        if not hit_depth_end:
+            raise RuntimeError(
+                f"Depth generation did not emit <depth_end> within {max_depth_end_steps} steps "
+                "after adaptive depth tokens."
+            )
+
+        depth_token_ids = torch.stack(generated_tokens, dim=1)
+        full_input_ids = torch.cat([inputs["input_ids"], depth_token_ids], dim=1)
+        attention_mask = inputs.get("attention_mask")
+        if attention_mask is not None:
+            full_attention_mask = torch.cat(
+                (attention_mask, attention_mask.new_ones(depth_token_ids.shape)),
+                dim=-1,
+            )[:, : full_input_ids.shape[1]]
+        else:
+            full_attention_mask = None
+        current_output = MolmoAct2CausalLMOutputWithPast(
+            logits=full_logits,
+            past_key_values=current_past_key_values,
+        )
+        encoder_kv_states = self.model._extract_kv_states(current_past_key_values)
+        return _DepthPrefix(
+            token_ids=depth_token_ids,
+            depth_bins=depth_bins,
+            full_input_ids=full_input_ids,
+            attention_mask=full_attention_mask,
+            encoder_kv_states=encoder_kv_states,
+            next_output=current_output,
+            past_key_values=current_past_key_values,
+        )
+
+    def _decode_discrete_action_chunk(
+        self,
+        generated_token_ids: torch.Tensor,
+        *,
+        action_tokenizer: Any,
+        action_dim: int,
+        action_horizon: int,
+    ) -> torch.Tensor:
+        if action_tokenizer is None:
+            raise ValueError("inference_action_mode='discrete' requires an `action_tokenizer` input.")
+        if self.config.action_start_token_id is None or self.config.action_end_token_id is None:
+            raise RuntimeError("Discrete action generation requires <action_start>/<action_end> token IDs.")
+        token_id_to_bin = self._action_token_id_to_bin()
+        if not token_id_to_bin:
+            raise RuntimeError(
+                "Discrete action generation requires indexed action tokens in the converted config."
+            )
+        discrete_token_ids = _extract_discrete_token_bins(
+            _flatten_generated_token_ids(generated_token_ids),
+            int(self.config.action_start_token_id),
+            int(self.config.action_end_token_id),
+            token_id_to_bin,
+        )
+        if not discrete_token_ids:
+            raise RuntimeError(
+                "Model generated no decodable action tokens between <action_start>/<action_end>."
+            )
+        try:
+            decoded = action_tokenizer.decode(
+                [discrete_token_ids],
+                time_horizon=int(action_horizon),
+                action_dim=int(action_dim),
+            )
+        except TypeError:
+            decoded = action_tokenizer.decode([discrete_token_ids])
+        action_chunk = np.asarray(decoded, dtype=np.float32)
+        if action_chunk.ndim == 1:
+            action_chunk = action_chunk[None, None, :]
+        elif action_chunk.ndim == 2:
+            action_chunk = action_chunk[None, :, :]
+        elif action_chunk.ndim > 3:
+            action_chunk = action_chunk.reshape(1, action_chunk.shape[-2], action_chunk.shape[-1])
+        if action_chunk.ndim != 3:
+            raise RuntimeError(f"Decoded action chunk has unexpected shape {action_chunk.shape}.")
+        return torch.as_tensor(action_chunk, device=self.device, dtype=torch.float32)
+
+    @torch.no_grad()
+    def predict_action(
+        self,
+        *,
+        processor: Any,
+        images: Any,
+        task: str,
+        state: Any,
+        norm_tag: str,
+        inference_action_mode: str | None = None,
+        enable_depth_reasoning: bool = False,
+        enable_adaptive_depth: bool = True,
+        depth_cache: Mapping[str, Any] | None = None,
+        action_tokenizer: Any = None,
+        num_steps: int | None = None,
+        n_action_steps: int | None = None,
+        generator: torch.Generator | None = None,
+        normalize_language: bool = True,
+        enable_cuda_graph: bool = True,
+        return_dict: bool = True,
+    ) -> MolmoAct2ActionOutput | torch.Tensor:
+        if state is None:
+            raise ValueError("MolmoAct2 `predict_action` requires `state` for discrete state prompting.")
+        if inference_action_mode is None:
+            raise ValueError(
+                "`inference_action_mode` must be provided explicitly as either 'continuous' or 'discrete'."
+            )
+        inference_action_mode = str(inference_action_mode)
+        if inference_action_mode not in {"continuous", "discrete"}:
+            raise ValueError("inference_action_mode must be either 'continuous' or 'discrete'.")
+        if inference_action_mode == "continuous" and not bool(self.config.add_action_expert):
+            raise RuntimeError(
+                "inference_action_mode='continuous' requires an action expert, but this checkpoint "
+                "was converted with add_action_expert=False."
+            )
+        if inference_action_mode == "continuous" and self.config.action_mode not in {
+            "continuous",
+            "both",
+        }:
+            raise ValueError(
+                "inference_action_mode='continuous' requires checkpoint action_mode in "
+                f"{{'continuous', 'both'}}, got {self.config.action_mode!r}."
+            )
+        if inference_action_mode == "discrete":
+            if action_tokenizer is None:
+                raise ValueError("inference_action_mode='discrete' requires an `action_tokenizer` input.")
+            if self.config.action_mode not in {"discrete", "both"}:
+                raise ValueError(
+                    "inference_action_mode='discrete' requires checkpoint action_mode in "
+                    f"{{'discrete', 'both'}}, got {self.config.action_mode!r}."
+                )
+        if enable_depth_reasoning and not bool(self.config.enable_depth_reasoning):
+            raise ValueError("this model was not trained with `--enable_depth_reasoning`.")
+
+        stats = self._get_robot_stats()
+        norm_tag = stats.validate_tag(norm_tag)
+        metadata = stats.get_metadata(norm_tag)
+        normalized_state = np.asarray(stats.normalize_state(state, norm_tag), dtype=np.float32)
+        num_state_tokens = int(self.config.num_state_tokens or 0)
+        if num_state_tokens <= 0:
+            raise RuntimeError(
+                "Discrete state prompting requires indexed state tokens in the converted config."
+            )
+        discrete_state_string = _build_discrete_state_string(normalized_state, num_state_tokens)
+        style = "robot_depth_action" if enable_depth_reasoning else "robot_action"
+        task_text = str(task or "")
+        if normalize_language:
+            task_text = _normalize_question_text(task_text)
+        text = _build_robot_text(
+            task=task_text,
+            style=style,
+            discrete_state_string=discrete_state_string,
+            setup_type=str(metadata.get("setup_type", "") or ""),
+            control_mode=str(metadata.get("control_mode", "") or ""),
+            add_setup_tokens=bool(self.config.add_setup_tokens),
+            add_control_tokens=bool(self.config.add_control_tokens),
+            num_images=self._count_images(images),
+        )
+        inputs = processor(text=text, images=images, return_tensors="pt")
+        inputs = self._move_inputs_to_device(inputs, self.device)
+        inputs = self._drop_trivial_attention_mask(inputs)
+
+        action_dim = stats.get_action_dim(norm_tag)
+        if action_dim is None:
+            action_dim = int(self.config.max_action_dim)
+        action_dim = int(action_dim)
+        max_action_horizon = self.model._resolve_action_horizon()
+        action_horizon = stats.get_action_horizon(norm_tag) or max_action_horizon
+        if int(action_horizon) > max_action_horizon:
+            raise ValueError(
+                f"Tag action_horizon={int(action_horizon)} exceeds checkpoint max_action_horizon={max_action_horizon}."
+            )
+        generation_horizon = int(action_horizon)
+        resolved_n_action_steps = n_action_steps
+        if resolved_n_action_steps is None:
+            resolved_n_action_steps = stats.get_n_action_steps(norm_tag)
+        if resolved_n_action_steps is None:
+            resolved_n_action_steps = int(action_horizon)
+        resolved_n_action_steps = int(resolved_n_action_steps)
+        if resolved_n_action_steps < 1:
+            raise ValueError(f"n_action_steps must be >= 1, got {resolved_n_action_steps}.")
+        if resolved_n_action_steps > int(action_horizon):
+            raise ValueError(
+                f"Requested n_action_steps={resolved_n_action_steps} exceeds tag action_horizon={int(action_horizon)}."
+            )
+        batch_size = int(inputs["input_ids"].shape[0])
+        action_dim_is_pad = self._build_action_dim_is_pad(
+            action_dim=action_dim,
+            max_action_dim=int(self.config.max_action_dim),
+            batch_size=batch_size,
+            device=self.device,
+        )
+        self.model.action_cuda_graph_manager.set_enabled(enable_cuda_graph)
+        self.depth_decode_cuda_graph_manager.set_enabled(enable_cuda_graph)
+
+        generated_token_ids = None
+        depth_bins = None
+        updated_depth_cache = depth_cache
+        if inference_action_mode == "continuous":
+            if enable_depth_reasoning:
+                latest_first_image = _extract_first_image(images)
+                depth_prefix = self._generate_depth_prefix(
+                    inputs,
+                    latest_first_image=latest_first_image,
+                    depth_cache=depth_cache,
+                    enable_adaptive_depth=bool(enable_adaptive_depth),
+                )
+                generated_token_ids = depth_prefix.token_ids
+                depth_bins = depth_prefix.depth_bins
+                actions = self.model.generate_actions_from_inputs(
+                    input_ids=depth_prefix.full_input_ids,
+                    attention_mask=depth_prefix.attention_mask,
+                    action_dim_is_pad=action_dim_is_pad,
+                    action_horizon=generation_horizon,
+                    num_steps=num_steps,
+                    generator=generator,
+                    encoder_kv_states=depth_prefix.encoder_kv_states,
+                    encoder_attention_mask=self.model._get_encoder_attention_mask(
+                        depth_prefix.full_input_ids,
+                        depth_prefix.attention_mask,
+                    ),
+                )
+                if latest_first_image is not None:
+                    updated_depth_cache = {
+                        "image": latest_first_image,
+                        "depth_bins": depth_bins.detach().cpu().reshape(-1).numpy().astype(np.int64),
+                    }
+            else:
+                actions = self.model.generate_actions_from_inputs(
+                    **inputs,
+                    action_dim_is_pad=action_dim_is_pad,
+                    action_horizon=generation_horizon,
+                    num_steps=num_steps,
+                    generator=generator,
+                )
+        else:
+            if enable_depth_reasoning:
+                latest_first_image = _extract_first_image(images)
+                depth_prefix = self._generate_depth_prefix(
+                    inputs,
+                    latest_first_image=latest_first_image,
+                    depth_cache=depth_cache,
+                    enable_adaptive_depth=bool(enable_adaptive_depth),
+                )
+                action_token_ids = self._continue_discrete_generation_from_output(
+                    depth_prefix.next_output,
+                    past_key_values=depth_prefix.past_key_values,
+                    attention_mask=depth_prefix.attention_mask,
+                    end_token_id=self._require_eos_token_id(),
+                    max_steps=max(1, int(generation_horizon * 16)),
+                )
+                generated_token_ids = torch.cat([depth_prefix.token_ids, action_token_ids], dim=1)
+                depth_bins = depth_prefix.depth_bins
+                if latest_first_image is not None:
+                    updated_depth_cache = {
+                        "image": latest_first_image,
+                        "depth_bins": depth_bins.detach().cpu().reshape(-1).numpy().astype(np.int64),
+                    }
+            else:
+                max_action_decode_steps = max(1, int(generation_horizon * 16))
+                action_attention_bias = None
+                if enable_cuda_graph:
+                    action_static_cache = self._make_ar_decode_static_cache(
+                        inputs,
+                        max_steps=max_action_decode_steps,
+                    )
+                    action_attention_bias = self._make_depth_decode_attention_bias(
+                        inputs,
+                        action_static_cache,
+                    )
+                    prefill_output = self(
+                        **inputs,
+                        use_cache=True,
+                        past_key_values=action_static_cache,
+                    )
+                else:
+                    prefill_output = self(**inputs, use_cache=True)
+                action_token_ids = self._continue_discrete_generation_from_output(
+                    prefill_output,
+                    past_key_values=prefill_output.past_key_values,
+                    attention_mask=inputs.get("attention_mask"),
+                    end_token_id=self._require_eos_token_id(),
+                    max_steps=max_action_decode_steps,
+                    attention_bias=action_attention_bias,
+                )
+                generated_token_ids = action_token_ids
+            actions = self._decode_discrete_action_chunk(
+                generated_token_ids,
+                action_tokenizer=action_tokenizer,
+                action_dim=action_dim,
+                action_horizon=generation_horizon,
+            )
+
+        actions = self._slice_action_dim(actions, action_dim)
+        actions = self._slice_action_chunk(actions, int(self.config.n_obs_steps), resolved_n_action_steps)
+        actions = stats.unnormalize_action(actions, norm_tag)
+        if not torch.is_tensor(actions):
+            actions = torch.as_tensor(actions, device=self.device, dtype=torch.float32)
+        else:
+            actions = actions.to(device=self.device, dtype=torch.float32)
+        output = MolmoAct2ActionOutput(
+            actions=actions,
+            generated_token_ids=generated_token_ids,
+            depth_bins=depth_bins,
+            depth_cache=updated_depth_cache,
+        )
+        if return_dict:
+            return output
+        return actions
+
+    @can_return_tuple
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        pixel_values: torch.Tensor | None = None,
+        image_token_pooling: torch.Tensor | None = None,
+        image_grids: torch.Tensor | None = None,
+        image_num_crops: torch.Tensor | None = None,
+        pixel_values_videos: torch.Tensor | None = None,
+        video_token_pooling: torch.Tensor | None = None,
+        video_grids: torch.Tensor | None = None,
+        attention_mask: torch.Tensor | None = None,
+        position_ids: torch.LongTensor | None = None,
+        past_key_values: list[torch.FloatTensor] | None = None,
+        token_type_ids: torch.LongTensor | None = None,
+        inputs_embeds: torch.FloatTensor | None = None,
+        labels: torch.LongTensor | None = None,
+        use_cache: bool | None = None,
+        output_attentions: bool | None = None,
+        output_hidden_states: bool | None = None,
+        cache_position: torch.LongTensor | None = None,
+        logits_to_keep: int | torch.Tensor = 0,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> tuple | MolmoAct2CausalLMOutputWithPast:
+        r"""
+        ```python
+        >>> from PIL import Image
+        >>> import requests
+        >>> from lerobot.policies.molmoact2.hf_model.modeling_molmoact2 import MolmoAct2ForConditionalGeneration
+        >>> from lerobot.policies.molmoact2.processor_molmoact2 import _load_local_molmoact2_processor
+
+        >>> model = MolmoAct2ForConditionalGeneration.from_pretrained("...")
+        >>> processor = _load_local_molmoact2_processor("...")
+
+        >>> prompt = "What's the content of the image?"
+        >>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
+        >>> image = Image.open(requests.get(url, stream=True).raw)
+
+        >>> messages = [{"role": "user", "content": [{"type": "text", "text": prompt}, {"type": "image", "image": image}]}]
+
+        >>> inputs = processor.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True)
+
+        >>> # Generate
+        >>> generated_ids = model.generate(**inputs, max_new_tokens=15)
+        >>> generated_tokens = generated_ids[:, inputs['input_ids'].size(1):]
+        >>> processor.post_process_image_text_to_text(generated_tokens, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "The image shows a bustling street scene in what appears to be a Chinatown area. There's ..."
+        ```"""
+        outputs = self.model(
+            input_ids=input_ids,
+            pixel_values=pixel_values,
+            image_token_pooling=image_token_pooling,
+            image_grids=image_grids,
+            image_num_crops=image_num_crops,
+            pixel_values_videos=pixel_values_videos,
+            video_token_pooling=video_token_pooling,
+            video_grids=video_grids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            token_type_ids=token_type_ids,
+            inputs_embeds=inputs_embeds,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            cache_position=cache_position,
+            **kwargs,
+        )
+
+        hidden_states = outputs.last_hidden_state
+        # Only compute necessary logits, and do not upcast them to float if we are not computing the loss
+        slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
+        logits = self.lm_head(hidden_states[:, slice_indices, :])
+
+        loss = None
+        if labels is not None:
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.vocab_size)
+
+        return MolmoAct2CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            image_hidden_states=outputs.image_hidden_states,
+        )
+
+    def prepare_inputs_for_generation(
+        self,
+        input_ids: torch.LongTensor,
+        past_key_values: list[torch.FloatTensor] | None = None,
+        inputs_embeds: torch.FloatTensor | None = None,
+        pixel_values: torch.FloatTensor | None = None,
+        image_token_pooling: torch.Tensor | None = None,
+        image_grids: torch.Tensor | None = None,
+        image_num_crops: torch.Tensor | None = None,
+        pixel_values_videos: torch.Tensor | None = None,
+        video_token_pooling: torch.Tensor | None = None,
+        video_grids: torch.Tensor | None = None,
+        attention_mask: torch.Tensor | None = None,
+        token_type_ids: torch.LongTensor | None = None,
+        cache_position: torch.LongTensor | None = None,
+        logits_to_keep: int | torch.Tensor | None = None,
+        **kwargs,
+    ):
+        model_inputs = super().prepare_inputs_for_generation(
+            input_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            cache_position=cache_position,
+            logits_to_keep=logits_to_keep,
+            token_type_ids=token_type_ids,
+            **kwargs,
+        )
+
+        include_visual_inputs = past_key_values is None
+        if past_key_values is not None and hasattr(past_key_values, "get_seq_length"):
+            include_visual_inputs = int(past_key_values.get_seq_length()) == 0
+        if include_visual_inputs:
+            model_inputs["pixel_values"] = pixel_values
+            model_inputs["image_token_pooling"] = image_token_pooling
+            model_inputs["image_grids"] = image_grids
+            model_inputs["image_num_crops"] = image_num_crops
+            model_inputs["pixel_values_videos"] = pixel_values_videos
+            model_inputs["video_token_pooling"] = video_token_pooling
+            model_inputs["video_grids"] = video_grids
+
+        return model_inputs
+
+    # Adapted from transformers.models.gemma3.modeling_gemma3
+    @staticmethod
+    def create_masks_for_generate(
+        config: PretrainedConfig,
+        input_embeds: torch.Tensor,
+        attention_mask: torch.Tensor | None,
+        cache_position: torch.Tensor,
+        past_key_values: Cache | None,
+        position_ids: torch.Tensor | None,
+        token_type_ids: torch.Tensor | None = None,
+        **kwargs,
+    ) -> dict:
+        # Prepare mask arguments
+        mask_kwargs = {
+            "config": config.get_text_config(),
+            "input_embeds": input_embeds,
+            "attention_mask": attention_mask,
+            "cache_position": cache_position,
+            "past_key_values": past_key_values,
+            "position_ids": position_ids,
+        }
+        # Add the token type ids mask for generate as well
+        if token_type_ids is not None and input_embeds.shape[1] != 1:
+            # We need to pass an additional mask function to account for token type ids, and it needs to be an `or`
+            mask_kwargs["or_mask_function"] = token_type_ids_mask_function(
+                token_type_ids.to(cache_position.device)
+            )
+
+        return create_masks_for_generate(**mask_kwargs)
diff --git a/src/lerobot/policies/molmoact2/hf_model/processing_molmoact2.py b/src/lerobot/policies/molmoact2/hf_model/processing_molmoact2.py
new file mode 100644
index 000000000..7b8775faa
--- /dev/null
+++ b/src/lerobot/policies/molmoact2/hf_model/processing_molmoact2.py
@@ -0,0 +1,431 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# ruff: noqa
+
+"""
+Processor class for MolmoAct2.
+"""
+
+from typing import Optional, Union
+import dataclasses
+
+import numpy as np
+
+from transformers.image_utils import ImageInput
+from transformers.video_utils import VideoInput
+from transformers.processing_utils import (
+    Unpack,
+    ProcessingKwargs,
+    ProcessorMixin,
+)
+from transformers.feature_extraction_utils import BatchFeature
+from transformers.tokenization_utils_base import TextInput, PreTokenizedInput
+from transformers.utils import logging
+
+from transformers import AutoTokenizer
+from .image_processing_molmoact2 import MolmoAct2ImagesKwargs, MolmoAct2ImageProcessor
+from .video_processing_molmoact2 import MolmoAct2VideoProcessorKwargs, MolmoAct2VideoProcessor
+
+
+logger = logging.get_logger(__name__)
+
+
+# Special tokens, these should be present in any tokenizer we use since the preprocessor uses them
+IMAGE_PATCH_TOKEN = f"<im_patch>"  # Where to insert high-res tokens
+IMAGE_LOW_RES_TOKEN = f"<im_low>"  # Where to insert low-res tokens
+IM_START_TOKEN = f"<im_start>"
+LOW_RES_IMAGE_START_TOKEN = f"<low_res_im_start>"
+FRAME_START_TOKEN = f"<frame_start>"
+IM_END_TOKEN = f"<im_end>"
+FRAME_END_TOKEN = f"<frame_end>"
+IM_COL_TOKEN = f"<im_col>"
+IMAGE_PROMPT = "<|image|>"
+VIDEO_PROMPT = "<|video|>"
+
+IMAGE_TOKENS = [
+    IMAGE_PATCH_TOKEN,
+    IM_COL_TOKEN,
+    IM_START_TOKEN,
+    LOW_RES_IMAGE_START_TOKEN,
+    FRAME_START_TOKEN,
+    IM_END_TOKEN,
+    FRAME_END_TOKEN,
+    IMAGE_LOW_RES_TOKEN,
+]
+
+
+class MolmoAct2ProcessorKwargs(ProcessingKwargs, total=False):
+    """MolmoAct2 processor kwargs"""
+
+    images_kwargs: MolmoAct2ImagesKwargs
+    videos_kwargs: MolmoAct2VideoProcessorKwargs
+    _defaults = {
+        "text_kwargs": {
+            "padding": False,
+            "return_mm_token_type_ids": True,
+        },
+        "videos_kwargs": {"return_metadata": True},
+    }
+
+
+class MolmoAct2Processor(ProcessorMixin):
+    attributes = ["image_processor", "video_processor", "tokenizer"]
+    optional_attributes = [
+        "chat_template",
+        "time_mode",
+        "image_use_col_tokens",
+        "use_single_crop_col_tokens",
+        "use_single_crop_start_token",
+        "video_use_col_tokens",
+        "use_frame_special_tokens",
+    ]
+    image_processor_class = "AutoImageProcessor"
+    video_processor_class = "AutoVideoProcessor"
+    tokenizer_class = "AutoTokenizer"
+
+    def __init__(
+        self,
+        image_processor: MolmoAct2ImageProcessor = None,
+        video_processor: MolmoAct2VideoProcessor = None,
+        tokenizer: AutoTokenizer = None,
+        chat_template: str | None = None,
+        image_use_col_tokens: bool | None = True,
+        use_single_crop_col_tokens: bool | None = None,
+        use_single_crop_start_token: bool | None = True,
+        video_use_col_tokens: bool | None = False,
+        use_frame_special_tokens: bool | None = True,
+        **kwargs,
+    ) -> None:
+        super().__init__(
+            image_processor,
+            video_processor,
+            tokenizer,
+            chat_template=chat_template,
+        )
+        self.image_use_col_tokens = image_use_col_tokens
+        self.use_single_crop_col_tokens = use_single_crop_col_tokens
+        self.use_single_crop_start_token = use_single_crop_start_token
+        self.video_use_col_tokens = video_use_col_tokens
+        self.use_frame_special_tokens = use_frame_special_tokens
+
+        self.image_placeholder_token = IMAGE_PROMPT
+        self.video_placeholder_token = VIDEO_PROMPT
+        self.image_token_ids = [tokenizer.convert_tokens_to_ids(token) for token in IMAGE_TOKENS]
+
+    def get_image_tokens(self, image_grid: np.ndarray):
+        resized_h, resized_w, height, width = image_grid
+        if int(height) == 0 or int(width) == 0:
+            per_row = np.full(resized_w, IMAGE_PATCH_TOKEN)
+            use_single_crop_col_tokens = (
+                self.image_use_col_tokens
+                if self.use_single_crop_col_tokens is None
+                else self.use_single_crop_col_tokens
+            )
+            if use_single_crop_col_tokens:
+                per_row = np.concatenate([per_row, [IM_COL_TOKEN]], 0)
+            joint = [
+                [IM_START_TOKEN],
+                np.tile(per_row, [resized_h]),
+                [IM_END_TOKEN],
+            ]
+            return np.concatenate(joint)
+        per_row = np.full(width, IMAGE_PATCH_TOKEN)
+        if self.image_use_col_tokens:
+            per_row = np.concatenate([per_row, [IM_COL_TOKEN]], 0)
+        joint = [
+            [IM_START_TOKEN],
+            np.tile(per_row, [height]),
+            [IM_END_TOKEN],
+        ]
+        per_row = np.full(resized_w, IMAGE_PATCH_TOKEN)
+        use_single_crop_col_tokens = (
+            self.image_use_col_tokens
+            if self.use_single_crop_col_tokens is None
+            else self.use_single_crop_col_tokens
+        )
+        image_start_token = LOW_RES_IMAGE_START_TOKEN if self.use_single_crop_start_token else IM_START_TOKEN
+        if use_single_crop_col_tokens:
+            per_row = np.concatenate([per_row, [IM_COL_TOKEN]], 0)
+        joint = [
+            [image_start_token],
+            np.tile(per_row, [resized_h]),
+            [IM_END_TOKEN],
+        ] + joint
+
+        return np.concatenate(joint)
+
+    def get_video_string(
+        self,
+        video_grid: np.ndarray,
+        timestamps: np.ndarray,
+    ):
+        if self.use_frame_special_tokens:
+            start_token_id = FRAME_START_TOKEN
+            end_token_id = FRAME_END_TOKEN
+        else:
+            start_token_id = IM_START_TOKEN
+            end_token_id = IM_END_TOKEN
+
+        num_frames, h, w = video_grid
+        video_string: str = ""
+        for frame_idx, frame_time in enumerate(timestamps):
+            # `per-frame-compact` time mode
+            prev_space = " " if frame_idx > 0 else ""
+            frame_prefix = prev_space + f"{frame_time:.1f} "  # explicit whitespace before/after image tokens
+
+            video_string += frame_prefix
+            per_row = np.full(w, IMAGE_PATCH_TOKEN)
+            if self.video_use_col_tokens:
+                per_row = np.concatenate([per_row, [IM_COL_TOKEN]], 0)
+            extra_tokens = np.tile(per_row, [h])
+            video_tokens = [
+                [start_token_id],
+                extra_tokens,
+                [end_token_id],
+            ]
+            video_string += "".join(np.concatenate(video_tokens, 0))
+
+        return video_string
+
+    def insert_bos(
+        self,
+        input_ids: np.ndarray,
+        attention_mask: np.ndarray,
+        bos_token_id: int,
+        pad_token_id: int,
+    ):
+        """
+        Args:
+            input_ids: [B, S] array with left padding
+            attention_mask: [B, S] array (0 for pad, 1 for valid)
+            bos_token_id: int
+            pad_token_id: int
+        Returns:
+            input_ids_out: [B, S] or [B, S+1] array with bos inserted if needed
+            attention_mask_out: same shape as input_ids_out
+        """
+
+        need_to_expand = len(input_ids.shape) == 1
+        if need_to_expand:
+            input_ids = input_ids[None, :]
+            attention_mask = attention_mask[None, :]
+
+        B, S = input_ids.shape
+
+        # Handle zero-length sequence
+        if S == 0:
+            new_input_ids = np.full((B, 1), bos_token_id, dtype=input_ids.dtype)
+            new_attention_mask = np.ones((B, 1), dtype=attention_mask.dtype)
+            if need_to_expand:
+                new_input_ids = new_input_ids[0]
+                new_attention_mask = new_attention_mask[0]
+            return new_input_ids, new_attention_mask
+
+        first_valid_index = (attention_mask == 1).argmax(axis=-1)  # [B]
+        bos_already_present = np.all(input_ids[np.arange(B), first_valid_index] == bos_token_id)
+
+        if bos_already_present:
+            if need_to_expand:
+                input_ids = input_ids[0]
+                attention_mask = attention_mask[0]
+            return input_ids, attention_mask
+        else:
+            new_input_ids = np.full((B, S + 1), pad_token_id, dtype=input_ids.dtype)
+            new_attention_mask = np.zeros((B, S + 1), dtype=attention_mask.dtype)
+
+            src_idx = np.tile(np.arange(S), (B, 1))  # [B, S]
+            valid_mask = src_idx >= first_valid_index[:, None]  # [B, S]
+            tgt_idx = src_idx + 1  # shit right
+            batch_idx = np.tile(np.arange(B)[:, None], (1, S))  # [B, S]
+
+            # flatten valid_positions
+            flat_vals = input_ids[valid_mask]
+            flat_batch = batch_idx[valid_mask]
+            flat_tgt = tgt_idx[valid_mask]
+
+            new_input_ids[flat_batch, flat_tgt] = flat_vals
+            new_attention_mask[flat_batch, flat_tgt] = 1
+
+            insert_pos = first_valid_index
+            new_input_ids[np.arange(B), insert_pos] = bos_token_id
+            new_attention_mask[np.arange(B), insert_pos] = 1
+
+            if need_to_expand:
+                new_input_ids = new_input_ids[0]
+                new_attention_mask = new_attention_mask[0]
+
+            return new_input_ids, new_attention_mask
+
+    def __call__(
+        self,
+        text: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] = None,
+        images: ImageInput = None,
+        videos: VideoInput = None,
+        **kwargs: Unpack[MolmoAct2ProcessorKwargs],
+    ) -> BatchFeature:
+        """
+
+        Args:
+            text (`str`, `list[str]`, `list[list[str]]`):
+                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
+                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
+                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
+            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `list[PIL.Image.Image]`, `list[np.ndarray]`, `list[torch.Tensor]`):
+                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
+                tensor. Both channels-first and channels-last formats are supported.
+            videos (`dict[str, Any]` or `list[dict[str, Any]]`):
+                The video or batch of videos to be prepared. Each video can be a dictionary with the following keys:
+                - `"frames"`: `np.ndarray` of shape (T, H, W, 3)
+                - `"timestamps"`: `np.ndarray` of shape (T,)
+                - `"sampled_fps"`: `float` (optional)
+                - `"sampling_augmentation"`: `str` (optional)
+            return_tensors (`str` or [`~utils.TensorType`], *optional*):
+                If set, will return tensors of a particular framework. Acceptable values are:
+                - `'tf'`: Return TensorFlow `tf.constant` objects.
+                - `'pt'`: Return PyTorch `torch.Tensor` objects.
+                - `'np'`: Return NumPy `np.ndarray` objects.
+                - `'jax'`: Return JAX `jnp.ndarray` objects.
+
+        Returns:
+            `BatchFeature`: A [`BatchFeature`] with the following fields:
+            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
+            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
+              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not `None`).
+            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
+            - **image_token_pooling** -- Indices of the patches in `image_grids` to pool for each token in `image_tokens`.
+              Returned when `images` is not `None`.
+            - **image_grids** -- Grids of images. Returned when `images` is not `None`.
+            - **image_num_crops** -- Number of crops for each image. Returned when `images` is not `None`.
+            - **pixel_values_videos** -- Pixel values of videos to be fed to a model. Returned when `videos` is not `None`.
+            - **video_token_pooling** -- Indices of the patches in `video_grids` to pool for each token in `video_tokens`.
+              Returned when `videos` is not `None`.
+            - **video_grids** -- Grids of videos. Returned when `videos` is not `None`.
+        """
+
+        output_kwargs = self._merge_kwargs(
+            MolmoAct2ProcessorKwargs,
+            tokenizer_init_kwargs=self.tokenizer.init_kwargs,
+            **kwargs,
+        )
+
+        if images is not None:
+            image_inputs = self.image_processor(images, **output_kwargs["images_kwargs"])
+            image_grids = image_inputs["image_grids"]
+        else:
+            image_inputs = {}
+            image_grids = None
+
+        if videos is not None:
+            videos_inputs = self.video_processor(videos=videos, **output_kwargs["videos_kwargs"])
+            video_grids = videos_inputs["video_grids"]
+            # If user has not requested video metadata, pop it
+            if "return_metadata" not in kwargs:
+                video_metadata = videos_inputs.pop("video_metadata")
+            else:
+                video_metadata = videos_inputs["video_metadata"]
+        else:
+            videos_inputs = {}
+            video_grids = None
+
+        if not isinstance(text, list):
+            text = [text]
+
+        text = text.copy()  # below lines change text in-place
+
+        if image_grids is not None:
+            index = 0
+            for i in range(len(text)):
+                num_images = text[i].count(self.image_placeholder_token)
+                image_grids_i = image_grids[index : index + num_images]
+                for image_grid in image_grids_i:
+                    image_tokens = self.get_image_tokens(image_grid)
+                    image_string = "".join(image_tokens)
+                    text[i] = text[i].replace(self.image_placeholder_token, image_string, 1)
+                index += num_images
+
+        if video_grids is not None:
+            index = 0
+            for i in range(len(text)):
+                num_videos = text[i].count(self.video_placeholder_token)
+                assert num_videos in {0, 1}, "At most one video is supported for now"
+                video_grids_i = video_grids[index : index + num_videos]
+                metadata_i = video_metadata[index : index + num_videos]
+                for video_grid, metadata in zip(video_grids_i, metadata_i):
+                    video_string = self.get_video_string(
+                        video_grid,
+                        metadata.timestamps,
+                    )
+                    text[i] = text[i].replace(self.video_placeholder_token, video_string, 1)
+                index += num_videos
+
+        return_tensors = output_kwargs["text_kwargs"].pop("return_tensors", None)
+        return_mm_token_type_ids = output_kwargs["text_kwargs"].pop("return_mm_token_type_ids", False)
+        text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
+
+        input_ids = text_inputs["input_ids"]
+        attention_mask = text_inputs["attention_mask"]
+
+        input_ids = np.array(input_ids)
+        attention_mask = np.array(attention_mask)
+
+        bos = self.tokenizer.bos_token_id or self.tokenizer.eos_token_id
+        input_ids, attention_mask = self.insert_bos(
+            input_ids, attention_mask, bos, self.tokenizer.pad_token_id
+        )
+
+        if return_mm_token_type_ids:
+            image_tokens = np.array(self.image_token_ids).astype(input_ids.dtype)
+            token_type_ids = np.any(input_ids[:, :, None] == image_tokens[None, None, :], axis=-1)
+            text_inputs["token_type_ids"] = token_type_ids.tolist()
+
+        text_inputs["input_ids"] = input_ids.tolist()
+        text_inputs["attention_mask"] = attention_mask.tolist()
+
+        return BatchFeature(
+            data={**text_inputs, **image_inputs, **videos_inputs},
+            tensor_type=return_tensors,
+        )
+
+    def post_process_image_text_to_text(
+        self, generated_outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False, **kwargs
+    ):
+        """
+        Post-process the output of the model to decode the text.
+
+        Args:
+            generated_outputs (`torch.Tensor` or `np.ndarray`):
+                The output of the model `generate` function. The output is expected to be a tensor of shape `(batch_size, sequence_length)`
+                or `(sequence_length,)`.
+            skip_special_tokens (`bool`, *optional*, defaults to `True`):
+                Whether or not to remove special tokens in the output. Argument passed to the tokenizer's `batch_decode` method.
+            clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
+                Whether or not to clean up the tokenization spaces. Argument passed to the tokenizer's `batch_decode` method.
+            **kwargs:
+                Additional arguments to be passed to the tokenizer's `batch_decode method`.
+
+        Returns:
+            `list[str]`: The decoded text.
+        """
+        return self.tokenizer.batch_decode(
+            generated_outputs,
+            skip_special_tokens=skip_special_tokens,
+            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+            **kwargs,
+        )
+
+
+MolmoAct2Processor.register_for_auto_class()
diff --git a/src/lerobot/policies/molmoact2/hf_model/video_processing_molmoact2.py b/src/lerobot/policies/molmoact2/hf_model/video_processing_molmoact2.py
new file mode 100644
index 000000000..644d5a691
--- /dev/null
+++ b/src/lerobot/policies/molmoact2/hf_model/video_processing_molmoact2.py
@@ -0,0 +1,997 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# ruff: noqa
+
+"""Video processor class for MolmoAct2"""
+
+from functools import partial
+import os
+import warnings
+from contextlib import redirect_stdout
+from io import BytesIO
+from urllib.parse import urlparse
+from typing import Optional, Union
+from collections.abc import Callable
+
+import numpy as np
+import requests
+import einops
+import torch
+import torchvision.transforms
+
+from transformers.image_utils import (
+    IMAGENET_STANDARD_MEAN,
+    IMAGENET_STANDARD_STD,
+    ImageInput,
+    PILImageResampling,
+    SizeDict,
+    validate_kwargs,
+)
+from transformers.video_utils import (
+    VideoInput,
+    is_valid_video,
+    make_batched_videos,
+    make_batched_metadata,
+    VideoMetadata,
+)
+from transformers.processing_utils import Unpack, VideosKwargs
+from transformers.video_processing_utils import BaseVideoProcessor
+from transformers.utils import logging
+from transformers.feature_extraction_utils import BatchFeature
+from transformers.utils import (
+    is_av_available,
+    is_decord_available,
+    is_torchcodec_available,
+    is_yt_dlp_available,
+    TensorType,
+    logging,
+    to_numpy,
+)
+
+
+logger = logging.get_logger(__name__)
+
+MAX_VIDEO_FPS = 8
+
+
+def normalize_image(
+    image: np.ndarray,
+    image_mean: list[float],
+    image_std: list[float],
+) -> np.ndarray:
+    if np.allclose(image_mean, [0.5, 0.5, 0.5]) and np.allclose(image_std, [0.5, 0.5, 0.5]):
+        return image * np.asarray(2.0, dtype=np.float32) - np.asarray(1.0, dtype=np.float32)
+    image -= np.array(image_mean, dtype=np.float32)[None, None, :]
+    image /= np.array(image_std, dtype=np.float32)[None, None, :]
+    return image
+
+
+def resize_image(
+    image: np.ndarray,
+    desired_output_size: list[int],
+    resample: PILImageResampling,
+) -> np.ndarray:
+    if len(image.shape) == 3:
+        is_video = False
+        image = torch.permute(torch.from_numpy(image), [2, 0, 1])
+    else:
+        is_video = True
+        image = torch.permute(torch.from_numpy(image), [0, 3, 1, 2])
+    dtype = image.dtype
+    if torch.is_floating_point(image):
+        in_min = 0.0
+        in_max = 1.0
+        resized = torchvision.transforms.Resize(
+            desired_output_size,
+            resample,
+            antialias=False,
+        )(image)
+        resized = torch.clip(resized, 0.0, 1.0).to(dtype)
+    else:
+        assert image.dtype == torch.uint8, "SigLIP expects float images or uint8 images, but got {}".format(
+            image.dtype
+        )
+        in_min = 0.0
+        in_max = 255.0
+        resized = torchvision.transforms.Resize(
+            desired_output_size,
+            resample,
+            antialias=False,
+        )(image)
+        resized = torch.clip(resized, 0, 255).to(dtype)
+
+    resized = resized.to(torch.float32)
+    resized = (resized - in_min) / (in_max - in_min)
+
+    if is_video:
+        resized = torch.permute(resized, [0, 2, 3, 1]).numpy()
+    else:
+        resized = torch.permute(resized, [1, 2, 0]).numpy()
+
+    return resized
+
+
+def build_resized_image(
+    image: np.ndarray,
+    base_image_input_size: list[int],
+    resample: PILImageResampling,
+    image_mean: list[float],
+    image_std: list[float],
+    image_patch_size: int,
+) -> tuple[np.ndarray, np.ndarray]:
+    resized = resize_image(
+        image,
+        base_image_input_size,
+        resample,
+    )
+    resized = normalize_image(resized, image_mean, image_std)
+    if len(resized.shape) == 3:
+        resized = np.expand_dims(resized, 0)
+    crop_patch_w = base_image_input_size[1] // image_patch_size
+    crop_patch_h = base_image_input_size[0] // image_patch_size
+    resize_idx = np.arange(crop_patch_w * crop_patch_h).reshape([crop_patch_h, crop_patch_w])
+    return resized, resize_idx
+
+
+def batch_pixels_to_patches(array: np.ndarray, patch_size: int) -> np.ndarray:
+    """Reshape images of [n_images, h, w, 3] -> [n_images, n_patches, pixels_per_patch]"""
+    if len(array.shape) == 3:
+        n_crops, h, w = array.shape
+        h_patches = h // patch_size
+        w_patches = w // patch_size
+        array = np.reshape(array, [n_crops, h_patches, patch_size, w_patches, patch_size])
+        array = np.transpose(array, [0, 1, 3, 2, 4])
+        array = np.reshape(array, [n_crops, h_patches * w_patches, patch_size * patch_size])
+        return array
+    else:
+        n_crops, h, w, c = array.shape
+        h_patches = h // patch_size
+        w_patches = w // patch_size
+        array = np.reshape(array, [n_crops, h_patches, patch_size, w_patches, patch_size, c])
+        array = np.transpose(array, [0, 1, 3, 2, 4, 5])
+        array = np.reshape(array, [n_crops, h_patches * w_patches, patch_size * patch_size * c])
+        return array
+
+
+def arange_for_pooling(
+    idx_arr: np.ndarray,
+    pool_h: int,
+    pool_w: int,
+) -> np.ndarray:
+    h_pad = pool_h * ((idx_arr.shape[0] + pool_h - 1) // pool_h) - idx_arr.shape[0]
+    w_pad = pool_w * ((idx_arr.shape[1] + pool_w - 1) // pool_w) - idx_arr.shape[1]
+    idx_arr = np.pad(
+        idx_arr,
+        [[h_pad // 2, (h_pad + 1) // 2], [w_pad // 2, (w_pad + 1) // 2]],
+        mode="constant",
+        constant_values=-1,
+    )
+    return einops.rearrange(idx_arr, "(h dh) (w dw) -> h w (dh dw)", dh=pool_h, dw=pool_w)
+
+
+def image_to_patches_and_grids(
+    image: ImageInput,
+    base_image_input_size: list[int],
+    resample: PILImageResampling,
+    image_mean: list[float],
+    image_std: list[float],
+    image_patch_size: int,
+    image_pooling_w: int,
+    image_pooling_h: int,
+) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
+    """
+    :return image_grids, the shape of each image after pooling
+    :return crops, the image crops to processes with the ViT
+    :return pooled_patch_idx, for each patch_id tokens in `image_tokens`, the indices of the
+                                patches in `crops` to pool for that token, masked with -1
+    """
+    if isinstance(base_image_input_size, int):
+        base_image_input_size = (base_image_input_size, base_image_input_size)
+
+    pooling_w = image_pooling_w
+    pooling_h = image_pooling_h
+
+    resized, resize_idx = build_resized_image(
+        image,
+        base_image_input_size,
+        resample,
+        image_mean,
+        image_std,
+        image_patch_size,
+    )
+    pooling_idx = arange_for_pooling(resize_idx, pooling_h, pooling_w)
+    h, w = pooling_idx.shape[:2]
+    pooling_idx = pooling_idx.reshape([-1, pooling_h * pooling_w])
+    image_grid = [h, w]
+    return (
+        image_grid,
+        batch_pixels_to_patches(resized, image_patch_size),
+        pooling_idx,
+    )
+
+
+def get_candidate_target_fps(
+    video_fps: int | float,
+    sampling_fps: int | float,
+    max_fps: int | float = MAX_VIDEO_FPS,
+) -> list[float]:
+    """
+    Return the subset of `video_fps` factors that remain multiples of `sampling_fps`.
+
+    Examples:
+        >>> get_candidate_target_fps(video_fps=6, sampling_fps=2)
+        [2, 6]
+        >>> get_candidate_target_fps(video_fps=5, sampling_fps=1)
+        [1, 5]
+        >>> get_candidate_target_fps(video_fps=2, sampling_fps=2)
+        [2]
+        >>> get_candidate_target_fps(video_fps=5, sampling_fps=2)
+        Traceback (most recent call last):
+            ...
+        ValueError: sampling_fps=2 must divide video_fps=5 to produce consistent frame steps.
+    """
+    video_fps = int(video_fps)
+    sampling_fps = int(sampling_fps)
+    max_fps = int(max_fps)
+
+    if sampling_fps is None:
+        raise ValueError("sampling_fps must be provided")
+    if video_fps <= 0 or sampling_fps <= 0:
+        raise ValueError(f"video_fps and sampling_fps must be positive (got {video_fps}, {sampling_fps})")
+    if video_fps % sampling_fps != 0:
+        raise ValueError(f"sampling_fps={sampling_fps} must divide video_fps={video_fps}.")
+
+    candidates = []
+    for candidate in range(sampling_fps, video_fps + 1, sampling_fps):
+        if candidate > max_fps:
+            break
+        if video_fps % candidate == 0:
+            candidates.append(float(candidate))
+
+    return candidates
+
+
+def read_video_decord(
+    video_path,
+    sample_timestamps_fn: Callable,
+    **kwargs,
+) -> np.ndarray:
+    """
+    Decode a video using the Decord backend.
+
+    Args:
+        video_path (`str`):
+            Path to the video file.
+        sample_timestamps_fn (`Callable`):
+            A callable function that will return timestamps at which the video should be sampled.
+
+    Returns:
+        tuple[`np.array`, `VideoMetadata`]: A tuple containing:
+            - Numpy array of frames in RGB (shape: [num_frames, height, width, 3]).
+            - `VideoMetadata` object.
+    """
+    # Lazy import from decord
+    import importlib
+
+    decord = importlib.import_module("decord")
+
+    vr = decord.VideoReader(uri=video_path, ctx=decord.cpu(0))  # decord has problems with gpu
+    video_fps = vr.get_avg_fps()
+    total_num_frames = len(vr)
+    time_stamps = vr.get_frame_timestamp(list(range(len(vr))))
+    duration = time_stamps[-1][1] - time_stamps[0][0]
+
+    metadata = VideoMetadata(
+        total_num_frames=int(total_num_frames),
+        fps=float(video_fps),
+        duration=float(duration),
+        video_backend="decord",
+    )
+
+    target_timestamps = sample_timestamps_fn(metadata=metadata, **kwargs)
+    target_timestamps = np.array(target_timestamps)
+    offset = time_stamps[0, 0]
+
+    ix = np.searchsorted(time_stamps[:, 1], target_timestamps + offset, side="right")
+    ix = np.minimum(ix, len(time_stamps) - 1)
+
+    video = vr.get_batch(ix).asnumpy()
+    metadata.update(
+        {
+            "frames_indices": target_timestamps * video_fps,
+            "height": video.shape[1],
+            "width": video.shape[2],
+        }
+    )
+    return video, metadata
+
+
+def read_video_torchcodec(
+    video_path,
+    sample_timestamps_fn: Callable,
+    **kwargs,
+) -> np.ndarray:
+    """
+    Decode a video using torchcodec decoder.
+
+    Args:
+        video_path (`str`):
+            Path to the video file.
+        sample_timestamps_fn (`Callable`):
+            A callable function that will return timestamps at which the video should be sampled.
+
+    Returns:
+        tuple[`np.array`, `VideoMetadata`]: A tuple containing:
+            - Numpy array of frames in RGB (shape: [num_frames, height, width, 3]).
+            - `VideoMetadata` object.
+    """
+    # Lazy import torchcodec
+    import importlib
+
+    torchcodec = importlib.import_module("torchcodec")
+
+    decoder = torchcodec.decoders.VideoDecoder(
+        video_path,
+        # Interestingly `exact` mode takes less than approximate when we load the whole video
+        seek_mode="exact",
+        # Allow FFmpeg decide on the number of threads for efficiency
+        num_ffmpeg_threads=0,
+    )
+    # If the first frame starts at > 0, we effectively clip the video starting at that time
+    # since (most) video players would also skip to that time
+    time_offset = decoder.metadata.begin_stream_seconds_from_content
+    # Note this duration does assume we started playing at `time_offset`
+    duration = decoder.metadata.duration_seconds
+
+    metadata = VideoMetadata(
+        total_num_frames=decoder.metadata.num_frames,
+        fps=decoder.metadata.average_fps,
+        duration=duration,
+        video_backend="torchcodec",
+        height=decoder.metadata.height,
+        width=decoder.metadata.width,
+    )
+
+    target_timestamps = sample_timestamps_fn(metadata=metadata, **kwargs)
+
+    # Floating point/rounding issues might cause `target_timestamps` to be very slightly
+    # out-of-bounds, to handle this we sanity check then clip them
+    assert all(x >= 0 for x in target_timestamps)
+    assert all(x < duration + 1e-6 for x in target_timestamps)
+    # 1e-6 padding since torchcodec can throw out-of-bounds errors even if you ask for the
+    # exact boundary value, we should still get the first/last frame anyway
+    max_timestamp = decoder.metadata.end_stream_seconds_from_content - 1e-6
+    min_timestamp = decoder.metadata.begin_stream_seconds_from_content + 1e-6
+    # Note we avoid using numpy ops here to reduce floating precision issues
+    timestamps = [x + time_offset for x in target_timestamps]
+    timestamps = [max(min_timestamp, min(max_timestamp, x)) for x in timestamps]
+
+    video = (
+        decoder.get_frames_played_at(timestamps).data.numpy().transpose(0, 2, 3, 1)
+    )  # Convert to THWC format
+    target_timestamps = np.array(target_timestamps)
+    metadata.frames_indices = target_timestamps * metadata.fps
+
+    return video, metadata
+
+
+def read_video_pyav(
+    video_path,
+    sample_timestamps_fn: Callable,
+    **kwargs,
+) -> np.ndarray:
+    """
+    Decode a video using the PyAV backend.
+
+    Args:
+        video_path (`str`):
+            Path to the video file.
+        sample_timestamps_fn (`Callable`):
+            A callable function that will return timestamps at which the video should be sampled.
+
+    Returns:
+        tuple[`np.array`, `VideoMetadata`]: A tuple containing:
+            - Numpy array of frames in RGB (shape: [num_frames, height, width, 3]).
+            - `VideoMetadata` object.
+    """
+    # Lazy import torchcodec
+    import importlib
+
+    av = importlib.import_module("av")
+
+    with av.open(video_path) as container:
+        video_stream = container.streams.video[0]
+        fps = video_stream.average_rate or video_stream.guessed_rate
+        it = container.decode(video=0)
+        frames = list(it)
+
+        stream = container.streams.video[0]
+        start = frames[0].pts * stream.time_base
+        container_end = stream.duration
+        if container_end is not None:
+            container_end *= stream.time_base
+        if container_end is None or container_end < frames[-1].pts:
+            # Some problem with stream duration, so use the frame PTS directly
+            # and guess the duration of the last frame
+            end = frames[-1].pts * stream.time_base + 1 / fps
+        else:
+            end = container_end
+        duration = float(end - start)
+
+        metadata = VideoMetadata(
+            total_num_frames=len(frames),
+            fps=float(fps),
+            duration=float(duration),
+            video_backend="pyav",
+            height=video_stream.height,
+            width=video_stream.width,
+        )
+
+        target_timestamps = sample_timestamps_fn(metadata=metadata, **kwargs)
+        offset = float(start)
+
+        target_timestamps = np.array(target_timestamps)
+        end_time_stamps = np.array([float(frame.pts * stream.time_base) for frame in frames[1:]] + [duration])
+        indices = np.searchsorted(end_time_stamps, target_timestamps + offset, side="right")
+        indices = np.minimum(indices, len(end_time_stamps) - 1)
+
+        video = np.stack(
+            [frames[i].to_ndarray(format="rgb24", channel_last=True) for i in indices],
+            axis=0,
+        )
+
+        metadata.frames_indices = target_timestamps * fps
+
+        return video, metadata
+
+
+VIDEO_DECODERS = {
+    "decord": read_video_decord,
+    "torchcodec": read_video_torchcodec,
+    "pyav": read_video_pyav,
+}
+
+
+def load_video(
+    video: VideoInput,
+    backend: str = "decord",
+    sample_timestamps_fn: Callable | None = None,
+    **kwargs,
+):
+    """
+    Loads `video` to a numpy array.
+
+    Args:
+        video (`VideoInput`):
+            The video to convert to the numpy array format. Can be a link to video or local path.
+        backend (`str`, *optional*, defaults to `"decord"`):
+            The backend to use when loading the video. Can be any of ["decord", "pyav", ""torchcodec"]. Defaults to "decord".
+        sample_timestamps_fn (`Callable`):
+            A callable function that will return timestamps at which the video should be sampled.
+    """
+
+    # Early exit if provided an array or `PIL` frames
+    if not isinstance(video, str):
+        metadata = [None] * len(video)
+        return video, metadata
+
+    if urlparse(video).netloc in ["www.youtube.com", "youtube.com"]:
+        if not is_yt_dlp_available():
+            raise ImportError("To load a video from YouTube url you have  to install `yt_dlp` first.")
+        # Lazy import from yt_dlp
+        import importlib
+
+        yt_dlp = importlib.import_module("yt_dlp")
+
+        buffer = BytesIO()
+        with redirect_stdout(buffer), yt_dlp.YoutubeDL() as f:
+            f.download([video])
+        bytes_obj = buffer.getvalue()
+        file_obj = BytesIO(bytes_obj)
+    elif video.startswith("http://") or video.startswith("https://"):
+        file_obj = BytesIO(requests.get(video, timeout=10).content)
+    elif os.path.isfile(video):
+        file_obj = video
+    else:
+        raise TypeError(
+            "Incorrect format used for video. Should be an url linking to an video or a local path."
+        )
+
+    # can also load with decord, but not cv2/torchvision
+    # both will fail in case of url links
+    video_is_url = video.startswith("http://") or video.startswith("https://")
+    if video_is_url and backend == "opencv":
+        raise ValueError("If you are trying to load a video from URL, you cannot use 'opencv' as backend")
+
+    if (
+        (not is_decord_available() and backend == "decord")
+        or (not is_torchcodec_available() and backend == "torchcodec")
+        or (not is_av_available() and backend == "pyav")
+    ):
+        raise ImportError(
+            f"You chose backend={backend} for loading the video but the required library is not found in your environment "
+            f"Make sure to install {backend} before loading the video."
+        )
+
+    video_decoder = VIDEO_DECODERS[backend]
+    video, metadata = video_decoder(file_obj, sample_timestamps_fn, **kwargs)
+    return video, metadata
+
+
+def get_target_fps(
+    video_fps: float,
+    max_frames: int,
+    total_frames: int,
+    frame_sample_mode: str,
+    candidate_target_fps: tuple[float],
+) -> float:
+    """
+    Get the target fps that best spans the video and has the most frames sampled
+    """
+    num_frames_sampled = 0
+    selected_target_fps = None
+    for target_fps in candidate_target_fps:
+        step_size = max(int(video_fps / target_fps), 1)
+        num_frames_sampled_at_fps = int(total_frames / step_size)
+        if num_frames_sampled == 0:
+            if "uniform" in frame_sample_mode:
+                if num_frames_sampled_at_fps > max_frames:
+                    break
+            selected_target_fps = target_fps
+            num_frames_sampled = num_frames_sampled_at_fps
+
+        else:
+            # the candidate sampling fps increases so frame count can't decrease
+            assert num_frames_sampled <= num_frames_sampled_at_fps
+            if num_frames_sampled_at_fps > max_frames:
+                # choose the sampling fps that spans the video
+                continue
+
+            elif num_frames_sampled_at_fps > num_frames_sampled:
+                # both are less than max_frames, choose the one with higher density of frames sampled
+                selected_target_fps = target_fps
+                num_frames_sampled = num_frames_sampled_at_fps
+    return selected_target_fps
+
+
+def get_frame_times_and_chosen_fps(selected_target_fps, total_frames, max_frames, video_fps):
+    if selected_target_fps is None:
+        frame_indices = np.linspace(0, total_frames, max_frames, endpoint=False, dtype=int)
+    else:
+        step_size = max(int(video_fps / selected_target_fps), 1)
+        frame_indices = np.arange(0, total_frames, step_size)
+    if len(frame_indices) > max_frames:
+        frame_indices = frame_indices[:max_frames]
+    return selected_target_fps, frame_indices
+
+
+class MolmoAct2VideoProcessorKwargs(VideosKwargs, total=False):
+    patch_size: int | None
+    pooling_size: list[int] | None
+    frame_sample_mode: str | None
+    max_fps: int | None
+    sampling_fps: int | None
+
+
+class MolmoAct2VideoProcessor(BaseVideoProcessor):
+    resample = PILImageResampling.BILINEAR
+    size = {"height": 378, "width": 378}
+    image_mean = IMAGENET_STANDARD_MEAN
+    image_std = IMAGENET_STANDARD_STD
+    do_resize = True
+    do_rescale = True
+    do_normalize = True
+    do_convert_rgb = True
+    patch_size = 14
+    pooling_size = [3, 3]
+    do_sample_frames = True
+    frame_sample_mode = "uniform_last_frame"
+    max_fps = 2
+    sampling_fps = 2
+    valid_kwargs = MolmoAct2VideoProcessorKwargs
+    model_input_names = ["pixel_values_videos", "video_token_pooling", "video_grids"]
+
+    def __init__(self, **kwargs: Unpack[MolmoAct2VideoProcessorKwargs]):
+        super().__init__(**kwargs)
+        if self.size is not None and (
+            self.size.get("height", None) is None or self.size.get("width", None) is None
+        ):
+            raise ValueError("size must contain 'height' and 'width' keys.")
+
+    def _further_process_kwargs(
+        self,
+        size: SizeDict | None = None,
+        **kwargs,
+    ) -> dict:
+        """
+        Update kwargs that need further processing before being validated
+        Can be overridden by subclasses to customize the processing of kwargs.
+        """
+        if size is not None and ("height" not in size or "width" not in size):
+            raise ValueError("size must contain 'height' and 'width' keys.")
+
+        return super()._further_process_kwargs(size=size, **kwargs)
+
+    def sample_times(
+        self,
+        metadata: VideoMetadata,
+        frame_sample_mode: str,
+        num_frames: int,
+        max_fps: int | None = None,
+        sampling_fps: int | None = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Time-based sampling if an array video is passed
+        Args:
+            metadata (`VideoMetadata`):
+                Metadata of the video containing information about total duration, fps and total number of frames.
+            frame_sample_mode (`str`, *optional*):
+                Mode to sample frames. Defaults to `self.frame_sample_mode`.
+            num_frames (`int`, *optional*):
+                Maximum number of frames to sample. Defaults to `self.num_frames`.
+            man_fps (`int`, *optional*):
+                Maximum frames per second to sample.
+            sampling_fps (`int`, *optional*):
+                Sampling frames per second. Defaults to `self.sampling_fps`.
+                Used when `frame_sample_mode` is `"fps"`.
+        """
+        frame_sample_mode = frame_sample_mode or self.frame_sample_mode
+        num_frames = num_frames or self.num_frames
+        sampling_fps = sampling_fps or self.sampling_fps
+
+        duration = metadata.duration or metadata.total_num_frames / metadata.fps
+        if frame_sample_mode == "fps":
+            candidate_target_fps = get_candidate_target_fps(metadata.fps, sampling_fps)
+            # Try larger and larger FPSs until we hit one that can't span the video
+            target_fps = candidate_target_fps[0]
+            for candidate_fps in candidate_target_fps[1:]:
+                if num_frames / candidate_fps < duration:
+                    break
+                target_fps = candidate_fps
+            times = np.arange(0, num_frames) / target_fps
+            times = times[times < duration]
+            return times
+        elif frame_sample_mode == "uniform_last_frame":
+            if max_fps is not None:
+                max_duration = (num_frames - 1) / max_fps  # -1 to include the last frame
+                if max_duration < duration:
+                    times = np.linspace(0, duration, num=num_frames, endpoint=True, dtype=np.float64)
+                else:
+                    times = np.arange(0.0, stop=duration, step=1 / max_fps)
+                    times = np.concatenate([times, [duration]], axis=0)
+                    assert len(times) <= num_frames
+            else:
+                times = np.linspace(0, duration, num=num_frames, endpoint=True, dtype=np.float64)
+            return times
+        else:
+            raise NotImplementedError(frame_sample_mode)
+
+    def sample_frames(
+        self,
+        metadata: VideoMetadata,
+        frame_sample_mode: str | None = None,
+        num_frames: int | None = None,
+        max_fps: int | None = None,
+        sampling_fps: int | None = None,
+        **kwargs,
+    ) -> np.ndarray:
+        """
+        Frame-based sampling if an array video is passed
+        Args:
+            metadata (`VideoMetadata`):
+                Metadata of the video containing information about total duration, fps and total number of frames.
+            frame_sample_mode (`str`, *optional*):
+                Mode to sample frames. Defaults to `self.frame_sample_mode`.
+            num_frames (`int`, *optional*):
+                Maximum number of frames to sample. Defaults to `self.num_frames`.
+            max_fps (`int`, *optional*):
+                Maximum frames per second to sample.
+            sampling_fps (`int`, *optional*):
+                Sampling frames per second. Defaults to `self.sampling_fps`.
+                Used when `frame_sample_mode` is `"fps"`.
+        """
+        frame_sample_mode = frame_sample_mode or self.frame_sample_mode
+        num_frames = num_frames or self.num_frames
+        sampling_fps = sampling_fps or self.sampling_fps
+
+        total_num_frames = metadata.total_num_frames
+        if frame_sample_mode == "uniform_last_frame" and max_fps is not None:
+            duration = total_num_frames / metadata.fps
+            if total_num_frames <= 2:
+                return np.arange(total_num_frames).astype(int)
+            if duration > (num_frames - 1) / max_fps:  # -1 to include the last frame
+                # uniform fallback
+                indices = np.linspace(
+                    0,
+                    total_num_frames - 1,
+                    num=min(num_frames, total_num_frames),
+                    endpoint=True,
+                ).astype(int)
+                return indices
+            else:
+                float_indices = np.arange(
+                    0.0,
+                    stop=total_num_frames - 1,
+                    step=float(metadata.fps / max_fps),
+                )
+                if np.round(float_indices[-1]) != total_num_frames - 1:
+                    float_indices = np.concatenate([float_indices, [total_num_frames - 1]], axis=0)
+                indices = np.round(float_indices).astype(int)
+                assert indices[-1] < total_num_frames
+                assert len(float_indices) <= num_frames
+                return indices
+        elif frame_sample_mode == "uniform_last_frame":
+            indices = np.linspace(
+                0,
+                total_num_frames - 1,
+                num=min(num_frames, total_num_frames),
+                endpoint=True,
+            ).astype(int)
+            return indices
+        elif frame_sample_mode == "fps":
+            candidate_target_fps = get_candidate_target_fps(metadata.fps, sampling_fps)
+            selected_target_fps = get_target_fps(
+                metadata.fps,
+                num_frames,
+                total_num_frames,
+                frame_sample_mode,
+                candidate_target_fps,
+            )
+            _, indices = get_frame_times_and_chosen_fps(
+                selected_target_fps,
+                total_num_frames,
+                num_frames,
+                metadata.fps,
+            )
+            return indices
+        else:
+            raise NotImplementedError(frame_sample_mode)
+
+    def fetch_videos(self, video_url_or_urls: str | list[str] | list[list[str]], sample_timestamps_fn=None):
+        """
+        Convert a single or a list of urls into the corresponding `np.array` objects.
+
+        If a single url is passed, the return value will be a single object. If a list is passed a list of objects is
+        returned.
+        """
+        if (not is_decord_available()) and (not is_torchcodec_available()) and (not is_av_available()):
+            raise ImportError(
+                "MolmoAct2VideoProcessor requires `decord`, `torchcodec`, or `av` to be installed."
+            )
+
+        if is_decord_available():
+            backend = "decord"
+        elif is_torchcodec_available():
+            warnings.warn(
+                "`decord` is not installed and cannot be used to decode the video by default. "
+                "Falling back to `torchcodec`."
+            )
+            backend = "torchcodec"
+        else:
+            warnings.warn(
+                "`decord` is not installed and cannot be used to decode the video by default. "
+                "Falling back to `PyAV`."
+            )
+            backend = "pyav"
+
+        if isinstance(video_url_or_urls, list):
+            return list(
+                zip(
+                    *[
+                        self.fetch_videos(x, sample_timestamps_fn=sample_timestamps_fn)
+                        for x in video_url_or_urls
+                    ]
+                )
+            )
+        else:
+            return load_video(video_url_or_urls, backend=backend, sample_timestamps_fn=sample_timestamps_fn)
+
+    def _decode_and_sample_videos(
+        self,
+        videos: VideoInput,
+        video_metadata: VideoMetadata | dict,
+        do_sample_frames: bool | None = None,
+        sample_indices_fn: Callable | None = None,
+        sample_timestamps_fn: Callable | None = None,
+    ):
+        """
+        Decode input videos and sample frames if needed.
+        """
+        videos = make_batched_videos(videos)
+        video_metadata = make_batched_metadata(videos, video_metadata=video_metadata)
+
+        # Framed-based sampling if an array video is passed
+        # Otherwise, time-based sampling with decoding
+        if is_valid_video(videos[0]) and do_sample_frames:
+            assert video_metadata[0].fps is not None, "FPS must be provided for video input"
+            sampled_videos = []
+            sampled_metadata = []
+            for video, metadata in zip(videos, video_metadata):
+                indices = sample_indices_fn(metadata=metadata)
+                metadata.frames_indices = indices
+                sampled_videos.append(video[indices])
+                sampled_metadata.append(metadata)
+            videos = sampled_videos
+            video_metadata = sampled_metadata
+        elif not is_valid_video(videos[0]):
+            if sample_indices_fn is None:
+                logger.warning(
+                    "do_sample_frames is False, but video array is not provided: "
+                    "Will decode the video and sample frames using MolmoAct2's default sampling mode"
+                )
+            if isinstance(videos[0], list):
+                raise ValueError("A list of images is not supported for video input!")
+            else:
+                videos, video_metadata = self.fetch_videos(videos, sample_timestamps_fn=sample_timestamps_fn)
+
+        return videos, video_metadata
+
+    def _prepare_input_videos(
+        self,
+        videos: VideoInput,
+        **kwargs,
+    ) -> list[np.ndarray]:
+        processed_videos = [to_numpy(video) for video in videos]
+        return processed_videos
+
+    def preprocess(
+        self,
+        videos: VideoInput,
+        **kwargs: Unpack[MolmoAct2VideoProcessorKwargs],
+    ) -> BatchFeature:
+        validate_kwargs(
+            captured_kwargs=kwargs.keys(),
+            valid_processor_keys=list(self.valid_kwargs.__annotations__.keys()) + ["return_tensors"],
+        )
+
+        # Set default kwargs from self. This ensures that if a kwarg is not provided
+        # by the user, it gets its default value from the instance, or is set to None.
+        for kwarg_name in self.valid_kwargs.__annotations__:
+            kwargs.setdefault(kwarg_name, getattr(self, kwarg_name, None))
+
+        do_sample_frames = kwargs.pop("do_sample_frames")
+        video_metadata = kwargs.pop("video_metadata")
+
+        sample_indices_fn = partial(self.sample_frames, **kwargs) if do_sample_frames else None
+        sample_timestamps_fn = partial(self.sample_times, **kwargs)
+        videos, video_metadata = self._decode_and_sample_videos(
+            videos,
+            video_metadata=video_metadata,
+            do_sample_frames=do_sample_frames,
+            sample_indices_fn=sample_indices_fn,
+            sample_timestamps_fn=sample_timestamps_fn,
+        )
+        videos = self._prepare_input_videos(videos=videos)
+
+        kwargs = self._further_process_kwargs(**kwargs)
+
+        return_metadata = kwargs.pop("return_metadata")
+        preprocessed_videos = self._preprocess(videos=videos, **kwargs)
+        if return_metadata:
+            preprocessed_videos["video_metadata"] = video_metadata
+        return preprocessed_videos
+
+    def _preprocess(
+        self,
+        videos: list[np.ndarray],
+        size: SizeDict | None = None,
+        resample: PILImageResampling | None = None,
+        image_mean: float | list[float] | None = None,
+        image_std: float | list[float] | None = None,
+        do_convert_rgb: bool | None = None,
+        patch_size: int | None = None,
+        pooling_size: list[int] | None = None,
+        return_tensors: str | TensorType | None = None,
+        **kwargs,
+    ) -> BatchFeature:
+        """
+        Preprocess a video for the model.
+        Args:
+            videos (`VideoInput`):
+                Video to preprocess.
+            size (`SizeDict`, *optional*, defaults to `self.size`):
+                Size of the image after resizing.
+            resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
+                Resampling filter to use when resizing the image. This can be one of the enum `PILImageResampling`. Only
+                has an effect if `do_resize` is set to `True`.
+            image_mean (`float` or `list[float]`, *optional*, defaults to `self.image_mean`):
+                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
+            image_std (`float` or `list[float]`, *optional*, defaults to `self.image_std`):
+                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
+                `True`.
+            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
+                Whether to convert the image to RGB.
+            patch_size (`int`, *optional*, defaults to `self.patch_size`):
+                The spatial patch size of the vision encoder.
+            pooling_size (`list[int]`, *optional*, defaults to `self.pooling_size`):
+                The pooling size of the vision adapter.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                - Unset: Return a list of `np.ndarray`.
+                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
+                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
+                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
+
+        Returns:
+            A `BatchFeature` containing the following keys:
+                - `pixel_values_videos`: The preprocessed videos.
+                - `video_token_pooling`: The indices of the patches in `crops` to pool for each token in `video_tokens`.
+                - `video_grids`: The video grids.
+        """
+        if size.height is None or size.width is None:
+            raise ValueError("size must contain 'height' and 'width' keys.")
+
+        base_image_input_size = [size.height, size.width]
+
+        resample = resample or self.resample
+        image_mean = image_mean or self.image_mean
+        image_std = image_std or self.image_std
+        do_convert_rgb = do_convert_rgb or self.do_convert_rgb
+
+        patch_size = patch_size or self.patch_size
+        pooling_size = pooling_size or self.pooling_size
+
+        image_pooling_h, image_pooling_w = pooling_size
+
+        batch_grids = []
+        batch_crops = []
+        batch_pooled_patches_idx = []
+
+        for video in videos:
+            all_crops = []
+            pooled_patches_idx = []
+
+            for frame in video:
+                image_grid, crops, pooled_idx = image_to_patches_and_grids(
+                    frame,
+                    base_image_input_size,
+                    resample,
+                    image_mean,
+                    image_std,
+                    patch_size,
+                    image_pooling_w,
+                    image_pooling_h,
+                )
+                offset = sum(np.prod(x.shape[:2]) for x in all_crops)
+                pooled_idx_with_offset = np.where(pooled_idx >= 0, pooled_idx + offset, pooled_idx)
+                pooled_patches_idx.append(pooled_idx_with_offset)
+                all_crops.append(crops)
+
+            video_grid = np.array([len(video), image_grid[0], image_grid[1]])
+            all_crops = np.concatenate(all_crops, 0)
+            pooled_patches_idx = np.concatenate(pooled_patches_idx, 0)
+
+            batch_grids.append(video_grid)
+            batch_crops.append(all_crops)
+            batch_pooled_patches_idx.append(pooled_patches_idx)
+
+        video_grids = np.stack(batch_grids, 0)
+        pixel_values_videos = np.concatenate(batch_crops, 0)
+        video_token_pooling = np.concatenate(batch_pooled_patches_idx, 0)
+
+        data = dict(
+            pixel_values_videos=pixel_values_videos,
+            video_token_pooling=video_token_pooling,
+            video_grids=video_grids,
+        )
+
+        return BatchFeature(data, tensor_type=return_tensors)
+
+
+MolmoAct2VideoProcessor.register_for_auto_class()
diff --git a/src/lerobot/policies/molmoact2/modeling_molmoact2.py b/src/lerobot/policies/molmoact2/modeling_molmoact2.py
new file mode 100644
index 000000000..f86be0904
--- /dev/null
+++ b/src/lerobot/policies/molmoact2/modeling_molmoact2.py
@@ -0,0 +1,1551 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import json
+import os
+import types
+from collections import deque
+from contextlib import nullcontext
+from typing import TYPE_CHECKING, Any
+
+import numpy as np
+import torch
+import torch.nn.functional as F  # noqa: N812
+from safetensors.torch import load_file as load_safetensors_file
+from torch import Tensor
+from torch.distributions import Beta
+
+from lerobot.policies.pretrained import PreTrainedPolicy
+from lerobot.utils.constants import ACTION
+from lerobot.utils.import_utils import _scipy_available, _transformers_available, require_package
+
+from ..rtc.modeling_rtc import RTCProcessor
+from .configuration_molmoact2 import MolmoAct2Config, _hf_token, _resolve_checkpoint_location
+
+if TYPE_CHECKING or _transformers_available:
+    from transformers.utils import SAFE_WEIGHTS_INDEX_NAME, SAFE_WEIGHTS_NAME
+
+    from .hf_model.configuration_molmoact2 import MolmoAct2Config as HFMolmoAct2Config
+    from .hf_model.modeling_molmoact2 import MolmoAct2ForConditionalGeneration
+else:
+    SAFE_WEIGHTS_INDEX_NAME = "model.safetensors.index.json"
+    SAFE_WEIGHTS_NAME = "model.safetensors"
+    HFMolmoAct2Config = None
+    MolmoAct2ForConditionalGeneration = None
+
+if TYPE_CHECKING or (_transformers_available and _scipy_available):
+    from .hf_model.action_tokenizer import UniversalActionProcessor
+else:
+    UniversalActionProcessor = None
+
+_MODEL_INPUT_KEYS = {
+    "input_ids",
+    "pixel_values",
+    "image_token_pooling",
+    "image_grids",
+    "image_num_crops",
+    "pixel_values_videos",
+    "video_token_pooling",
+    "video_grids",
+    "attention_mask",
+    "position_ids",
+    "past_key_values",
+    "token_type_ids",
+    "inputs_embeds",
+}
+
+
+def _strict_load_safetensors_weights(model: torch.nn.Module, checkpoint_location: str) -> None:
+    index_path = os.path.join(checkpoint_location, SAFE_WEIGHTS_INDEX_NAME)
+    single_file_path = os.path.join(checkpoint_location, SAFE_WEIGHTS_NAME)
+    if os.path.isfile(index_path):
+        with open(index_path, encoding="utf-8") as f:
+            index = json.load(f)
+        weight_map = index["weight_map"]
+        loaded_keys = set(weight_map)
+        model_keys = set(model.state_dict())
+        missing_keys = sorted(model_keys - loaded_keys)
+        unexpected_keys = sorted(loaded_keys - model_keys)
+        if missing_keys or unexpected_keys:
+            message = ["MolmoAct2 safetensors do not match the local model implementation."]
+            if missing_keys:
+                message.append(f"Missing keys: {missing_keys[:8]}")
+            if unexpected_keys:
+                message.append(f"Unexpected keys: {unexpected_keys[:8]}")
+            raise RuntimeError(" ".join(message))
+        for shard_file in sorted(set(weight_map.values())):
+            state_dict = load_safetensors_file(os.path.join(checkpoint_location, shard_file), device="cpu")
+            model.load_state_dict(state_dict, strict=False)
+            del state_dict
+        return
+    if os.path.isfile(single_file_path):
+        state_dict = load_safetensors_file(single_file_path, device="cpu")
+        model.load_state_dict(state_dict, strict=True)
+        return
+    raise FileNotFoundError(
+        f"MolmoAct2 checkpoint at {checkpoint_location} must contain {SAFE_WEIGHTS_NAME} "
+        f"or {SAFE_WEIGHTS_INDEX_NAME}."
+    )
+
+
+def _torch_dtype(dtype: str) -> torch.dtype:
+    if dtype == "float32":
+        return torch.float32
+    if dtype == "bfloat16":
+        return torch.bfloat16
+    if dtype == "float16":
+        return torch.float16
+    raise ValueError(f"Unsupported dtype: {dtype}")
+
+
+def _sample_beta_timesteps(
+    *,
+    batch_size: int,
+    device: torch.device,
+    cutoff: float,
+    time_offset: float,
+    time_scale: float,
+    alpha: float,
+    beta: float,
+) -> Tensor:
+    if cutoff < time_offset:
+        raise ValueError(f"flow-matching cutoff must be >= time_offset, got {cutoff} < {time_offset}")
+    if time_scale <= 0:
+        raise ValueError(f"flow-matching time_scale must be > 0, got {time_scale}")
+    upper = min(cutoff, time_offset + time_scale)
+    dist = Beta(torch.tensor(alpha, device=device), torch.tensor(beta, device=device))
+    samples = dist.sample((batch_size,))
+    scale = upper - time_offset
+    if scale == 0:
+        return torch.full((batch_size,), time_offset, device=device, dtype=samples.dtype)
+    return time_offset + scale * samples
+
+
+class MolmoAct2Policy(PreTrainedPolicy):
+    config_class = MolmoAct2Config
+    name = "molmoact2"
+
+    def __init__(
+        self,
+        config: MolmoAct2Config,
+        *inputs,
+        dataset_stats: dict[str, dict[str, Tensor]] | None = None,
+        dataset_meta: Any | None = None,
+        **kwargs,
+    ):
+        super().__init__(config, *inputs, **kwargs)
+        self.config.apply_norm_tag_metadata()
+        self.config.validate_features()
+        del inputs, kwargs, dataset_stats, dataset_meta
+        self._checkpoint_action_mode = self.config.saved_policy_action_mode()
+        self._action_queue: deque[Tensor] = deque(maxlen=self.config.n_action_steps)
+        self._rollout_action_generator: torch.Generator | None = None
+        self._rollout_task_key: tuple[Any, ...] | None = None
+        self._rollout_index_for_task = -1
+        self.rtc_processor: RTCProcessor | None = None
+        self.action_tokenizer: Any | None = None
+        self._load_hf_model()
+        self.config.validate_inference_action_mode(self._checkpoint_action_mode)
+        if self.config.enable_lora_vlm:
+            self._apply_lora_adapters()
+        self.init_rtc_processor()
+
+    def _load_hf_model(self) -> None:
+        require_package("transformers", extra="molmoact2")
+
+        checkpoint_location = _resolve_checkpoint_location(
+            self.config.checkpoint_path,
+            revision=self.config.checkpoint_revision,
+            force_download=bool(self.config.checkpoint_force_download),
+        )
+        model_dtype = _torch_dtype(self.config.model_dtype)
+        if HFMolmoAct2Config is None or MolmoAct2ForConditionalGeneration is None:
+            raise RuntimeError("transformers is required to load MolmoAct2 checkpoints.")
+        hf_config = HFMolmoAct2Config.from_pretrained(
+            checkpoint_location,
+            token=_hf_token(),
+        )
+        self.model = MolmoAct2ForConditionalGeneration.from_pretrained(
+            checkpoint_location,
+            config=hf_config,
+            dtype=model_dtype,
+            low_cpu_mem_usage=True,
+            token=_hf_token(),
+        )
+        # Keep Hub loading limited to local code plus safetensors, and verify the
+        # local implementation exactly matches the checkpoint key space.
+        _strict_load_safetensors_weights(self.model, checkpoint_location)
+        hf_max_action_dim = int(getattr(self.model.config, "max_action_dim", -1))
+        if hf_max_action_dim != int(self.config.expected_max_action_dim):
+            raise ValueError(
+                "MolmoAct2 checkpoint max_action_dim mismatch: "
+                f"checkpoint={hf_max_action_dim}, expected={self.config.expected_max_action_dim}."
+            )
+        if hf_max_action_dim != 32:
+            raise ValueError(
+                f"MolmoAct2 released checkpoints must have max_action_dim=32, got {hf_max_action_dim}."
+            )
+
+        if not hasattr(self.model.config, "max_action_horizon"):
+            raise ValueError("MolmoAct2 HF checkpoints must define `max_action_horizon`.")
+        self._override_loaded_max_action_horizon(int(self.config.chunk_size))
+
+        if not hasattr(self.model.config, "action_mode"):
+            raise ValueError(
+                "MolmoAct2 HF checkpoints must define `action_mode`. If this is a released "
+                "MolmoAct2 checkpoint, refresh the local Hub cache with "
+                "`policy.checkpoint_force_download=true` after the updated files are pushed."
+            )
+        checkpoint_action_mode = str(self.model.config.action_mode)
+        self.config.validate_checkpoint_action_mode(
+            checkpoint_action_mode,
+            has_action_expert=bool(getattr(self.model.config, "add_action_expert", False)),
+        )
+
+        if self.config.freeze_embedding:
+            self._freeze_input_embeddings()
+        if self.config.train_action_expert_only:
+            self._freeze_non_action_expert_parameters()
+        if self.config.gradient_checkpointing:
+            self._enable_gradient_checkpointing()
+        self.train(self.training)
+
+    def reset(self) -> None:
+        self._action_queue = deque(maxlen=self.config.n_action_steps)
+        self._rollout_action_generator = None
+
+    def _set_inference_cuda_graph_enabled(self, enabled: bool) -> None:
+        if not hasattr(self, "model"):
+            return
+        hf_model = self._hf_model()
+        enabled = bool(enabled and getattr(self.config, "enable_inference_cuda_graph", True))
+        managers = [
+            getattr(self._backbone(), "action_cuda_graph_manager", None),
+            getattr(hf_model, "action_cuda_graph_manager", None),
+            getattr(hf_model, "depth_decode_cuda_graph_manager", None),
+        ]
+        seen: set[int] = set()
+        for manager in managers:
+            if manager is None or id(manager) in seen:
+                continue
+            seen.add(id(manager))
+            set_enabled = getattr(manager, "set_enabled", None)
+            if callable(set_enabled):
+                set_enabled(enabled)
+
+    def init_rtc_processor(self) -> None:
+        self.rtc_processor = None
+        if self.config.rtc_config is not None:
+            self.rtc_processor = RTCProcessor(self.config.rtc_config)
+
+    def _rtc_enabled(self) -> bool:
+        return self.config.rtc_config is not None and self.config.rtc_config.enabled
+
+    def _action_expert(self) -> torch.nn.Module:
+        return self._backbone()._require_action_expert()
+
+    def _enable_gradient_checkpointing(self) -> None:
+        enable_gradient_checkpointing = getattr(self._hf_model(), "gradient_checkpointing_enable", None)
+        if callable(enable_gradient_checkpointing):
+            try:
+                enable_gradient_checkpointing(gradient_checkpointing_kwargs={"use_reentrant": False})
+            except TypeError:
+                enable_gradient_checkpointing()
+        else:
+            transformer = getattr(self._backbone(), "transformer", None)
+            if transformer is None:
+                raise RuntimeError("gradient_checkpointing=true, but MolmoAct2 exposes no text transformer.")
+            transformer.gradient_checkpointing = True
+
+        transformer = getattr(self._backbone(), "transformer", None)
+        if transformer is not None:
+            transformer.gradient_checkpointing = True
+        vision_backbone = getattr(self._backbone(), "vision_backbone", None)
+        if vision_backbone is not None:
+            vision_backbone.gradient_checkpointing = True
+
+    def _freeze_non_action_expert_parameters(self) -> None:
+        trainable_params = 0
+        for name, param in self.named_parameters():
+            param.requires_grad = "action_expert" in name
+            if param.requires_grad:
+                trainable_params += param.numel()
+        if trainable_params == 0:
+            raise RuntimeError("train_action_expert_only=true, but no action_expert parameters were found.")
+
+    def _unfreeze_action_expert_parameters(self) -> None:
+        trainable_params = 0
+        for name, param in self.named_parameters():
+            if "action_expert" in name:
+                param.requires_grad_(True)
+                trainable_params += param.numel()
+        if trainable_params == 0:
+            raise RuntimeError("enable_lora_vlm=true, but no action_expert parameters were found.")
+
+    def train(self, mode: bool = True):
+        super().train(mode)
+        if getattr(self.config, "train_action_expert_only", False) and hasattr(self, "model"):
+            self._hf_model().eval()
+            self._action_expert().train(mode)
+        self._set_inference_cuda_graph_enabled(not mode)
+        return self
+
+    def _freeze_input_embeddings(self) -> None:
+        embedding_modules: list[torch.nn.Module] = []
+        seen_module_ids: set[int] = set()
+        hf_model = self._hf_model()
+        for module in (hf_model, self._backbone()):
+            get_input_embeddings = getattr(module, "get_input_embeddings", None)
+            if not callable(get_input_embeddings):
+                continue
+            embeddings = get_input_embeddings()
+            if embeddings is None or id(embeddings) in seen_module_ids:
+                continue
+            embedding_modules.append(embeddings)
+            seen_module_ids.add(id(embeddings))
+
+        if not embedding_modules:
+            raise RuntimeError("freeze_embedding=true, but MolmoAct2 checkpoint exposes no input embeddings.")
+
+        lm_head = getattr(hf_model, "lm_head", None)
+        lm_head_params = {id(param) for param in lm_head.parameters()} if lm_head is not None else set()
+        embedding_params = [param for embeddings in embedding_modules for param in embeddings.parameters()]
+        if any(id(param) in lm_head_params for param in embedding_params):
+            raise RuntimeError(
+                "freeze_embedding=true would also freeze lm_head because input embeddings and lm_head "
+                "share parameters in this checkpoint."
+            )
+        for param in embedding_params:
+            param.requires_grad = False
+
+    def get_optim_params(self) -> list[dict[str, Any]]:
+        vit_params: list[Tensor] = []
+        connector_params: list[Tensor] = []
+        action_expert_params: list[Tensor] = []
+        vlm_params: list[Tensor] = []
+        for name, param in self.named_parameters():
+            if not param.requires_grad:
+                continue
+            if "action_expert" in name:
+                action_expert_params.append(param)
+            elif any(part in name for part in ("image_pooling_2d", "image_projector")):
+                connector_params.append(param)
+            elif any(part in name for part in ("vision", "image_encoder", "vit")):
+                vit_params.append(param)
+            elif any(part in name for part in ("multi_modal_projector", "connector", "mm_projector")):
+                connector_params.append(param)
+            else:
+                vlm_params.append(param)
+
+        vlm_lr = 5e-5 if self.config.enable_lora_vlm else self.config.optimizer_lr
+        vit_lr = 5e-5 if self.config.enable_lora_vlm else self.config.optimizer_vit_lr
+        connector_lr = 5e-5 if self.config.enable_lora_vlm else self.config.optimizer_connector_lr
+
+        groups: list[dict[str, Any]] = []
+        if vlm_params:
+            groups.append({"params": vlm_params, "lr": vlm_lr})
+        if vit_params:
+            groups.append({"params": vit_params, "lr": vit_lr})
+        if connector_params:
+            groups.append({"params": connector_params, "lr": connector_lr})
+        if action_expert_params:
+            groups.append({"params": action_expert_params, "lr": self.config.optimizer_action_expert_lr})
+        return groups
+
+    def _model_inputs(self, batch: dict[str, Tensor]) -> dict[str, Tensor]:
+        compute_dtype = _torch_dtype(self.config.model_dtype)
+        return {
+            key: value.to(dtype=compute_dtype) if value.is_floating_point() else value
+            for key, value in batch.items()
+            if key in _MODEL_INPUT_KEYS and value is not None
+        }
+
+    def _output_action_dim(self, batch: dict[str, Tensor]) -> int:
+        action_feature = self.config.output_features.get(ACTION)
+        if action_feature is not None and action_feature.shape:
+            action_dim = int(action_feature.shape[0])
+            if action_dim > 0:
+                return action_dim
+
+        action_dim_is_pad = batch.get("action_dim_is_pad")
+        if action_dim_is_pad is not None:
+            valid_counts = (~action_dim_is_pad.to(dtype=torch.bool)).sum(dim=-1)
+            if bool((valid_counts == valid_counts[0]).all()) and int(valid_counts[0]) > 0:
+                return int(valid_counts[0])
+
+        raise RuntimeError("MolmoAct2 inference requires a positive action dimension in output_features.")
+
+    def _hf_model(self):
+        base_model = getattr(self.model, "base_model", None)
+        wrapped_model = getattr(base_model, "model", None) if base_model is not None else None
+        return wrapped_model if wrapped_model is not None else self.model
+
+    def _backbone(self):
+        return self._hf_model().model
+
+    def _override_loaded_max_action_horizon(self, action_horizon: int) -> None:
+        if action_horizon < 1:
+            raise ValueError(f"action_horizon must be >= 1, got {action_horizon}.")
+        hf_model = self._hf_model()
+        for cfg in (getattr(hf_model, "config", None), getattr(self._backbone(), "config", None)):
+            if cfg is not None:
+                cfg.max_action_horizon = int(action_horizon)
+
+    def _generation_action_horizon(self) -> int:
+        chunk_size = getattr(self.config, "chunk_size", None)
+        if chunk_size is not None:
+            return int(chunk_size)
+        hf_model = self._hf_model()
+        for cfg in (getattr(hf_model, "config", None), getattr(self._backbone(), "config", None)):
+            if cfg is None:
+                continue
+            value = getattr(cfg, "max_action_horizon", None)
+            if value is not None:
+                return int(value)
+        raise RuntimeError("MolmoAct2 could not resolve an action generation horizon.")
+
+    @staticmethod
+    def _mask_discrete_action_spans(
+        *,
+        input_ids: Tensor,
+        mask: Tensor,
+        start_token_id: int | None,
+        end_token_id: int | None,
+    ) -> Tensor:
+        if start_token_id is None or end_token_id is None:
+            return mask
+        mask = mask.clone()
+        for batch_idx in range(input_ids.shape[0]):
+            row = input_ids[batch_idx]
+            starts = (row == int(start_token_id)).nonzero(as_tuple=False).flatten().tolist()
+            ends = (row == int(end_token_id)).nonzero(as_tuple=False).flatten().tolist()
+            end_ptr = 0
+            for start in starts:
+                while end_ptr < len(ends) and ends[end_ptr] < start:
+                    end_ptr += 1
+                if end_ptr >= len(ends):
+                    mask[batch_idx, start:] = False
+                    break
+                end = int(ends[end_ptr])
+                mask[batch_idx, start : end + 1] = False
+                end_ptr += 1
+        return mask
+
+    def _encoder_attention_mask_for_action_expert(
+        self,
+        *,
+        input_ids: Tensor | None,
+        attention_mask: Tensor | None,
+    ) -> Tensor | None:
+        backbone = self._backbone()
+        get_encoder_attention_mask = getattr(backbone, "_get_encoder_attention_mask", None)
+        if callable(get_encoder_attention_mask):
+            mask = get_encoder_attention_mask(input_ids, attention_mask)
+        elif attention_mask is not None:
+            mask = attention_mask.to(dtype=torch.bool)
+        elif input_ids is not None:
+            mask = input_ids != -1
+        else:
+            return None
+
+        if getattr(self.config, "action_mode", None) != "both" or input_ids is None or mask is None:
+            return mask
+
+        mask = mask.to(dtype=torch.bool).clone()
+        eos_token_id = getattr(self.model.config, "eos_token_id", None)
+        if eos_token_id is not None:
+            mask &= input_ids != int(eos_token_id)
+        return self._mask_discrete_action_spans(
+            input_ids=input_ids,
+            mask=mask,
+            start_token_id=getattr(self.model.config, "action_start_token_id", None),
+            end_token_id=getattr(self.model.config, "action_end_token_id", None),
+        )
+
+    @staticmethod
+    def _drop_trivial_attention_mask(model_inputs: dict[str, Tensor]) -> dict[str, Tensor]:
+        attention_mask = model_inputs.get("attention_mask")
+        if torch.is_tensor(attention_mask) and bool(attention_mask.to(dtype=torch.bool).all().item()):
+            model_inputs = dict(model_inputs)
+            model_inputs.pop("attention_mask", None)
+        return model_inputs
+
+    def _load_discrete_action_tokenizer(self) -> Any:
+        if self.action_tokenizer is None:
+            require_package("transformers", extra="molmoact2")
+            require_package("scipy", extra="molmoact2")
+
+            if UniversalActionProcessor is None:
+                raise RuntimeError("transformers and scipy are required to load MolmoAct2 action tokenizer.")
+            self.action_tokenizer = UniversalActionProcessor.from_pretrained_local(
+                self.config.discrete_action_tokenizer,
+            )
+        return self.action_tokenizer
+
+    def _resolve_inference_action_mode(self, requested_mode: str | None) -> str:
+        return self.config.resolve_inference_action_mode(requested_mode, self._checkpoint_action_mode)
+
+    @staticmethod
+    def _combine_rollout_seeds(first_seed: int, batch_size: int) -> int:
+        seed = 0
+        for idx in range(batch_size):
+            seed = (seed + (idx + 1) * (first_seed + idx)) % (2**63 - 1)
+        return seed
+
+    @staticmethod
+    def _rollout_task_signature(batch: dict[str, Any]) -> tuple[Any, ...] | None:
+        task = batch.get("task")
+        if task is None:
+            task = batch.get("observation.language")
+        if task is None:
+            return None
+        if isinstance(task, str):
+            return (task,)
+        if isinstance(task, (list, tuple)):
+            return tuple(str(item) for item in task)
+        return (str(task),)
+
+    def _rollout_generator_for_inputs(
+        self,
+        batch: dict[str, Any],
+        *,
+        batch_size: int,
+        device: torch.device,
+    ) -> torch.Generator | None:
+        if not bool(getattr(self.config, "per_episode_seed", False)):
+            return None
+        if self._rollout_action_generator is not None:
+            return self._rollout_action_generator
+
+        task_signature = self._rollout_task_signature(batch)
+        if task_signature != self._rollout_task_key:
+            self._rollout_task_key = task_signature
+            self._rollout_index_for_task = 0
+        else:
+            self._rollout_index_for_task += 1
+
+        base_seed = int(getattr(self.config, "eval_seed", None) or 0)
+        first_seed = base_seed + self._rollout_index_for_task * batch_size
+        generator_device = (
+            device if device.type == "cuda" and torch.cuda.is_available() else torch.device("cpu")
+        )
+        generator = torch.Generator(device=generator_device)
+        generator.manual_seed(self._combine_rollout_seeds(first_seed, batch_size))
+        self._rollout_action_generator = generator
+        return generator
+
+    @staticmethod
+    def _expand_mask(mask: Tensor | None, num_flow_timesteps: int) -> Tensor | None:
+        if mask is None:
+            return None
+        return (
+            mask.unsqueeze(1)
+            .expand(-1, num_flow_timesteps, *([-1] * (mask.ndim - 1)))
+            .reshape(mask.shape[0] * num_flow_timesteps, *mask.shape[1:])
+        )
+
+    @staticmethod
+    def _action_dim_valid_mask(target: Tensor, action_dim_is_pad: Tensor | None) -> Tensor | None:
+        if action_dim_is_pad is None:
+            return None
+        mask = ~action_dim_is_pad.to(device=target.device, dtype=torch.bool)
+        if mask.ndim == 1:
+            mask = mask.unsqueeze(0)
+        if mask.shape[-1] != target.shape[-1]:
+            raise ValueError(
+                f"action_dim_is_pad width {mask.shape[-1]} does not match target width {target.shape[-1]}."
+            )
+        if mask.shape[0] == 1 and target.shape[0] != 1:
+            mask = mask.expand(target.shape[0], -1)
+        if mask.shape[0] != target.shape[0]:
+            raise ValueError(
+                f"action_dim_is_pad batch {mask.shape[0]} does not match target batch {target.shape[0]}."
+            )
+        while mask.ndim < target.ndim:
+            mask = mask.unsqueeze(1)
+        return mask
+
+    @classmethod
+    def _mask_action_dim_tensor(cls, tensor: Tensor, action_dim_is_pad: Tensor | None) -> Tensor:
+        if not cls._mask_enabled_static(action_dim_is_pad):
+            return tensor
+        valid_mask = cls._action_dim_valid_mask(tensor, action_dim_is_pad)
+        if valid_mask is None:
+            return tensor
+        return tensor.masked_fill(~valid_mask, 0)
+
+    @staticmethod
+    def _mask_enabled_static(action_dim_is_pad: Tensor | None) -> bool:
+        return action_dim_is_pad is not None
+
+    @classmethod
+    def _apply_action_dim_padding_mask(cls, loss: Tensor, action_dim_is_pad: Tensor | None) -> Tensor:
+        valid_mask = cls._action_dim_valid_mask(loss, action_dim_is_pad)
+        if valid_mask is None:
+            return loss
+        valid = valid_mask.to(dtype=loss.dtype)
+        denom = valid.sum(dim=-1).clamp_min(1.0)
+        return (loss * valid).sum(dim=-1) / denom
+
+    @staticmethod
+    def _apply_action_chunk_padding_mask(loss: Tensor, action_horizon_is_pad: Tensor | None) -> Tensor:
+        if action_horizon_is_pad is None:
+            return loss
+        valid_action = (
+            (~action_horizon_is_pad.to(device=loss.device, dtype=torch.bool)).unsqueeze(1).unsqueeze(-1)
+        )
+        return loss * valid_action
+
+    def _prepare_flow_matching_tensors(
+        self,
+        *,
+        actions: Tensor,
+        action_dim_is_pad: Tensor | None,
+        timesteps: Tensor | None = None,
+        noise: Tensor | None = None,
+    ) -> tuple[Tensor, Tensor, Tensor, Tensor]:
+        action_expert = self._backbone()._require_action_expert()
+        action_dtype = next(action_expert.parameters()).dtype
+        actions = actions.to(dtype=action_dtype)
+        batch_size = int(actions.shape[0])
+        device = actions.device
+        num_flow_timesteps = max(1, int(self.config.num_flow_timesteps))
+
+        if timesteps is None:
+            timesteps = (
+                _sample_beta_timesteps(
+                    batch_size=batch_size * num_flow_timesteps,
+                    device=device,
+                    cutoff=self.config.flow_matching_cutoff,
+                    time_offset=self.config.flow_matching_time_offset,
+                    time_scale=self.config.flow_matching_time_scale,
+                    alpha=self.config.flow_matching_beta_alpha,
+                    beta=self.config.flow_matching_beta_beta,
+                )
+                .to(dtype=action_dtype)
+                .view(batch_size, num_flow_timesteps)
+            )
+        else:
+            expected_timesteps_shape = (batch_size, num_flow_timesteps)
+            timesteps = timesteps.to(device=device, dtype=action_dtype)
+            if tuple(timesteps.shape) != expected_timesteps_shape:
+                raise ValueError(
+                    f"flow timesteps must have shape {expected_timesteps_shape}, got {tuple(timesteps.shape)}."
+                )
+
+        if self.config.mask_action_dim_padding:
+            actions = self._mask_action_dim_tensor(actions, action_dim_is_pad)
+
+        expected_noise_shape = (batch_size, num_flow_timesteps, actions.shape[1], actions.shape[2])
+        if noise is None:
+            noise = torch.randn(*expected_noise_shape, device=device, dtype=actions.dtype)
+        else:
+            noise = noise.to(device=device, dtype=actions.dtype)
+            if tuple(noise.shape) != expected_noise_shape:
+                raise ValueError(
+                    f"flow noise must have shape {expected_noise_shape}, got {tuple(noise.shape)}."
+                )
+        if self.config.mask_action_dim_padding:
+            noise = self._mask_action_dim_tensor(noise, action_dim_is_pad)
+
+        t_broadcast = timesteps.view(batch_size, num_flow_timesteps, 1, 1)
+        actions_expanded = actions.unsqueeze(1).expand(-1, num_flow_timesteps, -1, -1)
+        xt = (1.0 - t_broadcast) * noise + t_broadcast * actions_expanded
+        target_velocity = actions_expanded - noise
+        return actions, timesteps, xt, target_velocity
+
+    def _prepare_joint_training_backbone_inputs(
+        self,
+        model_inputs: dict[str, Tensor],
+    ) -> tuple[Tensor, Tensor | dict[str, Any], Tensor, Tensor]:
+        backbone = self._backbone()
+        input_ids = model_inputs.get("input_ids")
+        inputs_embeds = model_inputs.get("inputs_embeds")
+        if (input_ids is None) == (inputs_embeds is None):
+            raise ValueError(
+                "MolmoAct2 joint flow training requires exactly one of input_ids or inputs_embeds."
+            )
+
+        images = None
+        token_pooling = None
+        merge_visual_inputs = getattr(backbone, "merge_visual_inputs", None)
+        if callable(merge_visual_inputs):
+            images, token_pooling = merge_visual_inputs(
+                input_ids=input_ids,
+                pixel_values=model_inputs.get("pixel_values"),
+                image_token_pooling=model_inputs.get("image_token_pooling"),
+                image_grids=model_inputs.get("image_grids"),
+                image_num_crops=model_inputs.get("image_num_crops"),
+                pixel_values_videos=model_inputs.get("pixel_values_videos"),
+                video_token_pooling=model_inputs.get("video_token_pooling"),
+                video_grids=model_inputs.get("video_grids"),
+            )
+        elif (
+            model_inputs.get("pixel_values") is not None
+            or model_inputs.get("pixel_values_videos") is not None
+        ):
+            raise RuntimeError("MolmoAct2 checkpoint does not expose merge_visual_inputs for joint training.")
+
+        if images is not None and inputs_embeds is not None:
+            raise ValueError("MolmoAct2 joint flow training cannot combine inputs_embeds with visual inputs.")
+        if inputs_embeds is None:
+            inputs_embeds, _image_features = backbone.build_input_embeddings(input_ids, images, token_pooling)
+
+        cache_position = torch.arange(0, inputs_embeds.shape[1], device=inputs_embeds.device)
+        position_ids = model_inputs.get("position_ids")
+        if position_ids is None:
+            position_ids = cache_position.unsqueeze(0)
+
+        attention_mask = model_inputs.get("attention_mask")
+        if isinstance(attention_mask, dict):
+            causal_mask_mapping = attention_mask
+        else:
+            causal_mask_mapping = backbone._build_native_attention_bias(
+                inputs_embeds=inputs_embeds,
+                attention_mask=attention_mask,
+                token_type_ids=model_inputs.get("token_type_ids"),
+                past_key_values=None,
+            )
+        return inputs_embeds, causal_mask_mapping, position_ids, cache_position
+
+    @staticmethod
+    def _decoder_layer_kv_outputs(
+        layer_outputs: tuple[Any, ...], *, output_attentions: bool
+    ) -> tuple[Tensor, Tensor]:
+        output_idx = 2 if output_attentions else 1
+        return layer_outputs[output_idx], layer_outputs[output_idx + 1]
+
+    @staticmethod
+    def _action_time_conditioning(action_expert: torch.nn.Module, timesteps: Tensor) -> Tensor:
+        time_conditioning = getattr(action_expert, "_time_conditioning", None)
+        if callable(time_conditioning):
+            return time_conditioning(timesteps)
+        return action_expert.time_embed(timesteps)
+
+    def _compute_flow_matching_loss_joint_per_layer(
+        self,
+        *,
+        batch: dict[str, Tensor],
+        model_inputs: dict[str, Tensor],
+        timesteps: Tensor | None = None,
+        noise: Tensor | None = None,
+        reduction: str = "mean",
+    ) -> tuple[Tensor, Tensor]:
+        if reduction not in {"mean", "none"}:
+            raise ValueError(f"Unsupported reduction={reduction!r}. Expected 'mean' or 'none'.")
+        backbone = self._backbone()
+        transformer = getattr(backbone, "transformer", None)
+        action_expert = backbone._require_action_expert()
+        if transformer is None:
+            raise RuntimeError("MolmoAct2 joint flow training requires a patchable text transformer.")
+        if len(action_expert.blocks) != int(transformer.config.num_hidden_layers):
+            raise RuntimeError(
+                "MolmoAct2 joint flow training requires one action expert block per text transformer layer."
+            )
+
+        actions, timesteps, xt, target_velocity = self._prepare_flow_matching_tensors(
+            actions=batch[ACTION],
+            action_dim_is_pad=batch.get("action_dim_is_pad"),
+            timesteps=timesteps,
+            noise=noise,
+        )
+        num_flow_timesteps = max(1, int(self.config.num_flow_timesteps))
+        batch_size = int(actions.shape[0])
+        device = actions.device
+        xt_flat = xt.reshape(batch_size * num_flow_timesteps, actions.shape[1], actions.shape[2])
+        timesteps_flat = timesteps.reshape(batch_size * num_flow_timesteps)
+
+        hidden_states, causal_mask_mapping, position_ids, cache_position = (
+            self._prepare_joint_training_backbone_inputs(model_inputs)
+        )
+        if hidden_states.shape[0] != batch_size:
+            raise ValueError(
+                f"Backbone batch size {hidden_states.shape[0]} does not match action batch size {batch_size}."
+            )
+
+        encoder_attention_mask = self._encoder_attention_mask_for_action_expert(
+            input_ids=model_inputs.get("input_ids"),
+            attention_mask=model_inputs.get("attention_mask"),
+        )
+        action_attention_mask = None
+        if batch.get("action_horizon_is_pad") is not None:
+            action_attention_mask = ~batch["action_horizon_is_pad"].to(device=device, dtype=torch.bool)
+
+        valid_action = None
+        if action_attention_mask is not None:
+            valid_action = action_attention_mask.to(device=device, dtype=actions.dtype).unsqueeze(-1)
+            valid_action = self._expand_mask(valid_action, num_flow_timesteps)
+
+        rope_cache = None
+        if len(action_expert.blocks) > 0 and action_expert.blocks[0].self_attn.rope is not None:
+            rope_cache = action_expert.blocks[0].self_attn.rope.build_cache(
+                seq_len=actions.shape[1],
+                device=device,
+                dtype=actions.dtype,
+            )
+
+        cross_mask = action_expert._build_cross_attention_mask(
+            encoder_attention_mask,
+            batch_size,
+            actions.dtype,
+        )
+        cross_mask = self._expand_mask(cross_mask, num_flow_timesteps)
+        self_mask = action_expert._build_self_attention_mask(
+            action_attention_mask,
+            actions.shape[1],
+            device,
+            actions.dtype,
+        )
+        self_mask = self._expand_mask(self_mask, num_flow_timesteps)
+
+        conditioning = self._action_time_conditioning(action_expert, timesteps_flat)
+        action_hidden = action_expert.action_embed(xt_flat)
+        if valid_action is not None:
+            action_hidden = action_hidden * valid_action
+
+        if transformer.config.rope_scaling_layers is not None:
+            position_embeddings_mapping = {
+                "default": transformer.rotary_embs["default"](hidden_states, position_ids),
+                "scaling": transformer.rotary_embs["scaling"](hidden_states, position_ids),
+            }
+        else:
+            position_embeddings = transformer.rotary_emb(hidden_states, position_ids)
+
+        use_gradient_checkpointing = bool(
+            getattr(self.config, "gradient_checkpointing", False)
+            and self.training
+            and torch.is_grad_enabled()
+        )
+
+        def run_layer(
+            layer_idx: int, layer_hidden: Tensor, layer_action_hidden: Tensor
+        ) -> tuple[Tensor, Tensor]:
+            decoder_block = transformer.blocks[layer_idx]
+            action_block = action_expert.blocks[layer_idx]
+            if transformer.config.rope_scaling_layers is not None:
+                position_embeddings_i = (
+                    position_embeddings_mapping["scaling"]
+                    if layer_idx in transformer.config.rope_scaling_layers
+                    else position_embeddings_mapping["default"]
+                )
+            else:
+                position_embeddings_i = position_embeddings
+
+            layer_outputs = decoder_block(
+                layer_hidden,
+                position_embeddings=position_embeddings_i,
+                attention_mask=causal_mask_mapping,
+                position_ids=position_ids,
+                past_key_values=None,
+                output_attentions=False,
+                use_cache=False,
+                cache_position=cache_position,
+                collect_layer_kv_states=True,
+            )
+            next_hidden = layer_outputs[0]
+            key_states, value_states = self._decoder_layer_kv_outputs(layer_outputs, output_attentions=False)
+            key_states = backbone._cache_to_sequence(key_states)
+            value_states = backbone._cache_to_sequence(value_states)
+            if self.config.enable_knowledge_insulation:
+                key_states = key_states.detach()
+                value_states = value_states.detach()
+
+            k_ctx = action_expert._project_kv_tensor(key_states, action_expert.context_k_proj)
+            v_ctx = action_expert._project_kv_tensor(value_states, action_expert.context_v_proj)
+            k_norm = action_block.cross_attn.k_norm
+            if k_norm is not None:
+                k_ctx = k_norm(k_ctx.transpose(1, 2)).transpose(1, 2)
+            if num_flow_timesteps != 1:
+                k_ctx = self._expand_mask(k_ctx, num_flow_timesteps)
+                v_ctx = self._expand_mask(v_ctx, num_flow_timesteps)
+
+            next_action_hidden = action_block(
+                layer_action_hidden,
+                conditioning,
+                cross_kv=(k_ctx, v_ctx),
+                self_attn_mask=self_mask,
+                attn_mask=cross_mask,
+                is_causal=action_expert.config.causal_attn,
+                modulation=None,
+                rope_cache=rope_cache,
+            )
+            if valid_action is not None:
+                next_action_hidden = next_action_hidden * valid_action
+            return next_hidden, next_action_hidden
+
+        for layer_idx in range(int(transformer.config.num_hidden_layers)):
+            if use_gradient_checkpointing:
+                hidden_states, action_hidden = torch.utils.checkpoint.checkpoint(
+                    lambda layer_hidden, layer_action_hidden, idx=layer_idx: run_layer(
+                        idx,
+                        layer_hidden,
+                        layer_action_hidden,
+                    ),
+                    hidden_states,
+                    action_hidden,
+                    use_reentrant=False,
+                )
+            else:
+                hidden_states, action_hidden = run_layer(layer_idx, hidden_states, action_hidden)
+
+        hidden_states = transformer.ln_f(hidden_states)
+        pred_velocity = action_expert.final_layer(action_hidden, conditioning)
+        if valid_action is not None:
+            pred_velocity = pred_velocity * valid_action
+        pred_velocity = pred_velocity.reshape(
+            batch_size, num_flow_timesteps, actions.shape[1], actions.shape[2]
+        )
+
+        loss = F.mse_loss(pred_velocity, target_velocity, reduction="none")
+        loss = self._apply_action_chunk_padding_mask(loss, batch.get("action_horizon_is_pad"))
+        if self.config.mask_action_dim_padding:
+            loss = self._apply_action_dim_padding_mask(loss, batch.get("action_dim_is_pad"))
+        loss = loss.reshape(batch_size, -1).mean(dim=1)
+        if reduction == "mean":
+            loss = loss.mean()
+        return loss, hidden_states
+
+    def _discrete_token_weights(self, valid_positions: Tensor) -> Tensor | None:
+        mode = self.config.discrete_loss_token_weighting
+        if mode in {"none", "token", "root_subsegments"}:
+            return None
+        if mode != "root_subsegments_root_tokens" and mode != "root_tokens":
+            raise ValueError(f"Unsupported discrete_loss_token_weighting={mode!r}.")
+
+        token_counts = valid_positions.sum(dim=1).to(dtype=torch.float32)
+        example_weights = torch.zeros_like(token_counts)
+        nonempty = token_counts > 0
+        example_weights[nonempty] = 2.0 / torch.sqrt(token_counts[nonempty])
+        return example_weights[:, None].expand_as(valid_positions)[valid_positions].to(dtype=torch.float32)
+
+    @staticmethod
+    def _weighted_mean(values: Tensor, weights: Tensor | None) -> Tensor:
+        if weights is None:
+            return values.mean()
+        weights = weights.to(device=values.device, dtype=values.dtype)
+        return torch.dot(values, weights) / weights.sum().clamp_min(1.0)
+
+    @staticmethod
+    def _weighted_per_example(
+        values: Tensor,
+        weights: Tensor | None,
+        example_indices: Tensor,
+        batch_size: int,
+    ) -> Tensor:
+        values = values.float()
+        if weights is None:
+            weights = torch.ones_like(values)
+        else:
+            weights = weights.to(device=values.device, dtype=values.dtype)
+        loss_sum = torch.zeros(batch_size, device=values.device, dtype=torch.float32)
+        weight_sum = torch.zeros(batch_size, device=values.device, dtype=torch.float32)
+        loss_sum.scatter_add_(0, example_indices, values * weights)
+        weight_sum.scatter_add_(0, example_indices, weights)
+        global_weight_sum = weight_sum.sum().clamp_min(1.0)
+        return loss_sum * float(batch_size) / global_weight_sum
+
+    def _discrete_loss_from_backbone_outputs(
+        self,
+        batch: dict[str, Tensor],
+        outputs: Any,
+        reduction: str = "mean",
+    ) -> tuple[Tensor, Tensor | None]:
+        if reduction not in {"mean", "none"}:
+            raise ValueError(f"Unsupported reduction={reduction!r}. Expected 'mean' or 'none'.")
+        labels = batch.get("labels")
+        if labels is None:
+            raise RuntimeError("MolmoAct2 discrete training requires labels.")
+        hidden_states = outputs.last_hidden_state
+        if hidden_states is None:
+            raise RuntimeError("MolmoAct2 backbone did not return last_hidden_state.")
+
+        ignore_index = -100
+        shift_labels = F.pad(labels, (0, 1), value=ignore_index)[..., 1:].contiguous()
+        valid_positions = shift_labels != ignore_index
+        if not bool(valid_positions.any()):
+            raise RuntimeError("MolmoAct2 discrete training labels contain no valid action tokens.")
+
+        hidden_size = hidden_states.shape[-1]
+        selected_hidden = hidden_states.reshape(-1, hidden_size)[valid_positions.reshape(-1)]
+        selected_labels = shift_labels.reshape(-1)[valid_positions.reshape(-1)].to(
+            device=hidden_states.device
+        )
+        logits = F.linear(selected_hidden, self.model.lm_head.weight).float()
+        log_z = logits.logsumexp(dim=-1)
+        target_logits = logits.gather(dim=-1, index=selected_labels[:, None]).squeeze(-1)
+        token_ce_loss = log_z - target_logits
+        token_weights = self._discrete_token_weights(valid_positions)
+        if reduction == "none":
+            example_indices = valid_positions.nonzero(as_tuple=False)[:, 0].to(device=hidden_states.device)
+            ce_loss = self._weighted_per_example(
+                token_ce_loss,
+                token_weights,
+                example_indices,
+                int(labels.shape[0]),
+            )
+        else:
+            ce_loss = self._weighted_mean(token_ce_loss, token_weights)
+        if not self.config.softmax_auxiliary_loss:
+            return ce_loss, None
+
+        if reduction == "none":
+            z_loss = self.config.softmax_auxiliary_loss_scale * self._weighted_per_example(
+                log_z.pow(2),
+                token_weights,
+                example_indices,
+                int(labels.shape[0]),
+            )
+        else:
+            z_loss = self.config.softmax_auxiliary_loss_scale * self._weighted_mean(
+                log_z.pow(2), token_weights
+            )
+        return ce_loss, z_loss
+
+    @staticmethod
+    def _extract_discrete_token_bins(
+        generated_ids: list[int],
+        start_token_id: int,
+        end_token_id: int,
+        token_id_to_bin: dict[int, int],
+    ) -> list[int]:
+        start_idx = None
+        end_idx = None
+        for idx, token_id in enumerate(generated_ids):
+            if token_id == start_token_id:
+                start_idx = idx
+                break
+        if start_idx is not None:
+            for idx in range(start_idx + 1, len(generated_ids)):
+                if generated_ids[idx] == end_token_id:
+                    end_idx = idx
+                    break
+        span_start = 0 if start_idx is None else start_idx + 1
+        span_end = len(generated_ids) if end_idx is None else end_idx
+        return [
+            int(token_id_to_bin[token_id])
+            for token_id in generated_ids[span_start:span_end]
+            if token_id in token_id_to_bin
+        ]
+
+    def _action_token_id_to_bin(self) -> dict[int, int]:
+        method = getattr(self.model, "_action_token_id_to_bin", None)
+        if callable(method):
+            return dict(method())
+        start = getattr(self.model.config, "action_token_start_id", None)
+        num_tokens = int(getattr(self.model.config, "num_action_tokens", 0) or 0)
+        if start is None or num_tokens <= 0:
+            return {}
+        return {int(start) + idx: idx for idx in range(num_tokens)}
+
+    def _require_discrete_eos_token_id(self) -> int:
+        method = getattr(self.model, "_require_eos_token_id", None)
+        if callable(method):
+            return int(method())
+        eos_token_id = getattr(self.model.config, "eos_token_id", None)
+        if eos_token_id is None and getattr(self.model, "generation_config", None) is not None:
+            eos_token_id = getattr(self.model.generation_config, "eos_token_id", None)
+        if isinstance(eos_token_id, (list, tuple)):
+            eos_token_id = eos_token_id[0] if eos_token_id else None
+        if eos_token_id is None:
+            raise RuntimeError("Discrete action generation requires eos_token_id in the checkpoint config.")
+        return int(eos_token_id)
+
+    def _discrete_generation_max_steps(self) -> int:
+        if self.config.discrete_generation_max_steps is not None:
+            return int(self.config.discrete_generation_max_steps)
+        return max(1, self._generation_action_horizon() * 16)
+
+    def _continue_discrete_generation_from_output(
+        self,
+        initial_output: Any,
+        *,
+        past_key_values: Any | None,
+        attention_mask: Tensor | None,
+        end_token_id: int,
+        max_steps: int,
+        attention_bias: Tensor | None = None,
+    ) -> Tensor:
+        consume_generation_tokens = getattr(self.model, "_consume_generation_tokens", None)
+        ar_decode_step = getattr(self.model, "_run_ar_decode_step", None)
+        if ar_decode_step is None:
+            ar_decode_step = getattr(self.model, "_run_depth_decode_step", None)
+        if attention_bias is None and not callable(consume_generation_tokens):
+            raise RuntimeError("MolmoAct2 checkpoint does not expose discrete token generation helpers.")
+        if attention_bias is not None and not callable(ar_decode_step):
+            raise RuntimeError("MolmoAct2 checkpoint does not expose graph-backed AR decode helpers.")
+
+        generated_tokens: list[Tensor] = []
+        current_output = initial_output
+        current_past_key_values = past_key_values
+        current_attention_mask = attention_mask
+        hit_end = False
+        for _ in range(int(max_steps)):
+            next_token = torch.argmax(current_output.logits[:, -1, :], dim=-1)
+            generated_tokens.append(next_token)
+            if bool((next_token == int(end_token_id)).all()):
+                hit_end = True
+                break
+            if attention_bias is None:
+                current_output, current_attention_mask = consume_generation_tokens(
+                    next_token,
+                    past_key_values=current_past_key_values,
+                    attention_mask=current_attention_mask,
+                )
+                current_past_key_values = current_output.past_key_values
+            else:
+                last_hidden, current_past_key_values = ar_decode_step(
+                    next_token,
+                    past_key_values=current_past_key_values,
+                    attention_bias=attention_bias,
+                )
+                current_output = types.SimpleNamespace(
+                    logits=self.model.lm_head(last_hidden),
+                    past_key_values=current_past_key_values,
+                )
+        if not generated_tokens:
+            raise RuntimeError("Discrete continuation generated no tokens.")
+        if not hit_end:
+            raise RuntimeError(
+                f"Discrete continuation did not emit end token {int(end_token_id)} within {int(max_steps)} steps."
+            )
+        return torch.stack(generated_tokens, dim=1)
+
+    def _make_discrete_ar_graph_decode_inputs(
+        self,
+        model_inputs: dict[str, Tensor],
+        *,
+        max_steps: int,
+    ) -> tuple[Any | None, Tensor | None]:
+        if not bool(getattr(self.config, "enable_inference_cuda_graph", False)):
+            return None, None
+        if self.training or self.model.training:
+            return None, None
+        ar_decode_step = getattr(self.model, "_run_ar_decode_step", None)
+        if ar_decode_step is None:
+            ar_decode_step = getattr(self.model, "_run_depth_decode_step", None)
+        make_attention_bias = getattr(self.model, "_make_depth_decode_attention_bias", None)
+        if not callable(ar_decode_step) or not callable(make_attention_bias):
+            return None, None
+
+        make_static_cache = getattr(self.model, "_make_ar_decode_static_cache", None)
+        if callable(make_static_cache):
+            static_cache = make_static_cache(model_inputs, max_steps=max_steps)
+        else:
+            graph_manager = getattr(self.model, "depth_decode_cuda_graph_manager", None)
+            make_manager_static_cache = getattr(graph_manager, "make_static_cache", None)
+            if not callable(make_manager_static_cache):
+                return None, None
+            prompt_len = int(model_inputs["input_ids"].shape[1])
+            static_cache = make_manager_static_cache(max_cache_len=prompt_len + max(1, int(max_steps)))
+
+        attention_bias = make_attention_bias(model_inputs, static_cache)
+        return static_cache, attention_bias
+
+    def _decode_discrete_action_chunk(self, generated_token_ids: Tensor, *, action_dim: int) -> Tensor:
+        if (
+            getattr(self.model.config, "action_start_token_id", None) is None
+            or getattr(self.model.config, "action_end_token_id", None) is None
+        ):
+            raise RuntimeError("Discrete action generation requires <action_start>/<action_end> token IDs.")
+        token_id_to_bin = self._action_token_id_to_bin()
+        if not token_id_to_bin:
+            raise RuntimeError(
+                "Discrete action generation requires indexed action tokens in the checkpoint config."
+            )
+
+        action_tokenizer = self._load_discrete_action_tokenizer()
+        if generated_token_ids.ndim == 1:
+            generated_token_ids = generated_token_ids.unsqueeze(0)
+        if generated_token_ids.ndim == 3:
+            generated_token_ids = generated_token_ids[:, 0, :]
+        if generated_token_ids.ndim != 2:
+            raise ValueError(f"Unexpected generated token tensor shape {tuple(generated_token_ids.shape)}.")
+
+        chunks: list[Tensor] = []
+        for token_row in generated_token_ids:
+            generated_ids = [int(token_id) for token_id in token_row.detach().cpu().tolist()]
+            discrete_token_ids = self._extract_discrete_token_bins(
+                generated_ids,
+                int(self.model.config.action_start_token_id),
+                int(self.model.config.action_end_token_id),
+                token_id_to_bin,
+            )
+            if not discrete_token_ids:
+                raise RuntimeError(
+                    "Model generated no decodable action tokens between <action_start>/<action_end>."
+                )
+            try:
+                decoded = action_tokenizer.decode(
+                    [discrete_token_ids],
+                    time_horizon=self._generation_action_horizon(),
+                    action_dim=int(action_dim),
+                )
+            except TypeError:
+                decoded = action_tokenizer.decode([discrete_token_ids])
+            action_chunk = np.asarray(decoded, dtype=np.float32)
+            if action_chunk.ndim == 1:
+                action_chunk = action_chunk[None, :]
+            elif action_chunk.ndim == 3:
+                if int(action_chunk.shape[0]) != 1:
+                    action_chunk = action_chunk.reshape(action_chunk.shape[-2], action_chunk.shape[-1])
+                else:
+                    action_chunk = action_chunk[0]
+            elif action_chunk.ndim > 3:
+                action_chunk = action_chunk.reshape(action_chunk.shape[-2], action_chunk.shape[-1])
+            if action_chunk.ndim != 2:
+                raise RuntimeError(f"Decoded action chunk has unexpected shape {action_chunk.shape}.")
+            chunks.append(torch.as_tensor(action_chunk, device=token_row.device, dtype=torch.float32))
+        return torch.stack(chunks, dim=0)
+
+    def _generate_discrete_actions_from_inputs(
+        self,
+        *,
+        model_inputs: dict[str, Tensor],
+        action_dim: int,
+    ) -> Tensor:
+        model_inputs = self._drop_trivial_attention_mask(model_inputs)
+        max_steps = self._discrete_generation_max_steps()
+        static_cache, attention_bias = self._make_discrete_ar_graph_decode_inputs(
+            model_inputs,
+            max_steps=max_steps,
+        )
+        prefill_kwargs: dict[str, Any] = {}
+        if static_cache is not None:
+            prefill_kwargs["past_key_values"] = static_cache
+        prefill_output = self.model(
+            **model_inputs,
+            use_cache=True,
+            output_attentions=False,
+            output_hidden_states=False,
+            **prefill_kwargs,
+        )
+        generated_token_ids = self._continue_discrete_generation_from_output(
+            prefill_output,
+            past_key_values=prefill_output.past_key_values,
+            attention_mask=model_inputs.get("attention_mask"),
+            end_token_id=self._require_discrete_eos_token_id(),
+            max_steps=max_steps,
+            attention_bias=attention_bias,
+        )
+        return self._decode_discrete_action_chunk(generated_token_ids, action_dim=action_dim)
+
+    def _generate_actions_from_inputs_with_rtc(
+        self,
+        *,
+        model_inputs: dict[str, Tensor],
+        action_dim_is_pad: Tensor | None,
+        num_steps: int | None,
+        generator: torch.Generator | None,
+        inference_delay: int | None,
+        prev_chunk_left_over: Tensor | None,
+        execution_horizon: int | None,
+    ) -> Tensor:
+        backbone = self._backbone()
+        action_expert = self._action_expert()
+        outputs = backbone(
+            **model_inputs,
+            use_cache=True,
+            output_attentions=False,
+            output_hidden_states=False,
+        )
+        encoder_kv_states = backbone._extract_kv_states(outputs.past_key_values)
+        encoder_attention_mask = self._encoder_attention_mask_for_action_expert(
+            input_ids=model_inputs.get("input_ids"),
+            attention_mask=model_inputs.get("attention_mask"),
+        )
+        depth_gate, depth_mask = backbone._depth_gate_from_condition(
+            input_ids=model_inputs.get("input_ids"),
+            encoder_attention_mask=encoder_attention_mask,
+            layer_kv_states=encoder_kv_states,
+        )
+        encoder_kv_states = backbone._apply_depth_gate_to_layer_kv_states(
+            encoder_kv_states,
+            depth_mask,
+            depth_gate,
+        )
+
+        steps = int(num_steps or backbone.config.flow_matching_num_steps)
+        if steps <= 0:
+            raise ValueError(f"num_steps must be >= 1, got {steps}.")
+        source_tensor = encoder_kv_states[0][0]
+        batch_size = int(source_tensor.shape[0])
+        device = source_tensor.device
+        trajectory = torch.randn(
+            batch_size,
+            self._generation_action_horizon(),
+            int(backbone.config.max_action_dim),
+            device=device,
+            dtype=torch.float32,
+            generator=generator,
+        )
+        if self.config.mask_action_dim_padding:
+            trajectory = self._mask_action_dim_tensor(trajectory, action_dim_is_pad)
+
+        action_context = action_expert.prepare_context(
+            encoder_kv_states=encoder_kv_states,
+            encoder_attention_mask=encoder_attention_mask,
+            state_embeddings=None,
+            batch_size=batch_size,
+            seq_len=trajectory.shape[1],
+            device=device,
+            dtype=trajectory.dtype,
+        )
+        flow_timesteps = [
+            torch.full((batch_size,), idx / steps, device=device, dtype=trajectory.dtype)
+            for idx in range(steps)
+        ]
+        modulation_cache = action_expert.get_or_prepare_modulation_cache(
+            flow_timesteps,
+            cache_key=(steps, batch_size, device, trajectory.dtype),
+        )
+
+        dt = 1.0 / steps
+        mask_enabled = self.config.mask_action_dim_padding
+        for idx, flow_timestep in enumerate(flow_timesteps):
+            modulation = modulation_cache[idx]
+
+            def denoise_step(input_trajectory: Tensor, step_modulation=modulation) -> Tensor:
+                velocity = action_expert.forward_with_context(
+                    input_trajectory,
+                    step_modulation.conditioning,
+                    context=action_context,
+                    modulation=step_modulation,
+                )
+                if mask_enabled:
+                    velocity = self._mask_action_dim_tensor(velocity, action_dim_is_pad)
+                return velocity
+
+            if self._rtc_enabled():
+                if self.rtc_processor is None:
+                    raise RuntimeError("RTC is enabled but rtc_processor is not initialized.")
+
+                def rtc_denoise_step(input_trajectory: Tensor) -> Tensor:
+                    return -denoise_step(input_trajectory)
+
+                rtc_time = 1.0 - float(flow_timestep[0].item())
+                rtc_velocity = self.rtc_processor.denoise_step(
+                    x_t=trajectory,
+                    prev_chunk_left_over=prev_chunk_left_over,
+                    inference_delay=int(inference_delay or 0),
+                    time=rtc_time,
+                    original_denoise_step_partial=rtc_denoise_step,
+                    execution_horizon=execution_horizon,
+                )
+                velocity = -rtc_velocity
+            else:
+                velocity = denoise_step(trajectory)
+
+            trajectory = trajectory + dt * velocity
+            if mask_enabled:
+                trajectory = self._mask_action_dim_tensor(trajectory, action_dim_is_pad)
+            if self.rtc_processor is not None and self.rtc_processor.is_debug_enabled():
+                self.rtc_processor.track(time=float(flow_timestep[0].item()), x_t=trajectory, v_t=velocity)
+
+        return trajectory
+
+    def forward(
+        self,
+        batch: dict[str, Tensor],
+        reduction: str = "mean",
+    ) -> tuple[Tensor, dict[str, Any]]:
+        if reduction not in {"mean", "none"}:
+            raise ValueError(f"Unsupported reduction={reduction!r}. Expected 'mean' or 'none'.")
+        model_inputs = self._model_inputs(batch)
+        losses: list[Tensor] = []
+        metrics: dict[str, Any] = {}
+
+        if self.config.action_mode == "discrete":
+            outputs = self._backbone()(
+                **model_inputs,
+                use_cache=False,
+                output_attentions=False,
+                output_hidden_states=False,
+            )
+            discrete_ce_loss, discrete_z_loss = self._discrete_loss_from_backbone_outputs(
+                batch, outputs, reduction=reduction
+            )
+            discrete_loss = (
+                discrete_ce_loss if discrete_z_loss is None else discrete_ce_loss + discrete_z_loss
+            )
+            losses.append(discrete_loss)
+            metrics["discrete_ce_loss"] = discrete_ce_loss.detach().float().mean().item()
+            if discrete_z_loss is not None:
+                metrics["discrete_z_loss"] = discrete_z_loss.detach().float().mean().item()
+
+        elif self.config.action_mode == "continuous":
+            flow_loss, _ = self._compute_flow_matching_loss_joint_per_layer(
+                batch=batch,
+                model_inputs=model_inputs,
+                reduction=reduction,
+            )
+            losses.append(flow_loss)
+            metrics["action_flow_loss"] = flow_loss.detach().float().mean().item()
+
+        else:
+            flow_loss, hidden_states = self._compute_flow_matching_loss_joint_per_layer(
+                batch=batch,
+                model_inputs=model_inputs,
+                reduction=reduction,
+            )
+            outputs = types.SimpleNamespace(last_hidden_state=hidden_states)
+            discrete_ce_loss, discrete_z_loss = self._discrete_loss_from_backbone_outputs(
+                batch, outputs, reduction=reduction
+            )
+            discrete_loss = (
+                discrete_ce_loss if discrete_z_loss is None else discrete_ce_loss + discrete_z_loss
+            )
+            losses.append(discrete_loss)
+            metrics["discrete_ce_loss"] = discrete_ce_loss.detach().float().mean().item()
+            if discrete_z_loss is not None:
+                metrics["discrete_z_loss"] = discrete_z_loss.detach().float().mean().item()
+            losses.append(flow_loss)
+            metrics["action_flow_loss"] = flow_loss.detach().float().mean().item()
+
+        loss = torch.stack(losses).sum(dim=0)
+        metrics["loss"] = loss.detach().float().mean().item()
+        return loss, metrics
+
+    @torch.no_grad()
+    def predict_action_chunk(self, batch: dict[str, Tensor], **kwargs) -> Tensor:
+        if "action_mode" in kwargs:
+            raise TypeError(
+                "MolmoAct2 predict_action_chunk got unexpected keyword argument 'action_mode'; "
+                "use 'inference_action_mode'."
+            )
+        model_inputs = self._model_inputs(batch)
+        inference_action_mode = self._resolve_inference_action_mode(kwargs.get("inference_action_mode"))
+        num_steps = kwargs.get("num_steps", getattr(self.config, "num_inference_steps", None))
+        generator = kwargs.get("generator")
+        model_dtype = _torch_dtype(self.config.model_dtype)
+        device = next(self.parameters()).device
+        batch_size = int(next(iter(model_inputs.values())).shape[0])
+        if generator is None:
+            generator = self._rollout_generator_for_inputs(
+                batch,
+                batch_size=batch_size,
+                device=device,
+            )
+        action_dim = self._output_action_dim(batch)
+        autocast_context = (
+            torch.autocast(device_type=device.type, dtype=model_dtype)
+            if device.type in {"cuda", "cpu"} and model_dtype in {torch.bfloat16, torch.float16}
+            else nullcontext()
+        )
+        with autocast_context:
+            if inference_action_mode == "discrete":
+                if self._rtc_enabled():
+                    raise ValueError("RTC is only supported for continuous MolmoAct2 inference.")
+                actions = self._generate_discrete_actions_from_inputs(
+                    model_inputs=model_inputs,
+                    action_dim=action_dim,
+                )
+            elif self._rtc_enabled():
+                actions = self._generate_actions_from_inputs_with_rtc(
+                    model_inputs=model_inputs,
+                    action_dim_is_pad=batch.get("action_dim_is_pad"),
+                    num_steps=num_steps,
+                    generator=generator,
+                    inference_delay=kwargs.get("inference_delay"),
+                    prev_chunk_left_over=kwargs.get("prev_chunk_left_over"),
+                    execution_horizon=kwargs.get("execution_horizon"),
+                )
+            else:
+                actions = self._backbone().generate_actions_from_inputs(
+                    **model_inputs,
+                    action_dim_is_pad=batch.get("action_dim_is_pad"),
+                    action_horizon=self._generation_action_horizon(),
+                    num_steps=num_steps,
+                    generator=generator,
+                )
+        return actions[:, : self.config.n_action_steps, :action_dim].to(dtype=torch.float32)
+
+    @torch.no_grad()
+    def select_action(self, batch: dict[str, Tensor], **kwargs) -> Tensor:
+        if self._rtc_enabled():
+            raise AssertionError("RTC is not supported for select_action, use it with predict_action_chunk")
+        self.eval()
+        if len(self._action_queue) == 0:
+            actions = self.predict_action_chunk(batch, **kwargs)[:, : self.config.n_action_steps]
+            self._action_queue.extend(actions.transpose(0, 1))
+        return self._action_queue.popleft()
+
+    def _get_default_peft_targets(self) -> dict[str, Any]:
+        target_modules = self._lora_target_modules(prefix=r"model\.model")
+        return {
+            "target_modules": target_modules,
+            "modules_to_save": [],
+            "r": self.config.lora_rank,
+            "lora_alpha": self.config.lora_alpha,
+            "lora_dropout": self.config.lora_dropout,
+            "bias": self.config.lora_bias,
+        }
+
+    def _get_inner_peft_targets(self) -> dict[str, Any]:
+        target_modules = self._lora_target_modules(prefix="model")
+        return {
+            "target_modules": target_modules,
+            "modules_to_save": [],
+            "r": self.config.lora_rank,
+            "lora_alpha": self.config.lora_alpha,
+            "lora_dropout": self.config.lora_dropout,
+            "bias": self.config.lora_bias,
+        }
+
+    def _lora_target_modules(self, *, prefix: str) -> str:
+        vlm_linear_leaves = "w1|w2|w3|wq|wk|wv|wo|att_proj|attn_out|ff_proj|ff_out|patch_embedding"
+        target_modules = rf"{prefix}\.(transformer|vision_backbone)\.(?:.*\.)?({vlm_linear_leaves})$"
+        if self.config.enable_lora_action_expert:
+            action_expert_linear_paths = (
+                r"time_embed\.(1|3)|"
+                r"action_embed|context_k_proj|context_v_proj|"
+                r"blocks\.\d+\.self_attn\.(qkv|out_proj)|"
+                r"blocks\.\d+\.cross_attn\.(q_proj|out_proj)|"
+                r"blocks\.\d+\.mlp\.(up_proj|gate_proj|down_proj)|"
+                r"blocks\.\d+\.modulation\.linear|"
+                r"final_layer\.(modulation\.linear|linear)"
+            )
+            target_modules = (
+                f"({target_modules}|"
+                rf"{prefix}\.action_expert\.({action_expert_linear_paths})$)"
+            )
+        return target_modules
+
+    def _build_inner_lora_config(self):
+        require_package("peft", extra="molmoact2")
+        from peft import LoraConfig
+
+        return LoraConfig(**self._get_inner_peft_targets())
+
+    def _apply_lora_adapters(self) -> None:
+        require_package("peft", extra="molmoact2")
+        from peft import get_peft_model
+
+        peft_config = self._build_inner_lora_config()
+        self._validate_peft_config(peft_config)
+
+        for param in self.model.parameters():
+            param.requires_grad_(False)
+        self.model = get_peft_model(self.model, peft_config)
+        if not self.config.enable_lora_action_expert:
+            self._unfreeze_action_expert_parameters()
+        self.train(self.training)
+
+    def _validate_peft_config(self, peft_config) -> None:
+        del peft_config
+        if not self.config.checkpoint_path:
+            raise ValueError("MolmoAct2 LoRA fine-tuning requires `policy.checkpoint_path`.")
diff --git a/src/lerobot/policies/molmoact2/processor_molmoact2.py b/src/lerobot/policies/molmoact2/processor_molmoact2.py
new file mode 100644
index 000000000..6c7a3ed5c
--- /dev/null
+++ b/src/lerobot/policies/molmoact2/processor_molmoact2.py
@@ -0,0 +1,1083 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import json
+import os
+import re
+from contextlib import suppress
+from copy import deepcopy
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import TYPE_CHECKING, Any
+
+import numpy as np
+import torch
+from huggingface_hub import snapshot_download
+from torch import Tensor
+
+from lerobot.configs import FeatureType, PipelineFeatureType, PolicyFeature
+from lerobot.processor import (
+    AddBatchDimensionProcessorStep,
+    DeviceProcessorStep,
+    NormalizerProcessorStep,
+    PolicyAction,
+    PolicyProcessorPipeline,
+    ProcessorStep,
+    ProcessorStepRegistry,
+    RenameObservationsProcessorStep,
+    UnnormalizerProcessorStep,
+    policy_action_to_transition,
+    transition_to_policy_action,
+)
+from lerobot.types import EnvTransition, TransitionKey
+from lerobot.utils.constants import (
+    ACTION,
+    OBS_IMAGES,
+    OBS_STATE,
+    POLICY_POSTPROCESSOR_DEFAULT_NAME,
+    POLICY_PREPROCESSOR_DEFAULT_NAME,
+)
+from lerobot.utils.import_utils import _scipy_available, _transformers_available, require_package
+
+from .configuration_molmoact2 import MolmoAct2Config, infer_molmoact2_max_sequence_length
+
+if TYPE_CHECKING or _transformers_available:
+    from transformers import Qwen2Tokenizer
+
+    from .hf_model.image_processing_molmoact2 import MolmoAct2ImageProcessor
+    from .hf_model.processing_molmoact2 import MolmoAct2Processor
+    from .hf_model.video_processing_molmoact2 import MolmoAct2VideoProcessor
+else:
+    Qwen2Tokenizer = None
+    MolmoAct2ImageProcessor = None
+    MolmoAct2Processor = None
+    MolmoAct2VideoProcessor = None
+
+if TYPE_CHECKING or (_transformers_available and _scipy_available):
+    from .hf_model.action_tokenizer import UniversalActionProcessor
+else:
+    UniversalActionProcessor = None
+
+ACTION_OUTPUT_TOKEN = "<action_output>"  # nosec B105
+ACTION_START_TOKEN = "<action_start>"  # nosec B105
+ACTION_END_TOKEN = "<action_end>"  # nosec B105
+ACTION_TOKEN_PREFIX = "<action_"  # nosec B105
+STATE_START_TOKEN = "<state_start>"  # nosec B105
+STATE_END_TOKEN = "<state_end>"  # nosec B105
+STATE_TOKEN_PREFIX = "<state_"  # nosec B105
+SETUP_START_TOKEN = "<setup_start>"  # nosec B105
+SETUP_END_TOKEN = "<setup_end>"  # nosec B105
+CONTROL_START_TOKEN = "<control_start>"  # nosec B105
+CONTROL_END_TOKEN = "<control_end>"  # nosec B105
+
+_QUESTION_TRAILING_SENTENCE_PUNCTUATION = ".,!?;:,\u2026"
+_QUESTION_TRAILING_CLOSERS = "\"'\u201d\u2019)]}"
+_QUESTION_SURROUNDING_DELIMITERS = "\"'`\u201c\u201d\u2018\u2019[](){}"
+_QUESTION_PREFIX_PATTERNS = tuple(
+    re.compile(pattern, flags=re.IGNORECASE)
+    for pattern in (
+        r"^(?:task|instruction|language[_ ]instruction|goal)\s*[:\-]\s*",
+        r"^(?:the\s+task\s+is\s+to|your\s+task\s+is\s+to)\s+",
+    )
+)
+
+
+def _hf_token() -> str | None:
+    return os.environ.get("HF_TOKEN") or os.environ.get("HF_ACCESS_TOKEN")
+
+
+def _resolve_checkpoint_location(
+    checkpoint_path: str,
+    *,
+    revision: str | None = None,
+    force_download: bool = False,
+) -> str:
+    checkpoint_path = str(checkpoint_path or "").strip()
+    if not checkpoint_path:
+        raise ValueError("MolmoAct2 policy requires `checkpoint_path`.")
+    local_path = Path(checkpoint_path).expanduser()
+    if local_path.exists():
+        return str(local_path)
+    return snapshot_download(
+        repo_id=checkpoint_path,
+        repo_type="model",
+        revision=revision,
+        force_download=force_download,
+        ignore_patterns=["*.py", "*.pyc", "__pycache__/*"],
+        token=_hf_token(),
+    )
+
+
+def _load_hf_norm_stats_for_tag(
+    checkpoint_path: str,
+    *,
+    revision: str | None,
+    force_download: bool,
+    norm_tag: str | None,
+) -> tuple[dict[str, dict[str, Any]], dict[str, Any]]:
+    norm_tag = str(norm_tag or "").strip()
+    if not norm_tag:
+        raise ValueError("MolmoAct2 HF checkpoint inference requires `policy.norm_tag` for normalization.")
+
+    checkpoint_location = Path(
+        _resolve_checkpoint_location(
+            checkpoint_path,
+            revision=revision,
+            force_download=force_download,
+        )
+    )
+    config_path = checkpoint_location / "config.json"
+    norm_stats_filename = "norm_stats.json"
+    if config_path.exists():
+        with suppress(OSError, json.JSONDecodeError):
+            norm_stats_filename = str(
+                json.loads(config_path.read_text()).get("norm_stats_filename") or norm_stats_filename
+            )
+
+    stats_path = checkpoint_location / norm_stats_filename
+    if not stats_path.exists():
+        raise FileNotFoundError(
+            f"MolmoAct2 HF checkpoint is missing {norm_stats_filename!r}; cannot resolve norm_tag={norm_tag!r}."
+        )
+    payload = json.loads(stats_path.read_text())
+    metadata_by_tag = payload.get("metadata_by_tag")
+    if not isinstance(metadata_by_tag, dict):
+        raise ValueError(f"MolmoAct2 norm stats file {stats_path} has no metadata_by_tag mapping.")
+    metadata = metadata_by_tag.get(norm_tag)
+    if metadata is None:
+        available = sorted(str(tag) for tag in metadata_by_tag)
+        raise ValueError(f"Unknown MolmoAct2 norm_tag={norm_tag!r}. Available tags: {available}.")
+    if not isinstance(metadata, dict):
+        raise ValueError(f"MolmoAct2 norm_tag={norm_tag!r} metadata must be a mapping.")
+
+    def numeric_stats(raw_stats: dict[str, Any]) -> dict[str, Any]:
+        stats: dict[str, Any] = {}
+        for key, value in raw_stats.items():
+            if key == "names":
+                continue
+            if isinstance(value, (list, tuple)) and any(isinstance(item, str) for item in value):
+                continue
+            stats[key] = deepcopy(value)
+        return stats
+
+    action_stats = metadata.get("action_stats")
+    state_stats = metadata.get("state_stats")
+    if not isinstance(action_stats, dict) or not isinstance(state_stats, dict):
+        raise ValueError(f"MolmoAct2 norm_tag={norm_tag!r} must define action_stats and state_stats.")
+    return {ACTION: numeric_stats(action_stats), OBS_STATE: numeric_stats(state_stats)}, metadata
+
+
+def _strip_processor_config(config: dict[str, Any], *metadata_keys: str) -> dict[str, Any]:
+    return {
+        key: value
+        for key, value in config.items()
+        if key not in {"auto_map", "processor_class", *metadata_keys}
+    }
+
+
+def _load_local_molmoact2_processor(checkpoint_location: str) -> Any:
+    if (
+        Qwen2Tokenizer is None
+        or MolmoAct2ImageProcessor is None
+        or MolmoAct2Processor is None
+        or MolmoAct2VideoProcessor is None
+    ):
+        raise RuntimeError("transformers is required to load MolmoAct2 processor.")
+
+    checkpoint_path = Path(checkpoint_location)
+    processor_config_path = checkpoint_path / "processor_config.json"
+    if not processor_config_path.exists():
+        raise FileNotFoundError(f"MolmoAct2 checkpoint is missing {processor_config_path}.")
+    processor_config = json.loads(processor_config_path.read_text())
+
+    image_config = _strip_processor_config(
+        dict(processor_config.get("image_processor") or {}),
+        "image_processor_type",
+    )
+    video_config = _strip_processor_config(
+        dict(processor_config.get("video_processor") or {}),
+        "video_processor_type",
+    )
+    image_processor = MolmoAct2ImageProcessor(**image_config)
+    video_processor = MolmoAct2VideoProcessor(**video_config)
+    tokenizer = Qwen2Tokenizer.from_pretrained(
+        checkpoint_location,
+        token=_hf_token(),
+    )
+
+    chat_template_path = checkpoint_path / "chat_template.jinja"
+    chat_template = chat_template_path.read_text() if chat_template_path.exists() else None
+    return MolmoAct2Processor(
+        image_processor=image_processor,
+        video_processor=video_processor,
+        tokenizer=tokenizer,
+        chat_template=chat_template,
+        image_use_col_tokens=processor_config.get("image_use_col_tokens", True),
+        use_single_crop_col_tokens=processor_config.get("use_single_crop_col_tokens"),
+        use_single_crop_start_token=processor_config.get("use_single_crop_start_token", True),
+        video_use_col_tokens=processor_config.get("video_use_col_tokens", False),
+        use_frame_special_tokens=processor_config.get("use_frame_special_tokens", True),
+    )
+
+
+def _to_numpy(value: Any) -> np.ndarray:
+    if isinstance(value, np.ndarray):
+        return value
+    if torch.is_tensor(value):
+        return value.detach().cpu().numpy()
+    return np.asarray(value)
+
+
+def _normalize_image(value: Any) -> np.ndarray:
+    arr = _to_numpy(value)
+    while arr.ndim > 3 and int(arr.shape[0]) == 1:
+        arr = arr[0]
+    if arr.ndim == 2:
+        arr = np.stack([arr] * 3, axis=-1)
+    if arr.ndim == 3 and arr.shape[0] in {1, 3, 4} and arr.shape[-1] not in {1, 3, 4}:
+        arr = np.moveaxis(arr, 0, -1)
+    if arr.ndim == 3 and arr.shape[-1] == 1:
+        arr = np.repeat(arr, 3, axis=-1)
+    if arr.ndim != 3 or arr.shape[-1] not in {3, 4}:
+        raise ValueError(f"Unsupported image shape for MolmoAct2: {arr.shape}.")
+    if arr.shape[-1] == 4:
+        arr = arr[..., :3]
+    if arr.dtype in (np.float16, np.float32, np.float64):
+        if arr.size > 0 and float(np.nanmax(arr)) <= 1.0:
+            arr = arr * 255.0
+        arr = np.clip(arr, 0, 255).astype(np.uint8)
+    elif arr.dtype != np.uint8:
+        arr = np.clip(arr, 0, 255).astype(np.uint8)
+    return arr
+
+
+def _normalize_question_text(text: str) -> str:
+    normalized = re.sub(r"\s+", " ", str(text or "")).strip()
+    if not normalized:
+        return ""
+    previous = None
+    while normalized and normalized != previous:
+        previous = normalized
+        normalized = normalized.strip().strip(_QUESTION_SURROUNDING_DELIMITERS).strip()
+        for pattern in _QUESTION_PREFIX_PATTERNS:
+            normalized = pattern.sub("", normalized, count=1).strip()
+        normalized = normalized.rstrip(_QUESTION_TRAILING_SENTENCE_PUNCTUATION).rstrip()
+        normalized = normalized.rstrip(_QUESTION_TRAILING_CLOSERS).rstrip()
+        normalized = normalized.rstrip(_QUESTION_TRAILING_SENTENCE_PUNCTUATION).rstrip()
+    chunks = [chunk.strip() for chunk in re.split(r"[.!?]+", normalized) if chunk.strip()]
+    if len(chunks) > 1:
+        normalized = "; ".join(chunks)
+    return normalized.lower()
+
+
+def _wrap_setup_text(setup_type: str, add_setup_tokens: bool) -> str:
+    setup_type = str(setup_type or "")
+    if setup_type.startswith(SETUP_START_TOKEN) and setup_type.endswith(SETUP_END_TOKEN):
+        return setup_type
+    if not setup_type or not add_setup_tokens:
+        return setup_type
+    return f"{SETUP_START_TOKEN}{setup_type}{SETUP_END_TOKEN}"
+
+
+def _wrap_control_text(control_mode: str, add_control_tokens: bool) -> str:
+    control_mode = str(control_mode or "")
+    if control_mode.startswith(CONTROL_START_TOKEN) and control_mode.endswith(CONTROL_END_TOKEN):
+        return control_mode
+    if not control_mode or not add_control_tokens:
+        return control_mode
+    return f"{CONTROL_START_TOKEN}{control_mode}{CONTROL_END_TOKEN}"
+
+
+def _build_discrete_state_string(state: np.ndarray, num_state_tokens: int) -> str:
+    if num_state_tokens <= 0:
+        raise ValueError(f"num_state_tokens must be > 0, got {num_state_tokens}.")
+    arr = np.asarray(state, dtype=np.float32)
+    arr = np.nan_to_num(arr, nan=0.0, posinf=1.0, neginf=-1.0)
+    arr = np.clip(arr, -1.0, 1.0)
+    scaled = (arr + 1.0) / 2.0 * float(num_state_tokens - 1)
+    token_ids = np.clip(np.rint(scaled).astype(np.int64), 0, int(num_state_tokens) - 1).reshape(-1)
+    return f"{STATE_START_TOKEN}{''.join(f'{STATE_TOKEN_PREFIX}{int(token_id)}>' for token_id in token_ids)}{STATE_END_TOKEN}"
+
+
+def _build_robot_text(
+    *,
+    task: str,
+    discrete_state_string: str,
+    setup_type: str,
+    control_mode: str,
+    add_setup_tokens: bool,
+    add_control_tokens: bool,
+    num_images: int,
+) -> str:
+    setup_text = _wrap_setup_text(setup_type, add_setup_tokens=add_setup_tokens)
+    control_text = _wrap_control_text(control_mode, add_control_tokens=add_control_tokens)
+    state_clause = (
+        f" The current state of the robot is {discrete_state_string}." if discrete_state_string else ""
+    )
+    prompt = (
+        f"The task is to {task}. The setup is {setup_text}.{state_clause} "
+        f"The expected control mode is {control_text}. Given these, what action should the robot take to complete the task?"
+    )
+    if num_images <= 0:
+        image_prefix = ""
+    elif num_images == 1:
+        image_prefix = "<|image|>"
+    else:
+        image_prefix = "".join(f"Image {idx + 1}<|image|>" for idx in range(num_images))
+    return f"{image_prefix}<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n{ACTION_OUTPUT_TOKEN}"
+
+
+def _as_text_list(value: Any, batch_size: int) -> list[str]:
+    if value is None:
+        return [""] * batch_size
+    if isinstance(value, str):
+        return [value] * batch_size
+    if torch.is_tensor(value):
+        if value.ndim == 0:
+            return [str(value.item())] * batch_size
+        flat = value.detach().cpu().reshape(-1).tolist()
+        texts = [str(item) for item in flat]
+    elif isinstance(value, np.ndarray):
+        if value.ndim == 0:
+            return [str(value.item())] * batch_size
+        texts = [str(item) for item in value.reshape(-1).tolist()]
+    elif isinstance(value, (list, tuple)):
+        texts = [str(item) for item in value]
+    else:
+        texts = [str(value)]
+    if len(texts) == batch_size:
+        return texts
+    if len(texts) == 1:
+        return texts * batch_size
+    raise ValueError(f"Expected {batch_size} task strings, got {len(texts)}.")
+
+
+def _tokenize_discrete_action(action: np.ndarray, processor: Any) -> list[int]:
+    arr = np.asarray(action, dtype=np.float32)
+    if arr.ndim == 2:
+        arr = arr[None, :, :]
+    elif arr.ndim == 1:
+        arr = arr[None, None, :]
+    tokens_out = processor(arr)
+    if isinstance(tokens_out, dict):
+        tokens_out = tokens_out.get("input_ids", next(iter(tokens_out.values())))
+    if isinstance(tokens_out, np.ndarray):
+        tokens_out = tokens_out.tolist()
+    if torch.is_tensor(tokens_out):
+        tokens_out = tokens_out.detach().cpu().tolist()
+    if not isinstance(tokens_out, list):
+        raise TypeError(f"Unexpected discrete action tokenizer output type: {type(tokens_out)}")
+    if tokens_out and isinstance(tokens_out[0], (list, tuple, np.ndarray)):
+        tokens_out = tokens_out[0]
+    return [int(token_id) for token_id in tokens_out]
+
+
+def _build_discrete_action_string(action: np.ndarray, processor: Any) -> str:
+    token_ids = _tokenize_discrete_action(action, processor)
+    pieces = "".join(f"{ACTION_TOKEN_PREFIX}{int(token_id)}>" for token_id in token_ids)
+    return f"{ACTION_START_TOKEN}{pieces}{ACTION_END_TOKEN}"
+
+
+def _single_token_id(tokenizer: Any, token: str) -> int:
+    token_ids = tokenizer.encode(token, add_special_tokens=False)
+    if len(token_ids) != 1:
+        raise ValueError(f"MolmoAct2 token {token!r} must encode to one token, got {token_ids}.")
+    return int(token_ids[0])
+
+
+def _flatten_feature_names(raw_names: Any) -> list[str] | None:
+    if raw_names is None:
+        return None
+    if isinstance(raw_names, dict):
+        names: list[str] = []
+        for value in raw_names.values():
+            if isinstance(value, (list, tuple)):
+                names.extend(str(item) for item in value)
+            elif value is not None:
+                names.append(str(value))
+        return names or None
+    if isinstance(raw_names, (list, tuple)):
+        names = [str(item) for item in raw_names]
+        return names or None
+    return [str(raw_names)]
+
+
+def _feature_dim(stats: dict[str, Any] | None) -> int | None:
+    if not isinstance(stats, dict):
+        return None
+    for key in ("mean", "std", "min", "max", "q01", "q99", "q10", "q90", "mask"):
+        value = stats.get(key)
+        if value is None:
+            continue
+        if torch.is_tensor(value):
+            return int(value.shape[-1]) if value.ndim > 0 else None
+        arr = np.asarray(value)
+        return int(arr.shape[-1]) if arr.ndim > 0 else None
+    return None
+
+
+def _stats_array(value: Any) -> np.ndarray | None:
+    if value is None:
+        return None
+    if torch.is_tensor(value):
+        return value.detach().cpu().numpy() if value.ndim > 0 else None
+    arr = np.asarray(value)
+    return arr if arr.ndim > 0 else None
+
+
+def _validate_masked_passthrough_stats(feature_stats: dict[str, Any], mask: list[bool], key: str) -> None:
+    min_values = _stats_array(feature_stats.get("min"))
+    max_values = _stats_array(feature_stats.get("max"))
+    if min_values is None or max_values is None:
+        return
+
+    mask_array = np.asarray(mask, dtype=bool)
+    if (
+        mask_array.ndim != 1
+        or min_values.shape[-1] != mask_array.shape[0]
+        or max_values.shape[-1] != mask_array.shape[0]
+        or not bool((~mask_array).any())
+    ):
+        return
+
+    passthrough_min = min_values[..., ~mask_array]
+    passthrough_max = max_values[..., ~mask_array]
+    if bool(((passthrough_min < -1.0) | (passthrough_max > 1.0)).any()):
+        raise ValueError(
+            f"MolmoAct2 {key} gripper values are not under [-1, 1]. Please set normalize_gripper=True."
+        )
+
+
+def _feature_names_from_meta(dataset_meta: Any | None, feature_key: str) -> list[str] | None:
+    if dataset_meta is None:
+        return None
+
+    root = getattr(dataset_meta, "root", None)
+    candidate_roots = []
+    if root is not None:
+        repo_id = str(getattr(dataset_meta, "repo_id", "") or "").strip()
+        if repo_id:
+            candidate_roots.append(Path(root) / repo_id)
+        candidate_roots.append(Path(root))
+    for candidate_root in candidate_roots:
+        info_path = candidate_root / "meta" / "info.json"
+        if info_path.exists():
+            try:
+                with info_path.open("r", encoding="utf-8") as f:
+                    info = json.load(f)
+                names = _flatten_feature_names((info.get("features") or {}).get(feature_key, {}).get("names"))
+                if names:
+                    return names
+            except (OSError, json.JSONDecodeError, AttributeError):
+                pass
+
+    for container in (
+        getattr(getattr(dataset_meta, "info", None), "features", None),
+        getattr(dataset_meta, "features", None),
+    ):
+        if not isinstance(container, dict):
+            continue
+        feature = container.get(feature_key)
+        if not isinstance(feature, dict):
+            continue
+        names = _flatten_feature_names(feature.get("names"))
+        if names:
+            return names
+    return None
+
+
+def _add_gripper_masks_to_stats(
+    dataset_stats: dict[str, dict[str, Any]] | None,
+    dataset_meta: Any | None,
+    *,
+    normalize_gripper: bool,
+    dataset_feature_names: dict[str, Any] | None = None,
+) -> dict[str, dict[str, Any]] | None:
+    if not dataset_stats:
+        return dataset_stats
+
+    stats = deepcopy(dataset_stats)
+    for key in (ACTION, OBS_STATE):
+        feature_stats = stats.get(key)
+        if not isinstance(feature_stats, dict):
+            continue
+        dim = _feature_dim(feature_stats)
+        if dim is None:
+            continue
+
+        if normalize_gripper:
+            feature_stats["mask"] = [True] * dim
+            continue
+
+        names = _flatten_feature_names((dataset_feature_names or {}).get(key))
+        if names is None:
+            names = _feature_names_from_meta(dataset_meta, key)
+        if names is None:
+            names = _flatten_feature_names(feature_stats.get("names"))
+        if names is None:
+            continue
+        if len(names) != dim:
+            continue
+        mask = ["gripper" not in name.lower() for name in names]
+        _validate_masked_passthrough_stats(feature_stats, mask, key)
+        feature_stats["mask"] = mask
+    return stats
+
+
+def _normalization_masks_from_stats(
+    dataset_stats: dict[str, dict[str, Any]] | None,
+) -> dict[str, list[bool]]:
+    masks: dict[str, list[bool]] = {}
+    for key in (ACTION, OBS_STATE):
+        feature_stats = (dataset_stats or {}).get(key)
+        if not isinstance(feature_stats, dict):
+            continue
+        mask = feature_stats.get("mask")
+        if isinstance(mask, Tensor):
+            mask = mask.detach().cpu().tolist()
+        if isinstance(mask, list) and all(isinstance(value, bool) for value in mask):
+            masks[key] = mask
+    return masks
+
+
+class _MolmoAct2MaskedNormalizationMixin:
+    @staticmethod
+    def _broadcast_feature_mask(mask: Tensor, tensor: Tensor) -> Tensor | None:
+        mask = mask.to(device=tensor.device, dtype=torch.bool)
+        if mask.ndim != 1 or tensor.shape[-1] != mask.shape[0]:
+            return None
+        while mask.ndim < tensor.ndim:
+            mask = mask.unsqueeze(0)
+        return mask
+
+    @staticmethod
+    def _validate_masked_passthrough_range(tensor: Tensor, mask: Tensor, key: str) -> None:
+        passthrough_mask = ~mask.expand_as(tensor)
+        if not bool(passthrough_mask.any()):
+            return
+        passthrough_values = tensor[passthrough_mask]
+        if bool(((passthrough_values < -1.0) | (passthrough_values > 1.0)).any()):
+            raise ValueError(
+                f"MolmoAct2 {key} gripper values are not under [-1, 1]. Please set normalize_gripper=True."
+            )
+
+    def _apply_transform(
+        self, tensor: Tensor, key: str, feature_type: Any, *, inverse: bool = False
+    ) -> Tensor:
+        transformed = super()._apply_transform(tensor, key, feature_type, inverse=inverse)
+        stats = getattr(self, "_tensor_stats", {}).get(key, {})
+        mask = stats.get("mask") if isinstance(stats, dict) else None
+        if mask is None:
+            return transformed
+        mask = self._broadcast_feature_mask(mask, tensor)
+        if mask is None:
+            return transformed
+        if not inverse:
+            self._validate_masked_passthrough_range(tensor, mask, key)
+        return torch.where(mask, transformed, tensor)
+
+
+@ProcessorStepRegistry.register(name="molmoact2_masked_normalizer")
+@dataclass
+class MolmoAct2MaskedNormalizerProcessorStep(_MolmoAct2MaskedNormalizationMixin, NormalizerProcessorStep):
+    pass
+
+
+@ProcessorStepRegistry.register(name="molmoact2_masked_unnormalizer")
+@dataclass
+class MolmoAct2MaskedUnnormalizerProcessorStep(_MolmoAct2MaskedNormalizationMixin, UnnormalizerProcessorStep):
+    pass
+
+
+@ProcessorStepRegistry.register(name="molmoact2_clamp_normalized")
+@dataclass
+class MolmoAct2ClampNormalizedProcessorStep(ProcessorStep):
+    """Clamp q01/q99-normalized state and action to the range used by the old trainer."""
+
+    normalization_masks: dict[str, list[bool]] | None = None
+
+    @staticmethod
+    def _broadcast_feature_mask(mask: list[bool], tensor: Tensor) -> Tensor | None:
+        tensor_mask = torch.tensor(mask, device=tensor.device, dtype=torch.bool)
+        if tensor_mask.ndim != 1 or tensor.shape[-1] != tensor_mask.shape[0]:
+            return None
+        while tensor_mask.ndim < tensor.ndim:
+            tensor_mask = tensor_mask.unsqueeze(0)
+        return tensor_mask
+
+    @staticmethod
+    def _validate_masked_passthrough_range(tensor: Tensor, mask: Tensor, key: str) -> None:
+        passthrough_mask = ~mask.expand_as(tensor)
+        if not bool(passthrough_mask.any()):
+            return
+        passthrough_values = tensor[passthrough_mask]
+        if bool(((passthrough_values < -1.0) | (passthrough_values > 1.0)).any()):
+            raise ValueError(
+                f"MolmoAct2 {key} gripper values are not under [-1, 1]. Please set normalize_gripper=True."
+            )
+
+    def _clamp_tensor(self, tensor: Tensor, key: str) -> Tensor:
+        mask = (self.normalization_masks or {}).get(key)
+        if mask is None:
+            return tensor.clamp(-1.0, 1.0)
+        tensor_mask = self._broadcast_feature_mask(mask, tensor)
+        if tensor_mask is None:
+            return tensor.clamp(-1.0, 1.0)
+        self._validate_masked_passthrough_range(tensor, tensor_mask, key)
+        return torch.where(tensor_mask, tensor.clamp(-1.0, 1.0), tensor)
+
+    def __call__(self, transition: EnvTransition) -> EnvTransition:
+        transition = transition.copy()
+        observation = transition.get(TransitionKey.OBSERVATION)
+        if isinstance(observation, dict) and OBS_STATE in observation:
+            observation = observation.copy()
+            observation[OBS_STATE] = self._clamp_tensor(torch.as_tensor(observation[OBS_STATE]), OBS_STATE)
+            transition[TransitionKey.OBSERVATION] = observation
+        action = transition.get(TransitionKey.ACTION)
+        if action is not None:
+            transition[TransitionKey.ACTION] = self._clamp_tensor(torch.as_tensor(action), ACTION)
+        return transition
+
+    def transform_features(
+        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
+    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
+        return features
+
+
+@ProcessorStepRegistry.register(name="molmoact2_pack_inputs")
+@dataclass
+class MolmoAct2PackInputsProcessorStep(ProcessorStep):
+    checkpoint_path: str
+    checkpoint_revision: str | None = None
+    checkpoint_force_download: bool = False
+    action_mode: str = "both"
+    discrete_action_tokenizer: str = "allenai/MolmoAct2-FAST-Tokenizer"
+    image_keys: list[str] = field(default_factory=list)
+    allow_image_key_fallback: bool = False
+    setup_type: str = ""
+    control_mode: str = ""
+    normalize_language: bool = True
+    add_setup_tokens: bool = True
+    add_control_tokens: bool = True
+    num_state_tokens: int = 256
+    max_sequence_length: int | None = None
+    chunk_size: int = 30
+    max_action_dim: int = 32
+    env_action_dim: int | None = None
+
+    def __post_init__(self) -> None:
+        require_package("transformers", extra="molmoact2")
+
+        checkpoint_location = _resolve_checkpoint_location(
+            self.checkpoint_path,
+            revision=self.checkpoint_revision,
+            force_download=bool(self.checkpoint_force_download),
+        )
+        self.processor = _load_local_molmoact2_processor(checkpoint_location)
+        self.action_processor = None
+        if self.action_mode in {"discrete", "both"}:
+            require_package("scipy", extra="molmoact2")
+            if UniversalActionProcessor is None:
+                raise RuntimeError("transformers and scipy are required to load MolmoAct2 action tokenizer.")
+            self.action_processor = UniversalActionProcessor.from_pretrained_local(
+                self.discrete_action_tokenizer,
+            )
+        self._action_start_id = _single_token_id(self.processor.tokenizer, ACTION_START_TOKEN)
+        self._action_end_id = _single_token_id(self.processor.tokenizer, ACTION_END_TOKEN)
+        self._eos_token = self.processor.tokenizer.eos_token or ""
+        self._eos_token_id = self.processor.tokenizer.eos_token_id
+
+    def get_config(self) -> dict[str, Any]:
+        return {
+            "checkpoint_path": self.checkpoint_path,
+            "checkpoint_revision": self.checkpoint_revision,
+            "checkpoint_force_download": self.checkpoint_force_download,
+            "action_mode": self.action_mode,
+            "discrete_action_tokenizer": self.discrete_action_tokenizer,
+            "image_keys": list(self.image_keys),
+            "allow_image_key_fallback": self.allow_image_key_fallback,
+            "setup_type": self.setup_type,
+            "control_mode": self.control_mode,
+            "normalize_language": self.normalize_language,
+            "add_setup_tokens": self.add_setup_tokens,
+            "add_control_tokens": self.add_control_tokens,
+            "num_state_tokens": self.num_state_tokens,
+            "max_sequence_length": self.max_sequence_length,
+            "chunk_size": self.chunk_size,
+            "max_action_dim": self.max_action_dim,
+            "env_action_dim": self.env_action_dim,
+        }
+
+    def _resolve_max_sequence_length(
+        self,
+        *,
+        num_images: int,
+        state_dim: int,
+        action_dim: int,
+        action_horizon: int,
+        include_discrete_action: bool,
+    ) -> int:
+        if self.max_sequence_length is not None:
+            return int(self.max_sequence_length)
+        return infer_molmoact2_max_sequence_length(
+            num_images=num_images,
+            state_dim=state_dim,
+            action_dim=action_dim,
+            action_horizon=action_horizon,
+            include_discrete_action=include_discrete_action,
+        )
+
+    def _batch_size(self, observation: dict[str, Any], action: Tensor | None) -> int:
+        if action is not None:
+            return int(action.shape[0])
+        state = observation.get(OBS_STATE)
+        if torch.is_tensor(state) or isinstance(state, np.ndarray):
+            return int(state.shape[0]) if getattr(state, "ndim", 0) > 1 else 1
+        for key in self._resolve_image_keys(observation):
+            value = observation[key]
+            if torch.is_tensor(value) or isinstance(value, np.ndarray):
+                return int(value.shape[0]) if getattr(value, "ndim", 0) == 4 else 1
+        return 1
+
+    @staticmethod
+    def _observation_image_keys(observation: dict[str, Any]) -> list[str]:
+        keys = [key for key in observation if str(key).startswith(f"{OBS_IMAGES}.")]
+        if not keys:
+            keys = [key for key in observation if str(key).startswith("observation.image")]
+        return sorted(keys)
+
+    def _resolve_image_keys(self, observation: dict[str, Any]) -> list[str]:
+        if self.image_keys:
+            missing = [key for key in self.image_keys if key not in observation]
+            if missing:
+                fallback_keys = self._observation_image_keys(observation)
+                if self.allow_image_key_fallback and fallback_keys:
+                    return fallback_keys
+                raise ValueError(f"MolmoAct2 image_keys missing from observation: {missing}.")
+            return list(self.image_keys)
+        keys = self._observation_image_keys(observation)
+        if not keys:
+            raise ValueError("MolmoAct2 requires at least one image observation.")
+        return sorted(keys)
+
+    def _extract_images(self, observation: dict[str, Any], batch_size: int) -> list[list[np.ndarray]]:
+        images_by_example: list[list[np.ndarray]] = [[] for _ in range(batch_size)]
+        for key in self._resolve_image_keys(observation):
+            value = observation[key]
+            for batch_idx in range(batch_size):
+                item = value
+                if (torch.is_tensor(value) or isinstance(value, np.ndarray)) and getattr(
+                    value, "ndim", 0
+                ) >= 4:
+                    item = value[batch_idx]
+                images_by_example[batch_idx].append(_normalize_image(item))
+        return images_by_example
+
+    def _extract_state(self, observation: dict[str, Any], batch_size: int) -> Tensor:
+        if OBS_STATE not in observation:
+            raise ValueError("MolmoAct2 requires observation.state for discrete state prompting.")
+        state = torch.as_tensor(observation[OBS_STATE], dtype=torch.float32)
+        if state.ndim == 1:
+            state = state.unsqueeze(0)
+        if int(state.shape[0]) != batch_size:
+            raise ValueError(f"State batch size {state.shape[0]} does not match batch size {batch_size}.")
+        return state
+
+    def _pad_action(self, action: Tensor, action_is_pad: Any | None) -> tuple[Tensor, Tensor, Tensor]:
+        if action.ndim == 2:
+            action = action.unsqueeze(1)
+        if action.ndim != 3:
+            raise ValueError(f"MolmoAct2 expected action shape [B, T, D], got {tuple(action.shape)}.")
+        if action.shape[-1] > self.max_action_dim:
+            raise ValueError(
+                f"Action dim {action.shape[-1]} exceeds MolmoAct2 max_action_dim={self.max_action_dim}."
+            )
+        padded = torch.zeros(
+            (*action.shape[:-1], self.max_action_dim),
+            device=action.device,
+            dtype=torch.float32,
+        )
+        padded[..., : action.shape[-1]] = action.to(dtype=torch.float32)
+        action_dim_is_pad = torch.ones(
+            (action.shape[0], self.max_action_dim), device=action.device, dtype=torch.bool
+        )
+        action_dim_is_pad[:, : action.shape[-1]] = False
+        if action_is_pad is None:
+            action_horizon_is_pad = torch.zeros(action.shape[:2], device=action.device, dtype=torch.bool)
+        else:
+            action_horizon_is_pad = torch.as_tensor(action_is_pad, device=action.device, dtype=torch.bool)
+            if action_horizon_is_pad.ndim == 1:
+                action_horizon_is_pad = action_horizon_is_pad.unsqueeze(0)
+            if tuple(action_horizon_is_pad.shape) != tuple(action.shape[:2]):
+                raise ValueError(
+                    "action_is_pad must match action horizon shape: "
+                    f"got {tuple(action_horizon_is_pad.shape)} for action {tuple(action.shape)}."
+                )
+        return padded, action_horizon_is_pad, action_dim_is_pad
+
+    def _build_labels(self, input_ids: Tensor, attention_mask: Tensor) -> Tensor:
+        labels = torch.full_like(input_ids, -100)
+        for batch_idx in range(input_ids.shape[0]):
+            valid = attention_mask[batch_idx].to(dtype=torch.bool)
+            row = input_ids[batch_idx]
+            starts = (row == self._action_start_id).nonzero(as_tuple=False).flatten().tolist()
+            ends = (row == self._action_end_id).nonzero(as_tuple=False).flatten().tolist()
+            end_ptr = 0
+            for start in starts:
+                while end_ptr < len(ends) and ends[end_ptr] < start:
+                    end_ptr += 1
+                if end_ptr >= len(ends):
+                    raise ValueError(
+                        "Found <action_start> without matching <action_end> in MolmoAct2 labels."
+                    )
+                end = int(ends[end_ptr])
+                label_end = end + 1
+                if (
+                    self._eos_token_id is not None
+                    and label_end < int(row.shape[0])
+                    and int(row[label_end]) == int(self._eos_token_id)
+                ):
+                    label_end += 1
+                labels[batch_idx, start:label_end] = row[start:label_end]
+                end_ptr += 1
+            if not starts:
+                raise ValueError("No discrete action span found in MolmoAct2 training text.")
+            labels[batch_idx] = torch.where(
+                valid, labels[batch_idx], torch.full_like(labels[batch_idx], -100)
+            )
+        return labels
+
+    def __call__(self, transition: EnvTransition) -> EnvTransition:
+        transition = transition.copy()
+        observation = transition.get(TransitionKey.OBSERVATION) or {}
+        if not isinstance(observation, dict):
+            raise ValueError("MolmoAct2 expected an observation dictionary.")
+        complementary = dict(transition.get(TransitionKey.COMPLEMENTARY_DATA) or {})
+
+        raw_action = transition.get(TransitionKey.ACTION)
+        action = torch.as_tensor(raw_action, dtype=torch.float32) if raw_action is not None else None
+        batch_size = self._batch_size(observation, action)
+        state = self._extract_state(observation, batch_size)
+        images_by_example = self._extract_images(observation, batch_size)
+
+        task_source = complementary.get("task")
+        if task_source is None:
+            task_source = observation.get("task")
+        if task_source is None:
+            task_source = observation.get("observation.language")
+        if task_source is None:
+            task_source = complementary.get("language_instruction")
+        tasks = _as_text_list(task_source, batch_size)
+        if self.normalize_language:
+            tasks = [_normalize_question_text(task) for task in tasks]
+        complementary["task"] = tasks
+
+        action_padded = None
+        action_horizon_is_pad = None
+        action_dim_is_pad = torch.ones((batch_size, self.max_action_dim), dtype=torch.bool)
+        real_action_dim = int(self.env_action_dim or 0)
+        if action is not None:
+            action_is_pad = complementary.get("action_is_pad")
+            if action_is_pad is None:
+                action_is_pad = complementary.get("action_horizon_is_pad")
+            action_padded, action_horizon_is_pad, action_dim_is_pad = self._pad_action(action, action_is_pad)
+            real_action_dim = int(action.shape[-1])
+        elif real_action_dim > 0:
+            action_dim_is_pad[:, :real_action_dim] = False
+
+        prompt_texts: list[str] = []
+        full_texts: list[str] = []
+        flat_images: list[np.ndarray] = []
+        state_np = state.detach().cpu().numpy()
+        build_action_labels = action is not None and self.action_mode in {"discrete", "both"}
+        for batch_idx in range(batch_size):
+            images = images_by_example[batch_idx]
+            flat_images.extend(images)
+            discrete_state = _build_discrete_state_string(state_np[batch_idx], self.num_state_tokens)
+            prompt = _build_robot_text(
+                task=tasks[batch_idx],
+                discrete_state_string=discrete_state,
+                setup_type=self.setup_type,
+                control_mode=self.control_mode,
+                add_setup_tokens=self.add_setup_tokens,
+                add_control_tokens=self.add_control_tokens,
+                num_images=len(images),
+            )
+            prompt_texts.append(prompt)
+            if build_action_labels:
+                if self.action_processor is None:
+                    raise ValueError("Discrete MolmoAct2 training requires an action tokenizer.")
+                answer = _build_discrete_action_string(
+                    action[batch_idx].detach().cpu().numpy(), self.action_processor
+                )
+                full_texts.append(f"{prompt}{answer}{self._eos_token}")
+            else:
+                full_texts.append(prompt)
+
+        text = full_texts if build_action_labels else prompt_texts
+        inputs = self.processor(text=text, images=flat_images, return_tensors="pt", padding=True)
+        if action is None:
+            action_horizon = self.chunk_size
+        elif action.ndim == 2:
+            action_horizon = 1
+        else:
+            action_horizon = int(action.shape[1])
+        max_sequence_length = self._resolve_max_sequence_length(
+            num_images=max((len(images) for images in images_by_example), default=0),
+            state_dim=int(state.shape[-1]),
+            action_dim=max(real_action_dim, 1),
+            action_horizon=action_horizon,
+            include_discrete_action=build_action_labels,
+        )
+        if int(inputs["input_ids"].shape[1]) > max_sequence_length:
+            raise ValueError(
+                f"MolmoAct2 sequence length {int(inputs['input_ids'].shape[1])} exceeds "
+                f"max_sequence_length={max_sequence_length}."
+            )
+
+        if build_action_labels:
+            inputs["labels"] = self._build_labels(inputs["input_ids"], inputs["attention_mask"])
+
+        complementary.update(dict(inputs))
+        complementary["action_dim_is_pad"] = action_dim_is_pad
+        if action_horizon_is_pad is not None:
+            complementary["action_horizon_is_pad"] = action_horizon_is_pad
+
+        if action_padded is not None:
+            transition[TransitionKey.ACTION] = action_padded
+        transition[TransitionKey.COMPLEMENTARY_DATA] = complementary
+        return transition
+
+    def transform_features(
+        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
+    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
+        return features
+
+
+@ProcessorStepRegistry.register(name="molmoact2_clamp_action")
+@dataclass
+class MolmoAct2ClampActionProcessorStep(ProcessorStep):
+    def __call__(self, transition: EnvTransition) -> EnvTransition:
+        transition = transition.copy()
+        action = transition.get(TransitionKey.ACTION)
+        if action is not None:
+            transition[TransitionKey.ACTION] = torch.as_tensor(action).clamp(-1.0, 1.0)
+        return transition
+
+    def transform_features(
+        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
+    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
+        return features
+
+
+def make_molmoact2_pre_post_processors(
+    config: MolmoAct2Config,
+    dataset_stats: dict[str, dict[str, torch.Tensor]] | None = None,
+    dataset_meta: Any | None = None,
+) -> tuple[
+    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
+    PolicyProcessorPipeline[PolicyAction, PolicyAction],
+]:
+    env_action_dim = None
+    if config.output_features and ACTION in config.output_features:
+        env_action_dim = int(config.output_features[ACTION].shape[0])
+
+    hf_metadata: dict[str, Any] = {}
+    if dataset_stats is None and str(config.norm_tag or "").strip():
+        dataset_stats, hf_metadata = _load_hf_norm_stats_for_tag(
+            config.checkpoint_path,
+            revision=config.checkpoint_revision,
+            force_download=bool(config.checkpoint_force_download),
+            norm_tag=config.norm_tag,
+        )
+
+    image_keys = list(config.image_keys)
+    visual_feature_keys = [
+        key for key, feature in config.input_features.items() if feature.type == FeatureType.VISUAL
+    ]
+    if not image_keys and isinstance(hf_metadata.get("camera_keys"), list):
+        metadata_image_keys = [str(key) for key in hf_metadata["camera_keys"]]
+        if not visual_feature_keys or all(key in config.input_features for key in metadata_image_keys):
+            image_keys = metadata_image_keys
+    if not image_keys:
+        image_keys = visual_feature_keys
+    setup_type = config.setup_type or str(hf_metadata.get("setup_type") or "")
+    control_mode = config.control_mode or str(hf_metadata.get("control_mode") or "")
+    chunk_size = int(hf_metadata.get("action_horizon") or config.chunk_size)
+
+    masked_dataset_stats = _add_gripper_masks_to_stats(
+        dataset_stats,
+        dataset_meta,
+        normalize_gripper=config.normalize_gripper,
+        dataset_feature_names=config.dataset_feature_names,
+    )
+    normalization_masks = _normalization_masks_from_stats(masked_dataset_stats)
+
+    input_steps: list[ProcessorStep] = [
+        RenameObservationsProcessorStep(rename_map={}),
+        AddBatchDimensionProcessorStep(),
+        MolmoAct2MaskedNormalizerProcessorStep(
+            features={**config.input_features, **config.output_features},
+            norm_map=config.normalization_mapping,
+            stats=masked_dataset_stats,
+        ),
+        MolmoAct2ClampNormalizedProcessorStep(normalization_masks=normalization_masks),
+        MolmoAct2PackInputsProcessorStep(
+            checkpoint_path=config.checkpoint_path,
+            checkpoint_revision=config.checkpoint_revision,
+            checkpoint_force_download=config.checkpoint_force_download,
+            action_mode=config.action_mode,
+            discrete_action_tokenizer=config.discrete_action_tokenizer,
+            image_keys=image_keys,
+            allow_image_key_fallback=not bool(config.image_keys),
+            setup_type=setup_type,
+            control_mode=control_mode,
+            normalize_language=config.normalize_language,
+            add_setup_tokens=config.add_setup_tokens,
+            add_control_tokens=config.add_control_tokens,
+            num_state_tokens=config.num_state_tokens,
+            max_sequence_length=config.max_sequence_length,
+            chunk_size=chunk_size,
+            max_action_dim=config.expected_max_action_dim,
+            env_action_dim=env_action_dim,
+        ),
+        DeviceProcessorStep(device=config.device),
+    ]
+
+    output_steps: list[ProcessorStep] = [
+        MolmoAct2ClampActionProcessorStep(),
+        MolmoAct2MaskedUnnormalizerProcessorStep(
+            features=config.output_features,
+            norm_map=config.normalization_mapping,
+            stats=masked_dataset_stats,
+        ),
+        DeviceProcessorStep(device="cpu"),
+    ]
+
+    return (
+        PolicyProcessorPipeline[dict[str, Any], dict[str, Any]](
+            steps=input_steps,
+            name=POLICY_PREPROCESSOR_DEFAULT_NAME,
+        ),
+        PolicyProcessorPipeline[PolicyAction, PolicyAction](
+            steps=output_steps,
+            name=POLICY_POSTPROCESSOR_DEFAULT_NAME,
+            to_transition=policy_action_to_transition,
+            to_output=transition_to_policy_action,
+        ),
+    )
diff --git a/tests/policies/molmoact2/test_molmoact2.py b/tests/policies/molmoact2/test_molmoact2.py
new file mode 100644
index 000000000..3631bcc9b
--- /dev/null
+++ b/tests/policies/molmoact2/test_molmoact2.py
@@ -0,0 +1,1397 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The Allen Institute for Artificial Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Unit tests for MolmoAct2's LeRobot policy interface."""
+
+# ruff: noqa: E402
+
+from __future__ import annotations
+
+import json
+from collections import deque
+from types import SimpleNamespace
+
+import numpy as np
+import pytest
+import torch
+import torch.nn.functional as F  # noqa: N812
+
+pytest.importorskip("transformers")
+pytest.importorskip("scipy")
+
+from lerobot.configs import FeatureType, NormalizationMode, PolicyFeature
+from lerobot.policies import get_policy_class, make_policy_config
+from lerobot.policies.molmoact2 import (
+    configuration_molmoact2 as molmoact2_config,
+    modeling_molmoact2 as molmoact2_modeling,
+    processor_molmoact2 as molmoact2_processor,
+)
+from lerobot.policies.molmoact2.configuration_molmoact2 import (
+    MolmoAct2Config,
+    MolmoAct2CosineDecayWithWarmupSchedulerConfig,
+    infer_molmoact2_max_sequence_length,
+)
+from lerobot.policies.molmoact2.modeling_molmoact2 import MolmoAct2Policy
+from lerobot.policies.molmoact2.processor_molmoact2 import (
+    MolmoAct2ClampNormalizedProcessorStep,
+    MolmoAct2MaskedNormalizerProcessorStep,
+    MolmoAct2MaskedUnnormalizerProcessorStep,
+    MolmoAct2PackInputsProcessorStep,
+    _add_gripper_masks_to_stats,
+    _build_discrete_state_string,
+    _normalize_question_text,
+    make_molmoact2_pre_post_processors,
+)
+from lerobot.policies.rtc.configuration_rtc import RTCConfig
+from lerobot.types import TransitionKey
+from lerobot.utils.constants import ACTION, OBS_STATE
+
+
+def test_molmoact2_policy_registration():
+    cfg = make_policy_config("molmoact2", checkpoint_path="/tmp/not-a-real-checkpoint")
+
+    assert cfg.type == "molmoact2"
+    assert cfg.action_mode == "both"
+    assert cfg.normalize_gripper is False
+    assert cfg.enable_knowledge_insulation is False
+    assert cfg.freeze_embedding is True
+    assert cfg.per_episode_seed is False
+    assert cfg.eval_seed is None
+    assert cfg.normalize_language is True
+    assert cfg.get_scheduler_preset().num_decay_steps is None
+    assert cfg.action_delta_indices == list(range(cfg.chunk_size))
+    assert get_policy_class("molmoact2") is MolmoAct2Policy
+
+
+def test_molmoact2_checkpoint_download_ignores_remote_python(monkeypatch):
+    download_kwargs = {}
+
+    def fake_snapshot_download(**kwargs):
+        download_kwargs.update(kwargs)
+        return "/tmp/downloaded-molmoact2"
+
+    monkeypatch.setattr(molmoact2_config, "snapshot_download", fake_snapshot_download)
+
+    checkpoint_location = molmoact2_config._resolve_checkpoint_location("allenai/MolmoAct2")
+
+    assert checkpoint_location == "/tmp/downloaded-molmoact2"
+    assert download_kwargs["ignore_patterns"] == ["*.py", "*.pyc", "__pycache__/*"]
+
+
+def test_molmoact2_scheduler_decay_steps_auto_match_training_steps():
+    param = torch.nn.Parameter(torch.ones(()))
+    optimizer = torch.optim.AdamW([param], lr=0.001)
+    config = MolmoAct2CosineDecayWithWarmupSchedulerConfig(
+        peak_lr=0.01,
+        decay_lr=0.001,
+        num_warmup_steps=10,
+        num_decay_steps=None,
+    )
+
+    scheduler = config.build(optimizer, num_training_steps=100)
+    for _ in range(100):
+        optimizer.step()
+        scheduler.step()
+
+    assert scheduler.get_last_lr() == pytest.approx([0.0001])
+
+
+def test_molmoact2_rollout_generator_uses_eval_seed_per_task():
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.config = MolmoAct2Config(per_episode_seed=True, eval_seed=1000)
+    policy._rollout_action_generator = None
+    policy._rollout_task_key = None
+    policy._rollout_index_for_task = -1
+
+    policy.reset()
+    first = policy._rollout_generator_for_inputs(
+        {"task": ["pick", "pick", "pick"]},
+        batch_size=3,
+        device=torch.device("cpu"),
+    )
+    expected_first = torch.Generator().manual_seed(
+        MolmoAct2Policy._combine_rollout_seeds(first_seed=1000, batch_size=3)
+    )
+    assert torch.allclose(torch.rand(4, generator=first), torch.rand(4, generator=expected_first))
+
+    policy.reset()
+    second = policy._rollout_generator_for_inputs(
+        {"task": ["pick", "pick", "pick"]},
+        batch_size=3,
+        device=torch.device("cpu"),
+    )
+    expected_second = torch.Generator().manual_seed(
+        MolmoAct2Policy._combine_rollout_seeds(first_seed=1003, batch_size=3)
+    )
+    assert torch.allclose(torch.rand(4, generator=second), torch.rand(4, generator=expected_second))
+
+    policy.reset()
+    new_task = policy._rollout_generator_for_inputs(
+        {"task": ["place", "place", "place"]},
+        batch_size=3,
+        device=torch.device("cpu"),
+    )
+    expected_new_task = torch.Generator().manual_seed(
+        MolmoAct2Policy._combine_rollout_seeds(first_seed=1000, batch_size=3)
+    )
+    assert torch.allclose(torch.rand(4, generator=new_task), torch.rand(4, generator=expected_new_task))
+
+
+def test_molmoact2_gripper_mask_uses_feature_names(tmp_path):
+    meta_dir = tmp_path / "meta"
+    meta_dir.mkdir()
+    (meta_dir / "info.json").write_text(
+        json.dumps(
+            {
+                "features": {
+                    ACTION: {"names": {"motors": ["x", "gripper"]}},
+                    OBS_STATE: {"names": {"motors": ["joint", "gripper"]}},
+                }
+            }
+        ),
+        encoding="utf-8",
+    )
+    dataset_meta = SimpleNamespace(root=tmp_path)
+    stats = {
+        ACTION: {"q01": [0.0, 0.0], "q99": [10.0, 10.0]},
+        OBS_STATE: {"q01": [0.0, 0.0], "q99": [10.0, 10.0]},
+    }
+
+    masked_stats = _add_gripper_masks_to_stats(stats, dataset_meta, normalize_gripper=False)
+
+    assert masked_stats is not None
+    assert masked_stats[ACTION]["mask"] == [True, False]
+    assert masked_stats[OBS_STATE]["mask"] == [True, False]
+
+    features = {
+        ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(2,)),
+        OBS_STATE: PolicyFeature(type=FeatureType.STATE, shape=(2,)),
+    }
+    norm_map = {
+        FeatureType.ACTION: NormalizationMode.QUANTILES,
+        FeatureType.STATE: NormalizationMode.QUANTILES,
+    }
+    transition = {
+        TransitionKey.OBSERVATION: {OBS_STATE: torch.tensor([[5.0, 0.7]])},
+        TransitionKey.ACTION: torch.tensor([[5.0, -0.7]]),
+    }
+    normalizer = MolmoAct2MaskedNormalizerProcessorStep(
+        features=features,
+        norm_map=norm_map,
+        stats=masked_stats,
+    )
+    normalized = normalizer(transition)
+
+    assert torch.equal(normalized[TransitionKey.OBSERVATION][OBS_STATE], torch.tensor([[0.0, 0.7]]))
+    assert torch.equal(normalized[TransitionKey.ACTION], torch.tensor([[0.0, -0.7]]))
+
+    with pytest.raises(ValueError, match="gripper values are not under \\[-1, 1\\]"):
+        normalizer(
+            {
+                TransitionKey.OBSERVATION: {OBS_STATE: torch.tensor([[5.0, 7.0]])},
+                TransitionKey.ACTION: torch.tensor([[5.0, -0.7]]),
+            }
+        )
+
+    unnormalizer = MolmoAct2MaskedUnnormalizerProcessorStep(
+        features={ACTION: features[ACTION]},
+        norm_map=norm_map,
+        stats=masked_stats,
+    )
+    unnormalized = unnormalizer({TransitionKey.ACTION: torch.tensor([[0.0, -0.7]])})
+
+    assert torch.equal(unnormalized[TransitionKey.ACTION], torch.tensor([[5.0, -0.7]]))
+
+
+def test_molmoact2_gripper_mask_validates_dataset_stats(tmp_path):
+    meta_dir = tmp_path / "meta"
+    meta_dir.mkdir()
+    (meta_dir / "info.json").write_text(
+        json.dumps({"features": {ACTION: {"names": ["x", "gripper"]}}}),
+        encoding="utf-8",
+    )
+    stats = {
+        ACTION: {
+            "min": [-0.5, -2.0],
+            "max": [0.5, 0.5],
+        }
+    }
+
+    with pytest.raises(ValueError, match="gripper values are not under \\[-1, 1\\]"):
+        _add_gripper_masks_to_stats(stats, SimpleNamespace(root=tmp_path), normalize_gripper=False)
+
+    masked_stats = _add_gripper_masks_to_stats(stats, SimpleNamespace(root=tmp_path), normalize_gripper=True)
+    assert masked_stats is not None
+    assert masked_stats[ACTION]["mask"] == [True, True]
+
+
+def test_molmoact2_clamp_normalized_respects_masked_gripper_dims():
+    step = MolmoAct2ClampNormalizedProcessorStep(
+        normalization_masks={
+            ACTION: [True, False],
+            OBS_STATE: [True, False],
+        }
+    )
+    transition = {
+        TransitionKey.OBSERVATION: {OBS_STATE: torch.tensor([[-2.0, 0.8]])},
+        TransitionKey.ACTION: torch.tensor([[2.0, -0.8]]),
+    }
+
+    clamped = step(transition)
+
+    assert torch.equal(clamped[TransitionKey.OBSERVATION][OBS_STATE], torch.tensor([[-1.0, 0.8]]))
+    assert torch.equal(clamped[TransitionKey.ACTION], torch.tensor([[1.0, -0.8]]))
+
+    with pytest.raises(ValueError, match="gripper values are not under \\[-1, 1\\]"):
+        step({TransitionKey.OBSERVATION: {OBS_STATE: torch.tensor([[0.0, 1.2]])}})
+
+
+def test_molmoact2_normalize_gripper_true_keeps_all_dims_normalized(tmp_path):
+    meta_dir = tmp_path / "meta"
+    meta_dir.mkdir()
+    (meta_dir / "info.json").write_text(
+        json.dumps({"features": {ACTION: {"names": ["x", "gripper"]}}}),
+        encoding="utf-8",
+    )
+    stats = {ACTION: {"q01": [0.0, 0.0], "q99": [10.0, 10.0]}}
+
+    masked_stats = _add_gripper_masks_to_stats(
+        stats,
+        SimpleNamespace(root=tmp_path),
+        normalize_gripper=True,
+    )
+
+    assert masked_stats is not None
+    assert masked_stats[ACTION]["mask"] == [True, True]
+
+
+def test_molmoact2_uses_supplied_stats_with_repo_scoped_names(tmp_path):
+    repo_root = tmp_path / "test-org" / "libero"
+    (repo_root / "meta").mkdir(parents=True)
+    (repo_root / "meta" / "info.json").write_text(
+        json.dumps({"features": {ACTION: {"names": ["x", "gripper"]}}}),
+        encoding="utf-8",
+    )
+    base_stats = {ACTION: {"q01": [0.0, 0.0], "q99": [10.0, 10.0]}}
+
+    masked_stats = _add_gripper_masks_to_stats(
+        base_stats,
+        SimpleNamespace(root=tmp_path, repo_id="test-org/libero"),
+        normalize_gripper=False,
+    )
+
+    assert masked_stats is not None
+    assert masked_stats[ACTION]["q01"] == [0.0, 0.0]
+    assert masked_stats[ACTION]["mask"] == [True, False]
+
+
+def test_molmoact2_uses_config_feature_names_without_dataset_meta():
+    base_stats = {ACTION: {"q01": [0.0, 0.0], "q99": [10.0, 10.0]}}
+
+    masked_stats = _add_gripper_masks_to_stats(
+        base_stats,
+        None,
+        normalize_gripper=False,
+        dataset_feature_names={ACTION: ["x", "gripper"]},
+    )
+
+    assert masked_stats is not None
+    assert masked_stats[ACTION]["mask"] == [True, False]
+
+
+def test_molmoact2_processor_uses_available_visual_features_over_missing_metadata_keys(monkeypatch):
+    monkeypatch.setattr(
+        molmoact2_processor,
+        "_load_hf_norm_stats_for_tag",
+        lambda *args, **kwargs: (
+            {},
+            {"camera_keys": ["observation.images.image", "observation.images.wrist_image"]},
+        ),
+    )
+    monkeypatch.setattr(MolmoAct2PackInputsProcessorStep, "__post_init__", lambda self: None)
+    cfg = MolmoAct2Config(
+        checkpoint_path="/tmp/not-a-real-checkpoint",
+        norm_tag="libero",
+        input_features={
+            "observation.images.image": PolicyFeature(type=FeatureType.VISUAL, shape=(3, 224, 224)),
+            "observation.images.image2": PolicyFeature(type=FeatureType.VISUAL, shape=(3, 224, 224)),
+            OBS_STATE: PolicyFeature(type=FeatureType.STATE, shape=(7,)),
+        },
+        output_features={ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(7,))},
+    )
+
+    preprocessor, _ = make_molmoact2_pre_post_processors(cfg)
+    pack_step = next(
+        step for step in preprocessor.steps if isinstance(step, MolmoAct2PackInputsProcessorStep)
+    )
+
+    assert pack_step.image_keys == ["observation.images.image", "observation.images.image2"]
+    assert pack_step.allow_image_key_fallback is True
+
+
+def test_molmoact2_metadata_image_keys_can_fall_back_to_observation_keys():
+    step = object.__new__(MolmoAct2PackInputsProcessorStep)
+    step.image_keys = ["observation.images.image", "observation.images.wrist_image"]
+    step.allow_image_key_fallback = True
+    observation = {
+        "observation.images.image": torch.zeros(3, 4, 4),
+        "observation.images.image2": torch.zeros(3, 4, 4),
+    }
+
+    assert step._resolve_image_keys(observation) == ["observation.images.image", "observation.images.image2"]
+
+
+def test_molmoact2_explicit_image_keys_stay_strict():
+    step = object.__new__(MolmoAct2PackInputsProcessorStep)
+    step.image_keys = ["observation.images.image", "observation.images.wrist_image"]
+    step.allow_image_key_fallback = False
+    observation = {
+        "observation.images.image": torch.zeros(3, 4, 4),
+        "observation.images.image2": torch.zeros(3, 4, 4),
+    }
+
+    with pytest.raises(ValueError, match="wrist_image"):
+        step._resolve_image_keys(observation)
+
+
+def test_enable_lora_vlm_builds_policy_local_peft_config():
+    pytest.importorskip("peft")
+    policy_cfg = MolmoAct2Config(
+        checkpoint_path="/tmp/not-a-real-checkpoint",
+        device="cpu",
+        enable_lora_vlm=True,
+        lora_rank=64,
+        push_to_hub=False,
+    )
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.config = policy_cfg
+
+    peft_config = policy._build_inner_lora_config()
+
+    assert peft_config.r == 64
+    assert peft_config.target_modules == policy._get_inner_peft_targets()["target_modules"]
+    assert not policy_cfg.use_peft
+
+
+def test_cuda_graph_managers_are_inference_only():
+    class DummyManager:
+        def __init__(self):
+            self.enabled = None
+
+        def set_enabled(self, enabled):
+            self.enabled = enabled
+
+    class DummyBackbone(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.action_cuda_graph_manager = DummyManager()
+
+        def _require_action_expert(self):
+            return torch.nn.Linear(1, 1)
+
+    class DummyModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.model = DummyBackbone()
+            self.depth_decode_cuda_graph_manager = DummyManager()
+
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.config = SimpleNamespace(train_action_expert_only=False, enable_inference_cuda_graph=True)
+    policy.model = DummyModel()
+
+    policy.train()
+    assert policy.model.model.action_cuda_graph_manager.enabled is False
+    assert policy.model.depth_decode_cuda_graph_manager.enabled is False
+
+    policy.eval()
+    assert policy.model.model.action_cuda_graph_manager.enabled is True
+    assert policy.model.depth_decode_cuda_graph_manager.enabled is True
+
+    policy.config.enable_inference_cuda_graph = False
+    policy.eval()
+    assert policy.model.model.action_cuda_graph_manager.enabled is False
+    assert policy.model.depth_decode_cuda_graph_manager.enabled is False
+
+
+def test_lora_action_expert_target_is_opt_in():
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.config = SimpleNamespace(
+        lora_rank=64,
+        lora_alpha=16,
+        lora_dropout=0.05,
+        lora_bias="none",
+        enable_lora_action_expert=False,
+    )
+
+    targets = policy._get_default_peft_targets()["target_modules"]
+
+    assert "transformer|vision_backbone" in targets
+    assert "action_expert" not in targets
+
+    policy.config.enable_lora_action_expert = True
+    targets = policy._get_default_peft_targets()["target_modules"]
+
+    assert "action_expert" in targets
+    assert "state_encoder" not in targets
+    assert "state_norm" not in targets
+    assert "kv_proj" not in targets
+
+
+def test_enable_lora_vlm_wraps_loaded_hf_model_locally():
+    pytest.importorskip("peft")
+
+    class DummyInnerModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.transformer = torch.nn.Module()
+            self.transformer.wq = torch.nn.Linear(2, 2)
+            self.action_expert = torch.nn.Module()
+            self.action_expert.action_embed = torch.nn.Linear(2, 2)
+
+    class DummyHFModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.config = {}
+            self.model = DummyInnerModel()
+
+        def forward(self, x):
+            return self.model.transformer.wq(x)
+
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.config = SimpleNamespace(
+        checkpoint_path="/tmp/base",
+        lora_rank=2,
+        lora_alpha=4,
+        lora_dropout=0.0,
+        lora_bias="none",
+        enable_lora_action_expert=False,
+        train_action_expert_only=False,
+        enable_inference_cuda_graph=False,
+    )
+    policy.model = DummyHFModel()
+
+    policy._apply_lora_adapters()
+
+    assert policy._backbone() is policy.model.base_model.model.model
+    trainable = [name for name, param in policy.named_parameters() if param.requires_grad]
+    assert trainable
+    assert any("lora_" in name for name in trainable)
+    assert any("action_expert.action_embed" in name and "lora_" not in name for name in trainable)
+    assert policy.model(torch.ones(1, 2)).shape == (1, 2)
+
+
+def test_lora_vlm_unfreezes_action_expert_base_weights():
+    class DummyInnerModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.transformer = torch.nn.Module()
+            self.transformer.wq = torch.nn.Linear(2, 2)
+            self.action_expert = torch.nn.Module()
+            self.action_expert.action_embed = torch.nn.Linear(2, 2)
+
+    class DummyHFModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.model = DummyInnerModel()
+
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.model = DummyHFModel()
+
+    for param in policy.parameters():
+        param.requires_grad_(False)
+    policy._unfreeze_action_expert_parameters()
+
+    trainable = [name for name, param in policy.named_parameters() if param.requires_grad]
+    assert trainable
+    assert all("action_expert" in name for name in trainable)
+
+
+def test_train_action_expert_only_requires_continuous_action_mode():
+    with pytest.raises(ValueError, match="requires action_mode='continuous'"):
+        MolmoAct2Config(action_mode="both", train_action_expert_only=True)
+
+    with pytest.raises(ValueError, match="incompatible with enable_lora_vlm"):
+        MolmoAct2Config(action_mode="continuous", train_action_expert_only=True, enable_lora_vlm=True)
+
+    cfg = MolmoAct2Config(action_mode="continuous", train_action_expert_only=True)
+    assert cfg.train_action_expert_only
+
+
+def test_molmoact2_sequence_length_is_inferred_from_fixed_token_budget():
+    cfg = MolmoAct2Config(
+        action_mode="both",
+        chunk_size=10,
+        n_action_steps=10,
+        image_keys=["observation.images.image", "observation.images.wrist_image"],
+        input_features={OBS_STATE: PolicyFeature(type=FeatureType.STATE, shape=(8,))},
+        output_features={ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(7,))},
+    )
+
+    assert cfg.max_sequence_length is None
+    assert cfg.inferred_max_sequence_length() == 640
+    assert cfg.inferred_max_sequence_length(include_discrete_action=False) == 576
+    assert (
+        infer_molmoact2_max_sequence_length(
+            num_images=2,
+            state_dim=8,
+            action_dim=7,
+            action_horizon=30,
+            include_discrete_action=True,
+        )
+        == 768
+    )
+
+
+def test_molmoact2_sequence_length_override_is_preserved():
+    cfg = MolmoAct2Config(max_sequence_length=1024)
+
+    assert cfg.inferred_max_sequence_length(num_images=2, state_dim=8, action_dim=7) == 1024
+
+
+def test_train_action_expert_only_freezes_non_action_expert_params():
+    class DummyBackbone(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.transformer = torch.nn.Linear(2, 2)
+            self.vision_backbone = torch.nn.Linear(2, 2)
+            self.action_expert = torch.nn.Linear(2, 2)
+
+        def _require_action_expert(self):
+            return self.action_expert
+
+    class DummyModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.model = DummyBackbone()
+            self.lm_head = torch.nn.Linear(2, 2)
+
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.config = SimpleNamespace(train_action_expert_only=True)
+    policy.model = DummyModel()
+
+    policy._freeze_non_action_expert_parameters()
+    policy.train()
+
+    assert policy.model.model.action_expert.training
+    assert not policy.model.training
+    assert not policy.model.model.transformer.training
+    assert all(param.requires_grad for param in policy.model.model.action_expert.parameters())
+    assert not any(param.requires_grad for param in policy.model.model.transformer.parameters())
+    assert not any(param.requires_grad for param in policy.model.model.vision_backbone.parameters())
+    assert not any(param.requires_grad for param in policy.model.lm_head.parameters())
+
+
+def test_load_hf_model_accepts_max_action_horizon_schema(monkeypatch):
+    class DummyLoadedModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.config = SimpleNamespace(
+                max_action_dim=32,
+                max_action_horizon=30,
+                action_mode="both",
+                add_action_expert=True,
+            )
+            self.model = torch.nn.Module()
+            self.embed_tokens = torch.nn.Embedding(4, 4)
+            self.lm_head = torch.nn.Linear(4, 4, bias=False)
+
+        def get_input_embeddings(self):
+            return self.embed_tokens
+
+    loaded_model = DummyLoadedModel()
+    resolved_kwargs = {}
+
+    def fake_resolve_checkpoint_location(checkpoint_path, **kwargs):
+        resolved_kwargs.update(kwargs)
+        return checkpoint_path
+
+    config_kwargs = {}
+    model_kwargs = {}
+
+    class DummyHFConfig:
+        @classmethod
+        def from_pretrained(cls, *args, **kwargs):
+            del args
+            config_kwargs.update(kwargs)
+            return SimpleNamespace()
+
+    class DummyMolmoAct2ForConditionalGeneration:
+        @classmethod
+        def from_pretrained(cls, *args, **kwargs):
+            del args
+            model_kwargs.update(kwargs)
+            return loaded_model
+
+    monkeypatch.setattr(molmoact2_modeling, "_resolve_checkpoint_location", fake_resolve_checkpoint_location)
+    monkeypatch.setattr(molmoact2_modeling, "HFMolmoAct2Config", DummyHFConfig)
+    monkeypatch.setattr(
+        molmoact2_modeling,
+        "MolmoAct2ForConditionalGeneration",
+        DummyMolmoAct2ForConditionalGeneration,
+    )
+    monkeypatch.setattr(molmoact2_modeling, "_strict_load_safetensors_weights", lambda *args: None)
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.config = MolmoAct2Config(
+        checkpoint_path="/tmp/new-schema-checkpoint",
+        checkpoint_revision="main",
+        checkpoint_force_download=True,
+        chunk_size=10,
+        n_action_steps=10,
+        action_mode="both",
+    )
+
+    policy._load_hf_model()
+
+    assert policy.model is loaded_model
+    assert not hasattr(policy.model.config, "action_horizon")
+    assert policy.model.config.max_action_horizon == 10
+    assert policy._generation_action_horizon() == 10
+    assert resolved_kwargs == {"revision": "main", "force_download": True}
+    assert "trust_remote_code" not in config_kwargs
+    assert "trust_remote_code" not in model_kwargs
+
+
+def test_load_hf_model_chunk_size_overrides_larger_than_checkpoint_horizon(monkeypatch):
+    class DummyLoadedModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.config = SimpleNamespace(
+                max_action_dim=32,
+                max_action_horizon=10,
+                action_mode="both",
+                add_action_expert=True,
+            )
+            self.model = torch.nn.Module()
+            self.embed_tokens = torch.nn.Embedding(4, 4)
+            self.lm_head = torch.nn.Linear(4, 4, bias=False)
+
+        def get_input_embeddings(self):
+            return self.embed_tokens
+
+    loaded_model = DummyLoadedModel()
+    monkeypatch.setattr(
+        molmoact2_modeling,
+        "_resolve_checkpoint_location",
+        lambda checkpoint_path, **kwargs: checkpoint_path,
+    )
+
+    class DummyHFConfig:
+        @classmethod
+        def from_pretrained(cls, *args, **kwargs):
+            del args, kwargs
+            return SimpleNamespace()
+
+    class DummyMolmoAct2ForConditionalGeneration:
+        @classmethod
+        def from_pretrained(cls, *args, **kwargs):
+            del args, kwargs
+            return loaded_model
+
+    monkeypatch.setattr(molmoact2_modeling, "HFMolmoAct2Config", DummyHFConfig)
+    monkeypatch.setattr(
+        molmoact2_modeling,
+        "MolmoAct2ForConditionalGeneration",
+        DummyMolmoAct2ForConditionalGeneration,
+    )
+    monkeypatch.setattr(molmoact2_modeling, "_strict_load_safetensors_weights", lambda *args: None)
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.config = MolmoAct2Config(
+        checkpoint_path="/tmp/new-schema-checkpoint",
+        chunk_size=30,
+        n_action_steps=30,
+        action_mode="both",
+    )
+
+    policy._load_hf_model()
+
+    assert policy.model.config.max_action_horizon == 30
+    assert policy._generation_action_horizon() == 30
+
+
+def test_load_hf_model_rejects_legacy_action_horizon_schema(monkeypatch):
+    class DummyLoadedModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.config = SimpleNamespace(
+                max_action_dim=32,
+                action_horizon=30,
+                action_mode="both",
+                add_action_expert=True,
+            )
+            self.model = torch.nn.Module()
+
+    monkeypatch.setattr(
+        molmoact2_modeling,
+        "_resolve_checkpoint_location",
+        lambda checkpoint_path, **kwargs: checkpoint_path,
+    )
+
+    class DummyHFConfig:
+        @classmethod
+        def from_pretrained(cls, *args, **kwargs):
+            del args, kwargs
+            return SimpleNamespace()
+
+    class DummyMolmoAct2ForConditionalGeneration:
+        @classmethod
+        def from_pretrained(cls, *args, **kwargs):
+            del args, kwargs
+            return DummyLoadedModel()
+
+    monkeypatch.setattr(molmoact2_modeling, "HFMolmoAct2Config", DummyHFConfig)
+    monkeypatch.setattr(
+        molmoact2_modeling,
+        "MolmoAct2ForConditionalGeneration",
+        DummyMolmoAct2ForConditionalGeneration,
+    )
+    monkeypatch.setattr(molmoact2_modeling, "_strict_load_safetensors_weights", lambda *args: None)
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.config = MolmoAct2Config(
+        checkpoint_path="/tmp/legacy-schema-checkpoint",
+        chunk_size=10,
+        n_action_steps=10,
+        action_mode="both",
+    )
+
+    with pytest.raises(ValueError, match="max_action_horizon"):
+        policy._load_hf_model()
+
+
+def test_rtc_processor_initialization_and_select_action_guard():
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.config = SimpleNamespace(rtc_config=RTCConfig(enabled=True))
+
+    policy.init_rtc_processor()
+
+    assert policy.rtc_processor is not None
+    with pytest.raises(AssertionError, match="RTC is not supported for select_action"):
+        policy.select_action({})
+
+
+def test_select_action_uses_single_full_batch_queue():
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.config = SimpleNamespace(rtc_config=None, n_action_steps=2)
+    policy._action_queue = deque(maxlen=2)
+    calls = 0
+
+    def predict_action_chunk(batch, **kwargs):
+        nonlocal calls
+        del batch, kwargs
+        calls += 1
+        return torch.tensor(
+            [
+                [[1.0], [2.0]],
+                [[3.0], [4.0]],
+            ]
+        )
+
+    policy.predict_action_chunk = predict_action_chunk
+
+    first = policy.select_action({})
+    second = policy.select_action({})
+
+    assert calls == 1
+    assert torch.equal(first, torch.tensor([[1.0], [3.0]]))
+    assert torch.equal(second, torch.tensor([[2.0], [4.0]]))
+
+
+def test_inference_action_mode_is_explicit_and_has_no_action_mode_alias():
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.config = MolmoAct2Config(action_mode="both", inference_action_mode=None)
+    policy._checkpoint_action_mode = None
+
+    with pytest.raises(ValueError, match="inference_action_mode.*explicitly"):
+        policy._resolve_inference_action_mode(None)
+    with pytest.raises(TypeError, match="unexpected keyword argument 'action_mode'"):
+        policy.predict_action_chunk({}, action_mode="continuous")
+
+
+def test_rtc_generation_uses_previous_chunk_prefix():
+    class DummyActionExpert(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.weight = torch.nn.Parameter(torch.tensor(1.0))
+
+        def prepare_context(self, **kwargs):
+            del kwargs
+            return SimpleNamespace()
+
+        def get_or_prepare_modulation_cache(self, timesteps, *, cache_key=None):
+            del cache_key
+            return [SimpleNamespace(conditioning=timestep) for timestep in timesteps]
+
+        def forward_with_context(self, actions, timesteps, *, context, modulation=None):
+            del timesteps, context, modulation
+            return torch.ones_like(actions) * self.weight
+
+    class DummyBackbone(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.config = SimpleNamespace(
+                flow_matching_num_steps=2,
+                max_action_horizon=4,
+                max_action_dim=3,
+            )
+            self.action_expert = DummyActionExpert()
+            self.batch_size = 1
+
+        def _require_action_expert(self):
+            return self.action_expert
+
+        def forward(self, **kwargs):
+            self.batch_size = int(kwargs["input_ids"].shape[0])
+            return SimpleNamespace(past_key_values=object())
+
+        def _extract_kv_states(self, past_key_values):
+            del past_key_values
+            kv = torch.zeros(self.batch_size, 1, 1)
+            return [(kv, kv)]
+
+        def _get_encoder_attention_mask(self, input_ids, attention_mask):
+            del input_ids
+            return attention_mask
+
+        def _depth_gate_from_condition(self, **kwargs):
+            del kwargs
+            return None, None
+
+        def _apply_depth_gate_to_layer_kv_states(self, encoder_kv_states, depth_mask, depth_gate):
+            del depth_mask, depth_gate
+            return encoder_kv_states
+
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.config = SimpleNamespace(
+        mask_action_dim_padding=True,
+        rtc_config=RTCConfig(enabled=True, execution_horizon=2, max_guidance_weight=1.0),
+    )
+    policy.rtc_processor = None
+    policy.model = torch.nn.Module()
+    policy.model.model = DummyBackbone()
+    policy.init_rtc_processor()
+    model_inputs = {
+        "input_ids": torch.ones(1, 2, dtype=torch.long),
+        "attention_mask": torch.ones(1, 2, dtype=torch.long),
+    }
+    action_dim_is_pad = torch.tensor([[False, False, False]])
+
+    without_prefix = policy._generate_actions_from_inputs_with_rtc(
+        model_inputs=model_inputs,
+        action_dim_is_pad=action_dim_is_pad,
+        num_steps=2,
+        generator=torch.Generator().manual_seed(0),
+        inference_delay=0,
+        prev_chunk_left_over=None,
+        execution_horizon=None,
+    )
+    with_prefix = policy._generate_actions_from_inputs_with_rtc(
+        model_inputs=model_inputs,
+        action_dim_is_pad=action_dim_is_pad,
+        num_steps=2,
+        generator=torch.Generator().manual_seed(0),
+        inference_delay=0,
+        prev_chunk_left_over=torch.zeros(1, 4, 3),
+        execution_horizon=None,
+    )
+
+    assert without_prefix.shape == (1, 4, 3)
+    assert not torch.allclose(without_prefix, with_prefix)
+
+
+def test_discrete_state_string_matches_molmoact2_bins():
+    state = np.asarray([-1.0, 0.0, 1.0, np.nan, np.inf, -np.inf], dtype=np.float32)
+
+    assert _build_discrete_state_string(state, 256) == (
+        "<state_start><state_0><state_128><state_255><state_128><state_255><state_0><state_end>"
+    )
+
+
+def test_question_normalization_matches_release_prompt_style():
+    assert _normalize_question_text("Instruction: Pick up the cube, please!") == "pick up the cube, please"
+    assert (
+        _normalize_question_text("The task is to open drawer. Then close it.") == "open drawer; then close it"
+    )
+
+
+def test_action_padding_marks_only_real_dimensions():
+    step = object.__new__(MolmoAct2PackInputsProcessorStep)
+    step.max_action_dim = 32
+    action = torch.ones(2, 3, 7)
+
+    padded, horizon_is_pad, dim_is_pad = step._pad_action(action, None)
+
+    assert padded.shape == (2, 3, 32)
+    assert torch.equal(padded[..., :7], action)
+    assert torch.count_nonzero(padded[..., 7:]) == 0
+    assert not horizon_is_pad.any()
+    assert not dim_is_pad[:, :7].any()
+    assert dim_is_pad[:, 7:].all()
+
+
+def test_action_dim_padding_loss_reduces_like_old_trainer():
+    loss = torch.arange(2 * 2 * 3 * 4, dtype=torch.float32).reshape(2, 2, 3, 4)
+    action_dim_is_pad = torch.tensor(
+        [
+            [False, False, True, True],
+            [False, True, True, True],
+        ]
+    )
+
+    reduced = MolmoAct2Policy._apply_action_dim_padding_mask(loss, action_dim_is_pad)
+
+    expected = torch.stack(
+        [
+            loss[0, :, :, :2].sum(dim=-1) / 2,
+            loss[1, :, :, :1].sum(dim=-1) / 1,
+        ],
+        dim=0,
+    )
+    assert torch.equal(reduced, expected)
+
+
+def test_action_chunk_padding_keeps_old_mean_denominator():
+    loss = torch.ones(1, 2, 4, 3)
+    action_horizon_is_pad = torch.tensor([[False, False, True, True]])
+
+    masked = MolmoAct2Policy._apply_action_chunk_padding_mask(loss, action_horizon_is_pad)
+
+    assert masked.mean().item() == 0.5
+
+
+def test_selected_discrete_loss_matches_full_causal_lm_loss():
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.config = SimpleNamespace(
+        softmax_auxiliary_loss=False,
+        softmax_auxiliary_loss_scale=1e-4,
+        discrete_loss_token_weighting="none",
+    )
+    policy.model = torch.nn.Module()
+    policy.model.lm_head = torch.nn.Linear(3, 5, bias=False)
+    outputs = type("Outputs", (), {})()
+    outputs.last_hidden_state = torch.randn(2, 4, 3)
+    labels = torch.tensor(
+        [
+            [-100, 1, 2, -100],
+            [-100, -100, 3, 4],
+        ]
+    )
+
+    selected_loss, z_loss = policy._discrete_loss_from_backbone_outputs({"labels": labels}, outputs)
+
+    logits = policy.model.lm_head(outputs.last_hidden_state)
+    shift_labels = F.pad(labels, (0, 1), value=-100)[..., 1:].contiguous()
+    expected_loss = F.cross_entropy(logits.float().view(-1, 5), shift_labels.view(-1), ignore_index=-100)
+    assert torch.allclose(selected_loss, expected_loss)
+    assert z_loss is None
+
+
+def test_discrete_z_loss_matches_old_trainer_formula():
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.config = SimpleNamespace(
+        softmax_auxiliary_loss=True,
+        softmax_auxiliary_loss_scale=1e-4,
+        discrete_loss_token_weighting="none",
+    )
+    policy.model = torch.nn.Module()
+    policy.model.lm_head = torch.nn.Linear(3, 5, bias=False)
+    outputs = type("Outputs", (), {})()
+    outputs.last_hidden_state = torch.randn(2, 4, 3)
+    labels = torch.tensor(
+        [
+            [-100, 1, 2, -100],
+            [-100, -100, 3, 4],
+        ]
+    )
+
+    ce_loss, z_loss = policy._discrete_loss_from_backbone_outputs({"labels": labels}, outputs)
+
+    logits = policy.model.lm_head(outputs.last_hidden_state).float()
+    shift_labels = F.pad(labels, (0, 1), value=-100)[..., 1:].contiguous()
+    valid = shift_labels != -100
+    expected_ce = F.cross_entropy(logits.view(-1, 5), shift_labels.view(-1), ignore_index=-100)
+    expected_z = 1e-4 * logits.logsumexp(dim=-1)[valid].pow(2).mean()
+    assert torch.allclose(ce_loss, expected_ce)
+    assert z_loss is not None
+    assert torch.allclose(z_loss, expected_z)
+
+
+def test_discrete_reduction_none_preserves_mean_loss():
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.config = SimpleNamespace(
+        softmax_auxiliary_loss=True,
+        softmax_auxiliary_loss_scale=1e-4,
+        discrete_loss_token_weighting="root_subsegments_root_tokens",
+    )
+    policy.model = torch.nn.Module()
+    policy.model.lm_head = torch.nn.Linear(3, 5, bias=False)
+    outputs = type("Outputs", (), {})()
+    outputs.last_hidden_state = torch.randn(3, 5, 3)
+    labels = torch.tensor(
+        [
+            [-100, 1, -100, -100, -100],
+            [-100, -100, 2, 3, -100],
+            [-100, 4, 3, 2, 1],
+        ]
+    )
+
+    ce_mean, z_mean = policy._discrete_loss_from_backbone_outputs(
+        {"labels": labels},
+        outputs,
+        reduction="mean",
+    )
+    ce_none, z_none = policy._discrete_loss_from_backbone_outputs(
+        {"labels": labels},
+        outputs,
+        reduction="none",
+    )
+
+    assert ce_none.shape == (3,)
+    assert z_none is not None
+    assert z_none.shape == (3,)
+    assert torch.allclose(ce_none.mean(), ce_mean)
+    assert torch.allclose(z_none.mean(), z_mean)
+
+
+def test_forward_reduction_none_returns_per_sample_discrete_loss():
+    class DummyBackbone(torch.nn.Module):
+        def __init__(self, hidden_states):
+            super().__init__()
+            self.hidden_states = hidden_states
+
+        def forward(self, **kwargs):
+            del kwargs
+            return SimpleNamespace(last_hidden_state=self.hidden_states)
+
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.config = SimpleNamespace(
+        action_mode="discrete",
+        inference_action_mode="discrete",
+        model_dtype="float32",
+        softmax_auxiliary_loss=True,
+        softmax_auxiliary_loss_scale=1e-4,
+        discrete_loss_token_weighting="none",
+    )
+    policy.model = torch.nn.Module()
+    policy.model.lm_head = torch.nn.Linear(3, 5, bias=False)
+    hidden_states = torch.randn(2, 4, 3)
+    policy._backbone = lambda: DummyBackbone(hidden_states)
+    batch = {
+        "input_ids": torch.ones(2, 4, dtype=torch.long),
+        "labels": torch.tensor(
+            [
+                [-100, 1, 2, -100],
+                [-100, -100, 3, 4],
+            ]
+        ),
+    }
+
+    loss_none, metrics_none = policy.forward(batch, reduction="none")
+    loss_mean, metrics_mean = policy.forward(batch, reduction="mean")
+
+    assert loss_none.shape == (2,)
+    assert torch.allclose(loss_none.mean(), loss_mean)
+    assert metrics_none["loss"] == pytest.approx(metrics_mean["loss"])
+
+
+def test_discrete_root_token_weighting_matches_old_loss_mask_scaling():
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.config = SimpleNamespace(
+        softmax_auxiliary_loss=True,
+        softmax_auxiliary_loss_scale=1e-4,
+        discrete_loss_token_weighting="root_subsegments_root_tokens",
+    )
+    policy.model = torch.nn.Module()
+    policy.model.lm_head = torch.nn.Linear(3, 5, bias=False)
+    outputs = type("Outputs", (), {})()
+    outputs.last_hidden_state = torch.randn(2, 4, 3)
+    labels = torch.tensor(
+        [
+            [-100, -100, 1, -100],
+            [-100, 2, 3, 4],
+        ]
+    )
+
+    ce_loss, z_loss = policy._discrete_loss_from_backbone_outputs({"labels": labels}, outputs)
+
+    logits = policy.model.lm_head(outputs.last_hidden_state).float()
+    shift_labels = F.pad(labels, (0, 1), value=-100)[..., 1:].contiguous()
+    valid = shift_labels != -100
+    log_z = logits.logsumexp(dim=-1)
+    token_ce = log_z - logits.gather(dim=-1, index=shift_labels.clamp_min(0).unsqueeze(-1)).squeeze(-1)
+    weights = torch.zeros_like(token_ce)
+    counts = valid.sum(dim=1).float()
+    weights[valid] = (2.0 / torch.sqrt(counts))[:, None].expand_as(weights)[valid]
+    expected_ce = (token_ce * weights).sum() / weights.sum()
+    expected_z = 1e-4 * (log_z.pow(2) * weights).sum() / weights.sum()
+    assert torch.allclose(ce_loss, expected_ce)
+    assert z_loss is not None
+    assert torch.allclose(z_loss, expected_z)
+
+
+class _DummyActionTokenizer:
+    def decode(self, tokens, *, time_horizon=None, action_dim=None):
+        decoded = []
+        for token_row in tokens:
+            decoded.append(np.full((time_horizon, action_dim), sum(token_row), dtype=np.float32))
+        return np.stack(decoded)
+
+
+def test_discrete_decode_extracts_action_bins_for_each_batch():
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.config = SimpleNamespace(chunk_size=2)
+    policy.action_tokenizer = _DummyActionTokenizer()
+    policy.model = torch.nn.Module()
+    policy.model.config = SimpleNamespace(
+        action_start_token_id=10,
+        action_end_token_id=11,
+        action_token_start_id=100,
+        num_action_tokens=4,
+        action_horizon=2,
+    )
+
+    actions = policy._decode_discrete_action_chunk(
+        torch.tensor(
+            [
+                [10, 100, 101, 11, 2],
+                [10, 102, 103, 11, 2],
+            ]
+        ),
+        action_dim=2,
+    )
+
+    assert actions.shape == (2, 2, 2)
+    assert torch.equal(actions[0], torch.ones(2, 2))
+    assert torch.equal(actions[1], torch.full((2, 2), 5.0))
+
+
+def test_discrete_predict_action_chunk_uses_hf_cached_generation_path():
+    class DummyOutput:
+        def __init__(self, token_id, batch_size):
+            logits = torch.full((batch_size, 1, 128), -1e9)
+            logits[:, :, token_id] = 1.0
+            self.logits = logits
+            self.past_key_values = object()
+
+    class DummyModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.weight = torch.nn.Parameter(torch.tensor(1.0))
+            self.config = SimpleNamespace(
+                action_start_token_id=10,
+                action_end_token_id=11,
+                action_token_start_id=100,
+                num_action_tokens=4,
+                action_horizon=2,
+            )
+            self.tokens = [10, 100, 101, 11, 2]
+            self.index = 0
+
+        def forward(self, **kwargs):
+            batch_size = int(kwargs["input_ids"].shape[0])
+            return DummyOutput(self.tokens[self.index], batch_size)
+
+        def _consume_generation_tokens(self, token_ids, *, past_key_values, attention_mask):
+            del past_key_values
+            self.index += 1
+            if attention_mask is not None:
+                attention_mask = torch.cat([attention_mask, torch.ones_like(token_ids[:, None])], dim=-1)
+            return DummyOutput(self.tokens[self.index], int(token_ids.shape[0])), attention_mask
+
+        def _require_eos_token_id(self):
+            return 2
+
+        def _action_token_id_to_bin(self):
+            return {100: 0, 101: 1, 102: 2, 103: 3}
+
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.config = MolmoAct2Config(
+        action_mode="discrete",
+        inference_action_mode="discrete",
+        model_dtype="float32",
+        output_features={ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(2,))},
+        discrete_generation_max_steps=None,
+        discrete_action_tokenizer="unused",
+        chunk_size=2,
+        n_action_steps=1,
+        rtc_config=None,
+    )
+    policy._checkpoint_action_mode = None
+    policy.model = DummyModel()
+    policy.action_tokenizer = _DummyActionTokenizer()
+
+    actions = policy.predict_action_chunk(
+        {
+            "input_ids": torch.ones(1, 3, dtype=torch.long),
+            "attention_mask": torch.ones(1, 3, dtype=torch.long),
+        }
+    )
+
+    assert policy.model.index == 4
+    assert actions.shape == (1, 1, 2)
+    assert torch.equal(actions, torch.ones(1, 1, 2))
+
+
+def test_discrete_predict_action_chunk_uses_graph_backed_ar_decode_when_enabled():
+    class DummyOutput:
+        def __init__(self, token_id, past_key_values):
+            logits = torch.full((1, 1, 128), -1e9)
+            logits[:, :, token_id] = 1.0
+            self.logits = logits
+            self.past_key_values = past_key_values
+
+    class DummyLmHead(torch.nn.Module):
+        def forward(self, hidden_states):
+            token_id = int(hidden_states[0, 0, 0].item())
+            logits = torch.full((1, 1, 128), -1e9)
+            logits[:, :, token_id] = 1.0
+            return logits
+
+    class DummyModel(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.weight = torch.nn.Parameter(torch.tensor(1.0))
+            self.lm_head = DummyLmHead()
+            self.config = SimpleNamespace(
+                action_start_token_id=10,
+                action_end_token_id=11,
+                action_token_start_id=100,
+                num_action_tokens=4,
+                action_horizon=2,
+            )
+            self.tokens = [10, 100, 101, 11, 2]
+            self.index = 0
+            self.used_static_cache = False
+            self.graph_steps = 0
+
+        def forward(self, **kwargs):
+            self.used_static_cache = kwargs.get("past_key_values") == "static-cache"
+            return DummyOutput(self.tokens[self.index], kwargs.get("past_key_values"))
+
+        def _make_ar_decode_static_cache(self, inputs, *, max_steps):
+            assert int(inputs["input_ids"].shape[1]) == 3
+            assert max_steps == 32
+            return "static-cache"
+
+        def _make_depth_decode_attention_bias(self, inputs, past_key_values):
+            assert past_key_values == "static-cache"
+            return torch.ones(1, 1, 35, 35, dtype=torch.float32)
+
+        def _run_ar_decode_step(self, token_ids, *, past_key_values, attention_bias):
+            assert past_key_values == "static-cache"
+            assert attention_bias.shape == (1, 1, 35, 35)
+            self.index += 1
+            self.graph_steps += 1
+            return torch.tensor([[[float(self.tokens[self.index])]]]), past_key_values
+
+        def _require_eos_token_id(self):
+            return 2
+
+        def _action_token_id_to_bin(self):
+            return {100: 0, 101: 1, 102: 2, 103: 3}
+
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.config = MolmoAct2Config(
+        action_mode="discrete",
+        inference_action_mode="discrete",
+        model_dtype="float32",
+        output_features={ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(2,))},
+        discrete_generation_max_steps=None,
+        discrete_action_tokenizer="unused",
+        chunk_size=2,
+        n_action_steps=1,
+        rtc_config=None,
+        enable_inference_cuda_graph=True,
+    )
+    policy._checkpoint_action_mode = None
+    policy.model = DummyModel()
+    policy.action_tokenizer = _DummyActionTokenizer()
+    torch.nn.Module.train(policy, False)
+
+    actions = policy.predict_action_chunk(
+        {
+            "input_ids": torch.ones(1, 3, dtype=torch.long),
+            "attention_mask": torch.ones(1, 3, dtype=torch.long),
+        }
+    )
+
+    assert policy.model.used_static_cache
+    assert policy.model.graph_steps == 4
+    assert actions.shape == (1, 1, 2)
+    assert torch.equal(actions, torch.ones(1, 1, 2))
+
+
+class _DummyMolmoBackbone(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.embed = torch.nn.Embedding(5, 3)
+
+    def get_input_embeddings(self):
+        return self.embed
+
+
+class _DummyMolmoModel(torch.nn.Module):
+    def __init__(self, *, tie_lm_head: bool = False):
+        super().__init__()
+        self.model = _DummyMolmoBackbone()
+        self.lm_head = torch.nn.Linear(3, 5, bias=False)
+        if tie_lm_head:
+            self.lm_head.weight = self.model.embed.weight
+
+    def get_input_embeddings(self):
+        return self.model.embed
+
+
+def test_freeze_embedding_freezes_input_embeddings_only_when_untied():
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.model = _DummyMolmoModel()
+
+    policy._freeze_input_embeddings()
+
+    assert not policy.model.model.embed.weight.requires_grad
+    assert policy.model.lm_head.weight.requires_grad
+
+
+def test_freeze_embedding_rejects_tied_lm_head_without_mutating():
+    policy = object.__new__(MolmoAct2Policy)
+    torch.nn.Module.__init__(policy)
+    policy.model = _DummyMolmoModel(tie_lm_head=True)
+
+    with pytest.raises(RuntimeError, match="would also freeze lm_head"):
+        policy._freeze_input_embeddings()
+
+    assert policy.model.model.embed.weight.requires_grad
diff --git a/uv.lock b/uv.lock
index 3eb1dda23..eebbb7f95 100644
--- a/uv.lock
+++ b/uv.lock
@@ -2915,6 +2915,11 @@ metaworld = [
     { name = "scipy" },
     { name = "torchcodec", marker = "(platform_machine == 'arm64' and sys_platform == 'darwin') or (platform_machine == 'AMD64' and sys_platform == 'linux') or (platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'x86_64' and sys_platform == 'linux') or sys_platform == 'win32'" },
 ]
+molmoact2 = [
+    { name = "peft" },
+    { name = "scipy" },
+    { name = "transformers" },
+]
 motorbridge-dep = [
     { name = "motorbridge" },
 ]
@@ -3131,6 +3136,7 @@ requires-dist = [
     { name = "lerobot", extras = ["matplotlib-dep"], marker = "extra == 'sarm'" },
     { name = "lerobot", extras = ["matplotlib-dep"], marker = "extra == 'unitree-g1'" },
     { name = "lerobot", extras = ["metaworld"], marker = "extra == 'all'" },
+    { name = "lerobot", extras = ["molmoact2"], marker = "extra == 'all'" },
     { name = "lerobot", extras = ["motorbridge-dep"], marker = "extra == 'rebot'" },
     { name = "lerobot", extras = ["motorbridge-smart-servo-dep"], marker = "extra == 'rebot'" },
     { name = "lerobot", extras = ["multi-task-dit"], marker = "extra == 'all'" },
@@ -3138,6 +3144,7 @@ requires-dist = [
     { name = "lerobot", extras = ["openarms"], marker = "extra == 'all'" },
     { name = "lerobot", extras = ["peft"], marker = "extra == 'all'" },
     { name = "lerobot", extras = ["peft-dep"], marker = "extra == 'groot'" },
+    { name = "lerobot", extras = ["peft-dep"], marker = "extra == 'molmoact2'" },
     { name = "lerobot", extras = ["peft-dep"], marker = "extra == 'peft'" },
     { name = "lerobot", extras = ["peft-dep"], marker = "extra == 'wallx'" },
     { name = "lerobot", extras = ["phone"], marker = "extra == 'all'" },
@@ -3165,6 +3172,7 @@ requires-dist = [
     { name = "lerobot", extras = ["scipy-dep"], marker = "extra == 'aloha'" },
     { name = "lerobot", extras = ["scipy-dep"], marker = "extra == 'libero'" },
     { name = "lerobot", extras = ["scipy-dep"], marker = "extra == 'metaworld'" },
+    { name = "lerobot", extras = ["scipy-dep"], marker = "extra == 'molmoact2'" },
     { name = "lerobot", extras = ["scipy-dep"], marker = "extra == 'phone'" },
     { name = "lerobot", extras = ["scipy-dep"], marker = "extra == 'pi'" },
     { name = "lerobot", extras = ["scipy-dep"], marker = "extra == 'wallx'" },
@@ -3176,6 +3184,7 @@ requires-dist = [
     { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'groot'" },
     { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'hilserl'" },
     { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'libero'" },
+    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'molmoact2'" },
     { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'multi-task-dit'" },
     { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'peft'" },
     { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'pi'" },
@@ -3249,7 +3258,7 @@ requires-dist = [
     { name = "transformers", marker = "extra == 'transformers-dep'", specifier = ">=5.4.0,<5.6.0" },
     { name = "wandb", marker = "extra == 'training'", specifier = ">=0.24.0,<0.25.0" },
 ]
-provides-extras = ["dataset", "training", "hardware", "viz", "core-scripts", "evaluation", "dataset-viz", "av-dep", "pygame-dep", "placo-dep", "transformers-dep", "grpcio-dep", "can-dep", "peft-dep", "scipy-dep", "diffusers-dep", "qwen-vl-utils-dep", "matplotlib-dep", "pyserial-dep", "deepdiff-dep", "pynput-dep", "pyzmq-dep", "motorbridge-dep", "motorbridge-smart-servo-dep", "feetech", "dynamixel", "damiao", "robstride", "openarms", "gamepad", "hopejr", "lekiwi", "unitree-g1", "reachy2", "rebot", "kinematics", "intelrealsense", "phone", "diffusion", "wallx", "pi", "smolvla", "multi-task-dit", "groot", "sarm", "topreward", "xvla", "eo1", "hilserl", "async", "peft", "dev", "notebook", "test", "video-benchmark", "aloha", "pusht", "libero", "metaworld", "all"]
+provides-extras = ["dataset", "training", "hardware", "viz", "core-scripts", "evaluation", "dataset-viz", "av-dep", "pygame-dep", "placo-dep", "transformers-dep", "grpcio-dep", "can-dep", "peft-dep", "scipy-dep", "diffusers-dep", "qwen-vl-utils-dep", "matplotlib-dep", "pyserial-dep", "deepdiff-dep", "pynput-dep", "pyzmq-dep", "motorbridge-dep", "motorbridge-smart-servo-dep", "feetech", "dynamixel", "damiao", "robstride", "openarms", "gamepad", "hopejr", "lekiwi", "unitree-g1", "reachy2", "rebot", "kinematics", "intelrealsense", "phone", "diffusion", "wallx", "pi", "molmoact2", "smolvla", "multi-task-dit", "groot", "sarm", "topreward", "xvla", "eo1", "hilserl", "async", "peft", "dev", "notebook", "test", "video-benchmark", "aloha", "pusht", "libero", "metaworld", "all"]
 
 [[package]]
 name = "librt"