Fix EVO1 LIBERO rollout processors

Merge remote-tracking branch 'upstream/main' into codex/add-evo1-policy
docs(evo1): format results table
2026-06-16 15:57:03 +00:00 · 2026-06-09 15:10:10 +08:00 · 2026-05-12 17:40:59 +08:00 · 2026-05-12 17:40:18 +08:00 · 2026-05-11 19:47:55 +02:00 · 2026-05-11 21:51:41 +08:00
114 changed files with 6863 additions and 3564 deletions
@@ -382,6 +382,7 @@ jobs:
                --policy.path=\"\$ROBOTWIN_POLICY\" \
                --env.type=robotwin \
                --env.task=\"\$ROBOTWIN_TASKS\" \
+                --env.max_parallel_tasks=5 \
                --eval.batch_size=1 \
                --eval.n_episodes=1 \
                --eval.use_async_envs=false \
@@ -482,6 +483,7 @@ jobs:
                --policy.path=lerobot/smolvla_robocasa \
                --env.type=robocasa \
                --env.task=CloseFridge,OpenCabinet,OpenDrawer,TurnOnMicrowave,TurnOffStove,CloseToasterOvenDoor,SlideDishwasherRack,TurnOnSinkFaucet,NavigateKitchen,TurnOnElectricKettle \
+                --env.max_parallel_tasks=5 \
                --eval.batch_size=1 \
                --eval.n_episodes=1 \
                --eval.use_async_envs=false \
@@ -693,6 +695,7 @@ jobs:
                --env.task=\"\$ROBOMME_TASKS\" \
                --env.dataset_split=test \
                --env.task_ids=[0] \
+                --env.max_parallel_tasks=5 \
                --eval.batch_size=1 \
                --eval.n_episodes=1 \
                --eval.use_async_envs=false \
@@ -800,6 +803,7 @@ jobs:
                --env.type=libero_plus \
                --env.task=\"\$LIBERO_PLUS_SUITE\" \
                --env.task_ids=\"\$LIBERO_PLUS_TASK_IDS\" \
+                --env.max_parallel_tasks=5 \
                --eval.batch_size=1 \
                --eval.n_episodes=1 \
                --eval.use_async_envs=false \
@@ -900,6 +904,8 @@ jobs:
                --policy.path=lerobot/smolvla_vlabench \
                --env.type=vlabench \
                --env.task=select_fruit,select_toy,select_book,select_painting,select_drink,select_ingredient,select_billiards,select_poker,add_condiment,insert_flower \
+                --env.episode_length=50 \
+                --env.max_parallel_tasks=5 \
                --eval.batch_size=1 \
                --eval.n_episodes=1 \
                --eval.use_async_envs=false \
@@ -152,13 +152,14 @@ jobs:
            BASE_VERSION="${VERSION%%-*}"
            echo "Installing pre-release version $BASE_VERSION from TestPyPI..."
            uv pip install \
+              --torch-backend cpu \
              --index-url https://test.pypi.org/simple/ \
              --extra-index-url https://pypi.org/simple \
              --index-strategy unsafe-best-match \
               "lerobot[all]==$BASE_VERSION"
          else
            echo "Installing release version $VERSION from PyPI..."
-            uv pip install "lerobot[all]==$VERSION"
+            uv pip install --torch-backend cpu "lerobot[all]==$VERSION"
          fi
      - name: Check lerobot version
        run: uv run python -c "import lerobot; print(lerobot.__version__)"
@@ -19,19 +19,19 @@ on:
  workflow_dispatch:

  # Runs at 02:00
-  schedule:
-    - cron: "0 2 * * *"
+  # schedule:
+  #   - cron: "0 2 * * *"

 env:
  CLOSE_ISSUE_MESSAGE: >
-    This issue was closed because it has been stalled for 14 days with no activity.
+    This issue was closed because it has been stalled for 30 days with no activity.
    Feel free to reopen if is still relevant, or to ping a collaborator if you have any questions.
  CLOSE_PR_MESSAGE: >
-    This PR was closed because it has been stalled for 21 days with no activity.
+    This PR was closed because it has been stalled for 30 days with no activity.
    Feel free to reopen if is still relevant, or to ping a collaborator if you have any questions.
  WARN_ISSUE_MESSAGE: >
    This issue has been automatically marked as stale because it has not had
-    recent activity (6 months). It will be closed if no further activity occurs.
+    recent activity (1 year). It will be closed if no further activity occurs.
    Any change, comment or update to this issue will reset this count.
    Thank you for your contributions.
  WARN_PR_MESSAGE: >
@@ -59,10 +59,10 @@ jobs:
          stale-pr-label: stale
          exempt-issue-labels: never-stale
          exempt-pr-labels: never-stale
-          days-before-issue-stale: 180
-          days-before-issue-close: 14
+          days-before-issue-stale: 365
+          days-before-issue-close: 30
          days-before-pr-stale: 365
-          days-before-pr-close: 21
+          days-before-pr-close: 30
          delete-branch: true
          close-issue-message: ${{ env.CLOSE_ISSUE_MESSAGE }}
          close-pr-message: ${{ env.CLOSE_PR_MESSAGE }}
@@ -232,6 +232,8 @@ Match the policy to the user's **GPU memory** and **time budget**. Numbers below

 All policies typically train for **5–10 epochs** (see §7).

+> **Human-facing version:** the [Compute Hardware Guide](./docs/source/hardware_guide.mdx) reuses the table below and adds a cloud-GPU tier guide and a Hugging Face Jobs pointer.
+
 | Policy      | Batch | Update (ms) | Peak GPU mem (GB) | Best for                                                                                         |
 | ----------- | ----: | ----------: | ----------------: | ------------------------------------------------------------------------------------------------ |
 | `act`       |     4 |    **83.9** |          **0.94** | First-time users, laptops, single-task. Fast and reliable.                                       |
@@ -109,7 +109,7 @@ lerobot-train \

 Similarly to the hardware, you can easily implement your own policy & leverage LeRobot's data collection, training, and visualization tools, and share your model to the HF Hub

-For detailed policy setup guides, see the [Policy Documentation](https://huggingface.co/docs/lerobot/bring_your_own_policies).
+For detailed policy setup guides, see the [Policy Documentation](https://huggingface.co/docs/lerobot/bring_your_own_policies). For GPU/RAM requirements and expected training time per policy, see the [Compute Hardware Guide](https://huggingface.co/docs/lerobot/hardware_guide).

 ## Inference & Evaluation

@@ -39,7 +39,6 @@ from tqdm import tqdm

 from lerobot.datasets.lerobot_dataset import LeRobotDataset
 from lerobot.datasets.video_utils import (
-    VideoEncoderConfig,
    decode_video_frames,
    encode_video_frames,
 )
@@ -252,13 +251,10 @@ def benchmark_encoding_decoding(
            imgs_dir=imgs_dir,
            video_path=video_path,
            fps=fps,
-            camera_encoder_config=VideoEncoderConfig(
-                vcodec=encoding_cfg["vcodec"],
-                pix_fmt=encoding_cfg["pix_fmt"],
-                g=encoding_cfg.get("g"),
-                crf=encoding_cfg.get("crf"),
-                preset=encoding_cfg.get("preset"),
-            ),
+            vcodec=encoding_cfg["vcodec"],
+            pix_fmt=encoding_cfg["pix_fmt"],
+            g=encoding_cfg.get("g"),
+            crf=encoding_cfg.get("crf"),
            # fast_decode=encoding_cfg.get("fastdecode"),
            overwrite=True,
        )
@@ -35,7 +35,7 @@ USER root
 ARG ROBOTWIN_SHA=0aeea2d669c0f8516f4d5785f0aa33ba812c14b4
 RUN apt-get update \
    && apt-get install -y --no-install-recommends \
-         cuda-nvcc-12-4 cuda-cudart-dev-12-4 \
+         cuda-nvcc-12-8 cuda-cudart-dev-12-8 \
         libvulkan1 vulkan-tools \
    && mkdir -p /usr/share/vulkan/icd.d \
    && echo '{"file_format_version":"1.0.0","ICD":{"library_path":"libGLX_nvidia.so.0","api_version":"1.3.0"}}' \
@@ -18,9 +18,8 @@
 # docker build -f docker/Dockerfile.internal -t lerobot-internal .

 # Configure the base image for CI with GPU access
-# TODO(Steven): Bump these versions
-ARG CUDA_VERSION=12.4.1
-ARG OS_VERSION=22.04
+ARG CUDA_VERSION=12.8.1
+ARG OS_VERSION=24.04
 FROM nvidia/cuda:${CUDA_VERSION}-base-ubuntu${OS_VERSION}

 # Define Python version argument
@@ -36,16 +35,13 @@ ENV DEBIAN_FRONTEND=noninteractive \

 # Install Python, system dependencies, and uv (as root)
 RUN apt-get update && apt-get install -y --no-install-recommends \
-    software-properties-common build-essential git curl \
-    libglib2.0-0 libgl1-mesa-glx libegl1-mesa ffmpeg \
+    build-essential git curl \
+    libglib2.0-0 libgl1 libegl1 ffmpeg \
    libusb-1.0-0-dev speech-dispatcher libgeos-dev portaudio19-dev \
    cmake pkg-config ninja-build \
-    && add-apt-repository -y ppa:deadsnakes/ppa \
-    && apt-get update \
-    && apt-get install -y --no-install-recommends \
-       python${PYTHON_VERSION} \
-       python${PYTHON_VERSION}-venv \
-       python${PYTHON_VERSION}-dev \
+    python${PYTHON_VERSION} \
+    python${PYTHON_VERSION}-venv \
+    python${PYTHON_VERSION}-dev \
    && curl -LsSf https://astral.sh/uv/install.sh | sh \
    && mv /root/.local/bin/uv /usr/local/bin/uv \
    && useradd --create-home --shell /bin/bash user_lerobot \
@@ -8,7 +8,7 @@
  - local: il_robots
    title: Imitation Learning for Robots
  - local: bring_your_own_policies
-    title: Bring Your Own Policies
+    title: Adding a Policy
  - local: integrate_hardware
    title: Bring Your Own Hardware
  - local: hilserl
@@ -24,6 +24,12 @@
  - local: rename_map
    title: Using Rename Map and Empty Cameras
  title: "Tutorials"
+- sections:
+  - local: hardware_guide
+    title: Compute Hardware Guide
+  - local: torch_accelerators
+    title: PyTorch accelerators
+  title: "Compute & Hardware"
 - sections:
  - local: lerobot-dataset-v3
    title: Using LeRobotDataset
@@ -47,6 +53,10 @@
    title: π₀-FAST (Pi0Fast)
  - local: pi05
    title: π₀.₅ (Pi05)
+  - local: eo1
+    title: EO-1
+  - local: evo1
+    title: EVO1
  - local: groot
    title: NVIDIA GR00T N1.5
  - local: xvla
@@ -140,10 +150,6 @@
  - local: cameras
    title: Cameras
  title: "Sensors"
- sections:
-  - local: torch_accelerators
-    title: PyTorch accelerators
-  title: "Supported Hardware"
 - sections:
  - local: notebooks
    title: Notebooks
@@ -90,6 +90,6 @@ lerobot-record \
  --dataset.single_task="Your task description" \
  --dataset.streaming_encoding=true \
  --dataset.encoder_threads=2 \
-  # --dataset.camera_encoder_config.vcodec=auto \
+  # --dataset.vcodec=auto \
  --policy.path=${HF_USER}/act_policy
 ```
@@ -1,60 +1,37 @@
-# Bring Your Own Policies
+# Adding a Policy

-This tutorial explains how to integrate your own custom policy implementations into the LeRobot ecosystem, allowing you to leverage all LeRobot tools for training, evaluation, and deployment while using your own algorithms.
+This guide walks you through implementing a custom policy and getting it to work with LeRobot's training, evaluation, and deployment tools. There are two paths:

-## Step 1: Create a Policy Package
+- **Plugin (out-of-tree)** — ship your policy as a standalone `lerobot_policy_*` package. Faster, no PR required, easy to iterate. Right for experimentation, internal use, or when you want to publish independently.
+- **In-tree (contributed to LeRobot)** — land your policy directly in `src/lerobot/policies/`. Requires a PR, but makes your policy a first-class citizen of the library.

-Your custom policy should be organized as an installable Python package following LeRobot's plugin conventions.
+The plugin route is usually the right starting point — promote to in-tree once the policy has stabilized and there's clear value in shipping it with the library.

-### Package Structure
+Either way, the building blocks are the same: a configuration class, a policy class, and a processor factory. The first half of this guide covers those shared pieces; the second half covers the path-specific scaffolding ([Path A](#path-a-out-of-tree-plugin), [Path B](#path-b-contributing-in-tree)).

-Create a package with the prefix `lerobot_policy_` (IMPORTANT!) followed by your policy name:
+A note on tone: robot-learning is an actively evolving field, and "what a policy looks like" can shift with each new architecture. The conventions described here exist because they let `lerobot-train` and `lerobot-eval` work uniformly across very different models. When a new policy genuinely doesn't fit them, raise it (in your PR, or an issue) — the conventions are not sacred.

-```bash
-lerobot_policy_my_custom_policy/
-├── pyproject.toml
-└── src/
-    └── lerobot_policy_my_custom_policy/
-        ├── __init__.py
-        ├── configuration_my_custom_policy.py
-        ├── modeling_my_custom_policy.py
-        └── processor_my_custom_policy.py
-```
+---

-### Package Configuration
+## Anatomy of a policy

-Set up your `pyproject.toml`:
+Three building blocks make up every policy. The names below use `my_policy` as a placeholder — replace with your policy's name. That name is load-bearing: it must match the string you pass to `@PreTrainedConfig.register_subclass`, the `MyPolicy.name` class attribute, and the `make_<name>_pre_post_processors` factory function (more on each below).

-```toml
-[project]
-name = "lerobot_policy_my_custom_policy"
-version = "0.1.0"
-dependencies = [
-    # your policy-specific dependencies
-]
-requires-python = ">= 3.12"
+### Configuration class

-[build-system]
-build-backend = # your-build-backend
-requires = # your-build-system
-```
-
-## Step 2: Define the Policy Configuration
-
-Create a configuration class that inherits from [`PreTrainedConfig`](https://github.com/huggingface/lerobot/blob/main/src/lerobot/configs/policies.py) and registers your policy type:
-Here is a template to get you started, customize the parameters and methods as needed for your policy's architecture and training requirements.
+Inherit from [`PreTrainedConfig`](https://github.com/huggingface/lerobot/blob/main/src/lerobot/configs/policies.py) and register your policy type. Here is a template — customize the parameters and methods as needed for your policy's architecture and training requirements.

 ```python
-# configuration_my_custom_policy.py
+# configuration_my_policy.py
 from dataclasses import dataclass, field
 from lerobot.configs import PreTrainedConfig
 from lerobot.optim import AdamWConfig
 from lerobot.optim import CosineDecayWithWarmupSchedulerConfig

-@PreTrainedConfig.register_subclass("my_custom_policy")
+@PreTrainedConfig.register_subclass("my_policy")
@dataclass
-class MyCustomPolicyConfig(PreTrainedConfig):
-    """Configuration class for MyCustomPolicy.
+class MyPolicyConfig(PreTrainedConfig):
+    """Configuration class for MyPolicy.

    Args:
        n_obs_steps: Number of observation steps to use as input
@@ -77,16 +54,20 @@ class MyCustomPolicyConfig(PreTrainedConfig):
            raise ValueError("n_action_steps cannot exceed horizon")

    def validate_features(self) -> None:
-        """Validate input/output feature compatibility."""
+        """Validate input/output feature compatibility.
+
+        Call this explicitly from your policy's __init__ — the base class does not.
+        """
        if not self.image_features:
-            raise ValueError("MyCustomPolicy requires at least one image feature.")
+            raise ValueError("MyPolicy requires at least one image feature.")
        if self.action_feature is None:
-            raise ValueError("MyCustomPolicy requires 'action' in output_features.")
+            raise ValueError("MyPolicy requires 'action' in output_features.")

    def get_optimizer_preset(self) -> AdamWConfig:
        return AdamWConfig(lr=self.optimizer_lr, weight_decay=self.optimizer_weight_decay)

    def get_scheduler_preset(self):
+        """Return a LRSchedulerConfig from lerobot.optim, or None."""
        return None

    @property
@@ -101,8 +82,7 @@ class MyCustomPolicyConfig(PreTrainedConfig):

    @property
    def action_delta_indices(self) -> list[int]:
-        """Relative timestep offsets for the action chunk the dataset loader returns.
-        """
+        """Relative timestep offsets for the action chunk the dataset loader returns."""
        return list(range(self.horizon))

    @property
@@ -110,32 +90,34 @@ class MyCustomPolicyConfig(PreTrainedConfig):
        return None
 ```

-## Step 3: Implement the Policy Class
+The string you pass to `@register_subclass` must match `MyPolicy.name` (next section) and is what users supply as `--policy.type` on the CLI. Default to `AdamW` from `lerobot.optim` for `get_optimizer_preset` unless you genuinely need otherwise.

-Create your policy implementation by inheriting from [`PreTrainedPolicy`](https://github.com/huggingface/lerobot/blob/main/src/lerobot/policies/pretrained.py):
+### Policy class
+
+Inherit from [`PreTrainedPolicy`](https://github.com/huggingface/lerobot/blob/main/src/lerobot/policies/pretrained.py) and set two class attributes — both are checked by `__init_subclass__`:

 ```python
-# modeling_my_custom_policy.py
+# modeling_my_policy.py
 import torch
 import torch.nn as nn
 from typing import Any

 from lerobot.policies import PreTrainedPolicy
 from lerobot.utils.constants import ACTION
-from .configuration_my_custom_policy import MyCustomPolicyConfig
+from .configuration_my_policy import MyPolicyConfig

-class MyCustomPolicy(PreTrainedPolicy):
-    config_class = MyCustomPolicyConfig  # must match the string in @register_subclass
-    name = "my_custom_policy"
+class MyPolicy(PreTrainedPolicy):
+    config_class = MyPolicyConfig  # must match the string in @register_subclass
+    name = "my_policy"

-    def __init__(self, config: MyCustomPolicyConfig, dataset_stats: dict[str, Any] = None):
+    def __init__(self, config: MyPolicyConfig, dataset_stats: dict[str, Any] = None):
        super().__init__(config, dataset_stats)
        config.validate_features()  # not called automatically by the base class
        self.config = config
        self.model = ...  # your nn.Module here

    def reset(self):
-        """Reset episode state."""
+        """Reset per-episode state. Called by lerobot-eval at the start of each episode."""
        ...

    def get_optim_params(self) -> dict:
@@ -147,35 +129,51 @@ class MyCustomPolicy(PreTrainedPolicy):
        ...

    def select_action(self, batch: dict[str, torch.Tensor], **kwargs) -> torch.Tensor:
-        """Return a single action for the current timestep (called at inference)."""
+        """Return a single action for the current timestep (called every step at inference)."""
        ...

-    def forward(self, batch: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]:
+    def forward(self, batch: dict[str, torch.Tensor]) -> tuple[torch.Tensor, dict | None]:
        """Compute the training loss.

+        Returns `(loss, output_dict)`. `output_dict` may be `None`; everything in it must be
+        logging-friendly Python natives (no tensors with gradients).
+
        `batch["action_is_pad"]` is a bool mask of shape (B, horizon) that marks
-        timesteps padded because the episode ended before `horizon` steps, you
+        timesteps padded because the episode ended before `horizon` steps; you
        can exclude those from your loss.
        """
        actions = batch[ACTION]
        action_is_pad = batch.get("action_is_pad")
        ...
-        return {"loss": ...}
+        return loss, {"some_loss_component": some_loss_component.item()}
 ```

-## Step 4: Add Data Processors
+The methods called by the train/eval loops:

-Create processor functions. For a concrete reference, see [processor_act.py](https://github.com/huggingface/lerobot/blob/main/src/lerobot/policies/act/processor_act.py) or [processor_diffusion.py](https://github.com/huggingface/lerobot/blob/main/src/lerobot/policies/diffusion/processor_diffusion.py).
+| Method                                                            | Used by           | What it does                                                                                                                                                                                                                                         |
+| ----------------------------------------------------------------- | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `reset() -> None`                                                 | `lerobot-eval`    | Clear per-episode state at the start of each episode.                                                                                                                                                                                                |
+| `select_action(batch, **kwargs) -> Tensor`                        | `lerobot-eval`    | Return the next action `(B, action_dim)`. Called every step.                                                                                                                                                                                         |
+| `predict_action_chunk(batch, **kwargs) -> Tensor`                 | the policy itself | Return an action chunk `(B, chunk_size, action_dim)`. Currently abstract on the base class — raise `NotImplementedError` if your policy doesn't chunk.                                                                                               |
+| `forward(batch, reduction="mean") -> tuple[Tensor, dict \| None]` | `lerobot-train`   | Return `(loss, output_dict)`. Accept `reduction="none"` if you want to support per-sample weighting.                                                                                                                                                 |
+| `get_optim_params() -> dict`                                      | the optimizer     | Return `self.parameters()` for simple policies; return a named parameter dict for [multi-optimizer policies](https://github.com/huggingface/lerobot/blob/ecd38c50d7d15b4184cf42649ff1185ee2e11eeb/src/lerobot/policies/sac/modeling_sac.py#L61-L73). |
+| `update() -> None` _(optional)_                                   | `lerobot-train`   | Called after each optimizer step _if defined_. Use for EMA, target nets, replay buffers (TDMPC uses this).                                                                                                                                           |
+
+Batches are flat dictionaries keyed by the constants in [`lerobot.utils.constants`](https://github.com/huggingface/lerobot/blob/main/src/lerobot/utils/constants.py): `OBS_STATE` (`observation.state.<motor>`), `OBS_IMAGES` (`observation.images.<camera>`), `OBS_LANGUAGE`, `ACTION`, etc. Reuse the constants — don't invent new prefixes.
+
+### Processor functions
+
+LeRobot uses `PolicyProcessorPipeline`s to normalize inputs and de-normalize outputs around your policy. For a concrete reference, see [`processor_act.py`](https://github.com/huggingface/lerobot/blob/main/src/lerobot/policies/act/processor_act.py) or [`processor_diffusion.py`](https://github.com/huggingface/lerobot/blob/main/src/lerobot/policies/diffusion/processor_diffusion.py).

 ```python
-# processor_my_custom_policy.py
+# processor_my_policy.py
 from typing import Any
 import torch

 from lerobot.processor import PolicyAction, PolicyProcessorPipeline


-def make_my_custom_policy_pre_post_processors(
+def make_my_policy_pre_post_processors(
    config,
    dataset_stats: dict[str, dict[str, torch.Tensor]] | None = None,
 ) -> tuple[
@@ -187,11 +185,48 @@ def make_my_custom_policy_pre_post_processors(
    return preprocessor, postprocessor
 ```

-**Important - function naming:** LeRobot discovers your processor by name. The function **must** be called `make_{policy_name}_pre_post_processors` (matching the string you passed to `@PreTrainedConfig.register_subclass`).
+**Important — function naming:** LeRobot discovers your processor by name. The function **must** be called `make_{policy_name}_pre_post_processors` (matching the string you passed to `@PreTrainedConfig.register_subclass`).

-## Step 5: Package Initialization
+---

-Expose your classes in the package's `__init__.py`:
+## Path A: Out-of-tree plugin
+
+The fastest way to ship a policy: package it as a standalone Python distribution and install it alongside LeRobot. No PR required, you own the release cycle, and you can publish to PyPI under your own namespace.
+
+### Package structure
+
+Create a package with the prefix `lerobot_policy_` (IMPORTANT!) followed by your policy name:
+
+```bash
+lerobot_policy_my_policy/
+├── pyproject.toml
+└── src/
+    └── lerobot_policy_my_policy/
+        ├── __init__.py
+        ├── configuration_my_policy.py
+        ├── modeling_my_policy.py
+        └── processor_my_policy.py
+```
+
+### `pyproject.toml`
+
+```toml
+[project]
+name = "lerobot_policy_my_policy"
+version = "0.1.0"
+dependencies = [
+    # your policy-specific dependencies
+]
+requires-python = ">= 3.12"
+
+[build-system]
+build-backend = # your-build-backend
+requires = # your-build-system
+```
+
+### Package `__init__.py`
+
+Expose your classes in the package's `__init__.py` and guard against missing `lerobot`:

 ```python
 # __init__.py
@@ -204,44 +239,148 @@ except ImportError:
        "lerobot is not installed. Please install lerobot to use this policy package."
    )

-from .configuration_my_custom_policy import MyCustomPolicyConfig
-from .modeling_my_custom_policy import MyCustomPolicy
-from .processor_my_custom_policy import make_my_custom_policy_pre_post_processors
+from .configuration_my_policy import MyPolicyConfig
+from .modeling_my_policy import MyPolicy
+from .processor_my_policy import make_my_policy_pre_post_processors

 __all__ = [
-    "MyCustomPolicyConfig",
-    "MyCustomPolicy",
-    "make_my_custom_policy_pre_post_processors",
+    "MyPolicyConfig",
+    "MyPolicy",
+    "make_my_policy_pre_post_processors",
 ]
 ```

-## Step 6: Installation and Usage
-
-### Install Your Policy Package
+### Install and use

 ```bash
-cd lerobot_policy_my_custom_policy
+cd lerobot_policy_my_policy
 pip install -e .

 # Or install from PyPI if published
-pip install lerobot_policy_my_custom_policy
+pip install lerobot_policy_my_policy
 ```

-### Use Your Policy
-
 Once installed, your policy automatically integrates with LeRobot's training and evaluation tools:

 ```bash
 lerobot-train \
-    --policy.type my_custom_policy \
+    --policy.type my_policy \
    --env.type pusht \
    --steps 200000
 ```

-## Examples and Community Contributions
+---
+
+## Path B: Contributing in-tree
+
+When your policy has stabilized and there's clear value in shipping it with the library, you can land it directly in LeRobot. Read the general [contribution guide](./contributing) and the [PR template](https://github.com/huggingface/lerobot/blob/main/.github/PULL_REQUEST_TEMPLATE.md) first — that's where you'll find the testing/quality expectations every PR has to meet (`pre-commit run -a`, `pytest`, the community-review rule, etc.). What's below is the policy-specific layer on top of that.
+
+### In-tree layout
+
+```
+src/lerobot/policies/my_policy/
+├── __init__.py                    # re-exports config + modeling + processor factory
+├── configuration_my_policy.py     # MyPolicyConfig + @register_subclass
+├── modeling_my_policy.py          # MyPolicy(PreTrainedPolicy)
+├── processor_my_policy.py         # make_my_policy_pre_post_processors
+└── README.md                      # symlink → ../../../../docs/source/policy_my_policy_README.md
+```
+
+Two notes:
+
+- The `README.md` next to the source is a **symlink** into `docs/source/policy_<name>_README.md` — the actual file lives under `docs/`. Existing policies (act, smolvla, diffusion, …) all do this; copy one of those symlinks. The policy README is conventionally minimal: paper link + BibTeX citation.
+- The user-facing tutorial — what to install, how to train, hyperparameters, benchmark numbers — lives separately at `docs/source/<my_policy>.mdx` and is registered in `_toctree.yml` under "Policies".
+
+The file names are load-bearing: the factory does lazy imports by name, and the processor is discovered by the `make_<policy_name>_pre_post_processors` convention.
+
+### Wiring
+
+Three places need to know about your policy. All by name.
+
+1. **`policies/__init__.py`** — re-export `MyPolicyConfig` and add it to `__all__`. **Don't** re-export the modeling class; it loads lazily through the factory (so `import lerobot` stays fast).
+2. **`factory.py:get_policy_class`** — add a branch returning `MyPolicy` from a lazy import.
+3. **`factory.py:make_policy_config`** and **`factory.py:make_pre_post_processors`** — same idea, two more branches.
+
+Mirror an existing policy that's structurally similar to yours; the diff is small.
+
+### Heavy / optional dependencies
+
+Most policies need a heavy backbone (transformers, diffusers, a specific VLM SDK). The convention is **two-step gating**: a `TYPE_CHECKING`-guarded import at module top, and a `require_package` runtime check in the constructor. [`modeling_diffusion.py`](https://github.com/huggingface/lerobot/blob/main/src/lerobot/policies/diffusion/modeling_diffusion.py) is the canonical reference:
+
+```python
+from typing import TYPE_CHECKING
+from lerobot.utils.import_utils import _diffusers_available, require_package
+
+if TYPE_CHECKING or _diffusers_available:
+    from diffusers.schedulers.scheduling_ddim import DDIMScheduler
+else:
+    DDIMScheduler = None  # keeps the symbol bindable at import time
+
+class DiffusionPolicy(PreTrainedPolicy):
+    def __init__(self, config):
+        require_package("diffusers", extra="diffusion")
+        super().__init__(config)
+        ...
+```
+
+This way:
+
+- `import lerobot.policies` keeps working without the extra installed (the symbol is just bound to `None`).
+- Type checkers see the real symbol.
+- Instantiating the policy without the extra raises a clear `ImportError` pointing at `pip install 'lerobot[diffusion]'`.
+
+Add a matching extra to [`pyproject.toml`](https://github.com/huggingface/lerobot/blob/main/pyproject.toml) `[project.optional-dependencies]` and include it in the `all` extra so `pip install 'lerobot[all]'` keeps installing everything.
+
+### Benchmarks and a published checkpoint
+
+A new policy is much easier to review — and far more useful — when it ships with a working checkpoint and at least one number you can reproduce.
+
+**Pick at least one in-tree benchmark.** LeRobot ships sim benchmarks with per-benchmark Docker images (LIBERO, LIBERO-plus, Meta-World, RoboTwin 2.0, RoboCasa365, RoboCerebra, RoboMME, VLABench and more). Pick the one that matches your policy's modality — VLAs usually go to LIBERO or VLABench; image-only BC to LIBERO or Meta-World. The full list lives under [Benchmarks](./libero) in the docs sidebar.
+
+**Push the checkpoint & processors** to the Hub under `lerobot/<policy>_<benchmark>` (or your namespace if you don't have write access; a maintainer can mirror it). Use `PreTrainedPolicy.push_model_to_hub` so the repo gets `config.json`, `model.safetensors`, and a model card.
+
+**Report results in your policy's MDX**, with the exact `lerobot-eval` command and hardware so anyone can re-run:
+
+```markdown
+## Results
+
+Evaluated on LIBERO with `lerobot/<policy>_libero`:
+
+| Suite          | Success rate | n_episodes |
+| -------------- | -----------: | ---------: |
+| libero_spatial |        87.5% |         50 |
+| libero_object  |        93.0% |         50 |
+| libero_goal    |        81.5% |         50 |
+| libero_10      |        62.0% |         50 |
+| **average**    |    **81.0%** |        200 |
+
+Reproduce: `lerobot-eval --policy.path=lerobot/<policy>_libero --env.type=libero --env.task=libero_spatial --eval.n_episodes=50` (1× A100 40 GB).
+```
+
+Use `n_episodes ≥ 50` per suite for stable success-rate estimates.
+
+If your policy is real-robot-only and no sim benchmark applies, swap the sim eval for: a public training dataset on the Hub, the `lerobot-train` command, the checkpoint, and a real-robot success rate over ≥10 episodes via `lerobot-rollout --policy.path=...`.
+
+### PR checklist
+
+The general expectations are in [`CONTRIBUTING.md`](https://github.com/huggingface/lerobot/blob/main/CONTRIBUTING.md) and the [PR template](https://github.com/huggingface/lerobot/blob/main/.github/PULL_REQUEST_TEMPLATE.md). On top of those, reviewers will look for:
+
+- [ ] `MyPolicy` and `MyPolicyConfig` cover the surface above; `__init_subclass__` accepts the class.
+- [ ] `factory.py` and `policies/__init__.py` are wired (lazy imports for modeling).
+- [ ] `make_my_policy_pre_post_processors` follows the naming convention.
+- [ ] Optional deps live behind a `[project.optional-dependencies]` extra and the `TYPE_CHECKING + require_package` guard.
+- [ ] `tests/policies/` updated; backward-compat artifact committed & policy-specific tests.
+- [ ] `src/lerobot/policies/<name>/README.md` symlinked into `docs/source/policy_<name>_README.md`; user-facing `docs/source/<name>.mdx` written and added to `_toctree.yml`.
+- [ ] At least one reproducible benchmark eval in the policy MDX with a published checkpoint (sim benchmark, or real-robot dataset + checkpoint).
+
+The fastest way to get a clean PR is to copy the directory of the existing policy closest to yours, rename, and replace contents method by method. Don't wait until everything is polished — open a draft PR early and iterate with us; reviewers would much rather give feedback on a half-finished branch than a fully-merged one.
+
+---
+
+## Examples and community contributions

 Check out these example policy implementations:

- [DiTFlow Policy](https://github.com/danielsanjosepro/lerobot_policy_ditflow) - Diffusion Transformer policy with flow-matching objective. Try it out in this example: [DiTFlow Example](https://github.com/danielsanjosepro/test_lerobot_policy_ditflow)
+- [DiTFlow Policy](https://github.com/danielsanjosepro/lerobot_policy_ditflow) — Diffusion Transformer policy with flow-matching objective. Try it out in this example: [DiTFlow Example](https://github.com/danielsanjosepro/test_lerobot_policy_ditflow)

-Share your policy implementations with the community! 🤗
+Thanks for taking the time to bring a new policy into LeRobot. Every architecture that lands in `main` — and every plugin published by the community — makes the library a little more useful for the next person, and a little more representative of where robot learning is going. We're looking forward to seeing what you ship. 🤗
@@ -194,7 +194,7 @@ lerobot-record \
    --dataset.single_task="Navigate around obstacles" \
    --dataset.streaming_encoding=true \
    --dataset.encoder_threads=2 \
-    # --dataset.camera_encoder_config.vcodec=auto \
+    # --dataset.vcodec=auto \
    --display_data=true
 ```

@@ -0,0 +1,168 @@
+# EO-1
+
+EO-1 is a **Vision-Language-Action policy for robot control**. The LeRobot implementation integrates EO-1 with the standard LeRobot training, evaluation, processor interface.
+
+## Model Overview
+
+EO-1 uses a Qwen2.5-VL backbone for vision-language understanding and adds a continuous flow-matching action head for robot control. The policy formats each robot-control sample as a multimodal conversation: camera images are passed to Qwen2.5-VL, the robot state is represented with EO-1 state tokens, and the future action chunk is represented with EO-1 action tokens.
+
+<img
+  src="https://huggingface.co/datasets/HaomingSong/lerobot-documentation-images/resolve/main/lerobot/eo_pipeline.png"
+  alt="An overview of EO-1"
+  width="85%"
+/>
+
+During training, EO-1 learns to denoise continuous action chunks at the action-token positions. During inference, it samples an action chunk, returns continuous actions, and executes `n_action_steps` from the chunk before sampling again.
+
+### What the LeRobot Integration Covers
+
+- Standard `policy.type=eo1` configuration through LeRobot
+- Qwen2.5-VL image and text preprocessing through policy processors
+- Continuous flow-matching action prediction
+- Checkpoint save/load through LeRobot policy APIs
+- Training with `lerobot-train` and evaluation with `lerobot-eval`
+
+The broader EO-1 project also includes interleaved vision-text-action pretraining and multimodal reasoning workflows. This page focuses on the LeRobot robot-control policy path.
+
+## Installation Requirements
+
+1. Install LeRobot by following the [Installation Guide](./installation).
+2. Install EO-1 dependencies by running:
+
+   ```bash
+   pip install -e ".[eo1]"
+   ```
+
+3. If you want to train or evaluate on LIBERO, install the LIBERO dependencies too:
+
+   ```bash
+   pip install -e ".[eo1,libero]"
+   ```
+
+EO-1 can use the standard PyTorch scaled-dot-product attention backend through `policy.attn_implementation=sdpa`. If your environment has a compatible `flash_attn` installation, you can request `policy.attn_implementation=flash_attention_2`.
+
+## Data Requirements
+
+EO-1 expects a LeRobot dataset with:
+
+- At least one visual observation, for example `observation.images.image`
+- `observation.state`
+- `action`
+- A language task instruction through the dataset `task` field
+
+If your dataset uses different observation names, use `rename_map` to align them with the names expected by your training or evaluation setup.
+
+## Usage
+
+To use EO-1 in a LeRobot configuration, specify the policy type as:
+
+```python
+policy.type=eo1
+```
+
+By default, a new EO-1 policy initializes its backbone from:
+
+```python
+policy.vlm_base=Qwen/Qwen2.5-VL-3B-Instruct
+```
+
+Once a LeRobot-format EO-1 checkpoint is available, load it with:
+
+```python
+policy.path=your-org/your-eo1-checkpoint
+```
+
+## Training
+
+### Training Command Example
+
+```bash
+lerobot-train \
+  --dataset.repo_id=your_org/your_dataset \
+  --policy.type=eo1 \
+  --policy.vlm_base=Qwen/Qwen2.5-VL-3B-Instruct \
+  --policy.dtype=bfloat16 \
+  --policy.attn_implementation=sdpa \
+  --policy.gradient_checkpointing=false \
+  --output_dir=./outputs/eo1_training \
+  --job_name=eo1_training \
+  --steps=300000 \
+  --batch_size=16 \
+  --policy.device=cuda
+```
+
+### Key Training Parameters
+
+| Parameter                              | Default                       | Description                                                             |
+| -------------------------------------- | ----------------------------- | ----------------------------------------------------------------------- |
+| `policy.vlm_base`                      | `Qwen/Qwen2.5-VL-3B-Instruct` | Qwen2.5-VL checkpoint used to initialize a new policy                   |
+| `policy.dtype`                         | `auto`                        | Backbone dtype request: `auto`, `bfloat16`, or `float32`                |
+| `policy.attn_implementation`           | `None`                        | Optional Qwen attention backend, such as `sdpa`                         |
+| `policy.gradient_checkpointing`        | `false`                       | Reduces memory usage during training                                    |
+| `policy.chunk_size`                    | `8`                           | Number of future actions predicted per chunk                            |
+| `policy.n_action_steps`                | `8`                           | Number of actions consumed from a sampled chunk                         |
+| `policy.num_denoise_steps`             | `10`                          | Number of flow-matching denoising steps used during sampling            |
+| `policy.max_state_dim`                 | `32`                          | State padding dimension                                                 |
+| `policy.max_action_dim`                | `32`                          | Action padding dimension                                                |
+| `policy.force_fp32_autocast`           | `true`                        | Keeps the flow head in fp32 even when the backbone uses mixed precision |
+| `policy.supervise_padding_action_dims` | `true`                        | Controls whether padded action dimensions are supervised                |
+| `policy.supervise_padding_actions`     | `true`                        | Controls whether padded future action rows are supervised               |
+
+## Evaluation
+
+EO-1 can be evaluated through `lerobot-eval` once you have a LeRobot-format checkpoint:
+
+```bash
+lerobot-eval \
+  --policy.path=your-org/your-eo1-checkpoint \
+  --env.type=libero \
+  --env.task=libero_object \
+  --eval.batch_size=1 \
+  --eval.n_episodes=20
+```
+
+For datasets or environments whose camera names differ from the checkpoint configuration, pass a `rename_map`:
+
+```bash
+lerobot-eval \
+  --policy.path=your-org/your-eo1-checkpoint \
+  --env.type=libero \
+  --env.task=libero_object \
+  --rename_map='{"observation.images.image2":"observation.images.wrist_image"}'
+```
+
+## Configuration Notes
+
+### Image Processing
+
+EO-1 uses the Qwen2.5-VL processor. The `policy.image_min_pixels` and `policy.image_max_pixels` settings control the image resizing bounds before the visual tokens are passed into the backbone.
+
+### State and Action Dimensions
+
+The policy pads state and action vectors to `policy.max_state_dim` and `policy.max_action_dim` before the EO-1 flow head. Predictions are cropped back to the original action dimension before being returned by the policy.
+
+### Attention Backend
+
+Use `policy.attn_implementation=sdpa` for a portable setup. Use `flash_attention_2` only when `flash_attn` is installed and compatible with your environment.
+
+## References
+
+- [EO-1 project](https://github.com/EO-Robotics/EO1)
+- [EO-1 paper](https://arxiv.org/abs/2508.21112)
+- [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
+
+## Citation
+
+```bibtex
+@article{eo1,
+  title={EO-1: Interleaved Vision-Text-Action Pretraining for General Robot Control},
+  author={Delin Qu and Haoming Song and Qizhi Chen and Zhaoqing Chen and Xianqiang Gao and Xinyi Ye and Qi Lv and Modi Shi and Guanghui Ren and Cheng Ruan and Maoqing Yao and Haoran Yang and Jiacheng Bao and Bin Zhao and Dong Wang},
+  journal={arXiv preprint},
+  year={2025},
+  url={https://arxiv.org/abs/2508.21112}
+}
+```
+
+## License
+
+This LeRobot integration follows the **Apache 2.0 License** used by LeRobot. Check the upstream EO-1 model and dataset pages for the licenses of released EO-1 checkpoints and data.
@@ -0,0 +1,186 @@
+# EVO1
+
+EVO1 is a Vision-Language-Action policy for robot control built around an InternVL3 backbone and a continuous flow-matching action head. This LeRobot integration exposes EVO1 as a standard policy type so it can be trained and evaluated with the usual LeRobot dataset, checkpoint, and processor APIs.
+
+## Model Overview
+
+The policy embeds one or more camera images and the language task prompt with InternVL3, pads robot state/action vectors to fixed maximum dimensions, and predicts future action chunks with a flow-matching action head. During inference, the policy samples an action chunk and returns `n_action_steps` actions from that chunk before sampling again.
+
+### What the LeRobot Integration Covers
+
+- Standard `policy.type=evo1` configuration through LeRobot
+- InternVL3 image/text embedding with optional FlashAttention fallback
+- Stage-based finetuning controls for action-head-only and VLM finetuning runs
+- Continuous flow-matching action prediction
+- Checkpoint save/load through LeRobot policy APIs
+- Training with `lerobot-train` and evaluation with standard policy inference APIs
+
+The broader EVO1 project may include additional training scripts and dataset tooling. This page focuses on the LeRobot robot-control policy path.
+
+## Installation Requirements
+
+1. Install LeRobot by following the [Installation Guide](./installation).
+2. Install EVO1 dependencies:
+
+   ```bash
+   pip install -e ".[evo1]"
+   ```
+
+   For LIBERO evaluation, install the LIBERO extra as well:
+
+   ```bash
+   pip install -e ".[evo1,libero]"
+   ```
+
+3. Install a `flash-attn` wheel only if it is compatible with your Python, PyTorch, CUDA, and GPU stack. EVO1 falls back to standard attention when `flash_attn` is not available, but reproducing the official LIBERO checkpoint conversion result below requires the same FlashAttention path used by the original EVO1 checkpoint.
+
+EVO1 uses InternVL3 through the Hugging Face `transformers` remote-code path, so the first run may download the configured VLM checkpoint unless `policy.vlm_model_name` points to a local model directory.
+
+## Data Requirements
+
+EVO1 expects a LeRobot dataset with:
+
+- One to `policy.max_views` visual observations, for example `observation.images.image`
+- `observation.state`
+- `action`
+- A language task instruction in the dataset `task` field, or another field configured with `policy.task_field`
+
+State and action vectors are padded to `policy.max_state_dim` and `policy.max_action_dim`. Predictions are cropped back to the dataset action dimension before being returned.
+
+## Usage
+
+To use EVO1 in a LeRobot configuration, specify:
+
+```python
+policy.type=evo1
+```
+
+By default, a new EVO1 policy initializes its VLM from:
+
+```python
+policy.vlm_model_name=OpenGVLab/InternVL3-1B
+```
+
+Once a LeRobot-format EVO1 checkpoint is available, load it with:
+
+```python
+policy.path=your-org/your-evo1-checkpoint
+```
+
+The converted LIBERO checkpoint used for this PR is available at:
+
+```python
+policy.path=javadcc/evo1-libero-lerobot
+```
+
+## Training
+
+### Stage 1
+
+Stage 1 freezes the VLM and trains the action head:
+
+```bash
+lerobot-train \
+  --dataset.repo_id=your_org/your_dataset \
+  --policy.type=evo1 \
+  --policy.training_stage=stage1 \
+  --policy.vlm_model_name=OpenGVLab/InternVL3-1B \
+  --policy.device=cuda \
+  --policy.chunk_size=50 \
+  --policy.n_action_steps=50 \
+  --policy.max_state_dim=24 \
+  --policy.max_action_dim=24 \
+  --policy.optimizer_lr=1e-5 \
+  --batch_size=4 \
+  --steps=5000 \
+  --output_dir=./outputs/evo1_stage1
+```
+
+### Stage 2
+
+Stage 2 finetunes the VLM branches and action head. A common workflow starts from a Stage 1 checkpoint:
+
+```bash
+lerobot-train \
+  --dataset.repo_id=your_org/your_dataset \
+  --policy.path=./outputs/evo1_stage1/checkpoints/005000/pretrained_model \
+  --policy.training_stage=stage2 \
+  --policy.vlm_model_name=OpenGVLab/InternVL3-1B \
+  --policy.device=cuda \
+  --policy.chunk_size=50 \
+  --policy.n_action_steps=50 \
+  --policy.max_state_dim=24 \
+  --policy.max_action_dim=24 \
+  --policy.optimizer_lr=1e-5 \
+  --batch_size=4 \
+  --steps=80000 \
+  --output_dir=./outputs/evo1_stage2
+```
+
+By default, `policy.training_stage` reapplies the finetuning defaults for that stage. This is important when
+starting Stage 2 from a Stage 1 checkpoint, because the Stage 1 checkpoint config stores the VLM finetuning
+flags as disabled. These stage defaults take precedence over saved or manually supplied `policy.finetune_*`
+flags unless `policy.apply_training_stage_defaults=false`, so set that flag only when manually controlling
+every finetuning flag.
+
+### Key Training Parameters
+
+| Parameter                                     | Default                  | Description                                                       |
+| --------------------------------------------- | ------------------------ | ----------------------------------------------------------------- |
+| `policy.vlm_model_name`                       | `OpenGVLab/InternVL3-1B` | InternVL3 checkpoint or local model directory                     |
+| `policy.training_stage`                       | `stage1`                 | `stage1` trains the action head; `stage2` finetunes VLM branches  |
+| `policy.apply_training_stage_defaults`        | `true`                   | Reapplies stage finetuning defaults after loading a checkpoint    |
+| `policy.vlm_num_layers`                       | `14`                     | Number of InternVL3 language layers kept for the policy           |
+| `policy.vlm_dtype`                            | `bfloat16`               | Requested VLM dtype                                               |
+| `policy.use_flash_attn`                       | `true`                   | Requests FlashAttention when installed; otherwise falls back      |
+| `policy.enable_gradient_checkpointing`        | `true`                   | Enables checkpointing on supported InternVL3 modules              |
+| `policy.gradient_checkpointing_use_reentrant` | `false`                  | Reentrant setting passed to gradient checkpointing when supported |
+| `policy.chunk_size`                           | `50`                     | Number of future actions predicted per chunk                      |
+| `policy.n_action_steps`                       | `50`                     | Number of actions consumed from a sampled chunk                   |
+| `policy.max_state_dim`                        | `24`                     | State padding dimension                                           |
+| `policy.max_action_dim`                       | `24`                     | Action padding dimension                                          |
+| `policy.task_field`                           | `task`                   | Batch field used as the language prompt                           |
+
+## Results
+
+### LIBERO Object Checkpoint Conversion
+
+The checkpoint [javadcc/evo1-libero-lerobot](https://huggingface.co/javadcc/evo1-libero-lerobot)
+is the LeRobot-format conversion of the official EVO1 LIBERO checkpoint. The conversion was checked against
+the official EVO1 checkpoint with the same LIBERO Object initial states and action postprocessing.
+
+| Checkpoint                   | Suite           | Episodes         | Success Rate |
+| ---------------------------- | --------------- | ---------------- | ------------ |
+| Official EVO1 checkpoint     | `libero_object` | 10, one per task | 100%         |
+| LeRobot converted checkpoint | `libero_object` | 10, one per task | 100%         |
+
+For a fixed `libero_object` rollout, the official checkpoint and LeRobot checkpoint produced identical
+pixel embeddings, VLM fused tokens, normalized actions, and denormalized actions for the checked action step
+(`max_abs_diff=0.0`).
+
+The published checkpoint expects the raw LIBERO camera feature names
+`observation.images.agentview_image` and `observation.images.robot0_eye_in_hand_image`. To run the converted
+checkpoint with LeRobot LIBERO evaluation for the same one-episode-per-task setting, keep those camera names
+instead of the default `image`/`image2` mapping:
+
+```bash
+lerobot-eval \
+  --policy.path=javadcc/evo1-libero-lerobot \
+  --policy.device=cuda \
+  --env.type=libero \
+  --env.task=libero_object \
+  --env.camera_name_mapping="{agentview_image: agentview_image, robot0_eye_in_hand_image: robot0_eye_in_hand_image}" \
+  --env.observation_height=448 \
+  --env.observation_width=448 \
+  --eval.batch_size=1 \
+  --eval.n_episodes=1
+```
+
+## References
+
+- [EVO1 repository](https://github.com/MINT-SJTU/Evo-1)
+- [InternVL3-1B](https://huggingface.co/OpenGVLab/InternVL3-1B)
+
+## License
+
+This LeRobot integration follows the Apache 2.0 License used by LeRobot. Check the upstream EVO1 and InternVL3 model pages for the licenses of released checkpoints and data.
@@ -123,7 +123,7 @@ lerobot-record \
  --dataset.single_task="Grab and handover the red cube to the other arm" \
  --dataset.streaming_encoding=true \
  --dataset.encoder_threads=2 \
-  # --dataset.camera_encoder_config.vcodec=auto \
+  # --dataset.vcodec=auto \
  --policy.path=<user>/groot-bimanual \ # your trained model
  --dataset.episode_time_s=30 \
  --dataset.reset_time_s=10
@@ -0,0 +1,98 @@
+# Compute HW Guide for LeRobot Training
+
+Rough sizing for training a LeRobot policy: how much VRAM each policy needs, what training time looks like, and where to run when local hardware isn't enough.
+
+The numbers below are **indicative** — order-of-magnitude figures for picking hardware, not exact predictions. Throughput depends heavily on dataset I/O, image resolution, batch size, and number of GPUs.
+
+## Memory by policy group
+
+Policies cluster by backbone size; the groupings below give a single VRAM envelope per group instead of repeating numbers per policy. Memory scales roughly linearly with batch size; AdamW (the LeRobot default) carries optimizer state that adds ~30–100% over a forward+backward pass alone.
+
+| Group      | Policies                                    | Peak VRAM (BS 8, AdamW) | Suitable starter GPUs             |
+| ---------- | ------------------------------------------- | ----------------------: | --------------------------------- |
+| Light BC   | `act`, `vqbet`, `tdmpc`                     |                  ~2–6GB | Laptop GPU (RTX 3060), L4, A10G   |
+| Diffusion  | `diffusion`, `multi_task_dit`               |                 ~8–14GB | RTX 4070+ / L4 / A10G             |
+| Small VLA  | `smolvla`                                   |                ~10–16GB | RTX 4080+ / L4 / A10G             |
+| Large VLA  | `pi0`, `pi0_fast`, `pi05`, `xvla`, `wall_x` |                ~24–40GB | A100 40 GB+ (24 GB tight at BS 1) |
+| Multimodal | `groot`, `eo1`                              |                ~24–40GB | A100 40 GB+                       |
+| RL         | `sac`                                       |             config-dep. | See [HIL-SERL guide](./hilserl)   |
+
+Memory-bound? Drop the batch size (~linear), use gradient accumulation to recover effective batch, or for SmolVLA leave `freeze_vision_encoder=True`.
+
+## Training time
+
+Robotics imitation learning typically converges in **5–10 epochs over the dataset**, not hundreds of thousands of raw steps. Once you know your epoch count, wall-clock is essentially:
+
+```text
+total_frames    = sum of frames over all episodes      # 50 ep × 30 fps × 30 s ≈ 45,000
+steps_per_epoch = ceil(total_frames / (num_gpus × batch_size))
+total_steps     = epochs × steps_per_epoch
+wall_clock      ≈ total_steps × per_step_time
+```
+
+Per-step time depends on the policy and the GPU. The numbers in the table below are anchors — pick the row closest to your setup and scale linearly with `total_steps` if you train longer or shorter.
+
+### Common scenarios
+
+Indicative wall-clock for **5 epochs on a ~50-episode dataset (~45k frames at 30 fps × 30 s)**, default optimizer (AdamW), 640×480 images:
+
+| Setup                                | Policy         | Batch | Wall-clock |
+| ------------------------------------ | -------------- | ----- | ---------: |
+| Single RTX 4090 / RTX 3090 (24 GB)   | `act`          | 8     |  ~30–60min |
+| Single RTX 4090 / RTX 3090 (24 GB)   | `diffusion`    | 8     |      ~2–4h |
+| Single L4 / A10G (24 GB)             | `act`          | 8     |      ~1–2h |
+| Single L4 / A10G (24 GB)             | `smolvla`      | 4     |      ~3–6h |
+| Single A100 40 GB                    | `smolvla`      | 16    |      ~1–2h |
+| Single A100 40 GB                    | `pi0` / `pi05` | 4     |      ~4–8h |
+| 4× H100 80 GB cluster (`accelerate`) | `diffusion`    | 32    |  ~30–60min |
+| 4× H100 80 GB cluster (`accelerate`) | `smolvla`      | 32    |      ~1–2h |
+| Apple Silicon M1/M2/M3 Max (MPS)     | `act`          | 4     |     ~6–14h |
+
+These are order-of-magnitude figures. Real runs deviate by ±50% depending on image resolution, dataset I/O, dataloader threading, and exact GPU SKU. They are useful as "is this run going to take an hour or a day?" intuition, not as SLAs.
+
+### Multi-GPU matters a lot
+
+`accelerate launch --num_processes=N` is the easiest way to cut training time. Each optimizer step processes `N × batch_size` samples in roughly the same wall-clock as a single-GPU step, so 4 GPUs ≈ 4× speedup for compute-bound runs. See the [Multi GPU training](./multi_gpu_training) guide for the full setup.
+
+Reference data points on a 4×H100 80 GB cluster (`accelerate launch --num_processes=4`), 5000 steps, batch 32, AdamW, dataset [`imstevenpmwork/super_poulain_draft`](https://huggingface.co/datasets/imstevenpmwork/super_poulain_draft) (~50 episodes, ~640×480 images):
+
+| Policy      | Wall-clock | `update_s` | `dataloading_s` | GPU util | Notable flags                                                                                                                  |
+| ----------- | ---------- | ---------: | --------------: | -------- | ------------------------------------------------------------------------------------------------------------------------------ |
+| `diffusion` | 16m 17s    |      0.167 |           0.015 | ~90%     | defaults (training from scratch)                                                                                               |
+| `smolvla`   | 27m 49s    |      0.312 |           0.011 | ~80%     | `--policy.path=lerobot/smolvla_base`, `freeze_vision_encoder=false`, `train_expert_only=false`                                 |
+| `pi05`      | 3h 41m     |      2.548 |           0.014 | ~95%     | `--policy.pretrained_path=lerobot/pi05_base`, `gradient_checkpointing=true`, `dtype=bfloat16`, vision encoder + expert trained |
+
+The `dataloading_s` vs. `update_s` ratio is the diagnostic that matters: when `dataloading_s` approaches `update_s`, more GPUs stop helping — your dataloader is the bottleneck and you should look at `--num_workers`, image resolution, and disk speed before adding compute.
+
+### Schedule and checkpoints
+
+If you shorten training (e.g. 5k–10k steps on a small dataset), also shorten the LR schedule with `--policy.scheduler_decay_steps≈--steps`. Otherwise the LR stays near its peak and never decays. Same for `--save_freq`.
+
+## Where to run
+
+VRAM is the first filter. Within a tier, pick by budget and availability — the `$`–`$$$$` columns are relative; check current pricing on the provider you actually use.
+
+| Class                      | VRAM  | Tier   | Comfortable for                                             |
+| -------------------------- | ----- | ------ | ----------------------------------------------------------- |
+| RTX 3090 / 4090 (consumer) | 24 GB | `$`    | Light BC, Diffusion, SmolVLA. Tight for VLAs at batch 1.    |
+| L4 / A10G (cloud)          | 24 GB | `$–$$` | Same envelope; common on Google Cloud, RunPod, AWS `g5/g6`. |
+| A100 40 GB                 | 40 GB | `$$$`  | Any policy at reasonable batch sizes.                       |
+| A100 80 GB / H100 80 GB    | 80 GB | `$$$$` | Multi-GPU clusters; large batches for VLAs.                 |
+| **CPU only**               | —     | —      | Don't train. Use Colab or rent a GPU.                       |
+
+### Hugging Face Jobs
+
+[Hugging Face Jobs](https://huggingface.co/docs/hub/jobs) lets you run training on managed HF infrastructure, billed by the second. The repo publishes a ready-to-use image: **`huggingface/lerobot-gpu:latest`**, rebuilt **every night at 02:00 UTC from `main`** ([`docker_publish.yml`](https://github.com/huggingface/lerobot/blob/main/.github/workflows/docker_publish.yml)) — so it tracks the current state of the repo, not a tagged release.
+
+```bash
+hf jobs run --flavor a10g-large huggingface/lerobot-gpu:latest \
+  bash -c "nvidia-smi && lerobot-train \
+    --policy.type=act --dataset.repo_id=<USER>/<DATASET> \
+    --policy.repo_id=<USER>/act_<task> --batch_size=8 --steps=50000"
+```
+
+Notes:
+
+- The leading `nvidia-smi` is a quick sanity check that CUDA is visible inside the container — useful to fail fast if the flavor or driver mismatched.
+- The default Job timeout is 30 minutes; pass `--timeout 4h` (or longer) for real training.
+- `--flavor` maps onto the table above: `t4-small`/`t4-medium` (T4, ACT only), `l4x1`/`l4x4` (L4 24 GB), `a10g-small/large/largex2/largex4` (A10G 24 GB scaled out), `a100-large` (A100). For the current full catalogue + pricing see [https://huggingface.co/docs/hub/jobs](https://huggingface.co/docs/hub/jobs).
@@ -232,7 +232,7 @@ lerobot-record \
    --dataset.private=true \
    --dataset.streaming_encoding=true \
    --dataset.encoder_threads=2 \
-    # --dataset.camera_encoder_config.vcodec=auto \
+    # --dataset.vcodec=auto \
    --display_data=true
 ```

@@ -278,6 +278,6 @@ lerobot-record \
  --dataset.num_episodes=10 \
  --dataset.streaming_encoding=true \
  --dataset.encoder_threads=2 \
-  # --dataset.camera_encoder_config.vcodec=auto \
+  # --dataset.vcodec=auto \
  --policy.path=outputs/train/hopejr_hand/checkpoints/last/pretrained_model
 ```
@@ -193,7 +193,7 @@ lerobot-record \
    --dataset.num_episodes=5 \
    --dataset.single_task="Grab the black cube" \
    --dataset.streaming_encoding=true \
-    # --dataset.camera_encoder_config.vcodec=auto \
+    # --dataset.vcodec=auto \
    --dataset.encoder_threads=2
 ```
 </hfoption>
@@ -207,6 +207,56 @@ pip install 'lerobot[feetech]'        # Feetech motor support

 _Multiple extras can be combined (e.g., `.[core_scripts,pi,pusht]`). For a full list of available extras, refer to `pyproject.toml`._

+### PyTorch CUDA variant (Linux only)
+
+On Linux, the install path determines which CUDA wheel you get. macOS and Windows installs use the PyPI default (MPS / CPU / CUDA-Windows wheel respectively) and can skip this section.
+
+<!-- prettier-ignore-start -->
+
+<hfoptions id="cuda_variant">
+<hfoption id="uv-source">
+
+**Source install via `uv` (`uv sync` or `uv pip install -e .`)**
+
+`torch` and `torchvision` are pinned by the project to the **CUDA 12.8** PyTorch index (`https://download.pytorch.org/whl/cu128`, driver floor **570.86**) — covers Ampere/Ada/Hopper/Blackwell GPUs. No action needed for typical NVIDIA setups.
+
+To override for a different CUDA variant:
+
+```bash
+uv pip install --force-reinstall torch torchvision \
+    --index-url https://download.pytorch.org/whl/cu126   # older drivers; or cu130 for Blackwell on driver ≥ 580
+```
+
+</hfoption>
+<hfoption id="pip-conda">
+
+**Source install via `pip`/`conda`, or `pip install lerobot` from PyPI**
+
+PyPI default torch wheel is currently a cu130-bundled Linux wheel, driver floor **580.65**.
+
+To pick a specific CUDA variant:
+
+**Using `pip` or `conda`** — install torch first with an explicit index, then lerobot:
+
+```bash
+pip install --index-url https://download.pytorch.org/whl/cu128 torch torchvision
+pip install -e ".[all]"          # source
+# — or —
+pip install lerobot              # from PyPI
+```
+
+**Using `uv` to install from PyPI** — one-liner via `--torch-backend` (uv ≥ 0.6):
+
+```bash
+uv pip install --torch-backend cu128 lerobot
+```
+
+Supported values include `auto`, `cpu`, `cu126`, `cu128`, `cu129`, `cu130`, plus various `rocm*` and `xpu`. Swap as needed for your driver.
+
+</hfoption>
+</hfoptions>
+<!-- prettier-ignore-end -->
+
 ### Troubleshooting

 If you encounter build errors, you may need to install additional system dependencies: `cmake`, `build-essential`, and `ffmpeg libs`.
@@ -43,7 +43,7 @@ lerobot-record \
  --dataset.num_episodes=5 \
  --dataset.single_task="Grab the black cube" \
  --dataset.streaming_encoding=true \
-  # --dataset.camera_encoder_config.vcodec=auto \
+  # --dataset.vcodec=auto \
  --dataset.encoder_threads=2
 ```

@@ -0,0 +1,18 @@
+# EVO1
+
+EVO1 is a Vision-Language-Action policy for robot control. The LeRobot
+integration uses an InternVL3 vision-language backbone with a flow-matching
+action head, and supports staged training through the standard LeRobot policy
+APIs.
+
+The upstream EVO1 project is available at
+[MINT-SJTU/Evo-1](https://github.com/MINT-SJTU/Evo-1).
+
+```bibtex
+@misc{evo1,
+  title = {EVO1},
+  author = {{MINT-SJTU}},
+  year = {2026},
+  howpublished = {\url{https://github.com/MINT-SJTU/Evo-1}},
+}
+```
@@ -161,7 +161,7 @@ lerobot-record \
    --dataset.private=true \
    --dataset.streaming_encoding=true \
    --dataset.encoder_threads=2 \
-    # --dataset.camera_encoder_config.vcodec=auto \
+    # --dataset.vcodec=auto \
    --display_data=true
 ```

@@ -203,7 +203,7 @@ lerobot-record \
    --dataset.private=true \
    --dataset.streaming_encoding=true \
    --dataset.encoder_threads=2 \
-    # --dataset.camera_encoder_config.vcodec=auto \
+    # --dataset.vcodec=auto \
    --display_data=true
 ```

@@ -108,7 +108,7 @@ lerobot-record \
  --dataset.num_episodes=10 \
  --dataset.streaming_encoding=true \
  --dataset.encoder_threads=2 \
-  # --dataset.camera_encoder_config.vcodec=auto \
+  # --dataset.vcodec=auto \
  # <- Teleop optional if you want to teleoperate in between episodes \
  # --teleop.type=so100_leader \
  # --teleop.port=/dev/ttyACM0 \
@@ -14,22 +14,12 @@ This makes `save_episode()` near-instant (the video is already encoded by the ti

 ## 2. Tuning Parameters

-All encoding parameters are grouped under `camera_encoder_config` (a `VideoEncoderConfig` dataclass), accessible from the CLI via `--dataset.camera_encoder_config.<field>`.
-
-| Parameter               | CLI Flag                                      | Type          | Default       | Description                                                         |
-| ----------------------- | --------------------------------------------- | ------------- | ------------- | ------------------------------------------------------------------- |
-| `streaming_encoding`    | `--dataset.streaming_encoding`                | `bool`        | `True`        | Enable real-time encoding during capture                            |
-| `vcodec`                | `--dataset.camera_encoder_config.vcodec`      | `str`         | `"libsvtav1"` | Video codec. `"auto"` detects best HW encoder                       |
-| `pix_fmt`               | `--dataset.camera_encoder_config.pix_fmt`     | `str`         | `"yuv420p"`   | Pixel format                                                        |
-| `g`                     | `--dataset.camera_encoder_config.g`           | `int \| None` | `2`           | GOP size (keyframe interval)                                        |
-| `crf`                   | `--dataset.camera_encoder_config.crf`         | `int \| None` | `30`          | Quality level (mapped to codec-specific parameter)                  |
-| `preset`                | `--dataset.camera_encoder_config.preset`      | `int \| None` | `12`          | Speed preset (libsvtav1 only, 0 = slowest … 13 = fastest)           |
-| `fast_decode`           | `--dataset.camera_encoder_config.fast_decode` | `int`         | `0`           | Fast-decode tuning level                                            |
-| `encoder_threads`       | `--dataset.encoder_threads`                   | `int \| None` | `None` (auto) | Threads per encoder instance (global). `None` lets the codec decide |
-| `encoder_queue_maxsize` | `--dataset.encoder_queue_maxsize`             | `int`         | `60`          | Max buffered frames per camera (~2s at 30fps). Consumes RAM         |
-
-> [!TIP]
-> Not all parameters apply to every codec. `VideoEncoderConfig` will warn at startup if you set a parameter that your chosen codec ignores (e.g. `preset` with `h264_nvenc`).
+| Parameter               | CLI Flag                          | Type          | Default       | Description                                                       |
+| ----------------------- | --------------------------------- | ------------- | ------------- | ----------------------------------------------------------------- |
+| `streaming_encoding`    | `--dataset.streaming_encoding`    | `bool`        | `True`        | Enable real-time encoding during capture                          |
+| `vcodec`                | `--dataset.vcodec`                | `str`         | `"libsvtav1"` | Video codec. `"auto"` detects best HW encoder                     |
+| `encoder_threads`       | `--dataset.encoder_threads`       | `int \| None` | `None` (auto) | Threads per encoder instance. `None` will leave the vcoded decide |
+| `encoder_queue_maxsize` | `--dataset.encoder_queue_maxsize` | `int`         | `60`          | Max buffered frames per camera (~2s at 30fps). Consumes RAM       |

 ## 3. Performance Considerations

@@ -50,7 +40,7 @@ Streaming encoding means the CPU is encoding video **during** the capture loop,

 ### `encoder_threads` Tuning

-This parameter (`--dataset.encoder_threads`) controls how many threads each encoder instance uses internally:
+This parameter controls how many threads each encoder instance uses internally:

 - **Higher values** (e.g., 4-5): Faster encoding, but uses more CPU cores per camera. Good for high-end systems with many cores.
 - **Lower values** (e.g., 1-2): Less CPU per camera, freeing cores for capture and visualization. Good for low-res images and capable CPUs.
@@ -92,15 +82,15 @@ Use HW encoding when:

 ### Available HW Encoders

-| Encoder             | Platform      | Hardware                                                                                         | CLI Value                                                  |
-| ------------------- | ------------- | ------------------------------------------------------------------------------------------------ | ---------------------------------------------------------- |
-| `h264_videotoolbox` | macOS         | Apple Silicon / Intel                                                                            | `--dataset.camera_encoder_config.vcodec=h264_videotoolbox` |
-| `hevc_videotoolbox` | macOS         | Apple Silicon / Intel                                                                            | `--dataset.camera_encoder_config.vcodec=hevc_videotoolbox` |
-| `h264_nvenc`        | Linux/Windows | NVIDIA GPU                                                                                       | `--dataset.camera_encoder_config.vcodec=h264_nvenc`        |
-| `hevc_nvenc`        | Linux/Windows | NVIDIA GPU                                                                                       | `--dataset.camera_encoder_config.vcodec=hevc_nvenc`        |
-| `h264_vaapi`        | Linux         | Intel/AMD GPU                                                                                    | `--dataset.camera_encoder_config.vcodec=h264_vaapi`        |
-| `h264_qsv`          | Linux/Windows | Intel Quick Sync                                                                                 | `--dataset.camera_encoder_config.vcodec=h264_qsv`          |
-| `auto`              | Any           | Probes the system for available HW encoders. Falls back to `libsvtav1` if no HW encoder is found | `--dataset.camera_encoder_config.vcodec=auto`              |
+| Encoder             | Platform      | Hardware                                                                                         | CLI Value                            |
+| ------------------- | ------------- | ------------------------------------------------------------------------------------------------ | ------------------------------------ |
+| `h264_videotoolbox` | macOS         | Apple Silicon / Intel                                                                            | `--dataset.vcodec=h264_videotoolbox` |
+| `hevc_videotoolbox` | macOS         | Apple Silicon / Intel                                                                            | `--dataset.vcodec=hevc_videotoolbox` |
+| `h264_nvenc`        | Linux/Windows | NVIDIA GPU                                                                                       | `--dataset.vcodec=h264_nvenc`        |
+| `hevc_nvenc`        | Linux/Windows | NVIDIA GPU                                                                                       | `--dataset.vcodec=hevc_nvenc`        |
+| `h264_vaapi`        | Linux         | Intel/AMD GPU                                                                                    | `--dataset.vcodec=h264_vaapi`        |
+| `h264_qsv`          | Linux/Windows | Intel Quick Sync                                                                                 | `--dataset.vcodec=h264_qsv`          |
+| `auto`              | Any           | Probes the system for available HW encoders. Falls back to `libsvtav1` if no HW encoder is found | `--dataset.vcodec=auto`              |

 > [!NOTE]
 > In order to use the HW accelerated encoders you might need to upgrade your GPU drivers.
@@ -110,15 +100,15 @@ Use HW encoding when:

 ## 5. Troubleshooting

-| Symptom                                                            | Likely Cause                                 | Fix                                                                                                                                                                                                                                                                                                        |
-| ------------------------------------------------------------------ | -------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| System freezes or choppy robot movement or Rerun visualization lag | CPU starved (100% load usage)                | Close other apps, reduce encoding throughput, lower `encoder_threads`, use `h264`, use `display_data=False`. If the CPU continues to be at 100% then it might be insufficient for your setup, consider `--dataset.streaming_encoding=false` or HW encoding (`--dataset.camera_encoder_config.vcodec=auto`) |
-| "Encoder queue full" warnings or dropped frames in dataset         | Encoder can't keep up (Queue overflow)       | If CPU is not at 100%: Increase `encoder_threads`, increase `encoder_queue_maxsize` or use HW encoding (`--dataset.camera_encoder_config.vcodec=auto`).                                                                                                                                                    |
-| High RAM usage                                                     | Queue filling faster than encoding           | `encoder_threads` too low or CPU insufficient. Reduce `encoder_queue_maxsize` or use HW encoding                                                                                                                                                                                                           |
-| Large video files                                                  | Using HW encoder or H.264                    | Expected trade-off. Switch to `libsvtav1` if CPU allows                                                                                                                                                                                                                                                    |
-| `save_episode()` still slow                                        | `streaming_encoding` is `False`              | Set `--dataset.streaming_encoding=true`                                                                                                                                                                                                                                                                    |
-| Encoder thread crash                                               | Codec not available or invalid settings      | Check `vcodec` is installed, try `--dataset.camera_encoder_config.vcodec=auto`                                                                                                                                                                                                                             |
-| Recorded dataset is missing frames                                 | CPU/GPU starvation or occasional load spikes | If ~5% of frames are missing, your system is likely overloaded — follow the recommendations above. If fewer frames are missing (~2%), they are probably due to occasional transient load spikes (often at startup) and can be considered expected.                                                         |
+| Symptom                                                            | Likely Cause                                 | Fix                                                                                                                                                                                                                                                                                  |
+| ------------------------------------------------------------------ | -------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| System freezes or choppy robot movement or Rerun visualization lag | CPU starved (100% load usage)                | Close other apps, reduce encoding throughput, lower `encoder_threads`, use `h264`, use `display_data=False`. If the CPU continues to be at 100% then it might be insufficient for your setup, consider `--dataset.streaming_encoding=false` or HW encoding (`--dataset.vcodec=auto`) |
+| "Encoder queue full" warnings or dropped frames in dataset         | Encoder can't keep up (Queue overflow)       | If CPU is not at 100%: Increase `encoder_threads`, increase `encoder_queue_maxsize` or use HW encoding (`--dataset.vcodec=auto`).                                                                                                                                                    |
+| High RAM usage                                                     | Queue filling faster than encoding           | `encoder_threads` too low or CPU insufficient. Reduce `encoder_queue_maxsize` or use HW encoding                                                                                                                                                                                     |
+| Large video files                                                  | Using HW encoder or H.264                    | Expected trade-off. Switch to `libsvtav1` if CPU allows                                                                                                                                                                                                                              |
+| `save_episode()` still slow                                        | `streaming_encoding` is `False`              | Set `--dataset.streaming_encoding=true`                                                                                                                                                                                                                                              |
+| Encoder thread crash                                               | Codec not available or invalid settings      | Check `vcodec` is installed, try `--dataset.vcodec=auto`                                                                                                                                                                                                                             |
+| Recorded dataset is missing frames                                 | CPU/GPU starvation or occasional load spikes | If ~5% of frames are missing, your system is likely overloaded — follow the recommendations above. If fewer frames are missing (~2%), they are probably due to occasional transient load spikes (often at startup) and can be considered expected.                                   |

 ## 6. Recommended Configurations

@@ -156,10 +146,10 @@ On very constrained systems, streaming encoding may compete too heavily with the
 # 2camsx 640x480x3 @30fps: Requires some tuning.

 # Use H.264, disable streaming, consider batching encoding
-lerobot-record --dataset.camera_encoder_config.vcodec=h264 --dataset.streaming_encoding=false ...
+lerobot-record --dataset.vcodec=h264 --dataset.streaming_encoding=false ...
 ```

 ## 7. Closing note

 Performance ultimately depends on your exact setup — frames-per-second, resolution, CPU cores and load, available memory, episode length, and the encoder you choose. Always test with your target workload, be mindful about your CPU & system capabilities and tune `encoder_threads`, `encoder_queue_maxsize`, and
-`camera_encoder_config.vcodec` reasonably. That said, a common practical configuration (for many applications) is three cameras at 640×480x3 @30fps; this usually runs fine with the default streaming video encoding settings in modern systems. Always verify your recorded dataset is healthy by comparing the video duration to the CLI episode duration and confirming the row count equals FPS × CLI duration.
+`vcodec` reasonably. That said, a common practical configuration (for many applications) is three cameras at 640×480x3 @30fps; this usually runs fine with the default streaming video encoding settings in modern systems. Always verify your recorded dataset is healthy by comparing the video duration to the CLI episode duration and confirming the row count equals FPS × CLI duration.
@@ -117,10 +117,10 @@ lerobot-edit-dataset \
    --repo_id lerobot/pusht_image \
    --operation.type convert_image_to_video \
    --operation.output_dir outputs/pusht_video \
-    --operation.camera_encoder_config.vcodec libsvtav1 \
-    --operation.camera_encoder_config.pix_fmt yuv420p \
-    --operation.camera_encoder_config.g 2 \
-    --operation.camera_encoder_config.crf 30
+    --operation.vcodec libsvtav1 \
+    --operation.pix_fmt yuv420p \
+    --operation.g 2 \
+    --operation.crf 30

 # Convert only specific episodes
 lerobot-edit-dataset \
@@ -147,14 +147,11 @@ lerobot-edit-dataset \
 **Parameters:**

 - `output_dir`: Custom output directory (optional - by default uses `new_repo_id` or `{repo_id}_video`)
- `camera_encoder_config`: Video encoder settings — all sub-fields accessible via `--operation.camera_encoder_config.<field>`:
-  - `vcodec`: Video codec — `h264`, `hevc`, `libsvtav1`, `auto`, or hardware codecs (default: `libsvtav1`)
-  - `pix_fmt`: Pixel format — `yuv420p`, `yuv444p` (default: `yuv420p`)
-  - `g`: GOP size — lower values give better quality but larger files (default: 2)
-  - `crf`: Quality level — lower is better, 0 is lossless (default: 30)
-  - `preset`: Speed preset, libsvtav1 only (default: 12)
-  - `fast_decode`: Fast-decode tuning (default: 0)
-  - `encoder_threads`: Threads per encoder instance — global setting, separate from `camera_encoder_config` (default: None)
+- `vcodec`: Video codec to use - options: `h264`, `hevc`, `libsvtav1` (default: `libsvtav1`)
+- `pix_fmt`: Pixel format - options: `yuv420p`, `yuv444p` (default: `yuv420p`)
+- `g`: Group of pictures (GOP) size - lower values give better quality but larger files (default: 2)
+- `crf`: Constant rate factor - lower values give better quality but larger files, 0 is lossless (default: 30)
+- `fast_decode`: Fast decode tuning option (default: 0)
 - `episode_indices`: List of specific episodes to convert (default: all episodes)
 - `num_workers`: Number of parallel workers for processing (default: 4)

@@ -0,0 +1,136 @@
+# OMX Follower — Cube Pick And Place Example
+
+This is an example of what is possible to do with LeRobot on a physical setup.
+It is a WIP and being used internally at LeRobot and specific to our setup, but we hope it can be a useful reference for how to use LeRobot APIs and CLIs.
+
+It includes an end-to-end example for the **OMX Follower** robot arm: pick and place a cube dataset, train a policy, and deploy it autonomously.
+
+## Hardware
+
+| Component | Value                                |
+| --------- | ------------------------------------ |
+| Robot     | OMX Follower                         |
+| Cameras   | 2× OpenCV cameras (wrist + top-down) |
+
+## Scripts
+
+| Script                 | Purpose                                                         |
+| ---------------------- | --------------------------------------------------------------- |
+| `reset_environment.py` | Standalone utility: sweep workspace, grab cube, place cube      |
+| `record_grab.py`       | Automated data collection: reset → place → record grab episodes |
+
+## Setup
+
+Make sure you have LeRobot installed in your env. (See [the installation guide](https://huggingface.co/docs/lerobot/installation))
+
+Next, we will declare some environment variables for convenience. Adjust the camera indices and robot port to match your system configuration.
+
+```bash
+export ROBOT_PORT=/dev/ttyACM0
+export TELEOP_PORT=/dev/ttyACM1
+export HF_USERNAME=<your_hf_username>
+export ROBOT_CAMERAS="{ wrist: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30, fourcc: MJPG}, top: {type: opencv, index_or_path: 2, width: 640, height: 480, fps: 30, fourcc: MJPG} }"
+```
+
+## Step 1 — Collect Data
+
+```bash
+lerobot-record \
+    --robot.type=omx_follower \
+    --robot.port=$ROBOT_PORT \
+    --robot.id=omx_follower \
+    --robot.cameras="$ROBOT_CAMERAS" \
+    --teleop.type=omx_leader \
+    --teleop.port=$TELEOP_PORT \
+    --teleop.id=omx_leader \
+    --dataset.repo_id=$HF_USERNAME/omx_pickandplace \
+    --dataset.root=data/omx_pickandplace \
+    --dataset.num_episodes=50 \
+    --dataset.single_task="Pick the cube and place it in the blue square" \
+    --dataset.streaming_encoding=true \
+    --dataset.push_to_hub=true
+```
+
+### Bonus Auto-Collect script
+
+/!\ This is specific to our setup and the task of picking and placing a cube. It is not a general-purpose data collection script. As you may notice, it doesn't require a teleop.
+
+```bash
+python -m examples.omx.record_grab \
+    --robot.type=omx_follower \
+    --robot.port=$ROBOT_PORT \
+    --robot.id=omx_follower \
+    --robot.cameras="$ROBOT_CAMERAS" \
+    --dataset.repo_id=$HF_USERNAME/omx_pickandplace \
+    --dataset.root=data/omx_pickandplace \
+    --dataset.num_episodes=50 \
+    --dataset.single_task="Pick the cube and place it in the blue square" \
+    --dataset.streaming_encoding=true \
+    --dataset.push_to_hub=true
+```
+
+Each episode:
+
+1. The arm grabs the cube from the center of the workspace and places it at a random position.
+2. The arm returns to HOME.
+3. A targeted grab is recorded: HOME → approach raised → lower onto cube → grasp → lift → carry → drop → HOME.
+
+A dataset is already available here [`maximellerbach/omx_pickandplace`](https://huggingface.co/datasets/maximellerbach/omx_pickandplace), so you can skip directly to training if you want.
+
+## Step 2 — Train
+
+To train a simple `ACT` policy on the collected dataset, you can use the `lerobot-train` CLI:
+
+```bash
+lerobot-train \
+    --dataset.repo_id=$HF_USERNAME/omx_pickandplace \
+    --policy.type=act \
+    --output_dir=outputs/train/omx_pickandplace_act \
+    --policy.device=cuda \
+    --policy.repo_id=$HF_USERNAME/omx_pickandplace_act \
+    --steps=20000 \
+    --wandb.enable=true
+```
+
+A pretrained `ACT` policy is already available here [`maximellerbach/omx_pickandplace_act`](https://huggingface.co/maximellerbach/omx_pickandplace_act).
+
+## Step 3 — Rollout
+
+Use the `lerobot-rollout` CLI with base strategy:
+
+```bash
+lerobot-rollout \
+    --strategy.type=base \
+    --robot.type=omx_follower \
+    --robot.port=$ROBOT_PORT \
+    --robot.id=omx_follower \
+    --robot.cameras="$ROBOT_CAMERAS" \
+    --policy.path=$HF_USERNAME/omx_pickandplace_act \
+```
+
+For continuous recording with automatic upload (sentry mode):
+
+```bash
+lerobot-rollout \
+    --strategy.type=sentry \
+    --strategy.upload_every_n_episodes=10 \
+    --robot.type=omx_follower \
+    --robot.port=$ROBOT_PORT \
+    --robot.id=omx_follower \
+    --robot.cameras="$ROBOT_CAMERAS" \
+    --policy.path=$HF_USERNAME/omx_pickandplace_act \
+    --dataset.repo_id=$HF_USERNAME/rollout_omx_pickandplace_act \
+```
+
+## Environment Reset Utility
+
+Those are specific to this particular physical setup. Those are scripts that execute hardcoded sequences of actions on the robot to reset the environment, which is useful for data collection and evaluation. They are not general-purpose scripts.
+
+`reset_environment.py` can be run standalone to prepare the workspace:
+
+```bash
+# Grab cube + place it at a random position on the left side
+python -m examples.omx.reset_environment --port $ROBOT_PORT --mode grab_and_place
+```
+
+It also exposes `grab_cube(robot)` and `place_cube(robot)` for use in custom scripts.
@@ -0,0 +1,422 @@
+#!/usr/bin/env python3
+"""
+Auto-record grab episodes for the OMX robot arm.
+
+Each episode cycle:
+  1. grab_and_place  — grab cube from workspace center and place at a random (pan, reach) position
+  2. HOME            — return arm to home with gripper open
+  3. record_grab     — execute a targeted grab to the stored position while recording
+                       observations + actions to a LeRobotDataset
+
+Usage (run from repo root):
+    python -m examples.omx.record_grab \\
+        --robot.type=omx_follower \\
+        --robot.port=/dev/ttyACM0 \\
+        --robot.id=omx_follower \\
+        --robot.cameras="{ wrist: {type: opencv, index_or_path: 6, width: 640, height: 480, fps: 30, fourcc: MJPG}, top: {type: opencv, index_or_path: 4, width: 640, height: 480, fps: 30, fourcc: MJPG} }" \\
+        --dataset.repo_id=<hf_username>/<dataset_name> \\
+        --dataset.root=data/omx_grab \\
+        --dataset.num_episodes=50 \\
+        --dataset.single_task="Grab the cube" \\
+        --dataset.streaming_encoding=true
+"""
+
+import logging
+from dataclasses import dataclass
+from pprint import pformat
+
+import numpy as np
+
+from lerobot.cameras import CameraConfig  # noqa: F401
+from lerobot.cameras.opencv import OpenCVCameraConfig  # noqa: F401
+from lerobot.configs import parser
+from lerobot.configs.dataset import DatasetRecordConfig
+from lerobot.datasets import (
+    LeRobotDataset,
+    VideoEncodingManager,
+    aggregate_pipeline_dataset_features,
+    create_initial_features,
+)
+from lerobot.processor import make_default_processors
+from lerobot.robots import RobotConfig, make_robot_from_config
+from lerobot.robots.omx_follower import OmxFollower
+from lerobot.utils.constants import ACTION, OBS_STR
+from lerobot.utils.feature_utils import build_dataset_frame, combine_feature_dicts
+from lerobot.utils.robot_utils import precise_sleep
+
+from .reset_environment import (
+    APPROACH_SPEED,
+    GRIPPER_CLOSE_POS,
+    HOME_POSE,
+    PUSH_END_ELBOW_FLEX,
+    PUSH_END_SHOULDER_LIFT,
+    PUSH_START_ELBOW_FLEX,
+    PUSH_START_SHOULDER_LIFT,
+    array_to_pose,
+    grab_cube,
+    horizontal_wrist_flex,
+    move_to_pose,
+    place_cube,
+    pose_to_array,
+)
+
+# ── Grab-episode motion parameters ────────────────────────────────────────────
+
+# Shoulder-lift offset for the raised approach phase (subtracted from the target sl, arm is higher).
+GRAB_RAISE_SL_OFFSET = 20.0
+GRAB_LOWER_SPEED = 20.0
+RECORD_SPEED = 30.0
+
+# Pose the arm travels to after closing the gripper (cube held).
+GRAB_CARRY_POSE = {
+    "shoulder_pan.pos": -23.0,
+    "shoulder_lift.pos": 5.0,
+    "elbow_flex.pos": 18.0,
+    "wrist_flex.pos": -14.0,
+    "wrist_roll.pos": 0.0,
+    "gripper.pos": GRIPPER_CLOSE_POS,
+}
+
+# Per-joint jitter limits (degrees) applied to transit waypoints for human-like variation.
+# Cube-approach and carry poses are never jittered to preserve precision.
+_JITTER_LIMITS: dict[str, float] = {
+    "shoulder_pan.pos": 5.0,
+    "shoulder_lift.pos": 4.0,
+    "elbow_flex.pos": 4.0,
+    "wrist_flex.pos": 3.0,
+    "wrist_roll.pos": 2.0,
+    "gripper.pos": 0.0,
+}
+
+
+def _jitter_pose(pose: dict, rng: np.random.Generator) -> dict:
+    """Return a copy of pose with independent per-joint random perturbations."""
+    return {
+        k: v + rng.uniform(-_JITTER_LIMITS.get(k, 0.0), _JITTER_LIMITS.get(k, 0.0)) for k, v in pose.items()
+    }
+
+
+def _random_stuck_pose(rng: np.random.Generator) -> dict:
+    """Return a physically plausible stuck pose (failed grasp), gripper closed.
+
+    ef bounds are piecewise-linear in sl so the arm stays in a reachable,
+    table-safe envelope across the full sl range:
+      sl=-50 → ef ∈ [  0,  50]   (arm raised, can be bent forward)
+      sl=  0 → ef ∈ [-25,  25]   (mid reach)
+      sl= 30 → ef ∈ [-20,   0]   (arm extended, little room to flex)
+    wrist_flex is randomly offset from the horizontal value.
+    """
+    pan = float(rng.uniform(-5.0, 35.0))
+    sl = float(rng.uniform(-50.0, 30.0))
+
+    if sl <= 0.0:
+        alpha = (sl + 50.0) / 50.0  # 0 at sl=-50, 1 at sl=0
+        ef_lo = alpha * -25.0  # 0 → -25
+        ef_hi = 50.0 + alpha * -25.0  # 50 → 25
+    else:
+        alpha = sl / 30.0  # 0 at sl=0, 1 at sl=30
+        ef_lo = -25.0 + alpha * 5.0  # -25 → -20
+        ef_hi = 25.0 + alpha * -25.0  # 25 → 0
+
+    ef = float(rng.uniform(ef_lo, ef_hi))
+    wf = horizontal_wrist_flex(sl, ef) + float(rng.uniform(-15.0, 15.0))
+    return {
+        "shoulder_pan.pos": pan,
+        "shoulder_lift.pos": sl,
+        "elbow_flex.pos": ef,
+        "wrist_flex.pos": wf,
+        "wrist_roll.pos": float(rng.uniform(-15.0, 15.0)),
+        "gripper.pos": GRIPPER_CLOSE_POS,
+    }
+
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class OmxRecordGrabConfig:
+    robot: RobotConfig
+    dataset: DatasetRecordConfig
+    # Resume recording on an existing dataset.
+    resume: bool = False
+    # Fraction of episodes that start from a random stuck pose (gripper closed) to
+    # generate recovery data.  0.0 = disabled, 1.0 = all episodes are recovery starts.
+    recovery_prob: float = 0.5
+
+
+def record_episode_spline(
+    robot: OmxFollower,
+    waypoints: list[dict],
+    speeds: list[float],
+    dataset: LeRobotDataset,
+    task: str,
+) -> None:
+    """Execute a Catmull-Rom-style spline through waypoints, recording each frame.
+
+    Segment durations are parameterized from the maximum absolute joint delta
+    between consecutive waypoints divided by the requested segment speed,
+    producing non-uniform timing in joint space. Interior tangents are derived
+    from the adjacent per-segment velocities, with clamped (zero-velocity)
+    endpoints so the arm starts and stops smoothly. Each segment is cubic
+    Hermite, giving C1 continuity at every waypoint.
+    """
+    pts = [pose_to_array(w) for w in waypoints]
+    n = len(pts)
+
+    # Steps and duration per segment
+    n_steps_list = []
+    timestamps = []
+    for i in range(n - 1):
+        max_dist = float(np.max(np.abs(pts[i + 1] - pts[i])))
+        ns = max(1, int(max_dist / speeds[i] * dataset.fps)) if max_dist >= 0.5 else 0
+        n_steps_list.append(ns)
+        timestamps.append(ns / dataset.fps)
+
+    # Velocity tangents (deg/sec) — clamped at endpoints, Catmull-Rom for interior
+    vels = [np.zeros_like(pts[0])]
+    for i in range(1, n - 1):
+        v_prev = (pts[i] - pts[i - 1]) / timestamps[i - 1] if timestamps[i - 1] > 0 else np.zeros_like(pts[0])
+        v_next = (pts[i + 1] - pts[i]) / timestamps[i] if timestamps[i] > 0 else np.zeros_like(pts[0])
+        vels.append(0.5 * (v_prev + v_next))
+    vels.append(np.zeros_like(pts[0]))
+
+    dt = 1.0 / dataset.fps
+    for seg in range(n - 1):
+        ns = n_steps_list[seg]
+        if ns == 0:
+            continue
+        p0, p1 = pts[seg], pts[seg + 1]
+        # Scale velocity (deg/sec) to t-space tangent (deg/t-unit, where t: 0→1 over ns steps)
+        m0 = vels[seg] * timestamps[seg]
+        m1 = vels[seg + 1] * timestamps[seg]
+
+        for step in range(1, ns + 1):
+            t = step / ns
+            h00 = 2 * t**3 - 3 * t**2 + 1
+            h10 = t**3 - 2 * t**2 + t
+            h01 = -2 * t**3 + 3 * t**2
+            h11 = t**3 - t**2
+            commanded = h00 * p0 + h10 * m0 + h01 * p1 + h11 * m1
+
+            action = array_to_pose(commanded)
+            robot.send_action(action)
+            obs = robot.get_observation()
+            obs_frame = build_dataset_frame(dataset.features, obs, prefix=OBS_STR)
+            action_frame = build_dataset_frame(dataset.features, action, prefix=ACTION)
+            dataset.add_frame({**obs_frame, **action_frame, "task": task})
+            precise_sleep(dt)
+
+
+def record_grab_episode(
+    robot: OmxFollower,
+    dataset: LeRobotDataset,
+    pan: float,
+    t: float,
+    task: str,
+    recovery_start: bool = False,
+) -> None:
+    """Execute a targeted grab to the stored (pan, t) position, recording every frame.
+
+    Normal sequence (initial HOME move is NOT recorded):
+      HOME → raised approach above cube → lower → close gripper
+           → raise [jittered] → retract [jittered] → GRAB_CARRY_POSE → drop → HOME
+
+    Recovery sequence (recovery_start=True): arm is moved to a random stuck pose
+    (gripper closed) without recording, then recording begins from there:
+      stuck_pose → raised approach above cube → [normal grab sequence from there]
+
+    All segments are joined by a Catmull-Rom spline (C1-continuous velocities).
+    """
+    sl = PUSH_START_SHOULDER_LIFT + t * (PUSH_END_SHOULDER_LIFT - PUSH_START_SHOULDER_LIFT)
+    ef = PUSH_START_ELBOW_FLEX + t * (PUSH_END_ELBOW_FLEX - PUSH_START_ELBOW_FLEX)
+    sl_raised = sl - GRAB_RAISE_SL_OFFSET
+    wf_horizontal = horizontal_wrist_flex(sl, ef)
+
+    rng = np.random.default_rng()
+
+    if recovery_start:
+        stuck_pose = _random_stuck_pose(rng)
+        logger.info(f"Recovery start: {stuck_pose}")
+        move_to_pose(robot, stuck_pose, APPROACH_SPEED)
+        first_waypoints = [stuck_pose]
+        first_speeds = []
+    else:
+        jittery_start = _jitter_pose(HOME_POSE, rng)
+        move_to_pose(robot, jittery_start, APPROACH_SPEED)
+        first_waypoints = [jittery_start]
+        first_speeds = []
+
+    waypoints = first_waypoints + [
+        {  # raised approach: arm above cube
+            "shoulder_pan.pos": pan,
+            "shoulder_lift.pos": sl_raised,
+            "elbow_flex.pos": ef,
+            "wrist_flex.pos": horizontal_wrist_flex(sl_raised, ef),
+            "wrist_roll.pos": 0.0,
+            "gripper.pos": 60.0,
+        },
+        {  # lower onto cube — no jitter: precision needed
+            "shoulder_pan.pos": pan,
+            "shoulder_lift.pos": sl,
+            "elbow_flex.pos": ef,
+            "wrist_flex.pos": wf_horizontal,
+            "wrist_roll.pos": 0.0,
+            "gripper.pos": 60.0,
+        },
+        {  # close gripper — no jitter: precision needed
+            "shoulder_pan.pos": pan,
+            "shoulder_lift.pos": sl,
+            "elbow_flex.pos": ef,
+            "wrist_flex.pos": wf_horizontal,
+            "wrist_roll.pos": 0.0,
+            "gripper.pos": GRIPPER_CLOSE_POS,
+        },
+        _jitter_pose(
+            {  # raise with cube
+                "shoulder_pan.pos": pan,
+                "shoulder_lift.pos": sl_raised,
+                "elbow_flex.pos": ef,
+                "wrist_flex.pos": horizontal_wrist_flex(sl_raised, ef),
+                "wrist_roll.pos": 0.0,
+                "gripper.pos": GRIPPER_CLOSE_POS,
+            },
+            rng,
+        ),
+        _jitter_pose(
+            {  # retract: fold arm toward HOME before sweeping to carry zone
+                "shoulder_pan.pos": pan * 0.25,
+                "shoulder_lift.pos": HOME_POSE["shoulder_lift.pos"] + 5.0,
+                "elbow_flex.pos": HOME_POSE["elbow_flex.pos"] - 5.0,
+                "wrist_flex.pos": 0.0,
+                "wrist_roll.pos": 0.0,
+                "gripper.pos": GRIPPER_CLOSE_POS,
+            },
+            rng,
+        ),
+        GRAB_CARRY_POSE,  # no jitter: target drop zone
+        {**GRAB_CARRY_POSE, "gripper.pos": 60.0},  # drop cube
+        HOME_POSE,
+    ]
+    speeds = first_speeds + [
+        RECORD_SPEED,  # (HOME →) raised approach
+        GRAB_LOWER_SPEED,  # raised approach → lower
+        GRAB_LOWER_SPEED,  # lower → close gripper
+        RECORD_SPEED,  # close gripper → raise
+        RECORD_SPEED,  # raise → retract
+        RECORD_SPEED,  # retract → carry pose
+        RECORD_SPEED,  # carry pose → drop
+        RECORD_SPEED,  # drop → HOME
+    ]
+
+    record_episode_spline(robot, waypoints, speeds, dataset, task)
+
+    # Dwell at HOME for ~0.5 s before next episode
+    home_action = build_dataset_frame(dataset.features, HOME_POSE, prefix=ACTION)
+    dt = 1.0 / dataset.fps
+    for _ in range(int(dataset.fps * 0.5)):
+        robot.send_action(HOME_POSE)
+        obs = robot.get_observation()
+        obs_frame = build_dataset_frame(dataset.features, obs, prefix=OBS_STR)
+        dataset.add_frame({**obs_frame, **home_action, "task": task})
+        precise_sleep(dt)
+
+
+@parser.wrap()
+def record_grab(cfg: OmxRecordGrabConfig) -> LeRobotDataset:
+    logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
+    logger.info(pformat(cfg))
+
+    robot = make_robot_from_config(cfg.robot)
+    use_videos = cfg.dataset.video
+
+    teleop_action_processor, _, robot_obs_processor = make_default_processors()
+
+    dataset_features = combine_feature_dicts(
+        aggregate_pipeline_dataset_features(
+            pipeline=teleop_action_processor,
+            initial_features=create_initial_features(action=robot.action_features),
+            use_videos=use_videos,
+        ),
+        aggregate_pipeline_dataset_features(
+            pipeline=robot_obs_processor,
+            initial_features=create_initial_features(observation=robot.observation_features),
+            use_videos=use_videos,
+        ),
+    )
+
+    num_cameras = len(robot.cameras) if hasattr(robot, "cameras") else 0
+    dataset = None
+
+    try:
+        if cfg.resume:
+            dataset = LeRobotDataset.resume(
+                cfg.dataset.repo_id,
+                root=cfg.dataset.root,
+                streaming_encoding=cfg.dataset.streaming_encoding,
+                batch_encoding_size=cfg.dataset.video_encoding_batch_size,
+                vcodec=cfg.dataset.vcodec,
+                encoder_threads=cfg.dataset.encoder_threads,
+                image_writer_processes=cfg.dataset.num_image_writer_processes if num_cameras > 0 else 0,
+                image_writer_threads=cfg.dataset.num_image_writer_threads_per_camera * num_cameras
+                if num_cameras > 0
+                else 0,
+            )
+        else:
+            cfg.dataset.stamp_repo_id()
+            dataset = LeRobotDataset.create(
+                cfg.dataset.repo_id,
+                cfg.dataset.fps,
+                root=cfg.dataset.root,
+                robot_type=robot.name,
+                features=dataset_features,
+                use_videos=use_videos,
+                streaming_encoding=cfg.dataset.streaming_encoding,
+                batch_encoding_size=cfg.dataset.video_encoding_batch_size,
+                vcodec=cfg.dataset.vcodec,
+                encoder_threads=cfg.dataset.encoder_threads,
+                image_writer_processes=cfg.dataset.num_image_writer_processes if num_cameras > 0 else 0,
+                image_writer_threads=cfg.dataset.num_image_writer_threads_per_camera * num_cameras
+                if num_cameras > 0
+                else 0,
+            )
+
+        robot.connect(calibrate=True)
+
+        rng = np.random.default_rng()
+        with VideoEncodingManager(dataset):
+            for episode_idx in range(cfg.dataset.num_episodes):
+                logger.info(f"=== Episode {episode_idx + 1}/{cfg.dataset.num_episodes} ===")
+
+                logger.info("Step 1: grabbing and placing cube...")
+                grab_cube(robot)
+                pan, t = place_cube(robot)
+                logger.info(f"Cube placed at pan={pan:.1f}, reach={t:.2f}")
+
+                recovery_start = cfg.recovery_prob > 0 and float(rng.random()) < cfg.recovery_prob
+                logger.info(f"Step 2: recording {'recovery ' if recovery_start else ''}grab episode...")
+                record_grab_episode(
+                    robot,
+                    dataset,
+                    pan,
+                    t,
+                    cfg.dataset.single_task,
+                    recovery_start=recovery_start,
+                )
+
+                dataset.save_episode()
+                logger.info(f"Episode {episode_idx + 1} saved.")
+
+    finally:
+        if dataset:
+            dataset.finalize()
+        if robot.is_connected:
+            robot.disconnect()
+
+    if cfg.dataset.push_to_hub and dataset and dataset.num_episodes > 0:
+        dataset.push_to_hub(tags=cfg.dataset.tags, private=cfg.dataset.private)
+
+    return dataset
+
+
+if __name__ == "__main__":
+    record_grab()
@@ -0,0 +1,267 @@
+#!/usr/bin/env python3
+"""
+Auto-reset and cube-grab utility for the OMX robot arm.
+
+Provides:
+  - grab_cube(robot): sweep workspace, center cube, close gripper
+  - place_cube(robot): carry cube to a random position, release
+
+Standalone usage (run from repo root):
+    python -m examples.omx.reset_environment --port /dev/ttyACM1 --mode grab
+    python -m examples.omx.reset_environment --port /dev/ttyACM1 --mode grab_and_place
+
+Joint range: -100 to 100 for arm joints; gripper: 50 = closed, 80 = open.
+
+To read current joint values for calibration, add after robot.connect():
+    obs = robot.get_observation()
+    print({k: round(obs[k], 1) for k in JOINT_NAMES})
+    robot.disconnect(); raise SystemExit
+
+Parallel-to-ground IK: wrist_flex = WRIST_HORIZONTAL_OFFSET - shoulder_lift - elbow_flex.
+Linear interpolation preserves this constraint between any two poses that satisfy it.
+"""
+
+import argparse
+import logging
+
+import numpy as np
+
+from lerobot.robots.omx_follower import OmxFollower, OmxFollowerConfig
+from lerobot.robots.robot import Robot
+from lerobot.utils.robot_utils import precise_sleep
+
+logger = logging.getLogger(__name__)
+
+# ── Poses ─────────────────────────────────────────────────────────────────────
+
+HOME_POSE = {
+    "shoulder_pan.pos": 0.0,
+    "shoulder_lift.pos": -50.0,
+    "elbow_flex.pos": 50.0,
+    "wrist_flex.pos": 0.0,
+    "wrist_roll.pos": 0.0,
+    "gripper.pos": 60.0,
+}
+
+SWEEP_WAYPOINTS = [
+    {
+        "shoulder_pan.pos": -60.0,
+        "shoulder_lift.pos": 50.0,
+        "elbow_flex.pos": -60.0,
+        "wrist_flex.pos": -20.0,
+        "wrist_roll.pos": 0.0,
+        "gripper.pos": 60.0,
+    },
+    {
+        "shoulder_pan.pos": -30.0,
+        "shoulder_lift.pos": 50.0,
+        "elbow_flex.pos": -60.0,
+        "wrist_flex.pos": -5.0,
+        "wrist_roll.pos": 0.0,
+        "gripper.pos": 60.0,
+    },
+    {
+        "shoulder_pan.pos": 20.0,
+        "shoulder_lift.pos": 50.0,
+        "elbow_flex.pos": -55.0,
+        "wrist_flex.pos": -5.0,
+        "wrist_roll.pos": 0.0,
+        "gripper.pos": 60.0,
+    },
+]
+
+# ── Motion parameters ─────────────────────────────────────────────────────────
+
+CONTROL_HZ = 30
+APPROACH_SPEED = 50.0
+SWEEP_SPEED = 40.0
+
+# ── Grab-sequence parameters ──────────────────────────────────────────────────
+
+GRAB_PAN = 0.0
+SWEEP_LEFT_PAN = -60.0
+SWEEP_RIGHT_PAN = 60.0
+SWEEP_END_OFFSET = 5.0  # stop before center so the cube isn't pushed past GRAB_PAN
+SWEEP_END_PAN_RANGE = (15.0, 20.0)
+
+SWEEP_LOW_SHOULDER_LIFT = 50.0
+SWEEP_LOW_ELBOW_FLEX_START = -60.0
+SWEEP_LOW_ELBOW_FLEX_END = -55.0
+
+SWEEP_HIGH_WRIST_FLEX = -20.0  # wrist tilted up during high approach to clear obstacles
+
+PUSH_START_SHOULDER_LIFT = 0.0
+PUSH_START_ELBOW_FLEX = 45.0
+PUSH_END_SHOULDER_LIFT = 50.0
+PUSH_END_ELBOW_FLEX = -50.0
+# Subtracted from shoulder_lift during the push sweep to clear the platform surface.
+# Does not affect the grab-target interpolation in record_grab.py.
+PUSH_RAISE_OFFSET = 5.0
+
+WRIST_HORIZONTAL_OFFSET = 0.0  # tune if gripper tilts during push: + tilts nose up, - down
+GRIPPER_CLOSE_POS = 50.0
+
+PLACE_LEFT_PAN_RANGE = (5.0, 30.0)  # random pan range for cube placement on the left side
+PLACE_REACH_RANGE = (0.1, 0.7)  # 0 = arm retracted (PUSH_START), 1 = fully extended (PUSH_END)
+
+JOINT_NAMES = [
+    "shoulder_pan.pos",
+    "shoulder_lift.pos",
+    "elbow_flex.pos",
+    "wrist_flex.pos",
+    "wrist_roll.pos",
+    "gripper.pos",
+]
+
+# ── Helpers ───────────────────────────────────────────────────────────────────
+
+
+def pose_to_array(pose: dict) -> np.ndarray:
+    return np.array([pose[k] for k in JOINT_NAMES])
+
+
+def array_to_pose(arr: np.ndarray) -> dict:
+    return {k: float(arr[i]) for i, k in enumerate(JOINT_NAMES)}
+
+
+def horizontal_wrist_flex(shoulder_lift: float, elbow_flex: float) -> float:
+    return WRIST_HORIZONTAL_OFFSET - shoulder_lift - elbow_flex
+
+
+def _low_sweep_pose(pan: float, elbow_flex: float, wrist_flex: float | None = None) -> dict:
+    sl = SWEEP_LOW_SHOULDER_LIFT
+    return {
+        "shoulder_pan.pos": pan,
+        "shoulder_lift.pos": sl,
+        "elbow_flex.pos": elbow_flex,
+        "wrist_flex.pos": horizontal_wrist_flex(sl, elbow_flex) if wrist_flex is None else wrist_flex,
+        "wrist_roll.pos": 0.0,
+        "gripper.pos": 60.0,
+    }
+
+
+def _high_sweep_pose(pan: float) -> dict:
+    return {**HOME_POSE, "shoulder_pan.pos": pan, "wrist_flex.pos": SWEEP_HIGH_WRIST_FLEX}
+
+
+def _push_pose(shoulder_lift: float, elbow_flex: float, pan: float = GRAB_PAN, gripper: float = 70.0) -> dict:
+    return {
+        "shoulder_pan.pos": pan,
+        "shoulder_lift.pos": shoulder_lift,
+        "elbow_flex.pos": elbow_flex,
+        "wrist_flex.pos": horizontal_wrist_flex(shoulder_lift, elbow_flex),
+        "wrist_roll.pos": 0.0,
+        "gripper.pos": gripper,
+    }
+
+
+def move_to_pose(robot: Robot, target: dict, speed: float) -> None:
+    """Interpolate from current position to target at the given speed (units/s)."""
+    obs = robot.get_observation()
+    current = np.array([obs[k] for k in JOINT_NAMES])
+    goal = pose_to_array(target)
+
+    max_distance = float(np.max(np.abs(goal - current)))
+    if max_distance < 0.5:
+        return
+
+    n_steps = max(1, int(max_distance / speed * CONTROL_HZ))
+    dt = 1.0 / CONTROL_HZ
+    for step in range(1, n_steps + 1):
+        t = step / n_steps
+        robot.send_action(array_to_pose(current + t * (goal - current)))
+        precise_sleep(dt)
+
+
+# ── Sequences ─────────────────────────────────────────────────────────────────
+
+
+def grab_cube(robot: Robot) -> None:
+    """Left sweep → right sweep → extend arm parallel to ground → close gripper."""
+    move_to_pose(robot, HOME_POSE, APPROACH_SPEED)
+
+    for pan, end_pan in [
+        (SWEEP_LEFT_PAN, GRAB_PAN - SWEEP_END_OFFSET),
+        (SWEEP_RIGHT_PAN, GRAB_PAN + SWEEP_END_OFFSET),
+    ]:
+        logger.info(f"Sweeping {'left' if pan < 0 else 'right'} → center...")
+        move_to_pose(robot, _high_sweep_pose(pan), APPROACH_SPEED)
+        move_to_pose(
+            robot, _low_sweep_pose(pan, SWEEP_LOW_ELBOW_FLEX_START, wrist_flex=-20.0), APPROACH_SPEED
+        )
+        move_to_pose(robot, _low_sweep_pose(end_pan, SWEEP_LOW_ELBOW_FLEX_END, wrist_flex=0.0), SWEEP_SPEED)
+        move_to_pose(robot, HOME_POSE, APPROACH_SPEED)
+
+    logger.info("Extending to push cube into gripper...")
+    move_to_pose(
+        robot,
+        _push_pose(PUSH_START_SHOULDER_LIFT - PUSH_RAISE_OFFSET, PUSH_START_ELBOW_FLEX),
+        APPROACH_SPEED,
+    )
+    move_to_pose(
+        robot,
+        _push_pose(PUSH_END_SHOULDER_LIFT - PUSH_RAISE_OFFSET, PUSH_END_ELBOW_FLEX),
+        SWEEP_SPEED,
+    )
+
+    logger.info("Closing gripper...")
+    move_to_pose(
+        robot,
+        _push_pose(PUSH_END_SHOULDER_LIFT, PUSH_END_ELBOW_FLEX, gripper=GRIPPER_CLOSE_POS),
+        APPROACH_SPEED,
+    )
+
+    logger.info("Grab complete.")
+
+
+def place_cube(robot: Robot) -> tuple[float, float]:
+    """Carry the cube (gripper closed) to a random position on the left side, then release.
+
+    Returns:
+        (pan, t): pan angle and reach scalar [0, 1] of the placement position.
+    """
+    pan = float(np.random.uniform(*PLACE_LEFT_PAN_RANGE))
+    t = float(np.random.uniform(*PLACE_REACH_RANGE))
+    sl = PUSH_START_SHOULDER_LIFT + t * (PUSH_END_SHOULDER_LIFT - PUSH_START_SHOULDER_LIFT)
+    ef = PUSH_START_ELBOW_FLEX + t * (PUSH_END_ELBOW_FLEX - PUSH_START_ELBOW_FLEX)
+    logger.info(f"Placing cube at pan={pan:.1f}, reach={t:.2f}...")
+
+    move_to_pose(robot, {**HOME_POSE, "gripper.pos": GRIPPER_CLOSE_POS}, APPROACH_SPEED)
+    move_to_pose(
+        robot, {**HOME_POSE, "shoulder_pan.pos": pan, "gripper.pos": GRIPPER_CLOSE_POS}, APPROACH_SPEED
+    )
+    move_to_pose(robot, _push_pose(sl, ef, pan=pan, gripper=GRIPPER_CLOSE_POS), APPROACH_SPEED)
+    move_to_pose(robot, _push_pose(sl, ef, pan=pan, gripper=80.0), APPROACH_SPEED)
+    move_to_pose(robot, HOME_POSE, APPROACH_SPEED)
+    logger.info("Place complete.")
+    return pan, t
+
+
+# ── Entry point ───────────────────────────────────────────────────────────────
+
+
+def main():
+    parser = argparse.ArgumentParser(description="OMX arm reset / grab script")
+    parser.add_argument("--port", default="/dev/ttyACM1")
+    parser.add_argument("--robot_id", default="omx_follower")
+    parser.add_argument("--mode", choices=["grab", "grab_and_place"], default="grab_and_place")
+    args = parser.parse_args()
+
+    logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
+
+    robot = OmxFollower(OmxFollowerConfig(port=args.port, id=args.robot_id))
+    robot.connect(calibrate=True)
+
+    try:
+        if args.mode == "grab":
+            grab_cube(robot)
+        elif args.mode == "grab_and_place":
+            grab_cube(robot)
+            place_cube(robot)
+
+    finally:
+        robot.disconnect()
+
+
+if __name__ == "__main__":
+    main()
@@ -59,8 +59,8 @@ keywords = ["lerobot", "huggingface", "robotics",  "machine learning", "artifici

 dependencies = [
    # Core ML
-    "torch>=2.7,<2.11.0",
-    "torchvision>=0.22.0,<0.26.0",
+    "torch>=2.7,<2.12.0",
+    "torchvision>=0.22.0,<0.27.0",
    "numpy>=2.0.0,<2.3.0", # NOTE: Explicitly listing numpy helps the resolver converge faster. Upper bound imposed by opencv-python-headless.
    "opencv-python-headless>=4.9.0,<4.14.0",
    "Pillow>=10.0.0,<13.0.0",
@@ -99,7 +99,7 @@ dataset = [
    "pandas>=2.0.0,<3.0.0", # NOTE: Transitive dependency of datasets
    "pyarrow>=21.0.0,<30.0.0", # NOTE: Transitive dependency of datasets
    "lerobot[av-dep]",
-    "torchcodec>=0.3.0,<0.11.0; sys_platform != 'win32' and (sys_platform != 'linux' or (platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l')) and (sys_platform != 'darwin' or platform_machine != 'x86_64')", # NOTE: Windows support starts at version 0.7 (needs torch==2.8), ffmpeg>=8 support starts at version 0.8.1 (needs torch==2.9), system-wide ffmpeg support starts at version 0.10 (needs torch==2.10).
+    "torchcodec>=0.3.0,<0.12.0; sys_platform != 'win32' and (sys_platform != 'linux' or (platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l')) and (sys_platform != 'darwin' or platform_machine != 'x86_64')", # NOTE: Windows support starts at version 0.7 (needs torch==2.8), ffmpeg>=8 support starts at version 0.8.1 (needs torch==2.9), system-wide ffmpeg support starts at version 0.10 (needs torch==2.10), 0.11 needs torch==2.11, 0.12 needs torch==2.12.
    "jsonlines>=4.0.0,<5.0.0",
 ]
 training = [
@@ -128,7 +128,7 @@ dataset_viz = ["lerobot[dataset]", "lerobot[viz]"]
 av-dep = ["av>=15.0.0,<16.0.0"]
 pygame-dep = ["pygame>=2.5.1,<2.7.0"]
 placo-dep = ["placo>=0.9.6,<0.9.17"]
-transformers-dep = ["transformers==5.3.0"] # TODO(Steven): https://github.com/huggingface/lerobot/pull/3249
+transformers-dep = ["transformers>=5.4.0,<5.6.0"]
 grpcio-dep = ["grpcio==1.73.1", "protobuf>=6.31.1,<6.32.0"]
 can-dep = ["python-can>=4.2.0,<5.0.0"]
 peft-dep = ["peft>=0.18.0,<1.0.0"]
@@ -194,6 +194,8 @@ groot = [
 ]
 sarm = ["lerobot[transformers-dep]", "pydantic>=2.0.0,<3.0.0", "faker>=33.0.0,<35.0.0", "lerobot[matplotlib-dep]", "lerobot[qwen-vl-utils-dep]"]
 xvla = ["lerobot[transformers-dep]"]
+eo1 = ["lerobot[transformers-dep]", "lerobot[qwen-vl-utils-dep]"]
+evo1 = ["lerobot[transformers-dep]", "timm>=1.0.0,<1.1.0"]
 hilserl = ["lerobot[transformers-dep]", "gym-hil>=0.1.13,<0.2.0", "lerobot[grpcio-dep]", "lerobot[placo-dep]"]

 # Features
@@ -257,6 +259,7 @@ all = [
    "lerobot[smolvla]",
    # "lerobot[groot]", TODO(Steven): Gr00t requires specific installation instructions for flash-attn
    "lerobot[xvla]",
+    "lerobot[evo1]",
    "lerobot[hilserl]",
    "lerobot[async]",
    "lerobot[dev]",
@@ -292,6 +295,20 @@ lerobot-setup-can="lerobot.scripts.lerobot_setup_can:main"
 lerobot-rollout="lerobot.scripts.lerobot_rollout:main"

 # ---------------- Tool Configurations ----------------
+
+# cu128 wheels keep broad hardware reach; the driver floor is 570.86.
+# To use a different CUDA variant, reinstall torch with an explicit index, e.g.:
+#   uv pip install --force-reinstall torch torchvision \
+#       --index-url https://download.pytorch.org/whl/cu130
+[[tool.uv.index]]
+name = "pytorch-cu128"
+url = "https://download.pytorch.org/whl/cu128"
+explicit = true
+
+[tool.uv.sources]
+torch = [{ index = "pytorch-cu128", marker = "sys_platform == 'linux'" }]
+torchvision = [{ index = "pytorch-cu128", marker = "sys_platform == 'linux'" }]
+
 [tool.setuptools.package-data]
 lerobot = ["envs/*.json"]

@@ -333,6 +350,7 @@ ignore = [
 # E402: conditional-import guards (TYPE_CHECKING / is_package_available) must precede the imports they protect
 "src/lerobot/scripts/convert_dataset_v21_to_v30.py" = ["E402"]
 "src/lerobot/policies/wall_x/**" = ["N801", "N812", "SIM102", "SIM108", "SIM210", "SIM211", "B006", "B007", "SIM118"] # Supprese these as they are coming from original Qwen2_5_vl code TODO(pepijn): refactor original
+"src/lerobot/policies/evo1/**" = ["N801", "N812"]

 [tool.ruff.lint.isort]
 combine-as-imports = true
@@ -133,9 +133,6 @@ class RealSenseCamera(Camera):

        self.rs_pipeline: rs.pipeline | None = None
        self.rs_profile: rs.pipeline_profile | None = None
-        # Meters per uint16 unit on the depth stream. Queried from the device
-        # at connect() time. Typical D-series value is 0.001 (= 1 mm/unit).
-        self.depth_scale: float | None = None

        self.thread: Thread | None = None
        self.stop_event: Event | None = None
@@ -193,17 +190,6 @@ class RealSenseCamera(Camera):
            ) from e

        self._configure_capture_settings()
-
-        # Query depth scale (meters per uint16 unit) when depth is enabled so
-        # consumers can convert the raw z16 stream to metric distances.
-        if self.use_depth and self.rs_profile is not None:
-            try:
-                depth_sensor = self.rs_profile.get_device().first_depth_sensor()
-                self.depth_scale = float(depth_sensor.get_depth_scale())
-            except RuntimeError as e:
-                logger.warning(f"{self}: failed to query depth scale ({e}); falling back to 0.001 m/unit.")
-                self.depth_scale = 0.001
-
        self._start_read_thread()

        # NOTE(Steven/Caroline): Enforcing at least one second of warmup as RS cameras need a bit of time before the first read. If we don't wait, the first read from the warmup will raise.
@@ -546,6 +532,7 @@ class RealSenseCamera(Camera):
            self.latest_timestamp = None
            self.new_frame_event.clear()

+    # NOTE(Steven): Missing implementation for depth for now
    @check_if_not_connected
    def async_read(self, timeout_ms: float = 200) -> NDArray[Any]:
        """
@@ -588,6 +575,7 @@ class RealSenseCamera(Camera):

        return frame

+    # NOTE(Steven): Missing implementation for depth for now
    @check_if_not_connected
    def read_latest(self, max_age_ms: int = 500) -> NDArray[Any]:
        """Return the most recent (color) frame captured immediately (Peeking).
@@ -623,78 +611,6 @@ class RealSenseCamera(Camera):

        return frame

-
-    @check_if_not_connected
-    def async_read_depth(self, timeout_ms: float = 200) -> NDArray[Any]:
-        """Read the latest depth frame asynchronously, in metric meters.
-
-        Mirrors :meth:`async_read` but returns the depth stream rather than the
-        color stream. Output is ``np.uint16`` of shape ``(H, W)``.
-
-        Raises:
-            DeviceNotConnectedError: If the camera is not connected.
-            RuntimeError: If ``use_depth`` is ``False`` for this camera, or if
-                the background read thread is not running.
-            TimeoutError: If no frame becomes available within ``timeout_ms``.
-        """
-        if not self.use_depth:
-            raise RuntimeError(
-                f"{self}: cannot read depth — camera was configured with use_depth=False."
-            )
-
-        if self.thread is None or not self.thread.is_alive():
-            raise RuntimeError(f"{self} read thread is not running.")
-
-        if not self.new_frame_event.wait(timeout=timeout_ms / 1000.0):
-            raise TimeoutError(
-                f"Timed out waiting for depth frame from camera {self} after {timeout_ms} ms."
-            )
-
-        with self.frame_lock:
-            depth_frame = self.latest_depth_frame
-            self.new_frame_event.clear()
-
-        if depth_frame is None:
-            raise RuntimeError(f"Internal error: Event set but no depth frame available for {self}.")
-
-        return depth_frame
-
-    @check_if_not_connected
-    def read_latest_depth(self, max_age_ms: int = 500) -> NDArray[Any]:
-        """Return the most recent depth frame in metric meters (peeking).
-
-        Non-blocking counterpart of :meth:`read_latest` for the depth stream.
-        Output is ``np.float32`` of shape ``(H, W)`` in meters.
-
-        Raises:
-            DeviceNotConnectedError: If the camera is not connected.
-            RuntimeError: If ``use_depth`` is ``False`` for this camera, or if
-                no depth frame has been captured yet.
-            TimeoutError: If the latest depth frame is older than ``max_age_ms``.
-        """
-        if not self.use_depth:
-            raise RuntimeError(
-                f"{self}: cannot read depth — camera was configured with use_depth=False."
-            )
-
-        if self.thread is None or not self.thread.is_alive():
-            raise RuntimeError(f"{self} read thread is not running.")
-
-        with self.frame_lock:
-            depth_frame = self.latest_depth_frame
-            timestamp = self.latest_timestamp
-
-        if depth_frame is None or timestamp is None:
-            raise RuntimeError(f"{self} has not captured any depth frames yet.")
-
-        age_ms = (time.perf_counter() - timestamp) * 1e3
-        if age_ms > max_age_ms:
-            raise TimeoutError(
-                f"{self} latest depth frame is too old: {age_ms:.1f} ms (max allowed: {max_age_ms} ms)."
-            )
-
-        return depth_frame
-
    def disconnect(self) -> None:
        """
        Disconnects from the camera, stops the pipeline, and cleans up resources.
@@ -718,8 +634,6 @@ class RealSenseCamera(Camera):
            self.rs_pipeline = None
            self.rs_profile = None

-        self.depth_scale = None
-
        with self.frame_lock:
            self.latest_color_frame = None
            self.latest_depth_frame = None
@@ -17,7 +17,7 @@
 from dataclasses import dataclass, field

 from lerobot.transforms import ImageTransformsConfig
-from lerobot.utils.import_utils import get_safe_default_video_backend
+from lerobot.utils.import_utils import get_safe_default_codec


@dataclass
@@ -34,7 +34,7 @@ class DatasetConfig:
    image_transforms: ImageTransformsConfig = field(default_factory=ImageTransformsConfig)
    revision: str | None = None
    use_imagenet_stats: bool = True
-    video_backend: str = field(default_factory=get_safe_default_video_backend)
+    video_backend: str = field(default_factory=get_safe_default_codec)
    # When True, video frames are returned as uint8 tensors (0-255) instead of float32 (0.0-1.0).
    # This reduces memory and speeds up DataLoader IPC. The training pipeline handles the conversion.
    return_uint8: bool = False
@@ -256,7 +256,9 @@ class TrainPipelineConfig(HubMixin):
                ) from e

        cli_args = kwargs.pop("cli_args", [])
-        if config_file is not None:
+        # Legacy RA-BC migration only applies to framework-saved checkpoints (always JSON).
+        # Hand-written YAML/TOML configs are expected to use the current sample_weighting schema.
+        if config_file is not None and config_file.endswith(".json"):
            with open(config_file) as f:
                config = json.load(f)
            migrated_config = _migrate_legacy_rabc_fields(config)
@@ -40,21 +40,10 @@ from .io_utils import load_episodes, write_stats
 from .lerobot_dataset import LeRobotDataset
 from .multi_dataset import MultiLeRobotDataset
 from .pipeline_features import aggregate_pipeline_dataset_features, create_initial_features
-from .pyav_utils import (
-    check_video_encoder_config_pyav,
-    detect_available_encoders_pyav,
-    get_codec,
-)
 from .sampler import EpisodeAwareSampler
 from .streaming_dataset import StreamingLeRobotDataset
 from .utils import DEFAULT_EPISODES_PATH, create_lerobot_dataset_card
-from .video_utils import (
-    DepthEncoderConfig,
-    VideoEncoderConfig,
-    VideoEncodingManager,
-    camera_encoder_defaults,
-    depth_encoder_defaults,
-)
+from .video_utils import VideoEncodingManager

 # NOTE: Low-level I/O functions (cast_stats_to_numpy, get_parquet_file_size_in_mb, etc.)
 # and legacy migration constants are intentionally NOT re-exported here.
@@ -69,22 +58,15 @@ __all__ = [
    "LeRobotDatasetMetadata",
    "MultiLeRobotDataset",
    "StreamingLeRobotDataset",
-    "DepthEncoderConfig",
-    "VideoEncoderConfig",
    "VideoEncodingManager",
-    "camera_encoder_defaults",
-    "depth_encoder_defaults",
    "add_features",
    "aggregate_datasets",
    "aggregate_pipeline_dataset_features",
    "aggregate_stats",
-    "check_video_encoder_config_pyav",
    "convert_image_to_video_dataset",
    "create_initial_features",
    "create_lerobot_dataset_card",
    "delete_episodes",
-    "detect_available_encoders_pyav",
-    "get_codec",
    "get_feature_stats",
    "load_episodes",
    "make_dataset",
@@ -332,6 +332,7 @@ def aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chu
        videos_idx: Dictionary tracking video chunk and file indices.
        video_files_size_in_mb: Maximum size for video files in MB (defaults to DEFAULT_VIDEO_FILE_SIZE_IN_MB)
        chunk_size: Maximum number of files per chunk (defaults to DEFAULT_CHUNK_SIZE)
+
    Returns:
        dict: Updated videos_idx with current chunk and file indices.
    """
@@ -416,7 +417,6 @@ def aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chu
                concatenate_video_files(
                    [dst_path, src_path],
                    dst_path,
-                    compatibility_check=True,
                )
                # Update duration of this destination file
                dst_file_durations[dst_key] = current_dst_duration + src_duration
@@ -48,7 +48,7 @@ from .utils import (
    is_valid_version,
    update_chunk_file_indices,
 )
-from .video_utils import VideoEncoderConfig, get_video_info
+from .video_utils import get_video_info

 CODEBASE_VERSION = "v3.0"

@@ -313,20 +313,6 @@ class LeRobotDatasetMetadata:
        """Keys to access visual modalities stored as videos."""
        return [key for key, ft in self.features.items() if ft["dtype"] == "video"]

-    @property
-    def depth_keys(self) -> list[str]:
-        """Keys to access depth-map modalities stored as videos.
-
-        A depth video key is a feature whose ``info`` dict carries
-        ``"video.is_depth_map": True`` (set either at creation time by the user
-        or after the first encoded episode by :meth:`update_video_info`).
-        """
-        return [
-            key
-            for key, ft in self.features.items()
-            if ft["dtype"] == "video" and ft.get("info", {}).get("video.is_depth_map", False)
-        ]
-
    @property
    def camera_keys(self) -> list[str]:
        """Keys to access visual modalities (regardless of their storage method)."""
@@ -524,38 +510,19 @@ class LeRobotDatasetMetadata:
        self.stats = aggregate_stats([self.stats, episode_stats]) if self.stats is not None else episode_stats
        write_stats(self.stats, self.root)

-    def update_video_info(
-        self,
-        video_key: str | None = None,
-        camera_encoder_config: VideoEncoderConfig | None = None,
-    ) -> None:
-        """Populate per-feature video info in ``info.json``.
-
+    def update_video_info(self, video_key: str | None = None) -> None:
+        """
        Warning: this function writes info from first episode videos, implicitly assuming that all videos have
        been encoded the same way. Also, this means it assumes the first episode exists.
-
-        Args:
-            video_key: If provided, only update this video key. Otherwise update
-                all video keys in the dataset.
-            camera_encoder_config: Encoder configuration used to produce the
-                videos. When provided, its fields are recorded as
-                ``video.<field>`` entries alongside the stream-derived
-                ``video.*`` entries (see :func:`get_video_info`).
        """
        if video_key is not None and video_key not in self.video_keys:
            raise ValueError(f"Video key {video_key} not found in dataset")

        video_keys = [video_key] if video_key is not None else self.video_keys
        for key in video_keys:
-            existing = self.features[key].get("info") or {}
-            # Repopulate when codec metadata is missing — preserves user-provided
-            # markers like ``video.is_depth_map`` while still recording stream
-            # info on the first episode.
-            if not existing or "video.codec" not in existing:
+            if not self.features[key].get("info", None):
                video_path = self.root / self.video_path.format(video_key=key, chunk_index=0, file_index=0)
-                stream_info = get_video_info(video_path, camera_encoder_config=camera_encoder_config)
-                merged = {**existing, **stream_info}
-                self.info.features[key]["info"] = merged
+                self.info.features[key]["info"] = get_video_info(video_path)

    def update_chunk_settings(
        self,
@@ -32,13 +32,7 @@ from .io_utils import (
    hf_transform_to_torch,
    load_nested_dataset,
 )
-from .video_utils import decode_depth_frames, decode_video_frames
-from .depth_utils import (
-    DEFAULT_DEPTH_MIN, 
-    DEFAULT_DEPTH_MAX, 
-    DEFAULT_DEPTH_SHIFT, 
-    DEFAULT_DEPTH_USE_LOG,
-)
+from .video_utils import decode_video_frames


 class DatasetReader:
@@ -243,31 +237,17 @@ class DatasetReader:
        """
        ep = self._meta.episodes[ep_idx]

-        depth_keys = set(self._meta.depth_keys)
-
        def _decode_single(vid_key: str, query_ts: list[float]) -> tuple[str, torch.Tensor]:
            from_timestamp = ep[f"videos/{vid_key}/from_timestamp"]
            shifted_query_ts = [from_timestamp + ts for ts in query_ts]
            video_path = self.root / self._meta.get_video_file_path(ep_idx, vid_key)
-            if vid_key in depth_keys:
-                feature_info = self._meta.features[vid_key].get("info") or {}
-                frames = decode_depth_frames(
-                    video_path,
-                    shifted_query_ts,
-                    self._tolerance_s,
-                    depth_min=feature_info.get("video.depth_min", DEFAULT_DEPTH_MIN),
-                    depth_max=feature_info.get("video.depth_max", DEFAULT_DEPTH_MAX),
-                    shift=feature_info.get("video.shift", DEFAULT_DEPTH_SHIFT),
-                    use_log=feature_info.get("video.use_log", DEFAULT_DEPTH_USE_LOG),
-                )
-            else:
-                frames = decode_video_frames(
-                    video_path,
-                    shifted_query_ts,
-                    self._tolerance_s,
-                    self._video_backend,
-                    return_uint8=self._return_uint8,
-                )
+            frames = decode_video_frames(
+                video_path,
+                shifted_query_ts,
+                self._tolerance_s,
+                self._video_backend,
+                return_uint8=self._return_uint8,
+            )
            return vid_key, frames.squeeze(0)

        items = list(query_timestamps.items())
@@ -62,7 +62,7 @@ from .utils import (
    DEFAULT_EPISODES_PATH,
    update_chunk_file_indices,
 )
-from .video_utils import VideoEncoderConfig, encode_video_frames, get_video_info
+from .video_utils import encode_video_frames, get_video_info


 def _load_episode_with_stats(src_dataset: LeRobotDataset, episode_idx: int) -> dict:
@@ -92,7 +92,6 @@ def delete_episodes(
    episode_indices: list[int],
    output_dir: str | Path | None = None,
    repo_id: str | None = None,
-    camera_encoder_config: VideoEncoderConfig | None = None,
 ) -> LeRobotDataset:
    """Delete episodes from a LeRobotDataset and create a new dataset.

@@ -101,7 +100,6 @@ def delete_episodes(
        episode_indices: List of episode indices to delete.
        output_dir: Root directory where the edited dataset will be stored. If not specified, defaults to $HF_LEROBOT_HOME/repo_id. Equivalent to new_root in EditDatasetConfig.
        repo_id: Edited dataset identifier. Equivalent to new_repo_id in EditDatasetConfig.
-        camera_encoder_config: Video encoder settings used when re-encoding video segments (default: :class:`VideoEncoderConfig()`).
    """
    if not episode_indices:
        raise ValueError("No episodes to delete")
@@ -134,7 +132,7 @@ def delete_episodes(

    video_metadata = None
    if dataset.meta.video_keys:
-        video_metadata = _copy_and_reindex_videos(dataset, new_meta, episode_mapping, camera_encoder_config)
+        video_metadata = _copy_and_reindex_videos(dataset, new_meta, episode_mapping)

    data_metadata = _copy_and_reindex_data(dataset, new_meta, episode_mapping)

@@ -156,7 +154,6 @@ def split_dataset(
    dataset: LeRobotDataset,
    splits: dict[str, float | list[int]],
    output_dir: str | Path | None = None,
-    camera_encoder_config: VideoEncoderConfig | None = None,
 ) -> dict[str, LeRobotDataset]:
    """Split a LeRobotDataset into multiple smaller datasets.

@@ -165,7 +162,6 @@ def split_dataset(
        splits: Either a dict mapping split names to episode indices, or a dict mapping
                split names to fractions (must sum to <= 1.0).
        output_dir: Root directory where the split datasets will be stored. If not specified, defaults to $HF_LEROBOT_HOME/repo_id.
-        camera_encoder_config: Video encoder settings used when re-encoding video segments (default: :class:`VideoEncoderConfig()`).

    Examples:
      Split by specific episodes
@@ -226,9 +222,7 @@ def split_dataset(

        video_metadata = None
        if dataset.meta.video_keys:
-            video_metadata = _copy_and_reindex_videos(
-                dataset, new_meta, episode_mapping, camera_encoder_config
-            )
+            video_metadata = _copy_and_reindex_videos(dataset, new_meta, episode_mapping)

        data_metadata = _copy_and_reindex_data(dataset, new_meta, episode_mapping)

@@ -584,7 +578,8 @@ def _keep_episodes_from_video_with_av(
    output_path: Path,
    episodes_to_keep: list[tuple[int, int]],
    fps: float,
-    camera_encoder_config: VideoEncoderConfig | None = None,
+    vcodec: str = "libsvtav1",
+    pix_fmt: str = "yuv420p",
 ) -> None:
    """Keep only specified episodes from a video file using PyAV.

@@ -598,10 +593,9 @@ def _keep_episodes_from_video_with_av(
            Ranges are half-open intervals: [start_frame, end_frame), where start_frame
            is inclusive and end_frame is exclusive.
        fps: Frame rate of the video.
-        camera_encoder_config: Video encoder settings (default: :class:`VideoEncoderConfig()`).
+        vcodec: Video codec to use for encoding.
+        pix_fmt: Pixel format for output video.
    """
-    if camera_encoder_config is None:
-        camera_encoder_config = VideoEncoderConfig()
    from fractions import Fraction

    import av
@@ -625,12 +619,12 @@ def _keep_episodes_from_video_with_av(

    # Convert fps to Fraction for PyAV compatibility.
    fps_fraction = Fraction(fps).limit_denominator(1000)
-    v_out = out.add_stream(camera_encoder_config.vcodec, rate=fps_fraction)
+    v_out = out.add_stream(vcodec, rate=fps_fraction)

    # PyAV type stubs don't distinguish video streams from audio/subtitle streams.
    v_out.width = v_in.codec_context.width
    v_out.height = v_in.codec_context.height
-    v_out.pix_fmt = camera_encoder_config.pix_fmt
+    v_out.pix_fmt = pix_fmt

    # Set time_base to match the frame rate for proper timestamp handling.
    v_out.time_base = Fraction(1, int(fps))
@@ -693,7 +687,8 @@ def _copy_and_reindex_videos(
    src_dataset: LeRobotDataset,
    dst_meta: LeRobotDatasetMetadata,
    episode_mapping: dict[int, int],
-    camera_encoder_config: VideoEncoderConfig | None = None,
+    vcodec: str = "libsvtav1",
+    pix_fmt: str = "yuv420p",
 ) -> dict[int, dict]:
    """Copy and filter video files, only re-encoding files with deleted episodes.

@@ -705,13 +700,10 @@ def _copy_and_reindex_videos(
        src_dataset: Source dataset to copy from
        dst_meta: Destination metadata object
        episode_mapping: Mapping from old episode indices to new indices
-        camera_encoder_config: Video encoder settings used when re-encoding segments (default: :class:`VideoEncoderConfig()`).

    Returns:
        dict mapping episode index to its video metadata (chunk_index, file_index, timestamps)
    """
-    if camera_encoder_config is None:
-        camera_encoder_config = VideoEncoderConfig()
    if src_dataset.meta.episodes is None:
        src_dataset.meta.episodes = load_episodes(src_dataset.meta.root)

@@ -800,7 +792,8 @@ def _copy_and_reindex_videos(
                    dst_video_path,
                    episodes_to_keep_ranges,
                    src_dataset.meta.fps,
-                    camera_encoder_config,
+                    vcodec,
+                    pix_fmt,
                )

                cumulative_ts = 0.0
@@ -1271,7 +1264,11 @@ def _estimate_frame_size_via_calibration(
    episode_indices: list[int],
    temp_dir: Path,
    fps: int,
-    camera_encoder_config: VideoEncoderConfig,
+    vcodec: str,
+    pix_fmt: str,
+    g: int,
+    crf: int,
+    fast_decode: int,
    num_calibration_frames: int = 30,
 ) -> float:
    """Estimate MB per frame by encoding a small calibration sample.
@@ -1285,7 +1282,11 @@ def _estimate_frame_size_via_calibration(
        episode_indices: List of episode indices being processed.
        temp_dir: Temporary directory for calibration files.
        fps: Frames per second for video encoding.
-        camera_encoder_config: Video encoder settings used for calibration encoding.
+        vcodec: Video codec (libsvtav1, h264, hevc).
+        pix_fmt: Pixel format (yuv420p, etc.).
+        g: GOP size (group of pictures).
+        crf: Constant Rate Factor (quality).
+        fast_decode: Fast decode tuning parameter.
        num_calibration_frames: Number of frames to use for calibration (default: 30).

    Returns:
@@ -1321,7 +1322,11 @@ def _estimate_frame_size_via_calibration(
            imgs_dir=calibration_dir,
            video_path=calibration_video_path,
            fps=fps,
-            camera_encoder_config=camera_encoder_config,
+            vcodec=vcodec,
+            pix_fmt=pix_fmt,
+            g=g,
+            crf=crf,
+            fast_decode=fast_decode,
            overwrite=True,
        )

@@ -1639,7 +1644,11 @@ def convert_image_to_video_dataset(
    dataset: LeRobotDataset,
    output_dir: Path | None = None,
    repo_id: str | None = None,
-    camera_encoder_config: VideoEncoderConfig | None = None,
+    vcodec: str = "libsvtav1",
+    pix_fmt: str = "yuv420p",
+    g: int = 2,
+    crf: int = 30,
+    fast_decode: int = 0,
    episode_indices: list[int] | None = None,
    num_workers: int = 4,
    max_episodes_per_batch: int | None = None,
@@ -1654,7 +1663,11 @@ def convert_image_to_video_dataset(
        dataset: The source LeRobot dataset with images
        output_dir: Root directory where the edited dataset will be stored. If not specified, defaults to $HF_LEROBOT_HOME/repo_id. Equivalent to new_root in EditDatasetConfig.
        repo_id: Edited dataset identifier. Equivalent to new_repo_id in EditDatasetConfig.
-        camera_encoder_config: Video encoder settings (default: :class:`VideoEncoderConfig()`).
+        vcodec: Video codec (default: libsvtav1)
+        pix_fmt: Pixel format (default: yuv420p)
+        g: Group of pictures size (default: 2)
+        crf: Constant rate factor (default: 30)
+        fast_decode: Fast decode tuning (default: 0)
        episode_indices: List of episode indices to convert (None = all episodes)
        num_workers: Number of threads for parallel processing (default: 4)
        max_episodes_per_batch: Maximum episodes per video batch to avoid memory issues (None = no limit)
@@ -1663,9 +1676,6 @@ def convert_image_to_video_dataset(
    Returns:
        New LeRobotDataset with images encoded as videos
    """
-    if camera_encoder_config is None:
-        camera_encoder_config = VideoEncoderConfig()
-
    # Check that it's an image dataset
    if len(dataset.meta.video_keys) > 0:
        raise ValueError(
@@ -1689,10 +1699,7 @@ def convert_image_to_video_dataset(
    logging.info(
        f"Converting {len(episode_indices)} episodes with {len(img_keys)} cameras from {dataset.repo_id}"
    )
-    logging.info(
-        f"Video codec: {camera_encoder_config.vcodec}, pixel format: {camera_encoder_config.pix_fmt}, "
-        f"GOP: {camera_encoder_config.g}, CRF: {camera_encoder_config.crf}"
-    )
+    logging.info(f"Video codec: {vcodec}, pixel format: {pix_fmt}, GOP: {g}, CRF: {crf}")

    # Create new features dict, converting image features to video features
    new_features = {}
@@ -1762,7 +1769,11 @@ def convert_image_to_video_dataset(
                episode_indices=episode_indices,
                temp_dir=temp_dir,
                fps=fps,
-                camera_encoder_config=camera_encoder_config,
+                vcodec=vcodec,
+                pix_fmt=pix_fmt,
+                g=g,
+                crf=crf,
+                fast_decode=fast_decode,
            )

            logging.info(f"Processing camera: {img_key}")
@@ -1804,7 +1815,11 @@ def convert_image_to_video_dataset(
                    imgs_dir=imgs_dir,
                    video_path=video_path,
                    fps=fps,
-                    camera_encoder_config=camera_encoder_config,
+                    vcodec=vcodec,
+                    pix_fmt=pix_fmt,
+                    g=g,
+                    crf=crf,
+                    fast_decode=fast_decode,
                    overwrite=True,
                )

@@ -1850,9 +1865,7 @@ def convert_image_to_video_dataset(
                video_path = new_meta.root / new_meta.video_path.format(
                    video_key=img_key, chunk_index=0, file_index=0
                )
-                new_meta.info.features[img_key]["info"] = get_video_info(
-                    video_path, camera_encoder_config=camera_encoder_config
-                )
+                new_meta.info.features[img_key]["info"] = get_video_info(video_path)

        write_info(new_meta.info, new_meta.root)

@@ -46,19 +46,15 @@ from .io_utils import (
    write_info,
 )
 from .utils import (
-    DEFAULT_DEPTH_PATH,
    DEFAULT_EPISODES_PATH,
    DEFAULT_IMAGE_PATH,
    update_chunk_file_indices,
 )
 from .video_utils import (
-    DepthEncoderConfig,
    StreamingVideoEncoder,
-    VideoEncoderConfig,
    concatenate_video_files,
    encode_video_frames,
    get_video_duration_in_s,
-    is_depth_feature,
 )

 logger = logging.getLogger(__name__)
@@ -69,19 +65,14 @@ def _encode_video_worker(
    episode_index: int,
    root: Path,
    fps: int,
-    camera_encoder_config: VideoEncoderConfig | None = None,
+    vcodec: str = "libsvtav1",
    encoder_threads: int | None = None,
 ) -> Path:
    temp_path = Path(tempfile.mkdtemp(dir=root)) / f"{video_key}_{episode_index:03d}.mp4"
    fpath = DEFAULT_IMAGE_PATH.format(image_key=video_key, episode_index=episode_index, frame_index=0)
    img_dir = (root / fpath).parent
    encode_video_frames(
-        img_dir,
-        temp_path,
-        fps,
-        camera_encoder_config=camera_encoder_config,
-        encoder_threads=encoder_threads,
-        overwrite=True,
+        img_dir, temp_path, fps, vcodec=vcodec, overwrite=True, encoder_threads=encoder_threads
    )
    shutil.rmtree(img_dir)
    return temp_path
@@ -98,40 +89,33 @@ class DatasetWriter:
        self,
        meta: LeRobotDatasetMetadata,
        root: Path,
-        camera_encoder_config: VideoEncoderConfig,
+        vcodec: str,
        encoder_threads: int | None,
        batch_encoding_size: int,
        streaming_encoder: StreamingVideoEncoder | None = None,
        initial_frames: int = 0,
-        depth_encoder_config: DepthEncoderConfig | None = None,
    ):
-        """Initialize the writer with metadata, codec, and encoder config.
+        """Initialize the writer with metadata, codec, and encoding config.

        Args:
            meta: Dataset metadata instance (used for feature schema, chunk
                settings, and episode persistence).
            root: Local dataset root directory.
-            camera_encoder_config: Video encoder settings applied to all cameras.
-            encoder_threads: Number of encoder threads (global). ``None``
-                lets the codec decide.
+            vcodec: Video codec for encoding (e.g. ``'libsvtav1'``, ``'h264'``).
+            encoder_threads: Threads per encoder instance. ``None`` for auto.
            batch_encoding_size: Number of episodes to accumulate before
                batch-encoding videos.
            streaming_encoder: Optional pre-built :class:`StreamingVideoEncoder`
                for real-time encoding. ``None`` disables streaming mode.
            initial_frames: Starting frame count (non-zero when resuming).
-            depth_encoder_config: Optional depth-map encoder config used in
-                place of ``camera_encoder_config`` for keys present in
-                ``meta.depth_keys``.
        """
        self._meta = meta
        self._root = root
-        self._camera_encoder_config = camera_encoder_config
-        self._depth_encoder_config = depth_encoder_config
+        self._vcodec = vcodec
        self._encoder_threads = encoder_threads
        self._batch_encoding_size = batch_encoding_size
        self._streaming_encoder = streaming_encoder

-
        # Writer state
        self.image_writer: AsyncImageWriter | None = None
        self.episode_buffer: dict = self._create_episode_buffer()
@@ -151,16 +135,8 @@ class DatasetWriter:
            ep_buffer[key] = current_ep_idx if key == "episode_index" else []
        return ep_buffer

-    def _is_depth_image_key(self, image_key: str) -> bool:
-        """Whether *image_key* is a depth feature stored as per-frame images."""
-        ft = self._meta.features.get(image_key)
-        if ft is None or ft.get("dtype") != "image":
-            return False
-        return is_depth_feature(ft.get("info") or {})
-
    def _get_image_file_path(self, episode_index: int, image_key: str, frame_index: int) -> Path:
-        path_template = DEFAULT_DEPTH_PATH if self._is_depth_image_key(image_key) else DEFAULT_IMAGE_PATH
-        fpath = path_template.format(
+        fpath = DEFAULT_IMAGE_PATH.format(
            image_key=image_key, episode_index=episode_index, frame_index=frame_index
        )
        return self._root / fpath
@@ -308,7 +284,7 @@ class DatasetWriter:
                            episode_index,
                            self._root,
                            self._meta.fps,
-                            self._camera_encoder_config,
+                            self._vcodec,
                            self._encoder_threads,
                        ): video_key
                        for video_key in self._meta.video_keys
@@ -519,13 +495,7 @@ class DatasetWriter:

        # Update video info (only needed when first episode is encoded)
        if episode_index == 0:
-            is_depth_key = video_key in set(self._meta.depth_keys)
-            cfg_for_info = (
-                self._depth_encoder_config
-                if is_depth_key and self._depth_encoder_config is not None
-                else self._camera_encoder_config
-            )
-            self._meta.update_video_info(video_key, camera_encoder_config=cfg_for_info)
+            self._meta.update_video_info(video_key)
            write_info(self._meta.info, self._meta.root)

        metadata = {
@@ -594,12 +564,7 @@ class DatasetWriter:
    def _encode_temporary_episode_video(self, video_key: str, episode_index: int) -> Path:
        """Use ffmpeg to convert frames stored as png into mp4 videos."""
        return _encode_video_worker(
-            video_key,
-            episode_index,
-            self._root,
-            self._meta.fps,
-            self._camera_encoder_config,
-            self._encoder_threads,
+            video_key, episode_index, self._root, self._meta.fps, self._vcodec, self._encoder_threads
        )

    def close_writer(self) -> None:
@@ -1,189 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""
-Depth encoding/decoding helpers for :class:`VideoEncoderConfig`.
-"""
-
-import math
-from typing import Literal
-
-import numpy as np
-import torch
-from numpy.typing import NDArray
-
-DEPTH_QUANT_BITS: int = 12
-DEPTH_QMAX: int = (1 << DEPTH_QUANT_BITS) - 1  # 4095
-_MM_PER_METRE: float = 1000.0
-_UINT16_MAX: int = 65535
-
-DEFAULT_DEPTH_MIN: float = 0.01
-DEFAULT_DEPTH_MAX: float = 10.0
-DEFAULT_DEPTH_SHIFT: float = 3.5
-DEFAULT_DEPTH_USE_LOG: bool = True
-
-
-def _validate_log_quant_params(depth_min: float, shift: float) -> None:
-    """Ensure ``log(depth_min + shift)`` is finite."""
-    if depth_min + shift <= 0:
-        raise ValueError(
-            f"depth_min + shift must be positive for logarithmic quantization, "
-            f"got depth_min={depth_min} + shift={shift} = {depth_min + shift}"
-        )
-
-
-def _depth_input_to_float32_and_unit(
-    depth: NDArray[np.uint16] | NDArray[np.floating] | torch.Tensor,
-    input_unit: Literal["auto", "m", "mm"],
-) -> tuple[NDArray[np.float32], Literal["m", "mm"]]:
-    """Depth as float32 in the chosen unit, plus the resolved unit."""
-    if isinstance(depth, torch.Tensor):
-        t = depth.detach().cpu()
-        arr = t.numpy()
-        is_floating = t.is_floating_point()
-    else:
-        arr = np.asarray(depth)
-        is_floating = np.issubdtype(arr.dtype, np.floating)
-
-    resolved_unit: Literal["m", "mm"]
-    if input_unit == "auto":
-        resolved_unit = "m" if is_floating else "mm"
-    else:
-        resolved_unit = input_unit
-
-    # Convert to float32 to keep typing consistency
-    return np.asarray(arr, dtype=np.float32, order="K"), resolved_unit
-
-
-def quantize_depth(
-    depth: NDArray[np.uint16] | NDArray[np.floating] | torch.Tensor,
-    depth_min: float = DEFAULT_DEPTH_MIN,
-    depth_max: float = DEFAULT_DEPTH_MAX,
-    shift: float = DEFAULT_DEPTH_SHIFT,
-    use_log: bool = DEFAULT_DEPTH_USE_LOG,
-    *,
-    input_unit: Literal["auto", "m", "mm"] = "auto",
-) -> NDArray[np.uint16]:
-    """Quantize depth to 12-bit codes (``uint16``, values ``0…DEPTH_QMAX``).
-
-    Depth maps are packed into 12-bit integer frames so they fit in standard
-    high-bit-depth pixel formats (e.g. ``yuv420p12le`` / ``gray12le``)
-    and can be encoded by widely supported video codecs (HEVC Main 12, ffv1).
-    Logarithmic quantization is the default because it allocates more quanta
-    to near-range depth, which matches the (1/depth) error profile of typical
-    depth sensors. Math is ported from BEHAVIOR-1K's ``obs_utils.py``.
-
-    **Input units**:
-
-    - ``input_unit="auto"`` (default): infer from dtype (floating = m, non-floating = mm).
-    - ``input_unit="mm"``: interpret input values as millimetres.
-    - ``input_unit="m"``: interpret input values as metres.
-
-    Quantization math runs in the **resolved input unit**. 
-    
-    ``depth_min``, ``depth_max``, and ``shift`` are always in **metres**.
-
-    Args:
-        depth: Depth map; ``torch.Tensor`` is moved to CPU for conversion.
-        depth_min: Depth (metres) at quantum ``0``.
-        depth_max: Depth (metres) at quantum :data:`DEPTH_QMAX`.
-        shift: Depth shift (metres); used in log mode. Must satisfy ``depth_min + shift > 0``.
-        use_log: If ``True`` (default), quantize in log space.
-        input_unit: Input unit policy (``"auto"``, ``"mm"``, ``"m"``).
-
-    Returns:
-        ``numpy.ndarray``, ``dtype=uint16``, same shape as ``depth``, values in
-        ``[0, DEPTH_QMAX]``.
-
-    Raises:
-        ValueError: If ``input_unit`` is not ``"auto"``, ``"mm"``, or ``"m"``.
-        ValueError: If ``use_log=True`` and ``depth_min + shift <= 0``.
-    """
-    if input_unit not in ("auto", "m", "mm"):
-        raise ValueError(f"input_unit must be 'auto', 'm', or 'mm', got {input_unit!r}")
-
-    depth_f, resolved_unit = _depth_input_to_float32_and_unit(depth, input_unit=input_unit)
-    depth_min_u = np.float32(depth_min) if resolved_unit == "m" else np.float32(depth_min * _MM_PER_METRE)
-    depth_max_u = np.float32(depth_max) if resolved_unit == "m" else np.float32(depth_max * _MM_PER_METRE)
-    shift_u = np.float32(shift) if resolved_unit == "m" else np.float32(shift * _MM_PER_METRE)
-
-    if use_log:
-        _validate_log_quant_params(depth_min, shift)
-        log_min = math.log(float(depth_min_u + shift_u))
-        log_max = math.log(float(depth_max_u + shift_u))
-        norm = (np.log(depth_f + shift_u) - log_min) / (log_max - log_min)
-    else:
-        norm = (depth_f - depth_min_u) / (depth_max_u - depth_min_u)
-
-    out = np.rint(norm * DEPTH_QMAX).clip(0, DEPTH_QMAX)
-    return out.astype(np.uint16, copy=False)
-
-
-def dequantize_depth(
-    quantized: NDArray[np.uint16] | torch.Tensor,
-    depth_min: float = DEFAULT_DEPTH_MIN,
-    depth_max: float = DEFAULT_DEPTH_MAX,
-    shift: float = DEFAULT_DEPTH_SHIFT,
-    use_log: bool = DEFAULT_DEPTH_USE_LOG,
-    *,
-    output_unit: Literal["m", "mm"] = "mm",
-) -> NDArray[np.uint16] | NDArray[np.float32]:
-    """Inverse of :func:`quantize_depth`.
-
-    Tuning arguments **must match** :func:`quantize_depth`.
-
-    Decoding inverts the same normalized code mapping as :func:`quantize_depth`
-    using ``depth_min`` / ``depth_max`` / ``shift`` (in metres), then returns
-    the requested output unit.
-
-    Args:
-        quantized: 12-bit codes ``[0, DEPTH_QMAX]``, ``dtype=uint16``.
-        depth_min, depth_max, shift, use_log: Same as :func:`quantize_depth` (metres).
-        output_unit: ``\"mm\"`` returns ``uint16`` millimetres (``rint``, clip
-            ``[0, 65535]``). ``\"m\"`` returns ``float32`` metres in
-            ``[depth_min, depth_max]``.
-
-    Returns:
-        Depth map in the requested unit and dtype.
-
-    Raises:
-        ValueError: If ``use_log=True`` and ``depth_min + shift <= 0``.
-        ValueError: If ``output_unit`` is not ``\"m\"`` or ``\"mm\"``.
-    """
-    if output_unit not in ("m", "mm"):
-        raise ValueError(f"output_unit must be 'm' or 'mm', got {output_unit!r}")
-
-    if isinstance(quantized, torch.Tensor):
-        quantized = quantized.detach().cpu().numpy()
-    q = np.asarray(quantized, dtype=np.uint16, order="K")
-    norm = q.astype(np.float32, copy=False) / DEPTH_QMAX
-
-    depth_min_mm = np.float32(depth_min * _MM_PER_METRE)
-    depth_max_mm = np.float32(depth_max * _MM_PER_METRE)
-    shift_mm = np.float32(shift * _MM_PER_METRE)
-
-    if use_log:
-        _validate_log_quant_params(depth_min, shift)
-        log_min = math.log(float(depth_min_mm + shift_mm))
-        log_max = math.log(float(depth_max_mm + shift_mm))
-        depth_mm = np.exp(norm * (log_max - log_min) + log_min) - shift_mm
-    else:
-        depth_mm = norm * (depth_max_mm - depth_min_mm) + depth_min_mm
-
-    depth_mm = np.clip(depth_mm, depth_min_mm, depth_max_mm).astype(np.float32, copy=False)
-    if output_unit == "m":
-        return (depth_mm / np.float32(_MM_PER_METRE)).astype(np.float32, copy=False)
-    mm = np.rint(depth_mm).clip(0, _UINT16_MAX)
-    return mm.astype(np.uint16, copy=False)
@@ -294,20 +294,10 @@ def validate_feature_image_or_video(
    # Note: The check of pixels range ([0,1] for float and [0,255] for uint8) is done by the image writer threads.
    error_message = ""
    if isinstance(value, np.ndarray):
-        actual_shape = tuple(value.shape)
-        expected = tuple(expected_shape)
-        if len(expected) == 2:
-            # Single-channel features (e.g. depth maps) — accept (H,W), (1,H,W), (H,W,1)
-            h, w = expected
-            valid = actual_shape in {(h, w), (1, h, w), (h, w, 1)}
-            if not valid:
-                error_message += f"The feature '{name}' of shape '{actual_shape}' does not have the expected shape '{(h, w)}', '{(1, h, w)}', or '{(h, w, 1)}'.\n"
-        elif len(expected) == 3:
-            c, h, w = expected
-            if len(actual_shape) != 3 or (actual_shape != (c, h, w) and actual_shape != (h, w, c)):
-                error_message += f"The feature '{name}' of shape '{actual_shape}' does not have the expected shape '{(c, h, w)}' or '{(h, w, c)}'.\n"
-        else:
-            error_message += f"The feature '{name}' has an unsupported expected_shape '{expected}'.\n"
+        actual_shape = value.shape
+        c, h, w = expected_shape
+        if len(actual_shape) != 3 or (actual_shape != (c, h, w) and actual_shape != (h, w, c)):
+            error_message += f"The feature '{name}' of shape '{actual_shape}' does not have the expected shape '{(c, h, w)}' or '{(h, w, c)}'.\n"
    elif isinstance(value, PILImage.Image):
        pass
    else:
@@ -41,56 +41,15 @@ def safe_stop_image_writer(func):
    return wrapper


-# Single-channel dtypes that PIL natively maps to the matching mode
-# (``uint8`` → ``L``, ``uint16`` → ``I;16``, ``float32`` → ``F``).
-GRAYSCALE_DTYPES: tuple[np.dtype, ...] = (
-    np.dtype("uint8"),
-    np.dtype("uint16"),
-    np.dtype("float32"),
-)
-
-
 def image_array_to_pil_image(image_array: np.ndarray, range_check: bool = True) -> PIL.Image.Image:
-    """Convert a NumPy array to a PIL Image, preserving precision for grayscale.
+    # TODO(aliberts): handle 1 channel and 4 for depth images
+    if image_array.ndim != 3:
+        raise ValueError(f"The array has {image_array.ndim} dimensions, but 3 is expected for an image.")

-    Behaviour by shape:
-
-    - ``(H, W)`` or ``(1, H, W)`` / ``(H, W, 1)``: single-channel grayscale.
-      The native dtype is preserved using the matching PIL mode
-      (``L`` / ``I;16`` / ``F``). This is the path used for raw depth maps (no rescaling, clamping, or downcasting)
-    - ``(3, H, W)`` / ``(H, W, 3)``: RGB. Channels-first inputs are transposed
-      to channels-last. Float inputs in ``[0, 1]`` are scaled to ``uint8``
-      (existing behaviour, gated by ``range_check``).
-
-    Other shapes / channel counts raise ``NotImplementedError`` or
-    ``ValueError``.
-    """
-    if image_array.ndim not in (2, 3):
-        raise ValueError(
-            f"The array has {image_array.ndim} dimensions, but 2 or 3 is expected for an image."
-        )
-
-    # Squeeze 3D single-channel inputs to 2D so depth maps work whether the
-    # caller emits (H, W), (1, H, W), or (H, W, 1).
-    if image_array.ndim == 3:
-        if image_array.shape[0] == 1:
-            image_array = image_array[0]
-        elif image_array.shape[-1] == 1:
-            image_array = image_array[..., 0]
-
-    if image_array.ndim == 2:
-        if image_array.dtype not in GRAYSCALE_DTYPES:
-            raise ValueError(
-                f"Unsupported single-channel image dtype: {image_array.dtype}. "
-                f"Supported dtypes: {sorted(str(d) for d in GRAYSCALE_DTYPES)}."
-            )
-
-        return PIL.Image.fromarray(np.ascontiguousarray(image_array))
-
-    # 3D path: must be RGB (3 channels), channels-first or channels-last.
    if image_array.shape[0] == 3:
        # Transpose from pytorch convention (C, H, W) to (H, W, C)
        image_array = image_array.transpose(1, 2, 0)
+
    elif image_array.shape[-1] != 3:
        raise NotImplementedError(
            f"The image has {image_array.shape[-1]} channels, but 3 is required for now."
@@ -112,28 +71,13 @@ def image_array_to_pil_image(image_array: np.ndarray, range_check: bool = True)
    return PIL.Image.fromarray(image_array)


-def save_kwargs_for_path(fpath: Path, compress_level: int) -> dict:
-    """Pick the right format-specific kwargs for :meth:`PIL.Image.Image.save`.
-
-    PNG uses ``compress_level`` (0–9, zlib). TIFF uses ``compression`` (raw) for lossless raw depth maps.
-    """
-    suffix = Path(fpath).suffix.lower()
-    if suffix == ".png":
-        return {"compress_level": compress_level}
-    if suffix in (".tif", ".tiff"):
-        return {"compression": "raw"}
-    return {}
-
-
 def write_image(image: np.ndarray | PIL.Image.Image, fpath: Path, compress_level: int = 1):
    """
    Saves a NumPy array or PIL Image to a file.

    This function handles both NumPy arrays and PIL Image objects, converting
    the former to a PIL Image before saving. It includes error handling for
-    the save operation. The output format is inferred from the *fpath*
-    extension: ``.png`` → PNG with ``compress_level``, ``.tiff`` / ``.tif``
-    → lossless raw depth maps (TIFF).
+    the save operation.

    Args:
        image (np.ndarray | PIL.Image.Image): The image data to save.
@@ -157,7 +101,7 @@ def write_image(image: np.ndarray | PIL.Image.Image, fpath: Path, compress_level
            img = image
        else:
            raise TypeError(f"Unsupported image type: {type(image)}")
-        img.save(fpath, **save_kwargs_for_path(Path(fpath), compress_level))
+        img.save(fpath, compress_level=compress_level)
    except Exception as e:
        logger.error("Error writing image %s: %s", fpath, e)

@@ -35,11 +35,9 @@ from .utils import (
    is_valid_version,
 )
 from .video_utils import (
-    DepthEncoderConfig,
    StreamingVideoEncoder,
-    VideoEncoderConfig,
-    get_safe_default_video_backend,
-    seed_depth_feature_info,
+    get_safe_default_codec,
+    resolve_vcodec,
 )

 logger = logging.getLogger(__name__)
@@ -60,11 +58,10 @@ class LeRobotDataset(torch.utils.data.Dataset):
        video_backend: str | None = None,
        return_uint8: bool = False,
        batch_encoding_size: int = 1,
-        camera_encoder_config: VideoEncoderConfig | None = None,
-        depth_encoder_config: DepthEncoderConfig | None = None,
-        encoder_threads: int | None = None,
+        vcodec: str = "libsvtav1",
        streaming_encoding: bool = False,
        encoder_queue_maxsize: int = 30,
+        encoder_threads: int | None = None,
    ):
        """
        2 modes are available for instantiating this class, depending on 2 different use cases:
@@ -180,15 +177,16 @@ class LeRobotDataset(torch.utils.data.Dataset):
                You can also use the 'pyav' decoder used by Torchvision, which used to be the default option, or 'video_reader' which is another decoder of Torchvision.
            batch_encoding_size (int, optional): Number of episodes to accumulate before batch encoding videos.
                Set to 1 for immediate encoding (default), or higher for batched encoding. Defaults to 1.
-            camera_encoder_config (VideoEncoderConfig | None, optional): Video encoder settings for cameras
-                (codec, quality, etc.). Defaults to
-                :class:`~lerobot.datasets.video_utils.VideoEncoderConfig` defaults when ``None``.
-            encoder_threads (int | None, optional): Number of encoder threads (global). ``None`` lets the
-                codec decide.
+            vcodec (str, optional): Video codec for encoding videos during recording. Options: 'h264', 'hevc',
+                'libsvtav1', 'auto', or hardware-specific codecs like 'h264_videotoolbox', 'h264_nvenc'.
+                Defaults to 'libsvtav1'. Use 'auto' to auto-detect the best available hardware encoder.
            streaming_encoding (bool, optional): If True, encode video frames in real-time during capture
                instead of writing PNG images first. This makes save_episode() near-instant. Defaults to False.
            encoder_queue_maxsize (int, optional): Maximum number of frames to buffer per camera when using
                streaming encoding. Defaults to 30 (~1s at 30fps).
+            encoder_threads (int | None, optional): Number of threads per encoder instance. None lets the
+                codec auto-detect (default). Lower values reduce CPU usage per encoder. Maps to 'lp' (via svtav1-params) for
+                libsvtav1 and 'threads' for h264/hevc.

        Note:
            Write-mode parameters (``streaming_encoding``, ``batch_encoding_size``) passed to
@@ -204,13 +202,10 @@ class LeRobotDataset(torch.utils.data.Dataset):
        self.episodes = episodes
        self.tolerance_s = tolerance_s
        self.revision = revision if revision else CODEBASE_VERSION
-        self._video_backend = video_backend if video_backend else get_safe_default_video_backend()
+        self._video_backend = video_backend if video_backend else get_safe_default_codec()
        self._return_uint8 = return_uint8
        self._batch_encoding_size = batch_encoding_size
-        if camera_encoder_config is None:
-            camera_encoder_config = VideoEncoderConfig()
-        self._camera_encoder_config = camera_encoder_config
-        self._depth_encoder_config = depth_encoder_config
+        self._vcodec = resolve_vcodec(vcodec)
        self._encoder_threads = encoder_threads

        if self._requested_root is not None:
@@ -253,23 +248,16 @@ class LeRobotDataset(torch.utils.data.Dataset):
                DeprecationWarning,
                stacklevel=2,
            )
-            seed_depth_feature_info(self.meta.features, self._depth_encoder_config)
            streaming_enc = None
            if streaming_encoding and len(self.meta.video_keys) > 0:
                streaming_enc = self._build_streaming_encoder(
-                    self.meta.fps,
-                    self._camera_encoder_config,
-                    self._encoder_threads,
-                    encoder_queue_maxsize,
-                    depth_encoder_config=self._depth_encoder_config,
-                    depth_keys=self.meta.depth_keys,
+                    self.meta.fps, self._vcodec, encoder_queue_maxsize, encoder_threads
                )
            self.writer = DatasetWriter(
                meta=self.meta,
                root=self.root,
-                camera_encoder_config=self._camera_encoder_config,
-                depth_encoder_config=self._depth_encoder_config,
-                encoder_threads=self._encoder_threads,
+                vcodec=self._vcodec,
+                encoder_threads=encoder_threads,
                batch_encoding_size=batch_encoding_size,
                streaming_encoder=streaming_enc,
                initial_frames=self.meta.total_frames,
@@ -310,20 +298,19 @@ class LeRobotDataset(torch.utils.data.Dataset):
    @staticmethod
    def _build_streaming_encoder(
        fps: int,
-        camera_encoder_config: VideoEncoderConfig,
-        encoder_threads: int | None,
+        vcodec: str,
        encoder_queue_maxsize: int,
-        *,
-        depth_encoder_config: DepthEncoderConfig | None = None,
-        depth_keys: list[str] | None = None,
+        encoder_threads: int | None,
    ) -> StreamingVideoEncoder:
        return StreamingVideoEncoder(
            fps=fps,
-            camera_encoder_config=camera_encoder_config,
-            encoder_threads=encoder_threads,
+            vcodec=vcodec,
+            pix_fmt="yuv420p",
+            g=2,
+            crf=30,
+            preset=None,
            queue_maxsize=encoder_queue_maxsize,
-            depth_encoder_config=depth_encoder_config,
-            depth_keys=depth_keys,
+            encoder_threads=encoder_threads,
        )

    # ── Metadata properties ───────────────────────────────────────────
@@ -638,8 +625,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
        image_writer_threads: int = 0,
        video_backend: str | None = None,
        batch_encoding_size: int = 1,
-        camera_encoder_config: VideoEncoderConfig | None = None,
-        depth_encoder_config: DepthEncoderConfig | None = None,
+        vcodec: str = "libsvtav1",
        metadata_buffer_size: int = 10,
        streaming_encoding: bool = False,
        encoder_queue_maxsize: int = 30,
@@ -670,23 +656,20 @@ class LeRobotDataset(torch.utils.data.Dataset):
            video_backend: Video decoding backend (used when reading back).
            batch_encoding_size: Number of episodes to accumulate before
                batch-encoding videos. ``1`` means encode immediately.
-            camera_encoder_config: Video encoder settings for cameras; defaults
-                match :class:`~lerobot.datasets.video_utils.VideoEncoderConfig`
-                when ``None``.
-            encoder_threads: Number of encoder threads (global). ``None``
-                lets the codec decide.
+            vcodec: Video codec for encoding. Options include ``'libsvtav1'``,
+                ``'h264'``, ``'hevc'``, ``'auto'``.
            metadata_buffer_size: Number of episode metadata records to buffer
                before flushing to parquet.
            streaming_encoding: If ``True``, encode video frames in real-time
                during capture instead of writing images first.
            encoder_queue_maxsize: Max buffered frames per camera when using
                streaming encoding.
+            encoder_threads: Threads per encoder instance. ``None`` for auto.

        Returns:
            A new :class:`LeRobotDataset` in write mode.
        """
-        if camera_encoder_config is None:
-            camera_encoder_config = VideoEncoderConfig()
+        vcodec = resolve_vcodec(vcodec)
        obj = cls.__new__(cls)
        obj.meta = LeRobotDatasetMetadata.create(
            repo_id=repo_id,
@@ -707,32 +690,23 @@ class LeRobotDataset(torch.utils.data.Dataset):
        obj.image_transforms = None
        obj.delta_timestamps = None
        obj.episodes = None
-        obj._video_backend = video_backend if video_backend is not None else get_safe_default_video_backend()
+        obj._video_backend = video_backend if video_backend is not None else get_safe_default_codec()
        obj._return_uint8 = False
        obj._batch_encoding_size = batch_encoding_size
-        obj._camera_encoder_config = camera_encoder_config
-        obj._depth_encoder_config = depth_encoder_config
+        obj._vcodec = vcodec
        obj._encoder_threads = encoder_threads
-        seed_depth_feature_info(obj.meta.features, depth_encoder_config)

        # Reader is lazily created on first access (write-only mode)
        obj.reader = None

+        # Create writer
        streaming_enc = None
        if streaming_encoding and len(obj.meta.video_keys) > 0:
-            streaming_enc = cls._build_streaming_encoder(
-                fps,
-                camera_encoder_config,
-                encoder_threads,
-                encoder_queue_maxsize,
-                depth_encoder_config=depth_encoder_config,
-                depth_keys=obj.meta.depth_keys,
-            )
+            streaming_enc = cls._build_streaming_encoder(fps, vcodec, encoder_queue_maxsize, encoder_threads)
        obj.writer = DatasetWriter(
            meta=obj.meta,
            root=obj.root,
-            camera_encoder_config=camera_encoder_config,
-            depth_encoder_config=depth_encoder_config,
+            vcodec=vcodec,
            encoder_threads=encoder_threads,
            batch_encoding_size=batch_encoding_size,
            streaming_encoder=streaming_enc,
@@ -755,13 +729,12 @@ class LeRobotDataset(torch.utils.data.Dataset):
        force_cache_sync: bool = False,
        video_backend: str | None = None,
        batch_encoding_size: int = 1,
-        camera_encoder_config: VideoEncoderConfig | None = None,
-        depth_encoder_config: DepthEncoderConfig | None = None,
-        encoder_threads: int | None = None,
+        vcodec: str = "libsvtav1",
        image_writer_processes: int = 0,
        image_writer_threads: int = 0,
        streaming_encoding: bool = False,
        encoder_queue_maxsize: int = 30,
+        encoder_threads: int | None = None,
    ) -> "LeRobotDataset":
        """Resume recording on an existing dataset.

@@ -784,16 +757,13 @@ class LeRobotDataset(torch.utils.data.Dataset):
            video_backend: Video decoding backend for reading back data.
            batch_encoding_size: Number of episodes to accumulate before
                batch-encoding videos.
-            camera_encoder_config: Video encoder settings for cameras; defaults
-                match :class:`~lerobot.datasets.video_utils.VideoEncoderConfig`
-                when ``None``.
-            encoder_threads: Number of encoder threads (global). ``None``
-                lets the codec decide.
+            vcodec: Video codec for encoding.
            image_writer_processes: Subprocesses for async image writing.
            image_writer_threads: Threads for async image writing.
            streaming_encoding: If ``True``, encode video in real-time during
                capture.
            encoder_queue_maxsize: Max buffered frames per camera for streaming.
+            encoder_threads: Threads per encoder instance. ``None`` for auto.

        Returns:
            A :class:`LeRobotDataset` in write mode, ready to append episodes.
@@ -804,6 +774,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
                "Writing into the revision-safe Hub snapshot cache (used when root=None) would corrupt "
                "the shared cache. Please provide a local directory path."
            )
+        vcodec = resolve_vcodec(vcodec)
        obj = cls.__new__(cls)
        obj.repo_id = repo_id
        obj._requested_root = Path(root)
@@ -812,9 +783,11 @@ class LeRobotDataset(torch.utils.data.Dataset):
        obj.image_transforms = None
        obj.delta_timestamps = None
        obj.episodes = None
-        obj._video_backend = video_backend if video_backend else get_safe_default_video_backend()
+        obj._video_backend = video_backend if video_backend else get_safe_default_codec()
        obj._return_uint8 = False
        obj._batch_encoding_size = batch_encoding_size
+        obj._vcodec = vcodec
+        obj._encoder_threads = encoder_threads

        if obj._requested_root is not None:
            obj._requested_root.mkdir(exist_ok=True, parents=True)
@@ -823,33 +796,21 @@ class LeRobotDataset(torch.utils.data.Dataset):
        obj.meta = LeRobotDatasetMetadata(
            obj.repo_id, obj._requested_root, obj.revision, force_cache_sync=force_cache_sync
        )
-
-        if camera_encoder_config is None:
-            camera_encoder_config = VideoEncoderConfig()
-        obj._camera_encoder_config = camera_encoder_config
-        obj._depth_encoder_config = depth_encoder_config
-        obj._encoder_threads = encoder_threads
        obj.root = obj.meta.root
-        seed_depth_feature_info(obj.meta.features, depth_encoder_config)

        # Reader is lazily created on first access (write-only mode)
        obj.reader = None

+        # Create writer for appending
        streaming_enc = None
        if streaming_encoding and len(obj.meta.video_keys) > 0:
            streaming_enc = cls._build_streaming_encoder(
-                obj.meta.fps,
-                camera_encoder_config,
-                encoder_threads,
-                encoder_queue_maxsize,
-                depth_encoder_config=depth_encoder_config,
-                depth_keys=obj.meta.depth_keys,
+                obj.meta.fps, vcodec, encoder_queue_maxsize, encoder_threads
            )
        obj.writer = DatasetWriter(
            meta=obj.meta,
            root=obj.root,
-            camera_encoder_config=camera_encoder_config,
-            depth_encoder_config=depth_encoder_config,
+            vcodec=vcodec,
            encoder_threads=encoder_threads,
            batch_encoding_size=batch_encoding_size,
            streaming_encoder=streaming_enc,
@@ -1,311 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""PyAV-based compatibility checks for :class:`VideoEncoderConfig`.
-
-Centralises all :mod:`av` introspection of the bundled FFmpeg build.
-Checks degrade to a no-op when the target codec isn't available locally.
-"""
-
-from __future__ import annotations
-
-import functools
-import logging
-from typing import TYPE_CHECKING, Any, Literal
-
-import av
-import numpy as np
-import torch
-
-from lerobot.datasets.depth_utils import (
-    DEFAULT_DEPTH_MAX,
-    DEFAULT_DEPTH_MIN,
-    DEFAULT_DEPTH_SHIFT,
-    DEFAULT_DEPTH_USE_LOG,
-    quantize_depth,
-    dequantize_depth,
-)
-
-if TYPE_CHECKING:
-    from lerobot.datasets.video_utils import VideoEncoderConfig
-
-logger = logging.getLogger(__name__)
-
-# Pixel formats supported by the depth encode/decode helpers below. Both are
-# 16-bit-word formats that carry 12 significant bits per sample, matching the
-# ``DEPTH_QMAX = 4095`` quantization range.
-DEPTH_PIX_FMTS: tuple[str, ...] = ("yuv420p12le", "gray12le")
-
-# Neutral chroma for 12-bit YUV (the midpoint of [0, 4095]). Filling the U/V
-# planes with this value keeps the encoder from spending bits on chroma noise
-# when only the Y plane carries information.
-_NEUTRAL_CHROMA_12BIT: int = 2048
-
-FFMPEG_NUMERIC_OPTION_TYPES = ("INT", "INT64", "UINT64", "FLOAT", "DOUBLE")
-FFMPEG_INTEGER_OPTION_TYPES = ("INT", "INT64", "UINT64")
-
-
-def _write_u16_plane(plane: av.video.plane.VideoPlane, src: np.ndarray, fill_value: int | None = None) -> None:
-    """Copy ``src`` into a uint16 plane respecting FFmpeg line padding."""
-    height, width = src.shape
-    stride_u16 = plane.line_size // np.dtype(np.uint16).itemsize
-    dst = np.frombuffer(plane, dtype=np.uint16).reshape(height, stride_u16)
-    if fill_value is not None:
-        dst.fill(fill_value)
-    dst[:, :width] = src
-
-
-def encode_depth_frame_pyav(
-    depth: np.ndarray | torch.Tensor,
-    *,
-    pix_fmt: str = "yuv420p12le",
-    depth_min: float = DEFAULT_DEPTH_MIN,
-    depth_max: float = DEFAULT_DEPTH_MAX,
-    shift: float = DEFAULT_DEPTH_SHIFT,
-    use_log: bool = DEFAULT_DEPTH_USE_LOG,
-    input_unit: Literal["auto", "m", "mm"] = "auto",
-) -> av.VideoFrame:
-    """Quantize depth and pack it into a 12-bit PyAV video frame.
-
-    Args:
-        depth: Depth frame to encode (H, W). Unit handling follows
-            :func:`lerobot.datasets.depth_utils.quantize_depth`.
-        pix_fmt: Target pixel format. Must be one of :data:`DEPTH_PIX_FMTS`.
-        depth_min, depth_max, shift, use_log, input_unit: Forwarded to
-            :func:`quantize_depth`.
-
-    Returns:
-        An :class:`av.VideoFrame` in ``pix_fmt`` with quantized depth in the
-        luminance plane.
-    """
-    if pix_fmt not in DEPTH_PIX_FMTS:
-        raise ValueError(f"Unsupported depth pix_fmt={pix_fmt!r}; expected one of {DEPTH_PIX_FMTS}")
-
-    quantized_depth = quantize_depth(
-        depth,
-        depth_min=depth_min,
-        depth_max=depth_max,
-        shift=shift,
-        use_log=use_log,
-        input_unit=input_unit,
-    )
-    if quantized_depth.ndim != 2:
-        raise ValueError(f"depth must be a 2D frame; got shape {quantized_depth.shape}")
-
-    quantized_depth = np.ascontiguousarray(quantized_depth, dtype=np.uint16)
-    height, width = quantized_depth.shape
-
-    if pix_fmt == "gray12le":
-        frame = av.VideoFrame(width=width, height=height, format="gray12le")
-        _write_u16_plane(frame.planes[0], quantized_depth)
-        return frame
-
-    if height % 2 != 0 or width % 2 != 0:
-        raise ValueError("yuv420p12le requires even H and W")
-
-    frame = av.VideoFrame(width=width, height=height, format="yuv420p12le")
-    _write_u16_plane(frame.planes[0], quantized_depth)
-    neutral_chroma = np.full((height // 2, width // 2), _NEUTRAL_CHROMA_12BIT, dtype=np.uint16)
-    _write_u16_plane(frame.planes[1], neutral_chroma, fill_value=_NEUTRAL_CHROMA_12BIT)
-    _write_u16_plane(frame.planes[2], neutral_chroma, fill_value=_NEUTRAL_CHROMA_12BIT)
-    return frame
-
-
-def decode_depth_frame_pyav(
-    frame: av.VideoFrame | list[av.VideoFrame],
-    *,
-    depth_min: float = DEFAULT_DEPTH_MIN,
-    depth_max: float = DEFAULT_DEPTH_MAX,
-    shift: float = DEFAULT_DEPTH_SHIFT,
-    use_log: bool = DEFAULT_DEPTH_USE_LOG,
-    return_quantized: bool = False,
-    output_unit: Literal["m", "mm"] = "m",
-) -> np.ndarray:
-    """Decode one or many depth video frames to quantized or metric depth.
-
-    Args:
-        frame: A single depth frame or a list of depth frames.
-        depth_min, depth_max, shift, use_log: Forwarded to
-            :func:`dequantize_depth`.
-        return_quantized: If ``True``, return raw 12-bit quanta as ``uint16``.
-        output_unit: Unit for dequantized output (``"m"`` or ``"mm"``).
-
-    Returns:
-        ``(H, W)`` array for a single frame, or ``(N, H, W)`` for a list.
-    """
-    frames = frame if isinstance(frame, list) else [frame]
-    quantized = np.stack([f.reformat(format="gray12le").to_ndarray() for f in frames]).astype(np.uint16, copy=False)
-    if return_quantized:
-        return quantized[0] if len(frames) == 1 else quantized
-
-    decoded = dequantize_depth(
-        quantized,
-        depth_min=depth_min,
-        depth_max=depth_max,
-        shift=shift,
-        use_log=use_log,
-        output_unit=output_unit,
-    )
-    return decoded[0] if len(frames) == 1 else decoded
-
-
-@functools.cache
-def get_codec(vcodec: str) -> av.codec.Codec | None:
-    """PyAV write-mode ``Codec`` for *vcodec*, or ``None`` if unavailable."""
-    try:
-        return av.codec.Codec(vcodec, "w")
-    except Exception:
-        return None
-
-
-@functools.cache
-def _get_codec_video_formats(vcodec: str) -> dict[str, av.option.Option]:
-    """Private-option name → PyAV ``Option`` for *vcodec* (empty if unavailable)."""
-    codec = get_codec(vcodec)
-    if codec is None:
-        return {}
-    return {opt.name: opt for opt in codec.descriptor.options}
-
-
-@functools.cache
-def _get_codec_video_formats(vcodec: str) -> tuple[str, ...]:
-    """Pixel formats accepted by *vcodec* in PyAV's preferred order (empty if unknown)."""
-    codec = get_codec(vcodec)
-    if codec is None:
-        return ()
-    return tuple(fmt.name for fmt in (codec.video_formats or []))
-
-
-def detect_available_encoders_pyav(encoders: list[str] | str) -> list[str]:
-    """Return the subset of *encoders* available as video encoders in the local FFmpeg build.
-
-    Each name is probed directly via :func:`get_codec`; input order is preserved.
-    """
-    if isinstance(encoders, str):
-        encoders = [encoders]
-
-    available: list[str] = []
-    for name in encoders:
-        codec = get_codec(name)
-        if codec is not None and codec.type == "video":
-            available.append(name)
-        else:
-            logger.debug("encoder '%s' not available as video encoder", name)
-    return available
-
-
-def _check_option_value(vcodec: str, label: str, value: Any, opt: av.option.Option) -> None:
-    """Range-check numeric *value* and choice-check string *value* against *opt*."""
-    type_name = opt.type.name
-    if type_name in FFMPEG_NUMERIC_OPTION_TYPES:
-        if isinstance(value, bool):
-            raise ValueError(
-                f"{label}={value!r} is not numeric; codec {vcodec!r} expects a number for this option."
-            )
-        elif isinstance(value, str):
-            try:
-                num_val = float(value)
-            except ValueError as e:
-                raise ValueError(
-                    f"{label}={value!r} is not numeric; codec {vcodec!r} expects a number for this option."
-                ) from e
-        elif isinstance(value, (float, int)):
-            num_val = value
-        else:
-            raise ValueError(
-                f"{label}={value!r} is not numeric; codec {vcodec!r} expects a number for this option."
-            )
-
-        # Check integer type compatibility
-        if type_name in FFMPEG_INTEGER_OPTION_TYPES and not num_val.is_integer():
-            raise ValueError(
-                f"{label}={num_val!r} must be an integer for codec {vcodec!r} "
-                f"(FFmpeg option {opt.name!r} is {type_name}); float values are not allowed."
-            )
-
-        # Check numeric range compatibility
-        lo, hi = float(opt.min), float(opt.max)
-        if lo < hi and not (lo <= num_val <= hi):
-            raise ValueError(
-                f"{label}={num_val} is out of range for codec {vcodec!r}; must be in [{lo}, {hi}]"
-            )
-
-    elif type_name == "STRING":
-        if isinstance(value, bool):
-            raise ValueError(f"{label}={value!r} is not a valid string value for codec {vcodec!r}.")
-        if isinstance(value, str):
-            str_val = value
-        elif isinstance(value, (int, float)):
-            str_val = str(value)
-        else:
-            raise ValueError(f"{label}={value!r} has unsupported type for STRING option on codec {vcodec!r}")
-
-        # Check string choice compatibility
-        choices = [c.name for c in (opt.choices or [])]
-        if choices and str_val not in choices:
-            raise ValueError(
-                f"{label}={str_val!r} is not a supported choice for codec "
-                f"{vcodec!r}; valid choices: {choices}"
-            )
-    else:
-        return
-
-
-def _check_pixel_format(vcodec: str, pix_fmt: str) -> None:
-    formats = _get_codec_video_formats(vcodec)
-    if formats and pix_fmt not in formats:
-        raise ValueError(
-            f"pix_fmt={pix_fmt!r} is not supported by codec {vcodec!r}; "
-            f"supported pixel formats: {list(formats)}"
-        )
-
-
-def _check_codec_options(vcodec: str, codec_options: dict[str, Any], config: VideoEncoderConfig) -> None:
-    """Validate merged encoder options (typed) against the codec's published AVOptions."""
-    supported_options = _get_codec_options_by_name(vcodec)
-    for key, value in codec_options.items():
-        # GOP size is not a codec-specific option, it has to be validated separately.
-        if key == "g":
-            if isinstance(value, bool) or not isinstance(value, int) or value < 1:
-                raise ValueError(f"g={value!r} must be a positive integer for codec {vcodec!r}")
-            continue
-        if key not in supported_options:
-            continue
-        opt = supported_options[key]
-        label = f"extra_options[{key!r}]" if key in config.extra_options else key
-        _check_option_value(vcodec, label, value, opt)
-
-
-def check_video_encoder_config_pyav(config: VideoEncoderConfig) -> None:
-    """Verify *config* is compatible with the bundled FFmpeg build.
-
-    Checks pixel format, abstract tuning-field compatibility, and each merged
-    encoder option from :meth:`~lerobot.datasets.video_utils.VideoEncoderConfig.get_codec_options`
-    against PyAV (including numeric ``extra_options`` present in that dict).
-    No-op when ``config.vcodec`` isn't in the local FFmpeg build.
-
-    Raises:
-        ValueError: on the first incompatibility encountered.
-    """
-    vcodec = config.vcodec
-    options = _get_codec_options_by_name(vcodec)
-    if not options:
-        logger.warning(
-            "Codec %r is not available in the bundled FFmpeg build; ",
-            vcodec,
-        )
-        return
-    _check_pixel_format(config.vcodec, config.pix_fmt)
-    _check_codec_options(config.vcodec, config.get_codec_options(), config)
@@ -93,10 +93,6 @@ DEFAULT_EPISODES_PATH = EPISODES_DIR + "/" + CHUNK_FILE_PATTERN + ".parquet"
 DEFAULT_DATA_PATH = DATA_DIR + "/" + CHUNK_FILE_PATTERN + ".parquet"
 DEFAULT_VIDEO_PATH = VIDEO_DIR + "/{video_key}/" + CHUNK_FILE_PATTERN + ".mp4"
 DEFAULT_IMAGE_PATH = "images/{image_key}/episode-{episode_index:06d}/frame-{frame_index:06d}.png"
-# Depth maps live alongside images on disk but use TIFF instead of PNG: PNG
-# cannot natively round-trip float32, and several common loaders silently
-# downcast 16-bit grayscale.
-DEFAULT_DEPTH_PATH = "images/{image_key}/episode-{episode_index:06d}/frame-{frame_index:06d}.tiff"

 LEGACY_EPISODES_PATH = "meta/episodes.jsonl"
 LEGACY_EPISODES_STATS_PATH = "meta/episodes_stats.jsonl"
@@ -17,13 +17,12 @@ import contextlib
 import glob
 import importlib
 import logging
-import math
 import queue
 import shutil
 import tempfile
 import threading
 import warnings
-from dataclasses import asdict, dataclass, field
+from dataclasses import dataclass, field
 from fractions import Fraction
 from pathlib import Path
 from threading import Lock
@@ -38,23 +37,7 @@ import torchvision
 from datasets.features.features import register_feature
 from PIL import Image

-from lerobot.datasets.pyav_utils import (
-    check_video_encoder_config_pyav,
-    depth_to_video_frame,
-    detect_available_encoders_pyav,
-    decode_depth_frame,
-    encode_depth_frame_pyav,
-    decode_depth_frame_pyav,
-)
-from lerobot.datasets.depth_utils import (
-    quantize_depth,
-    dequantize_depth,
-    DEFAULT_DEPTH_MIN,
-    DEFAULT_DEPTH_MAX,
-    DEFAULT_DEPTH_SHIFT,
-    DEFAULT_DEPTH_USE_LOG,
-)
-from lerobot.utils.import_utils import get_safe_default_video_backend
+from lerobot.utils.import_utils import get_safe_default_codec

 logger = logging.getLogger(__name__)

@@ -69,226 +52,70 @@ HW_ENCODERS = [
    "h264_qsv",  # Intel Quick Sync
 ]

-VALID_VIDEO_CODECS = {"h264", "hevc", "libsvtav1", "ffv1", "auto"} | set(HW_ENCODERS)
-
-LIBSVTAV1_DEFAULT_PRESET: int = 12
+VALID_VIDEO_CODECS = {"h264", "hevc", "libsvtav1", "auto"} | set(HW_ENCODERS)


-@dataclass
-class VideoEncoderConfig:
-    """Video encoder configuration.
+def _get_codec_options(
+    vcodec: str,
+    g: int | None = 2,
+    crf: int | None = 30,
+    preset: int | None = None,
+) -> dict:
+    """Build codec-specific options dict for video encoding."""
+    options = {}

-    Attributes:
-        vcodec: FFmpeg encoder name. ``"auto"`` is resolved during
-            construction (HW encoder if available, else ``libsvtav1``).
-        pix_fmt: Pixel format (e.g. ``"yuv420p"``).
-        g: GOP size (keyframe interval).
-        crf: Quality level — mapped to the native quality parameter of the
-            codec (``crf`` for software, ``qp`` for NVENC/VAAPI,
-            ``q:v`` for VideoToolbox, ``global_quality`` for QSV).
-        preset: Speed/quality preset. Accepted type is per-codec.
-        fast_decode: Fast-decode tuning. For ``libsvtav1`` this is a level (0-2)
-            embedded in ``svtav1-params``. For ``h264`` and ``hevc`` non-zero values
-            set ``tune=fastdecode``. Ignored for other codecs.
-        video_backend: Python library driving FFmpeg for encoding. Only ``"pyav"``
-            is currently supported.
-        extra_options: Free-form dictionary of additional FFmpeg options
-            (e.g. ``{"tune": "film", "profile:v": "high", "bf": 2}``).
-    """
+    # GOP size (keyframe interval) - supported by VideoToolbox and software encoders
+    if g is not None and (vcodec in ("h264_videotoolbox", "hevc_videotoolbox") or vcodec not in HW_ENCODERS):
+        options["g"] = str(g)

-    vcodec: str = "libsvtav1"
-    pix_fmt: str = "yuv420p"
-    g: int | None = 2
-    crf: int | None = 30
-    preset: int | str | None = None
-    fast_decode: int = 0
-    # TODO(CarolinePascal): add torchcodec support + find a way to unify the
-    # two backends (encoding and decoding).
-    video_backend: str = "pyav"
-    extra_options: dict[str, Any] = field(default_factory=dict)
+    # Quality control (codec-specific parameter names)
+    if crf is not None:
+        if vcodec in ("h264", "hevc", "libsvtav1"):
+            options["crf"] = str(crf)
+        elif vcodec in ("h264_videotoolbox", "hevc_videotoolbox"):
+            quality = max(1, min(100, int(100 - crf * 2)))
+            options["q:v"] = str(quality)
+        elif vcodec in ("h264_nvenc", "hevc_nvenc"):
+            options["rc"] = "constqp"
+            options["qp"] = str(crf)
+        elif vcodec in ("h264_vaapi",):
+            options["qp"] = str(crf)
+        elif vcodec in ("h264_qsv",):
+            options["global_quality"] = str(crf)

-    # Class-level marker persisted to ``info.json`` (via ``asdict``) so the
-    # reader can tell depth datasets from RGB ones without a separate dispatch
-    # path. ``init=False`` keeps it out of CLI/constructor surface; subclasses
-    # flip the default (see :class:`DepthEncoderConfig`).
-    is_depth_map: bool = field(default=False, init=False)
+    # Preset (only for libsvtav1)
+    if vcodec == "libsvtav1":
+        options["preset"] = str(preset) if preset is not None else "12"

-    def __post_init__(self) -> None:
-        self.resolve_vcodec()
-
-        # Empty-constructor ergonomics: ``VideoEncoderConfig()`` must "just work".
-        if self.preset is None and self.vcodec == "libsvtav1":
-            self.preset = LIBSVTAV1_DEFAULT_PRESET
-
-        self.validate()
-
-    def detect_available_encoders(self, encoders: list[str] | str) -> list[str]:
-        """Detect available encoders based on the video backend."""
-        if self.video_backend == "pyav":
-            return detect_available_encoders_pyav(encoders)
-        else:
-            return []
-
-    def validate(self) -> None:
-        """Validate the video encoder config."""
-        if self.video_backend == "pyav":
-            check_video_encoder_config_pyav(self)
-
-    def resolve_vcodec(self) -> None:
-        """Validate vcodec and resolve 'auto' to best available HW encoder, fallback to libsvtav1.
-
-        Any explicitly-requested codec that isn't in the local FFmpeg build is
-        also silently rewritten to ``libsvtav1`` so encoding never hard-fails on
-        a host missing the requested encoder.
-        """
-        # Backward compatibility: older datasets persist ``vcodec="av1"`` in
-        # ``info.json``. Rewrite to the canonical encoder name *before* the
-        # validation check below so loading those datasets keeps working.
-        if self.vcodec == "av1":
-            self.vcodec = "libsvtav1"
-
-        if self.vcodec not in VALID_VIDEO_CODECS:
-            raise ValueError(f"Invalid vcodec '{self.vcodec}'. Must be one of: {sorted(VALID_VIDEO_CODECS)}")
-        if self.vcodec == "auto":
-            available = self.detect_available_encoders(HW_ENCODERS)
-            for encoder in HW_ENCODERS:
-                if encoder in available:
-                    logger.info(f"Auto-selected video codec: {encoder}")
-                    self.vcodec = encoder
-                    return
-            logger.info("No hardware encoder available, falling back to software encoder 'libsvtav1'")
-            self.vcodec = "libsvtav1"
-
-        if self.detect_available_encoders(self.vcodec):
-            logger.info(f"Using video codec: {self.vcodec}")
-            self.vcodec = self.vcodec
-            return
-        raise ValueError(f"Unsupported video codec: {self.vcodec} with video backend {self.video_backend}")
-
-    def get_codec_options(
-        self, encoder_threads: int | None = None, as_strings: bool = False
-    ) -> dict[str, str]:
-        """Translate the tuning fields to codec-specific FFmpeg options.
-
-        ``VideoEncoderConfig.extra_options`` are merged last but never override a structured field.
-
-        Args:
-            encoder_threads: Number of encoder threads set globally for all VideoEncoderConfigs.
-                For libsvtav1, this is mapped to ``lp`` via ``svtav1-params``.
-                For h264/hevc, this is mapped to ``threads``.
-                Hardware encoders ignore this parameter.
-            as_strings: If ``True``, casts values to strings.
-        """
-        opts: dict[str, Any] = {}
-
-        def set_if(key: str, value: Any) -> None:
-            if value is not None:
-                opts[key] = value if not as_strings else str(value)
-
-        # GOP size is not a codec-specific option, so it is always set.
-        set_if("g", self.g)
-
-        if self.vcodec == "libsvtav1":
-            set_if("crf", self.crf)
-            set_if("preset", self.preset)
-            svtav1_parts: list[str] = []
-            if self.fast_decode is not None:
-                svtav1_parts.append(f"fast-decode={max(0, min(2, self.fast_decode))}")
-            if encoder_threads is not None:
-                svtav1_parts.append(f"lp={encoder_threads}")
-            if svtav1_parts:
-                opts["svtav1-params"] = ":".join(svtav1_parts)
-        elif self.vcodec in ("h264", "hevc"):
-            set_if("crf", self.crf)
-            set_if("preset", self.preset)
-            if self.fast_decode:
-                opts["tune"] = "fastdecode"
-            set_if("threads", encoder_threads)
-        elif self.vcodec in ("h264_videotoolbox", "hevc_videotoolbox"):
-            if self.crf is not None:
-                opts["q:v"] = max(1, min(100, 100 - self.crf * 2))
-        elif self.vcodec in ("h264_nvenc", "hevc_nvenc"):
-            opts["rc"] = "constqp"
-            set_if("qp", self.crf)
-            set_if("preset", self.preset)
-        elif self.vcodec == "h264_vaapi":
-            set_if("qp", self.crf)
-        elif self.vcodec == "h264_qsv":
-            set_if("global_quality", self.crf)
-            set_if("preset", self.preset)
-        elif self.vcodec == "ffv1":
-            # Lossless intra-frame codec. ``crf``/``preset``/``fast_decode`` 
-            # are not meaningful.
-            set_if("threads", encoder_threads)
-        else:
-            set_if("crf", self.crf)
-            set_if("preset", self.preset)
-
-        # Extra options are merged last but never override structured fields (values are kept as given).
-        for k, v in self.extra_options.items():
-            if k not in opts:
-                set_if(k, v)
-
-        return opts
+    return options


-@dataclass
-class DepthEncoderConfig(VideoEncoderConfig):
-    """Encoder configuration for depth-map streams.
-
-    Inherits the full :class:`VideoEncoderConfig` surface (codec, GOP, CRF,
-    preset, ``extra_options``…) and adds the four parameters of the depth
-    quantization pipeline (:func:`quantize_depth`). Inheritance — rather
-    than composition — keeps the CLI flat: ``--dataset.depth_encoder_config.<field>``
-    works identically to its RGB counterpart.
-
-    Defaults flip ``vcodec`` to ``"hevc"`` (Main 12 profile) and ``pix_fmt``
-    to ``"yuv420p12le"``, the most widely available 12-bit pixel format.
-    For archive-grade lossless storage use ``vcodec="ffv1"`` together with
-    ``pix_fmt="gray12le"`` (and clear ``crf``/``preset`` to ``None`` since
-    ``ffv1`` doesn't expose those tuning knobs).
-
-    The :attr:`is_depth_map` marker is class-fixed to ``True`` (``init=False``,
-    so it's hidden from CLI and constructor args) and is what the reader
-    side keys on to tell depth datasets from RGB ones.
-
-    Attributes:
-        depth_min: Minimum depth in physical units (e.g. metres) represented
-            by quantum ``0``.
-        depth_max: Maximum depth represented by quantum :data:`DEPTH_QMAX`.
-        shift: Pre-log offset for numerical stability near zero.
-        use_log: ``True`` for logarithmic quantization (default; matches
-            sensor error profile), ``False`` for linear.
-    """
-
-    vcodec: str = "hevc"
-    pix_fmt: str = "yuv420p12le"
-
-    depth_min: float = DEFAULT_DEPTH_MIN
-    depth_max: float = DEFAULT_DEPTH_MAX
-    shift: float = DEFAULT_DEPTH_SHIFT
-    use_log: bool = DEFAULT_DEPTH_USE_LOG
-
-    # Class invariant — kept out of ``__init__`` (and CLI) but persisted
-    # via ``asdict`` into ``info.json`` for the reader to detect depth.
-    is_depth_map: bool = field(default=True, init=False)
-
-    def quantize(self, depth: torch.Tensor | np.ndarray) -> torch.Tensor:
-        """Apply :func:`quantize_depth` bound to this config's parameters."""
-        return quantize_depth(depth, self.depth_min, self.depth_max, self.shift, self.use_log)
-
-    def dequantize(self, quantized: torch.Tensor | np.ndarray) -> torch.Tensor:
-        """Apply :func:`dequantize_depth` bound to this config's parameters."""
-        return dequantize_depth(quantized, self.depth_min, self.depth_max, self.shift, self.use_log)
+def detect_available_hw_encoders() -> list[str]:
+    """Probe PyAV/FFmpeg for available hardware video encoders."""
+    available = []
+    for codec_name in HW_ENCODERS:
+        try:
+            av.codec.Codec(codec_name, "w")
+            available.append(codec_name)
+        except Exception:  # nosec B110
+            logger.debug("HW encoder '%s' not available", codec_name)  # nosec B110
+    return available


-def depth_encoder_defaults() -> DepthEncoderConfig:
-    """Return a :class:`DepthEncoderConfig` with depth-camera defaults."""
-    return DepthEncoderConfig()
-
-def camera_encoder_defaults() -> VideoEncoderConfig:
-    """Return a :class:`VideoEncoderConfig` with RGB-camera defaults."""
-    return VideoEncoderConfig()
+def resolve_vcodec(vcodec: str) -> str:
+    """Validate vcodec and resolve 'auto' to best available HW encoder, fallback to libsvtav1."""
+    if vcodec not in VALID_VIDEO_CODECS:
+        raise ValueError(f"Invalid vcodec '{vcodec}'. Must be one of: {sorted(VALID_VIDEO_CODECS)}")
+    if vcodec != "auto":
+        logger.info(f"Using video codec: {vcodec}")
+        return vcodec
+    available = detect_available_hw_encoders()
+    for encoder in HW_ENCODERS:
+        if encoder in available:
+            logger.info(f"Auto-selected video codec: {encoder}")
+            return encoder
+    logger.info("No hardware encoder available, falling back to software encoder 'libsvtav1'")
+    return "libsvtav1"


 def decode_video_frames(
@@ -315,7 +142,7 @@ def decode_video_frames(
    Currently supports torchcodec on cpu and pyav.
    """
    if backend is None:
-        backend = get_safe_default_video_backend()
+        backend = get_safe_default_codec()
    if backend == "torchcodec":
        return decode_video_frames_torchcodec(video_path, timestamps, tolerance_s, return_uint8=return_uint8)
    elif backend in ["pyav", "video_reader"]:
@@ -455,7 +282,11 @@ class VideoDecoderCache:
        with self._lock:
            if video_path not in self._cache:
                file_handle = fsspec.open(video_path).__enter__()
-                decoder = VideoDecoder(file_handle, seek_mode="approximate")
+                try:
+                    decoder = VideoDecoder(file_handle, seek_mode="approximate")
+                except Exception:
+                    file_handle.close()
+                    raise
                self._cache[video_path] = (decoder, file_handle)

            return self._cache[video_path][0]
@@ -569,136 +400,22 @@ def decode_video_frames_torchcodec(
    return closest_frames


-def decode_depth_frames(
-    video_path: Path | str,
-    timestamps: list[float],
-    tolerance_s: float,
-    *,
-    depth_min: float = DEFAULT_DEPTH_MIN,
-    depth_max: float = DEFAULT_DEPTH_MAX,
-    shift: float = DEFAULT_DEPTH_SHIFT,
-    use_log: bool = DEFAULT_DEPTH_USE_LOG,
-    return_quantized: bool = False,
-    log_loaded_timestamps: bool = False,
-) -> torch.Tensor:
-    """Decode depth-map frames at the requested timestamps using PyAV.
-
-    Mirrors the timestamp-tolerance / closest-frame contract of
-    :func:`decode_video_frames` but operates entirely through PyAV (the
-    ``torchvision`` and ``torchcodec`` backends don't currently round-trip
-    12-bit pixel formats reliably).
-
-    Each decoded frame is reformatted to ``gray12le`` so the same path
-    handles ``yuv420p12le`` (HEVC default) and ``gray12le`` (ffv1 archive)
-    sources transparently.
-
-    Args:
-        video_path: Path to a depth video produced with a
-            :class:`DepthEncoderConfig`.
-        timestamps: Frame timestamps to retrieve, in seconds.
-        tolerance_s: Maximum allowed deviation between the queried and the
-            actually-decoded timestamps.
-        depth_min, depth_max, shift, use_log: Parameters used at quantization
-            time. Should match :func:`info_to_depth_kwargs` extracted from
-            ``info.json`` for the source dataset.
-        return_quantized: If ``True``, skip the dequantization step and
-            return raw 12-bit ``uint16`` quanta.
-        log_loaded_timestamps: Debug logging.
-
-    Returns:
-        ``torch.Tensor`` of shape ``(N, H, W)``:
-
-        * ``dtype=torch.float32`` (metric depth, default)
-        * ``dtype=torch.uint16`` when ``return_quantized=True``.
-
-    Raises:
-        FrameTimestampError: If a query timestamp can't be matched within
-            *tolerance_s*, or if no frames are decoded.
-    """
-    video_path_str = str(video_path)
-    first_ts = min(timestamps)
-    last_ts = max(timestamps)
-
-    loaded_frames: list[np.ndarray] = []
-    loaded_ts: list[float] = []
-
-    av.logging.set_level(av.logging.WARNING)
-    with av.open(video_path_str, "r") as container:
-        try:
-            stream = container.streams.video[0]
-        except IndexError as e:
-            raise FrameTimestampError(f"No video stream in {video_path_str}") from e
-
-        # Seek to the keyframe at-or-before first_ts (PyAV doesn't do
-        # accurate seek, so we still iterate forward to the requested range).
-        seek_pts = int(first_ts / stream.time_base)
-        container.seek(seek_pts, stream=stream, any_frame=False, backward=True)
-
-        for frame in container.decode(stream):
-            if frame.pts is None:
-                continue
-            current_ts = float(frame.pts * stream.time_base)
-            if log_loaded_timestamps:
-                logger.info(f"depth frame loaded at timestamp={current_ts:.4f}")
-            loaded_frames.append(
-                decode_depth_frame(
-                    frame,
-                    depth_min=depth_min,
-                    depth_max=depth_max,
-                    shift=shift,
-                    use_log=use_log,
-                    return_quantized=True,
-                )
-            )
-            loaded_ts.append(current_ts)
-            if current_ts >= last_ts:
-                break
-
-    av.logging.restore_default_callback()
-
-    if not loaded_frames:
-        raise FrameTimestampError(
-            f"No depth frames decoded from {video_path_str} for timestamps {timestamps}"
-        )
-
-    query_ts = torch.tensor(timestamps)
-    loaded_ts_t = torch.tensor(loaded_ts)
-    dist = torch.cdist(query_ts[:, None], loaded_ts_t[:, None], p=1)
-    min_, argmin_ = dist.min(1)
-
-    is_within_tol = min_ < tolerance_s
-    if not is_within_tol.all():
-        raise FrameTimestampError(
-            f"One or several query timestamps violate the tolerance "
-            f"({min_[~is_within_tol]} > {tolerance_s=})."
-            f"\nqueried timestamps: {query_ts}"
-            f"\nloaded timestamps: {loaded_ts_t}"
-            f"\nvideo: {video_path_str}"
-        )
-
-    closest = np.stack([loaded_frames[i] for i in argmin_])  # (N, H, W) uint16
-    quantized = torch.from_numpy(closest)
-
-    if return_quantized:
-        return quantized
-    return dequantize_depth(quantized, depth_min, depth_max, shift, use_log)
-
-
 def encode_video_frames(
    imgs_dir: Path | str,
    video_path: Path | str,
    fps: int,
-    camera_encoder_config: VideoEncoderConfig | None = None,
-    encoder_threads: int | None = None,
-    *,
+    vcodec: str = "libsvtav1",
+    pix_fmt: str = "yuv420p",
+    g: int | None = 2,
+    crf: int | None = 30,
+    fast_decode: int = 0,
    log_level: int | None = av.logging.WARNING,
    overwrite: bool = False,
+    preset: int | None = None,
+    encoder_threads: int | None = None,
 ) -> None:
    """More info on ffmpeg arguments tuning on `benchmark/video/README.md`"""
-    if camera_encoder_config is None:
-        camera_encoder_config = VideoEncoderConfig()
-    vcodec = camera_encoder_config.vcodec
-    pix_fmt = camera_encoder_config.pix_fmt
+    vcodec = resolve_vcodec(vcodec)

    video_path = Path(video_path)
    imgs_dir = Path(imgs_dir)
@@ -709,18 +426,42 @@ def encode_video_frames(

    video_path.parent.mkdir(parents=True, exist_ok=True)

+    # Encoders/pixel formats incompatibility check
+    if (vcodec == "libsvtav1" or vcodec == "hevc") and pix_fmt == "yuv444p":
+        logger.warning(
+            f"Incompatible pixel format 'yuv444p' for codec {vcodec}, auto-selecting format 'yuv420p'"
+        )
+        pix_fmt = "yuv420p"
+
    # Get input frames
    template = "frame-" + ("[0-9]" * 6) + ".png"
    input_list = sorted(
        glob.glob(str(imgs_dir / template)), key=lambda x: int(x.split("-")[-1].split(".")[0])
    )

+    # Define video output frame size (assuming all input frames are the same size)
    if len(input_list) == 0:
        raise FileNotFoundError(f"No images found in {imgs_dir}.")
    with Image.open(input_list[0]) as dummy_image:
        width, height = dummy_image.size

-    video_options = camera_encoder_config.get_codec_options(encoder_threads, as_strings=True)
+    # Define video codec options
+    video_options = _get_codec_options(vcodec, g, crf, preset)
+
+    if fast_decode:
+        key = "svtav1-params" if vcodec == "libsvtav1" else "tune"
+        value = f"fast-decode={fast_decode}" if vcodec == "libsvtav1" else "fastdecode"
+        video_options[key] = value
+
+    if encoder_threads is not None:
+        if vcodec == "libsvtav1":
+            lp_param = f"lp={encoder_threads}"
+            if "svtav1-params" in video_options:
+                video_options["svtav1-params"] += f":{lp_param}"
+            else:
+                video_options["svtav1-params"] = lp_param
+        else:
+            video_options["threads"] = str(encoder_threads)

    # Set logging level
    if log_level is not None:
@@ -757,10 +498,7 @@ def encode_video_frames(


 def concatenate_video_files(
-    input_video_paths: list[Path | str],
-    output_video_path: Path,
-    overwrite: bool = True,
-    compatibility_check: bool = False,
+    input_video_paths: list[Path | str], output_video_path: Path, overwrite: bool = True
 ):
    """
    Concatenate multiple video files into a single video file using pyav.
@@ -773,7 +511,6 @@ def concatenate_video_files(
        input_video_paths: Ordered list of input video file paths to concatenate.
        output_video_path: Path to the output video file.
        overwrite: Whether to overwrite the output video file if it already exists. Default is True.
-        compatibility_check: Whether to check if the input videos are compatible. Default is False.

    Note:
        - Creates a temporary directory for intermediate files that is cleaned up after use.
@@ -792,22 +529,6 @@ def concatenate_video_files(
    if len(input_video_paths) == 0:
        raise FileNotFoundError("No input video paths provided.")

-    # This check may be skipped at recording time as videos are encoded with the same encoder config.
-    if compatibility_check:
-        reference_video_info = get_video_info(input_video_paths[0])
-        for input_path in input_video_paths[1:]:
-            video_info = get_video_info(input_path)
-            if (
-                video_info["video.height"] != reference_video_info["video.height"]
-                or video_info["video.width"] != reference_video_info["video.width"]
-                or video_info["video.fps"] != reference_video_info["video.fps"]
-                or video_info["video.codec"] != reference_video_info["video.codec"]
-                or video_info["video.pix_fmt"] != reference_video_info["video.pix_fmt"]
-            ):
-                raise ValueError(
-                    f"Input video {input_path} is not compatible with the reference video {input_video_paths[0]}."
-                )
-
    # Create a temporary .ffconcat file to list the input video paths
    with tempfile.NamedTemporaryFile(mode="w", suffix=".ffconcat", delete=False) as tmp_concatenate_file:
        tmp_concatenate_file.write("ffconcat version 1.0\n")
@@ -874,31 +595,33 @@ class _CameraEncoderThread(threading.Thread):
        fps: int,
        vcodec: str,
        pix_fmt: str,
-        codec_options: dict[str, str],
+        g: int | None,
+        crf: int | None,
+        preset: int | None,
        frame_queue: queue.Queue,
        result_queue: queue.Queue,
        stop_event: threading.Event,
-        depth_encoder_config: "DepthEncoderConfig | None" = None,
+        encoder_threads: int | None = None,
    ):
        super().__init__(daemon=True)
        self.video_path = video_path
        self.fps = fps
        self.vcodec = vcodec
        self.pix_fmt = pix_fmt
-        self.codec_options = codec_options
+        self.g = g
+        self.crf = crf
+        self.preset = preset
        self.frame_queue = frame_queue
        self.result_queue = result_queue
        self.stop_event = stop_event
-        self.depth_encoder_config = depth_encoder_config
-
+        self.encoder_threads = encoder_threads

    def run(self) -> None:
        from .compute_stats import RunningQuantileStats, auto_downsample_height_width

        container = None
        output_stream = None
-        is_depth = self.depth_encoder_config is not None
-        stats_tracker = RunningQuantileStats() if not is_depth else None
+        stats_tracker = RunningQuantileStats()
        frame_count = 0

        try:
@@ -916,45 +639,51 @@ class _CameraEncoderThread(threading.Thread):
                    # Sentinel: flush and close
                    break

-                # Ensure HWC (RGB or depth) uint8 (RGB only) numpy array
+                # Ensure HWC uint8 numpy array
                if isinstance(frame_data, np.ndarray):
                    if frame_data.ndim == 3 and frame_data.shape[0] == 3:
                        # CHW -> HWC
                        frame_data = frame_data.transpose(1, 2, 0)
-                    if frame_data.dtype != np.uint8 and not is_depth:
+                    if frame_data.dtype != np.uint8:
                        frame_data = (frame_data * 255).astype(np.uint8)

                # Open container on first frame (to get width/height)
                if container is None:
                    height, width = frame_data.shape[:2]
+                    video_options = _get_codec_options(self.vcodec, self.g, self.crf, self.preset)
+                    if self.encoder_threads is not None:
+                        if self.vcodec == "libsvtav1":
+                            lp_param = f"lp={self.encoder_threads}"
+                            if "svtav1-params" in video_options:
+                                video_options["svtav1-params"] += f":{lp_param}"
+                            else:
+                                video_options["svtav1-params"] = lp_param
+                        else:
+                            video_options["threads"] = str(self.encoder_threads)
                    Path(self.video_path).parent.mkdir(parents=True, exist_ok=True)
                    container = av.open(str(self.video_path), "w")
-                    output_stream = container.add_stream(self.vcodec, self.fps, options=self.codec_options)
+                    output_stream = container.add_stream(self.vcodec, self.fps, options=video_options)
                    output_stream.pix_fmt = self.pix_fmt
                    output_stream.width = width
                    output_stream.height = height
                    output_stream.time_base = Fraction(1, self.fps)

                # Encode frame with explicit timestamps
-                if is_depth:
-                    video_frame = encode_depth_frame_pyav(frame_data, pix_fmt=self.pix_fmt, depth_min=self.depth_encoder_config.depth_min, depth_max=self.depth_encoder_config.depth_max, shift=self.depth_encoder_config.shift, use_log=self.depth_encoder_config.use_log)
-                else:
-                    pil_img = Image.fromarray(frame_data)
-                    video_frame = av.VideoFrame.from_image(pil_img)
+                pil_img = Image.fromarray(frame_data)
+                video_frame = av.VideoFrame.from_image(pil_img)
                video_frame.pts = frame_count
                video_frame.time_base = Fraction(1, self.fps)
                packet = output_stream.encode(video_frame)
                if packet:
                    container.mux(packet)

-                if not is_depth:
-                    # Update stats with downsampled frame (per-channel stats like compute_episode_stats)
-                    img_chw = frame_data.transpose(2, 0, 1)  # HWC -> CHW
-                    img_downsampled = auto_downsample_height_width(img_chw)
-                    # Reshape CHW to (H*W, C) for per-channel stats
-                    channels = img_downsampled.shape[0]
-                    img_for_stats = img_downsampled.transpose(1, 2, 0).reshape(-1, channels)
-                    stats_tracker.update(img_for_stats)
+                # Update stats with downsampled frame (per-channel stats like compute_episode_stats)
+                img_chw = frame_data.transpose(2, 0, 1)  # HWC -> CHW
+                img_downsampled = auto_downsample_height_width(img_chw)
+                # Reshape CHW to (H*W, C) for per-channel stats
+                channels = img_downsampled.shape[0]
+                img_for_stats = img_downsampled.transpose(1, 2, 0).reshape(-1, channels)
+                stats_tracker.update(img_for_stats)

                frame_count += 1

@@ -969,10 +698,8 @@ class _CameraEncoderThread(threading.Thread):

            av.logging.restore_default_callback()

-            # Get stats and put on result queue (depth streams skip stats)
-            if is_depth:
-                self.result_queue.put(("ok", None))
-            elif frame_count >= 2:
+            # Get stats and put on result queue
+            if frame_count >= 2:
                stats = stats_tracker.get_statistics()
                self.result_queue.put(("ok", stats))
            else:
@@ -1001,40 +728,22 @@ class StreamingVideoEncoder:
    def __init__(
        self,
        fps: int,
-        camera_encoder_config: VideoEncoderConfig | None = None,
-        encoder_threads: int | None = None,
-        *,
+        vcodec: str = "libsvtav1",
+        pix_fmt: str = "yuv420p",
+        g: int | None = 2,
+        crf: int | None = 30,
+        preset: int | None = None,
        queue_maxsize: int = 30,
-        depth_encoder_config: "DepthEncoderConfig | None" = None,
-        depth_keys: list[str] | None = None,
+        encoder_threads: int | None = None,
    ):
-        """
-        Args:
-            fps: Frames per second for the output videos.
-            camera_encoder_config: Video encoder settings applied to all cameras.
-                When ``None``, :class:`VideoEncoderConfig` defaults are used.
-            encoder_threads: Number of encoder threads (global setting).
-                ``None`` lets the codec decide.
-            queue_maxsize: Max frames to buffer per camera before
-                back-pressure drops frames.
-            depth_encoder_config: Optional depth encoder configuration applied
-                to all depth video keys listed in ``depth_keys``.
-            depth_keys: Video keys (matching the dataset feature names) that
-                must be encoded as quantized depth maps using
-                ``depth_encoder_config``. Required when ``depth_encoder_config``
-                is provided.
-        """
        self.fps = fps
-        self._camera_encoder_config = camera_encoder_config or VideoEncoderConfig()
-        self._encoder_threads = encoder_threads
+        self.vcodec = resolve_vcodec(vcodec)
+        self.pix_fmt = pix_fmt
+        self.g = g
+        self.crf = crf
+        self.preset = preset
        self.queue_maxsize = queue_maxsize
-        self._depth_encoder_config = depth_encoder_config
-        self._depth_keys: set[str] = set(depth_keys or [])
-        if self._depth_keys and self._depth_encoder_config is None:
-            raise ValueError(
-                "StreamingVideoEncoder received depth_keys without a depth_encoder_config; "
-                "either pass a DepthEncoderConfig or remove depth_keys."
-            )
+        self.encoder_threads = encoder_threads

        self._frame_queues: dict[str, queue.Queue] = {}
        self._result_queues: dict[str, queue.Queue] = {}
@@ -1065,28 +774,18 @@ class StreamingVideoEncoder:
            temp_video_dir = Path(tempfile.mkdtemp(dir=temp_dir))
            video_path = temp_video_dir / f"{video_key.replace('/', '_')}_streaming.mp4"

-            is_depth_key = video_key in self._depth_keys
-            encoder_cfg: VideoEncoderConfig
-            depth_cfg = None
-            if is_depth_key:
-                assert self._depth_encoder_config is not None  # guaranteed by __init__
-                encoder_cfg = self._depth_encoder_config
-                depth_cfg = self._depth_encoder_config
-            else:
-                encoder_cfg = self._camera_encoder_config
-
-            vcodec = encoder_cfg.vcodec
-            codec_options = encoder_cfg.get_codec_options(self._encoder_threads)
            encoder_thread = _CameraEncoderThread(
                video_path=video_path,
                fps=self.fps,
-                vcodec=vcodec,
-                pix_fmt=encoder_cfg.pix_fmt,
-                codec_options=codec_options,
+                vcodec=self.vcodec,
+                pix_fmt=self.pix_fmt,
+                g=self.g,
+                crf=self.crf,
+                preset=self.preset,
                frame_queue=frame_queue,
                result_queue=result_queue,
                stop_event=stop_event,
-                depth_encoder_config=depth_cfg,
+                encoder_threads=self.encoder_threads,
            )
            encoder_thread.start()

@@ -1291,18 +990,8 @@ def get_audio_info(video_path: Path | str) -> dict:
    return audio_info


-def get_video_info(
-    video_path: Path | str,
-    video_encoder_config: "VideoEncoderConfig | None" = None,
-) -> dict:
-    """Build the ``video.*`` / ``audio.*`` info dict persisted in ``info.json``.
-
-    Args:
-        video_path: Path to the encoded video file to probe.
-        video_encoder_config: If provided, record the exact encoder settings used to encode this
-            video. Stream-derived values take precedence — encoder fields are only written for keys
-            not already populated from the video file itself.
-    """
+def get_video_info(video_path: Path | str) -> dict:
+    # Set logging level
    logging.getLogger("libav").setLevel(av.logging.WARNING)

    # Getting video stream information
@@ -1319,6 +1008,7 @@ def get_video_info(
        video_info["video.width"] = video_stream.width
        video_info["video.codec"] = video_stream.codec.canonical_name
        video_info["video.pix_fmt"] = video_stream.pix_fmt
+        video_info["video.is_depth_map"] = False

        # Calculate fps from r_frame_rate
        video_info["video.fps"] = int(video_stream.base_rate)
@@ -1332,67 +1022,9 @@ def get_video_info(
    # Adding audio stream information
    video_info.update(**get_audio_info(video_path))

-    # Add additional encoder configuration if provided (no override of stream-derived values)
-    # Depth related fields flow naturally through this path.
-    if video_encoder_config is not None:
-        for field_name, field_value in asdict(video_encoder_config).items():
-            video_info.setdefault(f"video.{field_name}", field_value)
-
-    # Fallback case where no encoder config is provided or the video is not a depth map.
-    video_info.setdefault("video.is_depth_map", False)
-
    return video_info


-# ─── Depth metadata helpers (reader side) ────────────────────────────
-
-
-_DEPTH_INFO_KEYS: tuple[str, ...] = (
-    "video.depth_min",
-    "video.depth_max",
-    "video.shift",
-    "video.use_log",
-)
-
-
-def seed_depth_feature_info(
-    features: dict[str, dict],
-    depth_encoder_config: "DepthEncoderConfig | None",
-) -> None:
-    """Pre-populate per-feature ``video.<field>`` entries from *depth_encoder_config*.
-
-    ``update_video_info`` only runs after the first episode video is encoded,
-    so without this seeding step ``features[key]["info"]`` carries no
-    quantization range until then. Consumers that read the dataset feature
-    spec mid-recording (e.g. the rerun visualizer pinning the depth colormap
-    to ``video.depth_min`` / ``video.depth_max``) would otherwise see no
-    range during episode 1 and re-normalize per frame.
-
-    Stream-derived values written later by :func:`get_video_info` /
-    ``update_video_info`` win over these seeds (the merge is
-    ``{**existing, **stream_info}``), so callers can safely re-run this on
-    a partially-populated info dict.
-
-    No-op when ``depth_encoder_config`` is ``None`` or no feature is flagged
-    as a depth map.
-    """
-    if depth_encoder_config is None:
-        return
-    encoder_fields = {
-        f"video.{name}": value for name, value in asdict(depth_encoder_config).items()
-    }
-    for ft in features.values():
-        if ft.get("dtype") != "video":
-            continue
-        info = ft.get("info") or {}
-        if not info.get("video.is_depth_map", False):
-            continue
-        # Only fill fields not already set, so explicit user-provided info is preserved.
-        for k, v in encoder_fields.items():
-            info.setdefault(k, v)
-        ft["info"] = info
-
-
 def get_video_pixel_channels(pix_fmt: str) -> int:
    if "gray" in pix_fmt or "depth" in pix_fmt or "monochrome" in pix_fmt:
        return 1
@@ -24,7 +24,12 @@ import gymnasium as gym
 from gymnasium.envs.registration import registry as gym_registry

 from lerobot.configs import FeatureType, PolicyFeature
-from lerobot.processor import IsaaclabArenaProcessorStep, LiberoProcessorStep, PolicyProcessorPipeline
+from lerobot.processor import (
+    IsaaclabArenaProcessorStep,
+    LiberoActionProcessorStep,
+    LiberoProcessorStep,
+    PolicyProcessorPipeline,
+)
 from lerobot.robots import RobotConfig
 from lerobot.teleoperators.config import TeleoperatorConfig
 from lerobot.utils.constants import (
@@ -123,7 +128,7 @@ class EnvConfig(draccus.ChoiceRegistry, abc.ABC):
            vec = env_cls([_make_one for _ in range(n_envs)], **extra_kwargs)
        return {self.type: {0: vec}}

-    def get_env_processors(self):
+    def get_env_processors(self, policy_cfg: Any | None = None):
        """Return (preprocessor, postprocessor) for this env. Default: identity."""
        return PolicyProcessorPipeline(steps=[]), PolicyProcessorPipeline(steps=[])

@@ -436,10 +441,13 @@ class LiberoEnv(EnvConfig):
            is_libero_plus=self.is_libero_plus,
        )

-    def get_env_processors(self):
+    def get_env_processors(self, policy_cfg: Any | None = None):
+        max_state_dim = getattr(policy_cfg, "max_state_dim", None) if getattr(policy_cfg, "type", None) == "evo1" else None
+        action_feature = self.features.get(ACTION)
+        action_dim = int(action_feature.shape[0]) if action_feature is not None else 7
        return (
-            PolicyProcessorPipeline(steps=[LiberoProcessorStep()]),
-            PolicyProcessorPipeline(steps=[]),
+            PolicyProcessorPipeline(steps=[LiberoProcessorStep(max_state_dim=max_state_dim)]),
+            PolicyProcessorPipeline(steps=[LiberoActionProcessorStep(action_dim=action_dim)]),
        )


@@ -705,7 +713,7 @@ class IsaaclabArenaEnv(HubEnvConfig):
    def gym_kwargs(self) -> dict:
        return {}

-    def get_env_processors(self):
+    def get_env_processors(self, policy_cfg: Any | None = None):
        state_keys = tuple(k.strip() for k in (self.state_keys or "").split(",") if k.strip())
        camera_keys = tuple(k.strip() for k in (self.camera_keys or "").split(",") if k.strip())
        if not state_keys and not camera_keys:
@@ -15,6 +15,7 @@
 # limitations under the License.
 from __future__ import annotations

+import inspect
 from typing import Any

 import gymnasium as gym
@@ -52,7 +53,14 @@ def make_env_pre_post_processors(

        return make_xvla_libero_pre_post_processors()

-    return env_cfg.get_env_processors()
+    get_processors = env_cfg.get_env_processors
+    signature = inspect.signature(get_processors)
+    supports_policy_cfg = "policy_cfg" in signature.parameters or any(
+        param.kind is inspect.Parameter.VAR_KEYWORD for param in signature.parameters.values()
+    )
+    if supports_policy_cfg:
+        return get_processors(policy_cfg=policy_cfg)
+    return get_processors()


 def make_env(
@@ -16,6 +16,8 @@ from lerobot.utils.action_interpolator import ActionInterpolator as ActionInterp

 from .act.configuration_act import ACTConfig as ACTConfig
 from .diffusion.configuration_diffusion import DiffusionConfig as DiffusionConfig
+from .eo1.configuration_eo1 import EO1Config as EO1Config
+from .evo1.configuration_evo1 import Evo1Config as Evo1Config
 from .factory import get_policy_class, make_policy, make_policy_config, make_pre_post_processors
 from .groot.configuration_groot import GrootConfig as GrootConfig
 from .multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig as MultiTaskDiTConfig
@@ -39,8 +41,10 @@ __all__ = [
    # Configuration classes
    "ACTConfig",
    "DiffusionConfig",
+    "Evo1Config",
    "GrootConfig",
    "MultiTaskDiTConfig",
+    "EO1Config",
    "PI0Config",
    "PI0FastConfig",
    "PI05Config",
@@ -100,8 +100,8 @@ class DiffusionConfig(PreTrainedConfig):

    # Inputs / output structure.
    n_obs_steps: int = 2
-    horizon: int = 16
-    n_action_steps: int = 8
+    horizon: int = 64
+    n_action_steps: int = 32

    normalization_mapping: dict[str, NormalizationMode] = field(
        default_factory=lambda: {
@@ -122,10 +122,10 @@ class DiffusionConfig(PreTrainedConfig):
    crop_ratio: float = 1.0
    crop_shape: tuple[int, int] | None = None
    crop_is_random: bool = True
-    pretrained_backbone_weights: str | None = None
-    use_group_norm: bool = True
+    pretrained_backbone_weights: str | None = "ResNet18_Weights.IMAGENET1K_V1"
+    use_group_norm: bool = False
    spatial_softmax_num_keypoints: int = 32
-    use_separate_rgb_encoder_per_camera: bool = False
+    use_separate_rgb_encoder_per_camera: bool = True
    # Unet.
    down_dims: tuple[int, ...] = (512, 1024, 2048)
    kernel_size: int = 5
@@ -0,0 +1 @@
+../../../../docs/source/eo1.mdx
@@ -0,0 +1,7 @@
+#!/usr/bin/env python
+
+from .configuration_eo1 import EO1Config
+from .modeling_eo1 import EO1Policy
+from .processor_eo1 import make_eo1_pre_post_processors
+
+__all__ = ["EO1Config", "EO1Policy", "make_eo1_pre_post_processors"]
@@ -0,0 +1,193 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+from copy import deepcopy
+from dataclasses import dataclass, field
+from typing import TYPE_CHECKING
+
+from lerobot.configs.policies import PreTrainedConfig
+from lerobot.configs.types import FeatureType, NormalizationMode, PolicyFeature
+from lerobot.optim.optimizers import AdamWConfig
+from lerobot.optim.schedulers import CosineDecayWithWarmupSchedulerConfig
+from lerobot.utils.constants import ACTION, OBS_STATE
+from lerobot.utils.import_utils import _transformers_available, require_package
+
+if TYPE_CHECKING or _transformers_available:
+    from transformers.models.qwen2_5_vl.configuration_qwen2_5_vl import (
+        Qwen2_5_VLConfig,
+        Qwen2_5_VLTextConfig,
+        Qwen2_5_VLVisionConfig,
+    )
+else:
+    Qwen2_5_VLConfig = None
+    Qwen2_5_VLTextConfig = None
+    Qwen2_5_VLVisionConfig = None
+
+
+@PreTrainedConfig.register_subclass("eo1")
+@dataclass
+class EO1Config(PreTrainedConfig):
+    """Configuration for native EO1 policy integration in LeRobot."""
+
+    vlm_base: str = "Qwen/Qwen2.5-VL-3B-Instruct"
+    vlm_config: dict | None = None
+
+    # Vision processor settings.
+    image_min_pixels: int | None = 64 * 28 * 28
+    image_max_pixels: int | None = 128 * 28 * 28
+    use_fast_processor: bool = False
+
+    # Execution and action horizon.
+    n_obs_steps: int = 1
+    chunk_size: int = 8
+    n_action_steps: int = 8
+
+    # State/action padding to match EO1 flow head dimensionality.
+    max_state_dim: int = 32
+    max_action_dim: int = 32
+
+    # Flow matching sampling.
+    num_denoise_steps: int = 10
+    num_action_layers: int = 2
+    action_act: str = "linear"
+    time_sampling_beta_alpha: float = 1.5
+    time_sampling_beta_beta: float = 1.0
+    time_sampling_scale: float = 0.999
+    time_sampling_offset: float = 0.001
+    min_period: float = 4e-3
+    max_period: float = 4.0
+    supervise_padding_action_dims: bool = True
+    supervise_padding_actions: bool = True
+
+    # Policy-level dtype request for the Qwen backbone.
+    # - "auto": follow the backbone config/checkpoint default dtype. For Qwen2.5-VL this resolves to bf16.
+    #           The EO1 flow-matching head still keeps its own parameters in fp32.
+    # - "bfloat16": force the backbone to initialize/load in bf16 regardless of the saved config default.
+    # - "float32": force the backbone to initialize/load in fp32 for maximum numerical conservatism.
+    dtype: str = "auto"  # Options: "auto", "bfloat16", "float32"
+    force_fp32_autocast: bool = True
+
+    # Optional attention backend request passed through to the Qwen backbone.
+    # Common values: None, "eager", "sdpa", "flash_attention_2".
+    attn_implementation: str | None = None
+
+    # Training settings.
+    gradient_checkpointing: bool = False  # Enable gradient checkpointing for memory optimization
+
+    normalization_mapping: dict[str, NormalizationMode] = field(
+        default_factory=lambda: {
+            "VISUAL": NormalizationMode.IDENTITY,
+            "STATE": NormalizationMode.MEAN_STD,
+            "ACTION": NormalizationMode.MEAN_STD,
+        }
+    )
+
+    # Optimizer settings aligned with EO1/experiments/2_libero/train.sh and EO1 TrainPipelineConfig defaults.
+    optimizer_lr: float = 1e-4
+    optimizer_betas: tuple[float, float] = (0.9, 0.999)
+    optimizer_eps: float = 1e-8
+    optimizer_weight_decay: float = 0.1
+    optimizer_grad_clip_norm: float = 1.0
+
+    # Scheduler settings aligned with EO1 train.sh: cosine schedule with warmup_ratio=0.03.
+    # Note: These will auto-scale if --steps < scheduler_decay_steps
+    # For example, --steps=3000 will scale warmup to 100 and decay to 3000
+    scheduler_warmup_steps: int = 900  # 0.03 * 30_000 long-run steps
+    scheduler_decay_steps: int = 30_000
+    scheduler_decay_lr: float = 0.0
+
+    def __post_init__(self):
+        super().__post_init__()
+
+        if self.n_action_steps > self.chunk_size:
+            raise ValueError(
+                f"n_action_steps ({self.n_action_steps}) cannot be greater than chunk_size ({self.chunk_size})"
+            )
+
+        # Populate the serialized backbone config only when the caller did not provide one.
+        if self.vlm_config is None:
+            require_package("transformers", extra="eo1")
+            self.vlm_config = Qwen2_5_VLConfig.from_pretrained(self.vlm_base).to_dict()
+
+    @property
+    def vlm_backbone_config(self) -> Qwen2_5_VLConfig:
+        require_package("transformers", extra="eo1")
+        config_dict = deepcopy(self.vlm_config)
+        if self.attn_implementation is not None:
+            config_dict["attn_implementation"] = self.attn_implementation
+        return Qwen2_5_VLConfig(**config_dict)
+
+    @property
+    def text_config(self) -> Qwen2_5_VLTextConfig:
+        return self.vlm_backbone_config.text_config
+
+    @property
+    def vision_config(self) -> Qwen2_5_VLVisionConfig:
+        return self.vlm_backbone_config.vision_config
+
+    def validate_features(self) -> None:
+        """Validate and set up EO1 input and output features."""
+        image_features = [key for key, feat in self.input_features.items() if feat.type == FeatureType.VISUAL]
+        if not image_features:
+            raise ValueError(
+                "EO1 policy requires at least one visual input feature. "
+                "No features of type FeatureType.VISUAL found in input_features."
+            )
+
+        if OBS_STATE not in self.input_features:
+            state_feature = PolicyFeature(
+                type=FeatureType.STATE,
+                shape=(self.max_state_dim,),
+            )
+            self.input_features[OBS_STATE] = state_feature
+
+        if ACTION not in self.output_features:
+            action_feature = PolicyFeature(
+                type=FeatureType.ACTION,
+                shape=(self.max_action_dim,),
+            )
+            self.output_features[ACTION] = action_feature
+
+    def get_optimizer_preset(self) -> AdamWConfig:
+        return AdamWConfig(
+            lr=self.optimizer_lr,
+            betas=self.optimizer_betas,
+            eps=self.optimizer_eps,
+            weight_decay=self.optimizer_weight_decay,
+            grad_clip_norm=self.optimizer_grad_clip_norm,
+        )
+
+    def get_scheduler_preset(self):
+        return CosineDecayWithWarmupSchedulerConfig(
+            peak_lr=self.optimizer_lr,
+            decay_lr=self.scheduler_decay_lr,
+            num_warmup_steps=self.scheduler_warmup_steps,
+            num_decay_steps=self.scheduler_decay_steps,
+        )
+
+    @property
+    def observation_delta_indices(self) -> None:
+        return None
+
+    @property
+    def action_delta_indices(self) -> list[int]:
+        return list(range(self.chunk_size))
+
+    @property
+    def reward_delta_indices(self) -> None:
+        return None
@@ -0,0 +1,620 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import contextlib
+import logging
+import math
+from collections import deque
+from typing import TYPE_CHECKING, Any
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F  # noqa: N812
+import torch.utils.checkpoint
+from torch import Tensor
+
+from lerobot.policies.eo1.configuration_eo1 import EO1Config
+from lerobot.policies.pretrained import PreTrainedPolicy
+from lerobot.utils.constants import ACTION, OBS_STATE
+from lerobot.utils.import_utils import _transformers_available, require_package
+
+if TYPE_CHECKING or _transformers_available:
+    from transformers.activations import ACT2FN
+    from transformers.models.qwen2_5_vl import Qwen2_5_VLForConditionalGeneration
+    from transformers.utils import torch_compilable_check
+else:
+    ACT2FN = None
+    Qwen2_5_VLForConditionalGeneration = None
+    torch_compilable_check = None
+
+logger = logging.getLogger(__name__)
+
+
+def pad_vector(vector, new_dim):
+    """Pad the last dimension of a vector to new_dim with zeros.
+
+    Can be (batch_size x sequence_length x features_dimension)
+    or (batch_size x features_dimension)
+    """
+    if vector.shape[-1] >= new_dim:
+        return vector
+    return F.pad(vector, (0, new_dim - vector.shape[-1]))
+
+
+class EO1Policy(PreTrainedPolicy):
+    """EO1 policy wrapper for LeRobot robot-only training/evaluation."""
+
+    config_class = EO1Config
+    name = "eo1"
+
+    def __init__(self, config: EO1Config, **kwargs):
+        require_package("transformers", extra="eo1")
+        super().__init__(config)
+        config.validate_features()
+        self.config = config
+
+        if config.pretrained_path is None:
+            # Initialize from pretrained VLM
+            vlm_backbone = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+                config.vlm_base,
+                dtype=config.dtype,
+                attn_implementation=config.attn_implementation,
+            )
+        else:
+            vlm_backbone = Qwen2_5_VLForConditionalGeneration._from_config(
+                config.vlm_backbone_config,
+                dtype=config.vlm_backbone_config.dtype if config.dtype == "auto" else config.dtype,
+            )
+
+        self.model = EO1VisionFlowMatchingModel(config, vlm_backbone)
+        if config.gradient_checkpointing:
+            self.model.gradient_checkpointing_enable()
+
+        self.model.to(config.device)
+        self.reset()
+
+    def reset(self):
+        self._action_queue = deque(maxlen=self.config.n_action_steps)
+
+    @staticmethod
+    def _get_model_inputs(batch: dict[str, Tensor], excluded_keys: set[str]) -> dict[str, Tensor]:
+        return {key: value for key, value in batch.items() if key not in excluded_keys}
+
+    def forward(self, batch: dict[str, Tensor]) -> tuple[Tensor, dict]:
+        state = self.prepare_state(batch[OBS_STATE])
+        actions = self.prepare_action(batch[ACTION])
+        model_inputs = self._get_model_inputs(batch, {OBS_STATE, ACTION})
+        loss = self.model(states=state, action=actions, **model_inputs)
+
+        loss_dict = {"loss": loss.item()}
+        return loss, loss_dict
+
+    @torch.no_grad()
+    def predict_action_chunk(self, batch: dict[str, Tensor], **kwargs) -> Tensor:
+        self.eval()
+
+        states = self.prepare_state(batch[OBS_STATE])
+        model_inputs = self._get_model_inputs(batch, {OBS_STATE})
+        actions = self.model.sample_actions(states=states, **model_inputs).to(torch.float32)
+
+        original_action_dim = self.config.output_features[ACTION].shape[0]
+        return actions[:, :, :original_action_dim]
+
+    def prepare_state(self, state: Tensor) -> Tensor:
+        return pad_vector(state, self.config.max_state_dim)
+
+    def prepare_action(self, action: Tensor) -> Tensor:
+        return pad_vector(action, self.config.max_action_dim)
+
+    @torch.no_grad()
+    def select_action(self, batch: dict[str, Tensor]) -> Tensor:
+        self.eval()
+
+        if len(self._action_queue) == 0:
+            actions = self.predict_action_chunk(batch)[:, : self.config.n_action_steps]
+            self._action_queue.extend(actions.transpose(0, 1))
+
+        return self._action_queue.popleft()
+
+    def get_optim_params(self) -> dict:
+        return self.parameters()
+
+
+def get_safe_dtype(target_dtype, device_type):
+    """Get a safe dtype for the given device type."""
+    if device_type == "mps" and target_dtype == torch.float64:
+        return torch.float32
+    if device_type == "cpu":
+        # CPU doesn't support bfloat16, use float32 instead
+        if target_dtype == torch.bfloat16:
+            return torch.float32
+        if target_dtype == torch.float64:
+            return torch.float64
+    return target_dtype
+
+
+def create_sinusoidal_pos_embedding(  # see openpi `create_sinusoidal_pos_embedding` (exact copy)
+    time: torch.Tensor, dimension: int, min_period: float, max_period: float, device="cpu"
+) -> Tensor:
+    """Computes sine-cosine positional embedding vectors for scalar positions."""
+    if dimension % 2 != 0:
+        raise ValueError(f"dimension ({dimension}) must be divisible by 2")
+
+    if time.ndim != 1:
+        raise ValueError("The time tensor is expected to be of shape `(batch_size, )`.")
+
+    dtype = get_safe_dtype(torch.float64, device.type)
+    fraction = torch.linspace(0.0, 1.0, dimension // 2, dtype=dtype, device=device)
+    period = min_period * (max_period / min_period) ** fraction
+
+    # Compute the outer product
+    scaling_factor = 1.0 / period * 2 * math.pi
+    sin_input = scaling_factor[None, :] * time[:, None]
+    return torch.cat([torch.sin(sin_input), torch.cos(sin_input)], dim=1)
+
+
+def sample_beta(alpha, beta, bsize, device):  # see openpi `sample_beta` (exact copy)
+    # Beta sampling uses _sample_dirichlet which isn't implemented for MPS, so sample on CPU
+    alpha_t = torch.tensor(alpha, dtype=torch.float32)
+    beta_t = torch.tensor(beta, dtype=torch.float32)
+    dist = torch.distributions.Beta(alpha_t, beta_t)
+    return dist.sample((bsize,)).to(device)
+
+
+class EO1VisionActionProjector(torch.nn.Sequential):
+    """This block implements the multi-layer perceptron (MLP) module."""
+
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        num_layers: int = 2,
+        activation_layer: str = "linear",
+        bias: bool = True,
+        device: Any = None,
+        dtype: torch.dtype = torch.float32,
+    ):
+        layers = []
+        in_dim = in_channels
+        hidden_channels = [in_dim] * (num_layers - 1) + [out_channels]
+        for hidden_dim in hidden_channels[:-1]:
+            layers.append(torch.nn.Linear(in_dim, hidden_dim, bias=bias, dtype=dtype, device=device))
+            layers.append(ACT2FN[activation_layer])
+            in_dim = hidden_dim
+        layers.append(torch.nn.Linear(in_dim, hidden_channels[-1], bias=bias, dtype=dtype, device=device))
+        super().__init__(*layers)
+
+    @property
+    def dtype(self):
+        return self[0].weight.dtype
+
+
+class EO1VisionFlowMatchingModel(nn.Module):
+    def __init__(
+        self,
+        config: EO1Config,
+        vlm_backbone: Qwen2_5_VLForConditionalGeneration | None = None,
+    ):
+        require_package("transformers", extra="eo1")
+        super().__init__()
+
+        self.config = config
+        # Preserve the backbone dtype selected at construction time so Qwen's fp32 rotary buffers stay intact.
+        self.vlm_backbone = vlm_backbone
+        self.hidden_size = self.vlm_backbone.config.text_config.hidden_size
+        max_state_dim = config.max_state_dim
+        max_action_dim = config.max_action_dim
+        self.state_proj = nn.Linear(max_state_dim, self.hidden_size, dtype=torch.float32)
+        self.action_in_proj = nn.Linear(max_action_dim, self.hidden_size, dtype=torch.float32)
+        self.action_out_proj = EO1VisionActionProjector(
+            self.hidden_size,
+            max_action_dim,
+            config.num_action_layers,
+            config.action_act,
+            dtype=torch.float32,
+        )
+        self.action_time_mlp_in = nn.Linear(self.hidden_size * 2, self.hidden_size, dtype=torch.float32)
+        self.action_time_mlp_out = nn.Linear(self.hidden_size, self.hidden_size, dtype=torch.float32)
+        self.gradient_checkpointing_enabled = False
+
+    def get_input_embeddings(self):
+        return self.vlm_backbone.get_input_embeddings()
+
+    def flow_head_autocast_context(self):
+        if self.config.force_fp32_autocast:
+            return torch.autocast(
+                device_type=self.state_proj.weight.device.type,
+                enabled=False,
+            )
+        return contextlib.nullcontext()
+
+    def gradient_checkpointing_enable(self):
+        """Enable gradient checkpointing for the Qwen2.5-VL backbone."""
+        self.gradient_checkpointing_enabled = True
+        self.vlm_backbone.gradient_checkpointing_enable(
+            gradient_checkpointing_kwargs={"use_reentrant": False}
+        )
+        logger.info("Enabled gradient checkpointing for EO1VisionFlowMatchingModel")
+
+    def gradient_checkpointing_disable(self):
+        """Disable gradient checkpointing for the Qwen2.5-VL backbone."""
+        self.gradient_checkpointing_enabled = False
+        self.vlm_backbone.gradient_checkpointing_disable()
+        logger.info("Disabled gradient checkpointing for EO1VisionFlowMatchingModel")
+
+    def _apply_checkpoint(self, func, *args, **kwargs):
+        """Apply manual gradient checkpointing to EO1 flow-head computations when training."""
+        if self.gradient_checkpointing_enabled and self.training and torch.is_grad_enabled():
+            return torch.utils.checkpoint.checkpoint(
+                func, *args, use_reentrant=False, preserve_rng_state=False, **kwargs
+            )
+        return func(*args, **kwargs)
+
+    def sample_noise(self, shape, device):
+        noise = torch.normal(
+            mean=0.0,
+            std=1.0,
+            size=shape,
+            dtype=torch.float32,
+            device=device,
+        )
+        return noise
+
+    def sample_time(self, bsize, device):
+        time_beta = sample_beta(
+            self.config.time_sampling_beta_alpha, self.config.time_sampling_beta_beta, bsize, device
+        )
+        time = time_beta * self.config.time_sampling_scale + self.config.time_sampling_offset
+        return time.to(dtype=torch.float32, device=device)
+
+    def get_placeholder_mask(
+        self,
+        input_ids: torch.LongTensor | None,
+        inputs_embeds: torch.FloatTensor | None,
+        state_features: torch.FloatTensor | None = None,
+        action_features: torch.FloatTensor | None = None,
+        *,
+        state_token_id: int,
+        action_token_id: int,
+    ) -> tuple[torch.BoolTensor, torch.BoolTensor]:
+        """Return EO1 state/action placeholder masks, following Qwen's multimodal mask style."""
+        if input_ids is None:
+            special_state_mask = inputs_embeds == self.get_input_embeddings()(
+                torch.tensor(state_token_id, dtype=torch.long, device=inputs_embeds.device)
+            )
+            special_state_mask = special_state_mask.all(-1)
+            special_action_mask = inputs_embeds == self.get_input_embeddings()(
+                torch.tensor(action_token_id, dtype=torch.long, device=inputs_embeds.device)
+            )
+            special_action_mask = special_action_mask.all(-1)
+        else:
+            special_state_mask = input_ids == state_token_id
+            special_action_mask = input_ids == action_token_id
+
+        n_state_tokens = special_state_mask.sum()
+        special_state_mask = (
+            special_state_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device)
+        )
+        if state_features is not None:
+            torch_compilable_check(
+                inputs_embeds[special_state_mask].numel() == state_features.numel(),
+                f"State features and state tokens do not match, tokens: {n_state_tokens}, features: {state_features.shape[0]}",
+            )
+
+        n_action_tokens = special_action_mask.sum()
+        special_action_mask = (
+            special_action_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device)
+        )
+        if action_features is not None:
+            torch_compilable_check(
+                inputs_embeds[special_action_mask].numel() == action_features.numel(),
+                f"Action features and action tokens do not match, tokens: {n_action_tokens}, features: {action_features.shape[0]}",
+            )
+
+        return special_state_mask, special_action_mask
+
+    def embed_prefix(
+        self,
+        input_ids: torch.LongTensor,
+        states: torch.Tensor,
+        *,
+        state_token_id: int,
+        action_token_id: int,
+    ) -> torch.FloatTensor:
+        """Embed the EO1 prefix tokens before native Qwen injects multimodal features."""
+
+        # Get the input embeddings for the input IDs
+        def input_embed_func(input_ids: torch.LongTensor) -> torch.FloatTensor:
+            return self.get_input_embeddings()(input_ids)
+
+        inputs_embeds = self._apply_checkpoint(input_embed_func, input_ids)
+
+        # Project the states to the hidden size
+        def state_proj_func(states: torch.Tensor) -> torch.FloatTensor:
+            with self.flow_head_autocast_context():
+                states = states.to(dtype=self.state_proj.weight.dtype)
+                return self.state_proj(states)
+
+        state_embs = self._apply_checkpoint(state_proj_func, states)
+        state_mask, _ = self.get_placeholder_mask(
+            input_ids,
+            inputs_embeds,
+            state_features=state_embs,
+            state_token_id=state_token_id,
+            action_token_id=action_token_id,
+        )
+        state_embs = state_embs.to(inputs_embeds.device, inputs_embeds.dtype)
+        inputs_embeds = inputs_embeds.masked_scatter(state_mask, state_embs)
+        return inputs_embeds
+
+    def embed_suffix(
+        self,
+        timestep: torch.Tensor,
+        noisy_actions: torch.Tensor,
+    ) -> torch.FloatTensor:
+        """Embed the suffix"""
+
+        def action_proj_func(noisy_actions: torch.Tensor) -> torch.FloatTensor:
+            with self.flow_head_autocast_context():
+                noisy_actions = noisy_actions.to(dtype=self.action_in_proj.weight.dtype)
+                return self.action_in_proj(noisy_actions)
+
+        action_embs = self._apply_checkpoint(action_proj_func, noisy_actions)
+        time_embs = create_sinusoidal_pos_embedding(
+            timestep,
+            self.hidden_size,
+            min_period=self.config.min_period,
+            max_period=self.config.max_period,
+            device=action_embs.device,
+        )
+        time_embs = time_embs.to(dtype=action_embs.dtype)
+        time_embs = time_embs[:, None, :].expand_as(action_embs)
+        action_time_embs = torch.cat([action_embs, time_embs], dim=2)
+
+        def mlp_func(action_time_embs: torch.Tensor) -> torch.FloatTensor:
+            with self.flow_head_autocast_context():
+                action_time_embs = action_time_embs.to(dtype=self.action_time_mlp_in.weight.dtype)
+                action_time_embs = self.action_time_mlp_in(action_time_embs)
+                action_time_embs = F.silu(action_time_embs)
+                return self.action_time_mlp_out(action_time_embs)
+
+        action_time_embs = self._apply_checkpoint(mlp_func, action_time_embs)
+        return action_time_embs
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor | None = None,
+        attention_mask: torch.LongTensor | None = None,
+        pixel_values: torch.FloatTensor | None = None,
+        image_grid_thw: torch.LongTensor | None = None,
+        mm_token_type_ids: torch.IntTensor | None = None,
+        states: torch.FloatTensor | None = None,
+        action: torch.FloatTensor | None = None,
+        action_is_pad: torch.BoolTensor | None = None,
+        *,
+        state_token_id: int,
+        action_token_id: int,
+        **kwargs,
+    ) -> Tensor:
+        """Run the EO1 training forward pass and compute the flow-matching loss."""
+
+        # 1. Build the EO1 prefix with state placeholders resolved.
+        inputs_embeds = self.embed_prefix(
+            input_ids,
+            states=states,
+            state_token_id=state_token_id,
+            action_token_id=action_token_id,
+        )
+
+        # 2. Sample the diffusion target and replace the action placeholders.
+        time = self.sample_time(action.shape[0], inputs_embeds.device)
+        noise = self.sample_noise(action.shape, inputs_embeds.device)
+
+        time_expanded = time[:, None, None]
+        x_t = time_expanded * noise + (1 - time_expanded) * action
+        u_t = noise - action
+        action_time_embs = self.embed_suffix(time, x_t)
+        _, action_mask = self.get_placeholder_mask(
+            input_ids,
+            inputs_embeds,
+            action_features=action_time_embs,
+            state_token_id=state_token_id,
+            action_token_id=action_token_id,
+        )
+        action_time_embs = action_time_embs.to(inputs_embeds.device, inputs_embeds.dtype)
+        inputs_embeds = inputs_embeds.masked_scatter(action_mask, action_time_embs)
+
+        # 3. Optionally drop padded action tokens from backbone attention.
+        if attention_mask is not None:
+            attention_mask = attention_mask.to(inputs_embeds.device)
+
+        if not self.config.supervise_padding_actions:
+            action_is_pad = action_is_pad.to(device=inputs_embeds.device, dtype=torch.bool)
+            action_token_mask = action_mask[..., 0]
+            action_padding_mask = torch.zeros_like(action_token_mask)
+            action_padding_mask = action_padding_mask.masked_scatter(
+                action_token_mask,
+                action_is_pad.reshape(-1),
+            )
+            attention_mask = attention_mask.masked_fill(action_padding_mask, 0)
+
+        # 4. Run the Qwen backbone on the fused EO1 sequence.
+        def vlm_forward_func(
+            input_ids: torch.LongTensor,
+            attention_mask: torch.Tensor | None,
+            inputs_embeds: torch.FloatTensor,
+            pixel_values: torch.Tensor | None,
+            image_grid_thw: torch.LongTensor | None,
+            mm_token_type_ids: torch.IntTensor | None,
+        ) -> torch.FloatTensor:
+            outputs = self.vlm_backbone.model(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                inputs_embeds=inputs_embeds,
+                pixel_values=pixel_values,
+                image_grid_thw=image_grid_thw,
+                mm_token_type_ids=mm_token_type_ids,
+                use_cache=False,
+                output_hidden_states=False,
+                return_dict=True,
+            )
+            return outputs.last_hidden_state
+
+        hidden_states = self._apply_checkpoint(
+            vlm_forward_func,
+            input_ids,
+            attention_mask,
+            inputs_embeds,
+            pixel_values,
+            image_grid_thw,
+            mm_token_type_ids,
+        )
+        action_hidden_states = hidden_states[action_mask[..., 0]]
+
+        # 5. Project the action-token hidden states back to the flow target space.
+        def action_out_proj_func(action_hidden_states: torch.FloatTensor) -> torch.FloatTensor:
+            with self.flow_head_autocast_context():
+                action_hidden_states = action_hidden_states.to(dtype=self.action_out_proj.dtype)
+                return self.action_out_proj(action_hidden_states)
+
+        v_t = self._apply_checkpoint(action_out_proj_func, action_hidden_states)
+        v_t = v_t.reshape(u_t.shape).to(dtype=u_t.dtype)
+        losses = F.mse_loss(u_t, v_t, reduction="none")
+
+        # 6. Apply the configured supervision mask and reduce the loss.
+        if not self.config.supervise_padding_action_dims:
+            original_action_dim = self.config.output_features[ACTION].shape[0]
+            losses = losses[..., :original_action_dim]
+
+        if not self.config.supervise_padding_actions:
+            losses = losses[~action_is_pad]
+
+        return losses.mean()
+
+    @torch.no_grad()
+    def sample_actions(
+        self,
+        input_ids: torch.LongTensor | None = None,
+        attention_mask: torch.Tensor | None = None,
+        pixel_values: torch.Tensor | None = None,
+        image_grid_thw: torch.LongTensor | None = None,
+        mm_token_type_ids: torch.IntTensor | None = None,
+        states: torch.Tensor | None = None,
+        *,
+        state_token_id: int,
+        action_token_id: int,
+        **kwargs,
+    ) -> Tensor:
+        """Sample actions from the model."""
+        if states is None:
+            raise ValueError("states are required for EO1 action sampling.")
+        if mm_token_type_ids is None:
+            raise ValueError("mm_token_type_ids are required for EO1 action sampling.")
+
+        # 1. Resolve the left-padded rollout prompt and locate the action span.
+        chunk_size = self.config.chunk_size
+
+        inputs_embeds = self.embed_prefix(
+            input_ids,
+            states=states,
+            state_token_id=state_token_id,
+            action_token_id=action_token_id,
+        ).clone()
+        _, action_placeholder_mask = self.get_placeholder_mask(
+            input_ids,
+            inputs_embeds,
+            state_token_id=state_token_id,
+            action_token_id=action_token_id,
+        )
+        action_mask = action_placeholder_mask[..., 0]
+        token_counts = action_mask.sum(dim=1)
+        if not torch.all(token_counts == chunk_size):
+            raise ValueError(
+                f"Each sample must contain exactly {chunk_size} action tokens, got {token_counts.tolist()}."
+            )
+        if action_mask.ne(action_mask[:1]).any():
+            raise ValueError(
+                "Batch inference expects all samples to share the same action token mask after left padding."
+            )
+        act_start = int(action_mask[0].to(torch.int64).argmax().item())
+        act_end = act_start + self.config.chunk_size
+        if not torch.all(action_mask[:, act_start:act_end]):
+            raise ValueError("Action tokens must form a contiguous chunk of length chunk_size.")
+        act_slice = slice(act_start, act_end)
+
+        # 2. Encode the fixed prefix once and cache its KV state.
+        batch_size = input_ids.shape[0]
+        device = inputs_embeds.device
+        attention_mask = attention_mask.to(device)
+        mm_token_type_ids = mm_token_type_ids.to(device)
+        position_ids, _ = self.vlm_backbone.model.get_rope_index(
+            input_ids,
+            image_grid_thw=image_grid_thw,
+            attention_mask=attention_mask,
+            mm_token_type_ids=mm_token_type_ids,
+        )
+        position_ids = position_ids.to(device)
+
+        outputs = self.vlm_backbone.model(
+            input_ids=input_ids[:, :act_start],
+            attention_mask=attention_mask[:, :act_start],
+            position_ids=position_ids[..., :act_start],
+            inputs_embeds=inputs_embeds[:, :act_start],
+            pixel_values=pixel_values,
+            image_grid_thw=image_grid_thw,
+            mm_token_type_ids=mm_token_type_ids[:, :act_start],
+            use_cache=True,
+            return_dict=True,
+        )
+
+        x_t = self.sample_noise(
+            (batch_size, chunk_size, self.config.max_action_dim),
+            device,
+        ).to(dtype=self.action_in_proj.weight.dtype)
+        dt = -1.0 / self.config.num_denoise_steps
+        past_key_values = outputs.past_key_values
+
+        # 3. Denoise only the action chunk while keeping the prefix cache invariant.
+        for step in range(self.config.num_denoise_steps):
+            time = torch.full(
+                (batch_size,),
+                1.0 + step * dt,
+                device=device,
+                dtype=torch.float32,
+            )
+            action_time_embs = self.embed_suffix(time, x_t)
+            inputs_embeds[:, act_slice] = action_time_embs.to(inputs_embeds.dtype)
+
+            # Keep the prefix KV cache invariant across denoising steps.
+            past_key_values.crop(act_start)
+            outputs = self.vlm_backbone.model(
+                attention_mask=attention_mask[:, :act_end],
+                past_key_values=past_key_values,
+                inputs_embeds=inputs_embeds[:, act_slice],
+                position_ids=position_ids[..., act_slice],
+                use_cache=True,
+                return_dict=True,
+            )
+            with self.flow_head_autocast_context():
+                hidden_states = outputs.last_hidden_state[:, :chunk_size]
+                hidden_states = hidden_states.to(dtype=self.action_out_proj.dtype)
+                v_t = self.action_out_proj(hidden_states)
+
+            x_t += dt * v_t.reshape(x_t.shape)
+
+        return x_t
@@ -0,0 +1,282 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from typing import TYPE_CHECKING, Any
+
+import torch
+
+from lerobot.configs.types import FeatureType, PipelineFeatureType, PolicyFeature
+from lerobot.policies.eo1.configuration_eo1 import EO1Config
+from lerobot.processor import (
+    AddBatchDimensionProcessorStep,
+    ComplementaryDataProcessorStep,
+    DeviceProcessorStep,
+    NormalizerProcessorStep,
+    PolicyAction,
+    PolicyProcessorPipeline,
+    ProcessorStep,
+    ProcessorStepRegistry,
+    RenameObservationsProcessorStep,
+    UnnormalizerProcessorStep,
+)
+from lerobot.processor.converters import policy_action_to_transition, transition_to_policy_action
+from lerobot.types import TransitionKey
+from lerobot.utils.constants import (
+    OBS_STATE,
+    POLICY_POSTPROCESSOR_DEFAULT_NAME,
+    POLICY_PREPROCESSOR_DEFAULT_NAME,
+)
+from lerobot.utils.import_utils import _transformers_available, require_package
+
+if TYPE_CHECKING or _transformers_available:
+    from transformers.models.qwen2_5_vl import Qwen2_5_VLProcessor
+else:
+    Qwen2_5_VLProcessor = None
+
+SYSTEM_MESSAGE = "You are a helpful physical assistant."
+
+# EO-1 special tokens
+ACTION_START_TOKEN = "<|action_start|>"  # nosec B105
+DEFAULT_ACTION_TOKEN = "<|action_pad|>"  # nosec B105
+ACTION_END_TOKEN = "<|action_end|>"  # nosec B105
+STATE_START_TOKEN = "<|state_start|>"  # nosec B105
+DEFAULT_STATE_TOKEN = "<|state_pad|>"  # nosec B105
+STATE_END_TOKEN = "<|state_end|>"  # nosec B105
+TASK_VLA_TOKEN = "<|vla|>"  # nosec B105
+
+EO1_SPECIAL_TOKENS = [
+    ACTION_START_TOKEN,
+    DEFAULT_ACTION_TOKEN,
+    ACTION_END_TOKEN,
+    STATE_START_TOKEN,
+    DEFAULT_STATE_TOKEN,
+    STATE_END_TOKEN,
+    TASK_VLA_TOKEN,
+]
+
+
+@dataclass
+@ProcessorStepRegistry.register(name="eo1_conversation_template_processor")
+class EO1ConversationTemplateStep(ComplementaryDataProcessorStep):
+    input_features: dict[str, PolicyFeature] | dict[str, dict[str, Any]]
+    chunk_size: int
+
+    _image_keys: list[str] = field(default_factory=list, init=False, repr=False)
+
+    def __post_init__(self):
+        # Robust JSON deserialization handling (guard empty maps).
+        if self.input_features:
+            first_val = next(iter(self.input_features.values()))
+            if isinstance(first_val, dict):
+                reconstructed = {}
+                for key, ft_dict in self.input_features.items():
+                    reconstructed[key] = PolicyFeature(
+                        type=FeatureType(ft_dict["type"]), shape=tuple(ft_dict["shape"])
+                    )
+                self.input_features = reconstructed
+
+        self._image_keys = [
+            key for key, value in self.input_features.items() if value.type == FeatureType.VISUAL
+        ]
+
+    def complementary_data(self, complementary_data):
+        tasks = complementary_data.get("task")
+        if tasks is None:
+            raise ValueError("Task is required for EO1ConversationTemplateStep.")
+
+        observation = self.transition.get(TransitionKey.OBSERVATION)
+        if observation is None:
+            raise ValueError("Observation is required for EO1ConversationTemplateStep.")
+
+        if OBS_STATE in observation and observation[OBS_STATE].shape[0] != len(tasks):
+            raise ValueError("Batch size mismatch between observation.state and task list.")
+
+        # LeRobot visual observations reach in processor as float32 tensors in [0, 1].
+        # Convert to uint8 in [0, 255] to meet the input requirement of Qwen2.5-VL-3B-Instruct.
+        images = {
+            key: observation[key].clamp(0, 1).mul(255.0).round().to(torch.uint8) for key in self._image_keys
+        }
+        messages = []
+        for i in range(len(tasks)):
+            content = [
+                *[{"type": "image", "image": images[key][i]} for key in self._image_keys],
+                {
+                    "type": "text",
+                    "text": (
+                        f"{STATE_START_TOKEN}{DEFAULT_STATE_TOKEN}{STATE_END_TOKEN}{tasks[i]}{TASK_VLA_TOKEN}"
+                    ),
+                },
+            ]
+            messages.append(
+                [
+                    {"role": "system", "content": [{"type": "text", "text": SYSTEM_MESSAGE}]},
+                    {"role": "user", "content": content},
+                    {
+                        "role": "assistant",
+                        "content": [
+                            {
+                                "type": "text",
+                                "text": f"{ACTION_START_TOKEN}{DEFAULT_ACTION_TOKEN * self.chunk_size}{ACTION_END_TOKEN}",
+                            }
+                        ],
+                    },
+                ]
+            )
+
+        complementary_data["messages"] = messages
+
+        return complementary_data
+
+    def transform_features(
+        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
+    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
+        """
+        This step only materializes EO1-specific message objects in complementary_data.
+        PipelineFeatureType tracks only ACTION and OBSERVATION, so there is no static
+        feature contract change to record here.
+        """
+        return features
+
+    def get_config(self) -> dict[str, Any]:
+        return {
+            "input_features": {
+                key: {"type": ft.type.value, "shape": ft.shape} for key, ft in self.input_features.items()
+            },
+            "chunk_size": self.chunk_size,
+        }
+
+
+@dataclass
+@ProcessorStepRegistry.register(name="eo1_qwen_processor")
+class EO1QwenProcessorStep(ComplementaryDataProcessorStep):
+    processor_name: str = "Qwen/Qwen2.5-VL-3B-Instruct"
+    image_min_pixels: int | None = 64 * 28 * 28
+    image_max_pixels: int | None = 128 * 28 * 28
+    use_fast_processor: bool = False
+
+    _processor: Qwen2_5_VLProcessor | None = field(default=None, init=False, repr=False)
+    _state_token_id: int | None = field(default=None, init=False, repr=False)
+    _action_token_id: int | None = field(default=None, init=False, repr=False)
+
+    def __post_init__(self):
+        require_package("transformers", extra="eo1")
+        self._processor = Qwen2_5_VLProcessor.from_pretrained(
+            self.processor_name,
+            use_fast=self.use_fast_processor,
+        )
+        self._processor.tokenizer.add_tokens(EO1_SPECIAL_TOKENS, special_tokens=True)
+        self._state_token_id = self._processor.tokenizer.convert_tokens_to_ids(DEFAULT_STATE_TOKEN)
+        self._action_token_id = self._processor.tokenizer.convert_tokens_to_ids(DEFAULT_ACTION_TOKEN)
+
+    def complementary_data(self, complementary_data):
+        messages = complementary_data.pop("messages", None)
+        if messages is None:
+            raise ValueError("Messages are required for EO1QwenProcessorStep.")
+
+        # Rollout batches use left padding so action spans stay aligned across samples.
+        # Supervised batches use right padding to match standard training collation.
+        padding_side = "right" if self.transition.get(TransitionKey.ACTION) is not None else "left"
+
+        inputs = self._processor.apply_chat_template(
+            messages,
+            tokenize=True,
+            padding=True,
+            padding_side=padding_side,
+            min_pixels=self.image_min_pixels,
+            max_pixels=self.image_max_pixels,
+            add_generation_prompt=False,
+            return_dict=True,
+            return_tensors="pt",
+        )
+
+        complementary_data["input_ids"] = inputs["input_ids"]
+        complementary_data["pixel_values"] = inputs["pixel_values"]
+        complementary_data["image_grid_thw"] = inputs["image_grid_thw"]
+        complementary_data["attention_mask"] = inputs["attention_mask"]
+        complementary_data["mm_token_type_ids"] = inputs["mm_token_type_ids"]
+        complementary_data["state_token_id"] = self._state_token_id
+        complementary_data["action_token_id"] = self._action_token_id
+
+        return complementary_data
+
+    def get_config(self) -> dict[str, Any]:
+        return {
+            "processor_name": self.processor_name,
+            "image_min_pixels": self.image_min_pixels,
+            "image_max_pixels": self.image_max_pixels,
+            "use_fast_processor": self.use_fast_processor,
+        }
+
+    def transform_features(
+        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
+    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
+        """
+        This step only converts the messages to the model input format.
+        """
+        return features
+
+
+def make_eo1_pre_post_processors(
+    config: EO1Config,
+    dataset_stats: dict[str, dict[str, torch.Tensor]] | None = None,
+) -> tuple[
+    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
+    PolicyProcessorPipeline[PolicyAction, PolicyAction],
+]:
+    """Build pre/post processor pipelines for EO1."""
+
+    input_steps: list[ProcessorStep] = [
+        RenameObservationsProcessorStep(rename_map={}),
+        AddBatchDimensionProcessorStep(),
+        NormalizerProcessorStep(
+            features={**config.input_features, **config.output_features},
+            norm_map=config.normalization_mapping,
+            stats=dataset_stats,
+        ),
+        EO1ConversationTemplateStep(input_features=config.input_features, chunk_size=config.chunk_size),
+        EO1QwenProcessorStep(
+            processor_name=config.vlm_base,
+            image_min_pixels=config.image_min_pixels,
+            image_max_pixels=config.image_max_pixels,
+            use_fast_processor=config.use_fast_processor,
+        ),
+        DeviceProcessorStep(device=config.device),
+    ]
+
+    output_steps: list[ProcessorStep] = [
+        UnnormalizerProcessorStep(
+            features=config.output_features,
+            norm_map=config.normalization_mapping,
+            stats=dataset_stats,
+        ),
+        DeviceProcessorStep(device="cpu"),
+    ]
+
+    return (
+        PolicyProcessorPipeline[dict[str, Any], dict[str, Any]](
+            steps=input_steps,
+            name=POLICY_PREPROCESSOR_DEFAULT_NAME,
+        ),
+        PolicyProcessorPipeline[PolicyAction, PolicyAction](
+            steps=output_steps,
+            name=POLICY_POSTPROCESSOR_DEFAULT_NAME,
+            to_transition=policy_action_to_transition,
+            to_output=transition_to_policy_action,
+        ),
+    )
@@ -0,0 +1 @@
+../../../../docs/source/policy_evo1_README.md
@@ -0,0 +1,19 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .configuration_evo1 import Evo1Config
+from .modeling_evo1 import EVO1Policy
+from .processor_evo1 import make_evo1_pre_post_processors
+
+__all__ = ["Evo1Config", "EVO1Policy", "make_evo1_pre_post_processors"]
@@ -0,0 +1,225 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import math
+from dataclasses import dataclass, field
+
+from torch.optim import Optimizer
+from torch.optim.lr_scheduler import LambdaLR
+
+from lerobot.configs.policies import PreTrainedConfig
+from lerobot.configs.types import FeatureType, NormalizationMode, PolicyFeature
+from lerobot.optim.optimizers import AdamWConfig
+from lerobot.optim.schedulers import LRSchedulerConfig
+from lerobot.utils.constants import ACTION, OBS_IMAGES, OBS_STATE
+
+
+@LRSchedulerConfig.register_subclass("evo1_exact")
+@dataclass
+class Evo1SchedulerConfig(LRSchedulerConfig):
+    num_warmup_steps: int
+
+    def build(self, optimizer: Optimizer, num_training_steps: int) -> LambdaLR:
+        def lr_lambda(current_step: int) -> float:
+            if current_step < self.num_warmup_steps:
+                return current_step / max(1, self.num_warmup_steps)
+            progress = (current_step - self.num_warmup_steps) / max(
+                1, num_training_steps - self.num_warmup_steps
+            )
+            return max(0.0, 0.5 * (1.0 + math.cos(math.pi * progress)))
+
+        return LambdaLR(optimizer, lr_lambda, -1)
+
+
+@PreTrainedConfig.register_subclass("evo1")
+@dataclass
+class Evo1Config(PreTrainedConfig):
+    training_stage: str = "stage1"
+    use_amp: bool = True
+
+    n_obs_steps: int = 1
+    chunk_size: int = 50
+    n_action_steps: int = 50
+
+    max_state_dim: int = 24
+    max_action_dim: int = 24
+    max_views: int = 3
+    image_resolution: tuple[int, int] = (448, 448)
+    empty_cameras: int = 0
+
+    normalization_mapping: dict[str, NormalizationMode] = field(
+        default_factory=lambda: {
+            "VISUAL": NormalizationMode.IDENTITY,
+            "STATE": NormalizationMode.MIN_MAX,
+            "ACTION": NormalizationMode.MIN_MAX,
+        }
+    )
+
+    vlm_model_name: str = "OpenGVLab/InternVL3-1B"
+    vlm_num_layers: int | None = 14
+    vlm_dtype: str = "bfloat16"
+    use_flash_attn: bool = True
+    action_head: str = "flowmatching"
+    embed_dim: int = 896
+    hidden_dim: int = 1024
+    state_hidden_dim: int = 1024
+    num_heads: int = 8
+    num_layers: int = 8
+    dropout: float = 0.0
+    num_inference_timesteps: int = 32
+    num_categories: int = 1
+    return_cls_only: bool = False
+    enable_gradient_checkpointing: bool = True
+    gradient_checkpointing_use_reentrant: bool = False
+
+    finetune_vlm: bool | None = None
+    finetune_language_model: bool | None = None
+    finetune_vision_model: bool | None = None
+    finetune_action_head: bool | None = None
+    # Reapply stage defaults after loading checkpoint configs so stage2 cannot
+    # accidentally inherit the frozen VLM flags stored by a stage1 checkpoint.
+    apply_training_stage_defaults: bool = True
+
+    task_field: str = "task"
+    embodiment_id_field: str | None = None
+    default_embodiment_id: int = 0
+
+    optimizer_lr: float = 1e-5
+    optimizer_betas: tuple[float, float] = (0.9, 0.999)
+    optimizer_eps: float = 1e-8
+    optimizer_weight_decay: float = 1e-5
+    optimizer_grad_clip_norm: float = 1.0
+
+    scheduler_warmup_steps: int = 300
+    drop_last: bool = True
+
+    def __post_init__(self):
+        super().__post_init__()
+        if self.training_stage not in {"stage1", "stage2"}:
+            raise ValueError(
+                f"Unsupported EVO1 training_stage '{self.training_stage}', expected 'stage1' or 'stage2'"
+            )
+
+        if self.apply_training_stage_defaults:
+            if self.training_stage == "stage1":
+                self.finetune_vlm = False
+                self.finetune_language_model = False
+                self.finetune_vision_model = False
+                self.finetune_action_head = True
+            elif self.training_stage == "stage2":
+                self.finetune_vlm = True
+                self.finetune_language_model = True
+                self.finetune_vision_model = True
+                self.finetune_action_head = True
+        elif self.training_stage == "stage1":
+            if self.finetune_vlm is None:
+                self.finetune_vlm = False
+            if self.finetune_language_model is None:
+                self.finetune_language_model = False
+            if self.finetune_vision_model is None:
+                self.finetune_vision_model = False
+            if self.finetune_action_head is None:
+                self.finetune_action_head = True
+        elif self.training_stage == "stage2":
+            has_explicit_branch_flags = any(
+                flag is not None for flag in (self.finetune_language_model, self.finetune_vision_model)
+            )
+            if not has_explicit_branch_flags:
+                if self.finetune_vlm is None:
+                    self.finetune_vlm = True
+                if self.finetune_language_model is None:
+                    self.finetune_language_model = True
+                if self.finetune_vision_model is None:
+                    self.finetune_vision_model = True
+            elif self.finetune_vlm is None:
+                self.finetune_vlm = bool(self.finetune_language_model or self.finetune_vision_model)
+            if self.finetune_action_head is None:
+                self.finetune_action_head = True
+
+        if self.finetune_vlm is None:
+            self.finetune_vlm = False
+        if self.finetune_language_model is None:
+            self.finetune_language_model = False
+        if self.finetune_vision_model is None:
+            self.finetune_vision_model = False
+        if self.finetune_action_head is None:
+            self.finetune_action_head = False
+
+        branch_vlm = self.finetune_language_model or self.finetune_vision_model
+        if self.finetune_vlm != branch_vlm:
+            raise ValueError(
+                "Inconsistent EVO1 finetune config: "
+                f"finetune_vlm={self.finetune_vlm} but "
+                f"(finetune_language_model or finetune_vision_model)={branch_vlm}. "
+                "When branch-level flags are used, finetune_vlm must match their effective union."
+            )
+
+        if self.n_action_steps > self.chunk_size:
+            raise ValueError(
+                f"n_action_steps ({self.n_action_steps}) must be <= chunk_size ({self.chunk_size})"
+            )
+
+    def validate_features(self) -> None:
+        if self.input_features is None:
+            self.input_features = {}
+        if self.output_features is None:
+            self.output_features = {}
+
+        for i in range(self.empty_cameras):
+            key = OBS_IMAGES + f".empty_camera_{i}"
+            if key not in self.input_features:
+                self.input_features[key] = PolicyFeature(
+                    type=FeatureType.VISUAL,
+                    shape=(3, *self.image_resolution),
+                )
+
+        if OBS_STATE not in self.input_features:
+            self.input_features[OBS_STATE] = PolicyFeature(
+                type=FeatureType.STATE,
+                shape=(self.max_state_dim,),
+            )
+
+        if ACTION not in self.output_features:
+            self.output_features[ACTION] = PolicyFeature(
+                type=FeatureType.ACTION,
+                shape=(self.max_action_dim,),
+            )
+
+    def get_optimizer_preset(self) -> AdamWConfig:
+        return AdamWConfig(
+            lr=self.optimizer_lr,
+            betas=self.optimizer_betas,
+            eps=self.optimizer_eps,
+            weight_decay=self.optimizer_weight_decay,
+            grad_clip_norm=self.optimizer_grad_clip_norm,
+        )
+
+    def get_scheduler_preset(self):
+        return Evo1SchedulerConfig(
+            num_warmup_steps=self.scheduler_warmup_steps,
+        )
+
+    @property
+    def observation_delta_indices(self) -> list[int]:
+        return [0]
+
+    @property
+    def action_delta_indices(self) -> list[int]:
+        return list(range(self.chunk_size))
+
+    @property
+    def reward_delta_indices(self) -> None:
+        return None
@@ -0,0 +1,234 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+from collections.abc import Sequence
+from typing import Any
+
+import torch
+import torch.nn as nn
+from PIL import Image
+
+from lerobot.policies.evo1.flow_matching import FlowmatchingActionHead
+from lerobot.policies.evo1.internvl3_embedder import InternVL3Embedder
+
+
+def _cfgget(config: Any, key: str, default=None):
+    if isinstance(config, dict):
+        return config.get(key, default)
+    return getattr(config, key, default)
+
+
+class EVO1(nn.Module):
+    def __init__(self, config: dict):
+        super().__init__()
+        self.config = config
+        self._device = _cfgget(config, "device", "cuda")
+        self.return_cls_only = _cfgget(config, "return_cls_only", False)
+        vlm_name = _cfgget(config, "vlm_name", "OpenGVLab/InternVL3-1B")
+        image_size = _cfgget(config, "image_size", 448)
+        if image_size is None:
+            image_resolution = _cfgget(config, "image_resolution", (448, 448))
+            image_size = int(image_resolution[0])
+
+        self.embedder = InternVL3Embedder(
+            model_name=vlm_name,
+            image_size=image_size,
+            device=self._device,
+            num_language_layers=_cfgget(config, "vlm_num_layers", 14),
+            model_dtype=_cfgget(config, "vlm_dtype", "bfloat16"),
+            use_flash_attn=_cfgget(config, "use_flash_attn", True),
+            enable_gradient_checkpointing=_cfgget(config, "enable_gradient_checkpointing", True),
+            gradient_checkpointing_use_reentrant=_cfgget(
+                config, "gradient_checkpointing_use_reentrant", False
+            ),
+        )
+
+        action_head_type = _cfgget(config, "action_head", "flowmatching").lower()
+        if action_head_type != "flowmatching":
+            raise NotImplementedError(f"Unknown action_head: {action_head_type}")
+
+        horizon = _cfgget(config, "action_horizon", _cfgget(config, "horizon", 16))
+        per_action_dim = _cfgget(config, "per_action_dim", 7)
+        action_dim = horizon * per_action_dim
+
+        if isinstance(config, dict):
+            config["horizon"] = horizon
+            config["per_action_dim"] = per_action_dim
+            config["action_dim"] = action_dim
+
+        self.horizon = horizon
+        self.per_action_dim = per_action_dim
+        self.action_head = FlowmatchingActionHead(config=config).to(self._device)
+
+    def _normalize_image_batches(
+        self,
+        images: Sequence[Image.Image | torch.Tensor] | Sequence[Sequence[Image.Image | torch.Tensor]],
+        prompt: str | list[str] | None,
+        image_mask: torch.Tensor,
+    ) -> tuple[list[list[Image.Image | torch.Tensor]], list[str], torch.Tensor]:
+        if not images:
+            raise ValueError("EVO1 expects at least one image per sample.")
+
+        first = images[0]
+        if isinstance(first, (Image.Image, torch.Tensor)):
+            image_batches = [list(images)]  # type: ignore[arg-type]
+        else:
+            image_batches = [list(sample) for sample in images]  # type: ignore[arg-type]
+
+        batch_size = len(image_batches)
+        if prompt is None:
+            prompts = [""] * batch_size
+        elif isinstance(prompt, str):
+            prompts = [prompt] * batch_size
+        else:
+            prompts = [str(p) for p in prompt]
+            if len(prompts) != batch_size:
+                raise ValueError(
+                    f"Prompt batch size {len(prompts)} does not match image batch size {batch_size}"
+                )
+
+        if image_mask.dim() == 1:
+            image_mask = image_mask.unsqueeze(0)
+        if image_mask.shape[0] != batch_size:
+            raise ValueError(
+                f"image_mask batch size {image_mask.shape[0]} does not match image batch size {batch_size}"
+            )
+
+        return image_batches, prompts, image_mask
+
+    def get_vl_embeddings(
+        self,
+        images: list[Image.Image | torch.Tensor] | list[list[Image.Image | torch.Tensor]],
+        image_mask: torch.Tensor,
+        prompt: str | list[str] | None = None,
+        return_cls_only: bool | None = None,
+    ) -> torch.Tensor:
+        if return_cls_only is None:
+            return_cls_only = self.return_cls_only
+
+        image_batches, prompts, image_mask = self._normalize_image_batches(images, prompt, image_mask)
+        return self.embedder.get_fused_image_text_embedding_from_tensor_images(
+            image_tensors_batch=image_batches,
+            image_masks=image_mask,
+            text_prompts=prompts,
+            return_cls_only=return_cls_only,
+        )
+
+    def prepare_state(self, state_input: list | torch.Tensor) -> torch.Tensor:
+        if isinstance(state_input, list):
+            state_tensor = torch.tensor(state_input)
+        elif isinstance(state_input, torch.Tensor):
+            state_tensor = state_input
+        else:
+            raise TypeError(f"Unsupported state input type: {type(state_input)}")
+
+        if state_tensor.ndim == 1:
+            state_tensor = state_tensor.unsqueeze(0)
+
+        return state_tensor.to(self._device)
+
+    def predict_action(
+        self,
+        fused_tokens: torch.Tensor,
+        state: torch.Tensor,
+        actions_gt: torch.Tensor | None = None,
+        action_mask: torch.Tensor | None = None,
+        embodiment_ids: torch.Tensor | None = None,
+    ):
+        if actions_gt is None:
+            return self.action_head.get_action(
+                fused_tokens,
+                state=state,
+                action_mask=action_mask,
+                embodiment_id=embodiment_ids,
+            )
+        return self.action_head(
+            fused_tokens,
+            state=state,
+            actions_gt=actions_gt,
+            action_mask=action_mask,
+            embodiment_id=embodiment_ids,
+        )
+
+    @torch.no_grad()
+    def run_inference(
+        self,
+        images: list[Image.Image | torch.Tensor],
+        image_mask: torch.Tensor,
+        prompt: str,
+        state_input: list | torch.Tensor,
+        return_cls_only: bool | None = None,
+        action_mask: torch.Tensor | None = None,
+        embodiment_ids: torch.Tensor | None = None,
+    ) -> torch.Tensor:
+        if image_mask.dim() == 1:
+            image_mask = image_mask.unsqueeze(0)
+
+        fused_tokens = self.get_vl_embeddings(
+            images=[images],
+            image_mask=image_mask,
+            prompt=[prompt],
+            return_cls_only=return_cls_only,
+        )
+        state_tensor = self.prepare_state(state_input)
+        action = self.predict_action(
+            fused_tokens,
+            state_tensor,
+            action_mask=action_mask,
+            embodiment_ids=embodiment_ids,
+        )
+        if isinstance(action, torch.Tensor) and action.dtype == torch.bfloat16:
+            action = action.to(torch.float32)
+        return action
+
+    def forward(
+        self,
+        fused_tokens: torch.Tensor,
+        state: torch.Tensor | None = None,
+        actions_gt: torch.Tensor | None = None,
+        action_mask: torch.Tensor | None = None,
+        embodiment_ids: torch.Tensor | None = None,
+    ):
+        return self.predict_action(fused_tokens, state, actions_gt, action_mask, embodiment_ids)
+
+    def _set_module_trainable(self, module: nn.Module, trainable: bool):
+        for param in module.parameters():
+            param.requires_grad = trainable
+
+    def set_finetune_flags(self):
+        finetune_vlm = _cfgget(self.config, "finetune_vlm", False)
+        finetune_language_model = _cfgget(self.config, "finetune_language_model", False)
+        finetune_vision_model = _cfgget(self.config, "finetune_vision_model", False)
+        has_explicit_branch_flags = any(
+            flag is not None for flag in (finetune_language_model, finetune_vision_model)
+        )
+        finetune_language_model = bool(finetune_language_model)
+        finetune_vision_model = bool(finetune_vision_model)
+        finetune_vlm = bool(finetune_vlm)
+
+        if has_explicit_branch_flags:
+            self._set_module_trainable(self.embedder, False)
+            if hasattr(self.embedder.model, "language_model"):
+                self._set_module_trainable(self.embedder.model.language_model, finetune_language_model)
+            if hasattr(self.embedder.model, "vision_model"):
+                self._set_module_trainable(self.embedder.model.vision_model, finetune_vision_model)
+            if hasattr(self.embedder.model, "mlp1"):
+                self._set_module_trainable(self.embedder.model.mlp1, finetune_vision_model)
+        elif not finetune_vlm:
+            self._set_module_trainable(self.embedder, False)
+
+        if not _cfgget(self.config, "finetune_action_head", False):
+            self._set_module_trainable(self.action_head, False)
@@ -0,0 +1,456 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import logging
+import math
+from types import SimpleNamespace
+
+import torch
+import torch.nn as nn
+
+logger = logging.getLogger(__name__)
+
+
+def _cfgget(config, key: str, default=None):
+    if isinstance(config, dict):
+        return config.get(key, default)
+    return getattr(config, key, default)
+
+
+class SinusoidalPositionalEncoding(nn.Module):
+    def __init__(self, dim: int, max_len: int = 1000):
+        super().__init__()
+        pe = torch.zeros(max_len, dim)
+        position = torch.arange(0, max_len).unsqueeze(1)
+        div_term = torch.exp(torch.arange(0, dim, 2) * -(math.log(10000.0) / dim))
+        pe[:, 0::2] = torch.sin(position * div_term)
+        pe[:, 1::2] = torch.cos(position * div_term)
+        pe = pe.unsqueeze(0)
+        self.register_buffer("pe", pe)
+
+    def forward(self, seq_len: int):
+        if seq_len > self.pe.size(1):
+            self._extend_pe(seq_len)
+        return self.pe[:, :seq_len, :]
+
+    def _extend_pe(self, new_max_len):
+        old_max_len, dim = self.pe.size(1), self.pe.size(2)
+        if new_max_len <= old_max_len:
+            return
+        extra_positions = torch.arange(old_max_len, new_max_len, dtype=torch.float).unsqueeze(1)
+        div_term = torch.exp(torch.arange(0, dim, 2, dtype=torch.float) * -(math.log(10000.0) / dim))
+        extra_pe = torch.zeros(new_max_len - old_max_len, dim)
+        extra_pe[:, 0::2] = torch.sin(extra_positions * div_term)
+        extra_pe[:, 1::2] = torch.cos(extra_positions * div_term)
+        extra_pe = extra_pe.unsqueeze(0)
+        new_pe = torch.cat([self.pe, extra_pe.to(self.pe.device)], dim=1)
+        self.pe = new_pe
+
+
+class CategorySpecificLinear(nn.Module):
+    def __init__(self, in_dim: int, out_dim: int, num_categories: int = 1):
+        super().__init__()
+        self.num_categories = num_categories
+        if num_categories <= 1:
+            self.linear = nn.Linear(in_dim, out_dim)
+        else:
+            self.weight = nn.Parameter(torch.empty(num_categories, in_dim, out_dim))
+            self.bias = nn.Parameter(torch.zeros(num_categories, out_dim))
+            nn.init.xavier_uniform_(self.weight)
+
+    def forward(self, x: torch.Tensor, category_id: torch.LongTensor):
+        if self.num_categories <= 1:
+            if x.dtype != self.linear.weight.dtype:
+                x = x.to(dtype=self.linear.weight.dtype)
+            return self.linear(x)
+
+        if x.dtype != self.weight.dtype:
+            x = x.to(dtype=self.weight.dtype)
+
+        orig_shape = x.shape
+        x_flat = x.reshape(-1, orig_shape[-1])
+        if category_id.dim() == 0:
+            cid = category_id.item()
+            out = x_flat @ self.weight[cid] + self.bias[cid]
+        else:
+            category_id = category_id.reshape(-1)
+            if category_id.numel() != x_flat.size(0):
+                raise ValueError(
+                    f"category_id length {category_id.numel()} does not match flattened batch {x_flat.size(0)}"
+                )
+            weight_selected = self.weight[category_id]
+            bias_selected = self.bias[category_id]
+            out = torch.bmm(x_flat.unsqueeze(1), weight_selected).squeeze(1) + bias_selected
+        out_shape = orig_shape[:-1] + (out.shape[-1],)
+        return out.view(out_shape)
+
+
+class CategorySpecificMLP(nn.Module):
+    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int, num_categories: int = 1):
+        super().__init__()
+        self.fc1 = CategorySpecificLinear(input_dim, hidden_dim, num_categories)
+        self.fc2 = CategorySpecificLinear(hidden_dim, output_dim, num_categories)
+        self.activation = nn.ReLU(inplace=True)
+
+    def forward(self, x: torch.Tensor, category_id: torch.LongTensor):
+        out = self.activation(self.fc1(x, category_id))
+        out = self.fc2(out, category_id)
+        return out
+
+
+class MultiEmbodimentActionEncoder(nn.Module):
+    def __init__(
+        self, action_dim: int, embed_dim: int, hidden_dim: int, horizon: int, num_categories: int = 1
+    ):
+        super().__init__()
+        self.horizon = horizon
+        self.embed_dim = embed_dim
+        self.num_categories = num_categories
+
+        self.W1 = CategorySpecificLinear(action_dim, hidden_dim, num_categories)
+        self.W2 = CategorySpecificLinear(hidden_dim, hidden_dim, num_categories)
+        self.W3 = CategorySpecificLinear(hidden_dim, embed_dim, num_categories)
+
+        self.pos_encoding = SinusoidalPositionalEncoding(hidden_dim, max_len=horizon)
+        self.activation = nn.ReLU(inplace=True)
+
+    def forward(self, action_seq: torch.Tensor, category_id: torch.LongTensor):
+        batch_size, horizon, action_dim = action_seq.shape
+        assert self.horizon == horizon, "Action sequence length must match horizon"
+
+        x = action_seq.reshape(batch_size * horizon, action_dim)
+        if category_id.dim() == 0:
+            cat_ids = category_id.expand(horizon * batch_size)
+        else:
+            cat_ids = category_id.unsqueeze(1).expand(batch_size, horizon).reshape(batch_size * horizon)
+
+        out = self.activation(self.W1(x, cat_ids))
+        pos_enc = self.pos_encoding(horizon).to(device=out.device, dtype=out.dtype)
+        out = out.view(batch_size, horizon, -1) + pos_enc
+        out = out.view(batch_size * horizon, -1)
+        out = self.activation(self.W2(out, cat_ids))
+        out = self.W3(out, cat_ids)
+        return out.view(batch_size, horizon, self.embed_dim)
+
+
+class BasicTransformerBlock(nn.Module):
+    def __init__(self, embed_dim: int, num_heads: int, hidden_dim: int, dropout: float = 0.0):
+        super().__init__()
+        self.attn = nn.MultiheadAttention(embed_dim, num_heads, dropout=dropout, batch_first=True)
+        self.norm1 = nn.LayerNorm(embed_dim)
+        self.norm2 = nn.LayerNorm(embed_dim)
+        self.ff = nn.Sequential(nn.Linear(embed_dim, hidden_dim), nn.GELU(), nn.Linear(hidden_dim, embed_dim))
+
+    def forward(self, action_tokens: torch.Tensor, context_tokens: torch.Tensor, time_emb: torch.Tensor):
+        x = self.norm1(action_tokens)
+        attn_out, _ = self.attn(x, context_tokens, context_tokens)
+        x = action_tokens + attn_out
+        x2 = self.norm2(x)
+        if time_emb is not None:
+            x2 = x2 + time_emb.unsqueeze(1)
+        ff_out = self.ff(x2)
+        return x + ff_out
+
+
+class FlowmatchingActionHead(nn.Module):
+    def __init__(
+        self,
+        config=None,
+        embed_dim: int = 896,
+        hidden_dim: int = 1024,
+        action_dim: int = 16 * 7,
+        horizon: int = 16,
+        per_action_dim: int = 7,
+        num_heads: int = 8,
+        num_layers: int = 8,
+        dropout: float = 0.0,
+        num_inference_timesteps: int = 20,
+        num_categories: int = 1,
+    ):
+        super().__init__()
+
+        if config is not None:
+            embed_dim = _cfgget(config, "embed_dim", embed_dim)
+            hidden_dim = _cfgget(config, "hidden_dim", hidden_dim)
+            action_dim = _cfgget(config, "action_dim", action_dim)
+            horizon = _cfgget(config, "horizon", horizon)
+            per_action_dim = _cfgget(config, "per_action_dim", per_action_dim)
+            num_heads = _cfgget(config, "num_heads", num_heads)
+            num_layers = _cfgget(config, "num_layers", num_layers)
+            dropout = _cfgget(config, "dropout", dropout)
+            num_inference_timesteps = _cfgget(config, "num_inference_timesteps", num_inference_timesteps)
+            num_categories = _cfgget(config, "num_categories", num_categories)
+            self.config = config
+        else:
+            self.config = SimpleNamespace(
+                embed_dim=embed_dim,
+                hidden_dim=hidden_dim,
+                action_dim=action_dim,
+                horizon=horizon,
+                per_action_dim=per_action_dim,
+                num_heads=num_heads,
+                num_layers=num_layers,
+                dropout=dropout,
+                num_inference_timesteps=num_inference_timesteps,
+                num_categories=num_categories,
+            )
+
+        logger.info("FlowmatchingActionHead num_inference_timesteps=%s", num_inference_timesteps)
+        self.embed_dim = embed_dim
+        self.horizon = horizon
+        self.per_action_dim = _cfgget(self.config, "per_action_dim", per_action_dim)
+        self.action_dim = _cfgget(self.config, "action_dim", action_dim)
+
+        self.time_pos_enc = SinusoidalPositionalEncoding(embed_dim, max_len=1000)
+        self.transformer_blocks = nn.ModuleList(
+            [
+                BasicTransformerBlock(
+                    embed_dim=embed_dim,
+                    num_heads=num_heads,
+                    hidden_dim=embed_dim * 4,
+                    dropout=dropout,
+                )
+                for _ in range(num_layers)
+            ]
+        )
+        self.norm_out = nn.LayerNorm(embed_dim)
+        self.seq_pool_proj = nn.Linear(self.horizon * self.embed_dim, self.embed_dim)
+        self.mlp_head = CategorySpecificMLP(
+            input_dim=embed_dim,
+            hidden_dim=hidden_dim,
+            output_dim=action_dim,
+            num_categories=num_categories,
+        )
+
+        self.state_encoder = None
+        state_dim = _cfgget(self.config, "state_dim")
+        if state_dim is not None:
+            state_hidden = _cfgget(self.config, "state_hidden_dim", embed_dim)
+            self.state_encoder = CategorySpecificMLP(
+                input_dim=state_dim,
+                hidden_dim=state_hidden,
+                output_dim=embed_dim,
+                num_categories=num_categories,
+            )
+
+        if horizon > 1:
+            self.action_encoder = MultiEmbodimentActionEncoder(
+                action_dim=self.per_action_dim,
+                embed_dim=embed_dim,
+                hidden_dim=embed_dim,
+                horizon=horizon,
+                num_categories=num_categories,
+            )
+            self.single_action_proj = None
+        else:
+            self.action_encoder = None
+            self.single_action_proj = nn.Linear(self.per_action_dim, self.embed_dim)
+
+    def _project_actions(self, action_seq: torch.Tensor, embodiment_id: torch.LongTensor) -> torch.Tensor:
+        if self.horizon > 1 and self.action_encoder is not None:
+            return self.action_encoder(action_seq, embodiment_id)
+        if self.single_action_proj is None:
+            raise RuntimeError("single_action_proj is not initialized for horizon <= 1.")
+        return self.single_action_proj(action_seq)
+
+    def _expand_action_mask(
+        self,
+        action_mask: torch.Tensor,
+        batch_size: int,
+        per_action_dim: int,
+        device: torch.device,
+        dtype: torch.dtype,
+    ) -> torch.Tensor:
+        if action_mask is None:
+            raise ValueError("action_mask must be provided for flow matching inference.")
+
+        if action_mask.dim() == 2:
+            expected_last_dim = self.horizon * per_action_dim
+            if action_mask.shape == (batch_size, expected_last_dim):
+                expanded_mask = action_mask.reshape(batch_size, self.horizon, per_action_dim)
+            elif action_mask.shape == (batch_size, per_action_dim):
+                expanded_mask = action_mask.unsqueeze(1).expand(batch_size, self.horizon, per_action_dim)
+            else:
+                raise ValueError(
+                    f"Expected action_mask shape {(batch_size, expected_last_dim)} or "
+                    f"{(batch_size, per_action_dim)}, got {tuple(action_mask.shape)}"
+                )
+        elif action_mask.dim() == 3:
+            expected_shape = (batch_size, self.horizon, per_action_dim)
+            if tuple(action_mask.shape) != expected_shape:
+                raise ValueError(
+                    f"Expected action_mask shape {expected_shape}, got {tuple(action_mask.shape)}"
+                )
+            expanded_mask = action_mask
+        else:
+            raise ValueError(f"Unsupported action_mask rank: {action_mask.dim()}")
+
+        return expanded_mask.to(device=device, dtype=dtype)
+
+    def forward(
+        self,
+        fused_tokens: torch.Tensor,
+        state: torch.Tensor = None,
+        actions_gt: torch.Tensor = None,
+        embodiment_id: torch.LongTensor = None,
+        state_mask: torch.Tensor = None,
+        action_mask: torch.Tensor = None,
+    ):
+        if actions_gt is None:
+            return self.get_action(
+                fused_tokens, state=state, embodiment_id=embodiment_id, action_mask=action_mask
+            )
+
+        batch_size = fused_tokens.size(0)
+        device = fused_tokens.device
+        if embodiment_id is None:
+            embodiment_id = torch.zeros(batch_size, dtype=torch.long, device=device)
+
+        context_tokens = fused_tokens
+        if state is not None and self.state_encoder is not None:
+            state_emb = self.state_encoder(state, embodiment_id).unsqueeze(1)
+            context_tokens = torch.cat([context_tokens, state_emb], dim=1)
+
+        t = (
+            torch.distributions.Beta(2, 2)
+            .sample((batch_size,))
+            .clamp(0.02, 0.98)
+            .to(device)
+            .to(dtype=self.dtype)
+        )
+        time_index = (t * 999).long().clamp_(0, 999)
+        time_emb = self.time_pos_enc(1000)[:, time_index, :].squeeze(0).to(dtype=context_tokens.dtype)
+
+        actions_gt_seq = actions_gt
+        noise = torch.rand_like(actions_gt) * 2 - 1
+        if action_mask is not None:
+            action_mask = action_mask.to(dtype=noise.dtype, device=noise.device)
+            if action_mask.shape != noise.shape:
+                raise ValueError(f"action_mask shape {action_mask.shape} != noise shape {noise.shape}")
+            actions_gt_seq = actions_gt_seq * action_mask
+            noise = noise * action_mask
+
+        if self.horizon > 1:
+            noise_seq = noise.view(batch_size, self.horizon, self.per_action_dim)
+        else:
+            noise_seq = noise if noise.dim() == 3 else noise.unsqueeze(1)
+        t_broadcast = t.view(batch_size, 1, 1)
+        action_intermediate_seq = (1 - t_broadcast) * noise_seq + t_broadcast * actions_gt_seq
+
+        action_tokens = self._project_actions(action_intermediate_seq, embodiment_id)
+        target_dtype = self.dtype
+        action_tokens = action_tokens.to(dtype=target_dtype)
+        context_tokens = context_tokens.to(dtype=target_dtype)
+        time_emb = time_emb.to(dtype=target_dtype)
+
+        x = action_tokens
+        for block in self.transformer_blocks:
+            x = block(x, context_tokens, time_emb)
+        x = self.norm_out(x)
+
+        if self.horizon > 1:
+            x_flat = x.reshape(batch_size, -1)
+            x_pooled = self.seq_pool_proj(x_flat)
+        else:
+            x_pooled = x.squeeze(1)
+
+        pred_velocity = self.mlp_head(x_pooled, embodiment_id)
+        return pred_velocity, noise
+
+    def get_action(
+        self,
+        fused_tokens: torch.Tensor,
+        state: torch.Tensor = None,
+        embodiment_id: torch.LongTensor = None,
+        action_mask: torch.Tensor = None,
+    ):
+        batch_size = fused_tokens.size(0)
+        device = fused_tokens.device
+        if embodiment_id is None:
+            embodiment_id = torch.zeros(batch_size, dtype=torch.long, device=device)
+
+        context_tokens = fused_tokens
+        if state is not None and self.state_encoder is not None:
+            state_emb = self.state_encoder(state, embodiment_id).unsqueeze(1)
+            context_tokens = torch.cat([context_tokens, state_emb], dim=1)
+
+        action_dim_total = _cfgget(self.config, "action_dim", self.action_dim)
+        per_action_dim = _cfgget(self.config, "per_action_dim", action_dim_total // max(self.horizon, 1))
+
+        action = torch.rand(batch_size, action_dim_total, device=device, dtype=context_tokens.dtype) * 2 - 1
+        action_seq = (
+            action.view(batch_size, self.horizon, per_action_dim)
+            if self.horizon > 1
+            else action.view(batch_size, 1, per_action_dim)
+        )
+        action_mask = self._expand_action_mask(
+            action_mask,
+            batch_size=batch_size,
+            per_action_dim=per_action_dim,
+            device=action_seq.device,
+            dtype=action_seq.dtype,
+        )
+        action_seq = action_seq * action_mask
+
+        target_dtype = self.dtype
+        context_tokens = context_tokens.to(dtype=target_dtype)
+
+        num_steps = int(_cfgget(self.config, "num_inference_timesteps", 32))
+        if num_steps <= 0:
+            raise ValueError(f"num_inference_timesteps must be positive, got {num_steps}")
+        dt = 1.0 / num_steps
+
+        for i in range(num_steps):
+            t = i / num_steps
+            time_index = min(int(t * 999), 999)
+            time_emb = (
+                self.time_pos_enc(1000)[:, time_index, :].to(device).squeeze(0).to(dtype=context_tokens.dtype)
+            )
+            time_emb = time_emb.unsqueeze(0).repeat(batch_size, 1)
+
+            action_seq = action_seq * action_mask
+            action_tokens = self._project_actions(action_seq, embodiment_id).to(dtype=target_dtype)
+            time_emb = time_emb.to(dtype=target_dtype)
+
+            x = action_tokens
+            for block in self.transformer_blocks:
+                x = block(x, context_tokens, time_emb)
+            x = self.norm_out(x)
+
+            if self.horizon > 1:
+                x_flat = x.reshape(batch_size, -1)
+                x_pooled = self.seq_pool_proj(x_flat)
+            else:
+                x_pooled = x.squeeze(1)
+
+            pred = self.mlp_head(x_pooled, embodiment_id)
+            action = action + dt * pred
+            action_seq = (
+                action.view(batch_size, self.horizon, per_action_dim)
+                if self.horizon > 1
+                else action.view(batch_size, 1, per_action_dim)
+            )
+
+        action_seq = action_seq * action_mask
+        return action_seq.reshape(batch_size, -1)
+
+    @property
+    def device(self):
+        return next(self.parameters()).device
+
+    @property
+    def dtype(self):
+        return next(self.parameters()).dtype
@@ -0,0 +1,435 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import functools
+import logging
+import types
+from collections.abc import Sequence
+from contextlib import contextmanager
+from typing import TYPE_CHECKING
+
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint
+import torchvision.transforms.functional as TF
+from PIL import Image
+from torchvision.transforms.functional import to_pil_image
+
+from lerobot.utils.import_utils import _transformers_available, require_package
+
+if TYPE_CHECKING or _transformers_available:
+    from transformers import AutoModel, AutoTokenizer
+else:
+    AutoModel = None
+    AutoTokenizer = None
+
+IMAGENET_MEAN = (0.485, 0.456, 0.406)
+IMAGENET_STD = (0.229, 0.224, 0.225)
+IMG_CONTEXT_TOKEN = "<IMG_CONTEXT>"  # nosec B105
+IMG_START_TOKEN = "<img>"  # nosec B105
+IMG_END_TOKEN = "</img>"  # nosec B105
+
+logger = logging.getLogger(__name__)
+
+
+def _patch_vision_encoder_checkpointing(encoder: nn.Module, use_reentrant: bool) -> None:
+    if getattr(encoder, "_evo1_checkpoint_patch_applied", False):
+        encoder.gradient_checkpointing_use_reentrant = use_reentrant
+        return
+
+    original_forward = encoder.forward
+
+    def forward_with_checkpoint_kwargs(self, *args, **kwargs):
+        original_checkpoint = torch.utils.checkpoint.checkpoint
+
+        def checkpoint(function, *checkpoint_args, **checkpoint_kwargs):
+            checkpoint_kwargs.setdefault("use_reentrant", self.gradient_checkpointing_use_reentrant)
+            return original_checkpoint(function, *checkpoint_args, **checkpoint_kwargs)
+
+        torch.utils.checkpoint.checkpoint = checkpoint
+        try:
+            return original_forward(*args, **kwargs)
+        finally:
+            torch.utils.checkpoint.checkpoint = original_checkpoint
+
+    encoder.gradient_checkpointing_use_reentrant = use_reentrant
+    encoder.forward = types.MethodType(forward_with_checkpoint_kwargs, encoder)
+    encoder._evo1_checkpoint_patch_applied = True
+
+
+def flash_attn_is_available() -> bool:
+    try:
+        import flash_attn  # noqa: F401
+    except ModuleNotFoundError:
+        return False
+    return True
+
+
+@contextmanager
+def _internvl_transformers5_load_compatibility():
+    from transformers.modeling_utils import PreTrainedModel
+
+    original_linspace = torch.linspace
+    original_mark_tied = PreTrainedModel.mark_tied_weights_as_initialized
+
+    def linspace(*args, **kwargs):
+        if kwargs.get("device") is None:
+            kwargs["device"] = torch.device("cpu")
+        return original_linspace(*args, **kwargs)
+
+    def mark_tied_weights_as_initialized(self, loading_info):
+        if not hasattr(self, "all_tied_weights_keys"):
+            self.all_tied_weights_keys = {}
+        return original_mark_tied(self, loading_info)
+
+    torch.linspace = linspace
+    PreTrainedModel.mark_tied_weights_as_initialized = mark_tied_weights_as_initialized
+    try:
+        yield
+    finally:
+        torch.linspace = original_linspace
+        PreTrainedModel.mark_tied_weights_as_initialized = original_mark_tied
+
+
+@functools.lru_cache(maxsize=10000)
+def get_target_aspect_ratio(orig_width: int, orig_height: int, image_size: int, min_num: int, max_num: int):
+    aspect_ratio = orig_width / orig_height
+    target_ratios = {
+        (i, j)
+        for n in range(min_num, max_num + 1)
+        for i in range(1, n + 1)
+        for j in range(1, n + 1)
+        if i * j <= max_num and i * j >= min_num
+    }
+    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
+
+    best_ratio_diff = float("inf")
+    best_ratio = (1, 1)
+    area = orig_width * orig_height
+    for ratio in target_ratios:
+        target_ar = ratio[0] / ratio[1]
+        diff = abs(aspect_ratio - target_ar)
+        if diff < best_ratio_diff:
+            best_ratio_diff = diff
+            best_ratio = ratio
+        elif diff == best_ratio_diff and area > 0.5 * image_size**2 * ratio[0] * ratio[1]:
+            best_ratio = ratio
+    return best_ratio
+
+
+def dynamic_preprocess(image, min_num=1, max_num=1, image_size=448, use_thumbnail=False):
+    orig_width, orig_height = image.size
+    ratio_w, ratio_h = get_target_aspect_ratio(orig_width, orig_height, image_size, min_num, max_num)
+    target_width = image_size * ratio_w
+    target_height = image_size * ratio_h
+    blocks = ratio_w * ratio_h
+    resized_img = image.resize((target_width, target_height))
+    processed_images = []
+    for i in range(blocks):
+        box = (
+            (i % (target_width // image_size)) * image_size,
+            (i // (target_width // image_size)) * image_size,
+            ((i % (target_width // image_size)) + 1) * image_size,
+            ((i // (target_width // image_size)) + 1) * image_size,
+        )
+        processed_images.append(resized_img.crop(box))
+    if use_thumbnail and len(processed_images) != 1:
+        processed_images.append(image.resize((image_size, image_size)))
+    return processed_images
+
+
+class InternVL3Embedder(nn.Module):
+    def __init__(
+        self,
+        model_name="OpenGVLab/InternVL3-1B",
+        image_size=448,
+        device="cuda",
+        num_language_layers: int | None = 14,
+        model_dtype: str | torch.dtype = "bfloat16",
+        use_flash_attn: bool = True,
+        enable_gradient_checkpointing: bool = True,
+        gradient_checkpointing_use_reentrant: bool = False,
+    ):
+        super().__init__()
+        self._requested_device = device
+        self.image_size = image_size
+        self.num_language_layers = num_language_layers
+        self.max_text_length = 1024
+        self.enable_gradient_checkpointing = bool(enable_gradient_checkpointing)
+        self.gradient_checkpointing_use_reentrant = bool(gradient_checkpointing_use_reentrant)
+
+        require_package("transformers", extra="evo1")
+
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, use_fast=False)
+        if isinstance(model_dtype, str):
+            try:
+                model_dtype = getattr(torch, model_dtype)
+            except AttributeError as exc:
+                raise ValueError(f"Unsupported EVO1 vlm_dtype '{model_dtype}'") from exc
+
+        resolved_use_flash_attn = bool(use_flash_attn and flash_attn_is_available())
+        if use_flash_attn and not resolved_use_flash_attn:
+            logger.warning("flash_attn is not installed. Falling back to standard attention.")
+
+        # InternVL3 remote code predates Transformers 5 post-init conventions:
+        # it computes stochastic-depth scalars via torch.linspace(...).item()
+        # while Transformers initializes under torch.device("meta"), and it
+        # does not populate all_tied_weights_keys before loading finalization.
+        with _internvl_transformers5_load_compatibility():
+            self.model = AutoModel.from_pretrained(
+                model_name,
+                torch_dtype=model_dtype,
+                trust_remote_code=True,
+                use_flash_attn=resolved_use_flash_attn,
+                low_cpu_mem_usage=True,
+                _fast_init=False,
+            ).to(self._requested_device)
+
+        if hasattr(self.model.language_model, "model"):
+            layers = self.model.language_model.model.layers
+        else:
+            layers = self.model.language_model.layers
+        if self.num_language_layers is not None:
+            layers = layers[: self.num_language_layers]
+
+        if hasattr(self.model.language_model, "model"):
+            self.model.language_model.model.layers = torch.nn.ModuleList(layers)
+        else:
+            self.model.language_model.layers = torch.nn.ModuleList(layers)
+        self.model.language_model.lm_head = torch.nn.Identity()
+
+        self._configure_memory_features()
+        self.img_context_token_id = self.tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
+
+    def _configure_memory_features(self) -> None:
+        checkpoint_kwargs = {"use_reentrant": self.gradient_checkpointing_use_reentrant}
+
+        if not self.enable_gradient_checkpointing:
+            if hasattr(self.model, "vision_model") and hasattr(self.model.vision_model, "encoder"):
+                self.model.vision_model.encoder.gradient_checkpointing = False
+            language_model = getattr(self.model, "language_model", None)
+            if language_model is not None:
+                if hasattr(language_model, "gradient_checkpointing_disable"):
+                    language_model.gradient_checkpointing_disable()
+                elif hasattr(language_model, "gradient_checkpointing"):
+                    language_model.gradient_checkpointing = False
+                if hasattr(language_model, "model"):
+                    inner = language_model.model
+                    if hasattr(inner, "gradient_checkpointing_disable"):
+                        inner.gradient_checkpointing_disable()
+                    elif hasattr(inner, "gradient_checkpointing"):
+                        inner.gradient_checkpointing = False
+            return
+
+        def _enable_ckpt(module: nn.Module | None) -> bool:
+            if module is None:
+                return False
+            if hasattr(module, "gradient_checkpointing_enable"):
+                try:
+                    module.gradient_checkpointing_enable(gradient_checkpointing_kwargs=checkpoint_kwargs)
+                except TypeError:
+                    module.gradient_checkpointing_enable()
+                return True
+            if hasattr(module, "gradient_checkpointing"):
+                module.gradient_checkpointing = True
+                return True
+            return False
+
+        enabled_any = _enable_ckpt(self.model)
+
+        if hasattr(self.model, "vision_model") and hasattr(self.model.vision_model, "encoder"):
+            encoder = self.model.vision_model.encoder
+            encoder.gradient_checkpointing = True
+            _patch_vision_encoder_checkpointing(
+                encoder, use_reentrant=self.gradient_checkpointing_use_reentrant
+            )
+            enabled_any = True
+
+        language_model = getattr(self.model, "language_model", None)
+        if language_model is not None:
+            enabled_any = _enable_ckpt(language_model) or enabled_any
+            if hasattr(language_model, "model"):
+                enabled_any = _enable_ckpt(language_model.model) or enabled_any
+            if hasattr(language_model, "config"):
+                language_model.config.use_cache = False
+
+        if hasattr(self.model, "config"):
+            self.model.config.use_cache = False
+        if hasattr(self.model, "enable_input_require_grads"):
+            self.model.enable_input_require_grads()
+
+        if enabled_any:
+            logger.info("Gradient checkpointing enabled for InternVL3 embedder.")
+        else:
+            logger.warning(
+                "Requested gradient checkpointing, but model does not expose checkpointing controls."
+            )
+
+    def _preprocess_single_image(self, image: Image.Image | torch.Tensor) -> torch.Tensor:
+        if isinstance(image, torch.Tensor):
+            pil_image = to_pil_image(image.detach().cpu())
+        else:
+            pil_image = image.convert("RGB")
+        tiles = dynamic_preprocess(pil_image, image_size=self.image_size)
+        tile_tensors = torch.stack([TF.to_tensor(tile) for tile in tiles]).to(
+            device=self.device, dtype=torch.bfloat16
+        )
+        mean = torch.tensor(IMAGENET_MEAN, device=self.device, dtype=torch.bfloat16).view(1, 3, 1, 1)
+        std = torch.tensor(IMAGENET_STD, device=self.device, dtype=torch.bfloat16).view(1, 3, 1, 1)
+        return (tile_tensors - mean) / std
+
+    def _preprocess_images(
+        self,
+        image_tensors_batch: Sequence[Sequence[Image.Image | torch.Tensor]],
+    ) -> tuple[torch.Tensor, list[list[int]]]:
+        pixel_values_list = []
+        batch_num_tiles_list: list[list[int]] = []
+
+        for image_tensors in image_tensors_batch:
+            num_tiles_list: list[int] = []
+            for image in image_tensors:
+                tiles = self._preprocess_single_image(image)
+                pixel_values_list.append(tiles)
+                num_tiles_list.append(int(tiles.shape[0]))
+            batch_num_tiles_list.append(num_tiles_list)
+
+        if pixel_values_list:
+            pixel_values = torch.cat(pixel_values_list, dim=0)
+        else:
+            pixel_values = torch.empty(
+                0, 3, self.image_size, self.image_size, dtype=torch.bfloat16, device=self.device
+            )
+        return pixel_values, batch_num_tiles_list
+
+    def _build_multimodal_prompts(
+        self,
+        batch_num_tiles_list: list[list[int]],
+        text_prompts: Sequence[str],
+    ) -> list[str]:
+        prompts = []
+        for num_tiles_list, text_prompt in zip(batch_num_tiles_list, text_prompts, strict=True):
+            prompt_segments = []
+            for i, tile_count in enumerate(num_tiles_list):
+                token_count = self.model.num_image_token * tile_count
+                image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * token_count + IMG_END_TOKEN
+                prompt_segments.append(f"Image-{i + 1}: {image_tokens}\n")
+            prompts.append("".join(prompt_segments) + text_prompt.strip())
+        return prompts
+
+    def _prepare_and_fuse_embeddings(
+        self,
+        prompts: Sequence[str],
+        vit_embeds: torch.Tensor,
+        image_masks: torch.Tensor,
+        batch_num_tiles_list: list[list[int]],
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        untruncated_ids = self.tokenizer(list(prompts), padding=False, truncation=False)["input_ids"]
+        true_sequence_length = max((len(ids) for ids in untruncated_ids), default=0)
+        if true_sequence_length > self.max_text_length:
+            logger.warning(
+                "InternVL3 prompt truncated in batch: max_length=%s actual_max_length=%s",
+                self.max_text_length,
+                true_sequence_length,
+            )
+
+        model_inputs = self.tokenizer(
+            list(prompts),
+            return_tensors="pt",
+            padding="max_length",
+            truncation=True,
+            max_length=self.max_text_length,
+        ).to(self.device)
+        input_ids = model_inputs["input_ids"]
+        attention_mask = model_inputs["attention_mask"]
+
+        img_token_mask = input_ids == self.img_context_token_id
+        input_embeds = self.model.language_model.get_input_embeddings()(input_ids).clone()
+
+        batch_size, _, channels = input_embeds.shape
+        vit_embeds = vit_embeds.reshape(-1, channels).to(dtype=input_embeds.dtype, device=input_embeds.device)
+        tokens_per_tile = self.model.num_image_token
+        actual_vis_tokens_list = img_token_mask.sum(dim=1).tolist()
+
+        vit_idx = 0
+        for batch_index in range(batch_size):
+            expected_vis_tokens = sum(batch_num_tiles_list[batch_index]) * tokens_per_tile
+            mask_b = img_token_mask[batch_index]
+            actual_vis_tokens = actual_vis_tokens_list[batch_index]
+
+            item_vit_embeds = vit_embeds[vit_idx : vit_idx + expected_vis_tokens]
+            vit_idx += expected_vis_tokens
+            if actual_vis_tokens > 0:
+                if item_vit_embeds.shape[0] < actual_vis_tokens:
+                    raise ValueError(
+                        f"InternVL3 produced fewer image tokens than expected for sample {batch_index}: "
+                        f"got {item_vit_embeds.shape[0]}, need {actual_vis_tokens}"
+                    )
+                input_embeds[batch_index, mask_b] = item_vit_embeds[:actual_vis_tokens]
+
+            current_token_idx = 0
+            img_token_locations = torch.where(mask_b)[0]
+            for image_index, num_tiles in enumerate(batch_num_tiles_list[batch_index]):
+                num_tokens_for_image = num_tiles * tokens_per_tile
+                if not bool(image_masks[batch_index, image_index].item()):
+                    start_offset = current_token_idx
+                    end_offset = min(current_token_idx + num_tokens_for_image, len(img_token_locations))
+                    if start_offset < end_offset:
+                        idxs = img_token_locations[start_offset:end_offset]
+                        attention_mask[batch_index, idxs] = 0
+                current_token_idx += num_tokens_for_image
+
+        return input_embeds, attention_mask
+
+    def get_fused_image_text_embedding_from_tensor_images(
+        self,
+        image_tensors_batch: Sequence[Sequence[Image.Image | torch.Tensor]],
+        image_masks: torch.Tensor,
+        text_prompts: Sequence[str],
+        return_cls_only: bool = True,
+    ):
+        pixel_values, batch_num_tiles_list = self._preprocess_images(image_tensors_batch)
+        if pixel_values.shape[0] == 0:
+            logger.warning("InternVL3 received an empty image batch after preprocessing.")
+            hidden_size = getattr(self.model.config, "hidden_size", None)
+            if hidden_size is None and hasattr(self.model.language_model, "config"):
+                hidden_size = getattr(self.model.language_model.config, "hidden_size", None)
+            if hidden_size is None:
+                raise RuntimeError("Unable to infer hidden size for empty InternVL3 batch.")
+            empty = torch.empty(0, hidden_size, device=self.device, dtype=torch.float32)
+            return empty
+
+        prompts = self._build_multimodal_prompts(batch_num_tiles_list, text_prompts)
+        vit_embeds = self.model.extract_feature(pixel_values)
+        inputs_embeds, attention_mask = self._prepare_and_fuse_embeddings(
+            prompts,
+            vit_embeds,
+            image_masks.to(device=self.device),
+            batch_num_tiles_list,
+        )
+
+        outputs = self.model.language_model(
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            output_hidden_states=True,
+            use_cache=False,
+            return_dict=True,
+        )
+        fused_hidden = outputs.hidden_states[-1].to(torch.float32)
+        return fused_hidden[:, 0, :] if return_cls_only else fused_hidden
+
+    @property
+    def device(self) -> torch.device:
+        return next(self.model.parameters()).device
@@ -0,0 +1,450 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import builtins
+from collections import deque
+from contextlib import nullcontext
+from pathlib import Path
+
+import torch
+from torch import Tensor
+
+from lerobot.configs.policies import PreTrainedConfig
+from lerobot.policies.evo1.configuration_evo1 import Evo1Config
+from lerobot.policies.evo1.evo1_model import EVO1
+from lerobot.policies.pretrained import PreTrainedPolicy, T
+from lerobot.utils.constants import ACTION, OBS_IMAGES, OBS_STATE
+
+
+class EVO1Policy(PreTrainedPolicy):
+    config_class = Evo1Config
+    name = "evo1"
+
+    def __init__(self, config: Evo1Config, **kwargs):
+        super().__init__(config)
+        config.validate_features()
+
+        if len(config.image_features) > config.max_views:
+            raise ValueError(
+                f"EVO1 supports at most {config.max_views} camera streams, got {len(config.image_features)}"
+            )
+
+        self.config = config
+        self.model = EVO1(self._build_model_config(config))
+        self.model.set_finetune_flags()
+        self.reset()
+
+    @classmethod
+    def from_pretrained(
+        cls: builtins.type[T],
+        pretrained_name_or_path: str | Path,
+        *,
+        config: PreTrainedConfig | None = None,
+        force_download: bool = False,
+        resume_download: bool | None = None,
+        proxies: dict | None = None,
+        token: str | bool | None = None,
+        cache_dir: str | Path | None = None,
+        local_files_only: bool = False,
+        revision: str | None = None,
+        strict: bool | None = None,
+        **kwargs,
+    ) -> T:
+        if strict is None:
+            strict = not (config is not None and getattr(config, "training_stage", None) == "stage2")
+        return super().from_pretrained(
+            pretrained_name_or_path=pretrained_name_or_path,
+            config=config,
+            force_download=force_download,
+            resume_download=resume_download,
+            proxies=proxies,
+            token=token,
+            cache_dir=cache_dir,
+            local_files_only=local_files_only,
+            revision=revision,
+            strict=strict,
+            **kwargs,
+        )
+
+    @staticmethod
+    def _build_model_config(config: Evo1Config) -> dict:
+        return {
+            "device": config.device,
+            "return_cls_only": config.return_cls_only,
+            "vlm_name": config.vlm_model_name,
+            "vlm_num_layers": config.vlm_num_layers,
+            "vlm_dtype": config.vlm_dtype,
+            "use_flash_attn": config.use_flash_attn,
+            "action_head": config.action_head,
+            "action_horizon": config.chunk_size,
+            "per_action_dim": config.max_action_dim,
+            "state_dim": config.max_state_dim,
+            "embed_dim": config.embed_dim,
+            "hidden_dim": config.hidden_dim,
+            "state_hidden_dim": config.state_hidden_dim,
+            "num_heads": config.num_heads,
+            "num_layers": config.num_layers,
+            "dropout": config.dropout,
+            "num_inference_timesteps": config.num_inference_timesteps,
+            "num_categories": config.num_categories,
+            "enable_gradient_checkpointing": config.enable_gradient_checkpointing,
+            "gradient_checkpointing_use_reentrant": config.gradient_checkpointing_use_reentrant,
+            "finetune_vlm": config.finetune_vlm,
+            "finetune_language_model": config.finetune_language_model,
+            "finetune_vision_model": config.finetune_vision_model,
+            "finetune_action_head": config.finetune_action_head,
+        }
+
+    @property
+    def _camera_keys(self) -> list[str]:
+        return list(self.config.image_features)
+
+    @property
+    def _env_action_dim(self) -> int:
+        action_feature = self.config.action_feature
+        if action_feature is None:
+            return self.config.max_action_dim
+        return int(action_feature.shape[0])
+
+    @property
+    def _compute_dtype(self) -> torch.dtype:
+        return next(self.model.action_head.parameters()).dtype
+
+    @property
+    def _training_compute_dtype(self) -> torch.dtype:
+        if str(self.config.device).startswith("cuda"):
+            return torch.bfloat16
+        return self._compute_dtype
+
+    @property
+    def _inference_compute_dtype(self) -> torch.dtype:
+        if str(self.config.device).startswith("cuda") and self.config.use_amp:
+            return torch.bfloat16
+        return self._compute_dtype
+
+    def get_optim_params(self) -> list[dict]:
+        decay, no_decay = [], []
+        for name, param in self.named_parameters():
+            if not param.requires_grad:
+                continue
+            is_bias = name.endswith("bias") or ".bias" in name
+            is_norm = param.dim() == 1 or "norm" in name.lower()
+            if is_bias or is_norm:
+                no_decay.append(param)
+            else:
+                decay.append(param)
+        return [
+            {"params": decay, "weight_decay": self.config.optimizer_weight_decay},
+            {"params": no_decay, "weight_decay": 0.0},
+        ]
+
+    def reset(self):
+        self._action_queue = deque([], maxlen=self.config.n_action_steps)
+
+    def _normalize_task_batch(self, batch: dict[str, Tensor | list[str] | str]) -> list[str]:
+        prompts = batch.get(self.config.task_field)
+        if prompts is None and self.config.task_field != "task":
+            prompts = batch.get("task")
+        if prompts is None:
+            raise ValueError(f"EVO1 expects a '{self.config.task_field}' text field in the batch.")
+        if isinstance(prompts, str):
+            return [prompts]
+        if isinstance(prompts, (list, tuple)):
+            return [str(prompt) for prompt in prompts]
+        raise TypeError(f"Unsupported prompt batch type: {type(prompts)}")
+
+    def _prepare_state(self, batch: dict[str, Tensor]) -> tuple[Tensor, Tensor]:
+        if OBS_STATE not in batch:
+            raise ValueError(f"EVO1 requires '{OBS_STATE}' in the batch.")
+        state = batch[OBS_STATE]
+        if state.dim() == 1:
+            state = state.unsqueeze(0)
+        elif state.dim() == 3:
+            state = state[:, -1]
+        elif state.dim() != 2:
+            raise ValueError(f"Unsupported state tensor shape for EVO1: {tuple(state.shape)}")
+        batch_size, state_dim = state.shape
+        if state_dim > self.config.max_state_dim:
+            raise ValueError(
+                f"State dim {state_dim} exceeds configured max_state_dim {self.config.max_state_dim}"
+            )
+        explicit_mask = batch.get("state_mask")
+        if explicit_mask is not None:
+            if explicit_mask.dim() == 1:
+                explicit_mask = explicit_mask.unsqueeze(0)
+            elif explicit_mask.dim() == 3:
+                explicit_mask = explicit_mask[:, -1]
+            elif explicit_mask.dim() != 2:
+                raise ValueError(
+                    f"Unsupported state_mask tensor shape for EVO1: {tuple(explicit_mask.shape)}"
+                )
+            if explicit_mask.shape != (batch_size, state_dim):
+                raise ValueError(
+                    f"state_mask shape {tuple(explicit_mask.shape)} does not match state shape {(batch_size, state_dim)}"
+                )
+        padded = torch.zeros(
+            batch_size,
+            self.config.max_state_dim,
+            dtype=state.dtype,
+            device=self.config.device,
+        )
+        padded[:, :state_dim] = state.to(device=self.config.device)
+        mask = torch.zeros(
+            batch_size,
+            self.config.max_state_dim,
+            dtype=torch.bool,
+            device=self.config.device,
+        )
+        if explicit_mask is None:
+            mask[:, :state_dim] = True
+        else:
+            mask[:, :state_dim] = explicit_mask.to(device=self.config.device, dtype=torch.bool)
+        return padded.to(dtype=self._compute_dtype), mask
+
+    def _prepare_actions(self, batch: dict[str, Tensor]) -> tuple[Tensor, Tensor]:
+        if ACTION not in batch:
+            raise ValueError(f"EVO1 requires '{ACTION}' in the batch for training.")
+        action = batch[ACTION]
+        if action.dim() == 2:
+            action = action.unsqueeze(1)
+        batch_size, horizon, action_dim = action.shape
+        if horizon != self.config.chunk_size:
+            raise ValueError(
+                f"EVO1 expects chunk_size={self.config.chunk_size}, got action horizon {horizon}"
+            )
+        if action_dim > self.config.max_action_dim:
+            raise ValueError(
+                f"Action dim {action_dim} exceeds configured max_action_dim {self.config.max_action_dim}"
+            )
+        explicit_mask = batch.get("action_mask")
+        if explicit_mask is not None:
+            if explicit_mask.dim() == 2:
+                if horizon == 1:
+                    explicit_mask = explicit_mask.unsqueeze(1)
+                else:
+                    raise ValueError(
+                        f"2D action_mask is only supported when chunk_size=1, got action horizon {horizon}"
+                    )
+            elif explicit_mask.dim() != 3:
+                raise ValueError(
+                    f"Unsupported action_mask tensor shape for EVO1: {tuple(explicit_mask.shape)}"
+                )
+            if explicit_mask.shape != (batch_size, horizon, action_dim):
+                raise ValueError(
+                    "action_mask shape "
+                    f"{tuple(explicit_mask.shape)} does not match action shape {(batch_size, horizon, action_dim)}"
+                )
+        padded = torch.zeros(
+            batch_size,
+            horizon,
+            self.config.max_action_dim,
+            dtype=action.dtype,
+            device=self.config.device,
+        )
+        padded[:, :, :action_dim] = action.to(device=self.config.device)
+        mask = torch.zeros(
+            batch_size,
+            horizon,
+            self.config.max_action_dim,
+            dtype=torch.bool,
+            device=self.config.device,
+        )
+        if explicit_mask is None:
+            mask[:, :, :action_dim] = True
+        else:
+            mask[:, :, :action_dim] = explicit_mask.to(device=self.config.device, dtype=torch.bool)
+        return padded.to(dtype=self._compute_dtype), mask
+
+    def _prepare_inference_action_mask(self, batch_size: int) -> Tensor:
+        mask = torch.zeros(
+            batch_size,
+            self.config.max_action_dim,
+            dtype=torch.bool,
+            device=self.config.device,
+        )
+        mask[:, : self._env_action_dim] = True
+        return mask
+
+    def _get_embodiment_ids(self, batch: dict[str, Tensor], batch_size: int) -> Tensor:
+        embodiment_ids = batch.get("embodiment_id")
+        if embodiment_ids is None and self.config.embodiment_id_field:
+            embodiment_ids = batch.get(self.config.embodiment_id_field)
+        if embodiment_ids is None:
+            return torch.full(
+                (batch_size,),
+                self.config.default_embodiment_id,
+                dtype=torch.long,
+                device=self.config.device,
+            )
+        if embodiment_ids.dim() == 0:
+            embodiment_ids = embodiment_ids.unsqueeze(0)
+        elif embodiment_ids.dim() > 1:
+            embodiment_ids = embodiment_ids[:, -1]
+        return embodiment_ids.to(device=self.config.device, dtype=torch.long)
+
+    @property
+    def _tracks_vlm_gradients(self) -> bool:
+        return bool(
+            self.config.finetune_vlm
+            or self.config.finetune_language_model
+            or self.config.finetune_vision_model
+        )
+
+    def _collect_image_batches(self, batch: dict[str, Tensor]) -> tuple[list[list[Tensor]], Tensor]:
+        camera_keys = self._camera_keys or sorted(key for key in batch if key.startswith(f"{OBS_IMAGES}."))
+        if not camera_keys:
+            raise ValueError("EVO1 requires at least one visual observation feature.")
+
+        # Normalize each camera tensor to (B, C, H, W) up-front so that batch_size is read
+        # from a real batch dim and not from C in the unbatched (C, H, W) case.
+        normalized: dict[str, Tensor] = {}
+        for camera_key in camera_keys[: self.config.max_views]:
+            image = batch[camera_key]
+            if image.dim() == 3:
+                image = image.unsqueeze(0)
+            elif image.dim() == 5:
+                image = image[:, -1]
+            elif image.dim() != 4:
+                raise ValueError(
+                    f"Unsupported image tensor shape for EVO1: key={camera_key} shape={tuple(image.shape)}"
+                )
+            normalized[camera_key] = image
+
+        batch_size = normalized[camera_keys[0]].shape[0]
+        image_batches: list[list[Tensor]] = []
+        image_masks = torch.zeros(batch_size, self.config.max_views, dtype=torch.bool)
+
+        for batch_index in range(batch_size):
+            sample_images: list[Tensor] = []
+            for camera_key in camera_keys[: self.config.max_views]:
+                sample_images.append(normalized[camera_key][batch_index].detach().cpu())
+            if not sample_images:
+                raise ValueError("EVO1 received a batch without any image tensor.")
+            while len(sample_images) < self.config.max_views:
+                sample_images.append(torch.zeros_like(sample_images[0]))
+            image_batches.append(sample_images[: self.config.max_views])
+            image_masks[batch_index, : min(len(camera_keys), self.config.max_views)] = True
+
+        return image_batches, image_masks
+
+    def _compute_fused_tokens(
+        self,
+        prompts: list[str],
+        image_batches: list[list[Tensor]],
+        image_masks: Tensor,
+    ) -> Tensor:
+        track_vlm_gradients = self._tracks_vlm_gradients
+        grad_context = nullcontext() if track_vlm_gradients else torch.no_grad()
+        embedder = getattr(self.model, "embedder", None)
+        embedder_was_training = embedder.training if embedder is not None else None
+
+        if not track_vlm_gradients and embedder is not None:
+            embedder.eval()
+
+        try:
+            with grad_context:
+                fused_tokens = self.model.get_vl_embeddings(
+                    images=image_batches,
+                    image_mask=image_masks,
+                    prompt=prompts,
+                    return_cls_only=self.config.return_cls_only,
+                )
+        finally:
+            if not track_vlm_gradients and embedder is not None and embedder_was_training is not None:
+                embedder.train(embedder_was_training)
+
+        if not track_vlm_gradients:
+            fused_tokens = fused_tokens.detach()
+        return fused_tokens.to(device=self.config.device, dtype=self._compute_dtype)
+
+    def _compute_masked_loss(
+        self,
+        pred_velocity: Tensor,
+        target_velocity: Tensor,
+        action_mask: Tensor,
+        reduction: str,
+    ) -> Tensor:
+        flat_mask = action_mask.view(action_mask.shape[0], -1).to(dtype=pred_velocity.dtype)
+        sq_error = ((pred_velocity - target_velocity) * flat_mask).pow(2)
+        active = flat_mask.sum(dim=1).clamp_min(1.0)
+        per_sample_loss = sq_error.sum(dim=1) / active
+        if reduction == "none":
+            return per_sample_loss
+        if reduction != "mean":
+            raise ValueError(f"Unsupported reduction '{reduction}'")
+        return sq_error.sum() / active.sum()
+
+    def forward(self, batch: dict[str, Tensor], reduction: str = "mean") -> tuple[Tensor, dict]:
+        prompts = self._normalize_task_batch(batch)
+        image_batches, image_masks = self._collect_image_batches(batch)
+        states, _state_mask = self._prepare_state(batch)
+        actions_gt, action_mask = self._prepare_actions(batch)
+        fused_tokens = self._compute_fused_tokens(prompts, image_batches, image_masks)
+        states = states.to(dtype=self._training_compute_dtype)
+        actions_gt = actions_gt.to(dtype=self._training_compute_dtype)
+        fused_tokens = fused_tokens.to(dtype=self._training_compute_dtype)
+        embodiment_ids = self._get_embodiment_ids(batch, states.shape[0])
+
+        pred_velocity, noise = self.model(
+            fused_tokens,
+            state=states,
+            actions_gt=actions_gt,
+            action_mask=action_mask.to(device=self.config.device, dtype=self._compute_dtype),
+            embodiment_ids=embodiment_ids,
+        )
+        flat_action_mask = action_mask.view(action_mask.shape[0], -1).to(dtype=actions_gt.dtype)
+        target_velocity = (actions_gt - noise).view(actions_gt.shape[0], -1) * flat_action_mask
+        loss = self._compute_masked_loss(pred_velocity, target_velocity, action_mask, reduction)
+        loss_mean = loss.mean().item() if loss.ndim > 0 else loss.item()
+        return loss, {
+            "loss": loss_mean,
+            "active_action_dims": float(action_mask.sum(dim=(1, 2)).float().mean().item()),
+        }
+
+    @torch.no_grad()
+    def predict_action_chunk(self, batch: dict[str, Tensor], **kwargs) -> Tensor:
+        self.eval()
+
+        prompts = self._normalize_task_batch(batch)
+        image_batches, image_masks = self._collect_image_batches(batch)
+        states, _state_mask = self._prepare_state(batch)
+        fused_tokens = self._compute_fused_tokens(prompts, image_batches, image_masks)
+        states = states.to(dtype=self._inference_compute_dtype)
+        fused_tokens = fused_tokens.to(dtype=self._inference_compute_dtype)
+        embodiment_ids = self._get_embodiment_ids(batch, states.shape[0])
+        action_mask = self._prepare_inference_action_mask(states.shape[0])
+
+        with (
+            torch.autocast(device_type="cuda", dtype=torch.bfloat16)
+            if self.config.use_amp and str(self.config.device).startswith("cuda")
+            else nullcontext()
+        ):
+            actions = self.model(
+                fused_tokens,
+                state=states,
+                action_mask=action_mask,
+                embodiment_ids=embodiment_ids,
+            )
+        actions = actions.view(states.shape[0], self.config.chunk_size, self.config.max_action_dim)
+        return actions[:, :, : self._env_action_dim]
+
+    @torch.no_grad()
+    def select_action(self, batch: dict[str, Tensor], **kwargs) -> Tensor:
+        self.eval()
+        if len(self._action_queue) == 0:
+            action_chunk = self.predict_action_chunk(batch)[:, : self.config.n_action_steps]
+            self._action_queue.extend(action_chunk.transpose(0, 1))
+        return self._action_queue.popleft()
@@ -0,0 +1,106 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+from typing import Any
+
+import torch
+
+from lerobot.policies.evo1.configuration_evo1 import Evo1Config
+from lerobot.processor import (
+    AddBatchDimensionProcessorStep,
+    DeviceProcessorStep,
+    NormalizerProcessorStep,
+    PolicyAction,
+    PolicyProcessorPipeline,
+    RenameObservationsProcessorStep,
+    UnnormalizerProcessorStep,
+)
+from lerobot.processor.converters import (
+    batch_to_transition,
+    create_transition,
+    policy_action_to_transition,
+    transition_to_policy_action,
+)
+from lerobot.utils.constants import (
+    ACTION,
+    DONE,
+    INFO,
+    OBS_PREFIX,
+    POLICY_POSTPROCESSOR_DEFAULT_NAME,
+    POLICY_PREPROCESSOR_DEFAULT_NAME,
+    REWARD,
+    TRUNCATED,
+)
+
+
+def evo1_batch_to_transition(batch: dict[str, Any]):
+    transition = batch_to_transition(batch)
+    complementary_data = dict(transition.get("complementary_data") or {})
+    reserved = {ACTION, REWARD, DONE, TRUNCATED, INFO}
+    for key, value in batch.items():
+        if key in reserved or key.startswith(OBS_PREFIX):
+            continue
+        complementary_data.setdefault(key, value)
+    return create_transition(
+        observation=transition.get("observation"),
+        action=transition.get("action"),
+        reward=transition.get("reward", 0.0),
+        done=transition.get("done", False),
+        truncated=transition.get("truncated", False),
+        info=transition.get("info", {}),
+        complementary_data=complementary_data,
+    )
+
+
+def make_evo1_pre_post_processors(
+    config: Evo1Config,
+    dataset_stats: dict[str, dict[str, torch.Tensor]] | None = None,
+) -> tuple[
+    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
+    PolicyProcessorPipeline[PolicyAction, PolicyAction],
+]:
+    input_steps = [
+        RenameObservationsProcessorStep(rename_map={}),
+        AddBatchDimensionProcessorStep(),
+        NormalizerProcessorStep(
+            features={**config.input_features, **config.output_features},
+            norm_map=config.normalization_mapping,
+            stats=dataset_stats,
+        ),
+        DeviceProcessorStep(device=config.device),
+    ]
+    output_steps = [
+        UnnormalizerProcessorStep(
+            features=config.output_features,
+            norm_map=config.normalization_mapping,
+            stats=dataset_stats,
+        ),
+        DeviceProcessorStep(device="cpu"),
+    ]
+
+    return (
+        PolicyProcessorPipeline[dict[str, Any], dict[str, Any]](
+            steps=input_steps,
+            name=POLICY_PREPROCESSOR_DEFAULT_NAME,
+            to_transition=evo1_batch_to_transition,
+        ),
+        PolicyProcessorPipeline[PolicyAction, PolicyAction](
+            steps=output_steps,
+            name=POLICY_POSTPROCESSOR_DEFAULT_NAME,
+            to_transition=policy_action_to_transition,
+            to_output=transition_to_policy_action,
+        ),
+    )
@@ -46,6 +46,8 @@ from lerobot.utils.feature_utils import dataset_to_policy_features

 from .act.configuration_act import ACTConfig
 from .diffusion.configuration_diffusion import DiffusionConfig
+from .eo1.configuration_eo1 import EO1Config
+from .evo1.configuration_evo1 import Evo1Config
 from .groot.configuration_groot import GrootConfig
 from .multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig
 from .pi0.configuration_pi0 import PI0Config
@@ -87,7 +89,7 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:

    Args:
        name: The name of the policy. Supported names are "tdmpc", "diffusion", "act",
-            "multi_task_dit", "vqbet", "pi0", "pi05", "sac", "smolvla", "wall_x".
+            "multi_task_dit", "vqbet", "pi0", "pi05", "sac", "smolvla", "wall_x", "eo1", "evo1".
    Returns:
        The policy class corresponding to the given name.

@@ -146,6 +148,14 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:
        from .wall_x.modeling_wall_x import WallXPolicy

        return WallXPolicy
+    elif name == "eo1":
+        from .eo1.modeling_eo1 import EO1Policy
+
+        return EO1Policy
+    elif name == "evo1":
+        from .evo1.modeling_evo1 import EVO1Policy
+
+        return EVO1Policy
    else:
        try:
            return _get_policy_cls_from_policy_name(name=name)
@@ -163,7 +173,7 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:
    Args:
        policy_type: The type of the policy. Supported types include "tdmpc",
                     "multi_task_dit", "diffusion", "act", "vqbet", "pi0", "pi05", "sac",
-                     "smolvla", "wall_x".
+                     "smolvla", "wall_x", "eo1", "evo1".
        **kwargs: Keyword arguments to be passed to the configuration class constructor.

    Returns:
@@ -196,6 +206,10 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:
        return XVLAConfig(**kwargs)
    elif policy_type == "wall_x":
        return WallXConfig(**kwargs)
+    elif policy_type == "eo1":
+        return EO1Config(**kwargs)
+    elif policy_type == "evo1":
+        return Evo1Config(**kwargs)
    else:
        try:
            config_cls = PreTrainedConfig.get_choice_class(policy_type)
@@ -399,6 +413,20 @@ def make_pre_post_processors(
            config=policy_cfg,
            dataset_stats=kwargs.get("dataset_stats"),
        )
+    elif isinstance(policy_cfg, EO1Config):
+        from .eo1.processor_eo1 import make_eo1_pre_post_processors
+
+        processors = make_eo1_pre_post_processors(
+            config=policy_cfg,
+            dataset_stats=kwargs.get("dataset_stats"),
+        )
+    elif isinstance(policy_cfg, Evo1Config):
+        from .evo1.processor_evo1 import make_evo1_pre_post_processors
+
+        processors = make_evo1_pre_post_processors(
+            config=policy_cfg,
+            dataset_stats=kwargs.get("dataset_stats"),
+        )

    else:
        try:
@@ -514,7 +542,7 @@ def make_policy(

        logging.info("Loading policy's PEFT adapter.")

-        peft_pretrained_path = cfg.pretrained_path
+        peft_pretrained_path = str(cfg.pretrained_path)
        peft_config = PeftConfig.from_pretrained(peft_pretrained_path)

        kwargs["pretrained_name_or_path"] = peft_config.base_model_name_or_path
@@ -527,7 +555,9 @@ def make_policy(
            )

        policy = policy_cls.from_pretrained(**kwargs)
-        policy = PeftModel.from_pretrained(policy, peft_pretrained_path, config=peft_config)
+        policy = PeftModel.from_pretrained(
+            policy, peft_pretrained_path, config=peft_config, is_trainable=True
+        )

    else:
        # Make a fresh policy.
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-from dataclasses import dataclass, field
+from dataclasses import field
 from typing import TYPE_CHECKING

 import torch
@@ -109,7 +109,6 @@ class MultiEmbodimentActionEncoder(nn.Module):
        return x


-@dataclass
 class FlowmatchingActionHeadConfig(PretrainedConfig):
    """NOTE: N1.5 uses XEmbFlowmatchingPolicyHeadConfig as action head"""

@@ -444,13 +444,13 @@ class PaliGemmaWithExpertModel(
        if image.dtype != torch.float32:
            image = image.to(torch.float32)
        image_outputs = self.paligemma.model.get_image_features(image)
-        features = image_outputs.pooler_output * self.paligemma.config.text_config.hidden_size**0.5
+        features = image_outputs.pooler_output
        if features.dtype != out_dtype:
            features = features.to(out_dtype)
        return features

    def embed_language_tokens(self, tokens: torch.Tensor):
-        return self.paligemma.model.language_model.embed_tokens(tokens)
+        return self.paligemma.model.language_model.get_input_embeddings()(tokens)

    def forward(
        self,
@@ -666,8 +666,7 @@ class PI0Pytorch(nn.Module):  # see openpi `PI0Pytorch`
        # Process language tokens
        def lang_embed_func(lang_tokens):
            lang_emb = self.paligemma_with_expert.embed_language_tokens(lang_tokens)
-            lang_emb_dim = lang_emb.shape[-1]
-            return lang_emb * math.sqrt(lang_emb_dim)
+            return lang_emb

        lang_emb = self._apply_checkpoint(lang_embed_func, lang_tokens)
        embs.append(lang_emb)
@@ -748,16 +747,8 @@ class PI0Pytorch(nn.Module):  # see openpi `PI0Pytorch`

        return embs, pad_masks, att_masks, adarms_cond

-    def forward(
-        self, images, img_masks, lang_tokens, lang_masks, state, actions, noise=None, time=None
-    ) -> Tensor:
+    def forward(self, images, img_masks, lang_tokens, lang_masks, state, actions, noise, time) -> Tensor:
        """Do a full training forward pass and compute the loss."""
-        if noise is None:
-            noise = self.sample_noise(actions.shape, actions.device)
-
-        if time is None:
-            time = self.sample_time(actions.shape[0], actions.device)
-
        time_expanded = time[:, None, None]
        x_t = time_expanded * noise + (1 - time_expanded) * actions
        u_t = noise - actions
@@ -1292,8 +1283,11 @@ class PI0Policy(PreTrainedPolicy):
        state = self.prepare_state(batch)
        actions = self.prepare_action(batch)

+        noise = self.model.sample_noise(actions.shape, actions.device)
+        time = self.model.sample_time(actions.shape[0], actions.device)
+
        # Compute loss
-        losses = self.model.forward(images, img_masks, lang_tokens, lang_masks, state, actions)
+        losses = self.model.forward(images, img_masks, lang_tokens, lang_masks, state, actions, noise, time)

        # Truncate losses to actual action dimensions
        original_action_dim = self.config.output_features[ACTION].shape[0]
@@ -728,14 +728,8 @@ class PI05Pytorch(nn.Module):  # see openpi `PI0Pytorch`

        return embs, pad_masks, att_masks, adarms_cond

-    def forward(self, images, img_masks, tokens, masks, actions, noise=None, time=None) -> Tensor:
+    def forward(self, images, img_masks, tokens, masks, actions, noise, time) -> Tensor:
        """Do a full training forward pass and compute the loss."""
-        if noise is None:
-            noise = self.sample_noise(actions.shape, actions.device)
-
-        if time is None:
-            time = self.sample_time(actions.shape[0], actions.device)
-
        time_expanded = time[:, None, None]
        x_t = time_expanded * noise + (1 - time_expanded) * actions
        u_t = noise - actions
@@ -1262,8 +1256,11 @@ class PI05Policy(PreTrainedPolicy):

        actions = self.prepare_action(batch)

+        noise = self.model.sample_noise(actions.shape, actions.device)
+        time = self.model.sample_time(actions.shape[0], actions.device)
+
        # Compute loss (no separate state needed for PI05)
-        losses = self.model.forward(images, img_masks, tokens, masks, actions)
+        losses = self.model.forward(images, img_masks, tokens, masks, actions, noise, time)

        # Truncate losses to actual action dimensions
        original_action_dim = self.config.output_features[ACTION].shape[0]
@@ -16,7 +16,6 @@

 import builtins
 import logging
-import math
 from collections import deque
 from pathlib import Path
 from typing import TYPE_CHECKING, Literal, TypedDict, Unpack
@@ -261,13 +260,15 @@ class PI0FastPaliGemma(nn.Module):
        if image.dtype != torch.float32:
            image = image.to(torch.float32)
        image_outputs = self.paligemma.model.get_image_features(image)
-        features = image_outputs.pooler_output * self.paligemma.config.text_config.hidden_size**0.5
+        features = image_outputs.pooler_output
+        norm = 2048**0.5
+        features = features / norm * norm
        if features.dtype != out_dtype:
            features = features.to(out_dtype)
        return features

    def embed_language_tokens(self, tokens: torch.Tensor):
-        return self.paligemma.model.language_model.embed_tokens(tokens)
+        return self.paligemma.model.language_model.get_input_embeddings()(tokens)

    def forward(
        self,
@@ -417,8 +418,7 @@ class PI0FastPytorch(nn.Module):  # see openpi `PI0Pytorch`
        # Process language instruction tokens
        def lang_embed_func(tokens):
            lang_emb = self.paligemma_with_expert.embed_language_tokens(tokens)
-            lang_emb_dim = lang_emb.shape[-1]
-            return lang_emb * math.sqrt(lang_emb_dim)
+            return lang_emb

        lang_emb = self._apply_checkpoint(lang_embed_func, tokens)
        embs.append(lang_emb)
@@ -432,8 +432,7 @@ class PI0FastPytorch(nn.Module):  # see openpi `PI0Pytorch`

            def fast_action_embed_func(fast_action_tokens):
                fast_emb = self.paligemma_with_expert.embed_language_tokens(fast_action_tokens)
-                fast_emb_dim = fast_emb.shape[-1]
-                return fast_emb * math.sqrt(fast_emb_dim)
+                return fast_emb

            fast_action_emb = self._apply_checkpoint(fast_action_embed_func, fast_action_tokens)
            embs.append(fast_action_emb)
@@ -666,7 +665,6 @@ class PI0FastPytorch(nn.Module):  # see openpi `PI0Pytorch`
            if t < max_decoding_steps - 1:
                # embed the newly generated token
                next_token_emb = self.paligemma_with_expert.embed_language_tokens(next_token)
-                next_token_emb = next_token_emb * math.sqrt(next_token_emb.shape[-1])
                if prefix_embs.dtype == torch.bfloat16:
                    next_token_emb = next_token_emb.to(dtype=torch.bfloat16)

@@ -771,7 +769,6 @@ class PI0FastPytorch(nn.Module):  # see openpi `PI0Pytorch`
            # Embed the single previous token
            # We use embed_language_tokens directly to avoid overhead of full prefix embedding
            next_token_emb = self.paligemma_with_expert.embed_language_tokens(next_token)
-            next_token_emb = next_token_emb * math.sqrt(next_token_emb.shape[-1])
            if prefix_embs.dtype == torch.bfloat16:
                next_token_emb = next_token_emb.to(dtype=torch.bfloat16)

@@ -97,8 +97,8 @@ class VQBeTConfig(PreTrainedConfig):
    vision_backbone: str = "resnet18"
    crop_shape: tuple[int, int] | None = (84, 84)
    crop_is_random: bool = True
-    pretrained_backbone_weights: str | None = None
-    use_group_norm: bool = True
+    pretrained_backbone_weights: str | None = "ResNet18_Weights.IMAGENET1K_V1"
+    use_group_norm: bool = False
    spatial_softmax_num_keypoints: int = 32
    # VQ-VAE
    n_vqvae_training_steps: int = 20000
@@ -22,7 +22,7 @@ from transformers.utils import (
    add_start_docstrings,
    add_start_docstrings_to_model_forward,
    is_flash_attn_2_available,
-    is_flash_attn_greater_or_equal_2_10,
+    is_flash_attn_greater_or_equal,
    is_torchdynamo_compiling,
    logging,
    replace_return_docstrings,
@@ -890,7 +890,7 @@ class Qwen2_5_VLFlashAttention2(Qwen2_5_VLAttention):
        # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
        # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignment, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
        # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
-        self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
+        self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal("2.1.0")

    def forward(
        self,
@@ -939,7 +939,7 @@ class Qwen2_5_VLFlashAttention2(Qwen2_5_VLAttention):
        input_dtype = query_states.dtype
        if input_dtype == torch.float32:
            if torch.is_autocast_enabled():
-                target_dtype = torch.get_autocast_gpu_dtype()
+                target_dtype = torch.get_autocast_dtype(query_states.device.type)
            # Handle the case where the model is quantized
            elif hasattr(self.config, "_pre_quantization_dtype"):
                target_dtype = self.config._pre_quantization_dtype
@@ -45,7 +45,7 @@ from transformers.utils import (
    add_start_docstrings,
    add_start_docstrings_to_model_forward,
    is_flash_attn_2_available,
-    is_flash_attn_greater_or_equal_2_10,
+    is_flash_attn_greater_or_equal,
    logging,
    replace_return_docstrings,
 )
@@ -909,7 +909,7 @@ class Florence2FlashAttention2(Florence2Attention):
        # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
        # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignment, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
        # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
-        self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
+        self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal("2.1.0")

    def _reshape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
        return tensor.view(bsz, seq_len, self.num_heads, self.head_dim)
@@ -985,7 +985,7 @@ class Florence2FlashAttention2(Florence2Attention):
        input_dtype = query_states.dtype
        if input_dtype == torch.float32:
            if torch.is_autocast_enabled():
-                target_dtype = torch.get_autocast_gpu_dtype()
+                target_dtype = torch.get_autocast_dtype(query_states.device.type)
            # Handle the case where the model is quantized
            elif hasattr(self.config, "_pre_quantization_dtype"):
                target_dtype = self.config._pre_quantization_dtype
@@ -40,7 +40,7 @@ from .converters import (
 )
 from .delta_action_processor import MapDeltaActionToRobotActionStep, MapTensorToDeltaActionDictStep
 from .device_processor import DeviceProcessorStep
-from .env_processor import IsaaclabArenaProcessorStep, LiberoProcessorStep
+from .env_processor import IsaaclabArenaProcessorStep, LiberoActionProcessorStep, LiberoProcessorStep
 from .factory import (
    make_default_processors,
    make_default_robot_action_processor,
@@ -149,6 +149,7 @@ __all__ = [
    "RewardProcessorStep",
    "DataProcessorPipeline",
    "IsaaclabArenaProcessorStep",
+    "LiberoActionProcessorStep",
    "LiberoProcessorStep",
    "TimeLimitProcessorStep",
    "AddBatchDimensionProcessorStep",
@@ -18,9 +18,9 @@ from dataclasses import dataclass
 import torch

 from lerobot.configs import FeatureType, PipelineFeatureType, PolicyFeature
-from lerobot.utils.constants import OBS_IMAGES, OBS_PREFIX, OBS_STATE, OBS_STR
+from lerobot.utils.constants import ACTION, OBS_IMAGES, OBS_PREFIX, OBS_STATE, OBS_STR

-from .pipeline import ObservationProcessorStep, ProcessorStepRegistry
+from .pipeline import ActionProcessorStep, ObservationProcessorStep, ProcessorStepRegistry


@dataclass
@@ -46,6 +46,8 @@ class LiberoProcessorStep(ObservationProcessorStep):
    -   This accounts for the HuggingFaceVLA/libero camera orientation convention.
    """

+    max_state_dim: int | None = None
+
    def _process_observation(self, observation):
        """
        Processes both image and robot_state observations from LIBERO.
@@ -78,6 +80,15 @@ class LiberoProcessorStep(ObservationProcessorStep):
            state = state.float()
            if state.dim() == 1:
                state = state.unsqueeze(0)
+            if self.max_state_dim is not None:
+                if state.shape[-1] > self.max_state_dim:
+                    raise ValueError(
+                        f"LIBERO state has {state.shape[-1]} dims, which is larger than "
+                        f"configured max_state_dim={self.max_state_dim}."
+                    )
+                if state.shape[-1] < self.max_state_dim:
+                    pad_width = self.max_state_dim - state.shape[-1]
+                    state = torch.nn.functional.pad(state, (0, pad_width))

            processed_obs[OBS_STATE] = state
        return processed_obs
@@ -101,7 +112,7 @@ class LiberoProcessorStep(ObservationProcessorStep):
        # add our new flattened state
        state_feats[OBS_STATE] = PolicyFeature(
            type=FeatureType.STATE,
-            shape=(8,),  # [eef_pos(3), axis_angle(3), gripper(2)]
+            shape=(self.max_state_dim or 8,),  # [eef_pos(3), axis_angle(3), gripper(2)] plus padding
        )

        new_features[FeatureType.STATE] = state_feats
@@ -111,6 +122,9 @@ class LiberoProcessorStep(ObservationProcessorStep):
    def observation(self, observation):
        return self._process_observation(observation)

+    def get_config(self) -> dict:
+        return {"max_state_dim": self.max_state_dim}
+
    def _quat2axisangle(self, quat: torch.Tensor) -> torch.Tensor:
        """
        Convert batched quaternions to axis-angle format.
@@ -153,6 +167,32 @@ class LiberoProcessorStep(ObservationProcessorStep):
        return result


+@dataclass
+@ProcessorStepRegistry.register(name="libero_action_processor")
+class LiberoActionProcessorStep(ActionProcessorStep):
+    """Slices padded policy actions back to the executable LIBERO action space."""
+
+    action_dim: int = 7
+
+    def action(self, action):
+        if action.shape[-1] < self.action_dim:
+            raise ValueError(
+                f"LIBERO action has {action.shape[-1]} dims, which is smaller than action_dim={self.action_dim}."
+            )
+        return action[..., : self.action_dim]
+
+    def transform_features(
+        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
+    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
+        new_features = {ft: feats.copy() for ft, feats in features.items()}
+        action_feats = new_features.setdefault(FeatureType.ACTION, {})
+        action_feats[ACTION] = PolicyFeature(type=FeatureType.ACTION, shape=(self.action_dim,))
+        return new_features
+
+    def get_config(self) -> dict:
+        return {"action_dim": self.action_dim}
+
+
@dataclass
@ProcessorStepRegistry.register(name="isaaclab_arena_processor")
 class IsaaclabArenaProcessorStep(ObservationProcessorStep):
@@ -54,6 +54,7 @@ class BiOpenArmFollower(Robot):
            calibration_dir=config.calibration_dir,
            port=config.left_arm_config.port,
            disable_torque_on_disconnect=config.left_arm_config.disable_torque_on_disconnect,
+            use_velocity_and_torque=config.left_arm_config.use_velocity_and_torque,
            max_relative_target=config.left_arm_config.max_relative_target,
            cameras=left_cameras,
            side=config.left_arm_config.side,
@@ -72,6 +73,7 @@ class BiOpenArmFollower(Robot):
            calibration_dir=config.calibration_dir,
            port=config.right_arm_config.port,
            disable_torque_on_disconnect=config.right_arm_config.disable_torque_on_disconnect,
+            use_velocity_and_torque=config.right_arm_config.use_velocity_and_torque,
            max_relative_target=config.right_arm_config.max_relative_target,
            cameras=right_cameras,
            side=config.right_arm_config.side,
@@ -46,7 +46,7 @@ class LeKiwiConfig(RobotConfig):
    cameras: dict[str, CameraConfig] = field(default_factory=lekiwi_cameras_config)

    # Set to `True` for backward compatibility with previous policies/dataset
-    use_degrees: bool = False
+    use_degrees: bool = True


@dataclass
@@ -66,6 +66,10 @@ class OpenArmFollowerConfigBase:
    # Whether to disable torque when disconnecting
    disable_torque_on_disconnect: bool = True

+    # When True, expose `.vel` and `.torque` per motor in observation features.
+    # Default False for compatibility with the position-only openarm_mini teleoperator.
+    use_velocity_and_torque: bool = False
+
    # Safety limit for relative target positions
    # Set to a positive scalar for all motors, or a dict mapping motor names to limits
    max_relative_target: float | dict[str, float] | None = None
@@ -93,8 +93,9 @@ class OpenArmFollower(Robot):
        features: dict[str, type] = {}
        for motor in self.bus.motors:
            features[f"{motor}.pos"] = float
-            features[f"{motor}.vel"] = float  # Add this
-            features[f"{motor}.torque"] = float  # Add this
+            if self.config.use_velocity_and_torque:
+                features[f"{motor}.vel"] = float
+                features[f"{motor}.torque"] = float
        return features

    @property
@@ -235,8 +236,9 @@ class OpenArmFollower(Robot):
        for motor in self.bus.motors:
            state = states.get(motor, {})
            obs_dict[f"{motor}.pos"] = state.get("position", 0.0)
-            obs_dict[f"{motor}.vel"] = state.get("velocity", 0.0)
-            obs_dict[f"{motor}.torque"] = state.get("torque", 0.0)
+            if self.config.use_velocity_and_torque:
+                obs_dict[f"{motor}.vel"] = state.get("velocity", 0.0)
+                obs_dict[f"{motor}.torque"] = state.get("torque", 0.0)

        # Capture images from cameras
        for cam_key, cam in self.cameras.items():
@@ -68,16 +68,9 @@ class SOFollower(Robot):

    @property
    def _cameras_ft(self) -> dict[str, tuple]:
-        features: dict[str, tuple] = {}
-        for cam in self.cameras:
-            cam_cfg = self.config.cameras[cam]
-            features[cam] = (cam_cfg.height, cam_cfg.width, 3)
-            # Cameras with a depth stream (e.g. RealSense with use_depth=True) also
-            # emit a 2D depth feature; hw_to_dataset_features routes 2D shapes to
-            # ``observation.depth.<bare>`` with the depth-map marker.
-            if getattr(cam_cfg, "use_depth", False):
-                features[f"{cam}_depth"] = (cam_cfg.height, cam_cfg.width)
-        return features
+        return {
+            cam: (self.config.cameras[cam].height, self.config.cameras[cam].width, 3) for cam in self.cameras
+        }

    @cached_property
    def observation_features(self) -> dict[str, type | tuple]:
@@ -197,14 +190,6 @@ class SOFollower(Robot):
            dt_ms = (time.perf_counter() - start) * 1e3
            logger.debug(f"{self} read {cam_key}: {dt_ms:.1f}ms")

-            # Cameras with a depth stream populate a sibling ``<cam>_depth`` key
-            # (consumed by hw_to_dataset_features / build_dataset_frame).
-            if getattr(self.config.cameras[cam_key], "use_depth", False):
-                start = time.perf_counter()
-                obs_dict[f"{cam_key}_depth"] = cam.read_latest_depth()
-                dt_ms = (time.perf_counter() - start) * 1e3
-                logger.debug(f"{self} read {cam_key} depth: {dt_ms:.1f}ms")
-
        return obs_dict

    @check_if_not_connected
@@ -33,12 +33,13 @@ Recording modes:
    ``record_autonomous=False``: Only correction windows are recorded.
        Each correction (start to stop) becomes one episode.

-Teleoperator expectations:
-    The user is responsible for keeping the leader arm aligned with the
-    follower arm at the moment a correction begins.  Programmatic motor
-    handover (``enable_torque`` / ``disable_torque`` / ``write_goal_positions``)
-    is intentionally not invoked here — see the TODO in
-    :func:`DAggerStrategy._apply_transition` for the open design decision.
+Teleoperator handover:
+    On AUTONOMOUS → PAUSED, actuated teleops (those with non-empty
+    ``feedback_features``, e.g. SO-101, OpenArmMini) are smoothly driven to
+    the follower's last position via ``send_feedback`` so the operator takes
+    over without a jerk.  Non-actuated teleops cannot be driven,
+    so on PAUSED → CORRECTING the follower is instead slid to the teleop's
+    current pose before the correction begins.
 """

 from __future__ import annotations
@@ -175,17 +176,27 @@ class DAggerEvents:
 # ---------------------------------------------------------------------------


-# TODO(Steven): re-enable programmatic teleop alignment once we decide whether
-# to enforce motor-control methods on every Teleoperator.  Until then the user
-# is responsible for moving the leader arm to the follower's pose at the moment
-# a correction begins.
-def _teleop_smooth_move_to(
-    teleop: Teleoperator, target_pos: dict, duration_s: float = 2.0, fps: int = 50
-) -> None:
-    """Smoothly move teleop to target position via linear interpolation.
+def _teleop_supports_feedback(teleop: Teleoperator) -> bool:
+    """Return True when the teleop can receive position feedback (is actuated).
+    TODO(Maxime): See if it is possible to unify this interface across teleops instead of duck-typing.
+    """
+    return (
+        bool(teleop.feedback_features)
+        and hasattr(teleop, "disable_torque")
+        and hasattr(teleop, "enable_torque")
+    )

-    Requires the teleoperator to support motor control methods
-    (``enable_torque``, ``write_goal_positions``, ``get_action``).
+
+def _teleop_smooth_move_to(
+    teleop: Teleoperator, target_pos: dict, duration_s: float = 2.0, fps: int = 30
+) -> None:
+    """Smoothly move an actuated teleop to ``target_pos`` via linear interpolation.
+
+    Requires the teleoperator to support feedback
+    (i.e. have non-empty ``feedback_features`` and implement ``disable_torque`` / ``enable_torque``).
+
+    TODO(Maxime): This blocks up to ``duration_s`` seconds, during this time
+    the follower robot doesn't receive new actions, this could be an issue on LeKiwi.
    """
    teleop.enable_torque()
    current = teleop.get_action()
@@ -193,13 +204,28 @@ def _teleop_smooth_move_to(

    for step in range(steps + 1):
        t = step / steps
-        interp = {}
-        for k in current:
-            if k in target_pos:
-                interp[k] = current[k] * (1 - t) + target_pos[k] * t
-            else:
-                interp[k] = current[k]
-        teleop.write_goal_positions(interp)
+        interp = {
+            k: current[k] * (1 - t) + target_pos[k] * t if k in target_pos else current[k] for k in current
+        }
+        teleop.send_feedback(interp)
+        time.sleep(1 / fps)
+
+
+def _follower_smooth_move_to(
+    robot: ThreadSafeRobot, current: dict, target: dict, duration_s: float = 1.0, fps: int = 30
+) -> None:
+    """Smoothly move the follower robot from ``current`` to ``target`` action.
+
+    Used when the teleop is non-actuated: instead of driving the leader arm
+    to the follower, we bring the follower to the teleop's current pose.
+    Both ``current`` and ``target`` must be in robot-action key space.
+    """
+    steps = max(int(duration_s * fps), 1)
+
+    for step in range(steps + 1):
+        t = step / steps
+        interp = {k: current[k] * (1 - t) + target[k] * t if k in target else current[k] for k in current}
+        robot.send_action(interp)
        time.sleep(1 / fps)


@@ -415,9 +441,6 @@ class DAggerStrategy(RolloutStrategy):
        engine.reset()
        interpolator.reset()
        events.reset()
-        # TODO(Steven): re-enable once Teleoperator motor-control methods are
-        # standardised; until then the user pre-aligns the leader by hand.
-        # teleop.disable_torque()
        engine.resume()

        last_action: dict[str, Any] | None = None
@@ -441,8 +464,16 @@ class DAggerStrategy(RolloutStrategy):
                    transition = events.consume_transition()
                    if transition is not None:
                        old_phase, new_phase = transition
-                        self._apply_transition(old_phase, new_phase, engine, interpolator, robot, teleop)
-                        last_action = None
+                        self._apply_transition(
+                            old_phase,
+                            new_phase,
+                            engine,
+                            interpolator,
+                            ctx,
+                            last_action,
+                        )
+                        if new_phase == DAggerPhase.AUTONOMOUS:
+                            last_action = None

                    phase = events.phase
                    obs = robot.get_observation()
@@ -532,9 +563,6 @@ class DAggerStrategy(RolloutStrategy):
            finally:
                logger.info("DAgger continuous control loop ended — pausing engine")
                engine.pause()
-                # TODO(Steven): re-enable once Teleoperator motor-control methods
-                # are standardised across all teleop implementations.
-                # teleop.disable_torque()
                with contextlib.suppress(Exception):
                    with self._episode_lock:
                        dataset.save_episode()
@@ -570,9 +598,6 @@ class DAggerStrategy(RolloutStrategy):
        engine.reset()
        interpolator.reset()
        events.reset()
-        # TODO(Steven): re-enable once Teleoperator motor-control methods are
-        # standardised; until then the user pre-aligns the leader by hand.
-        # teleop.disable_torque()
        engine.resume()

        last_action: dict[str, Any] | None = None
@@ -600,8 +625,16 @@ class DAggerStrategy(RolloutStrategy):
                    transition = events.consume_transition()
                    if transition is not None:
                        old_phase, new_phase = transition
-                        self._apply_transition(old_phase, new_phase, engine, interpolator, robot, teleop)
-                        last_action = None
+                        self._apply_transition(
+                            old_phase,
+                            new_phase,
+                            engine,
+                            interpolator,
+                            ctx,
+                            last_action,
+                        )
+                        if new_phase == DAggerPhase.AUTONOMOUS:
+                            last_action = None

                        # Correction ended -> save episode (blocking if not streaming)
                        if old_phase == DAggerPhase.CORRECTING and new_phase == DAggerPhase.PAUSED:
@@ -679,9 +712,6 @@ class DAggerStrategy(RolloutStrategy):
            finally:
                logger.info("DAgger corrections-only loop ended — pausing engine")
                engine.pause()
-                # TODO(Steven): re-enable once Teleoperator motor-control methods
-                # are standardised across all teleop implementations.
-                # teleop.disable_torque()
                with contextlib.suppress(Exception):
                    with self._episode_lock:
                        dataset.save_episode()
@@ -698,36 +728,71 @@ class DAggerStrategy(RolloutStrategy):
        new_phase: DAggerPhase,
        engine,
        interpolator,
-        robot: ThreadSafeRobot,
-        teleop: Teleoperator,
+        ctx: RolloutContext,
+        prev_action: dict | None,
    ) -> None:
-        """Execute side-effects for a validated phase transition."""
+        """Execute side-effects for a validated phase transition, including smooth handovers.
+
+        AUTONOMOUS -> PAUSED (actuated teleop):
+            Pause the engine, then drive the leader arm to the follower's last
+            commanded position so the operator takes over without a jerk.
+
+        PAUSED -> CORRECTING (non-actuated teleop):
+            Slide the follower to the teleop's current pose so the robot meets
+            the operator's hand rather than jumping to it on the first frame.
+
+        CORRECTING -> PAUSED (actuated teleop):
+            Re-enable torque to hold position after correction.
+            This will be potentially useful if cancelling the correction recording
+
+        PAUSED -> AUTONOMOUS:
+            Reset and resume the inference engine.
+        """
+        teleop = ctx.hardware.teleop
+        robot = ctx.hardware.robot_wrapper
+
        logger.info("Phase transition: %s -> %s", old_phase.value, new_phase.value)
        if old_phase == DAggerPhase.AUTONOMOUS and new_phase == DAggerPhase.PAUSED:
-            logger.info("Pausing engine — robot holds position")
+            logger.info("Pausing engine - robot holds position")
            engine.pause()
-            obs = robot.get_observation()
-            _robot_pos = {
-                k: v for k, v in obs.items() if k.endswith(".pos") and k in robot.observation_features
-            }
-            # TODO(Steven): once Teleoperator motor-control methods are
-            # standardised, drive the leader to the follower's pose here so the
-            # operator does not need to pre-align the arm by hand.  Until then
-            # the user is responsible for the alignment.
-            # _teleop_smooth_move_to(teleop, _robot_pos, duration_s=2.0, fps=50)

-        elif new_phase == DAggerPhase.CORRECTING:
-            logger.info("Entering correction mode — human teleop control")
-            # TODO(Steven): re-enable once Teleoperator motor-control methods
-            # are standardised across all teleop implementations.
-            # teleop.disable_torque()
+            if _teleop_supports_feedback(teleop) and prev_action is not None:
+                # TODO(Maxime): prev_action is in robot action key space (output of robot_action_processor).
+                # send_feedback expects teleop feedback key space. For homogeneous setups (e.g. SO-101
+                # leader + SO-101 follower) the keys are identical so this works. If the processor pipeline
+                # does non-trivial key renaming (e.g. a rename_map on action keys), the interpolation in
+                # _teleop_smooth_move_to silently no-ops and the arm doesn't move.
+                logger.info("Smooth handover: moving leader arm to follower position")
+                _teleop_smooth_move_to(teleop, prev_action)
+
+        elif old_phase == DAggerPhase.PAUSED and new_phase == DAggerPhase.CORRECTING:
+            logger.info("Entering correction mode - human teleop control")
+            if not _teleop_supports_feedback(teleop) and prev_action is not None:
+                logger.info("Smooth handover: sliding follower to teleop position")
+                obs = robot.get_observation()
+                teleop_action = teleop.get_action()
+                processed = ctx.processors.teleop_action_processor((teleop_action, obs))
+                target = ctx.processors.robot_action_processor((processed, obs))
+                _follower_smooth_move_to(robot, prev_action, target)
+
+            # unlock the teleop for human control
+            if _teleop_supports_feedback(teleop):
+                teleop.disable_torque()
+
+        elif old_phase == DAggerPhase.CORRECTING and new_phase == DAggerPhase.PAUSED:
+            if _teleop_supports_feedback(teleop):
+                teleop.enable_torque()

        elif new_phase == DAggerPhase.AUTONOMOUS:
-            logger.info("Resuming autonomous mode — resetting engine and interpolator")
+            logger.info("Resuming autonomous mode - resetting engine and interpolator")
            interpolator.reset()
            engine.reset()
            engine.resume()

+            # release teleop before resuming the policy
+            if _teleop_supports_feedback(teleop):
+                teleop.disable_torque()
+
    # ------------------------------------------------------------------
    # Background push (shared by both modes)
    # ------------------------------------------------------------------
@@ -49,14 +49,6 @@ Delete episodes and save to a new dataset at a specific path and with a new repo
        --operation.type delete_episodes \
        --operation.episode_indices "[0, 2, 5]"

-Delete episodes and re-encode video segments with h264:
-    lerobot-edit-dataset \
-        --repo_id lerobot/pusht \
-        --operation.type delete_episodes \
-        --operation.episode_indices "[0, 2, 5]" \
-        --operation.camera_encoder_config.vcodec h264 \
-        --operation.camera_encoder_config.crf 23
-
 Split dataset by fractions (pusht_train, pusht_val):
    lerobot-edit-dataset \
        --repo_id lerobot/pusht \
@@ -82,14 +74,6 @@ Split into more than two splits:
        --operation.type split \
        --operation.splits '{"train": 0.6, "val": 0.2, "test": 0.2}'

-Split dataset and re-encode video segments with h264:
-    lerobot-edit-dataset \
-        --repo_id lerobot/pusht \
-        --operation.type split \
-        --operation.splits '{"train": 0.8, "val": 0.2}' \
-        --operation.camera_encoder_config.vcodec h264 \
-        --operation.camera_encoder_config.crf 23
-
 Merge multiple datasets:
    lerobot-edit-dataset \
        --new_repo_id lerobot/pusht_merged \
@@ -203,7 +187,7 @@ import abc
 import logging
 import shutil
 import sys
-from dataclasses import dataclass, field
+from dataclasses import dataclass
 from pathlib import Path

 import draccus
@@ -211,8 +195,6 @@ import draccus
 from lerobot.configs import parser
 from lerobot.datasets import (
    LeRobotDataset,
-    VideoEncoderConfig,
-    camera_encoder_defaults,
    convert_image_to_video_dataset,
    delete_episodes,
    merge_datasets,
@@ -236,14 +218,12 @@ class OperationConfig(draccus.ChoiceRegistry, abc.ABC):
@dataclass
 class DeleteEpisodesConfig(OperationConfig):
    episode_indices: list[int] | None = None
-    camera_encoder_config: VideoEncoderConfig = field(default_factory=camera_encoder_defaults)


@OperationConfig.register_subclass("split")
@dataclass
 class SplitConfig(OperationConfig):
    splits: dict[str, float | list[int]] | None = None
-    camera_encoder_config: VideoEncoderConfig = field(default_factory=camera_encoder_defaults)


@OperationConfig.register_subclass("merge")
@@ -270,7 +250,11 @@ class ModifyTasksConfig(OperationConfig):
@dataclass
 class ConvertImageToVideoConfig(OperationConfig):
    output_dir: str | None = None
-    camera_encoder_config: VideoEncoderConfig = field(default_factory=camera_encoder_defaults)
+    vcodec: str = "libsvtav1"
+    pix_fmt: str = "yuv420p"
+    g: int = 2
+    crf: int = 30
+    fast_decode: int = 0
    episode_indices: list[int] | None = None
    num_workers: int = 4
    max_episodes_per_batch: int | None = None
@@ -372,7 +356,6 @@ def handle_delete_episodes(cfg: EditDatasetConfig) -> None:
        episode_indices=cfg.operation.episode_indices,
        output_dir=output_dir,
        repo_id=output_repo_id,
-        camera_encoder_config=cfg.operation.camera_encoder_config,
    )

    logging.info(f"Dataset saved to {output_dir}")
@@ -404,7 +387,6 @@ def handle_split(cfg: EditDatasetConfig) -> None:
        dataset,
        splits=cfg.operation.splits,
        output_dir=cfg.new_root,
-        camera_encoder_config=cfg.operation.camera_encoder_config,
    )

    for split_name, split_ds in split_datasets.items():
@@ -575,8 +557,11 @@ def handle_convert_image_to_video(cfg: EditDatasetConfig) -> None:
        dataset=dataset,
        output_dir=output_dir,
        repo_id=output_repo_id,
-        camera_encoder_config=getattr(cfg.operation, "camera_encoder_config", None)
-        or camera_encoder_defaults(),
+        vcodec=getattr(cfg.operation, "vcodec", "libsvtav1"),
+        pix_fmt=getattr(cfg.operation, "pix_fmt", "yuv420p"),
+        g=getattr(cfg.operation, "g", 2),
+        crf=getattr(cfg.operation, "crf", 30),
+        fast_decode=getattr(cfg.operation, "fast_decode", 0),
        episode_indices=getattr(cfg.operation, "episode_indices", None),
        num_workers=getattr(cfg.operation, "num_workers", 4),
        max_episodes_per_batch=getattr(cfg.operation, "max_episodes_per_batch", None),
@@ -63,27 +63,6 @@ lerobot-record \\
  --dataset.streaming_encoding=true \\
  --dataset.encoder_threads=2
 ```
-
-Example recording with custom video encoding parameters:
-```shell
-lerobot-record \\
-    --robot.type=so100_follower \\
-    --robot.port=/dev/tty.usbmodem58760431541 \\
-    --robot.cameras="{laptop: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \\
-    --robot.id=black \\
-    --teleop.type=so100_leader \\
-    --teleop.port=/dev/tty.usbmodem58760431551 \\
-    --teleop.id=blue \\
-    --dataset.repo_id=<my_username>/<my_dataset_name> \\
-    --dataset.num_episodes=2 \\
-    --dataset.single_task="Grab the cube" \\
-    --dataset.streaming_encoding=true \\
-    --dataset.encoder_threads=2 \\
-    --dataset.camera_encoder_config.vcodec=h264 \\
-    --dataset.camera_encoder_config.preset=fast \\
-    --dataset.camera_encoder_config.extra_options={"tune": "film", "profile:v": "high", "bf": 2} \\
-    --display_data=true
-```
 """

 import logging
@@ -104,12 +83,10 @@ from lerobot.common.control_utils import (
 from lerobot.configs import parser
 from lerobot.configs.dataset import DatasetRecordConfig
 from lerobot.datasets import (
-    DepthEncoderConfig,
    LeRobotDataset,
    VideoEncodingManager,
    aggregate_pipeline_dataset_features,
    create_initial_features,
-    depth_encoder_defaults,
    safe_stop_image_writer,
 )
 from lerobot.processor import (
@@ -328,10 +305,7 @@ def record_loop(

        if display_data:
            log_rerun_data(
-                observation=obs_processed,
-                action=action_values,
-                compress_images=display_compressed_images,
-                features=dataset.features if dataset is not None else None,
+                observation=obs_processed, action=action_values, compress_images=display_compressed_images
            )

        dt_s = time.perf_counter() - start_loop_t
@@ -403,11 +377,10 @@ def record(
                cfg.dataset.repo_id,
                root=cfg.dataset.root,
                batch_encoding_size=cfg.dataset.video_encoding_batch_size,
-                camera_encoder_config=cfg.dataset.camera_encoder_config,
-                depth_encoder_config=cfg.dataset.depth_encoder_config,
-                encoder_threads=cfg.dataset.encoder_threads,
+                vcodec=cfg.dataset.vcodec,
                streaming_encoding=cfg.dataset.streaming_encoding,
                encoder_queue_maxsize=cfg.dataset.encoder_queue_maxsize,
+                encoder_threads=cfg.dataset.encoder_threads,
                image_writer_processes=cfg.dataset.num_image_writer_processes if num_cameras > 0 else 0,
                image_writer_threads=cfg.dataset.num_image_writer_threads_per_camera * num_cameras
                if num_cameras > 0
@@ -433,11 +406,10 @@ def record(
                image_writer_processes=cfg.dataset.num_image_writer_processes,
                image_writer_threads=cfg.dataset.num_image_writer_threads_per_camera * len(robot.cameras),
                batch_encoding_size=cfg.dataset.video_encoding_batch_size,
-                camera_encoder_config=cfg.dataset.camera_encoder_config,
-                depth_encoder_config=cfg.dataset.depth_encoder_config,
-                encoder_threads=cfg.dataset.encoder_threads,
+                vcodec=cfg.dataset.vcodec,
                streaming_encoding=cfg.dataset.streaming_encoding,
                encoder_queue_maxsize=cfg.dataset.encoder_queue_maxsize,
+                encoder_threads=cfg.dataset.encoder_threads,
            )

        robot.connect()
@@ -448,7 +420,7 @@ def record(

        if not cfg.dataset.streaming_encoding:
            logging.info(
-                "Streaming encoding is disabled. If you have capable hardware, consider enabling it for way faster episode saving. --dataset.streaming_encoding=true --dataset.encoder_threads=2 # --dataset.camera_encoder_config.vcodec=auto. More info in the documentation: https://huggingface.co/docs/lerobot/streaming_video_encoding"
+                "Streaming encoding is disabled. If you have capable hardware, consider enabling it for way faster episode saving. --dataset.streaming_encoding=true --dataset.encoder_threads=2 # --dataset.vcodec=auto. More info in the documentation: https://huggingface.co/docs/lerobot/streaming_video_encoding"
            )

        with VideoEncodingManager(dataset):
@@ -277,9 +277,14 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
    if cfg.peft is not None:
        if cfg.is_reward_model_training:
            raise ValueError("PEFT is only supported for policy training. ")
-        logging.info("Using PEFT! Wrapping model.")
-        peft_cli_overrides = dataclasses.asdict(cfg.peft)
-        policy = policy.wrap_with_peft(peft_cli_overrides=peft_cli_overrides)
+        from peft import PeftModel
+
+        if isinstance(policy, PeftModel):
+            logging.info("PEFT adapter already loaded from checkpoint, skipping wrap_with_peft.")
+        else:
+            logging.info("Using PEFT! Wrapping model.")
+            peft_cli_overrides = dataclasses.asdict(cfg.peft)
+            policy = policy.wrap_with_peft(peft_cli_overrides=peft_cli_overrides)

    # Wait for all processes to finish model creation before continuing
    accelerator.wait_for_everyone()
@@ -49,6 +49,7 @@ class BiOpenArmLeader(Teleoperator):
            can_data_bitrate=config.left_arm_config.can_data_bitrate,
            motor_config=config.left_arm_config.motor_config,
            manual_control=config.left_arm_config.manual_control,
+            use_velocity_and_torque=config.left_arm_config.use_velocity_and_torque,
            position_kd=config.left_arm_config.position_kd,
            position_kp=config.left_arm_config.position_kp,
        )
@@ -63,6 +64,7 @@ class BiOpenArmLeader(Teleoperator):
            can_data_bitrate=config.right_arm_config.can_data_bitrate,
            motor_config=config.right_arm_config.motor_config,
            manual_control=config.right_arm_config.manual_control,
+            use_velocity_and_torque=config.right_arm_config.use_velocity_and_torque,
            position_kd=config.right_arm_config.position_kd,
            position_kp=config.right_arm_config.position_kp,
        )
@@ -60,6 +60,10 @@ class OpenArmLeaderConfigBase:
    # When enabled, motors have torque disabled for manual movement
    manual_control: bool = True

+    # When True, expose `.vel` and `.torque` per motor in action features.
+    # Default False for compatibility with the position-only openarm_mini teleoperator.
+    use_velocity_and_torque: bool = False
+
    # TODO(Steven, Pepijn): Not used ... ?
    # MIT control parameters (used when manual_control=False for torque control)
    # List of 8 values: [joint_1, joint_2, joint_3, joint_4, joint_5, joint_6, joint_7, gripper]
@@ -70,8 +70,9 @@ class OpenArmLeader(Teleoperator):
        features: dict[str, type] = {}
        for motor in self.bus.motors:
            features[f"{motor}.pos"] = float
-            features[f"{motor}.vel"] = float
-            features[f"{motor}.torque"] = float
+            if self.config.use_velocity_and_torque:
+                features[f"{motor}.vel"] = float
+                features[f"{motor}.torque"] = float
        return features

    @property
@@ -201,8 +202,9 @@ class OpenArmLeader(Teleoperator):
        for motor in self.bus.motors:
            state = states.get(motor, {})
            action_dict[f"{motor}.pos"] = state.get("position")
-            action_dict[f"{motor}.vel"] = state.get("velocity")
-            action_dict[f"{motor}.torque"] = state.get("torque")
+            if self.config.use_velocity_and_torque:
+                action_dict[f"{motor}.vel"] = state.get("velocity")
+                action_dict[f"{motor}.torque"] = state.get("torque")

        dt_ms = (time.perf_counter() - start) * 1e3
        logger.debug(f"{self} read state: {dt_ms:.1f}ms")
@@ -112,7 +112,7 @@ class OpenArmMini(Teleoperator):

    @property
    def feedback_features(self) -> dict[str, type]:
-        return {}
+        return self.action_features

    @property
    def is_connected(self) -> bool:
@@ -348,8 +348,9 @@ class OpenArmMini(Teleoperator):
        if left_goals:
            self.bus_left.sync_write("Goal_Position", left_goals)

+    @check_if_not_connected
    def send_feedback(self, feedback: dict[str, float]) -> None:
-        raise NotImplementedError("Feedback is not yet implemented for OpenArm Mini.")
+        self.write_goal_positions(feedback)

    @check_if_not_connected
    def disconnect(self) -> None:
@@ -59,7 +59,7 @@ class SOLeader(Teleoperator):

    @property
    def feedback_features(self) -> dict[str, type]:
-        return {}
+        return self.action_features

    @property
    def is_connected(self) -> bool:
@@ -130,6 +130,12 @@ class SOLeader(Teleoperator):
        for motor in self.bus.motors:
            self.bus.write("Operating_Mode", motor, OperatingMode.POSITION.value)

+    def enable_torque(self) -> None:
+        self.bus.enable_torque()
+
+    def disable_torque(self) -> None:
+        self.bus.disable_torque()
+
    def setup_motors(self) -> None:
        for motor in reversed(self.bus.motors):
            input(f"Connect the controller board to the '{motor}' motor only and press enter.")
@@ -145,9 +151,11 @@ class SOLeader(Teleoperator):
        logger.debug(f"{self} read action: {dt_ms:.1f}ms")
        return action

+    @check_if_not_connected
    def send_feedback(self, feedback: dict[str, float]) -> None:
-        # TODO: Implement force feedback
-        raise NotImplementedError
+        goals = {k.removesuffix(".pos"): v for k, v in feedback.items() if k.endswith(".pos")}
+        if goals:
+            self.bus.sync_write("Goal_Position", goals)

    @check_if_not_connected
    def disconnect(self) -> None:
@@ -86,24 +86,11 @@ def hw_to_dataset_features(
        }

    for key, shape in cam_fts.items():
-        if len(shape) == 2:
-            # Single-channel feature (e.g. depth map). The hardware-side key is
-            # expected to use a "_depth" suffix to disambiguate from its color
-            # counterpart; we strip it so the dataset feature is published as
-            # ``{prefix}.depth.<bare>`` and aligned with ``observation.images.<bare>``.
-            bare = key.removesuffix("_depth") if key.endswith("_depth") else key
-            features[f"{prefix}.depth.{bare}"] = {
-                "dtype": "video" if use_video else "image",
-                "shape": shape,
-                "names": ["height", "width"],
-                "info": {"video.is_depth_map": True},
-            }
-        else:
-            features[f"{prefix}.images.{key}"] = {
-                "dtype": "video" if use_video else "image",
-                "shape": shape,
-                "names": ["height", "width", "channels"],
-            }
+        features[f"{prefix}.images.{key}"] = {
+            "dtype": "video" if use_video else "image",
+            "shape": shape,
+            "names": ["height", "width", "channels"],
+        }

    _validate_feature_names(features)
    return features
@@ -133,14 +120,7 @@ def build_dataset_frame(
        elif ft["dtype"] == "float32" and len(ft["shape"]) == 1:
            frame[key] = np.array([values[name] for name in ft["names"]], dtype=np.float32)
        elif ft["dtype"] in ["image", "video"]:
-            if key.startswith(f"{prefix}.depth."):
-                bare = key.removeprefix(f"{prefix}.depth.")
-                # Hardware emits depth values under "<bare>_depth" to disambiguate
-                # from the color stream stored at "<bare>" — fall back to the bare
-                # name when the producer already uses dataset-style keys.
-                frame[key] = values.get(f"{bare}_depth", values.get(bare))
-            else:
-                frame[key] = values[key.removeprefix(f"{prefix}.images.")]
+            frame[key] = values[key.removeprefix(f"{prefix}.images.")]

    return frame

@@ -69,7 +69,7 @@ def is_package_available(
        return package_exists


-def get_safe_default_video_backend():
+def get_safe_default_codec():
    logger = logging.getLogger(__name__)
    if importlib.util.find_spec("torchcodec"):
        return "torchcodec"
@@ -63,56 +63,10 @@ def _is_scalar(x):
    )


-def _derive_depth_obs_ranges(
-    features: dict[str, dict] | None,
-) -> dict[str, tuple[float, float] | None]:
-    """Map observation keys of depth features to their ``(depth_min, depth_max)`` range.
-
-    A feature is considered a depth map when its ``info`` dict carries
-    ``video.is_depth_map=True`` (the marker set by ``hw_to_dataset_features``
-    and persisted in ``info.json``). For each such feature, we record both
-    the fully-namespaced dataset key (e.g. ``observation.depth.front``) and
-    the corresponding raw observation key forms the robot is likely to emit
-    (``front`` and ``front_depth``) so a single membership check covers all
-    call sites.
-
-    The mapped value is the ``(depth_min, depth_max)`` range stored on the
-    feature (matching the quantization range used at encoding time), or
-    ``None`` when the metadata doesn't expose a range — in which case the
-    caller should let Rerun auto-normalize. Anchoring the colormap to a
-    fixed range avoids per-frame re-normalization, which otherwise looks
-    like flicker on near-static scenes.
-    """
-    ranges: dict[str, tuple[float, float] | None] = {}
-    if not features:
-        return ranges
-    depth_prefix = f"{OBS_STR}.depth."
-    for fk, fv in features.items():
-        info = fv.get("info") if isinstance(fv, dict) else None
-        if not isinstance(info, dict) or not info.get("video.is_depth_map", False):
-            continue
-        depth_min = info.get("video.depth_min")
-        depth_max = info.get("video.depth_max")
-        rng: tuple[float, float] | None = None
-        if (
-            isinstance(depth_min, (int, float))
-            and isinstance(depth_max, (int, float))
-            and depth_max > depth_min
-        ):
-            rng = (float(depth_min), float(depth_max))
-        ranges[fk] = rng
-        if fk.startswith(depth_prefix):
-            bare = fk[len(depth_prefix) :]
-            ranges[bare] = rng
-            ranges[f"{bare}_depth"] = rng
-    return ranges
-
-
 def log_rerun_data(
    observation: RobotObservation | None = None,
    action: RobotAction | None = None,
    compress_images: bool = False,
-    features: dict[str, dict] | None = None,
 ) -> None:
    """
    Logs observation and action data to Rerun for real-time visualization.
@@ -122,13 +76,6 @@ def log_rerun_data(
    - Scalars values (floats, ints) are logged as `rr.Scalars`.
    - 3D NumPy arrays that resemble images (e.g., with 1, 3, or 4 channels first) are transposed
      from CHW to HWC format, (optionally) compressed to JPEG and logged as `rr.Image` or `rr.EncodedImage`.
-    - 2D NumPy arrays whose key matches a depth feature in ``features`` (i.e. carrying
-      ``video.is_depth_map=True``) are logged as ``rr.DepthImage`` with the Viridis
-      colormap and ``meter=1.0`` (depth values are expected in metric meters). When
-      the feature exposes ``video.depth_min`` / ``video.depth_max`` (the encoder
-      quantization range, persisted in ``info.json``), the colormap is anchored to
-      that range via ``depth_range`` to keep the visualization stable across frames.
-      Depth images are never JPEG-compressed regardless of ``compress_images``.
    - 1D NumPy arrays are logged as a series of individual scalars, with each element indexed.
    - Other multi-dimensional arrays are flattened and logged as individual scalars.

@@ -138,16 +85,11 @@ def log_rerun_data(
        observation: An optional dictionary containing observation data to log.
        action: An optional dictionary containing action data to log.
        compress_images: Whether to compress images before logging to save bandwidth & memory in exchange for cpu and quality.
-        features: Optional dataset feature spec (e.g. ``LeRobotDataset.features``). When
-            provided, observation entries matching a depth-map feature are rendered with
-            ``rr.DepthImage`` instead of the generic ``rr.Image`` path.
    """

    require_package("rerun-sdk", extra="viz", import_name="rerun")
    import rerun as rr

-    depth_obs_ranges = _derive_depth_obs_ranges(features)
-
    if observation:
        for k, v in observation.items():
            if v is None:
@@ -158,20 +100,6 @@ def log_rerun_data(
                rr.log(key, rr.Scalars(float(v)))
            elif isinstance(v, np.ndarray):
                arr = v
-                is_depth = bool(depth_obs_ranges) and (k in depth_obs_ranges or key in depth_obs_ranges)
-                if is_depth and arr.ndim == 2:
-                    # Viridis-colormapped DepthImage; never JPEG-compress (lossy on float metric depth).
-                    # Anchor the colormap to the encoder range when available, so the
-                    # visualization doesn't flicker as per-frame min/max drift.
-                    depth_range = depth_obs_ranges.get(k) or depth_obs_ranges.get(key)
-                    depth_kwargs: dict = {
-                        "meter": 1.0,
-                        "colormap": rr.components.Colormap.Viridis,
-                    }
-                    if depth_range is not None:
-                        depth_kwargs["depth_range"] = depth_range
-                    rr.log(key, rr.DepthImage(arr, **depth_kwargs), static=True)
-                    continue
                # Convert CHW -> HWC when needed
                if arr.ndim == 3 and arr.shape[0] in (1, 3, 4) and arr.shape[-1] not in (1, 3, 4):
                    arr = np.transpose(arr, (1, 2, 0))
@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:54aecbc1af72a4cd5e9261492f5e7601890517516257aacdf2a0ffb3ce281f1b
+oid sha256:51effd76b73e972f10d31f5084ab906386134b600c87b2668767d30232a902bd
 size 992
@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:88a9c3775a2aa1e90a08850521970070a4fcf0f6b82aab43cd8ccc5cf77e0013
-size 47424
+oid sha256:d4d7a16ca67f9adefac0e0620a7b2e9c822f2db42faaaced7a89fbad60e5ead4
+size 47680
@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:91a2635e05a75fe187a5081504c5f35ce3417378813fa2deaf9ca4e8200e1819
+oid sha256:796c439ee8a64bf9901ff8325e7419bda8bd316360ee95e6304e8e1ae0f4c36c
 size 68
@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:645bff922ac7bea63ad018ebf77c303c0e4cd2c1c0dc5ef3192865281bef3dc6
-size 47424
+oid sha256:ad33a8b47c39c2e1374567ff9da43cdb95e2dbe904c1b02a35051346d3043095
+size 47680
@@ -202,31 +202,6 @@ def test_read_latest_too_old():
            _ = camera.read_latest(max_age_ms=0)  # immediately too old


-def test_async_read_depth_without_use_depth_raises():
-    """``async_read_depth`` must reject cameras configured without ``use_depth=True``."""
-    config = RealSenseCameraConfig(serial_number_or_name="042", warmup_s=0)
-    with RealSenseCamera(config) as camera, pytest.raises(RuntimeError, match="use_depth=False"):
-        _ = camera.async_read_depth()
-
-
-def test_read_latest_depth_without_use_depth_raises():
-    """``read_latest_depth`` must reject cameras configured without ``use_depth=True``."""
-    config = RealSenseCameraConfig(serial_number_or_name="042", warmup_s=0)
-    with RealSenseCamera(config) as camera, pytest.raises(RuntimeError, match="use_depth=False"):
-        _ = camera.read_latest_depth()
-
-
-def test_depth_to_meters_uses_depth_scale():
-    """``_depth_to_meters`` must scale uint16 raw depth into float32 metric meters."""
-    config = RealSenseCameraConfig(serial_number_or_name="042", warmup_s=0)
-    camera = RealSenseCamera(config)
-    camera.depth_scale = 0.001  # typical D-series scale (1 mm/unit)
-    raw = np.array([[0, 1000, 2500], [4095, 65535, 0]], dtype=np.uint16)
-    meters = camera._depth_to_meters(raw)
-    assert meters.dtype == np.float32
-    np.testing.assert_allclose(meters, raw.astype(np.float32) * 0.001)
-
-
@pytest.mark.parametrize(
    "rotation",
    [
@@ -142,36 +142,6 @@ def test_create_without_videos_has_no_video_path(tmp_path):
    assert meta.video_keys == []


-def test_depth_keys_property_filters_by_marker(tmp_path):
-    """``depth_keys`` selects only video features carrying ``video.is_depth_map=True``."""
-    features = {
-        **SIMPLE_FEATURES,
-        "observation.images.cam": {
-            "dtype": "video",
-            "shape": (64, 96, 3),
-            "names": ["height", "width", "channels"],
-            "info": None,
-        },
-        "observation.depth.cam": {
-            "dtype": "video",
-            "shape": (64, 96),
-            "names": ["height", "width"],
-            "info": {"video.is_depth_map": True},
-        },
-    }
-    meta = LeRobotDatasetMetadata.create(
-        repo_id="test/depth_keys", fps=DEFAULT_FPS, features=features, root=tmp_path / "depth_keys"
-    )
-
-    assert set(meta.video_keys) == {"observation.images.cam", "observation.depth.cam"}
-    assert meta.depth_keys == ["observation.depth.cam"]
-    
-def test_depth_keys_empty_when_no_marker(tmp_path):
-    meta = LeRobotDatasetMetadata.create(
-        repo_id="test/no_depth", fps=DEFAULT_FPS, features=VIDEO_FEATURES, root=tmp_path / "no_depth"
-    )
-    assert meta.depth_keys == []
-
 def test_create_raises_on_existing_directory(tmp_path):
    """create() raises if root directory already exists."""
    root = tmp_path / "existing"
@@ -20,7 +20,7 @@ import pytest
 pytest.importorskip("datasets", reason="datasets is required (install lerobot[dataset])")

 from lerobot.datasets.dataset_reader import DatasetReader
-from lerobot.utils.import_utils import get_safe_default_video_backend
+from lerobot.utils.import_utils import get_safe_default_codec

 # ── Loading ──────────────────────────────────────────────────────────

@@ -35,7 +35,7 @@ def test_try_load_returns_true_when_data_exists(tmp_path, lerobot_dataset_factor
        root=dataset.root,
        episodes=None,
        tolerance_s=1e-4,
-        video_backend=get_safe_default_video_backend(),
+        video_backend=get_safe_default_codec(),
        delta_timestamps=None,
        image_transforms=None,
    )
@@ -58,7 +58,7 @@ def test_try_load_returns_false_when_no_data(tmp_path):
        root=meta.root,
        episodes=None,
        tolerance_s=1e-4,
-        video_backend=get_safe_default_video_backend(),
+        video_backend=get_safe_default_codec(),
        delta_timestamps=None,
        image_transforms=None,
    )
@@ -25,7 +25,6 @@ pytest.importorskip("datasets", reason="datasets is required (install lerobot[da

 from lerobot.datasets.dataset_tools import (
    add_features,
-    convert_image_to_video_dataset,
    delete_episodes,
    merge_datasets,
    modify_features,
@@ -33,7 +32,7 @@ from lerobot.datasets.dataset_tools import (
    remove_feature,
    split_dataset,
 )
-from lerobot.datasets.video_utils import VideoEncoderConfig
+from lerobot.scripts.lerobot_edit_dataset import convert_image_to_video_dataset


@pytest.fixture
@@ -1247,12 +1246,10 @@ def test_convert_image_to_video_dataset(tmp_path):
            dataset=source_dataset,
            output_dir=output_dir,
            repo_id="lerobot/pusht_video",
-            camera_encoder_config=VideoEncoderConfig(
-                vcodec="libsvtav1",
-                pix_fmt="yuv420p",
-                g=2,
-                crf=30,
-            ),
+            vcodec="libsvtav1",
+            pix_fmt="yuv420p",
+            g=2,
+            crf=30,
            episode_indices=[0, 1],
            num_workers=2,
        )
@@ -28,7 +28,6 @@ pytest.importorskip("datasets", reason="datasets is required (install lerobot[da
 from lerobot.datasets.dataset_writer import _encode_video_worker
 from lerobot.datasets.lerobot_dataset import LeRobotDataset
 from lerobot.datasets.utils import DEFAULT_IMAGE_PATH
-from lerobot.datasets.video_utils import VideoEncoderConfig
 from tests.fixtures.constants import DEFAULT_FPS, DUMMY_REPO_ID

 SIMPLE_FEATURES = {
@@ -53,8 +52,8 @@ def _make_frame(features: dict, task: str = "Dummy task") -> dict:
 # ── Existing encode_video_worker tests ───────────────────────────────


-def test_encode_video_worker_forwards_camera_encoder_config(tmp_path):
-    """_encode_video_worker forwards camera_encoder_config to encode_video_frames."""
+def test_encode_video_worker_forwards_vcodec(tmp_path):
+    """_encode_video_worker correctly forwards the vcodec parameter."""
    video_key = "observation.images.laptop"
    fpath = DEFAULT_IMAGE_PATH.format(image_key=video_key, episode_index=0, frame_index=0)
    img_dir = tmp_path / Path(fpath).parent
@@ -69,21 +68,13 @@ def test_encode_video_worker_forwards_camera_encoder_config(tmp_path):
        Path(video_path).touch()

    with patch("lerobot.datasets.dataset_writer.encode_video_frames", side_effect=mock_encode):
-        _encode_video_worker(
-            video_key,
-            0,
-            tmp_path,
-            fps=30,
-            camera_encoder_config=VideoEncoderConfig(vcodec="h264", preset=None),
-            encoder_threads=4,
-        )
+        _encode_video_worker(video_key, 0, tmp_path, fps=30, vcodec="h264")

-    assert captured_kwargs["camera_encoder_config"].vcodec == "h264"
-    assert captured_kwargs["encoder_threads"] == 4
+    assert captured_kwargs["vcodec"] == "h264"


-def test_encode_video_worker_default_camera_encoder_config(tmp_path):
-    """_encode_video_worker passes None camera_encoder_config which encode_video_frames defaults."""
+def test_encode_video_worker_default_vcodec(tmp_path):
+    """_encode_video_worker uses libsvtav1 as the default codec."""
    video_key = "observation.images.laptop"
    fpath = DEFAULT_IMAGE_PATH.format(image_key=video_key, episode_index=0, frame_index=0)
    img_dir = tmp_path / Path(fpath).parent
@@ -100,8 +91,7 @@ def test_encode_video_worker_default_camera_encoder_config(tmp_path):
    with patch("lerobot.datasets.dataset_writer.encode_video_frames", side_effect=mock_encode):
        _encode_video_worker(video_key, 0, tmp_path, fps=30)

-    assert captured_kwargs["camera_encoder_config"] is None
-    assert captured_kwargs["encoder_threads"] is None
+    assert captured_kwargs["vcodec"] == "libsvtav1"


 # ── add_frame contracts ──────────────────────────────────────────────
@@ -43,7 +43,7 @@ from lerobot.datasets.utils import (
    DEFAULT_VIDEO_FILE_SIZE_IN_MB,
    create_branch,
 )
-from lerobot.datasets.video_utils import VALID_VIDEO_CODECS, VideoEncoderConfig
+from lerobot.datasets.video_utils import VALID_VIDEO_CODECS
 from lerobot.envs.factory import make_env_config
 from lerobot.policies.factory import make_policy_config
 from lerobot.robots import make_robot_from_config
@@ -1470,9 +1470,17 @@ def test_frames_in_current_file_calculation(tmp_path, empty_lerobot_dataset_fact


 def test_lerobot_dataset_vcodec_validation():
-    """Invalid vcodec in encoder config is rejected at construction time."""
+    """Test that LeRobotDataset validates the vcodec parameter."""
+    # Test that invalid vcodec raises ValueError
    with pytest.raises(ValueError, match="Invalid vcodec"):
-        VideoEncoderConfig(vcodec="invalid_codec")
+        LeRobotDataset.__new__(LeRobotDataset)  # bypass __init__ to test validation directly
+        # Actually test via create since it's easier
+        LeRobotDataset.create(
+            repo_id="test/invalid_codec",
+            fps=30,
+            features={"observation.state": {"dtype": "float32", "shape": (2,), "names": ["x", "y"]}},
+            vcodec="invalid_codec",
+        )


 def test_valid_video_codecs_constant():
@@ -1483,8 +1491,7 @@ def test_valid_video_codecs_constant():
    assert "auto" in VALID_VIDEO_CODECS
    assert "h264_videotoolbox" in VALID_VIDEO_CODECS
    assert "h264_nvenc" in VALID_VIDEO_CODECS
-    assert "ffv1" in VALID_VIDEO_CODECS
-    assert len(VALID_VIDEO_CODECS) == 11
+    assert len(VALID_VIDEO_CODECS) == 10


 def test_delta_timestamps_with_episodes_filter(tmp_path, empty_lerobot_dataset_factory):
@@ -93,32 +93,9 @@ def test_image_array_to_pil_image_pytorch_format(img_array_factory):


 def test_image_array_to_pil_image_single_channel(img_array_factory):
-    # Single-channel inputs are routed to grayscale mode for raw depth maps.
    img_array = img_array_factory(channels=1)
-    result_image = image_array_to_pil_image(img_array)
-    assert isinstance(result_image, Image.Image)
-    assert result_image.size == (100, 100)
-    assert result_image.mode == "L"
-    assert np.array_equal(np.array(result_image), img_array.squeeze(-1))
-
-
-def test_image_array_to_pil_image_single_channel_uint16(img_array_factory):
-    img_array = img_array_factory(channels=1, dtype=np.uint16)
-    result_image = image_array_to_pil_image(img_array)
-    assert isinstance(result_image, Image.Image)
-    assert result_image.size == (100, 100)
-    assert result_image.mode == "I;16"
-    # Bit-perfect: no rescaling, no clipping.
-    assert np.array_equal(np.array(result_image), img_array.squeeze(-1))
-
-
-def test_image_array_to_pil_image_single_channel_float32(img_array_factory):
-    img_array = img_array_factory(channels=1, dtype=np.float32)
-    result_image = image_array_to_pil_image(img_array)
-    assert isinstance(result_image, Image.Image)
-    assert result_image.size == (100, 100)
-    assert result_image.mode == "F"
-    assert np.array_equal(np.array(result_image), img_array.squeeze(-1))
+    with pytest.raises(NotImplementedError):
+        image_array_to_pil_image(img_array)


 def test_image_array_to_pil_image_4_channels(img_array_factory):
@@ -164,28 +141,6 @@ def test_write_image_image(tmp_path, img_factory):
    assert np.array_equal(image_pil, saved_image)


-def test_write_image_tiff_uint16_bitperfect(tmp_path):
-    """16-bit grayscale TIFF round-trips bit-perfectly (raw depth maps)."""
-    image_array = np.random.randint(0, 65535, size=(32, 48), dtype=np.uint16)
-    fpath = tmp_path / "depth.tiff"
-    write_image(image_array, fpath)
-    assert fpath.exists()
-    saved = np.array(Image.open(fpath))
-    assert saved.dtype == np.uint16
-    assert np.array_equal(saved, image_array)
-
-
-def test_write_image_tiff_float32_bitperfect(tmp_path):
-    """Float32 TIFF round-trips bit-perfectly (metric depth in meters)."""
-    image_array = np.random.uniform(0.05, 4.0, size=(32, 48)).astype(np.float32)
-    fpath = tmp_path / "depth.tiff"
-    write_image(image_array, fpath)
-    assert fpath.exists()
-    saved = np.array(Image.open(fpath))
-    assert saved.dtype == np.float32
-    assert np.array_equal(saved, image_array)
-
-
 def test_write_image_exception(tmp_path):
    image_array = "invalid data"
    fpath = tmp_path / DUMMY_IMAGE
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
javadcc_mac	f9b8f297b4	Fix EVO1 LIBERO rollout processors	2026-06-09 15:10:10 +08:00
javadcc_mac	95527f6051	Merge remote-tracking branch 'upstream/main' into codex/add-evo1-policy	2026-05-12 17:40:59 +08:00
javadcc_mac	407ee867b9	docs(evo1): format results table	2026-05-12 17:40:18 +08:00
Steven Palma	26ff40ddd7	chore(deps): cap torch ceiling at <2.12, pin Linux wheels to cu128 (#3570 ) * chore(deps): ceiling + cuda * ci: bump cuda version docker image * ci: add cpu wheel to release workflow * chore(deps): update uv.lock * docs: update installation with cuda note	2026-05-11 19:47:55 +02:00
javadcc_mac	a5e6409985	fix(evo1): finalize policy guide alignment	2026-05-11 21:51:41 +08:00
Maxime Ellerbach	6d269b28c8	docs(omx): adding some examples and scripts (#3566 ) * docs(omx): adding some examples and scripts * cleaning up and reviewing the cli args * adding __init__.py to example folder, adjusting the examples * adding reference to pretrained act policy * moving `.send_action` before `dataset.add_frame` for consistency Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net> * adjusting docstring Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net> * adressing hardcoded dataset fps * removed init as it worked without --------- Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>	2026-05-11 15:36:32 +02:00
Steven Palma	b607c8458e	docs: add policy & compute guide (#3534 ) * docs(policy): contributing a policy guide * docs(training): HW compute guide * chore(docs): add to readme and index * Apply suggestions from code review Co-authored-by: Haoming Song <1847575517@qq.com> Signed-off-by: Steven Palma <imstevenpmwork@ieee.org> * chore(docs): slight improvements * refactor(docs): consolidate add policy docs * chore(style): fix pre-commit --------- Signed-off-by: Steven Palma <imstevenpmwork@ieee.org> Co-authored-by: Haoming Song <1847575517@qq.com>	2026-05-11 15:19:12 +02:00
Jash Shah	9e83510c99	fix(datasets): close file handle on VideoDecoder init failure in cache (#3542 ) If VideoDecoder() raises during initialization, the fsspec file handle was leaked since it was opened via __enter__() but never closed on the exception path. Now explicitly closes the handle before re-raising.	2026-05-10 17:30:37 +02:00
javadcc_mac	1c9fbba9a9	chore(evo1): align with policy contribution guide conventions - Add `src/lerobot/policies/evo1/README.md` symlink into `docs/source/evo1.mdx` to match the in-tree README convention (mirroring the EO-1 layout). - Convert `transformers` import in `internvl3_embedder.py` to the standard `TYPE_CHECKING + _transformers_available` two-step gating used by other optional-backbone policies (e.g. diffusion). The previous lazy-in-`__init__` import was functionally equivalent for runtime gating but didn't expose the real symbols to type checkers. - Add `lerobot[evo1]` to the `all` extra in `pyproject.toml` so `pip install 'lerobot[all]'` keeps installing every optional policy. Per the guidance in https://moon-ci-docs.huggingface.co/docs/lerobot/pr_3534/en/contributing_a_policy.	2026-05-10 23:14:23 +08:00
javadcc_mac	6a1b5ceb9d	Merge remote-tracking branch 'upstream/main' into codex/add-evo1-policy # Conflicts: # uv.lock	2026-05-10 22:48:17 +08:00
javadcc_mac	daa4c4dd30	chore(lock): regenerate uv.lock for evo1 extra Adds the `evo1` entry to `[package.metadata.requires-dist]` and the `provides-extras` list so that `uv sync --locked --extra test` (used by fast_tests.yml) no longer reports the lockfile as stale. Generated with `uv 0.8.0` (matching `UV_VERSION` in fast_tests.yml). The non-evo1 marker tweaks are produced by `uv lock` re-resolving the existing dep graph and are not introduced by this PR.	2026-05-10 22:43:26 +08:00
Anthony Shoumikhin	1f7b03f5f2	chore(deps): allow torch 2.11/2.12 and fix autocast deprecation (#3435 ) * chore(deps): allow torch 2.11/2.12 and fix autocast deprecation - Bump torch to >=2.7,<2.13 (was <2.11), torchvision to <0.28 (was <0.26), and torchcodec to <0.13 (was <0.11) to allow installs against the latest stable torch 2.11 and the upcoming 2.12 line. - Replace removed torch.get_autocast_gpu_dtype() with torch.get_autocast_dtype("cuda") in Florence2 and Qwen2.5-VL-MoE FlashAttention paths (the former is removed in 2.11+). - Refresh uv.lock for the new resolution (torch 2.11.0+cu130, torchvision 0.26.0+cu130, torchcodec 0.11.1, full CUDA 13 stack). Verified locally with `uv sync --locked` from a clean .venv and the lerobot test suite (pytest -n 8 --dist=loadfile --timeout=300). Failure set is identical to the pre-bump baseline: 18 pre-existing failures (test_sac_policy, test_pi0_rtc, test_pi05_rtc, test_replay_buffer), 0 new, 0 fixed. AI assistance: this change was authored with Claude Code per AI_POLICY.md. * fix(policies): use device-agnostic autocast dtype lookup Pass query_states.device.type to torch.get_autocast_dtype() instead of hardcoding 'cuda', so the cast matches the active autocast context when running under CPU/MPS/XPU autocast. --------- Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>	2026-05-10 13:05:35 +02:00
Yiming Wang	ff992a7a1d	Merge branch 'main' into codex/add-evo1-policy	2026-05-10 18:54:35 +08:00
Steven Palma	cb8edf17e6	chore(dependencies): update uv.lock (#3475 )	2026-05-10 12:24:22 +02:00
Steven Palma	5699f6cbf4	chore(ci): disable auto-stale (#3550 )	2026-05-10 11:49:31 +02:00
javadcc_mac	48269dddb3	fix(evo1): infer batch size after normalizing image dims `_collect_image_batches` read `batch_size = batch[camera_keys[0]].shape[0]` before normalizing per-camera tensors to `(B, C, H, W)`. For an unbatched `(C, H, W)` input (which the function tries to support via the `image.dim() == 3` branch), this picked up the channel count `C` instead of the real batch size, making the subsequent per-sample loop iterate `C` times and indexing go out of bounds. Normalize each camera tensor up-front, then read `batch_size` from the normalized batch dim. Adds `test_collect_image_batches_handles_unbatched_chw` covering the regression. Reported by Copilot review on huggingface/lerobot#3545.	2026-05-10 11:29:23 +08:00
javadcc_mac	8df8d3d866	feat(policies): add EVO1 policy	2026-05-09 21:39:19 +08:00
masato-ka	0e6114ac36	fix(train): restrict legacy RA-BC migration to JSON checkpoints only (#3490 ) * fix(train): restrict legacy RA-BC migration to JSON checkpoints only _migrate_legacy_rabc_fields was called for all config files, causing json.load to raise DecodeError when a YAML/TOML config was passed to lerobot-train for a new training run. Guard the block with an .endswith(".json") check so migration only runs when resuming from a JSON checkpoint.	2026-05-08 20:27:01 +02:00
Steven Palma	c8ce413d73	fix(robots): allign lekiwi default with so100 use_degrees (#3531 )	2026-05-07 17:52:34 +02:00
Pepijn	82dffde7fa	fix(ci): speed up multi-task benchmark evals (parallelize + cap VLABench steps) (#3529 ) * fix(ci): run multi-task benchmark evals 5-at-a-time in parallel The eval script supports running tasks concurrently via a ThreadPoolExecutor (env.max_parallel_tasks). Apply it to the four multi-task benchmark CI jobs (RoboTwin, RoboCasa, RoboMME, LIBERO-plus — 8-10 tasks/task_ids each) so they finish in ~2 waves of 5 instead of running sequentially. Single-task jobs (Libero, MetaWorld, RoboCerebra) are unchanged. * fix(ci): cap VLABench smoke eval at 50 steps per task VLABench's default episode_length is 500 steps; with 10 tasks at ~1 it/s the smoke eval took ~80 minutes of rollouts on top of the image build. The eval is a pipeline smoke test (running_success_rate stays at 0% on this short rollout anyway), so we don't need full episodes — cap each task at 50 steps to bring total rollout time down ~10x. * fix(ci): run VLABench tasks 5-at-a-time in parallel The eval script already supports running multiple tasks concurrently via a ThreadPoolExecutor (env.max_parallel_tasks). Set it to 5 so the 10 VLABench tasks finish in ~2 waves instead of running sequentially.	2026-05-07 13:37:16 +02:00
Ville Kuosmanen	eaf0218bc8	feat(policy): use pretrained vision encoder weights by default for diffusion and vqbet (#3202 ) * feat: add pretrained vision encoder weights for diffusion and vqbet * fix test by re-generating artifacts --------- Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>	2026-05-07 12:10:38 +02:00
Pepijn	a0e52d52fe	fix(ci): bump robotwin benchmark image to CUDA 12.6 (#3525 ) The robotwin benchmark Dockerfile still installed cuda-nvcc-12-4 and cuda-cudart-dev-12-4 after #3505 upgraded the base image to CUDA 12.6.3 on Ubuntu 24.04. Those packages aren't available in the ubuntu2404 CUDA repo, so the build failed at apt-get install. Bumping both to -12-6 to match the base image.	2026-05-07 11:11:12 +02:00
Haoming Song	e99c55af4b	feat(policies): add EO-1 model (#3403 ) * feat(policies): add EO-1 model * chore(eo1): adjust policy_eo1_README.md to to avoid duplicate with eo1.mdx * chore(eo1): remove policy_eo1_README.md, link eo1.mdx in policy folder --------- Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>	2026-05-06 18:01:16 +02:00
Steven Palma	408e0ca763	fix(robots): openarm features with openarmmini (#3524 )	2026-05-06 17:03:09 +02:00
Maxime Ellerbach	ce24063efd	feat(dagger): adding smooth handover (#3506 ) * feat(dagger): adding smooth handover * update docstring * small phase fix and documenting potential issues * cleaning up	2026-05-05 14:44:32 +02:00
Steven Palma	82934719db	chore(dep): bump transformers to 5.4.0 (#3374 ) * fix(deps): breaking change from transformers 5.4.0 * Update src/lerobot/policies/xvla/modeling_florence2.py Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net> * Update src/lerobot/policies/wall_x/qwen_model/qwen2_5_vl_moe.py Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net> * removing dataclass * bumping transformers 5.4.0 * weird i can't even pass the test on main * oops, typo * chore(style): fix pre-commit run * chore: update uv.lock * seems like a weird numerical precision issue, lets check in runners * chore: update uv.lock * chore(dependecies): adjust transformers version * chore: update uv.lock --------- Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net> Co-authored-by: Maximellerbach <maxime.ellerbach@huggingface.co> Co-authored-by: raushan <raushan@huggingface.co>	2026-05-05 14:19:09 +02:00
Steven Palma	401a217597	chore(ci): increase time stale (#3507 )	2026-05-04 22:35:16 +02:00
Steven Palma	40094b0464	chore(ci): upgrade docker internal (#3505 )	2026-05-04 21:28:52 +02:00
Jash Shah	fdbfc015a2	fix(peft): fix LoRA resume from Hub (PosixPath + double wrap) (#3485 )	2026-05-04 10:52:37 +02:00
Haoming Song	d656da8ccc	fix(pi): keep training sampling outside compiled forwards (#3487 ) Move PI0 and PI0.5 noise/time sampling into the policy wrappers so the compiled PyTorch cores receive them as tensor inputs. This keeps Beta sampling out of torch.compile on MPS, avoiding aten::_sample_dirichlet compilation errors while preserving the CUDA training path. Validation: .venv/bin/python -m pre_commit run --files src/lerobot/policies/pi0/modeling_pi0.py src/lerobot/policies/pi05/modeling_pi05.py; .venv/bin/python -m pytest -sv -rs tests/policies/pi0_pi05/test_pi0.py tests/policies/pi0_pi05/test_pi05.py tests/policies/pi0_pi05/test_pi0_rtc.py tests/policies/pi0_pi05/test_pi05_rtc.py Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>	2026-04-30 13:21:17 +02:00
				`@@ -0,0 +1 @@`
				`../../../../docs/source/policy_evo1_README.md`