Merge remote-tracking branch 'origin/main' into worktree-lingbot-va-port

# Conflicts: # docs/source/_toctree.yml # src/lerobot/policies/factory.py # uv.lock
fix(save codec options): making sure codec options are always set via set_if (#3910 )
2026-07-02 23:57:24 +00:00 · 2026-07-02 14:15:09 +00:00 · 2026-07-02 15:29:14 +02:00 · 2026-07-02 11:53:13 +02:00 · 2026-06-25 11:28:32 +00:00 · 2026-06-25 11:23:35 +00:00
36 changed files with 3342 additions and 50 deletions
@@ -22,6 +22,10 @@ outputs
 rl
 media

+# Local virtualenvs (the image provides its own)
+.venv
+venv
+

 # Logging
 logs
@@ -69,6 +69,8 @@
    title: VLA-JEPA
  - local: eo1
    title: EO-1
+  - local: lingbot_va
+    title: LingBot-VA
  - local: fastwam
    title: FastWAM
  - local: groot
@@ -0,0 +1,187 @@
+# LingBot-VA
+
+LingBot-VA is an **autoregressive video-action world-model policy** built on the **Wan2.2**
+video-diffusion stack. It interleaves, in one autoregressive sequence, the prediction of
+future **video latents** and **robot actions** ("VA" = Video-Action). The LeRobot
+integration wires LingBot-VA into the standard training, evaluation and processor
+interfaces.
+
+## Model Overview
+
+LingBot-VA is a **dual-stream "mixture-of-transformers"**: a video/latent stream
+(`patch_embedding_mlp → blocks → proj_out`) and an action stream
+(`action_embedder → blocks → action_proj_out`) share the same 30 transformer blocks and
+text conditioning.
+
+| Component                | Class                   | Role                                                        |
+| ------------------------ | ----------------------- | ----------------------------------------------------------- |
+| DiT backbone (trainable) | `WanTransformer3DModel` | ~5B-param dual-stream transformer.                          |
+| VAE (frozen)             | `AutoencoderKLWan`      | Wan2.2 VAE, `z_dim=48`. Lazy-pulled from the source repo.   |
+| Text encoder (frozen)    | `UMT5EncoderModel`      | UMT5-XXL, `d_model=4096`. Lazy-pulled from the source repo. |
+
+At inference the policy runs an autoregressive loop per chunk: it denoises the video-latent
+stream (CFG, ~20 steps) and the action stream (~50 steps) with two independent
+flow-matching schedulers, maintaining a KV cache across chunks. Real observed keyframes are
+fed back into the KV cache as the chunk is executed (closed-loop world modeling).
+
+### What the LeRobot Integration Covers
+
+- Standard `policy.type=lingbot_va` configuration through LeRobot.
+- Ready-to-use LeRobot-format checkpoints on the Hub (converted from the released upstream ones).
+- Autoregressive dual-stream inference behind the standard `select_action` interface
+  (single-environment eval, `--eval.batch_size=1`).
+- Opt-in saving of the policy's **predicted (imagined) videos** during eval / training.
+- Evaluation with `lerobot-eval` on LIBERO and RoboTwin.
+- Training / fine-tuning via the dual-stream flow-matching loss (`policy.forward`), see below.
+
+## Installation
+
+1. Install LeRobot by following the [Installation Guide](./installation).
+2. Install the LingBot-VA extra:
+
+```bash
+pip install -e ".[lingbot_va]"
+```
+
+## Checkpoints
+
+The released upstream checkpoints have been converted to LeRobot format and pushed to the Hub:
+
+| Variant                | LeRobot checkpoint               |
+| ---------------------- | -------------------------------- |
+| LIBERO-Long post-train | `lerobot/lingbot_va_libero_long` |
+| RoboTwin post-train    | `lerobot/lingbot_va_robotwin`    |
+| Pretrained base        | `lerobot/lingbot_va_base`        |
+
+Only the trainable ~5B transformer is stored in the LeRobot
+`model.safetensors`. The frozen VAE + UMT5 + tokenizer (~20 GB) are pulled from
+`config.wan_pretrained_path` at load time (defaults to the source `robbyant/*` repo). The
+UMT5-XXL text encoder runs on CPU by default (`config.text_encoder_device`) so the 5B
+transformer + VAE fit on a single 24–32 GB GPU.
+
+## Evaluation (LIBERO)
+
+```bash
+lerobot-eval \
+    --policy.path=lerobot/lingbot_va_libero_long \
+    --policy.device=cuda \
+    --env.type=libero --env.task=libero_10 \
+    --env.observation_height=128 --env.observation_width=128 \
+    --eval.n_episodes=50 --eval.batch_size=1 \
+    --output_dir=outputs/eval/lingbot_va_libero
+```
+
+LingBot-VA's streaming inference (KV cache + observed-keyframe feedback) is implemented for
+single-environment eval; use `--eval.batch_size=1`.
+
+## Evaluation (RoboTwin)
+
+RoboTwin 2.0 needs the SAPIEN + CuRobo simulator stack. You can use the benchmark Docker image
+(`docker/Dockerfile.benchmark.robotwin`, which also needs `warp-lang==1.3.1` and CuRobo built
+with the GPU's compute capability in `TORCH_CUDA_ARCH_LIST`). RoboTwin uses **end-effector-pose
+control**, so run with `--env.action_mode=ee`: the policy predicts per-arm `xyz+quaternion+gripper`
+deltas (`robotwin_tshape` latent layout) that are composed onto the episode's initial eef pose and
+executed via CuRobo IK.
+
+```bash
+lerobot-eval \
+    --policy.path=lerobot/lingbot_va_robotwin \
+    --policy.device=cuda \
+    --env.type=robotwin --env.task=beat_block_hammer --env.action_mode=ee \
+    --eval.n_episodes=10 --eval.batch_size=1 \
+    --output_dir=outputs/eval/lingbot_va_robotwin
+```
+
+### Saving predicted (imagined) videos
+
+Set `--policy.save_predicted_video=true` to additionally VAE-decode the predicted video
+latents and write `pred_episode_*.mp4` next to the env-rendered `eval_episode_*.mp4` videos.
+The same flag works for the periodic eval during `lerobot-train`.
+
+## Training / fine-tuning
+
+`LingBotVAPolicy.forward(batch)` implements the dual-stream **flow-matching** loss
+(`latent_loss + action_loss`, timestep-weighted, action-masked) from the paper: it VAE-encodes
+the camera clips into video latents, UMT5-encodes the task, noises both streams, runs the
+transformer's block-causal training pass and returns `(loss, metrics)`. Optimizer preset is AdamW
+with a linear-warmup-then-constant schedule (matching upstream).
+
+Requirements:
+
+- The block-causal masks use PyTorch **flex-attention**, so build the policy with
+  `--policy.attn_mode=flex` for training (the default `torch` SDPA is inference-only).
+- The full 5B DiT does not fit a single 24–32 GB GPU under AdamW; fine-tune with **LoRA**
+  (`--policy.use_peft=true`) and/or optimizer offload. `get_optim_params` returns only the
+  trainable (e.g. adapter) parameters; the VAE + UMT5 text encoder stay frozen.
+
+```bash
+lerobot-train \
+  --policy.path=lerobot/lingbot_va_libero_long --policy.attn_mode=flex \
+  --policy.use_peft=true \
+  --dataset.repo_id=<your LeRobot-format dataset> \
+  --batch_size=1 --steps=... --output_dir=outputs/train/lingbot_va
+```
+
+The dataset must provide camera clips (a temporal window per camera, VAE-encoded to
+`frame_chunk_size` latent frames) and `frame_chunk_size * action_per_frame` action steps per item.
+
+## Data format (action channels & camera order)
+
+LingBot-VA is an **end-effector (Cartesian) pose** policy, it predicts EEF poses + gripper, not
+joint positions. Actions live in a fixed multi-embodiment **30-dim** layout; map your robot's
+action dimensions into these channels and pad the rest with `0` (`used_action_channel_ids` selects
+the channels a given checkpoint actually uses):
+
+| channels | meaning                                               |
+| -------- | ----------------------------------------------------- |
+| 0–6      | Left-arm end-effector pose                            |
+| 7–13     | Right-arm end-effector pose                           |
+| 14–20    | Left-arm joints (unused by the released checkpoints)  |
+| 21–27    | Right-arm joints (unused by the released checkpoints) |
+| 28       | Left gripper                                          |
+| 29       | Right gripper                                         |
+
+- **LIBERO** uses channels `0–6`: a 6-DoF EEF delta (xyz + rotation) + gripper (single arm).
+- **RoboTwin** uses channels `[0–6, 28, 7–13, 29]`: left EEF (xyz + quaternion) + left gripper +
+  right EEF + right gripper (16 dims). The env converts these poses to joint trajectories via
+  CuRobo IK — joints are never predicted.
+
+Joint-space datasets (or a different EEF convention) must be remapped into this schema before
+fine-tuning these checkpoints.
+
+**Camera order is fixed and order-sensitive**, per-camera latents are concatenated spatially in
+`obs_cam_keys` order, so the physical camera→slot mapping must match training:
+
+| benchmark | `obs_cam_keys` (in order)                                                                             | `camera_layout`                                                     |
+| --------- | ----------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------- |
+| LIBERO    | `observation.images.image` (agentview / 3rd-person), `observation.images.image2` (eye-in-hand wrist)  | `width_concat` (latents concatenated on width)                      |
+| RoboTwin  | `observation.images.head_camera`, `observation.images.left_camera`, `observation.images.right_camera` | `robotwin_tshape` (full-res head below, two half-res wrists on top) |
+
+The first camera is the exterior/head view and the rest are wrist views.
+
+## Inference Hyperparameters (LIBERO)
+
+| Key                                    | Value                                                                             |
+| -------------------------------------- | --------------------------------------------------------------------------------- |
+| height × width                         | 128 × 128                                                                         |
+| cameras                                | `observation.images.image` (agentview), `observation.images.image2` (eye-in-hand) |
+| action channels used                   | 0–6 (7-DoF arm + gripper)                                                         |
+| action_per_frame / frame_chunk_size    | 4 / 4                                                                             |
+| attn_window                            | 30                                                                                |
+| video / action denoising steps         | 20 / 50                                                                           |
+| guidance_scale / action_guidance_scale | 5 / 1                                                                             |
+| snr_shift / action_snr_shift           | 5.0 / 0.05                                                                        |
+
+These are the defaults of `LingBotVAConfig`; override any of them via `--policy.<name>=...`.
+
+## Notes
+
+- **Attention backend:** inference uses the `torch` SDPA backend (always available). The
+  `flashattn` and `flex` backends are optional; `flex` is only needed for training.
+- **Model size:** the DiT is ~5B params and the frozen VAE+UMT5 add ~20 GB; inference needs
+  roughly 18–24 GB of VRAM.
+
+## License
+
+LingBot-VA is released under Apache-2.0. See the
+[upstream repository](https://github.com/Robbyant/lingbot-va).
@@ -155,7 +155,8 @@ accelerate-dep = ["accelerate>=1.14.0,<2.0.0"]
 can-dep = ["python-can>=4.2.0,<5.0.0"]
 peft-dep = ["peft>=0.18.0,<1.0.0"]
 scipy-dep = ["scipy>=1.14.0,<2.0.0"]
-diffusers-dep = ["diffusers>=0.27.2,<0.36.0"]
+diffusers-dep = ["diffusers>=0.27.2,<0.37.0"]
+imageio-dep = ["imageio[ffmpeg]>=2.34.0,<3.0.0"]
 qwen-vl-utils-dep = ["qwen-vl-utils>=0.0.11,<0.1.0"]
 matplotlib-dep = ["matplotlib>=3.10.3,<4.0.0", "contourpy>=1.3.0,<2.0.0"] # NOTE: Explicitly listing contourpy helps the resolver converge faster.
 pyserial-dep = ["pyserial>=3.5,<4.0"]
@@ -236,6 +237,7 @@ fastwam = [
 ]
 hilserl = ["lerobot[transformers-dep]", "lerobot[dataset]", "gym-hil>=0.1.14,<0.2.0", "lerobot[grpcio-dep]", "lerobot[placo-dep]"]
 vla_jepa = ["lerobot[transformers-dep]", "lerobot[diffusers-dep]", "lerobot[qwen-vl-utils-dep]"]
+lingbot_va = ["lerobot[transformers-dep]", "diffusers>=0.36.0,<0.37.0", "lerobot[imageio-dep]", "accelerate>=1.10.0,<2.0.0", "ftfy>=6.0.0,<7.0.0"]

 # Features
 async = ["lerobot[grpcio-dep]", "lerobot[matplotlib-dep]"]
@@ -318,6 +320,7 @@ all = [
    "lerobot[xvla]",
    "lerobot[hilserl]",
    "lerobot[vla_jepa]",
+    "lerobot[lingbot_va]",
    "lerobot[async]",
    "lerobot[dev]",
    "lerobot[test]",
@@ -410,6 +413,9 @@ ignore = [
 # E402: conditional-import guards (TYPE_CHECKING / is_package_available) must precede the imports they protect
 "src/lerobot/scripts/convert_dataset_v21_to_v30.py" = ["E402"]
 "src/lerobot/policies/wall_x/**" = ["N801", "N812", "SIM102", "SIM108", "SIM210", "SIM211", "B006", "B007", "SIM118"] # Supprese these as they are coming from original Qwen2_5_vl code TODO(pepijn): refactor original
+# Vendored Wan2.2 / LingBot-VA model code uses tensor-dimension names (B, F, H, W) and `F` for
+# torch.nn.functional.
+"src/lerobot/policies/lingbot_va/**" = ["N803", "N806", "N812", "SIM102"]

 [tool.ruff.lint.isort]
 combine-as-imports = true
@@ -34,6 +34,8 @@ from .types import (
 )
 from .video import (
    DEFAULT_DEPTH_UNIT,
+    DEPTH_METER_UNIT,
+    DEPTH_MILLIMETER_UNIT,
    VALID_VIDEO_CODECS,
    VIDEO_ENCODER_INFO_KEYS,
    DepthEncoderConfig,
@@ -41,6 +43,7 @@ from .video import (
    VideoEncoderConfig,
    depth_encoder_defaults,
    encoder_config_from_video_info,
+    infer_depth_unit,
    rgb_encoder_defaults,
 )

@@ -70,8 +73,11 @@ __all__ = [
    "depth_encoder_defaults",
    # Factories
    "encoder_config_from_video_info",
+    "infer_depth_unit",
    # Constants
    "DEFAULT_DEPTH_UNIT",
+    "DEPTH_METER_UNIT",
+    "DEPTH_MILLIMETER_UNIT",
    "VALID_VIDEO_CODECS",
    "VIDEO_ENCODER_INFO_KEYS",
 ]
@@ -22,6 +22,8 @@ import logging
 from dataclasses import dataclass, field
 from typing import Any, ClassVar, Self

+import numpy as np
+
 from lerobot.utils.import_utils import require_package

 logger = logging.getLogger(__name__)
@@ -67,6 +69,15 @@ DEPTH_METER_UNIT: str = "m"
 DEPTH_MILLIMETER_UNIT: str = "mm"
 DEFAULT_DEPTH_UNIT: str = DEPTH_MILLIMETER_UNIT

+
+def infer_depth_unit(dtype: np.dtype | type) -> str:
+    """Infer the physical unit of raw depth frames from their dtype.
+
+    Floating-point frames are assumed to be in metres, integer frames in millimetres.
+    """
+    return DEPTH_METER_UNIT if np.issubdtype(np.dtype(dtype), np.floating) else DEPTH_MILLIMETER_UNIT
+
+
 # Depth-specific tuning fields persisted under ``features[*]["info"]`` as ``video.<name>``.
 DEPTH_ENCODER_INFO_FIELD_NAMES: frozenset[str] = frozenset({"depth_min", "depth_max", "shift", "use_log"})

@@ -215,24 +226,24 @@ class VideoEncoderConfig:
            if encoder_threads is not None:
                svtav1_parts.append(f"lp={encoder_threads}")
            if svtav1_parts:
-                opts["svtav1-params"] = ":".join(svtav1_parts)
+                set_if("svtav1-params", ":".join(svtav1_parts))
        elif self.vcodec in ("h264", "hevc"):
            set_if("crf", self.crf)
            set_if("preset", self.preset)
            if self.fast_decode:
-                opts["tune"] = "fastdecode"
+                set_if("tune", "fastdecode")
            set_if("threads", encoder_threads)
        elif self.vcodec == "libaom-av1":
            set_if("crf", self.crf)
            set_if("preset", self.preset)
            if encoder_threads is not None:
-                opts["threads"] = encoder_threads
-                opts["row-mt"] = 1
+                set_if("threads", encoder_threads)
+                set_if("row-mt", 1)
        elif self.vcodec in ("h264_videotoolbox", "hevc_videotoolbox"):
            if self.crf is not None:
-                opts["q:v"] = max(1, min(100, 100 - self.crf * 2))
+                set_if("q:v", max(1, min(100, 100 - self.crf * 2)))
        elif self.vcodec in ("h264_nvenc", "hevc_nvenc"):
-            opts["rc"] = 0
+            set_if("rc", 0)
            set_if("qp", self.crf)
            set_if("preset", self.preset)
        elif self.vcodec == "h264_vaapi":
@@ -509,7 +509,7 @@ def compute_episode_stats(
        For 'image'/'video' features, stats are computed per channel and kept with a
        leading channel axis (e.g. shape (3, 1, 1) for RGB). RGB stats are divided by
        255 to land in [0, 1]; depth maps (features flagged with ``is_depth_map``) skip
-        this rescaling and remain in their stored units.
+        this rescaling and remain in their stored units (stored in ``depth_unit``).
    """
    if quantile_list is None:
        quantile_list = DEFAULT_QUANTILES
@@ -26,12 +26,13 @@ import pyarrow as pa
 import pyarrow.parquet as pq
 from huggingface_hub import snapshot_download

-from lerobot.configs import VideoEncoderConfig
+from lerobot.configs import DEPTH_METER_UNIT, VideoEncoderConfig
 from lerobot.utils.constants import DEFAULT_FEATURES, HF_LEROBOT_HOME, HF_LEROBOT_HUB_CACHE
 from lerobot.utils.feature_utils import _validate_feature_names
 from lerobot.utils.utils import flatten_dict

 from .compute_stats import aggregate_stats
+from .depth_utils import MM_PER_METRE
 from .feature_utils import create_empty_dataset_info
 from .io_utils import (
    get_file_size_in_mb,
@@ -358,6 +359,35 @@ class LeRobotDatasetMetadata:

        return [key for key, ft in self.features.items() if _is_depth(ft)]

+    def rescale_depth_stats(self, output_unit: str) -> None:
+        """Rescale depth feature stats in place from their recorded unit to ``output_unit``.
+
+        Depth stats are stored in the unit the frames were recorded in
+        (``features[key]["info"]["depth_unit"]``), while frames are returned in
+        ``output_unit`` on read. This converts the unit-bearing stat entries so
+        stats match the frames consumers see.
+        """
+        missing_unit_keys = [
+            key for key in self.depth_keys if (self.features[key].get("info") or {}).get("depth_unit") is None
+        ]
+        if missing_unit_keys:
+            logging.warning(
+                f"Depth feature(s) {missing_unit_keys} have no recorded 'depth_unit' in their info. "
+                f"Depth maps and stats for these keys will be returned AS IS, with no unit conversion "
+                f"to the requested output unit {output_unit!r}. Re-record the dataset or set 'depth_unit' "
+                f"in the feature info (meta/info.json) to enable conversion."
+            )
+        if self.stats is None:
+            return
+        for key in self.depth_keys:
+            stored_unit = (self.features[key].get("info") or {}).get("depth_unit")
+            if stored_unit is None or stored_unit == output_unit or key not in self.stats:
+                continue
+            factor = MM_PER_METRE if stored_unit == DEPTH_METER_UNIT else 1.0 / MM_PER_METRE
+            self.stats[key] = {
+                stat: value if stat == "count" else value * factor for stat, value in self.stats[key].items()
+            }
+
    @property
    def camera_keys(self) -> list[str]:
        """Keys to access visual modalities (regardless of their storage method)."""
@@ -22,10 +22,14 @@ from pathlib import Path
 import datasets
 import torch

-from lerobot.configs import DEFAULT_DEPTH_UNIT, DepthEncoderConfig
+from lerobot.configs import (
+    DEFAULT_DEPTH_UNIT,
+    DEPTH_METER_UNIT,
+    DepthEncoderConfig,
+)

 from .dataset_metadata import LeRobotDatasetMetadata
-from .depth_utils import dequantize_depth
+from .depth_utils import MM_PER_METRE, dequantize_depth
 from .feature_utils import (
    check_delta_timestamps,
    get_delta_indices,
@@ -102,6 +106,13 @@ class DatasetReader:
            for vid_key in self._meta.depth_keys
        }

+        # Get the input unit of each depth feature stored as raw images.
+        self._image_depth_units: dict[str, str | None] = {
+            key: (self._meta.features[key].get("info") or {}).get("depth_unit")
+            for key in self._meta.depth_keys
+            if key in self._meta.image_keys
+        }
+
    def set_image_transforms(self, image_transforms: Callable | None) -> None:
        """Replace the transform applied to visual observations."""
        if image_transforms is not None and not callable(image_transforms):
@@ -329,6 +340,13 @@ class DatasetReader:
                    continue
                item[cam] = self._image_transforms(item[cam])

+        # Convert depth features to the output unit.
+        for key, stored_unit in self._image_depth_units.items():
+            if key in item and stored_unit is not None and stored_unit != self._depth_output_unit:
+                item[key] = (
+                    item[key] * MM_PER_METRE if stored_unit == DEPTH_METER_UNIT else item[key] / MM_PER_METRE
+                )
+
        # Add task as a string
        task_idx = item["task_index"].item()
        item["task"] = self._meta.tasks.iloc[task_idx].name
@@ -36,6 +36,7 @@ from lerobot.configs import (
    RGBEncoderConfig,
    VideoEncoderConfig,
    depth_encoder_defaults,
+    infer_depth_unit,
    rgb_encoder_defaults,
 )

@@ -209,6 +210,15 @@ class DatasetWriter:
        self.episode_buffer["timestamp"].append(timestamp)
        self.episode_buffer["task"].append(frame.pop("task"))

+        # Record each depth feature's input unit once, inferred from the first frame's dtype.
+        if frame_index == 0:
+            for depth_key in self._meta.depth_keys:
+                if depth_key not in frame:
+                    continue
+                info = self._meta.features[depth_key].setdefault("info", {})
+                if info.get("depth_unit") is None:
+                    info["depth_unit"] = infer_depth_unit(np.asarray(frame[depth_key]).dtype)
+
        # Start streaming encoder on first frame of episode
        if frame_index == 0 and self._streaming_encoder is not None:
            self._streaming_encoder.start_episode(
@@ -34,12 +34,13 @@ from lerobot.configs.video import (
    DEPTH_METER_UNIT,
    DEPTH_MILLIMETER_UNIT,
    DEPTH_QMAX,
+    infer_depth_unit,
 )

 from .image_writer import squeeze_single_channel
 from .pyav_utils import write_u16_plane

-_MM_PER_METRE = 1000.0
+MM_PER_METRE = 1000.0
 _UINT16_MAX = 65535


@@ -57,11 +58,7 @@ def _depth_input_to_float32_and_unit(
    input_unit: Literal["auto", DEPTH_METER_UNIT, DEPTH_MILLIMETER_UNIT],
 ) -> tuple[NDArray[np.float32], Literal[DEPTH_METER_UNIT, DEPTH_MILLIMETER_UNIT]]:
    """Convert depth to float32 in the chosen unit, and return the resolved unit."""
-    resolved_unit = (
-        (DEPTH_METER_UNIT if np.issubdtype(depth.dtype, np.floating) else DEPTH_MILLIMETER_UNIT)
-        if input_unit == "auto"
-        else input_unit
-    )
+    resolved_unit = infer_depth_unit(depth.dtype) if input_unit == "auto" else input_unit
    return depth.astype(np.float32, order="K"), resolved_unit


@@ -126,12 +123,12 @@ def quantize_depth(

    # Convert depth_min, depth_max, and shift to the resolved input unit.
    depth_min_u = (
-        np.float32(depth_min) if resolved_unit == DEPTH_METER_UNIT else np.float32(depth_min * _MM_PER_METRE)
+        np.float32(depth_min) if resolved_unit == DEPTH_METER_UNIT else np.float32(depth_min * MM_PER_METRE)
    )
    depth_max_u = (
-        np.float32(depth_max) if resolved_unit == DEPTH_METER_UNIT else np.float32(depth_max * _MM_PER_METRE)
+        np.float32(depth_max) if resolved_unit == DEPTH_METER_UNIT else np.float32(depth_max * MM_PER_METRE)
    )
-    shift_u = np.float32(shift) if resolved_unit == DEPTH_METER_UNIT else np.float32(shift * _MM_PER_METRE)
+    shift_u = np.float32(shift) if resolved_unit == DEPTH_METER_UNIT else np.float32(shift * MM_PER_METRE)

    # Normalization and quantization is performed in the resolved input unit.
    if use_log:
@@ -236,7 +233,7 @@ def dequantize_depth(

        # mm path: round + clamp in float32, skipping the uint16 round-trip
        # when returning a tensor (torch.uint16 is poorly supported).
-        buf.mul_(_MM_PER_METRE).round_().clamp_(0.0, _UINT16_MAX)
+        buf.mul_(MM_PER_METRE).round_().clamp_(0.0, _UINT16_MAX)
        if output_tensor:
            return buf
        return buf.cpu().numpy().astype(np.uint16, copy=False)
@@ -259,7 +256,7 @@ def dequantize_depth(
    if output_unit == DEPTH_METER_UNIT:
        return torch.from_numpy(buf) if output_tensor else buf

-    np.multiply(buf, _MM_PER_METRE, out=buf)
+    np.multiply(buf, MM_PER_METRE, out=buf)
    np.rint(buf, out=buf)
    np.clip(buf, 0.0, _UINT16_MAX, out=buf)
    if output_tensor:
@@ -224,6 +224,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
        )
        self.root = self.meta.root
        self.revision = self.meta.revision
+        self.meta.rescale_depth_stats(self._depth_output_unit)

        if episodes is not None and any(
            episode >= self.meta.total_episodes or episode < 0 for episode in episodes
@@ -350,6 +351,11 @@ class LeRobotDataset(torch.utils.data.Dataset):
        """Frames per second used during data collection."""
        return self.meta.fps

+    @property
+    def depth_output_unit(self) -> str:
+        """Physical unit (``"m"`` or ``"mm"``) depth maps and statistics are returned in on read."""
+        return self._depth_output_unit
+
    @property
    def num_frames(self) -> int:
        """Number of frames in selected episodes."""
@@ -22,11 +22,11 @@ import numpy as np
 import torch
 from datasets import load_dataset

-from lerobot.configs import DEFAULT_DEPTH_UNIT, DepthEncoderConfig
+from lerobot.configs import DEFAULT_DEPTH_UNIT, DEPTH_METER_UNIT, DepthEncoderConfig
 from lerobot.utils.constants import HF_LEROBOT_HOME, LOOKAHEAD_BACKTRACKTABLE, LOOKBACK_BACKTRACKTABLE

 from .dataset_metadata import CODEBASE_VERSION, LeRobotDatasetMetadata
-from .depth_utils import dequantize_depth
+from .depth_utils import MM_PER_METRE, dequantize_depth
 from .feature_utils import get_delta_indices
 from .io_utils import item_to_torch
 from .utils import (
@@ -310,6 +310,7 @@ class StreamingLeRobotDataset(torch.utils.data.IterableDataset):
        )
        self.root = self.meta.root
        self.revision = self.meta.revision
+        self.meta.rescale_depth_stats(self._depth_output_unit)
        # Check version
        check_version_compatibility(self.repo_id, self.meta._version, CODEBASE_VERSION)

@@ -318,6 +319,13 @@ class StreamingLeRobotDataset(torch.utils.data.IterableDataset):
            for vid_key in self.meta.depth_keys
        }

+        # Input unit of each depth feature stored as raw images (dequantized separately from videos).
+        self._image_depth_units: dict[str, str | None] = {
+            key: (self.meta.features[key].get("info") or {}).get("depth_unit")
+            for key in self.meta.depth_keys
+            if key in self.meta.image_keys
+        }
+
        self.delta_timestamps = None
        self.delta_indices = None

@@ -348,6 +356,11 @@ class StreamingLeRobotDataset(torch.utils.data.IterableDataset):
    def fps(self):
        return self.meta.fps

+    @property
+    def depth_output_unit(self) -> str:
+        """Physical unit (``"m"`` or ``"mm"``) depth maps are returned in on read."""
+        return self._depth_output_unit
+
    @staticmethod
    def _iter_random_indices(
        rng: np.random.Generator, buffer_size: int, random_batch_size=100
@@ -530,6 +543,15 @@ class StreamingLeRobotDataset(torch.utils.data.IterableDataset):
        for update in updates:
            result.update(update)

+        # Convert raw-image depth features to the output unit (video depth is already converted).
+        for key, stored_unit in self._image_depth_units.items():
+            if key in result and stored_unit is not None and stored_unit != self._depth_output_unit:
+                result[key] = (
+                    result[key] * MM_PER_METRE
+                    if stored_unit == DEPTH_METER_UNIT
+                    else result[key] / MM_PER_METRE
+                )
+
        result["task"] = self.meta.tasks.iloc[item["task_index"]].name

        yield result
@@ -757,7 +757,7 @@ class RoboTwinEnvConfig(EnvConfig):

    task: str = "beat_block_hammer"  # single task or comma-separated list
    fps: int = 25
-    episode_length: int = 300
+    episode_length: int = 1200
    obs_type: str = "pixels_agent_pos"
    render_mode: str = "rgb_array"
    # Available cameras from RoboTwin's aloha-agilex embodiment: head_camera
@@ -768,6 +768,9 @@ class RoboTwinEnvConfig(EnvConfig):
    # must equal what SAPIEN actually renders.
    observation_height: int = 240
    observation_width: int = 320
+    # "joint": 14-d joint-space control. "ee": 16-d end-effector-pose deltas executed via CuRobo IK
+    # (for world-model policies like LingBot-VA that predict per-arm xyz+quaternion+gripper poses).
+    action_mode: str = "joint"
    features: dict[str, PolicyFeature] = field(
        default_factory=lambda: {
            ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(14,)),
@@ -784,6 +787,8 @@ class RoboTwinEnvConfig(EnvConfig):
    )

    def __post_init__(self):
+        if self.action_mode == "ee":
+            self.features[ACTION] = PolicyFeature(type=FeatureType.ACTION, shape=(16,))
        cam_list = [c.strip() for c in self.camera_names.split(",") if c.strip()]
        for cam in cam_list:
            self.features[f"pixels/{cam}"] = PolicyFeature(
@@ -826,6 +831,7 @@ class RoboTwinEnvConfig(EnvConfig):
            observation_height=self.observation_height,
            observation_width=self.observation_width,
            episode_length=self.episode_length,
+            action_mode=self.action_mode,
        )


@@ -17,6 +17,7 @@ from __future__ import annotations

 import importlib
 import logging
+import os
 from collections import defaultdict
 from collections.abc import Callable, Sequence
 from functools import partial
@@ -41,10 +42,123 @@ ROBOTWIN_CAMERA_NAMES: tuple[str, ...] = (
    "right_camera",
 )

-ACTION_DIM = 14  # 7 DOF × 2 arms
+ACTION_DIM = 14  # 7 DOF × 2 arms (joint-space control mode)
+# End-effector-pose control mode: per arm [x, y, z, qx, qy, qz, qw, gripper] = 8, dual-arm = 16.
+# Used by world-model policies (e.g. LingBot-VA) that predict eef-pose deltas executed via CuRobo IK.
+EEF_ACTION_DIM = 16
 ACTION_LOW = -1.0
 ACTION_HIGH = 1.0
-DEFAULT_EPISODE_LENGTH = 300
+DEFAULT_EPISODE_LENGTH = 1200
+OFFICIAL_INSTRUCTION_ENV = "LEROBOT_ROBOTWIN_OFFICIAL_INSTRUCTION"
+OFFICIAL_INSTRUCTION_TYPE_ENV = "LEROBOT_ROBOTWIN_INSTRUCTION_TYPE"
+OFFICIAL_INSTRUCTION_MAX_ENV = "LEROBOT_ROBOTWIN_INSTRUCTION_MAX"
+
+
+def _compose_eef_pose(new_pose: np.ndarray, init_pose: np.ndarray) -> np.ndarray:
+    """Compose a single-arm predicted delta pose onto the initial pose.
+
+    ``new_pose`` / ``init_pose`` are 8-vectors ``[x, y, z, qx, qy, qz, qw, gripper]``. Translation
+    is added, rotation is composed (``init_R * new_R``), and the gripper is taken from the
+    prediction. Mirrors ``add_eef_pose`` in the upstream LingBot-VA RoboTwin client.
+    """
+    from scipy.spatial.transform import Rotation
+
+    new_r = Rotation.from_quat(new_pose[3:7])
+    init_r = Rotation.from_quat(init_pose[3:7])
+    out_rot = (init_r * new_r).as_quat()
+    out_trans = new_pose[:3] + init_pose[:3]
+    return np.concatenate([out_trans, out_rot, new_pose[7:8]])
+
+
+def _add_init_eef_pose(delta_pose: np.ndarray, init_pose: np.ndarray) -> np.ndarray:
+    """Compose a dual-arm (16-d) predicted delta pose onto the initial eef pose, normalizing quats."""
+    left = _compose_eef_pose(delta_pose[:8], init_pose[:8])
+    right = _compose_eef_pose(delta_pose[8:], init_pose[8:])
+    out = np.concatenate([left, right])
+    # Normalize the two quaternions (indices 3:7 and 11:15) as the upstream client does.
+    out[3:7] = out[3:7] / (np.linalg.norm(out[3:7]) + 1e-8)
+    out[11:15] = out[11:15] / (np.linalg.norm(out[11:15]) + 1e-8)
+    return out
+
+
+def _env_flag(name: str, default: bool = False) -> bool:
+    raw = os.environ.get(name)
+    if raw is None:
+        return default
+    return raw.strip().lower() in {"1", "true", "yes", "on"}
+
+
+def _arm_for_block(block: Any) -> str:
+    return "left" if float(block.get_pose().p[0]) < 0 else "right"
+
+
+def _robotwin_blocks_episode_info(task_name: str, env: Any) -> dict[str, str] | None:
+    """Infer the episode-info dict used by RoboTwin's official instruction generator for block ranking."""
+    if task_name == "blocks_ranking_rgb":
+        return {
+            "{A}": "red block",
+            "{B}": "green block",
+            "{C}": "blue block",
+            "{a}": _arm_for_block(env.block1),
+            "{b}": _arm_for_block(env.block2),
+            "{c}": _arm_for_block(env.block3),
+        }
+    if task_name == "blocks_ranking_size":
+        return {
+            "{A}": "large block",
+            "{B}": "medium block",
+            "{C}": "small block",
+            "{a}": _arm_for_block(env.block1),
+            "{b}": _arm_for_block(env.block2),
+            "{c}": _arm_for_block(env.block3),
+        }
+    return None
+
+
+def _generate_robotwin_official_instruction(task_name: str, env: Any) -> str:
+    """Generate language with RoboTwin's official task templates, matching its eval client."""
+    fallback = task_name.replace("_", " ")
+    episode_info = _robotwin_blocks_episode_info(task_name, env)
+    if episode_info is None:
+        logger.warning(
+            "Official RoboTwin instruction is not implemented for task=%s; using %r.", task_name, fallback
+        )
+        return fallback
+
+    try:
+        from description.utils.generate_episode_instructions import generate_episode_descriptions
+    except Exception:
+        logger.warning(
+            "Failed to import RoboTwin official instruction generator; using %r.", fallback, exc_info=True
+        )
+        return fallback
+
+    instruction_type = os.environ.get(OFFICIAL_INSTRUCTION_TYPE_ENV, "seen")
+    try:
+        max_descriptions = int(os.environ.get(OFFICIAL_INSTRUCTION_MAX_ENV, "1000000"))
+    except ValueError:
+        max_descriptions = 1000000
+
+    results = generate_episode_descriptions(task_name, [episode_info], max_descriptions=max_descriptions)
+    if not results:
+        logger.warning(
+            "RoboTwin generated no official instructions for task=%s; using %r.", task_name, fallback
+        )
+        return fallback
+
+    options = results[0].get(instruction_type) or results[0].get("seen") or results[0].get("unseen")
+    if not options:
+        logger.warning(
+            "RoboTwin generated no %s official instructions for task=%s; using %r.",
+            instruction_type,
+            task_name,
+            fallback,
+        )
+        return fallback
+
+    return str(np.random.choice(options))
+
+
 # D435 dims from task_config/_camera_config.yml (what demo_clean.yml selects).
 DEFAULT_CAMERA_H = 240
 DEFAULT_CAMERA_W = 320
@@ -234,6 +348,7 @@ class RoboTwinEnv(gym.Env):
        observation_width: int | None = None,
        episode_length: int = DEFAULT_EPISODE_LENGTH,
        render_mode: str = "rgb_array",
+        action_mode: str = "joint",
    ):
        super().__init__()
        self.task_name = task_name
@@ -241,6 +356,13 @@ class RoboTwinEnv(gym.Env):
        self.task_description = task_name.replace("_", " ")
        self.episode_index = episode_index
        self._reset_stride = n_envs
+        # "joint": 14-d joint-space actions via take_action(action). "ee": 16-d end-effector-pose
+        # deltas (added onto the episode's initial eef pose) executed via take_action(.., "ee") + IK.
+        if action_mode not in ("joint", "ee"):
+            raise ValueError(f"action_mode must be 'joint' or 'ee'; got {action_mode!r}")
+        self.action_mode = action_mode
+        self._action_dim = EEF_ACTION_DIM if action_mode == "ee" else ACTION_DIM
+        self._init_eef_pose: np.ndarray | None = None
        self.camera_names = list(camera_names)
        # Default to D435 dims (the camera type baked into task_config/demo_clean.yml).
        # The YAML-driven lookup is deferred to reset() so construction doesn't
@@ -271,7 +393,7 @@ class RoboTwinEnv(gym.Env):
            }
        )
        self.action_space = spaces.Box(
-            low=ACTION_LOW, high=ACTION_HIGH, shape=(ACTION_DIM,), dtype=np.float32
+            low=ACTION_LOW, high=ACTION_HIGH, shape=(self._action_dim,), dtype=np.float32
        )

    def _ensure_env(self) -> None:
@@ -317,6 +439,18 @@ class RoboTwinEnv(gym.Env):

        return {"pixels": images, "agent_pos": joint_state}

+    def _read_eef_pose(self) -> np.ndarray:
+        """Read the current 16-d dual-arm eef pose [left(xyz+quat)+grip, right(xyz+quat)+grip]."""
+        assert self._env is not None, "_read_eef_pose called before _ensure_env()"
+        ep = self._env.get_obs()["endpose"]
+        pose = (
+            list(ep["left_endpose"])
+            + [ep["left_gripper"]]
+            + list(ep["right_endpose"])
+            + [ep["right_gripper"]]
+        )
+        return np.asarray(pose, dtype=np.float64)
+
    def reset(self, seed: int | None = None, **kwargs) -> tuple[RobotObservation, dict]:
        self._ensure_env()
        super().reset(seed=seed)
@@ -330,16 +464,32 @@ class RoboTwinEnv(gym.Env):
        self.episode_index += self._reset_stride
        self._step_count = 0

+        use_official_instruction = self.task_name in {"blocks_ranking_rgb", "blocks_ranking_size"}
+        if _env_flag(OFFICIAL_INSTRUCTION_ENV, default=use_official_instruction):
+            self.task_description = _generate_robotwin_official_instruction(self.task_name, self._env)
+            if hasattr(self._env, "set_instruction"):
+                self._env.set_instruction(instruction=self.task_description)
+            logger.info("RoboTwin official instruction | task=%s | %s", self.task_name, self.task_description)
+        else:
+            self.task_description = self.task_name.replace("_", " ")
+
+        # In eef mode the policy predicts pose deltas relative to the initial eef pose.
+        if self.action_mode == "ee":
+            self._init_eef_pose = self._read_eef_pose()
+
        obs = self._get_obs()
        return obs, {"is_success": False, "task": self.task_name}

    def step(self, action: np.ndarray) -> tuple[RobotObservation, float, bool, bool, dict[str, Any]]:
        assert self._env is not None, "step() called before reset()"
-        if action.ndim != 1 or action.shape[0] != ACTION_DIM:
-            raise ValueError(f"Expected 1-D action of shape ({ACTION_DIM},), got {action.shape}")
+        if action.ndim != 1 or action.shape[0] != self._action_dim:
+            raise ValueError(f"Expected 1-D action of shape ({self._action_dim},), got {action.shape}")

        with torch.enable_grad():
-            if hasattr(self._env, "take_action"):
+            if self.action_mode == "ee":
+                ee_action = _add_init_eef_pose(np.asarray(action, dtype=np.float64), self._init_eef_pose)
+                self._env.take_action(ee_action, action_type="ee")
+            elif hasattr(self._env, "take_action"):
                self._env.take_action(action)
            else:
                self._env.step(action)
@@ -398,6 +548,7 @@ def _make_env_fns(
    observation_height: int,
    observation_width: int,
    episode_length: int,
+    action_mode: str = "joint",
 ) -> list[Callable[[], RoboTwinEnv]]:
    """Return n_envs factory callables for a single task."""

@@ -410,6 +561,7 @@ def _make_env_fns(
            observation_height=observation_height,
            observation_width=observation_width,
            episode_length=episode_length,
+            action_mode=action_mode,
        )

    return [partial(_make_one, i) for i in range(n_envs)]
@@ -423,6 +575,7 @@ def create_robotwin_envs(
    observation_height: int = DEFAULT_CAMERA_H,
    observation_width: int = DEFAULT_CAMERA_W,
    episode_length: int = DEFAULT_EPISODE_LENGTH,
+    action_mode: str = "joint",
 ) -> dict[str, dict[int, Any]]:
    """Create vectorized RoboTwin 2.0 environments.

@@ -473,6 +626,7 @@ def create_robotwin_envs(
            observation_height=observation_height,
            observation_width=observation_width,
            episode_length=episode_length,
+            action_mode=action_mode,
        )
        if is_async:
            lazy = _LazyAsyncVectorEnv(fns, cached_obs_space, cached_act_space, cached_metadata)
@@ -83,6 +83,28 @@ class VQBeTSchedulerConfig(LRSchedulerConfig):
        return LambdaLR(optimizer, lr_lambda, -1)


+@LRSchedulerConfig.register_subclass("constant_with_warmup")
+@dataclass
+class ConstantWithWarmupSchedulerConfig(LRSchedulerConfig):
+    """Linear warmup followed by a constant learning rate.
+
+    Mirrors the ``warmup_constant_lambda`` used by LingBot-VA (upstream ``wan_va/train.py``):
+    the LR ramps linearly from 0 to the peak over ``num_warmup_steps`` steps, then stays flat.
+    """
+
+    num_warmup_steps: int = 1000
+
+    def build(self, optimizer: Optimizer, num_training_steps: int) -> LambdaLR:
+        warmup_steps = self.num_warmup_steps or 0
+
+        def lr_lambda(current_step):
+            if current_step < warmup_steps:
+                return float(current_step) / float(max(1, warmup_steps))
+            return 1.0
+
+        return LambdaLR(optimizer, lr_lambda, -1)
+
+
@LRSchedulerConfig.register_subclass("cosine_decay_with_warmup")
@dataclass
 class CosineDecayWithWarmupSchedulerConfig(LRSchedulerConfig):
@@ -21,6 +21,7 @@ from .factory import get_policy_class, make_policy, make_policy_config, make_pre
 from .fastwam.configuration_fastwam import FastWAMConfig as FastWAMConfig
 from .gaussian_actor.configuration_gaussian_actor import GaussianActorConfig as GaussianActorConfig
 from .groot.configuration_groot import GrootConfig as GrootConfig
+from .lingbot_va.configuration_lingbot_va import LingBotVAConfig as LingBotVAConfig
 from .molmoact2.configuration_molmoact2 import MolmoAct2Config as MolmoAct2Config
 from .multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig as MultiTaskDiTConfig
 from .pi0.configuration_pi0 import PI0Config as PI0Config
@@ -46,6 +47,7 @@ __all__ = [
    "FastWAMConfig",
    "GaussianActorConfig",
    "GrootConfig",
+    "LingBotVAConfig",
    "MolmoAct2Config",
    "MultiTaskDiTConfig",
    "PI0Config",
@@ -50,6 +50,7 @@ from .eo1.configuration_eo1 import EO1Config
 from .fastwam.configuration_fastwam import FastWAMConfig
 from .gaussian_actor.configuration_gaussian_actor import GaussianActorConfig
 from .groot.configuration_groot import GrootConfig
+from .lingbot_va.configuration_lingbot_va import LingBotVAConfig
 from .molmoact2.configuration_molmoact2 import MolmoAct2Config
 from .multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig
 from .pi0.configuration_pi0 import PI0Config
@@ -163,6 +164,10 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:
        from .vla_jepa.modeling_vla_jepa import VLAJEPAPolicy

        return VLAJEPAPolicy
+    elif name == "lingbot_va":
+        from .lingbot_va.modeling_lingbot_va import LingBotVAPolicy
+
+        return LingBotVAPolicy
    elif name == "fastwam":
        from .fastwam.modeling_fastwam import FastWAMPolicy

@@ -223,6 +228,8 @@ def make_policy_config(policy_type: str, **kwargs) -> PreTrainedConfig:
        return MolmoAct2Config(**kwargs)
    elif policy_type == "vla_jepa":
        return VLAJEPAConfig(**kwargs)
+    elif policy_type == "lingbot_va":
+        return LingBotVAConfig(**kwargs)
    elif policy_type == "fastwam":
        return FastWAMConfig(**kwargs)
    else:
@@ -458,6 +465,14 @@ def make_pre_post_processors(
            dataset_stats=kwargs.get("dataset_stats"),
        )

+    elif isinstance(policy_cfg, LingBotVAConfig):
+        from .lingbot_va.processor_lingbot_va import make_lingbot_va_pre_post_processors
+
+        processors = make_lingbot_va_pre_post_processors(
+            config=policy_cfg,
+            dataset_stats=kwargs.get("dataset_stats"),
+        )
+
    elif isinstance(policy_cfg, FastWAMConfig):
        from .fastwam.processor_fastwam import make_fastwam_pre_post_processors

@@ -0,0 +1 @@
+../../../../docs/source/lingbot_va.mdx
@@ -0,0 +1,33 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# NOTE: ``LingBotVAPolicy`` (and the Wan transformer it owns) imports ``diffusers`` as a
+# hard dependency at class-definition time (it subclasses diffusers' ModelMixin/ConfigMixin).
+# To keep base ``import lerobot`` working without the optional ``lingbot_va`` extra, the
+# policy is exposed lazily via module ``__getattr__`` — the heavy import only happens when
+# ``LingBotVAPolicy`` is actually accessed (mirroring the lazy import in policies/factory.py).
+from .configuration_lingbot_va import LingBotVAConfig
+from .processor_lingbot_va import make_lingbot_va_pre_post_processors
+
+__all__ = ["LingBotVAConfig", "LingBotVAPolicy", "make_lingbot_va_pre_post_processors"]
+
+
+def __getattr__(name):
+    if name == "LingBotVAPolicy":
+        from .modeling_lingbot_va import LingBotVAPolicy
+
+        return LingBotVAPolicy
+    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
@@ -0,0 +1,170 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Configuration for the LingBot-VA policy.
+
+LingBot-VA is an autoregressive video-action world-model policy built on the Wan2.2
+video-diffusion stack. It interleaves prediction of future video latents and robot
+actions in a single dual-stream transformer. See ``docs/source/lingbot_va.mdx`` and the
+upstream repository (https://github.com/Robbyant/lingbot-va).
+
+Defaults below match the upstream LIBERO configuration (``wan_va/configs/va_libero_cfg.py``)
+and the ``transformer/config.json`` of the released checkpoints.
+"""
+
+from dataclasses import dataclass, field
+
+from lerobot.configs.policies import PreTrainedConfig
+from lerobot.configs.types import FeatureType, NormalizationMode, PolicyFeature
+from lerobot.optim.optimizers import AdamWConfig
+from lerobot.optim.schedulers import LRSchedulerConfig
+from lerobot.utils.constants import ACTION
+
+
+@PreTrainedConfig.register_subclass("lingbot_va")
+@dataclass
+class LingBotVAConfig(PreTrainedConfig):
+    """Configuration for the native LingBot-VA policy integration in LeRobot."""
+
+    # Wan transformer architecture
+    patch_size: tuple[int, int, int] = (1, 2, 2)
+    num_attention_heads: int = 24
+    attention_head_dim: int = 128
+    in_channels: int = 48
+    out_channels: int = 48
+    action_dim: int = 30
+    text_dim: int = 4096
+    freq_dim: int = 256
+    ffn_dim: int = 14336
+    num_layers: int = 30
+    cross_attn_norm: bool = True
+    eps: float = 1e-6
+    rope_max_seq_len: int = 1024
+    # "flex" = training only (needs recent torch); inference uses "torch" SDPA or "flashattn".
+    attn_mode: str = "torch"
+
+    # Frozen sub-models (VAE + UMT5 text encoder + tokenizer)
+    # ~20 GB of frozen weights, NOT bundled in the checkpoint; lazily pulled from this HF repo /
+    # local dir (must hold diffusers-style ``vae/``, ``text_encoder/``, ``tokenizer/`` sub-folders).
+    wan_pretrained_path: str = "robbyant/lingbot-va-base"
+    dtype: str = "bfloat16"  # transformer / VAE / text-encoder dtype: "bfloat16", "float16", "float32"
+    # Frozen UMT5-XXL encoder device; "cpu" frees ~11 GB VRAM (it runs once per episode).
+    text_encoder_device: str = "cpu"
+
+    # Observation cameras (order matters: latents are concatenated on width; LIBERO defaults)
+    obs_cam_keys: list[str] = field(
+        default_factory=lambda: ["observation.images.image", "observation.images.image2"]
+    )
+    # Undo the LIBERO env processor's extra horizontal flip to match the model's training orientation.
+    image_hflip: bool = False
+    # Camera latent layout: "width_concat" (cameras concatenated on width; LIBERO) or
+    # "robotwin_tshape" (full-res head + half-res wrists in a "T"; RoboTwin).
+    camera_layout: str = "width_concat"
+
+    # Inference hyperparameters (LIBERO defaults)
+    n_obs_steps: int = 1
+    height: int = 128
+    width: int = 128
+    action_per_frame: int = 4
+    frame_chunk_size: int = 4
+    attn_window: int = 30
+    num_inference_steps: int = 20
+    video_exec_step: int = -1
+    action_num_inference_steps: int = 50
+    guidance_scale: float = 5.0
+    action_guidance_scale: float = 1.0
+    snr_shift: float = 5.0
+    action_snr_shift: float = 0.05
+    max_sequence_length: int = 512  # UMT5 prompt length
+
+    # Subset of the 30-d action space used by the benchmark (LIBERO = 7-DoF). The action
+    # (un)normalization quantiles live in the checkpoint's ``policy_postprocessor.json``, not here.
+    used_action_channel_ids: list[int] = field(default_factory=lambda: list(range(7)))
+
+    # Opt-in: VAE-decode predicted video latents to ``self.last_predicted_frames`` for saving MP4s.
+    save_predicted_video: bool = False
+
+    # Normalization: IDENTITY here; images are scaled + VAE-encoded and actions are
+    # quantile-(un)normalized inside the policy / dedicated processor steps.
+    normalization_mapping: dict[str, NormalizationMode] = field(
+        default_factory=lambda: {
+            "VISUAL": NormalizationMode.IDENTITY,
+            "STATE": NormalizationMode.IDENTITY,
+            "ACTION": NormalizationMode.IDENTITY,
+        }
+    )
+
+    # Optimizer / scheduler (training; AdamW + warmup-constant per upstream train.py)
+    optimizer_lr: float = 1e-5
+    optimizer_betas: tuple[float, float] = (0.9, 0.95)
+    optimizer_eps: float = 1e-8
+    optimizer_weight_decay: float = 1e-4
+    optimizer_grad_clip_norm: float = 1.0
+    scheduler_warmup_steps: int = 1000
+
+    def __post_init__(self):
+        super().__post_init__()
+        if self.attn_mode not in ("torch", "flashattn", "flex"):
+            raise ValueError(f"attn_mode must be one of 'torch', 'flashattn', 'flex'; got {self.attn_mode!r}")
+
+    @property
+    def chunk_size(self) -> int:
+        """Number of single-step actions produced per autoregressive chunk."""
+        return self.frame_chunk_size * self.action_per_frame
+
+    @property
+    def n_action_steps(self) -> int:
+        """Number of actions executed before refilling (the whole chunk)."""
+        return self.chunk_size
+
+    def validate_features(self) -> None:
+        image_features = [key for key, feat in self.input_features.items() if feat.type == FeatureType.VISUAL]
+        if not image_features:
+            raise ValueError(
+                "LingBot-VA requires at least one visual input feature. "
+                "No features of type FeatureType.VISUAL found in input_features."
+            )
+        if ACTION not in self.output_features:
+            self.output_features[ACTION] = PolicyFeature(
+                type=FeatureType.ACTION, shape=(len(self.used_action_channel_ids),)
+            )
+
+    def get_optimizer_preset(self) -> AdamWConfig:
+        return AdamWConfig(
+            lr=self.optimizer_lr,
+            betas=self.optimizer_betas,
+            eps=self.optimizer_eps,
+            weight_decay=self.optimizer_weight_decay,
+            grad_clip_norm=self.optimizer_grad_clip_norm,
+        )
+
+    def get_scheduler_preset(self) -> LRSchedulerConfig | None:
+        # Upstream uses a linear warmup followed by a constant LR (warmup_constant_lambda).
+        from lerobot.optim.schedulers import ConstantWithWarmupSchedulerConfig
+
+        return ConstantWithWarmupSchedulerConfig(num_warmup_steps=self.scheduler_warmup_steps)
+
+    @property
+    def observation_delta_indices(self) -> list[int]:
+        temporal_downsample = 4
+        stride = max(1, self.action_per_frame // temporal_downsample)
+        return list(range(0, self.frame_chunk_size * temporal_downsample * stride, stride))
+
+    @property
+    def action_delta_indices(self) -> list[int]:
+        return list(range(self.chunk_size))
+
+    @property
+    def reward_delta_indices(self) -> None:
+        return None
@@ -0,0 +1,87 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Pre/post-processor pipelines for the LingBot-VA policy.
+
+The preprocessor passes inputs through (IDENTITY) and the postprocessor maps the policy's
+``[-1, 1]`` actions back to physical units with the built-in ``UnnormalizerProcessorStep``
+(QUANTILES) using per-channel q01/q99 restored from the checkpoint.
+"""
+
+from typing import Any
+
+import torch
+
+from lerobot.configs.types import FeatureType, NormalizationMode
+from lerobot.processor import (
+    AddBatchDimensionProcessorStep,
+    DeviceProcessorStep,
+    NormalizerProcessorStep,
+    PolicyAction,
+    PolicyProcessorPipeline,
+    ProcessorStep,
+    RenameObservationsProcessorStep,
+    UnnormalizerProcessorStep,
+)
+from lerobot.processor.converters import policy_action_to_transition, transition_to_policy_action
+from lerobot.utils.constants import (
+    POLICY_POSTPROCESSOR_DEFAULT_NAME,
+    POLICY_PREPROCESSOR_DEFAULT_NAME,
+)
+
+from .configuration_lingbot_va import LingBotVAConfig
+
+
+def make_lingbot_va_pre_post_processors(
+    config: LingBotVAConfig,
+    dataset_stats: dict[str, dict[str, torch.Tensor]] | None = None,
+) -> tuple[
+    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
+    PolicyProcessorPipeline[PolicyAction, PolicyAction],
+]:
+    """Build the pre/post processor pipelines for LingBot-VA."""
+
+    input_steps: list[ProcessorStep] = [
+        RenameObservationsProcessorStep(rename_map={}),
+        AddBatchDimensionProcessorStep(),
+        NormalizerProcessorStep(
+            features={**config.input_features, **config.output_features},
+            norm_map=config.normalization_mapping,
+            stats=dataset_stats,
+        ),
+        DeviceProcessorStep(device=config.device),
+    ]
+
+    # Unnormalize actions from [-1, 1] to physical units (QUANTILES) using q01/q99 restored from the checkpoint.
+    output_steps: list[ProcessorStep] = [
+        UnnormalizerProcessorStep(
+            features=config.output_features,
+            norm_map={FeatureType.ACTION: NormalizationMode.QUANTILES},
+            stats=dataset_stats,
+        ),
+        DeviceProcessorStep(device="cpu"),
+    ]
+
+    return (
+        PolicyProcessorPipeline[dict[str, Any], dict[str, Any]](
+            steps=input_steps,
+            name=POLICY_PREPROCESSOR_DEFAULT_NAME,
+        ),
+        PolicyProcessorPipeline[PolicyAction, PolicyAction](
+            steps=output_steps,
+            name=POLICY_POSTPROCESSOR_DEFAULT_NAME,
+            to_transition=policy_action_to_transition,
+            to_output=transition_to_policy_action,
+        ),
+    )
@@ -84,6 +84,7 @@ import torch
 import torch.utils.data
 import tqdm

+from lerobot.configs import DEPTH_MILLIMETER_UNIT
 from lerobot.datasets import LeRobotDataset
 from lerobot.utils.constants import ACTION, DONE, OBS_STATE, REWARD, SUCCESS
 from lerobot.utils.utils import init_logging
@@ -228,6 +229,9 @@ def visualize_dataset(

    logging.info("Logging to Rerun")

+    # Depth frames and stats are dequantized to the dataset's depth_output_unit on load.
+    depth_meter = 1000.0 if dataset.depth_output_unit == DEPTH_MILLIMETER_UNIT else 1.0
+
    # Use the dataset's q01/q99 depth statistics for robust depth range bounds
    depth_ranges = {}
    for key in dataset.meta.depth_keys:
@@ -254,6 +258,7 @@ def visualize_dataset(
                    depth = to_hwc_float32_numpy(batch[key][i])
                    depth_entity = rr.DepthImage(
                        depth,
+                        meter=depth_meter,
                        colormap=rr.components.Colormap.Viridis,
                        depth_range=depth_ranges.get(key),
                    )
@@ -169,6 +169,7 @@ def rollout(
    env_features: dict | None = None,
    recording_repo_id: str | None = None,
    recording_private: bool = False,
+    predicted_latents_callback: Callable[[PreTrainedPolicy], None] | None = None,
 ) -> dict:
    """Run a batched policy rollout once through a batch of environments.

@@ -198,6 +199,9 @@ def rollout(
            are returned optionally because they typically take more memory to cache. Defaults to False.
        render_callback: Optional rendering callback to be used after the environments are reset, and after
            every step.
+        predicted_latents_callback: Optional callback invoked after every ``select_action`` with the policy
+            itself. World-model policies (e.g. LingBot-VA) stash predicted video latents on
+            ``policy.last_predicted_latents``; this lets the caller concatenate chunks and decode once.
    Returns:
        The dictionary described above.
    """
@@ -276,6 +280,8 @@ def rollout(
            observation = preprocessor(observation)
            with torch.inference_mode():
                action = policy.select_action(observation)
+            if predicted_latents_callback is not None:
+                predicted_latents_callback(policy)
            action = postprocessor(action)

            action_transition = {ACTION: action}
@@ -295,12 +301,22 @@ def rollout(
            # available if none of the envs finished.
            if "final_info" in info:
                final_info = info["final_info"]
-                if not isinstance(final_info, dict):
-                    raise RuntimeError(
-                        "Unsupported `final_info` format: expected dict (Gymnasium >= 1.0). "
-                        "You're likely using an older version of gymnasium (< 1.0). Please upgrade."
+                if isinstance(final_info, dict):
+                    is_success = final_info.get("is_success", [False] * env.num_envs)
+                    successes = (
+                        is_success.tolist()
+                        if hasattr(is_success, "tolist")
+                        else [bool(is_success)] * env.num_envs
                    )
-                successes = final_info["is_success"].tolist()
+                else:
+                    # Gymnasium < 1.0 returns final_info as a per-env sequence/object array,
+                    # with entries set to a dict only for envs that just finished.
+                    successes = []
+                    for item in final_info:
+                        if isinstance(item, dict) and "is_success" in item:
+                            successes.append(bool(item["is_success"]))
+                        else:
+                            successes.append(False)
            elif "is_success" in info:
                is_success = info["is_success"]
                successes = (
@@ -400,6 +416,7 @@ def eval_policy(
    env_features: dict | None = None,
    recording_repo_id: str | None = None,
    recording_private: bool = False,
+    save_predicted_video: bool = False,
 ) -> dict:
    """
    Args:
@@ -418,6 +435,11 @@ def eval_policy(
    if max_episodes_rendered > 0 and not videos_dir:
        raise ValueError("If max_episodes_rendered > 0, videos_dir must be provided.")

+    # World-model policies (e.g. LingBot-VA) opt into predicted-video saving via their config.
+    save_predicted_video = save_predicted_video or bool(
+        getattr(getattr(policy, "config", None), "save_predicted_video", False)
+    )
+
    if not isinstance(policy, PreTrainedPolicy):
        exc = ValueError(
            f"Policy of type 'PreTrainedPolicy' is expected, but type '{type(policy)}' was provided."
@@ -461,6 +483,22 @@ def eval_policy(
    if max_episodes_rendered > 0:
        video_paths: list[str] = []

+    if save_predicted_video:
+        if not videos_dir:
+            raise ValueError("If save_predicted_video is True, videos_dir must be provided.")
+        predicted_video_paths: list[str] = []
+        n_predicted_rendered = 0
+
+    # Collect predicted-video latents across a rollout (world-model policies only). The latents are
+    # concatenated and decoded once after the rollout, matching upstream LingBot-VA's visualization path.
+    def collect_predicted_latents(policy: PreTrainedPolicy):
+        latents = getattr(policy, "last_predicted_latents", None)
+        if latents is not None:
+            pred_latents.append(
+                latents.detach().to("cpu") if hasattr(latents, "detach") else torch.as_tensor(latents).cpu()
+            )
+            policy.last_predicted_latents = None
+
    if return_episode_data:
        episode_data: dict | None = None

@@ -472,6 +510,9 @@ def eval_policy(
        if max_episodes_rendered > 0:
            ep_frames: list[np.ndarray] = []

+        if save_predicted_video:
+            pred_latents: list[torch.Tensor] = []
+
        if start_seed is None:
            seeds = None
        else:
@@ -492,6 +533,7 @@ def eval_policy(
            env_features=env_features,
            recording_repo_id=recording_repo_id,
            recording_private=recording_private,
+            predicted_latents_callback=collect_predicted_latents if save_predicted_video else None,
        )

        # Figure out where in each rollout sequence the first done condition was encountered (results after
@@ -557,6 +599,35 @@ def eval_policy(
                threads.append(thread)
                n_episodes_rendered += 1

+        # Maybe save the policy's predicted (imagined) video for this batch's rollout.
+        if save_predicted_video and len(pred_latents) > 0:
+            predicted_latent = torch.cat(pred_latents, dim=2)
+            decoder = getattr(policy, "decode_predicted_latents", None) or getattr(
+                policy, "_decode_predicted_video", None
+            )
+            if decoder is None:
+                raise AttributeError(
+                    "Policy config requested predicted-video saving, but the policy does not expose "
+                    "`decode_predicted_latents` or `_decode_predicted_video`."
+                )
+            predicted_video = decoder(predicted_latent)
+            if hasattr(predicted_video, "detach"):
+                predicted_video = predicted_video.detach().to("cpu").numpy()
+            videos_dir.mkdir(parents=True, exist_ok=True)
+            predicted_video_path = videos_dir / f"pred_episode_{n_predicted_rendered}.mp4"
+            predicted_video_paths.append(str(predicted_video_path))
+            thread = threading.Thread(
+                target=write_video,
+                args=(
+                    str(predicted_video_path),
+                    predicted_video,
+                    env.unwrapped.metadata["render_fps"],
+                ),
+            )
+            thread.start()
+            threads.append(thread)
+            n_predicted_rendered += 1
+
        progbar.set_postfix(
            {"running_success_rate": f"{np.mean(all_successes[:n_episodes]).item() * 100:.1f}%"}
        )
@@ -600,6 +671,9 @@ def eval_policy(
    if max_episodes_rendered > 0:
        info["video_paths"] = video_paths

+    if save_predicted_video:
+        info["predicted_video_paths"] = predicted_video_paths
+
    return info


@@ -740,9 +814,10 @@ class TaskMetrics(TypedDict):
    max_rewards: list[float]
    successes: list[bool]
    video_paths: list[str]
+    predicted_video_paths: list[str]


-ACC_KEYS = ("sum_rewards", "max_rewards", "successes", "video_paths")
+ACC_KEYS = ("sum_rewards", "max_rewards", "successes", "video_paths", "predicted_video_paths")


 def eval_one(
@@ -791,6 +866,7 @@ def eval_one(
        max_rewards=[ep["max_reward"] for ep in per_episode],
        successes=[ep["success"] for ep in per_episode],
        video_paths=task_result.get("video_paths", []),
+        predicted_video_paths=task_result.get("predicted_video_paths", []),
    )


@@ -851,6 +927,7 @@ def run_one(

    if max_episodes_rendered > 0:
        metrics.setdefault("video_paths", [])
+    metrics.setdefault("predicted_video_paths", [])
    return task_group, task_id, metrics


@@ -908,11 +985,11 @@ def eval_policy_all(
        _append("sum_rewards", metrics.get("sum_rewards"))
        _append("max_rewards", metrics.get("max_rewards"))
        _append("successes", metrics.get("successes"))
-        # video_paths is list-like
-        paths = metrics.get("video_paths", [])
-        if paths:
-            group_acc[group]["video_paths"].extend(paths)
-            overall["video_paths"].extend(paths)
+        for key in ("video_paths", "predicted_video_paths"):
+            paths = metrics.get(key, [])
+            if paths:
+                group_acc[group][key].extend(paths)
+                overall[key].extend(paths)

    # Choose runner (sequential vs threaded)
    task_runner = partial(
@@ -984,6 +1061,7 @@ def eval_policy_all(
            "pc_success": _agg_from_list(acc["successes"]) * 100 if acc["successes"] else float("nan"),
            "n_episodes": len(acc["sum_rewards"]),
            "video_paths": list(acc["video_paths"]),
+            "predicted_video_paths": list(acc["predicted_video_paths"]),
        }

    # overall aggregates
@@ -995,6 +1073,7 @@ def eval_policy_all(
        "eval_s": time.time() - start_t,
        "eval_ep_s": (time.time() - start_t) / max(1, len(overall["sum_rewards"])),
        "video_paths": list(overall["video_paths"]),
+        "predicted_video_paths": list(overall["predicted_video_paths"]),
    }

    return {
@@ -24,6 +24,7 @@ import os

 import numpy as np

+from lerobot.configs import DEPTH_MILLIMETER_UNIT, infer_depth_unit
 from lerobot.types import RobotAction, RobotObservation

 from .constants import ACTION, ACTION_PREFIX, OBS_PREFIX, OBS_STR
@@ -161,7 +162,13 @@ def log_rerun_data(
                    observation_paths.add(key)
                else:
                    if arr.shape[-1] == 1:
-                        img_entity = rr.DepthImage(arr, colormap=rr.components.Colormap.Viridis)
+                        # At record time, the depth unit is inferred from the frame type.
+                        depth_unit = infer_depth_unit(arr.dtype)
+                        img_entity = rr.DepthImage(
+                            arr,
+                            meter=1000.0 if depth_unit == DEPTH_MILLIMETER_UNIT else 1.0,
+                            colormap=rr.components.Colormap.Viridis,
+                        )
                    else:
                        img_entity = rr.Image(arr).compress() if compress_images else rr.Image(arr)
                    rr.log(key, entity=img_entity, static=True)
@@ -32,6 +32,7 @@ from lerobot.configs.video import (
 )
 from lerobot.datasets.depth_utils import dequantize_depth, quantize_depth
 from lerobot.datasets.image_writer import image_array_to_pil_image, write_image
+from lerobot.utils.constants import DEFAULT_FEATURES
 from tests.fixtures.constants import (
    DEFAULT_FPS,
    DUMMY_CAMERA_FEATURES,
@@ -245,3 +246,91 @@ class TestFeatureFileRouting:

        dataset.save_episode()
        dataset.finalize()
+
+
+class TestDepthUnitMetadata:
+    """The depth unit is inferred once from dtype, stored in ``info``, and drives stats + reads."""
+
+    NUM_FRAMES = 4
+
+    def _record(self, root, features_factory, depth_dtype, value, use_videos):
+        from lerobot.datasets.lerobot_dataset import LeRobotDataset
+
+        features = features_factory(camera_features=DUMMY_CAMERA_FEATURES_WITH_DEPTH, use_videos=use_videos)
+        dataset = LeRobotDataset.create(
+            repo_id=DUMMY_REPO_ID,
+            fps=DEFAULT_FPS,
+            features=features,
+            root=root,
+            use_videos=use_videos,
+            streaming_encoding=use_videos,
+        )
+        for _ in range(self.NUM_FRAMES):
+            frame: dict = {"task": "test"}
+            for key, ft in dataset.meta.features.items():
+                if key in DEFAULT_FEATURES:
+                    continue
+                if key in dataset.meta.depth_keys:
+                    frame[key] = np.full(ft["shape"], value, dtype=depth_dtype)
+                elif key in dataset.meta.camera_keys:
+                    frame[key] = np.random.randint(0, 256, ft["shape"], dtype=np.uint8)
+                else:
+                    frame[key] = np.zeros(ft["shape"], dtype=np.float32)
+            dataset.add_frame(frame)
+        return dataset
+
+    @pytest.mark.parametrize("use_videos", [False, True])
+    @pytest.mark.parametrize(
+        ("depth_dtype", "value", "expected_unit"),
+        [(np.float32, 2.0, DEPTH_METER_UNIT), (np.uint16, 2000, DEPTH_MILLIMETER_UNIT)],
+    )
+    def test_recorded_unit_inferred_persisted_and_kept_in_stats(
+        self, tmp_path, features_factory, use_videos, depth_dtype, value, expected_unit
+    ):
+        """Unit is inferred from the first frame's dtype, drives stats (raw, never canonicalized), and survives a reload."""
+        from lerobot.datasets.lerobot_dataset import LeRobotDataset
+
+        dataset = self._record(tmp_path / "ds", features_factory, depth_dtype, value, use_videos)
+        assert dataset.meta.features[DEPTH_KEY]["info"]["depth_unit"] == expected_unit
+        dataset.save_episode()
+        mean = float(np.asarray(dataset.meta.stats[DEPTH_KEY]["mean"]).reshape(-1)[0])
+        np.testing.assert_allclose(mean, value, rtol=0.05)
+        dataset.finalize()
+
+        reloaded = LeRobotDataset(repo_id=DUMMY_REPO_ID, root=tmp_path / "ds")
+        assert reloaded.meta.features[DEPTH_KEY]["info"]["depth_unit"] == expected_unit
+
+    @pytest.mark.parametrize("use_videos", [False, True])
+    @pytest.mark.parametrize(
+        ("output_unit", "expected"),
+        [(DEPTH_MILLIMETER_UNIT, 2000.0), (DEPTH_METER_UNIT, 2.0)],
+    )
+    def test_read_honors_output_unit_for_frames_and_stats(
+        self, tmp_path, features_factory, use_videos, output_unit, expected
+    ):
+        """Reloading with a ``depth_output_unit`` converts metre frames (image mode) and rescales stats while preserving count."""
+        from lerobot.datasets.lerobot_dataset import LeRobotDataset
+
+        dataset = self._record(tmp_path / "ds", features_factory, np.float32, 2.0, use_videos=use_videos)
+        dataset.save_episode()
+        count = float(np.asarray(dataset.meta.stats[DEPTH_KEY]["count"]).reshape(-1)[0])
+        dataset.finalize()
+
+        read_dataset = LeRobotDataset(
+            repo_id=DUMMY_REPO_ID, root=tmp_path / "ds", depth_output_unit=output_unit
+        )
+        stats = read_dataset.meta.stats[DEPTH_KEY]
+        np.testing.assert_allclose(float(np.asarray(stats["mean"]).reshape(-1)[0]), expected, rtol=0.05)
+        np.testing.assert_allclose(float(np.asarray(stats["count"]).reshape(-1)[0]), count)
+
+        if not use_videos:
+            depth = read_dataset[0][DEPTH_KEY]
+            assert torch.allclose(depth, torch.full_like(depth, expected))
+
+            from lerobot.datasets.streaming_dataset import StreamingLeRobotDataset
+
+            stream_dataset = StreamingLeRobotDataset(
+                repo_id=DUMMY_REPO_ID, root=tmp_path / "ds", depth_output_unit=output_unit
+            )
+            stream_depth = next(iter(stream_dataset))[DEPTH_KEY]
+            assert torch.allclose(stream_depth, torch.full_like(stream_depth, expected))
@@ -345,7 +345,9 @@ class TestExtraOptions:
        opts = cfg.get_codec_options()
        assert opts["qp"] == 20
        assert isinstance(opts["qp"], int)
-        assert cfg.get_codec_options(as_strings=True)["qp"] == "20"
+        str_opts = cfg.get_codec_options(as_strings=True)
+        assert str_opts["qp"] == "20"
+        assert all(isinstance(v, str) for v in str_opts.values())

    @require_libsvtav1
    def test_structured_fields_win_on_collision(self):
@@ -26,6 +26,7 @@ import pytest
 import torch
 from datasets import Dataset

+from lerobot.configs.video import infer_depth_unit
 from lerobot.datasets.dataset_metadata import CODEBASE_VERSION, LeRobotDatasetMetadata
 from lerobot.datasets.feature_utils import get_hf_features_from_features
 from lerobot.datasets.io_utils import flatten_dict, hf_transform_to_torch
@@ -535,6 +536,13 @@ def lerobot_dataset_factory(
                chunks_size=chunks_size,
                **info_kwargs,
            )
+            # This synthetic path skips add_frame, so record the depth unit the writer would
+            # have stored (dummy depth is uint16) to keep ``depth_unit`` present in info.json.
+            # Reassign a fresh info dict to avoid mutating the shared feature constants.
+            for ft in info.features.values():
+                ft_info = ft.get("info")
+                if ft_info is not None and ft_info.get("is_depth_map") and "depth_unit" not in ft_info:
+                    ft["info"] = {**ft_info, "depth_unit": infer_depth_unit(np.uint16)}
        if stats is None:
            stats = stats_factory(features=info.features)
        if tasks is None:
@@ -0,0 +1,13 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
@@ -0,0 +1,78 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import pytest
+
+from lerobot.configs.policies import PreTrainedConfig
+from lerobot.configs.types import FeatureType, PolicyFeature
+from lerobot.policies.lingbot_va.configuration_lingbot_va import LingBotVAConfig
+from lerobot.utils.constants import ACTION, OBS_IMAGES
+
+
+def make_config(**overrides) -> LingBotVAConfig:
+    kwargs = {"device": "cpu"}
+    kwargs.update(overrides)
+    return LingBotVAConfig(**kwargs)
+
+
+def test_registered_in_choice_registry() -> None:
+    assert "lingbot_va" in PreTrainedConfig.get_known_choices()
+    assert PreTrainedConfig.get_choice_class("lingbot_va") is LingBotVAConfig
+
+
+def test_type_property() -> None:
+    assert make_config().type == "lingbot_va"
+
+
+def test_chunk_size_and_action_steps() -> None:
+    cfg = make_config(frame_chunk_size=4, action_per_frame=4)
+    assert cfg.chunk_size == 16
+    assert cfg.n_action_steps == 16
+    assert cfg.action_delta_indices == list(range(16))
+    assert cfg.observation_delta_indices == list(range(16))
+    assert cfg.reward_delta_indices is None
+
+
+def test_optimizer_and_scheduler_presets() -> None:
+    cfg = make_config()
+    opt = cfg.get_optimizer_preset()
+    assert opt.lr == cfg.optimizer_lr
+    sched = cfg.get_scheduler_preset()
+    assert sched.num_warmup_steps == cfg.scheduler_warmup_steps
+
+
+def test_validate_features_sets_action_feature() -> None:
+    cfg = make_config()
+    cfg.input_features = {f"{OBS_IMAGES}.image": PolicyFeature(type=FeatureType.VISUAL, shape=(3, 128, 128))}
+    cfg.output_features = {}
+    cfg.validate_features()
+    assert ACTION in cfg.output_features
+    assert cfg.output_features[ACTION].shape == (len(cfg.used_action_channel_ids),)
+
+
+def test_validate_features_no_visual_raises() -> None:
+    cfg = make_config()
+    cfg.input_features = {}
+    cfg.output_features = {}
+    with pytest.raises(ValueError, match="at least one visual input feature"):
+        cfg.validate_features()
+
+
+def test_invalid_attn_mode_raises() -> None:
+    with pytest.raises(ValueError, match="attn_mode"):
+        make_config(attn_mode="banana")
@@ -0,0 +1,38 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import pytest
+
+from lerobot.policies.factory import make_policy_config
+from lerobot.policies.lingbot_va.configuration_lingbot_va import LingBotVAConfig
+
+
+def test_make_policy_config_returns_lingbot_va() -> None:
+    cfg = make_policy_config("lingbot_va", device="cpu")
+    assert isinstance(cfg, LingBotVAConfig)
+
+
+def test_get_policy_class_resolves_lazily() -> None:
+    # Importing the policy class pulls in diffusers (Wan2.2 stack); skip if unavailable.
+    pytest.importorskip("diffusers")
+    pytest.importorskip("transformers")
+    from lerobot.policies.factory import get_policy_class
+
+    cls = get_policy_class("lingbot_va")
+    assert cls.name == "lingbot_va"
+    assert cls.config_class is LingBotVAConfig
@@ -0,0 +1,131 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Unit tests for the vendored LingBot-VA helper code (scheduler + grid utilities)."""
+
+from __future__ import annotations
+
+import pytest
+import torch
+
+pytest.importorskip("diffusers")  # the model code lives in modeling_lingbot_va, which imports diffusers
+
+from lerobot.policies.lingbot_va.modeling_lingbot_va import (  # noqa: E402
+    FlowMatchScheduler,
+    data_seq_to_patch,
+    get_mesh_id,
+)
+
+
+def test_flow_match_scheduler_timesteps_monotone_decreasing() -> None:
+    sch = FlowMatchScheduler(shift=5.0, sigma_min=0.0, extra_one_step=True)
+    sch.set_timesteps(20)
+    assert sch.timesteps.shape == (20,)
+    diffs = sch.timesteps[1:] - sch.timesteps[:-1]
+    assert torch.all(diffs <= 0)  # decreasing
+
+
+def test_flow_match_scheduler_step_preserves_shape() -> None:
+    sch = FlowMatchScheduler(shift=5.0, sigma_min=0.0, extra_one_step=True)
+    sch.set_timesteps(20)
+    sample = torch.zeros(1, 48, 4, 8, 16)
+    out = sch.step(torch.ones_like(sample), sch.timesteps[0], sample)
+    assert out.shape == sample.shape
+
+
+def test_flow_match_scheduler_add_noise() -> None:
+    sch = FlowMatchScheduler(shift=5.0, sigma_min=0.0, extra_one_step=True)
+    sch.set_timesteps(20)
+    sample = torch.randn(1, 48, 4, 8, 16)
+    noise = torch.randn_like(sample)
+    noisy = sch.add_noise(sample, noise, sch.timesteps[:4], t_dim=2)
+    assert noisy.shape == sample.shape
+
+
+def test_get_mesh_id_latent_shape() -> None:
+    grid = get_mesh_id(4, 8, 16, 0, 1, 0)
+    assert grid.shape == (4, 4 * 8 * 16)  # (f, h, w, stream) x tokens
+
+
+def test_get_mesh_id_action_shape() -> None:
+    grid = get_mesh_id(4, 4, 1, 1, 1, 0, action=True)
+    assert grid.shape == (4, 4 * 4 * 1)
+    # Action rows for h/w are sentinel -1.
+    assert torch.all(grid[1] < 0)
+    assert torch.all(grid[2] < 0)
+
+
+def test_data_seq_to_patch_roundtrip_shape() -> None:
+    b, f, h, w, c = 1, 4, 8, 16, 48
+    seq = torch.arange(b * f * h * w * c, dtype=torch.float32).reshape(b, f * h * w, c)
+    out = data_seq_to_patch((1, 2, 2), seq, f, h, w, batch_size=b)
+    assert out.shape == (b, c, f, h, w)
+
+
+def test_training_step_reduces_loss_tiny_flex() -> None:
+    """End-to-end single training step (flow-matching loss -> backward -> AdamW) on a tiny config.
+
+    Exercises the flex-attention training path; requires a CUDA GPU with flex-attention support.
+    """
+    if not torch.cuda.is_available():
+        import pytest
+
+        pytest.skip("training step test requires a CUDA GPU (flex-attention)")
+
+    from lerobot.configs.types import FeatureType, PolicyFeature
+    from lerobot.policies.lingbot_va.configuration_lingbot_va import LingBotVAConfig
+    from lerobot.policies.lingbot_va.modeling_lingbot_va import LingBotVAPolicy
+    from lerobot.utils.constants import ACTION, OBS_IMAGES
+
+    cfg = LingBotVAConfig(
+        attn_mode="flex",
+        dtype="bfloat16",
+        in_channels=16,
+        out_channels=16,
+        action_dim=8,
+        text_dim=32,
+        freq_dim=64,
+        ffn_dim=64,
+        num_attention_heads=2,
+        attention_head_dim=24,
+        num_layers=2,
+        frame_chunk_size=2,
+        action_per_frame=4,
+        used_action_channel_ids=[0, 1, 2, 3],
+        obs_cam_keys=[f"{OBS_IMAGES}.image"],
+        device="cuda",
+    )
+    cfg.input_features = {f"{OBS_IMAGES}.image": PolicyFeature(type=FeatureType.VISUAL, shape=(3, 64, 64))}
+    cfg.output_features = {ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(4,))}
+    cfg.validate_features()
+
+    policy = LingBotVAPolicy(cfg).to("cuda")
+    policy.train()
+    opt = torch.optim.AdamW(policy.get_optim_params(), lr=1e-4)
+
+    b, fc, apf = 1, cfg.frame_chunk_size, cfg.action_per_frame
+    latents = torch.randn(b, cfg.in_channels, fc, 4, 4, device="cuda", dtype=torch.bfloat16)
+    actions = torch.randn(b, cfg.action_dim, fc, apf, 1, device="cuda", dtype=torch.bfloat16)
+    amask = torch.zeros(cfg.action_dim, device="cuda")
+    amask[cfg.used_action_channel_ids] = 1.0
+    actions_mask = amask.view(1, -1, 1, 1, 1).expand_as(actions)
+    text_emb = torch.randn(b, cfg.max_sequence_length, cfg.text_dim, device="cuda", dtype=torch.bfloat16)
+
+    loss, metrics = policy.training_loss_from_streams(latents, actions, actions_mask, text_emb)
+    assert torch.isfinite(loss) and {"latent_loss", "action_loss"} <= set(metrics)
+    loss.backward()
+    assert any(p.grad is not None and torch.isfinite(p.grad).all() for p in policy.get_optim_params())
+    opt.step()
@@ -0,0 +1,88 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+import torch
+
+from lerobot.configs.types import FeatureType, PolicyFeature
+from lerobot.policies.lingbot_va.configuration_lingbot_va import LingBotVAConfig
+from lerobot.policies.lingbot_va.processor_lingbot_va import make_lingbot_va_pre_post_processors
+from lerobot.processor import PolicyProcessorPipeline, UnnormalizerProcessorStep
+from lerobot.processor.converters import policy_action_to_transition, transition_to_policy_action
+from lerobot.utils.constants import (
+    ACTION,
+    OBS_IMAGES,
+    POLICY_POSTPROCESSOR_DEFAULT_NAME,
+    POLICY_PREPROCESSOR_DEFAULT_NAME,
+)
+
+
+def _make_config() -> LingBotVAConfig:
+    cfg = LingBotVAConfig(device="cpu")
+    cfg.input_features = {f"{OBS_IMAGES}.image": PolicyFeature(type=FeatureType.VISUAL, shape=(3, 128, 128))}
+    cfg.output_features = {}
+    cfg.validate_features()
+    return cfg
+
+
+def test_make_pre_post_processors_names_and_steps() -> None:
+    cfg = _make_config()
+    pre, post = make_lingbot_va_pre_post_processors(cfg, dataset_stats=None)
+    assert pre.name == POLICY_PREPROCESSOR_DEFAULT_NAME
+    assert post.name == POLICY_POSTPROCESSOR_DEFAULT_NAME
+    # Actions are unnormalized by the standard built-in quantile unnormalizer.
+    assert any(isinstance(s, UnnormalizerProcessorStep) for s in post.steps)
+
+
+def test_freshly_built_postprocessor_is_identity() -> None:
+    # Without action stats the quantile unnormalizer is a no-op (identity passthrough): the real
+    # per-benchmark q01/q99 are restored from the saved checkpoint on load, not hardcoded here.
+    cfg = _make_config()
+    _, post = make_lingbot_va_pre_post_processors(cfg, dataset_stats=None)
+    normed = torch.tensor([[0.3, -0.5, 1.0, -1.0, 0.0, 0.7, -0.2]])
+    assert torch.allclose(post(normed), normed, atol=1e-6)
+
+
+def test_postprocessor_quantile_unnormalization() -> None:
+    # QUANTILES unnormalize maps [-1, 1] -> [q01, q99]: -1 -> q01, +1 -> q99.
+    cfg = _make_config()
+    q01 = [-1.0, -0.5, 0.0, -1.0, -1.0, -1.0, -1.0]
+    q99 = [1.0, 0.5, 2.0, 1.0, 1.0, 1.0, 1.0]
+    stats = {ACTION: {"q01": q01, "q99": q99}}
+    _, post = make_lingbot_va_pre_post_processors(cfg, dataset_stats=stats)
+    out_lo = post(torch.full((1, 7), -1.0))
+    out_hi = post(torch.full((1, 7), 1.0))
+    assert torch.allclose(out_lo, torch.tensor(q01).unsqueeze(0), atol=1e-4)
+    assert torch.allclose(out_hi, torch.tensor(q99).unsqueeze(0), atol=1e-4)
+
+
+def test_postprocessor_stats_survive_save_load(tmp_path) -> None:
+    # Regression guard for the Hub mechanism: the q01/q99 stats live in the saved post-processor
+    # state and must round-trip through save_pretrained / from_pretrained.
+    cfg = _make_config()
+    q01 = [-0.6, -0.8, -0.9, -0.1, -0.15, -0.25, -1.0]
+    q99 = [0.9, 0.85, 0.9, 0.17, 0.18, 0.34, 1.0]
+    _, post = make_lingbot_va_pre_post_processors(cfg, dataset_stats={ACTION: {"q01": q01, "q99": q99}})
+    post.save_pretrained(tmp_path)
+    loaded = PolicyProcessorPipeline.from_pretrained(
+        tmp_path,
+        config_filename=f"{POLICY_POSTPROCESSOR_DEFAULT_NAME}.json",
+        to_transition=policy_action_to_transition,
+        to_output=transition_to_policy_action,
+    )
+    out = loaded(torch.full((1, 7), -1.0))
+    assert torch.allclose(out, torch.tensor(q01).unsqueeze(0), atol=1e-4)
@@ -50,8 +50,9 @@ def mock_rerun(monkeypatch):
            return self

    class DummyDepthImage:
-        def __init__(self, arr, colormap=None):
+        def __init__(self, arr, meter=None, colormap=None):
            self.arr = arr
+            self.meter = meter
            self.colormap = colormap

    def dummy_log(key, obj=None, **kwargs):
@@ -1189,10 +1189,11 @@ wheels = [

 [[package]]
 name = "diffusers"
-version = "0.35.2"
+version = "0.36.0"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
    { name = "filelock" },
+    { name = "httpx" },
    { name = "huggingface-hub" },
    { name = "importlib-metadata" },
    { name = "numpy" },
@@ -1201,9 +1202,9 @@ dependencies = [
    { name = "requests" },
    { name = "safetensors" },
 ]
-sdist = { url = "https://files.pythonhosted.org/packages/03/68/288ca23c7c05c73e87ffe5efffc282400ac9b017f7a9bb03883f4310ea15/diffusers-0.35.2.tar.gz", hash = "sha256:30ecd552303edfcfe1724573c3918a8462ee3ab4d529bdbd4c0045f763affded", size = 3366711, upload-time = "2025-10-15T04:05:17.213Z" }
+sdist = { url = "https://files.pythonhosted.org/packages/88/45/ccb2e2180ddf475a0f931dac6a50346310e4c464ce3cccb8a65d1fc1e16d/diffusers-0.36.0.tar.gz", hash = "sha256:a9cde8721b415bde6a678f2d02abb85396487e1b0e0d2b4abb462d14a9825ab0", size = 3795088, upload-time = "2025-12-08T10:14:34.255Z" }
 wheels = [
-    { url = "https://files.pythonhosted.org/packages/2a/2e/38d9824f8c6bb048c5ba21c6d4da54c29c162a46b58b3ef907a360a76d3e/diffusers-0.35.2-py3-none-any.whl", hash = "sha256:d50d5e74fdd6dcf55e5c1d304bc52cc7c2659abd1752740d736d7b54078b4db5", size = 4121649, upload-time = "2025-10-15T04:05:14.391Z" },
+    { url = "https://files.pythonhosted.org/packages/35/50/281f92cb1f83854dbd79b6e958b3bc5018607e2542971d41604ba7a14b2f/diffusers-0.36.0-py3-none-any.whl", hash = "sha256:525d42abc74bfc3b2db594999961295c054b48ef40a11724dacf50e6abd1af98", size = 4597884, upload-time = "2025-12-08T10:14:31.979Z" },
 ]

 [[package]]
@@ -1682,6 +1683,18 @@ http = [
    { name = "aiohttp" },
 ]

+[[package]]
+name = "ftfy"
+version = "6.3.1"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "wcwidth" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/a5/d3/8650919bc3c7c6e90ee3fa7fd618bf373cbbe55dff043bd67353dbb20cd8/ftfy-6.3.1.tar.gz", hash = "sha256:9b3c3d90f84fb267fe64d375a07b7f8912d817cf86009ae134aa03e1819506ec", size = 308927, upload-time = "2024-10-26T00:50:35.149Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/ab/6e/81d47999aebc1b155f81eca4477a616a70f238a2549848c38983f3c22a82/ftfy-6.3.1-py3-none-any.whl", hash = "sha256:7c70eb532015cd2f9adb53f101fb6c7945988d023a085d127d1573dc49dd0083", size = 44821, upload-time = "2024-10-26T00:50:33.425Z" },
+]
+
 [[package]]
 name = "future"
 version = "1.0.0"
@@ -2832,6 +2845,7 @@ all = [
    { name = "fastapi" },
    { name = "feetech-servo-sdk" },
    { name = "foxglove-sdk" },
+    { name = "ftfy" },
    { name = "grpcio" },
    { name = "grpcio-tools" },
    { name = "gym-aloha" },
@@ -2840,6 +2854,7 @@ all = [
    { name = "hebi-py" },
    { name = "hf-libero", marker = "sys_platform == 'linux'" },
    { name = "hidapi" },
+    { name = "imageio", extra = ["ffmpeg"] },
    { name = "ipykernel" },
    { name = "jsonlines" },
    { name = "jupyter" },
@@ -3031,6 +3046,9 @@ hopejr = [
    { name = "pygame" },
    { name = "pyserial" },
 ]
+imageio-dep = [
+    { name = "imageio", extra = ["ffmpeg"] },
+]
 intelrealsense = [
    { name = "pyrealsense2", marker = "sys_platform != 'darwin'" },
    { name = "pyrealsense2-macosx", marker = "sys_platform == 'darwin'" },
@@ -3057,6 +3075,13 @@ libero = [
    { name = "torchcodec", marker = "(platform_machine == 'arm64' and sys_platform == 'darwin') or (platform_machine == 'AMD64' and sys_platform == 'linux') or (platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'x86_64' and sys_platform == 'linux') or sys_platform == 'win32'" },
    { name = "transformers" },
 ]
+lingbot-va = [
+    { name = "accelerate" },
+    { name = "diffusers" },
+    { name = "ftfy" },
+    { name = "imageio", extra = ["ffmpeg"] },
+    { name = "transformers" },
+]
 matplotlib-dep = [
    { name = "contourpy" },
    { name = "matplotlib" },
@@ -3232,6 +3257,7 @@ xvla = [
 [package.metadata]
 requires-dist = [
    { name = "accelerate", marker = "extra == 'accelerate-dep'", specifier = ">=1.14.0,<2.0.0" },
+    { name = "accelerate", marker = "extra == 'lingbot-va'", specifier = ">=1.10.0,<2.0.0" },
    { name = "av", marker = "extra == 'av-dep'", specifier = ">=15.0.0,<16.0.0" },
    { name = "cmake", specifier = ">=3.29.0.1,<4.2.0" },
    { name = "cmeel-tinyxml2", marker = "extra == 'placo-dep'", specifier = "<11" },
@@ -3241,7 +3267,8 @@ requires-dist = [
    { name = "debugpy", marker = "extra == 'dev'", specifier = ">=1.8.1,<1.9.0" },
    { name = "decord", marker = "(platform_machine == 'AMD64' and extra == 'groot') or (platform_machine == 'x86_64' and extra == 'groot')", specifier = ">=0.6.0,<1.0.0" },
    { name = "deepdiff", marker = "extra == 'deepdiff-dep'", specifier = ">=7.0.1,<9.0.0" },
-    { name = "diffusers", marker = "extra == 'diffusers-dep'", specifier = ">=0.27.2,<0.36.0" },
+    { name = "diffusers", marker = "extra == 'diffusers-dep'", specifier = ">=0.27.2,<0.37.0" },
+    { name = "diffusers", marker = "extra == 'lingbot-va'", specifier = ">=0.36.0,<0.37.0" },
    { name = "dm-tree", marker = "extra == 'groot'", specifier = ">=0.1.8,<1.0.0" },
    { name = "draccus", specifier = "==0.10.0" },
    { name = "dynamixel-sdk", marker = "extra == 'dynamixel'", specifier = ">=3.7.31,<3.9.0" },
@@ -3251,6 +3278,7 @@ requires-dist = [
    { name = "feetech-servo-sdk", marker = "extra == 'feetech'", specifier = ">=1.0.0,<2.0.0" },
    { name = "flash-attn", marker = "sys_platform != 'darwin' and extra == 'groot'", specifier = ">=2.5.9,<3.0.0" },
    { name = "foxglove-sdk", marker = "extra == 'viz'", specifier = ">=0.25.1,<0.26.0" },
+    { name = "ftfy", marker = "extra == 'lingbot-va'", specifier = ">=6.0.0,<7.0.0" },
    { name = "grpcio", marker = "extra == 'grpcio-dep'", specifier = ">=1.73.1,<2.0.0" },
    { name = "grpcio", marker = "extra == 'reachy2'", specifier = "<=1.73.1" },
    { name = "grpcio-tools", marker = "extra == 'dev'", specifier = ">=1.73.1,<2.0.0" },
@@ -3262,6 +3290,7 @@ requires-dist = [
    { name = "hf-libero", marker = "sys_platform == 'linux' and extra == 'libero'", specifier = ">=0.1.4,<0.2.0" },
    { name = "hidapi", marker = "extra == 'gamepad'", specifier = ">=0.14.0,<0.15.0" },
    { name = "huggingface-hub", specifier = ">=1.0.0,<2.0.0" },
+    { name = "imageio", extras = ["ffmpeg"], marker = "extra == 'imageio-dep'", specifier = ">=2.34.0,<3.0.0" },
    { name = "ipykernel", marker = "extra == 'notebook'", specifier = ">=6.0.0,<7.0.0" },
    { name = "jsonlines", marker = "extra == 'dataset'", specifier = ">=4.0.0,<5.0.0" },
    { name = "jupyter", marker = "extra == 'notebook'", specifier = ">=1.0.0,<2.0.0" },
@@ -3308,10 +3337,12 @@ requires-dist = [
    { name = "lerobot", extras = ["hardware"], marker = "extra == 'core-scripts'" },
    { name = "lerobot", extras = ["hilserl"], marker = "extra == 'all'" },
    { name = "lerobot", extras = ["hopejr"], marker = "extra == 'all'" },
+    { name = "lerobot", extras = ["imageio-dep"], marker = "extra == 'lingbot-va'" },
    { name = "lerobot", extras = ["intelrealsense"], marker = "extra == 'all'" },
    { name = "lerobot", extras = ["kinematics"], marker = "extra == 'all'" },
    { name = "lerobot", extras = ["lekiwi"], marker = "extra == 'all'" },
    { name = "lerobot", extras = ["libero"], marker = "sys_platform == 'linux' and extra == 'all'" },
+    { name = "lerobot", extras = ["lingbot-va"], marker = "extra == 'all'" },
    { name = "lerobot", extras = ["matplotlib-dep"], marker = "extra == 'async'" },
    { name = "lerobot", extras = ["matplotlib-dep"], marker = "extra == 'sarm'" },
    { name = "lerobot", extras = ["matplotlib-dep"], marker = "extra == 'unitree-g1'" },
@@ -3370,6 +3401,7 @@ requires-dist = [
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'groot'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'hilserl'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'libero'" },
+    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'lingbot-va'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'molmoact2'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'multi-task-dit'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'peft'" },
@@ -3449,7 +3481,7 @@ requires-dist = [
    { name = "transformers", marker = "extra == 'transformers-dep'", specifier = ">=5.4.0,<5.6.0" },
    { name = "wandb", marker = "extra == 'training'", specifier = ">=0.24.0,<0.28.0" },
 ]
-provides-extras = ["dataset", "training", "hardware", "viz", "core-scripts", "evaluation", "dataset-viz", "av-dep", "pygame-dep", "placo-dep", "transformers-dep", "grpcio-dep", "accelerate-dep", "can-dep", "peft-dep", "scipy-dep", "diffusers-dep", "qwen-vl-utils-dep", "matplotlib-dep", "pyserial-dep", "deepdiff-dep", "pynput-dep", "pyzmq-dep", "motorbridge-dep", "motorbridge-smart-servo-dep", "feetech", "dynamixel", "damiao", "robstride", "openarms", "gamepad", "hopejr", "lekiwi", "unitree-g1", "reachy2", "rebot", "kinematics", "intelrealsense", "phone", "diffusion", "wallx", "pi", "molmoact2", "smolvla", "multi-task-dit", "groot", "sarm", "robometer", "topreward", "xvla", "eo1", "fastwam", "hilserl", "vla-jepa", "async", "peft", "annotations", "dev", "notebook", "test", "video-benchmark", "aloha", "pusht", "libero", "metaworld", "all"]
+provides-extras = ["dataset", "training", "hardware", "viz", "core-scripts", "evaluation", "dataset-viz", "av-dep", "pygame-dep", "placo-dep", "transformers-dep", "grpcio-dep", "accelerate-dep", "can-dep", "peft-dep", "scipy-dep", "diffusers-dep", "imageio-dep", "qwen-vl-utils-dep", "matplotlib-dep", "pyserial-dep", "deepdiff-dep", "pynput-dep", "pyzmq-dep", "motorbridge-dep", "motorbridge-smart-servo-dep", "feetech", "dynamixel", "damiao", "robstride", "openarms", "gamepad", "hopejr", "lekiwi", "unitree-g1", "reachy2", "rebot", "kinematics", "intelrealsense", "phone", "diffusion", "wallx", "pi", "molmoact2", "smolvla", "multi-task-dit", "groot", "sarm", "robometer", "topreward", "xvla", "eo1", "fastwam", "hilserl", "vla-jepa", "lingbot-va", "async", "peft", "annotations", "dev", "notebook", "test", "video-benchmark", "aloha", "pusht", "libero", "metaworld", "all"]

 [[package]]
 name = "librt"
Author	SHA1	Message	Date
Maxime Ellerbach	105aeab1bc	Merge remote-tracking branch 'origin/main' into worktree-lingbot-va-port # Conflicts: # docs/source/_toctree.yml # src/lerobot/policies/factory.py # uv.lock	2026-07-02 14:15:09 +00:00
Caroline Pascal	7ae12124b0	fix(save codec options): making sure codec options are always set via `set_if` (#3910 ) * fix(save codec options): making sure codec options are always safely set through `set_if` * tests(update): updating tests	2026-07-02 15:29:14 +02:00
Caroline Pascal	c746ca2df2	fix(depth unit): adding input depth unit storage in the dataset metadata (#3899 ) * fix(depth unit): storing raw depth units in the dataset metadata for correct depth statistics and depth raw frames handling. The unit is stored as a string ("m","mm") under "depth_unit" at the same level as "is_depth_map". Unit is inferred from the depth frame type. * feat(raw frame unit): adapting dataset reader so that raw depth frames are scaled according to the requested unit * feat(stats units): rescaling stats when loading a dataset so that the stats are given in the requested unit * tests(unit): adapting and extending depth tests to units manipulations * chore(format): formating code * feat(warning): adding a warning when depth unit is not specified in the dataset * chore(infer_depth_unit): moving the depth unit inference utility in a more accessible location * feat(rerun unit): adding correct depth unit display for rerun (foxglove does not support units yet) * feat(unit getter): adding a proper output_depth_unit getter to LeRobotDataset for cleaner integration * fix(streaming dataset): extending support for depth units to streaming datasets * test(rerun): fixing rerun tests	2026-07-02 11:53:13 +02:00
Maxime Ellerbach	a66c7761a5	adjusting test to match expected values	2026-06-25 11:28:32 +00:00
Maxime Ellerbach	14bd51f28f	updating uv lock and linting	2026-06-25 11:23:35 +00:00
Maxime Ellerbach	5ee83f17a1	applying fixes	2026-06-24 09:28:07 +00:00
Gangwei XU	e50308789c	fix(lingbot-va): align RoboTwin evaluation (#3784 ) Thank you for the RoboTwin fix, and alignment!	2026-06-23 17:34:40 +00:00
Pepijn	b5d3a5a5d3	docs(lingbot_va): condense processor normalization comments Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 17:31:37 +00:00
Pepijn	6c1220b8f0	docs(lingbot_va): point checkpoint paths at the lerobot org The LeRobot-format checkpoints moved from pepijn223/* to lerobot/* (libero_long, robotwin, base). Update the eval/train --policy.path examples accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 17:31:36 +00:00
Pepijn	3061ca6661	refactor(lingbot_va): use built-in UnnormalizerProcessorStep for actions Replace the bespoke LingBotVAActionUnnormalizeStep with the standard UnnormalizerProcessorStep in QUANTILES mode, which computes the identical (action + 1) / 2 * (q99 - q01) + q01 mapping. The per-channel q01/q99 are stored as the step's saved state (a safetensors file) and restored on load; a fresh build has no action stats so the step is an identity passthrough. The 3 Hub checkpoints (lerobot/lingbot_va_{libero_long,robotwin,base}) have been re-uploaded with the new post-processor (policy_postprocessor.json + *_unnormalizer_processor.safetensors); reloading from the Hub round-trips q01/q99. - processor_lingbot_va.py: drop the custom step + registry; build the post-processor with UnnormalizerProcessorStep (explicit ACTION->QUANTILES norm_map so the preprocessor / training path is unchanged). - tests: assert the built-in step is used, identity-when-no-stats, correct quantile unnormalization, and a save_pretrained/from_pretrained stats round-trip. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 17:31:35 +00:00
Pepijn	2a7b7ea744	docs(lingbot_va): trim provenance comments; default wan path to base repo - configuration_lingbot_va.py: drop the "──" decorations and the "(from transformer/config.json)" note; default wan_pretrained_path to robbyant/lingbot-va-base (has the frozen vae/text_encoder/tokenizer subfolders). - modeling_lingbot_va.py: remove the vendored-code banner and the "(upstream wan_va/...)" section-header provenance/dash decorations; condense the transformer-dtype comment to one line. No code changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 17:31:35 +00:00
Pepijn	50b20c5bf1	docs(lingbot_va): trim verbose comments - configuration_lingbot_va.py: condense multi-line field comments to one-liners (keep the ── section headers). - processor_lingbot_va.py: shorten the action-quantile explanation block. - modeling_lingbot_va.py: drop the bare "# ----" separator rules, keeping the one-line section headers. No code changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 17:31:34 +00:00
Pepijn	c764afb8ef	refactor(lingbot_va): drop hardcoded action quantiles; source from checkpoint The LIBERO/RoboTwin action (un)normalization quantiles were hardcoded as module constants in processor_lingbot_va.py. They are already serialized into each checkpoint's policy_postprocessor.json (via LingBotVAActionUnnormalizeStep.get_config) and restored on load by PolicyProcessorPipeline.from_pretrained, so the constants are dead at eval/load time for the released checkpoints (verified: libero_long/robotwin/base all carry their quantiles on the Hub). - Remove LIBERO_ACTION_Q01/Q99, ROBOTWIN_ACTION_Q01/Q99 and _default_action_quantiles. - make_lingbot_va_pre_post_processors now defaults a fresh (unconverted) build to a neutral [-1, 1] mapping (identity rescale); real per-benchmark stats come from the saved checkpoint (or postprocessor_overrides), analogous to dataset-stats normalization. - Update the config doc comment to point at the checkpoint as the source of truth. - Tests: replace the LIBERO-default assertion with a neutral-default check, and add a save_pretrained/from_pretrained round-trip guard for the quantile serialization. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 17:31:33 +00:00
Pepijn	fa875eafb7	Update pyproject.toml Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>	2026-06-23 17:31:32 +00:00
Pepijn	54e4926312	Update pyproject.toml Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>	2026-06-23 17:31:31 +00:00
Pepijn	2471c23af5	Update lingbot_va.mdx Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>	2026-06-23 17:31:30 +00:00
pepijn223	5422c99682	docs(lingbot_va): document EEF action-channel schema + camera order Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-23 17:31:30 +00:00
pepijn223	5131e6aa37	fix(lingbot_va): CI quality gate + fast-test collection - Add tests/policies/lingbot_va/__init__.py so the test files don't clash by basename with tests/policies/vla_jepa/* under pytest's default import mode (fast-test collection error). - Fix vendored typos flagged by the typos hook (pach_scale->patch_scale, total_tolen-> total_token_len, stablized->stabilized) and a mypy union-attr in RoboTwinEnv._read_eef_pose. - Apply Prettier formatting to docs/source/lingbot_va.mdx. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-23 17:31:29 +00:00
pepijn223	98ee5cdc22	feat(lingbot_va): implement training / fine-tuning (flow-matching loss) - Implement LingBotVAPolicy.forward(): dual-stream flow-matching training loss (latent + action, timestep-weighted, action-masked) ported from upstream train.py; VAE-encodes camera clips, UMT5-encodes the task, noises both streams, runs the block-causal flex-attention training pass (forward_train). - training_loss_from_streams() core + _build_training_streams() data prep (action scatter into the 30-d space, multi-frame VAE encode incl. robotwin_tshape). - get_optim_params returns only trainable transformer params (LoRA/PEFT friendly); VAE/UMT5 stay frozen. Training needs attn_mode='flex'. - Add a tiny-config single-training-step test (forward->loss->backward->AdamW) and a Training/fine-tuning section in the docs. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-23 17:31:28 +00:00
pepijn223	b81909fc28	feat(lingbot_va): RoboTwin eef-pose eval, single-file model, Hub checkpoints Make the LingBot-VA port runnable on both LIBERO and RoboTwin and clean up the package to LeRobot conventions. - Consolidate all vendored Wan2.2 model code (transformer, attention, VAE helpers, flow-matching scheduler, grid utils, flex-attention) into a single modeling_lingbot_va.py; remove the separate wan_*/schedulers modules. - Move the fixed action (un)normalization quantiles out of the config and into the post-processor (LIBERO 7-DoF + RoboTwin 16-d eef); remove the conversion script in favour of ready-to-use LeRobot-format checkpoints on the Hub. - Fixes found via on-sim validation: undo LIBERO's 180-degree image flip (image_hflip), encode obs as a multi-frame streaming-VAE clip, reset the streaming VAE cache between episodes, run the transformer in config.dtype, lazy-load frozen VAE/UMT5 by subfolder with the text encoder on CPU. - RoboTwin: add an end-effector-pose action mode to RoboTwinEnv (16-d per-arm xyz+quat+gripper deltas composed onto the initial eef pose, executed via CuRobo IK) and the robotwin_tshape latent layout (full-res head + half-res wrists via a second streaming VAE) with the upstream RoboTwin action quantiles + camera mapping. - Predicted-video saving works for both benchmarks; docs + tests updated. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-23 17:31:27 +00:00
Pepijn	d600a52943	feat(policies): add LingBot-VA autoregressive video-action world model Port the LingBot-VA policy (Wan2.2 dual-stream video+action world model) into LeRobot, following the EO-1 / VLA-JEPA conventions. Covers inference, checkpoint conversion, and predicted-video saving (training is deferred to a follow-up PR). - Vendored Wan transformer/attention/flex/VAE/scheduler modules (key names preserved for near-identity conversion); torch SDPA default, flashattn/flex lazy-guarded. - LingBotVAConfig (registered "lingbot_va") + processor with fixed-quantile action unnormalization; full dual-stream sampling loop with CFG, two flow-matching schedulers and KV cache, mapped onto select_action with observed-keyframe feedback. - convert_lingbot_va_checkpoints.py (libero/robotwin variants): bundles the ~5B transformer, lazy-pulls the frozen VAE+UMT5 from the source repo. - Predicted-video plumbing in lerobot_eval (predicted_frames_callback; opt-in via --policy.save_predicted_video) and ConstantWithWarmupSchedulerConfig. - pyproject: widen diffusers-dep to <0.37, add lingbot_va + imageio-dep extras, add lingbot_va and (missing) eo1 to `all`. - Factory + policies/__init__ wiring, docs page + toctree, and tests. Note: the LIBERO success-rate correctness gate must be validated on a CUDA GPU with the converted checkpoint. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-23 17:30:31 +00:00