chore(dependencies): update uv.lock

fix(train): drive Accelerate mixed precision from policy.dtype (#3912 )
* fix(train): drive Accelerate mixed precision from policy.dtype `accelerator.autocast()` was always a no-op because `mixed_precision` was never set, so `--policy.dtype=bfloat16` only cast the model params (via the policy) while autocast-eligible ops still ran in fp32/tf32. Map the active policy's `dtype` onto Accelerate's `mixed_precision` (bfloat16 -> bf16, float16 -> fp16, float32 -> no) so autocast is active for bf16/fp16 and stays full precision for float32. Policies without a string `dtype` field fall back to Accelerate's launcher default, so existing behavior is preserved. * style(train): condense mixed-precision comment to one line
2026-07-03 16:17:15 +00:00 · 2026-07-03 08:28:33 +00:00 · 2026-07-02 19:15:19 +02:00 · 2026-07-02 15:29:14 +02:00 · 2026-07-02 11:53:13 +02:00 · 2026-07-02 11:03:41 +02:00
19 changed files with 1151 additions and 1162 deletions
@@ -6,11 +6,12 @@ Encoding frames into an MP4 is a full FFmpeg pipeline: choice of encoder, pixel

 You can set these parameters from the CLI with `--dataset.rgb_encoder.<field>` (e.g. with `lerobot-record` or `lerobot-rollout`). The same block applies to every camera video stream in that run.

-> [!TIP]
-> Video storage must be on for `rgb_encoder` to have any effect —
-> `use_videos=True` in Python APIs, or `--dataset.video=true` on the CLI (the
-> recording default). With video off, inputs stay as images and `rgb_encoder` is
-> ignored.
+<Tip>
+  Video storage must be on for `rgb_encoder` to have any effect —
+  `use_videos=True` in Python APIs, or `--dataset.video=true` on the CLI (the
+  recording default). With video off, inputs stay as images and `rgb_encoder` is
+  ignored.
+</Tip>

 For details on **when** frames are written vs. encoded (streaming vs. post-episode), queues, and other top-level `--dataset.*` switches, see [Streaming Video Encoding](./streaming_video_encoding). For an encoding-parameter comparison and experiments, see the [video-benchmark Space](https://huggingface.co/spaces/lerobot/video-benchmark).

@@ -42,10 +43,12 @@ lerobot-record \

 ## Tuning parameters

-> [!WARNING]
-> The defaults are tuned to balance **compression ratio**, **visual quality**, and **decoding/seek speed** for typical robotics datasets. Changing them can affect both recording (CPU load, frame drops) and training (decoding throughput, image quality).
->
-> Only override these parameters if you have a specific reason to, and measure the impact on your pipeline before relying on the new settings.
+<Tip warning={true}>
+The defaults are tuned to balance **compression ratio**, **visual quality**, and **decoding/seek speed** for typical robotics datasets. Changing them can affect both recording (CPU load, frame drops) and training (decoding throughput, image quality).
+
+Only override these parameters if you have a specific reason to, and measure the impact on your pipeline before relying on the new settings.
+
+</Tip>

 All flags below are prefixed with `--dataset.rgb_encoder.` on the CLI.

@@ -66,92 +69,25 @@ All flags below are prefixed with `--dataset.rgb_encoder.` on the CLI.

 Depth maps (Intel RealSense, Reachy 2) are stored as their **own video streams** alongside the RGB streams. Raw depth (`uint16` millimetres or `float32` metres) can't survive an 8-bit codec, so LeRobot **quantizes** each map to a 12-bit code (`[0, 4095]`) — logarithmically by default, to match the `1/depth` error profile of depth sensors — then packs it into a high-bit-depth pixel format (`gray12le`) and encodes it with a 12-bit codec.

-<div style="margin:28px 0;padding:14px 0;">
-  <div style="margin:0 auto;display:flex;flex-wrap:wrap;justify-content:center;align-items:stretch;gap:6px;font-family:'Source Sans 3',ui-sans-serif,system-ui,sans-serif;font-size:14px;font-weight:600;color:#1B1B1D;">
-    <span style="display:flex;flex-direction:column;justify-content:center;align-items:center;text-align:center;gap:2px;background:#DBEAFE;color:#1D4ED8;border-radius:9px;padding:8px 12px;">
-      <span>Raw depth</span>
-      <span style="font-size:11px;font-weight:400;color:#3B6FD4;white-space:nowrap;">
-        uint16 mm
-        <br />
-        float32 m
-      </span>
-    </span>
-    <span style="display:flex;align-items:center;font-size:16px;color:#C3CBD9;">
-      →
-    </span>
-    <div style="border:2px dashed #C4B5FD;border-radius:13px;padding:18px 12px 12px;position:relative;display:flex;align-items:stretch;gap:6px;">
-      <span style="position:absolute;top:-10px;left:12px;background:#fff;padding:0 6px;font-size:11px;font-weight:700;color:#7E22CE;text-transform:uppercase;letter-spacing:0.5px;white-space:nowrap;">
-        Record time
-      </span>
-      <span style="display:flex;flex-direction:column;justify-content:center;align-items:center;text-align:center;gap:2px;background:#F3E8FF;color:#7E22CE;border-radius:9px;padding:8px 12px;">
-        <span>Clip</span>
-        <span style="font-size:11px;font-weight:400;color:#9061C2;white-space:nowrap;">
-          to [depth_min,
-          <br />
-          depth_max]
-        </span>
-      </span>
-      <span style="display:flex;align-items:center;font-size:16px;color:#C3CBD9;">
-        →
-      </span>
-      <span style="display:flex;flex-direction:column;justify-content:center;align-items:center;text-align:center;gap:2px;background:#F3E8FF;color:#7E22CE;border-radius:9px;padding:8px 12px;">
-        <span>Quantize</span>
-        <span style="font-size:11px;font-weight:400;color:#9061C2;white-space:nowrap;">
-          12-bit codes 0–4095
-          <br />
-          log (default) or linear
-        </span>
-      </span>
-      <span style="display:flex;align-items:center;font-size:16px;color:#C3CBD9;">
-        →
-      </span>
-      <span style="display:flex;flex-direction:column;justify-content:center;align-items:center;text-align:center;gap:2px;background:#F3E8FF;color:#7E22CE;border-radius:9px;padding:8px 12px;">
-        <span>Pack</span>
-        <span style="font-size:11px;font-weight:400;color:#9061C2;white-space:nowrap;">
-          into gray12le
-          <br />
-          plane
-        </span>
-      </span>
-      <span style="display:flex;align-items:center;font-size:16px;color:#C3CBD9;">
-        →
-      </span>
-      <span style="display:flex;flex-direction:column;justify-content:center;align-items:center;text-align:center;gap:2px;background:#F3E8FF;color:#7E22CE;border-radius:9px;padding:8px 12px;">
-        <span>Encode</span>
-        <span style="font-size:11px;font-weight:400;color:#9061C2;white-space:nowrap;">
-          HEVC
-          <br />
-          Main 12
-        </span>
-      </span>
-    </div>
-    <span style="display:flex;align-items:center;font-size:16px;color:#C3CBD9;">
-      →
-    </span>
-    <span style="display:flex;flex-direction:column;justify-content:center;align-items:center;text-align:center;gap:2px;background:#FEF3C7;color:#B45309;border-radius:9px;padding:8px 12px;">
-      <span>MP4</span>
-      <span style="font-size:11px;font-weight:400;color:#C77D18;white-space:nowrap;">
-        stored
-        <br />
-        stream
-      </span>
-    </span>
-    <span style="display:flex;align-items:center;font-size:16px;color:#34A06B;">
-      →
-    </span>
-    <div style="border:2px dashed #6EE7B7;border-radius:13px;padding:18px 12px 12px;position:relative;display:flex;align-items:center;gap:6px;">
-      <span style="position:absolute;top:-10px;left:12px;background:#fff;padding:0 6px;font-size:11px;font-weight:700;color:#047857;text-transform:uppercase;letter-spacing:0.5px;white-space:nowrap;">
-        Load time
-      </span>
-      <span style="display:flex;flex-direction:column;justify-content:center;align-items:center;text-align:center;gap:2px;background:#D1FAE5;color:#047857;border-radius:9px;padding:8px 12px;">
-        <span>Dequantize</span>
-        <span style="font-size:11px;font-weight:400;color:#059669;white-space:nowrap;">
-          to mm / m
-        </span>
-      </span>
-    </div>
-  </div>
-</div>
+```mermaid
+flowchart LR
+    A["Raw depth (uint16 mm / float32 m)"] --> B["Clip to depth_min, depth_max"]
+    B --> C["Quantize to 12-bit code 0–4095 (log or linear)"]
+    C --> D["Pack into gray12le"]
+    D --> E["Encode video (hevc Main 12)"]
+    E --> F[("MP4 + metadata: depth_min/max, shift, use_log")]
+    F -. "load time (depth_output_unit)" .-> G["Dequantize to mm or m"]
+
+    classDef input fill:#e3f2fd,stroke:#1565c0,color:#0d47a1;
+    classDef encode fill:#ede7f6,stroke:#5e35b1,color:#311b92;
+    classDef store fill:#fff8e1,stroke:#f9a825,color:#e65100;
+    classDef load fill:#e8f5e9,stroke:#2e7d32,color:#1b5e20;
+
+    class A input;
+    class B,C,D,E encode;
+    class F store;
+    class G load;
+```

 Configure the depth pipeline through a parallel **`depth_encoder`** block (`DepthEncoderConfig`). It shares every `RGBEncoderConfig` field (`vcodec`, `pix_fmt`, `crf`, …) and adds four quantizer knobs, set via `--dataset.depth_encoder.<field>`:

@@ -232,81 +168,15 @@ After the first episode of a video stream is encoded, the encoder configuration

 Two sources contribute to the `info` block:

-<div style="display:flex;flex-wrap:wrap;gap:14px;margin:20px 0;font-family:'Source Sans 3',ui-sans-serif,system-ui,sans-serif;">
-  <div style="flex:1 1 280px;border:1px solid #BFDBFE;border-radius:12px;overflow:hidden;">
-    <div style="background:#DBEAFE;color:#1D4ED8;font-weight:700;font-size:14px;padding:8px 14px;">
-      Stream-derived
-    </div>
-    <div style="padding:12px 14px;">
-      <div style="font-size:13px;color:#4B5563;margin-bottom:10px;">
-        Read back from the encoded MP4 with PyAV.
-      </div>
-      <div style="display:flex;flex-wrap:wrap;gap:6px;">
-        <code style="background:#EFF6FF;color:#1D4ED8;border-radius:6px;padding:2px 8px;font-size:12px;">
-          video.height
-        </code>
-        <code style="background:#EFF6FF;color:#1D4ED8;border-radius:6px;padding:2px 8px;font-size:12px;">
-          video.width
-        </code>
-        <code style="background:#EFF6FF;color:#1D4ED8;border-radius:6px;padding:2px 8px;font-size:12px;">
-          video.codec
-        </code>
-        <code style="background:#EFF6FF;color:#1D4ED8;border-radius:6px;padding:2px 8px;font-size:12px;">
-          video.pix_fmt
-        </code>
-        <code style="background:#EFF6FF;color:#1D4ED8;border-radius:6px;padding:2px 8px;font-size:12px;">
-          video.fps
-        </code>
-        <code style="background:#EFF6FF;color:#1D4ED8;border-radius:6px;padding:2px 8px;font-size:12px;">
-          video.channels
-        </code>
-        <code style="background:#EFF6FF;color:#1D4ED8;border-radius:6px;padding:2px 8px;font-size:12px;">
-          is_depth_map
-        </code>
-        <code style="background:#EFF6FF;color:#1D4ED8;border-radius:6px;padding:2px 8px;font-size:12px;">
-          audio.*
-        </code>
-      </div>
-    </div>
-  </div>
-  <div style="flex:1 1 280px;border:1px solid #DDD6FE;border-radius:12px;overflow:hidden;">
-    <div style="background:#F3E8FF;color:#7E22CE;font-weight:700;font-size:14px;padding:8px 14px;">
-      Encoder-derived
-    </div>
-    <div style="padding:12px 14px;">
-      <div style="font-size:13px;color:#4B5563;margin-bottom:10px;">
-        Taken from <code style="font-size:12px;">RGBEncoderConfig</code> /{" "}
-        <code style="font-size:12px;">DepthEncoderConfig</code>.
-      </div>
-      <div style="display:flex;flex-wrap:wrap;gap:6px;">
-        <code style="background:#FAF5FF;color:#7E22CE;border-radius:6px;padding:2px 8px;font-size:12px;">
-          video.g
-        </code>
-        <code style="background:#FAF5FF;color:#7E22CE;border-radius:6px;padding:2px 8px;font-size:12px;">
-          video.crf
-        </code>
-        <code style="background:#FAF5FF;color:#7E22CE;border-radius:6px;padding:2px 8px;font-size:12px;">
-          video.preset
-        </code>
-        <code style="background:#FAF5FF;color:#7E22CE;border-radius:6px;padding:2px 8px;font-size:12px;">
-          video.fast_decode
-        </code>
-        <code style="background:#FAF5FF;color:#7E22CE;border-radius:6px;padding:2px 8px;font-size:12px;">
-          video.video_backend
-        </code>
-        <code style="background:#FAF5FF;color:#7E22CE;border-radius:6px;padding:2px 8px;font-size:12px;">
-          video.extra_options
-        </code>
-      </div>
-    </div>
-  </div>
-</div>
+- **Stream-derived** (read back from the encoded MP4 with PyAV): `video.height`, `video.width`, `video.codec`, `video.pix_fmt`, `video.fps`, `video.channels`, `is_depth_map`, plus `audio.*` if an audio stream is present.
+- **Encoder-derived** (taken from `RGBEncoderConfig` or `DepthEncoderConfig`): `video.g`, `video.crf`, `video.preset`, `video.fast_decode`, `video.video_backend`, `video.extra_options`.

-> [!IMPORTANT]
-> This block is populated **once**, from the **first** episode. It assumes every
-> episode in the dataset was encoded with the same `rgb_encoder`. Changing
-> encoder settings partway through a recording is not supported — the
-> `info.json` will only reflect the parameters used for the first episode.
+<Tip>
+  This block is populated **once**, from the **first** episode. It assumes every
+  episode in the dataset was encoded with the same `rgb_encoder`. Changing
+  encoder settings partway through a recording is not supported — the
+  `info.json` will only reflect the parameters used for the first episode.
+</Tip>

 ---

@@ -314,35 +184,5 @@ Two sources contribute to the `info` block:

 When aggregating datasets with `merge_datasets`, video files are concatenated as-is (no re-encoding), and encoder fields in `info.json` are merged per-key:

-<div style="display:flex;flex-direction:column;gap:12px;margin:20px 0;font-family:'Source Sans 3',ui-sans-serif,system-ui,sans-serif;">
-  <div style="display:flex;gap:12px;align-items:flex-start;border-left:3px solid #F87171;background:#FEF2F2;border-radius:0 10px 10px 0;padding:12px 14px;">
-    <span style="flex:none;background:#FEE2E2;color:#B91C1C;font-weight:700;font-size:11px;text-transform:uppercase;letter-spacing:0.4px;border-radius:6px;padding:3px 8px;margin-top:1px;white-space:nowrap;">
-      Must match
-    </span>
-    <span style="font-size:14px;color:#1B1B1D;">
-      Stream-derived fields — <code style="font-size:12px;">video.codec</code>,{" "}
-      <code style="font-size:12px;">video.pix_fmt</code>,{" "}
-      <code style="font-size:12px;">video.height</code>,{" "}
-      <code style="font-size:12px;">video.width</code>,{" "}
-      <code style="font-size:12px;">video.fps</code> — must match across
-      sources, otherwise FFmpeg's concat demuxer fails.
-    </span>
-  </div>
-  <div style="display:flex;gap:12px;align-items:flex-start;border-left:3px solid #34D399;background:#ECFDF5;border-radius:0 10px 10px 0;padding:12px 14px;">
-    <span style="flex:none;background:#D1FAE5;color:#047857;font-weight:700;font-size:11px;text-transform:uppercase;letter-spacing:0.4px;border-radius:6px;padding:3px 8px;margin-top:1px;white-space:nowrap;">
-      Merged loosely
-    </span>
-    <span style="font-size:14px;color:#1B1B1D;">
-      Encoder-tuning fields — <code style="font-size:12px;">video.g</code>,{" "}
-      <code style="font-size:12px;">video.crf</code>,{" "}
-      <code style="font-size:12px;">video.preset</code>,{" "}
-      <code style="font-size:12px;">video.fast_decode</code>,{" "}
-      <code style="font-size:12px;">video.extra_options</code>. If every source
-      agrees, the value is kept; if not, it's set to{" "}
-      <code style="font-size:12px;">null</code> (or{" "}
-      <code style="font-size:12px;">{}</code> for{" "}
-      <code style="font-size:12px;">video.extra_options</code>) and a warning is
-      logged.
-    </span>
-  </div>
-</div>
+- **Stream-derived fields must match** across sources: `video.codec`, `video.pix_fmt`, `video.height`, `video.width`, `video.fps`. Otherwise FFmpeg's concat demuxer fails.
+- **Encoder-tuning fields are merged loosely**: `video.g`, `video.crf`, `video.preset`, `video.fast_decode`, `video.extra_options`. If every source agrees, the value is kept; if not, it's set to `null` (or `{}` for `video.extra_options`) and a warning is logged.
@@ -34,6 +34,8 @@ from .types import (
 )
 from .video import (
    DEFAULT_DEPTH_UNIT,
+    DEPTH_METER_UNIT,
+    DEPTH_MILLIMETER_UNIT,
    VALID_VIDEO_CODECS,
    VIDEO_ENCODER_INFO_KEYS,
    DepthEncoderConfig,
@@ -41,6 +43,7 @@ from .video import (
    VideoEncoderConfig,
    depth_encoder_defaults,
    encoder_config_from_video_info,
+    infer_depth_unit,
    rgb_encoder_defaults,
 )

@@ -70,8 +73,11 @@ __all__ = [
    "depth_encoder_defaults",
    # Factories
    "encoder_config_from_video_info",
+    "infer_depth_unit",
    # Constants
    "DEFAULT_DEPTH_UNIT",
+    "DEPTH_METER_UNIT",
+    "DEPTH_MILLIMETER_UNIT",
    "VALID_VIDEO_CODECS",
    "VIDEO_ENCODER_INFO_KEYS",
 ]
@@ -22,6 +22,8 @@ import logging
 from dataclasses import dataclass, field
 from typing import Any, ClassVar, Self

+import numpy as np
+
 from lerobot.utils.import_utils import require_package

 logger = logging.getLogger(__name__)
@@ -36,7 +38,9 @@ HW_VIDEO_CODECS = [
    "h264_vaapi",  # Linux Intel/AMD
    "h264_qsv",  # Intel Quick Sync
 ]
-VALID_VIDEO_CODECS: frozenset[str] = frozenset({"h264", "hevc", "libsvtav1", "auto", *HW_VIDEO_CODECS})
+VALID_VIDEO_CODECS: frozenset[str] = frozenset(
+    {"h264", "hevc", "libsvtav1", "libaom-av1", "auto", *HW_VIDEO_CODECS}
+)
 # Aliases for legacy video codec names.
 VIDEO_CODECS_ALIASES: dict[str, str] = {"av1": "libsvtav1"}

@@ -65,6 +69,15 @@ DEPTH_METER_UNIT: str = "m"
 DEPTH_MILLIMETER_UNIT: str = "mm"
 DEFAULT_DEPTH_UNIT: str = DEPTH_MILLIMETER_UNIT

+
+def infer_depth_unit(dtype: np.dtype | type) -> str:
+    """Infer the physical unit of raw depth frames from their dtype.
+
+    Floating-point frames are assumed to be in metres, integer frames in millimetres.
+    """
+    return DEPTH_METER_UNIT if np.issubdtype(np.dtype(dtype), np.floating) else DEPTH_MILLIMETER_UNIT
+
+
 # Depth-specific tuning fields persisted under ``features[*]["info"]`` as ``video.<name>``.
 DEPTH_ENCODER_INFO_FIELD_NAMES: frozenset[str] = frozenset({"depth_min", "depth_max", "shift", "use_log"})

@@ -213,18 +226,24 @@ class VideoEncoderConfig:
            if encoder_threads is not None:
                svtav1_parts.append(f"lp={encoder_threads}")
            if svtav1_parts:
-                opts["svtav1-params"] = ":".join(svtav1_parts)
+                set_if("svtav1-params", ":".join(svtav1_parts))
        elif self.vcodec in ("h264", "hevc"):
            set_if("crf", self.crf)
            set_if("preset", self.preset)
            if self.fast_decode:
-                opts["tune"] = "fastdecode"
+                set_if("tune", "fastdecode")
            set_if("threads", encoder_threads)
+        elif self.vcodec == "libaom-av1":
+            set_if("crf", self.crf)
+            set_if("preset", self.preset)
+            if encoder_threads is not None:
+                set_if("threads", encoder_threads)
+                set_if("row-mt", 1)
        elif self.vcodec in ("h264_videotoolbox", "hevc_videotoolbox"):
            if self.crf is not None:
-                opts["q:v"] = max(1, min(100, 100 - self.crf * 2))
+                set_if("q:v", max(1, min(100, 100 - self.crf * 2)))
        elif self.vcodec in ("h264_nvenc", "hevc_nvenc"):
-            opts["rc"] = 0
+            set_if("rc", 0)
            set_if("qp", self.crf)
            set_if("preset", self.preset)
        elif self.vcodec == "h264_vaapi":
@@ -509,7 +509,7 @@ def compute_episode_stats(
        For 'image'/'video' features, stats are computed per channel and kept with a
        leading channel axis (e.g. shape (3, 1, 1) for RGB). RGB stats are divided by
        255 to land in [0, 1]; depth maps (features flagged with ``is_depth_map``) skip
-        this rescaling and remain in their stored units.
+        this rescaling and remain in their stored units (stored in ``depth_unit``).
    """
    if quantile_list is None:
        quantile_list = DEFAULT_QUANTILES
@@ -26,12 +26,13 @@ import pyarrow as pa
 import pyarrow.parquet as pq
 from huggingface_hub import snapshot_download

-from lerobot.configs import VideoEncoderConfig
+from lerobot.configs import DEPTH_METER_UNIT, VideoEncoderConfig
 from lerobot.utils.constants import DEFAULT_FEATURES, HF_LEROBOT_HOME, HF_LEROBOT_HUB_CACHE
 from lerobot.utils.feature_utils import _validate_feature_names
 from lerobot.utils.utils import flatten_dict

 from .compute_stats import aggregate_stats
+from .depth_utils import MM_PER_METRE
 from .feature_utils import create_empty_dataset_info
 from .io_utils import (
    get_file_size_in_mb,
@@ -358,6 +359,35 @@ class LeRobotDatasetMetadata:

        return [key for key, ft in self.features.items() if _is_depth(ft)]

+    def rescale_depth_stats(self, output_unit: str) -> None:
+        """Rescale depth feature stats in place from their recorded unit to ``output_unit``.
+
+        Depth stats are stored in the unit the frames were recorded in
+        (``features[key]["info"]["depth_unit"]``), while frames are returned in
+        ``output_unit`` on read. This converts the unit-bearing stat entries so
+        stats match the frames consumers see.
+        """
+        missing_unit_keys = [
+            key for key in self.depth_keys if (self.features[key].get("info") or {}).get("depth_unit") is None
+        ]
+        if missing_unit_keys:
+            logging.warning(
+                f"Depth feature(s) {missing_unit_keys} have no recorded 'depth_unit' in their info. "
+                f"Depth maps and stats for these keys will be returned AS IS, with no unit conversion "
+                f"to the requested output unit {output_unit!r}. Re-record the dataset or set 'depth_unit' "
+                f"in the feature info (meta/info.json) to enable conversion."
+            )
+        if self.stats is None:
+            return
+        for key in self.depth_keys:
+            stored_unit = (self.features[key].get("info") or {}).get("depth_unit")
+            if stored_unit is None or stored_unit == output_unit or key not in self.stats:
+                continue
+            factor = MM_PER_METRE if stored_unit == DEPTH_METER_UNIT else 1.0 / MM_PER_METRE
+            self.stats[key] = {
+                stat: value if stat == "count" else value * factor for stat, value in self.stats[key].items()
+            }
+
    @property
    def camera_keys(self) -> list[str]:
        """Keys to access visual modalities (regardless of their storage method)."""
@@ -22,10 +22,14 @@ from pathlib import Path
 import datasets
 import torch

-from lerobot.configs import DEFAULT_DEPTH_UNIT, DepthEncoderConfig
+from lerobot.configs import (
+    DEFAULT_DEPTH_UNIT,
+    DEPTH_METER_UNIT,
+    DepthEncoderConfig,
+)

 from .dataset_metadata import LeRobotDatasetMetadata
-from .depth_utils import dequantize_depth
+from .depth_utils import MM_PER_METRE, dequantize_depth
 from .feature_utils import (
    check_delta_timestamps,
    get_delta_indices,
@@ -102,6 +106,13 @@ class DatasetReader:
            for vid_key in self._meta.depth_keys
        }

+        # Get the input unit of each depth feature stored as raw images.
+        self._image_depth_units: dict[str, str | None] = {
+            key: (self._meta.features[key].get("info") or {}).get("depth_unit")
+            for key in self._meta.depth_keys
+            if key in self._meta.image_keys
+        }
+
    def set_image_transforms(self, image_transforms: Callable | None) -> None:
        """Replace the transform applied to visual observations."""
        if image_transforms is not None and not callable(image_transforms):
@@ -329,6 +340,13 @@ class DatasetReader:
                    continue
                item[cam] = self._image_transforms(item[cam])

+        # Convert depth features to the output unit.
+        for key, stored_unit in self._image_depth_units.items():
+            if key in item and stored_unit is not None and stored_unit != self._depth_output_unit:
+                item[key] = (
+                    item[key] * MM_PER_METRE if stored_unit == DEPTH_METER_UNIT else item[key] / MM_PER_METRE
+                )
+
        # Add task as a string
        task_idx = item["task_index"].item()
        item["task"] = self._meta.tasks.iloc[task_idx].name
@@ -36,6 +36,7 @@ from lerobot.configs import (
    RGBEncoderConfig,
    VideoEncoderConfig,
    depth_encoder_defaults,
+    infer_depth_unit,
    rgb_encoder_defaults,
 )

@@ -209,6 +210,15 @@ class DatasetWriter:
        self.episode_buffer["timestamp"].append(timestamp)
        self.episode_buffer["task"].append(frame.pop("task"))

+        # Record each depth feature's input unit once, inferred from the first frame's dtype.
+        if frame_index == 0:
+            for depth_key in self._meta.depth_keys:
+                if depth_key not in frame:
+                    continue
+                info = self._meta.features[depth_key].setdefault("info", {})
+                if info.get("depth_unit") is None:
+                    info["depth_unit"] = infer_depth_unit(np.asarray(frame[depth_key]).dtype)
+
        # Start streaming encoder on first frame of episode
        if frame_index == 0 and self._streaming_encoder is not None:
            self._streaming_encoder.start_episode(
@@ -34,12 +34,13 @@ from lerobot.configs.video import (
    DEPTH_METER_UNIT,
    DEPTH_MILLIMETER_UNIT,
    DEPTH_QMAX,
+    infer_depth_unit,
 )

 from .image_writer import squeeze_single_channel
 from .pyav_utils import write_u16_plane

-_MM_PER_METRE = 1000.0
+MM_PER_METRE = 1000.0
 _UINT16_MAX = 65535


@@ -57,11 +58,7 @@ def _depth_input_to_float32_and_unit(
    input_unit: Literal["auto", DEPTH_METER_UNIT, DEPTH_MILLIMETER_UNIT],
 ) -> tuple[NDArray[np.float32], Literal[DEPTH_METER_UNIT, DEPTH_MILLIMETER_UNIT]]:
    """Convert depth to float32 in the chosen unit, and return the resolved unit."""
-    resolved_unit = (
-        (DEPTH_METER_UNIT if np.issubdtype(depth.dtype, np.floating) else DEPTH_MILLIMETER_UNIT)
-        if input_unit == "auto"
-        else input_unit
-    )
+    resolved_unit = infer_depth_unit(depth.dtype) if input_unit == "auto" else input_unit
    return depth.astype(np.float32, order="K"), resolved_unit


@@ -126,12 +123,12 @@ def quantize_depth(

    # Convert depth_min, depth_max, and shift to the resolved input unit.
    depth_min_u = (
-        np.float32(depth_min) if resolved_unit == DEPTH_METER_UNIT else np.float32(depth_min * _MM_PER_METRE)
+        np.float32(depth_min) if resolved_unit == DEPTH_METER_UNIT else np.float32(depth_min * MM_PER_METRE)
    )
    depth_max_u = (
-        np.float32(depth_max) if resolved_unit == DEPTH_METER_UNIT else np.float32(depth_max * _MM_PER_METRE)
+        np.float32(depth_max) if resolved_unit == DEPTH_METER_UNIT else np.float32(depth_max * MM_PER_METRE)
    )
-    shift_u = np.float32(shift) if resolved_unit == DEPTH_METER_UNIT else np.float32(shift * _MM_PER_METRE)
+    shift_u = np.float32(shift) if resolved_unit == DEPTH_METER_UNIT else np.float32(shift * MM_PER_METRE)

    # Normalization and quantization is performed in the resolved input unit.
    if use_log:
@@ -236,7 +233,7 @@ def dequantize_depth(

        # mm path: round + clamp in float32, skipping the uint16 round-trip
        # when returning a tensor (torch.uint16 is poorly supported).
-        buf.mul_(_MM_PER_METRE).round_().clamp_(0.0, _UINT16_MAX)
+        buf.mul_(MM_PER_METRE).round_().clamp_(0.0, _UINT16_MAX)
        if output_tensor:
            return buf
        return buf.cpu().numpy().astype(np.uint16, copy=False)
@@ -259,7 +256,7 @@ def dequantize_depth(
    if output_unit == DEPTH_METER_UNIT:
        return torch.from_numpy(buf) if output_tensor else buf

-    np.multiply(buf, _MM_PER_METRE, out=buf)
+    np.multiply(buf, MM_PER_METRE, out=buf)
    np.rint(buf, out=buf)
    np.clip(buf, 0.0, _UINT16_MAX, out=buf)
    if output_tensor:
@@ -224,6 +224,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
        )
        self.root = self.meta.root
        self.revision = self.meta.revision
+        self.meta.rescale_depth_stats(self._depth_output_unit)

        if episodes is not None and any(
            episode >= self.meta.total_episodes or episode < 0 for episode in episodes
@@ -350,6 +351,11 @@ class LeRobotDataset(torch.utils.data.Dataset):
        """Frames per second used during data collection."""
        return self.meta.fps

+    @property
+    def depth_output_unit(self) -> str:
+        """Physical unit (``"m"`` or ``"mm"``) depth maps and statistics are returned in on read."""
+        return self._depth_output_unit
+
    @property
    def num_frames(self) -> int:
        """Number of frames in selected episodes."""
@@ -22,11 +22,11 @@ import numpy as np
 import torch
 from datasets import load_dataset

-from lerobot.configs import DEFAULT_DEPTH_UNIT, DepthEncoderConfig
+from lerobot.configs import DEFAULT_DEPTH_UNIT, DEPTH_METER_UNIT, DepthEncoderConfig
 from lerobot.utils.constants import HF_LEROBOT_HOME, LOOKAHEAD_BACKTRACKTABLE, LOOKBACK_BACKTRACKTABLE

 from .dataset_metadata import CODEBASE_VERSION, LeRobotDatasetMetadata
-from .depth_utils import dequantize_depth
+from .depth_utils import MM_PER_METRE, dequantize_depth
 from .feature_utils import get_delta_indices
 from .io_utils import item_to_torch
 from .utils import (
@@ -310,6 +310,7 @@ class StreamingLeRobotDataset(torch.utils.data.IterableDataset):
        )
        self.root = self.meta.root
        self.revision = self.meta.revision
+        self.meta.rescale_depth_stats(self._depth_output_unit)
        # Check version
        check_version_compatibility(self.repo_id, self.meta._version, CODEBASE_VERSION)

@@ -318,6 +319,13 @@ class StreamingLeRobotDataset(torch.utils.data.IterableDataset):
            for vid_key in self.meta.depth_keys
        }

+        # Input unit of each depth feature stored as raw images (dequantized separately from videos).
+        self._image_depth_units: dict[str, str | None] = {
+            key: (self.meta.features[key].get("info") or {}).get("depth_unit")
+            for key in self.meta.depth_keys
+            if key in self.meta.image_keys
+        }
+
        self.delta_timestamps = None
        self.delta_indices = None

@@ -348,6 +356,11 @@ class StreamingLeRobotDataset(torch.utils.data.IterableDataset):
    def fps(self):
        return self.meta.fps

+    @property
+    def depth_output_unit(self) -> str:
+        """Physical unit (``"m"`` or ``"mm"``) depth maps are returned in on read."""
+        return self._depth_output_unit
+
    @staticmethod
    def _iter_random_indices(
        rng: np.random.Generator, buffer_size: int, random_batch_size=100
@@ -530,6 +543,15 @@ class StreamingLeRobotDataset(torch.utils.data.IterableDataset):
        for update in updates:
            result.update(update)

+        # Convert raw-image depth features to the output unit (video depth is already converted).
+        for key, stored_unit in self._image_depth_units.items():
+            if key in result and stored_unit is not None and stored_unit != self._depth_output_unit:
+                result[key] = (
+                    result[key] * MM_PER_METRE
+                    if stored_unit == DEPTH_METER_UNIT
+                    else result[key] / MM_PER_METRE
+                )
+
        result["task"] = self.meta.tasks.iloc[item["task_index"]].name

        yield result
@@ -84,6 +84,7 @@ import torch
 import torch.utils.data
 import tqdm

+from lerobot.configs import DEPTH_MILLIMETER_UNIT
 from lerobot.datasets import LeRobotDataset
 from lerobot.utils.constants import ACTION, DONE, OBS_STATE, REWARD, SUCCESS
 from lerobot.utils.utils import init_logging
@@ -228,6 +229,9 @@ def visualize_dataset(

    logging.info("Logging to Rerun")

+    # Depth frames and stats are dequantized to the dataset's depth_output_unit on load.
+    depth_meter = 1000.0 if dataset.depth_output_unit == DEPTH_MILLIMETER_UNIT else 1.0
+
    # Use the dataset's q01/q99 depth statistics for robust depth range bounds
    depth_ranges = {}
    for key in dataset.meta.depth_keys:
@@ -254,6 +258,7 @@ def visualize_dataset(
                    depth = to_hwc_float32_numpy(batch[key][i])
                    depth_entity = rr.DepthImage(
                        depth,
+                        meter=depth_meter,
                        colormap=rr.components.Colormap.Viridis,
                        depth_range=depth_ranges.get(key),
                    )
@@ -211,8 +211,12 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
        # Accelerate auto-detects the device based on the available hardware and ignores the policy.device setting.
        # Force the device to be CPU when the active config's device is set to CPU (works for both policy and reward model training).
        force_cpu = cfg.trainable_config.device == "cpu"
+        # Drive Accelerate's autocast from policy.dtype (bf16/fp16 activate it; float32/absent -> launcher default).
+        policy_dtype = getattr(cfg.trainable_config, "dtype", None)
+        mixed_precision = {"bfloat16": "bf16", "float16": "fp16", "float32": "no"}.get(policy_dtype)
        accelerator = Accelerator(
            step_scheduler_with_optimizer=False,
+            mixed_precision=mixed_precision,
            kwargs_handlers=[ddp_kwargs],
            cpu=force_cpu,
        )
@@ -24,6 +24,7 @@ import os

 import numpy as np

+from lerobot.configs import DEPTH_MILLIMETER_UNIT, infer_depth_unit
 from lerobot.types import RobotAction, RobotObservation

 from .constants import ACTION, ACTION_PREFIX, OBS_PREFIX, OBS_STR
@@ -161,7 +162,13 @@ def log_rerun_data(
                    observation_paths.add(key)
                else:
                    if arr.shape[-1] == 1:
-                        img_entity = rr.DepthImage(arr, colormap=rr.components.Colormap.Viridis)
+                        # At record time, the depth unit is inferred from the frame type.
+                        depth_unit = infer_depth_unit(arr.dtype)
+                        img_entity = rr.DepthImage(
+                            arr,
+                            meter=1000.0 if depth_unit == DEPTH_MILLIMETER_UNIT else 1.0,
+                            colormap=rr.components.Colormap.Viridis,
+                        )
                    else:
                        img_entity = rr.Image(arr).compress() if compress_images else rr.Image(arr)
                    rr.log(key, entity=img_entity, static=True)
@@ -1531,6 +1531,7 @@ def test_valid_video_codecs_constant():
    assert "h264" in VALID_VIDEO_CODECS
    assert "hevc" in VALID_VIDEO_CODECS
    assert "libsvtav1" in VALID_VIDEO_CODECS
+    assert "libaom-av1" in VALID_VIDEO_CODECS
    assert "auto" in VALID_VIDEO_CODECS
    assert "h264_videotoolbox" in VALID_VIDEO_CODECS
    assert "h264_nvenc" in VALID_VIDEO_CODECS
@@ -1538,7 +1539,7 @@ def test_valid_video_codecs_constant():
    assert "h264_qsv" in VALID_VIDEO_CODECS
    assert "hevc_videotoolbox" in VALID_VIDEO_CODECS
    assert "hevc_nvenc" in VALID_VIDEO_CODECS
-    assert len(VALID_VIDEO_CODECS) == 10
+    assert len(VALID_VIDEO_CODECS) == 11


 def test_delta_timestamps_with_episodes_filter(tmp_path, empty_lerobot_dataset_factory):
@@ -32,6 +32,7 @@ from lerobot.configs.video import (
 )
 from lerobot.datasets.depth_utils import dequantize_depth, quantize_depth
 from lerobot.datasets.image_writer import image_array_to_pil_image, write_image
+from lerobot.utils.constants import DEFAULT_FEATURES
 from tests.fixtures.constants import (
    DEFAULT_FPS,
    DUMMY_CAMERA_FEATURES,
@@ -245,3 +246,91 @@ class TestFeatureFileRouting:

        dataset.save_episode()
        dataset.finalize()
+
+
+class TestDepthUnitMetadata:
+    """The depth unit is inferred once from dtype, stored in ``info``, and drives stats + reads."""
+
+    NUM_FRAMES = 4
+
+    def _record(self, root, features_factory, depth_dtype, value, use_videos):
+        from lerobot.datasets.lerobot_dataset import LeRobotDataset
+
+        features = features_factory(camera_features=DUMMY_CAMERA_FEATURES_WITH_DEPTH, use_videos=use_videos)
+        dataset = LeRobotDataset.create(
+            repo_id=DUMMY_REPO_ID,
+            fps=DEFAULT_FPS,
+            features=features,
+            root=root,
+            use_videos=use_videos,
+            streaming_encoding=use_videos,
+        )
+        for _ in range(self.NUM_FRAMES):
+            frame: dict = {"task": "test"}
+            for key, ft in dataset.meta.features.items():
+                if key in DEFAULT_FEATURES:
+                    continue
+                if key in dataset.meta.depth_keys:
+                    frame[key] = np.full(ft["shape"], value, dtype=depth_dtype)
+                elif key in dataset.meta.camera_keys:
+                    frame[key] = np.random.randint(0, 256, ft["shape"], dtype=np.uint8)
+                else:
+                    frame[key] = np.zeros(ft["shape"], dtype=np.float32)
+            dataset.add_frame(frame)
+        return dataset
+
+    @pytest.mark.parametrize("use_videos", [False, True])
+    @pytest.mark.parametrize(
+        ("depth_dtype", "value", "expected_unit"),
+        [(np.float32, 2.0, DEPTH_METER_UNIT), (np.uint16, 2000, DEPTH_MILLIMETER_UNIT)],
+    )
+    def test_recorded_unit_inferred_persisted_and_kept_in_stats(
+        self, tmp_path, features_factory, use_videos, depth_dtype, value, expected_unit
+    ):
+        """Unit is inferred from the first frame's dtype, drives stats (raw, never canonicalized), and survives a reload."""
+        from lerobot.datasets.lerobot_dataset import LeRobotDataset
+
+        dataset = self._record(tmp_path / "ds", features_factory, depth_dtype, value, use_videos)
+        assert dataset.meta.features[DEPTH_KEY]["info"]["depth_unit"] == expected_unit
+        dataset.save_episode()
+        mean = float(np.asarray(dataset.meta.stats[DEPTH_KEY]["mean"]).reshape(-1)[0])
+        np.testing.assert_allclose(mean, value, rtol=0.05)
+        dataset.finalize()
+
+        reloaded = LeRobotDataset(repo_id=DUMMY_REPO_ID, root=tmp_path / "ds")
+        assert reloaded.meta.features[DEPTH_KEY]["info"]["depth_unit"] == expected_unit
+
+    @pytest.mark.parametrize("use_videos", [False, True])
+    @pytest.mark.parametrize(
+        ("output_unit", "expected"),
+        [(DEPTH_MILLIMETER_UNIT, 2000.0), (DEPTH_METER_UNIT, 2.0)],
+    )
+    def test_read_honors_output_unit_for_frames_and_stats(
+        self, tmp_path, features_factory, use_videos, output_unit, expected
+    ):
+        """Reloading with a ``depth_output_unit`` converts metre frames (image mode) and rescales stats while preserving count."""
+        from lerobot.datasets.lerobot_dataset import LeRobotDataset
+
+        dataset = self._record(tmp_path / "ds", features_factory, np.float32, 2.0, use_videos=use_videos)
+        dataset.save_episode()
+        count = float(np.asarray(dataset.meta.stats[DEPTH_KEY]["count"]).reshape(-1)[0])
+        dataset.finalize()
+
+        read_dataset = LeRobotDataset(
+            repo_id=DUMMY_REPO_ID, root=tmp_path / "ds", depth_output_unit=output_unit
+        )
+        stats = read_dataset.meta.stats[DEPTH_KEY]
+        np.testing.assert_allclose(float(np.asarray(stats["mean"]).reshape(-1)[0]), expected, rtol=0.05)
+        np.testing.assert_allclose(float(np.asarray(stats["count"]).reshape(-1)[0]), count)
+
+        if not use_videos:
+            depth = read_dataset[0][DEPTH_KEY]
+            assert torch.allclose(depth, torch.full_like(depth, expected))
+
+            from lerobot.datasets.streaming_dataset import StreamingLeRobotDataset
+
+            stream_dataset = StreamingLeRobotDataset(
+                repo_id=DUMMY_REPO_ID, root=tmp_path / "ds", depth_output_unit=output_unit
+            )
+            stream_depth = next(iter(stream_dataset))[DEPTH_KEY]
+            assert torch.allclose(stream_depth, torch.full_like(stream_depth, expected))
@@ -345,7 +345,9 @@ class TestExtraOptions:
        opts = cfg.get_codec_options()
        assert opts["qp"] == 20
        assert isinstance(opts["qp"], int)
-        assert cfg.get_codec_options(as_strings=True)["qp"] == "20"
+        str_opts = cfg.get_codec_options(as_strings=True)
+        assert str_opts["qp"] == "20"
+        assert all(isinstance(v, str) for v in str_opts.values())

    @require_libsvtav1
    def test_structured_fields_win_on_collision(self):
@@ -26,6 +26,7 @@ import pytest
 import torch
 from datasets import Dataset

+from lerobot.configs.video import infer_depth_unit
 from lerobot.datasets.dataset_metadata import CODEBASE_VERSION, LeRobotDatasetMetadata
 from lerobot.datasets.feature_utils import get_hf_features_from_features
 from lerobot.datasets.io_utils import flatten_dict, hf_transform_to_torch
@@ -535,6 +536,13 @@ def lerobot_dataset_factory(
                chunks_size=chunks_size,
                **info_kwargs,
            )
+            # This synthetic path skips add_frame, so record the depth unit the writer would
+            # have stored (dummy depth is uint16) to keep ``depth_unit`` present in info.json.
+            # Reassign a fresh info dict to avoid mutating the shared feature constants.
+            for ft in info.features.values():
+                ft_info = ft.get("info")
+                if ft_info is not None and ft_info.get("is_depth_map") and "depth_unit" not in ft_info:
+                    ft["info"] = {**ft_info, "depth_unit": infer_depth_unit(np.uint16)}
        if stats is None:
            stats = stats_factory(features=info.features)
        if tasks is None:
@@ -50,8 +50,9 @@ def mock_rerun(monkeypatch):
            return self

    class DummyDepthImage:
-        def __init__(self, arr, colormap=None):
+        def __init__(self, arr, meter=None, colormap=None):
            self.arr = arr
+            self.meter = meter
            self.colormap = colormap

    def dummy_log(key, obj=None, **kwargs):
Author	SHA1	Message	Date
github-actions[bot]	62067f8eb9	chore(dependencies): update uv.lock	2026-07-03 08:28:33 +00:00
Pepijn	07285677a3	fix(train): drive Accelerate mixed precision from policy.dtype (#3912 ) * fix(train): drive Accelerate mixed precision from policy.dtype `accelerator.autocast()` was always a no-op because `mixed_precision` was never set, so `--policy.dtype=bfloat16` only cast the model params (via the policy) while autocast-eligible ops still ran in fp32/tf32. Map the active policy's `dtype` onto Accelerate's `mixed_precision` (bfloat16 -> bf16, float16 -> fp16, float32 -> no) so autocast is active for bf16/fp16 and stays full precision for float32. Policies without a string `dtype` field fall back to Accelerate's launcher default, so existing behavior is preserved. * style(train): condense mixed-precision comment to one line	2026-07-02 19:15:19 +02:00
Caroline Pascal	7ae12124b0	fix(save codec options): making sure codec options are always set via `set_if` (#3910 ) * fix(save codec options): making sure codec options are always safely set through `set_if` * tests(update): updating tests	2026-07-02 15:29:14 +02:00
Caroline Pascal	c746ca2df2	fix(depth unit): adding input depth unit storage in the dataset metadata (#3899 ) * fix(depth unit): storing raw depth units in the dataset metadata for correct depth statistics and depth raw frames handling. The unit is stored as a string ("m","mm") under "depth_unit" at the same level as "is_depth_map". Unit is inferred from the depth frame type. * feat(raw frame unit): adapting dataset reader so that raw depth frames are scaled according to the requested unit * feat(stats units): rescaling stats when loading a dataset so that the stats are given in the requested unit * tests(unit): adapting and extending depth tests to units manipulations * chore(format): formating code * feat(warning): adding a warning when depth unit is not specified in the dataset * chore(infer_depth_unit): moving the depth unit inference utility in a more accessible location * feat(rerun unit): adding correct depth unit display for rerun (foxglove does not support units yet) * feat(unit getter): adding a proper output_depth_unit getter to LeRobotDataset for cleaner integration * fix(streaming dataset): extending support for depth units to streaming datasets * test(rerun): fixing rerun tests	2026-07-02 11:53:13 +02:00
Caroline Pascal	b961d2a8c5	feat(libaom-av1): adding support for libaom-av1 codec (#3898 )	2026-07-02 11:03:41 +02:00