feat(depth maps): adding support for depth in LeRobot (#3644)

* feat(depth): add depth quantization helpers and tests * feat(video): add ffv1 to supported codecs * feat(depth): persist depth metadata * feat(depth): extend quantization tools to better fit the encoding/decoding pipeline * feat(depth): plumb DepthEncoderConfig through LeRobotDataset and DatasetWriter * feat(depth): wire StreamingVideoEncoder + writer to depth encoder * feat(depth): wire DatasetReader to decode_depth_frames * feat(cameras/realsense): expose async depth in metric meters * feat(features): route 2D camera shapes to observation.depth.<key> * feat(robots/so_follower): emit + populate depth keys when use_depth * feat(record): plumb DepthEncoderConfig through lerobot-record * feat(viz): render depth observations as rr.DepthImage in Viridis * feat(depth maps writer): adding support for raw depth maps recording with image writer * chore(format): format code * feat(depth shape): ensuring depth maps shape is always including the channel * feat(is_depth): simplifying is_depth nested name + legacy support * fix(stop_event): fixing stop_event race condition in camera classes * fix(plumbing): fixing missing parts in the depth maps pipeline * chore(typos): fixing typos * test(fix): fixing exisiting tests to still work with latest features * tests(depth): adding new tests for depth integration validation * feat(pix_fmt channels): use PyAv to check get pixel formats number of channels * feat(refactor): refactor DepthEncoderConfig quantization pipeline, so that the methods do not live in the config class. Add pixel format - channels validation.Move the default pixel format for depth in the config file. * fix(pre-commit): fixing mutable defautl value * fix(info): fixing info metadata update when is_depth_map was set * tests(typos): fixing typos in tests * fix(realsense): fixing typo in realsense serial number * fix(normalization): restricting 255 normalization to non depth/uint8 images only * fix(typo): fixing typo * fix(TIFF): add missing quantization and cleanup for TIFF files * feat(batched dequantization): optimizing dequantize_depth for torch based batched dequantization * feat(tools): adding depth support in LeRobotDataset edition tools * test(aggregate): extending aggregation tests to depth frames * test(cleaning): cleaning up tests * fix(from_video_info): fixing early validation issue in from_video_info * fix(typo): fixing typo * fix(is_depth): adding missing doctrings and is_depth arguments in video decoding functions Co-authored-by: Wensi (Vince) Ai <59036629+wensi-ai@users.noreply.github.com> * fix(depth units): fixing depth units output for the realsense cameras * feat(output unit): adding support for output unit specification at dataset reading/training time Co-authored-by: Wensi (Vince) Ai <59036629+wensi-ai@users.noreply.github.com> * test(depth): cleaning up depth tests * test(depth encoding): updating and cleaning video/depth encoding tests * chore(format): formatting code * docs(depth): improving depth maps docs * test(fix): fixing depth tests * test(dataset tools): adding missing tests for new dataset edition tools features * chore(format): formatting code * fix(pyav check): fixing PyAV option validation for integer codec options by normalizing numeric values before calling `is_integer()` Co-authored-by: Wensi (Vince) Ai <59036629+wensi-ai@users.noreply.github.com> * docs(mermaid): fixing mermaid diagram * fix(rebase): rebase follow up corrections * feat(dataset tools): adding missing docstrings and features for depth fill support in dataset edition tools * docs(docstring): updating docstrings * docs(dataset tools): updating docs * fix(save images): fixing image saving in dataset tools * fix(update video info): fixing update video info logic to match the recording and editing use cases * test(reencode): fixing reencoding monkeypatch * fix(review): add Claude review * chore(format): format code * fix(update video info): ditching the differentiated approahces for video info update - video info are always updated unless for preserved keys. * chore(rebase): fixing rebase merge conflicts * test(visualization): fixing visualization tests * feat(docstrings): adding explicit docstring for encoding parameters. Docstrigns will now show up as description in the CLI --help. * feat(mm as default): adding a global DEFAULT_DEPTH_UNIT variable setting mm as default depth unit * fix(RGB <-> camera): renaming camera_encoder to rgb_encoder for clarity * chore(TODO): removing deprecated TODO * doc(write_u16_plane): improving docstrings for write_u16_plane * feat(units): adding constants for depth frames units (m and mm) * fix(spam): replacing spamming warning but a debug log * feat(leagcy metadata): adding automatic metadata update for legacy 'video.is_depth_map' feature * fix(copy&reindex): fixing metadat reshaping for single channel frames * fix(ImageNet): excluding dpeth frames from ImageNet stats * fix(PyAV container seek): fixing initial PyAV container seek to be robust againsy codec choice * feat(lerobot-dataset-viz): adding support for depth in lerobot-dataset-viz * fix(compress): removing rerun compression for DepthImages * fix(signle channel squeeze): fixing single channel squeezing * chore(format): format code * fix(streaming): adding support for dequantization in streaming_dataset.py * refactor(read depth): factorizing depth reading methods for realsense camera and adding support for depth-only usage * chore(renaming): fixing missed RGBEncoderConfig renamings * docs(renaming): reflecting renamings in a clearer way in the docs * chore(annotation): excluding depth from the annotation pipeline * feat(robots): adding depth support in compatible follower robots * feat(LeSadKiwi): excluding LeKiwi from depth support (for now) * chore(fail): removing misplaced file * chore(fail): removing misplaced file * fix(remove ffv1): removing ffv1 as it does not support MP4 * docs(cheat sheet): adding depth and video encoding to the cheat sheet * fix(lossless): tuning depth encoding parameters for lossless depth storage * test(fix): fixing failing tests * depth(ZMQ): excluding ZMQ from depth support * Revert "depth(ZMQ): excluding ZMQ from depth support" This reverts commit b95cf4e4c2. * fix(image transforms): excluding depth frames from images transforms * fix(typo): typo * fix(stats): fixing stats computation for depth frames * fix(TIFF vs. pytorch): adding an extra uint16 to float32 conversion for depth maps stored as raw TIFF images * fix(typos): fixing typos * test(dtype): fixing stats computation typing tests --------- Signed-off-by: Steven Palma <imstevenpmwork@ieee.org> Co-authored-by: Wensi (Vince) Ai <59036629+wensi-ai@users.noreply.github.com> Co-authored-by: Steven Palma <imstevenpmwork@ieee.org> Co-authored-by: Wensi Ai <wsai@stanford.edu>
2026-06-30 22:57:00 +00:00 · 2026-06-27 14:21:21 +02:00
parent 6a788fbdb0
commit 3dd19d043e
69 changed files with 2740 additions and 679 deletions
@@ -157,6 +157,14 @@ finally:
 </hfoption>
 </hfoptions>

+### Working with depth
+
+The Intel RealSense and Reachy 2 cameras can capture both color and depth in lockstep. Calling `read()` returns the **color** frame as `(H, W, 3)` `uint8`. Calling `read_depth()` returns the **depth map** as `(H, W, 1)` `uint16`, where each pixel value is the distance from the sensor expressed in **millimetres**. A pixel value of `0` typically means "no measurement available" (out-of-range, occluded, or low-confidence).
+
+During recording, the control loop peeks the freshest buffered frames non-blockingly via `read_latest()` (color) and `read_latest_depth()` (depth), adding the depth map as a sibling feature (e.g. `front_depth` next to `front`).
+
+For how depth streams are stored and encoded when recording a dataset, see the [Depth streams](./video_encoding_parameters#depth-streams) section of the video encoding guide.
+
 ## Use your phone's camera

 <hfoptions id="use phone">
@@ -89,6 +89,36 @@ Control the data recording flow using keyboard shortcuts:
 - Press **Left Arrow (`←`)**: Delete current episode and retry.
 - Press **Escape (`ESC`)**: Stop, encode videos, and upload.

+### Recording depth
+
+Intel RealSense cameras (`type: intelrealsense`) record a depth stream when you set `use_depth: true`. Depth is quantized to 12-bit codes and stored as its own video.
+
+```bash
+lerobot-record \
+    ... \
+    --robot.cameras="{ head: {type: intelrealsense, serial_number_or_name: \"0123456789\", width: 640, height: 480, fps: 30, use_depth: true} }" \
+    --dataset.repo_id=${HF_USER}/so101_depth_test \
+    --dataset.single_task="put the red brick in a bowl" \
+    --dataset.depth_encoder.depth_min=0.01 \
+    --dataset.depth_encoder.depth_max=10.0 \
+    --dataset.depth_encoder.shift=0.0 \
+    --dataset.depth_encoder.use_log=true
+```
+
+### Video encoding parameters
+
+RGB and depth streams are encoded independently via the `--dataset.rgb_encoder.*` and `--dataset.depth_encoder.*` keys.
+
+```bash
+lerobot-record \
+    ... \
+    --dataset.rgb_encoder.vcodec=h264 \
+    --dataset.rgb_encoder.pix_fmt=yuv420p \
+    --dataset.rgb_encoder.crf=23 \
+    --dataset.depth_encoder.vcodec=hevc \
+    --dataset.depth_encoder.extra_options='{"x265-params": "lossless=1"}'
+```
+
 ### Training

 Depending on your hardware training the policy might take a few hours. That's how you train simple `ACT` policy:
@@ -194,7 +194,7 @@ lerobot-record \
    --dataset.single_task="Navigate around obstacles" \
    --dataset.streaming_encoding=true \
    --dataset.encoder_threads=2 \
-    # --dataset.camera_encoder.vcodec=auto \
+    # --dataset.rgb_encoder.vcodec=auto \
    --display_data=true
 ```

@@ -124,7 +124,7 @@ lerobot-rollout\
  --dataset.single_task="Grab and handover the red cube to the other arm" \
  --dataset.streaming_encoding=true \
  --dataset.encoder_threads=2 \
-  # --dataset.camera_encoder.vcodec=auto \
+  # --dataset.rgb_encoder.vcodec=auto \
  --policy.path=<user>/groot-bimanual \ # your trained model
  --duration=600
 ```
@@ -232,7 +232,7 @@ lerobot-record \
    --dataset.private=true \
    --dataset.streaming_encoding=true \
    --dataset.encoder_threads=2 \
-    # --dataset.camera_encoder.vcodec=auto \
+    # --dataset.rgb_encoder.vcodec=auto \
    --display_data=true
 ```

@@ -278,6 +278,6 @@ lerobot-record \
  --dataset.num_episodes=10 \
  --dataset.streaming_encoding=true \
  --dataset.encoder_threads=2 \
-  # --dataset.camera_encoder.vcodec=auto \
+  # --dataset.rgb_encoder.vcodec=auto \
  --policy.path=outputs/train/hopejr_hand/checkpoints/last/pretrained_model
 ```
@@ -207,7 +207,7 @@ lerobot-record \
    --dataset.num_episodes=5 \
    --dataset.single_task="Grab the black cube" \
    --dataset.streaming_encoding=true \
-    # --dataset.camera_encoder.vcodec=auto \
+    # --dataset.rgb_encoder.vcodec=auto \
    --dataset.encoder_threads=2
 ```
 </hfoption>
@@ -44,7 +44,7 @@ lerobot-record \
  --dataset.num_episodes=5 \
  --dataset.single_task="Grab the black cube" \
  --dataset.streaming_encoding=true \
-  # --dataset.camera_encoder.vcodec=auto \
+  # --dataset.rgb_encoder.vcodec=auto \
  --dataset.encoder_threads=2
 ```

@@ -161,7 +161,7 @@ lerobot-record \
    --dataset.private=true \
    --dataset.streaming_encoding=true \
    --dataset.encoder_threads=2 \
-    # --dataset.camera_encoder.vcodec=auto \
+    # --dataset.rgb_encoder.vcodec=auto \
    --display_data=true
 ```

@@ -203,7 +203,7 @@ lerobot-record \
    --dataset.private=true \
    --dataset.streaming_encoding=true \
    --dataset.encoder_threads=2 \
-    # --dataset.camera_encoder.vcodec=auto \
+    # --dataset.rgb_encoder.vcodec=auto \
    --display_data=true
 ```

@@ -17,7 +17,7 @@ This makes `save_episode()` near-instant (the video is already encoded by the ti
 | Parameter               | CLI Flag                          | Type          | Default       | Description                                                       |
 | ----------------------- | --------------------------------- | ------------- | ------------- | ----------------------------------------------------------------- |
 | `streaming_encoding`    | `--dataset.streaming_encoding`    | `bool`        | `True`        | Enable real-time encoding during capture                          |
-| `vcodec`                | `--dataset.camera_encoder.vcodec` | `str`         | `"libsvtav1"` | Video codec. `"auto"` detects best HW encoder                     |
+| `vcodec`                | `--dataset.rgb_encoder.vcodec`    | `str`         | `"libsvtav1"` | Video codec. `"auto"` detects best HW encoder                     |
 | `encoder_threads`       | `--dataset.encoder_threads`       | `int \| None` | `None` (auto) | Threads per encoder instance. `None` will leave the vcoded decide |
 | `encoder_queue_maxsize` | `--dataset.encoder_queue_maxsize` | `int`         | `30`          | Max buffered frames per camera (~1s at 30fps). Consumes RAM       |

@@ -82,15 +82,15 @@ Use HW encoding when:

 ### Available HW Encoders

-| Encoder             | Platform      | Hardware                                                                                         | CLI Value                                           |
-| ------------------- | ------------- | ------------------------------------------------------------------------------------------------ | --------------------------------------------------- |
-| `h264_videotoolbox` | macOS         | Apple Silicon / Intel                                                                            | `--dataset.camera_encoder.vcodec=h264_videotoolbox` |
-| `hevc_videotoolbox` | macOS         | Apple Silicon / Intel                                                                            | `--dataset.camera_encoder.vcodec=hevc_videotoolbox` |
-| `h264_nvenc`        | Linux/Windows | NVIDIA GPU                                                                                       | `--dataset.camera_encoder.vcodec=h264_nvenc`        |
-| `hevc_nvenc`        | Linux/Windows | NVIDIA GPU                                                                                       | `--dataset.camera_encoder.vcodec=hevc_nvenc`        |
-| `h264_vaapi`        | Linux         | Intel/AMD GPU                                                                                    | `--dataset.camera_encoder.vcodec=h264_vaapi`        |
-| `h264_qsv`          | Linux/Windows | Intel Quick Sync                                                                                 | `--dataset.camera_encoder.vcodec=h264_qsv`          |
-| `auto`              | Any           | Probes the system for available HW encoders. Falls back to `libsvtav1` if no HW encoder is found | `--dataset.camera_encoder.vcodec=auto`              |
+| Encoder             | Platform      | Hardware                                                                                         | CLI Value                                        |
+| ------------------- | ------------- | ------------------------------------------------------------------------------------------------ | ------------------------------------------------ |
+| `h264_videotoolbox` | macOS         | Apple Silicon / Intel                                                                            | `--dataset.rgb_encoder.vcodec=h264_videotoolbox` |
+| `hevc_videotoolbox` | macOS         | Apple Silicon / Intel                                                                            | `--dataset.rgb_encoder.vcodec=hevc_videotoolbox` |
+| `h264_nvenc`        | Linux/Windows | NVIDIA GPU                                                                                       | `--dataset.rgb_encoder.vcodec=h264_nvenc`        |
+| `hevc_nvenc`        | Linux/Windows | NVIDIA GPU                                                                                       | `--dataset.rgb_encoder.vcodec=hevc_nvenc`        |
+| `h264_vaapi`        | Linux         | Intel/AMD GPU                                                                                    | `--dataset.rgb_encoder.vcodec=h264_vaapi`        |
+| `h264_qsv`          | Linux/Windows | Intel Quick Sync                                                                                 | `--dataset.rgb_encoder.vcodec=h264_qsv`          |
+| `auto`              | Any           | Probes the system for available HW encoders. Falls back to `libsvtav1` if no HW encoder is found | `--dataset.rgb_encoder.vcodec=auto`              |

 > [!NOTE]
 > In order to use the HW accelerated encoders you might need to upgrade your GPU drivers.
@@ -100,15 +100,15 @@ Use HW encoding when:

 ## 5. Troubleshooting

-| Symptom                                                            | Likely Cause                                 | Fix                                                                                                                                                                                                                                                                                                 |
-| ------------------------------------------------------------------ | -------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| System freezes or choppy robot movement or Rerun visualization lag | CPU starved (100% load usage)                | Close other apps, reduce encoding throughput, lower `encoder_threads`, use `h264`, use `display_data=False`. If the CPU continues to be at 100% then it might be insufficient for your setup, consider `--dataset.streaming_encoding=false` or HW encoding (`--dataset.camera_encoder.vcodec=auto`) |
-| "Encoder queue full" warnings or dropped frames in dataset         | Encoder can't keep up (Queue overflow)       | If CPU is not at 100%: Increase `encoder_threads`, increase `encoder_queue_maxsize` or use HW encoding (`--dataset.camera_encoder.vcodec=auto`).                                                                                                                                                    |
-| High RAM usage                                                     | Queue filling faster than encoding           | `encoder_threads` too low or CPU insufficient. Reduce `encoder_queue_maxsize` or use HW encoding                                                                                                                                                                                                    |
-| Large video files                                                  | Using HW encoder or H.264                    | Expected trade-off. Switch to `libsvtav1` if CPU allows                                                                                                                                                                                                                                             |
-| `save_episode()` still slow                                        | `streaming_encoding` is `False`              | Set `--dataset.streaming_encoding=true`                                                                                                                                                                                                                                                             |
-| Encoder thread crash                                               | Codec not available or invalid settings      | Check `vcodec` is installed, try `--dataset.camera_encoder.vcodec=auto`                                                                                                                                                                                                                             |
-| Recorded dataset is missing frames                                 | CPU/GPU starvation or occasional load spikes | If ~5% of frames are missing, your system is likely overloaded — follow the recommendations above. If fewer frames are missing (~2%), they are probably due to occasional transient load spikes (often at startup) and can be considered expected.                                                  |
+| Symptom                                                            | Likely Cause                                 | Fix                                                                                                                                                                                                                                                                                              |
+| ------------------------------------------------------------------ | -------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| System freezes or choppy robot movement or Rerun visualization lag | CPU starved (100% load usage)                | Close other apps, reduce encoding throughput, lower `encoder_threads`, use `h264`, use `display_data=False`. If the CPU continues to be at 100% then it might be insufficient for your setup, consider `--dataset.streaming_encoding=false` or HW encoding (`--dataset.rgb_encoder.vcodec=auto`) |
+| "Encoder queue full" warnings or dropped frames in dataset         | Encoder can't keep up (Queue overflow)       | If CPU is not at 100%: Increase `encoder_threads`, increase `encoder_queue_maxsize` or use HW encoding (`--dataset.rgb_encoder.vcodec=auto`).                                                                                                                                                    |
+| High RAM usage                                                     | Queue filling faster than encoding           | `encoder_threads` too low or CPU insufficient. Reduce `encoder_queue_maxsize` or use HW encoding                                                                                                                                                                                                 |
+| Large video files                                                  | Using HW encoder or H.264                    | Expected trade-off. Switch to `libsvtav1` if CPU allows                                                                                                                                                                                                                                          |
+| `save_episode()` still slow                                        | `streaming_encoding` is `False`              | Set `--dataset.streaming_encoding=true`                                                                                                                                                                                                                                                          |
+| Encoder thread crash                                               | Codec not available or invalid settings      | Check `vcodec` is installed, try `--dataset.rgb_encoder.vcodec=auto`                                                                                                                                                                                                                             |
+| Recorded dataset is missing frames                                 | CPU/GPU starvation or occasional load spikes | If ~5% of frames are missing, your system is likely overloaded — follow the recommendations above. If fewer frames are missing (~2%), they are probably due to occasional transient load spikes (often at startup) and can be considered expected.                                               |

 ## 6. Recommended Configurations

@@ -146,7 +146,7 @@ On very constrained systems, streaming encoding may compete too heavily with the
 # 2camsx 640x480x3 @30fps: Requires some tuning.

 # Use H.264, disable streaming, consider batching encoding
-lerobot-record --dataset.camera_encoder.vcodec=h264 --dataset.streaming_encoding=false ...
+lerobot-record --dataset.rgb_encoder.vcodec=h264 --dataset.streaming_encoding=false ...
 ```

 ## 7. Closing note
@@ -11,8 +11,9 @@ LeRobot provides several utilities for manipulating datasets:
 3. **Merge Datasets** - Combine multiple datasets into one. The datasets must have identical features, and episodes are concatenated in the order specified in `repo_ids`
 4. **Add Features** - Add new features to a dataset
 5. **Remove Features** - Remove features from a dataset
-6. **Convert to Video** - Convert image-based datasets to video format for efficient storage
-7. **Show the Info of Datasets** - Show the summary of datasets information such as number of episode etc.
+6. **Convert to Video** - Convert image-based datasets to video format for efficient storage (RGB and depth cameras are encoded with separate encoders)
+7. **Re-encode Videos** - Re-encode an existing video dataset's RGB and/or depth streams with new encoder settings
+8. **Show the Info of Datasets** - Show the summary of datasets information such as number of episode etc.

 The core implementation is in `lerobot.datasets.dataset_tools`.
 An example script detailing how to use the tools API is available in `examples/dataset/use_dataset_tools.py`.
@@ -117,10 +118,19 @@ lerobot-edit-dataset \
    --repo_id lerobot/pusht_image \
    --operation.type convert_image_to_video \
    --operation.output_dir outputs/pusht_video \
-    --operation.camera_encoder.vcodec libsvtav1 \
-    --operation.camera_encoder.pix_fmt yuv420p \
-    --operation.camera_encoder.g 2 \
-    --operation.camera_encoder.crf 30
+    --operation.rgb_encoder.vcodec libsvtav1 \
+    --operation.rgb_encoder.pix_fmt yuv420p \
+    --operation.rgb_encoder.g 2 \
+    --operation.rgb_encoder.crf 30
+
+# Convert a dataset that includes depth maps, customizing the depth encoder
+lerobot-edit-dataset \
+    --repo_id lerobot/pusht_image \
+    --operation.type convert_image_to_video \
+    --operation.output_dir outputs/pusht_video \
+    --operation.depth_encoder.depth_min 0.01 \
+    --operation.depth_encoder.depth_max 10.0 \
+    --operation.depth_encoder.use_log true

 # Convert only specific episodes
 lerobot-edit-dataset \
@@ -147,11 +157,42 @@ lerobot-edit-dataset \
 **Parameters:**

 - `output_dir`: Custom output directory (optional - by default uses `new_repo_id` or `{repo_id}_video`)
- `camera_encoder`: Video encoder settings — all sub-fields accessible via `--operation.camera_encoder.<field>. See [Video Encoding Parameters](./video_encoding_parameters) for more details.
+- `rgb_encoder`: Video encoder settings applied to RGB cameras — all sub-fields accessible via `--operation.rgb_encoder.<field>`. See [Video Encoding Parameters](./video_encoding_parameters) for more details.
+- `depth_encoder`: Video encoder settings applied to depth-map cameras (e.g. from an Intel RealSense). In addition to the standard encoder fields it exposes the depth quantization knobs (`depth_min`, `depth_max`, `shift`, `use_log`), accessible via `--operation.depth_encoder.<field>`. These quantization settings are persisted to the dataset metadata so depth can be dequantized back to physical units on load. See the [Depth streams](./video_encoding_parameters#depth-streams) section for details.
 - `episode_indices`: List of specific episodes to convert (default: all episodes)
 - `num_workers`: Number of parallel workers for processing (default: 4)

-**Note:** The resulting dataset will be a proper LeRobotDataset with all cameras encoded as videos in the `videos/` directory, with parquet files containing only metadata (no raw image data). All episodes, stats, and tasks are preserved.
+**Note:** The resulting dataset will be a proper LeRobotDataset with all cameras encoded as videos in the `videos/` directory, with parquet files containing only metadata (no raw image data). Depth-map cameras are detected automatically and routed to the `depth_encoder`, while RGB cameras use the `rgb_encoder`. All episodes, stats, and tasks are preserved.
+
+#### Re-encode Videos
+
+Re-encode the videos of an existing video dataset with different encoder settings, without going back to raw frames. RGB videos use the `rgb_encoder` and depth videos use the `depth_encoder`. Provide only the encoder(s) you want to re-encode; the other stream type is left untouched.
+
+```bash
+# Re-encode all RGB videos with new settings (saves to lerobot/pusht_reencoded by default)
+lerobot-edit-dataset \
+    --repo_id lerobot/pusht \
+    --operation.type reencode_videos \
+    --operation.rgb_encoder.vcodec h264 \
+    --operation.rgb_encoder.pix_fmt yuv420p \
+    --operation.rgb_encoder.crf 23
+
+# Re-encode both RGB and depth videos in a dataset with depth maps
+lerobot-edit-dataset \
+    --repo_id lerobot/pusht_depth \
+    --operation.type reencode_videos \
+    --operation.rgb_encoder.vcodec h264 \
+    --operation.depth_encoder.crf 50
+```
+
+**Parameters:**
+
+- `rgb_encoder`: Encoder settings applied to every RGB video. Omit to skip re-encoding RGB videos.
+- `depth_encoder`: Encoder settings applied to every depth video. Omit to skip re-encoding depth videos.
+- `num_workers`: Number of parallel workers for processing.
+
+> [!NOTE]
+> When re-encoding depth videos, the existing depth quantization parameters (`depth_min`, `depth_max`, `shift`, `use_log`) and the `is_depth_map` flag are **preserved** — re-encoding only changes the codec/quality of the stored stream, not how depth is dequantized on load.

 ### Show the information of datasets

@@ -2,15 +2,15 @@

 When video storage is enabled, LeRobot stores each camera stream as an **MP4** file instead of saving one image file per timestep. Video encoding compresses across time, which usually cuts dataset size and I/O compared to a pile of PNG, while keeping MP4 — a format every player and loader understands.

-Encoding frames into an MP4 is a full FFmpeg pipeline: choice of encoder, pixel format, GOP/keyframes, quality vs. speed, and optional extra encoder flags. Most of these knobs are user-tunable through `camera_encoder`, a nested `VideoEncoderConfig` (`lerobot.configs.video.VideoEncoderConfig`) passed through PyAV.
+Encoding frames into an MP4 is a full FFmpeg pipeline: choice of encoder, pixel format, GOP/keyframes, quality vs. speed, and optional extra encoder flags. Most of these knobs are user-tunable through `rgb_encoder`, a nested `RGBEncoderConfig` (`lerobot.configs.video.RGBEncoderConfig`) passed through PyAV.

-You can set these parameters from the CLI with `--dataset.camera_encoder.<field>` (e.g. with `lerobot-record` or `lerobot-rollout`). The same block applies to every camera video stream in that run.
+You can set these parameters from the CLI with `--dataset.rgb_encoder.<field>` (e.g. with `lerobot-record` or `lerobot-rollout`). The same block applies to every camera video stream in that run.

 <Tip>
-  Video storage must be on for `camera_encoder` to have any effect —
+  Video storage must be on for `rgb_encoder` to have any effect —
  `use_videos=True` in Python APIs, or `--dataset.video=true` on the CLI (the
-  recording default). With video off, inputs stay as images and `camera_encoder`
-  is ignored.
+  recording default). With video off, inputs stay as images and `rgb_encoder` is
+  ignored.
 </Tip>

 For details on **when** frames are written vs. encoded (streaming vs. post-episode), queues, and other top-level `--dataset.*` switches, see [Streaming Video Encoding](./streaming_video_encoding). For an encoding-parameter comparison and experiments, see the [video-benchmark Space](https://huggingface.co/spaces/lerobot/video-benchmark).
@@ -33,9 +33,9 @@ lerobot-record \
    --dataset.single_task="Grab the cube" \
    --dataset.streaming_encoding=true \
    --dataset.encoder_threads=2 \
-    --dataset.camera_encoder.vcodec=h264 \
-    --dataset.camera_encoder.preset=fast \
-    --dataset.camera_encoder.extra_options={"tune": "film", "profile:v": "high", "bf": 2} \
+    --dataset.rgb_encoder.vcodec=h264 \
+    --dataset.rgb_encoder.preset=fast \
+    --dataset.rgb_encoder.extra_options={"tune": "film", "profile:v": "high", "bf": 2} \
    --display_data=true
 ```

@@ -50,7 +50,7 @@ Only override these parameters if you have a specific reason to, and measure the

 </Tip>

-All flags below are prefixed with `--dataset.camera_encoder.` on the CLI.
+All flags below are prefixed with `--dataset.rgb_encoder.` on the CLI.

 | Parameter       | Type             | Default       | Description                                                                                                                                                                            |
 | --------------- | ---------------- | ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@@ -65,6 +65,77 @@ All flags below are prefixed with `--dataset.camera_encoder.` on the CLI.

 ---

+## Depth streams
+
+Depth maps (Intel RealSense, Reachy 2) are stored as their **own video streams** alongside the RGB streams. Raw depth (`uint16` millimetres or `float32` metres) can't survive an 8-bit codec, so LeRobot **quantizes** each map to a 12-bit code (`[0, 4095]`) — logarithmically by default, to match the `1/depth` error profile of depth sensors — then packs it into a high-bit-depth pixel format (`gray12le`) and encodes it with a 12-bit codec.
+
+```mermaid
+flowchart LR
+    A["Raw depth (uint16 mm / float32 m)"] --> B["Clip to depth_min, depth_max"]
+    B --> C["Quantize to 12-bit code 0–4095 (log or linear)"]
+    C --> D["Pack into gray12le"]
+    D --> E["Encode video (hevc Main 12)"]
+    E --> F[("MP4 + metadata: depth_min/max, shift, use_log")]
+    F -. "load time (depth_output_unit)" .-> G["Dequantize to mm or m"]
+
+    classDef input fill:#e3f2fd,stroke:#1565c0,color:#0d47a1;
+    classDef encode fill:#ede7f6,stroke:#5e35b1,color:#311b92;
+    classDef store fill:#fff8e1,stroke:#f9a825,color:#e65100;
+    classDef load fill:#e8f5e9,stroke:#2e7d32,color:#1b5e20;
+
+    class A input;
+    class B,C,D,E encode;
+    class F store;
+    class G load;
+```
+
+Configure the depth pipeline through a parallel **`depth_encoder`** block (`DepthEncoderConfig`). It shares every `RGBEncoderConfig` field (`vcodec`, `pix_fmt`, `crf`, …) and adds four quantizer knobs, set via `--dataset.depth_encoder.<field>`:
+
+```bash
+lerobot-record \
+    ... \
+    --dataset.depth_encoder.vcodec=hevc \
+    --dataset.depth_encoder.depth_min=0.05 \
+    --dataset.depth_encoder.depth_max=5.0 \
+    --dataset.depth_encoder.use_log=true
+```
+
+| Parameter       | Type    | Default                         | Description                                                                                                                            |
+| --------------- | ------- | ------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
+| `vcodec`        | `str`   | `"hevc"`                        | HEVC Main 12 (a 12-bit-capable codec, MP4-compatible).                                                                                 |
+| `extra_options` | `dict`  | `{"x265-params": "lossless=1"}` | **Depth defaults to lossless** (exact round-trip); `crf` is ignored. Pass `extra_options={}` and set `crf` for a smaller lossy stream. |
+| `pix_fmt`       | `str`   | `"gray12le"`                    | Single-channel 12-bit pixel format used to carry the quantized codes.                                                                  |
+| `depth_min`     | `float` | `0.01`                          | Depth in metres mapped to quantum `0`. Values below are clipped on decode.                                                             |
+| `depth_max`     | `float` | `10.0`                          | Depth in metres mapped to quantum `4095`. Values above are clipped on decode.                                                          |
+| `shift`         | `float` | `3.5`                           | Pre-log offset (metres) used in logarithmic quantization for numerical stability near zero. Must satisfy `depth_min + shift > 0`.      |
+| `use_log`       | `bool`  | `True`                          | If `true`, quantize in log-space (recommended for typical depth sensors). Set to `false` for uniform/linear quantization.              |
+
+> [!TIP]
+> `depth_min`, `depth_max`, and `shift` are always interpreted in **metres**, regardless of the input depth's unit. Inputs are auto-detected: integer arrays (e.g. `uint16` millimetres straight from a RealSense) are treated as millimetres, floating arrays as metres.
+> Pick `depth_min` / `depth_max` to bracket the actual working range of your sensor — quanta outside that range saturate, which can crush detail at the boundaries.
+
+Depth features are flagged with `"is_depth_map": true` in `meta/info.json`, and their quantizer settings (`video.depth_min`, `video.depth_max`, `video.shift`, `video.use_log`) are persisted — which is what lets depth be **dequantized back to physical units** on load.
+
+### Output unit at load time
+
+`depth_encoder` is a **record-time** concern. The unit that depth maps are dequantized to on _load_ (e.g. during training) is set separately by the read-time flag `--dataset.depth_output_unit`:
+
+```bash
+lerobot-train \
+    --dataset.repo_id=<my_username>/<my_dataset_name> \
+    --dataset.depth_output_unit=m \
+    --policy.type=act
+```
+
+| Parameter           | Type  | Default | Description                                                                                  |
+| ------------------- | ----- | ------- | -------------------------------------------------------------------------------------------- |
+| `depth_output_unit` | `str` | `"mm"`  | Physical unit depth maps are dequantized to on load: `"mm"` (millimetres) or `"m"` (metres). |
+
+> [!TIP]
+> This is purely a decode-time presentation choice — it does **not** alter the stored video or its metadata, so the same dataset can be read as `mm` or `m` without re-encoding. It has no effect on datasets without depth cameras.
+
+---
+
 ## Persistence in dataset metadata

 After the first episode of a video stream is encoded, the encoder configuration is **persisted into the dataset metadata** (`meta/info.json`) under each video feature, alongside the values probed from the file itself. For a video feature `observation.images.<camera>`, the layout in `info.json` is:
@@ -82,7 +153,7 @@ After the first episode of a video stream is encoded, the encoder configuration
        "video.pix_fmt": "yuv420p",
        "video.fps": 30,
        "video.channels": 3,
-        "video.is_depth_map": false,
+        "is_depth_map": false,
        "video.g": 2,
        "video.crf": 30,
        "video.preset": "fast",
@@ -97,12 +168,12 @@ After the first episode of a video stream is encoded, the encoder configuration

 Two sources contribute to the `info` block:

- **Stream-derived** (read back from the encoded MP4 with PyAV): `video.height`, `video.width`, `video.codec`, `video.pix_fmt`, `video.fps`, `video.channels`, `video.is_depth_map`, plus `audio.*` if an audio stream is present.
- **Encoder-derived** (taken from `VideoEncoderConfig`): `video.g`, `video.crf`, `video.preset`, `video.fast_decode`, `video.video_backend`, `video.extra_options`.
+- **Stream-derived** (read back from the encoded MP4 with PyAV): `video.height`, `video.width`, `video.codec`, `video.pix_fmt`, `video.fps`, `video.channels`, `is_depth_map`, plus `audio.*` if an audio stream is present.
+- **Encoder-derived** (taken from `RGBEncoderConfig` or `DepthEncoderConfig`): `video.g`, `video.crf`, `video.preset`, `video.fast_decode`, `video.video_backend`, `video.extra_options`.

 <Tip>
  This block is populated **once**, from the **first** episode. It assumes every
-  episode in the dataset was encoded with the same `camera_encoder`. Changing
+  episode in the dataset was encoded with the same `rgb_encoder`. Changing
  encoder settings partway through a recording is not supported — the
  `info.json` will only reflect the parameters used for the first episode.
 </Tip>