Files
lerobot/docs/source/video_encoding_parameters.mdx
T

227 lines
19 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Video encoding parameters
When video storage is enabled, LeRobot stores each camera stream as an **MP4** file instead of saving one image file per timestep. Video encoding compresses across time, which usually cuts dataset size and I/O compared to a pile of PNG, while keeping MP4 — a format every player and loader understands.
Encoding frames into an MP4 is a full FFmpeg pipeline: choice of encoder, pixel format, GOP/keyframes, quality vs. speed, and optional extra encoder flags. Most of these knobs are user-tunable through `rgb_encoder`, a nested `RGBEncoderConfig` (`lerobot.configs.video.RGBEncoderConfig`) passed through PyAV.
You can set these parameters from the CLI with `--dataset.rgb_encoder.<field>` (e.g. with `lerobot-record` or `lerobot-rollout`). The same block applies to every camera video stream in that run.
> [!TIP]
> Video storage must be on for `rgb_encoder` to have any effect —
> `use_videos=True` in Python APIs, or `--dataset.video=true` on the CLI (the
> recording default). With video off, inputs stay as images and `rgb_encoder` is
> ignored.
For details on **when** frames are written vs. encoded (streaming vs. post-episode), queues, and other top-level `--dataset.*` switches, see [Streaming Video Encoding](./streaming_video_encoding). For an encoding-parameter comparison and experiments, see the [video-benchmark Space](https://huggingface.co/spaces/lerobot/video-benchmark).
---
## Example
```bash
lerobot-record \
--robot.type=so100_follower \
--robot.port=/dev/tty.usbmodem58760431541 \
--robot.cameras="{laptop: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
--robot.id=black \
--teleop.type=so100_leader \
--teleop.port=/dev/tty.usbmodem58760431551 \
--teleop.id=blue \
--dataset.repo_id=<my_username>/<my_dataset_name> \
--dataset.num_episodes=2 \
--dataset.single_task="Grab the cube" \
--dataset.streaming_encoding=true \
--dataset.encoder_threads=2 \
--dataset.rgb_encoder.vcodec=h264 \
--dataset.rgb_encoder.preset=fast \
--dataset.rgb_encoder.extra_options={"tune": "film", "profile:v": "high", "bf": 2} \
--display_data=true
```
---
## Tuning parameters
> [!WARNING]
> The defaults are tuned to balance **compression ratio**, **visual quality**, and **decoding/seek speed** for typical robotics datasets. Changing them can affect both recording (CPU load, frame drops) and training (decoding throughput, image quality).
>
> Only override these parameters if you have a specific reason to, and measure the impact on your pipeline before relying on the new settings.
All flags below are prefixed with `--dataset.rgb_encoder.` on the CLI.
| Parameter | Type | Default | Description |
| --------------- | ---------------- | ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vcodec` | `str` | `"libsvtav1"` | Video codec name. `"auto"` picks the first available hardware encoder from a fixed preference list, falling back to `libsvtav1`. |
| `pix_fmt` | `str` | `"yuv420p"` | Output pixel format. Must be supported by the chosen codec in your FFmpeg build. |
| `g` | `int` | `2` | GOP size — a keyframe every `g` frames. Emitted as FFmpeg option `g`. |
| `crf` | `int` or `float` | `30` | Abstract quality value, mapped per codec (see the [mapping](#mapping-videoencoderconfig--ffmpeg-options) below). Lower → higher quality / larger output where the mapping is monotone. |
| `preset` | `int` or `str` | `12` \* | Encoder speed preset; meaning depends on the codec. <br/>\* When unset and `vcodec=libsvtav1`, LeRobot defaults to `12`. |
| `fast_decode` | `int` | `0` | `libsvtav1`: `02`, passed via `svtav1-params`. <br/>`h264` / `hevc` (software): if `>0`, sets `tune=fastdecode`. <br/>Other codecs: usually unused. |
| `video_backend` | `str` | `"pyav"` | Only `"pyav"` is currently implemented for video encoding. |
| `extra_options` | `dict` | `{}` | Extra FFmpeg or codec specific options merged after the structured fields above. Cannot override keys already set by those fields. |
---
## Depth streams
Depth maps (Intel RealSense, Reachy 2) are stored as their **own video streams** alongside the RGB streams. Raw depth (`uint16` millimetres or `float32` metres) can't survive an 8-bit codec, so LeRobot **quantizes** each map to a 12-bit code (`[0, 4095]`) — logarithmically by default, to match the `1/depth` error profile of depth sensors — then packs it into a high-bit-depth pixel format (`gray12le`) and encodes it with a 12-bit codec.
<div style="margin:28px 0;padding:14px 0;">
<div style="margin:0 auto;display:flex;flex-wrap:wrap;justify-content:center;align-items:stretch;gap:6px;font-family:'Source Sans 3',ui-sans-serif,system-ui,sans-serif;font-size:14px;font-weight:600;color:#1B1B1D;">
<span style="display:flex;flex-direction:column;justify-content:center;align-items:center;text-align:center;gap:2px;background:#DBEAFE;color:#1D4ED8;border-radius:9px;padding:8px 12px;"><span>Raw depth</span><span style="font-size:11px;font-weight:400;color:#3B6FD4;white-space:nowrap;">uint16 mm<br/>float32 m</span></span>
<span style="display:flex;align-items:center;font-size:16px;color:#C3CBD9;">→</span>
<div style="border:2px dashed #C4B5FD;border-radius:13px;padding:18px 12px 12px;position:relative;display:flex;align-items:stretch;gap:6px;">
<span style="position:absolute;top:-10px;left:12px;background:#fff;padding:0 6px;font-size:11px;font-weight:700;color:#7E22CE;text-transform:uppercase;letter-spacing:0.5px;white-space:nowrap;">Record time</span>
<span style="display:flex;flex-direction:column;justify-content:center;align-items:center;text-align:center;gap:2px;background:#F3E8FF;color:#7E22CE;border-radius:9px;padding:8px 12px;"><span>Clip</span><span style="font-size:11px;font-weight:400;color:#9061C2;white-space:nowrap;">to [depth_min,<br/>depth_max]</span></span>
<span style="display:flex;align-items:center;font-size:16px;color:#C3CBD9;">→</span>
<span style="display:flex;flex-direction:column;justify-content:center;align-items:center;text-align:center;gap:2px;background:#F3E8FF;color:#7E22CE;border-radius:9px;padding:8px 12px;"><span>Quantize</span><span style="font-size:11px;font-weight:400;color:#9061C2;white-space:nowrap;">12-bit codes 04095<br/>log (default) or linear</span></span>
<span style="display:flex;align-items:center;font-size:16px;color:#C3CBD9;">→</span>
<span style="display:flex;flex-direction:column;justify-content:center;align-items:center;text-align:center;gap:2px;background:#F3E8FF;color:#7E22CE;border-radius:9px;padding:8px 12px;"><span>Pack</span><span style="font-size:11px;font-weight:400;color:#9061C2;white-space:nowrap;">into gray12le<br/>plane</span></span>
<span style="display:flex;align-items:center;font-size:16px;color:#C3CBD9;">→</span>
<span style="display:flex;flex-direction:column;justify-content:center;align-items:center;text-align:center;gap:2px;background:#F3E8FF;color:#7E22CE;border-radius:9px;padding:8px 12px;"><span>Encode</span><span style="font-size:11px;font-weight:400;color:#9061C2;white-space:nowrap;">HEVC<br/>Main 12</span></span>
</div>
<span style="display:flex;align-items:center;font-size:16px;color:#C3CBD9;">→</span>
<span style="display:flex;flex-direction:column;justify-content:center;align-items:center;text-align:center;gap:2px;background:#FEF3C7;color:#B45309;border-radius:9px;padding:8px 12px;"><span>MP4</span><span style="font-size:11px;font-weight:400;color:#C77D18;white-space:nowrap;">stored<br/>stream</span></span>
<span style="display:flex;align-items:center;font-size:16px;color:#34A06B;">→</span>
<div style="border:2px dashed #6EE7B7;border-radius:13px;padding:18px 12px 12px;position:relative;display:flex;align-items:center;gap:6px;">
<span style="position:absolute;top:-10px;left:12px;background:#fff;padding:0 6px;font-size:11px;font-weight:700;color:#047857;text-transform:uppercase;letter-spacing:0.5px;white-space:nowrap;">Load time</span>
<span style="display:flex;flex-direction:column;justify-content:center;align-items:center;text-align:center;gap:2px;background:#D1FAE5;color:#047857;border-radius:9px;padding:8px 12px;"><span>Dequantize</span><span style="font-size:11px;font-weight:400;color:#059669;white-space:nowrap;">to mm / m</span></span>
</div>
</div>
</div>
Configure the depth pipeline through a parallel **`depth_encoder`** block (`DepthEncoderConfig`). It shares every `RGBEncoderConfig` field (`vcodec`, `pix_fmt`, `crf`, …) and adds four quantizer knobs, set via `--dataset.depth_encoder.<field>`:
```bash
lerobot-record \
... \
--dataset.depth_encoder.vcodec=hevc \
--dataset.depth_encoder.depth_min=0.05 \
--dataset.depth_encoder.depth_max=5.0 \
--dataset.depth_encoder.use_log=true
```
| Parameter | Type | Default | Description |
| --------------- | ------- | ------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| `vcodec` | `str` | `"hevc"` | HEVC Main 12 (a 12-bit-capable codec, MP4-compatible). |
| `extra_options` | `dict` | `{"x265-params": "lossless=1"}` | **Depth defaults to lossless** (exact round-trip); `crf` is ignored. Pass `extra_options={}` and set `crf` for a smaller lossy stream. |
| `pix_fmt` | `str` | `"gray12le"` | Single-channel 12-bit pixel format used to carry the quantized codes. |
| `depth_min` | `float` | `0.01` | Depth in metres mapped to quantum `0`. Values below are clipped on decode. |
| `depth_max` | `float` | `10.0` | Depth in metres mapped to quantum `4095`. Values above are clipped on decode. |
| `shift` | `float` | `3.5` | Pre-log offset (metres) used in logarithmic quantization for numerical stability near zero. Must satisfy `depth_min + shift > 0`. |
| `use_log` | `bool` | `True` | If `true`, quantize in log-space (recommended for typical depth sensors). Set to `false` for uniform/linear quantization. |
> [!TIP]
> `depth_min`, `depth_max`, and `shift` are always interpreted in **metres**, regardless of the input depth's unit. Inputs are auto-detected: integer arrays (e.g. `uint16` millimetres straight from a RealSense) are treated as millimetres, floating arrays as metres.
> Pick `depth_min` / `depth_max` to bracket the actual working range of your sensor — quanta outside that range saturate, which can crush detail at the boundaries.
Depth features are flagged with `"is_depth_map": true` in `meta/info.json`, and their quantizer settings (`video.depth_min`, `video.depth_max`, `video.shift`, `video.use_log`) are persisted — which is what lets depth be **dequantized back to physical units** on load.
### Output unit at load time
`depth_encoder` is a **record-time** concern. The unit that depth maps are dequantized to on _load_ (e.g. during training) is set separately by the read-time flag `--dataset.depth_output_unit`:
```bash
lerobot-train \
--dataset.repo_id=<my_username>/<my_dataset_name> \
--dataset.depth_output_unit=m \
--policy.type=act
```
| Parameter | Type | Default | Description |
| ------------------- | ----- | ------- | -------------------------------------------------------------------------------------------- |
| `depth_output_unit` | `str` | `"mm"` | Physical unit depth maps are dequantized to on load: `"mm"` (millimetres) or `"m"` (metres). |
> [!TIP]
> This is purely a decode-time presentation choice — it does **not** alter the stored video or its metadata, so the same dataset can be read as `mm` or `m` without re-encoding. It has no effect on datasets without depth cameras.
---
## Persistence in dataset metadata
After the first episode of a video stream is encoded, the encoder configuration is **persisted into the dataset metadata** (`meta/info.json`) under each video feature, alongside the values probed from the file itself. For a video feature `observation.images.<camera>`, the layout in `info.json` is:
```json
{
"features": {
"observation.images.laptop": {
"dtype": "video",
"shape": [480, 640, 3],
"info": {
"video.height": 480,
"video.width": 640,
"video.codec": "h264",
"video.pix_fmt": "yuv420p",
"video.fps": 30,
"video.channels": 3,
"is_depth_map": false,
"video.g": 2,
"video.crf": 30,
"video.preset": "fast",
"video.fast_decode": 0,
"video.video_backend": "pyav",
"video.extra_options": { "tune": "film", "profile:v": "high", "bf": 2 }
}
}
}
}
```
Two sources contribute to the `info` block:
<div style="display:flex;flex-wrap:wrap;gap:14px;margin:20px 0;font-family:'Source Sans 3',ui-sans-serif,system-ui,sans-serif;">
<div style="flex:1 1 280px;border:1px solid #BFDBFE;border-radius:12px;overflow:hidden;">
<div style="background:#DBEAFE;color:#1D4ED8;font-weight:700;font-size:14px;padding:8px 14px;">Stream-derived</div>
<div style="padding:12px 14px;">
<div style="font-size:13px;color:#4B5563;margin-bottom:10px;">Read back from the encoded MP4 with PyAV.</div>
<div style="display:flex;flex-wrap:wrap;gap:6px;">
<code style="background:#EFF6FF;color:#1D4ED8;border-radius:6px;padding:2px 8px;font-size:12px;">video.height</code>
<code style="background:#EFF6FF;color:#1D4ED8;border-radius:6px;padding:2px 8px;font-size:12px;">video.width</code>
<code style="background:#EFF6FF;color:#1D4ED8;border-radius:6px;padding:2px 8px;font-size:12px;">video.codec</code>
<code style="background:#EFF6FF;color:#1D4ED8;border-radius:6px;padding:2px 8px;font-size:12px;">video.pix_fmt</code>
<code style="background:#EFF6FF;color:#1D4ED8;border-radius:6px;padding:2px 8px;font-size:12px;">video.fps</code>
<code style="background:#EFF6FF;color:#1D4ED8;border-radius:6px;padding:2px 8px;font-size:12px;">video.channels</code>
<code style="background:#EFF6FF;color:#1D4ED8;border-radius:6px;padding:2px 8px;font-size:12px;">is_depth_map</code>
<code style="background:#EFF6FF;color:#1D4ED8;border-radius:6px;padding:2px 8px;font-size:12px;">audio.*</code>
</div>
</div>
</div>
<div style="flex:1 1 280px;border:1px solid #DDD6FE;border-radius:12px;overflow:hidden;">
<div style="background:#F3E8FF;color:#7E22CE;font-weight:700;font-size:14px;padding:8px 14px;">Encoder-derived</div>
<div style="padding:12px 14px;">
<div style="font-size:13px;color:#4B5563;margin-bottom:10px;">Taken from <code style="font-size:12px;">RGBEncoderConfig</code> / <code style="font-size:12px;">DepthEncoderConfig</code>.</div>
<div style="display:flex;flex-wrap:wrap;gap:6px;">
<code style="background:#FAF5FF;color:#7E22CE;border-radius:6px;padding:2px 8px;font-size:12px;">video.g</code>
<code style="background:#FAF5FF;color:#7E22CE;border-radius:6px;padding:2px 8px;font-size:12px;">video.crf</code>
<code style="background:#FAF5FF;color:#7E22CE;border-radius:6px;padding:2px 8px;font-size:12px;">video.preset</code>
<code style="background:#FAF5FF;color:#7E22CE;border-radius:6px;padding:2px 8px;font-size:12px;">video.fast_decode</code>
<code style="background:#FAF5FF;color:#7E22CE;border-radius:6px;padding:2px 8px;font-size:12px;">video.video_backend</code>
<code style="background:#FAF5FF;color:#7E22CE;border-radius:6px;padding:2px 8px;font-size:12px;">video.extra_options</code>
</div>
</div>
</div>
</div>
> [!IMPORTANT]
> This block is populated **once**, from the **first** episode. It assumes every
> episode in the dataset was encoded with the same `rgb_encoder`. Changing
> encoder settings partway through a recording is not supported — the
> `info.json` will only reflect the parameters used for the first episode.
---
## Merging datasets
When aggregating datasets with `merge_datasets`, video files are concatenated as-is (no re-encoding), and encoder fields in `info.json` are merged per-key:
<div style="display:flex;flex-direction:column;gap:12px;margin:20px 0;font-family:'Source Sans 3',ui-sans-serif,system-ui,sans-serif;">
<div style="display:flex;gap:12px;align-items:flex-start;border-left:3px solid #F87171;background:#FEF2F2;border-radius:0 10px 10px 0;padding:12px 14px;">
<span style="flex:none;background:#FEE2E2;color:#B91C1C;font-weight:700;font-size:11px;text-transform:uppercase;letter-spacing:0.4px;border-radius:6px;padding:3px 8px;margin-top:1px;white-space:nowrap;">Must match</span>
<span style="font-size:14px;color:#1B1B1D;">Stream-derived fields — <code style="font-size:12px;">video.codec</code>, <code style="font-size:12px;">video.pix_fmt</code>, <code style="font-size:12px;">video.height</code>, <code style="font-size:12px;">video.width</code>, <code style="font-size:12px;">video.fps</code> — must match across sources, otherwise FFmpeg's concat demuxer fails.</span>
</div>
<div style="display:flex;gap:12px;align-items:flex-start;border-left:3px solid #34D399;background:#ECFDF5;border-radius:0 10px 10px 0;padding:12px 14px;">
<span style="flex:none;background:#D1FAE5;color:#047857;font-weight:700;font-size:11px;text-transform:uppercase;letter-spacing:0.4px;border-radius:6px;padding:3px 8px;margin-top:1px;white-space:nowrap;">Merged loosely</span>
<span style="font-size:14px;color:#1B1B1D;">Encoder-tuning fields — <code style="font-size:12px;">video.g</code>, <code style="font-size:12px;">video.crf</code>, <code style="font-size:12px;">video.preset</code>, <code style="font-size:12px;">video.fast_decode</code>, <code style="font-size:12px;">video.extra_options</code>. If every source agrees, the value is kept; if not, it's set to <code style="font-size:12px;">null</code> (or <code style="font-size:12px;">{}</code> for <code style="font-size:12px;">video.extra_options</code>) and a warning is logged.</span>
</div>
</div>