Files
lerobot/STREAMING_ENCODING_PR.md
2026-02-06 11:26:27 +01:00

8.2 KiB
Raw Permalink Blame History

Streaming Video Encoding — Encode on the fly during recording

Problem

After each episode, save_episode() blocks for ~79 seconds on a 3-camera setup (3197 frames, 107s episode):

Step Time
Write 9591 PNGs to disk ~19s
Read PNGs back → compute image stats ~15s
Read PNGs again → encode 3× AV1 videos → delete PNGs ~44.5s
Save parquet + metadata ~0.6s
Total ~79s

The entire pipeline writes frames as temporary PNGs, reads them back twice (stats + encoding), then deletes them. This round-trip is the bottleneck.

Architecture

Before: sequential post-episode pipeline

  Recording loop                          save_episode() — BLOCKS ~79s
 ┌─────────────┐          ┌──────────────────────────────────────────────────────────┐
 │  30fps loop  │          │                                                          │
 │              │  frames  │  frame_buffer ──► write PNGs ──► read PNGs ──► stats     │
 │  camera ─►───┼──► list  │       (~19s)         │              (~15s)               │
 │  teleop      │          │                      ▼                                   │
 │  policy      │          │               read PNGs ──► AV1 encode ──► delete PNGs   │
 │              │          │                        (~44.5s)                           │
 └──────┬───────┘          └──────────────────────────────────────────────────────────┘
        │                                         │
        ▼                                         ▼
   episode ends                             next episode
   (~107s recording)                        (~79s blocked)

Data path: frame → list → PNG disk → read → stats + PNG disk → read → encode → MP4 → delete PNGs

After: streaming pipeline (encodes during recording)

  Recording loop (encoding happens HERE)            save_episode() — ~0.5s
 ┌───────────────────────────────────────┐         ┌──────────────────┐
 │  30fps control loop                   │         │                  │
 │                                       │         │  flush encoders  │
 │  camera ──► frame ─┬─► queue ──► [T1] ├── AV1 ─┤  (already done)  │
 │                    │    queue ──► [T2] ├── AV1 ─┤  ~0.16s          │
 │                    │    queue ──► [T3] ├── AV1 ─┤                  │
 │                    │                  │         │  running stats   │
 │                    └─► downsample ──► │─ stats ─┤  → finalize      │
 │                       RunningQuantile │         │  ~0.01s          │
 │  teleop / policy (never blocked)      │         │                  │
 └───────────────────────────────────────┘         │  save parquet    │
                                                   │  ~0.36s          │
        [T1] [T2] [T3] = encoder threads           └──────────────────┘
        (one per camera, GIL released by PyAV)

Data path: frame → queue → encode → MP4 (zero PNGs, zero re-reads)

Stats computation changes

Before After
Method compute_episode_stats() reads all PNGs from disk, decodes them, computes min/max/mean/std/quantiles RunningQuantileStats accumulates stats incrementally per frame during recording
Input Full-resolution PNGs read back from disk Downsampled frames (via auto_downsample_height_width, ~150×100px) directly from memory
When After episode ends, inside save_episode() During recording, inside add_frame() (~2ms per frame)
Output {mean, std, min, max, q01..q99} shaped (C,1,1) in [0,1] Identical shape and scale — RunningQuantileStats.get_statistics() → reshape (C,1,1) / 255
I/O Reads 9591 PNGs (~15s) Zero disk I/O
Numeric features Computed from episode buffer (unchanged) Computed from episode buffer (unchanged)

The running stats use the same auto_downsample_height_width function and produce the same statistical keys (mean, std, min, max, count, q01, q10, q50, q90, q99). Video features are excluded from the post-episode compute_episode_stats() call when streaming is active — only numeric features go through that path.

Results

Tested on the same 3-camera setup (2028 frames, 67.6s episode):

Step Before After Speedup
Frame writing (PNGs) ~19s 0s ∞ (eliminated)
Episode stats ~15s 0.01s 1500×
Video encoding ~44.5s 0.16s 278×
Parquet + meta ~0.6s 0.36s ~same
Total save_episode() ~79s 0.55s 143×

The video encoding time drops to near-zero because most encoding already happened during recording. finish_episode() only flushes the last few buffered frames.

Per-frame overhead during recording

Operation Time
queue.put(frame) (non-blocking) ~0.01ms
auto_downsample_height_width ~0.5ms
RunningQuantileStats.update ~1ms
Total per frame ~2ms (well within 33ms budget at 30fps)

Usage

Streaming is on by default. Users on weaker PCs can disable it to fall back to the old post-episode pipeline:

# Default (streaming ON)
lerobot-record --dataset.repo_id=user/dataset ...

# Old behavior (streaming OFF)
lerobot-record --dataset.repo_id=user/dataset --dataset.streaming_encoding=false

For the RaC data collection script, set streaming_encoding: false in the dataset config.

Files Changed

src/lerobot/datasets/video_utils.py

  • Added StreamingVideoEncoder — manages one _CameraEncoder thread per camera
  • Added _CameraEncoder — daemon thread that reads frames from a queue and encodes with PyAV
  • Non-blocking unbounded queue ensures the control loop is never delayed

src/lerobot/datasets/lerobot_dataset.py

  • create() / start_streaming_encoder(): new streaming_encoding parameter
  • add_frame(): when streaming, feeds frames to encoder + accumulates running stats instead of writing PNGs
  • save_episode(): when streaming, uses running stats and calls finish_episode() to get already-encoded video paths
  • clear_episode_buffer(): cancels in-progress encoding on re-record
  • finalize(): cleans up encoder on shutdown
  • Full backward compatibility: when streaming_encoding=False, all existing code paths are unchanged

src/lerobot/scripts/lerobot_record.py

  • Added streaming_encoding: bool = True to DatasetRecordConfig
  • Wired through to both create() and resume paths

examples/rac/rac_data_collection_openarms_rtc.py

  • Added streaming_encoding: bool = True to RaCRTCDatasetConfig
  • Frames are added inline during the control loop (streaming) or buffered for post-loop writing (old path)
  • Automatically detects mode and adjusts behavior

Design Notes

  • Why threads, not processes? PyAV/FFmpeg releases the GIL during encoding. Threads share memory (zero-copy frame passing), avoiding the serialization overhead of multiprocessing.
  • Why unbounded queue? At 30fps production vs ~72fps encoding throughput, the queue stays near-empty. Even during brief encoder stalls, memory growth is bounded by episode length. The control loop must never block.
  • Why running stats? Avoids the expensive read-back-from-disk step. RunningQuantileStats + auto_downsample_height_width compute identical statistics incrementally with ~2ms overhead per frame.
  • Backward compatible: Setting streaming_encoding=false restores the original PNG → encode pipeline exactly. No behavior changes for existing users who don't opt in.