# Streaming Video Encoding — Encode on the fly during recording ## Problem After each episode, `save_episode()` blocks for **~79 seconds** on a 3-camera setup (3197 frames, 107s episode): | Step | Time | |------|------| | Write 9591 PNGs to disk | ~19s | | Read PNGs back → compute image stats | ~15s | | Read PNGs again → encode 3× AV1 videos → delete PNGs | ~44.5s | | Save parquet + metadata | ~0.6s | | **Total** | **~79s** | The entire pipeline writes frames as temporary PNGs, reads them back twice (stats + encoding), then deletes them. This round-trip is the bottleneck. ## Architecture ### Before: sequential post-episode pipeline ``` Recording loop save_episode() — BLOCKS ~79s ┌─────────────┐ ┌──────────────────────────────────────────────────────────┐ │ 30fps loop │ │ │ │ │ frames │ frame_buffer ──► write PNGs ──► read PNGs ──► stats │ │ camera ─►───┼──► list │ (~19s) │ (~15s) │ │ teleop │ │ ▼ │ │ policy │ │ read PNGs ──► AV1 encode ──► delete PNGs │ │ │ │ (~44.5s) │ └──────┬───────┘ └──────────────────────────────────────────────────────────┘ │ │ ▼ ▼ episode ends next episode (~107s recording) (~79s blocked) ``` **Data path:** `frame → list → PNG disk → read → stats` + `PNG disk → read → encode → MP4 → delete PNGs` ### After: streaming pipeline (encodes during recording) ``` Recording loop (encoding happens HERE) save_episode() — ~0.5s ┌───────────────────────────────────────┐ ┌──────────────────┐ │ 30fps control loop │ │ │ │ │ │ flush encoders │ │ camera ──► frame ─┬─► queue ──► [T1] ├── AV1 ─┤ (already done) │ │ │ queue ──► [T2] ├── AV1 ─┤ ~0.16s │ │ │ queue ──► [T3] ├── AV1 ─┤ │ │ │ │ │ running stats │ │ └─► downsample ──► │─ stats ─┤ → finalize │ │ RunningQuantile │ │ ~0.01s │ │ teleop / policy (never blocked) │ │ │ └───────────────────────────────────────┘ │ save parquet │ │ ~0.36s │ [T1] [T2] [T3] = encoder threads └──────────────────┘ (one per camera, GIL released by PyAV) ``` **Data path:** `frame → queue → encode → MP4` (zero PNGs, zero re-reads) ## Stats computation changes | | Before | After | |---|---|---| | **Method** | `compute_episode_stats()` reads all PNGs from disk, decodes them, computes min/max/mean/std/quantiles | `RunningQuantileStats` accumulates stats incrementally per frame during recording | | **Input** | Full-resolution PNGs read back from disk | Downsampled frames (via `auto_downsample_height_width`, ~150×100px) directly from memory | | **When** | After episode ends, inside `save_episode()` | During recording, inside `add_frame()` (~2ms per frame) | | **Output** | `{mean, std, min, max, q01..q99}` shaped `(C,1,1)` in `[0,1]` | Identical shape and scale — `RunningQuantileStats.get_statistics()` → reshape `(C,1,1)` / 255 | | **I/O** | Reads 9591 PNGs (~15s) | Zero disk I/O | | **Numeric features** | Computed from episode buffer (unchanged) | Computed from episode buffer (unchanged) | The running stats use the same `auto_downsample_height_width` function and produce the same statistical keys (`mean`, `std`, `min`, `max`, `count`, `q01`, `q10`, `q50`, `q90`, `q99`). Video features are excluded from the post-episode `compute_episode_stats()` call when streaming is active — only numeric features go through that path. ## Results Tested on the same 3-camera setup (2028 frames, 67.6s episode): | Step | Before | After | Speedup | |------|--------|-------|---------| | Frame writing (PNGs) | ~19s | **0s** | ∞ (eliminated) | | Episode stats | ~15s | **0.01s** | 1500× | | Video encoding | ~44.5s | **0.16s** | 278× | | Parquet + meta | ~0.6s | **0.36s** | ~same | | **Total `save_episode()`** | **~79s** | **0.55s** | **143×** | The video encoding time drops to near-zero because most encoding already happened during recording. `finish_episode()` only flushes the last few buffered frames. ### Per-frame overhead during recording | Operation | Time | |-----------|------| | `queue.put(frame)` (non-blocking) | ~0.01ms | | `auto_downsample_height_width` | ~0.5ms | | `RunningQuantileStats.update` | ~1ms | | **Total per frame** | **~2ms** (well within 33ms budget at 30fps) | ## Usage Streaming is **on by default**. Users on weaker PCs can disable it to fall back to the old post-episode pipeline: ```bash # Default (streaming ON) lerobot-record --dataset.repo_id=user/dataset ... # Old behavior (streaming OFF) lerobot-record --dataset.repo_id=user/dataset --dataset.streaming_encoding=false ``` For the RaC data collection script, set `streaming_encoding: false` in the dataset config. ## Files Changed ### `src/lerobot/datasets/video_utils.py` - Added `StreamingVideoEncoder` — manages one `_CameraEncoder` thread per camera - Added `_CameraEncoder` — daemon thread that reads frames from a queue and encodes with PyAV - Non-blocking unbounded queue ensures the control loop is never delayed ### `src/lerobot/datasets/lerobot_dataset.py` - `create()` / `start_streaming_encoder()`: new `streaming_encoding` parameter - `add_frame()`: when streaming, feeds frames to encoder + accumulates running stats instead of writing PNGs - `save_episode()`: when streaming, uses running stats and calls `finish_episode()` to get already-encoded video paths - `clear_episode_buffer()`: cancels in-progress encoding on re-record - `finalize()`: cleans up encoder on shutdown - **Full backward compatibility**: when `streaming_encoding=False`, all existing code paths are unchanged ### `src/lerobot/scripts/lerobot_record.py` - Added `streaming_encoding: bool = True` to `DatasetRecordConfig` - Wired through to both `create()` and `resume` paths ### `examples/rac/rac_data_collection_openarms_rtc.py` - Added `streaming_encoding: bool = True` to `RaCRTCDatasetConfig` - Frames are added inline during the control loop (streaming) or buffered for post-loop writing (old path) - Automatically detects mode and adjusts behavior ## Design Notes - **Why threads, not processes?** PyAV/FFmpeg releases the GIL during encoding. Threads share memory (zero-copy frame passing), avoiding the serialization overhead of multiprocessing. - **Why unbounded queue?** At 30fps production vs ~72fps encoding throughput, the queue stays near-empty. Even during brief encoder stalls, memory growth is bounded by episode length. The control loop must never block. - **Why running stats?** Avoids the expensive read-back-from-disk step. `RunningQuantileStats` + `auto_downsample_height_width` compute identical statistics incrementally with ~2ms overhead per frame. - **Backward compatible**: Setting `streaming_encoding=false` restores the original PNG → encode pipeline exactly. No behavior changes for existing users who don't opt in.