mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-15 16:49:55 +00:00
141 lines
8.2 KiB
Markdown
141 lines
8.2 KiB
Markdown
# Streaming Video Encoding — Encode on the fly during recording
|
||
|
||
## Problem
|
||
|
||
After each episode, `save_episode()` blocks for **~79 seconds** on a 3-camera setup (3197 frames, 107s episode):
|
||
|
||
| Step | Time |
|
||
|------|------|
|
||
| Write 9591 PNGs to disk | ~19s |
|
||
| Read PNGs back → compute image stats | ~15s |
|
||
| Read PNGs again → encode 3× AV1 videos → delete PNGs | ~44.5s |
|
||
| Save parquet + metadata | ~0.6s |
|
||
| **Total** | **~79s** |
|
||
|
||
The entire pipeline writes frames as temporary PNGs, reads them back twice (stats + encoding), then deletes them. This round-trip is the bottleneck.
|
||
|
||
## Architecture
|
||
|
||
### Before: sequential post-episode pipeline
|
||
|
||
```
|
||
Recording loop save_episode() — BLOCKS ~79s
|
||
┌─────────────┐ ┌──────────────────────────────────────────────────────────┐
|
||
│ 30fps loop │ │ │
|
||
│ │ frames │ frame_buffer ──► write PNGs ──► read PNGs ──► stats │
|
||
│ camera ─►───┼──► list │ (~19s) │ (~15s) │
|
||
│ teleop │ │ ▼ │
|
||
│ policy │ │ read PNGs ──► AV1 encode ──► delete PNGs │
|
||
│ │ │ (~44.5s) │
|
||
└──────┬───────┘ └──────────────────────────────────────────────────────────┘
|
||
│ │
|
||
▼ ▼
|
||
episode ends next episode
|
||
(~107s recording) (~79s blocked)
|
||
```
|
||
|
||
**Data path:** `frame → list → PNG disk → read → stats` + `PNG disk → read → encode → MP4 → delete PNGs`
|
||
|
||
### After: streaming pipeline (encodes during recording)
|
||
|
||
```
|
||
Recording loop (encoding happens HERE) save_episode() — ~0.5s
|
||
┌───────────────────────────────────────┐ ┌──────────────────┐
|
||
│ 30fps control loop │ │ │
|
||
│ │ │ flush encoders │
|
||
│ camera ──► frame ─┬─► queue ──► [T1] ├── AV1 ─┤ (already done) │
|
||
│ │ queue ──► [T2] ├── AV1 ─┤ ~0.16s │
|
||
│ │ queue ──► [T3] ├── AV1 ─┤ │
|
||
│ │ │ │ running stats │
|
||
│ └─► downsample ──► │─ stats ─┤ → finalize │
|
||
│ RunningQuantile │ │ ~0.01s │
|
||
│ teleop / policy (never blocked) │ │ │
|
||
└───────────────────────────────────────┘ │ save parquet │
|
||
│ ~0.36s │
|
||
[T1] [T2] [T3] = encoder threads └──────────────────┘
|
||
(one per camera, GIL released by PyAV)
|
||
```
|
||
|
||
**Data path:** `frame → queue → encode → MP4` (zero PNGs, zero re-reads)
|
||
|
||
## Stats computation changes
|
||
|
||
| | Before | After |
|
||
|---|---|---|
|
||
| **Method** | `compute_episode_stats()` reads all PNGs from disk, decodes them, computes min/max/mean/std/quantiles | `RunningQuantileStats` accumulates stats incrementally per frame during recording |
|
||
| **Input** | Full-resolution PNGs read back from disk | Downsampled frames (via `auto_downsample_height_width`, ~150×100px) directly from memory |
|
||
| **When** | After episode ends, inside `save_episode()` | During recording, inside `add_frame()` (~2ms per frame) |
|
||
| **Output** | `{mean, std, min, max, q01..q99}` shaped `(C,1,1)` in `[0,1]` | Identical shape and scale — `RunningQuantileStats.get_statistics()` → reshape `(C,1,1)` / 255 |
|
||
| **I/O** | Reads 9591 PNGs (~15s) | Zero disk I/O |
|
||
| **Numeric features** | Computed from episode buffer (unchanged) | Computed from episode buffer (unchanged) |
|
||
|
||
The running stats use the same `auto_downsample_height_width` function and produce the same statistical keys (`mean`, `std`, `min`, `max`, `count`, `q01`, `q10`, `q50`, `q90`, `q99`). Video features are excluded from the post-episode `compute_episode_stats()` call when streaming is active — only numeric features go through that path.
|
||
|
||
## Results
|
||
|
||
Tested on the same 3-camera setup (2028 frames, 67.6s episode):
|
||
|
||
| Step | Before | After | Speedup |
|
||
|------|--------|-------|---------|
|
||
| Frame writing (PNGs) | ~19s | **0s** | ∞ (eliminated) |
|
||
| Episode stats | ~15s | **0.01s** | 1500× |
|
||
| Video encoding | ~44.5s | **0.16s** | 278× |
|
||
| Parquet + meta | ~0.6s | **0.36s** | ~same |
|
||
| **Total `save_episode()`** | **~79s** | **0.55s** | **143×** |
|
||
|
||
The video encoding time drops to near-zero because most encoding already happened during recording. `finish_episode()` only flushes the last few buffered frames.
|
||
|
||
### Per-frame overhead during recording
|
||
|
||
| Operation | Time |
|
||
|-----------|------|
|
||
| `queue.put(frame)` (non-blocking) | ~0.01ms |
|
||
| `auto_downsample_height_width` | ~0.5ms |
|
||
| `RunningQuantileStats.update` | ~1ms |
|
||
| **Total per frame** | **~2ms** (well within 33ms budget at 30fps) |
|
||
|
||
## Usage
|
||
|
||
Streaming is **on by default**. Users on weaker PCs can disable it to fall back to the old post-episode pipeline:
|
||
|
||
```bash
|
||
# Default (streaming ON)
|
||
lerobot-record --dataset.repo_id=user/dataset ...
|
||
|
||
# Old behavior (streaming OFF)
|
||
lerobot-record --dataset.repo_id=user/dataset --dataset.streaming_encoding=false
|
||
```
|
||
|
||
For the RaC data collection script, set `streaming_encoding: false` in the dataset config.
|
||
|
||
## Files Changed
|
||
|
||
### `src/lerobot/datasets/video_utils.py`
|
||
- Added `StreamingVideoEncoder` — manages one `_CameraEncoder` thread per camera
|
||
- Added `_CameraEncoder` — daemon thread that reads frames from a queue and encodes with PyAV
|
||
- Non-blocking unbounded queue ensures the control loop is never delayed
|
||
|
||
### `src/lerobot/datasets/lerobot_dataset.py`
|
||
- `create()` / `start_streaming_encoder()`: new `streaming_encoding` parameter
|
||
- `add_frame()`: when streaming, feeds frames to encoder + accumulates running stats instead of writing PNGs
|
||
- `save_episode()`: when streaming, uses running stats and calls `finish_episode()` to get already-encoded video paths
|
||
- `clear_episode_buffer()`: cancels in-progress encoding on re-record
|
||
- `finalize()`: cleans up encoder on shutdown
|
||
- **Full backward compatibility**: when `streaming_encoding=False`, all existing code paths are unchanged
|
||
|
||
### `src/lerobot/scripts/lerobot_record.py`
|
||
- Added `streaming_encoding: bool = True` to `DatasetRecordConfig`
|
||
- Wired through to both `create()` and `resume` paths
|
||
|
||
### `examples/rac/rac_data_collection_openarms_rtc.py`
|
||
- Added `streaming_encoding: bool = True` to `RaCRTCDatasetConfig`
|
||
- Frames are added inline during the control loop (streaming) or buffered for post-loop writing (old path)
|
||
- Automatically detects mode and adjusts behavior
|
||
|
||
## Design Notes
|
||
|
||
- **Why threads, not processes?** PyAV/FFmpeg releases the GIL during encoding. Threads share memory (zero-copy frame passing), avoiding the serialization overhead of multiprocessing.
|
||
- **Why unbounded queue?** At 30fps production vs ~72fps encoding throughput, the queue stays near-empty. Even during brief encoder stalls, memory growth is bounded by episode length. The control loop must never block.
|
||
- **Why running stats?** Avoids the expensive read-back-from-disk step. `RunningQuantileStats` + `auto_downsample_height_width` compute identical statistics incrementally with ~2ms overhead per frame.
|
||
- **Backward compatible**: Setting `streaming_encoding=false` restores the original PNG → encode pipeline exactly. No behavior changes for existing users who don't opt in.
|