mirror of
https://github.com/huggingface/lerobot.git
synced 2026-06-18 08:47:05 +00:00
343ecd7980
Add `video_decode_device` to StreamingLeRobotDataset and a `device` arg to VideoDecoderCache, passed to torchcodec's VideoDecoder. "cuda" offloads H.264/H.265 decode to the GPU's dedicated NVDEC engine (independent of the training SMs); requires a CUDA-enabled torchcodec build. benchmark: `--video_decode_device` flag. With cuda + num_workers>0 it forces the `spawn` start method (CUDA cannot init in forked workers) and disables CPU pin_memory (frames are already on-GPU). Decode device is recorded in results and the output filename. README documents the NVDEC option and its concurrency/IPC caveats. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
82 lines
4.3 KiB
Markdown
82 lines
4.3 KiB
Markdown
# Streaming dataloading benchmark
|
|
|
|
Measures **dataloading only** (no model) for `StreamingLeRobotDataset`: parquet read + video decode +
|
|
delta windowing + shuffle. A dummy consumer pulls batches and moves them to the device, so the numbers
|
|
isolate the data pipeline. Use it to compare sources (Hub vs. storage bucket vs. prewarmed bucket),
|
|
frame modes, and node counts, and to catch p95/p99 video-decode regressions.
|
|
|
|
## Run
|
|
|
|
```bash
|
|
python benchmarks/streaming/benchmark_streaming.py \
|
|
--repo_id pepijn223/robocasa_pretrain_human300_v4 \
|
|
--mode sarm --batch_size 64 --num_workers 12 --num_batches 200 \
|
|
--source hub --out_dir benchmarks/streaming/results
|
|
```
|
|
|
|
Multinode (per-node throughput) goes through Accelerate under SLURM:
|
|
|
|
```bash
|
|
sbatch slurm/benchmark_streaming_robocasa.sh
|
|
```
|
|
|
|
## Matrix
|
|
|
|
| Axis | Values |
|
|
| ---------- | -------------------------------------------------------------------------------------------------------------------- |
|
|
| Source | `hub` (verify now), `bucket`, `warmed_bucket` (bucket + prewarming; with user's help later) |
|
|
| Baseline | current `main` `StreamingLeRobotDataset` on Hub streaming |
|
|
| Nodes | 1 and 2 (per-node throughput should be independent) |
|
|
| Frame mode | `single` (1 frame, all cameras; target ≥ 120 frames/s/node) · `sarm` (8 steps spaced 1s; target ≥ 320 frames/s/node) |
|
|
|
|
`--source` is a label only; the actual source is whatever `--repo_id` / `--root` / `--data_files_root`
|
|
point at.
|
|
|
|
### GPU (NVDEC) decoding
|
|
|
|
By default video is decoded on the **CPU** in each DataLoader worker, so throughput is CPU-decode-bound and
|
|
scales with `--num_workers` (capped by the dataset's `num_shards`). Pass `--video_decode_device cuda` to
|
|
offload H.264/H.265 decode to the GPU's dedicated **NVDEC** engine, which runs independently of the SMs used
|
|
for training (see <https://developer.nvidia.com/video-codec-sdk>). This requires a CUDA-enabled torchcodec
|
|
build, and because CUDA cannot initialize in forked workers the benchmark switches to the `spawn` start
|
|
method automatically when `--num_workers > 0`.
|
|
|
|
```bash
|
|
# GPU/NVDEC decode, 6 workers, bucket source
|
|
python benchmarks/streaming/benchmark_streaming.py \
|
|
--repo_id pepijn223/robocasa_pretrain_human300_v4 \
|
|
--data_files_root hf://buckets/pepijn223/robocasa-stream \
|
|
--mode sarm --batch_size 64 --num_workers 6 --num_batches 200 \
|
|
--video_decode_device cuda --source bucket
|
|
```
|
|
|
|
Caveats with `cuda` + many workers: each worker creates its own CUDA context (VRAM overhead) and NVDEC has a
|
|
limited number of concurrent decode sessions per GPU; if you hit session/IPC limits, reduce `--num_workers`
|
|
or compare against `--num_workers 0` (single-process NVDEC, which often saturates the decode engine on its
|
|
own). Result files include the decode device in their name (`..._w6_cuda.json`).
|
|
|
|
Reference data root: bucket sources resolve through `--data_files_root hf://buckets/<owner>/<name>` (metadata
|
|
still loads from `--repo_id`). The local `single`/`sarm` CPU baselines on this dataset were ~176 / ~212
|
|
frames/s/node at `--num_workers 3` (3 cameras, fps 20).
|
|
|
|
## Metrics emitted (JSON + CSV)
|
|
|
|
`frames_per_s_node`, `samples_per_s`, `first_batch_latency_s`, `p50/p95/p99_sample_latency_ms`,
|
|
`wallclock_s`, and `video_decoder_cache` (`hits`, `misses`, `evictions`, `hit_rate`, `size`). A low
|
|
cache `hit_rate` with high `p99` is the decoder-thrash signature — raise `--video_decoder_cache_size`
|
|
or `--buffer_size`, or reduce `num_workers`.
|
|
|
|
## Bucket sources & prewarming (manual)
|
|
|
|
Prewarming is a **server-side** Hugging Face storage-bucket feature — there is no client script. To
|
|
benchmark the `warmed_bucket` source:
|
|
|
|
1. Attach a storage bucket to the dataset and enable it (see
|
|
<https://huggingface.co/docs/hub/storage-buckets>). Buckets resolve through `fsspec`, the same as
|
|
`hf://`, so no code change is needed — point `--repo_id`/`--revision` (or `--root`) at the bucket.
|
|
2. Enable **prewarming** in the bucket settings and wait for warm-up to complete.
|
|
3. Run the benchmark with `--source warmed_bucket`. Compare against the cold `--source bucket` and the
|
|
`--source hub` baseline.
|
|
|
|
Manual only — not run in CI.
|