mirror of
https://github.com/huggingface/lerobot.git
synced 2026-06-19 01:07:18 +00:00
a7b7f4964e
The fast-forward skip assumed every DataLoader worker delivers batches; workers that own no shards yield nothing and are stopped, so the batch round-robin runs over min(num_workers, num_shards) active workers. Use that effective count (shard-less workers skip nothing). Adds a resume test under num_workers=2 asserting exact continuation. Note: the test fixtures write a single parquet file regardless of data_files_size_in_mb, so worker-splitting tests exercise the degenerate single-shard layout; multi-shard behavior is covered by the rank-level split_dataset_by_node tests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
92 lines
5.2 KiB
Markdown
92 lines
5.2 KiB
Markdown
# Streaming dataloading benchmark
|
|
|
|
Measures **dataloading only** (no model) for `StreamingLeRobotDataset`: parquet read + video decode +
|
|
delta windowing + shuffle. A dummy consumer pulls batches and moves them to the device, so the numbers
|
|
isolate the data pipeline. Use it to compare sources (Hub vs. storage bucket vs. prewarmed bucket),
|
|
frame modes, and node counts, and to catch p95/p99 video-decode regressions.
|
|
|
|
## Run
|
|
|
|
```bash
|
|
python benchmarks/streaming/benchmark_streaming.py \
|
|
--repo_id pepijn223/robocasa_pretrain_human300_v4 \
|
|
--mode sarm --batch_size 64 --num_workers 12 --num_batches 200 \
|
|
--source hub --out_dir benchmarks/streaming/results
|
|
```
|
|
|
|
Multinode (per-node throughput) goes through Accelerate under SLURM:
|
|
|
|
```bash
|
|
sbatch slurm/benchmark_streaming_robocasa.sh
|
|
```
|
|
|
|
## Matrix
|
|
|
|
| Axis | Values |
|
|
| ---------- | -------------------------------------------------------------------------------------------------------------------- |
|
|
| Source | `hub` (verify now), `bucket`, `warmed_bucket` (bucket + prewarming; with user's help later) |
|
|
| Baseline | current `main` `StreamingLeRobotDataset` on Hub streaming |
|
|
| Nodes | 1 and 2 (per-node throughput should be independent) |
|
|
| Frame mode | `single` (1 frame, all cameras; target ≥ 120 frames/s/node) · `sarm` (8 steps spaced 1s; target ≥ 320 frames/s/node) |
|
|
|
|
`--source` is a label only; the actual source is whatever `--repo_id` / `--root` / `--data_files_root`
|
|
point at.
|
|
|
|
### GPU (NVDEC) decoding
|
|
|
|
By default video is decoded on the **CPU** in each DataLoader worker, so throughput is CPU-decode-bound and
|
|
scales with `--num_workers` (capped by the dataset's `num_shards`). Pass `--video_decode_device cuda` to
|
|
offload H.264/H.265 decode to the GPU's dedicated **NVDEC** engine, which runs independently of the SMs used
|
|
for training (see <https://developer.nvidia.com/video-codec-sdk>). This requires a CUDA-enabled torchcodec
|
|
build, and because CUDA cannot initialize in forked workers the benchmark switches to the `spawn` start
|
|
method automatically when `--num_workers > 0`.
|
|
|
|
```bash
|
|
# GPU/NVDEC decode, 6 workers, bucket source
|
|
python benchmarks/streaming/benchmark_streaming.py \
|
|
--repo_id pepijn223/robocasa_pretrain_human300_v4 \
|
|
--data_files_root hf://buckets/pepijn223/robocasa-stream \
|
|
--mode sarm --batch_size 64 --num_workers 6 --num_batches 200 \
|
|
--video_decode_device cuda --source bucket
|
|
```
|
|
|
|
Caveats with `cuda` + many workers: each worker creates its own CUDA context (VRAM overhead) and NVDEC has a
|
|
limited number of concurrent decode sessions per GPU; if you hit session/IPC limits, reduce `--num_workers`
|
|
or compare against `--num_workers 0` (single-process NVDEC, which often saturates the decode engine on its
|
|
own). Result files include the decode device in their name (`..._w6_cuda.json`).
|
|
|
|
> **Codec ⇄ NVDEC compatibility (important).** NVDEC can only decode codecs its hardware supports. LeRobot's
|
|
> **default video codec is AV1** (`VideoEncoderConfig.vcodec = "libsvtav1"`), so most v3 datasets are
|
|
> AV1-encoded — and the **A100 and H100 compute GPUs have no AV1 NVDEC decoder**
|
|
> (per NVIDIA's [decode support matrix](https://developer.nvidia.com/video-encode-and-decode-gpu-support-matrix-new));
|
|
> only Ada (L4/L40/RTX40) and a few Ampere cards (A10/A40/A16) do. On A100/H100, AV1 must be decoded on
|
|
> **CPU**, or the dataset re-encoded to H.265/H.264 (which those GPUs' NVDEC do support). Run
|
|
> `diagnose_decode.py --video_decode_device cuda` to check your exact node before relying on `cuda` decode.
|
|
> A `cuda` torchcodec build also needs an FFmpeg with NVDEC; see
|
|
> <https://github.com/meta-pytorch/torchcodec#installing-cuda-enabled-torchcodec>.
|
|
|
|
Reference data root: bucket sources resolve through `--data_files_root hf://buckets/<owner>/<name>` (metadata
|
|
still loads from `--repo_id`). The local `single`/`sarm` CPU baselines on this dataset were ~176 / ~212
|
|
frames/s/node at `--num_workers 3` (3 cameras, fps 20).
|
|
|
|
## Metrics emitted (JSON + CSV)
|
|
|
|
`frames_per_s_node`, `samples_per_s`, `first_batch_latency_s`, `p50/p95/p99_sample_latency_ms`,
|
|
`wallclock_s`, and `video_decoder_cache` (`hits`, `misses`, `evictions`, `hit_rate`, `size`). A low
|
|
cache `hit_rate` with high `p99` is the decoder-thrash signature — raise `--video_decoder_cache_size`
|
|
or `--episode_pool_size`, or reduce `num_workers`.
|
|
|
|
## Bucket sources & prewarming (manual)
|
|
|
|
Prewarming is a **server-side** Hugging Face storage-bucket feature — there is no client script. To
|
|
benchmark the `warmed_bucket` source:
|
|
|
|
1. Attach a storage bucket to the dataset and enable it (see
|
|
<https://huggingface.co/docs/hub/storage-buckets>). Buckets resolve through `fsspec`, the same as
|
|
`hf://`, so no code change is needed — point `--repo_id`/`--revision` (or `--root`) at the bucket.
|
|
2. Enable **prewarming** in the bucket settings and wait for warm-up to complete.
|
|
3. Run the benchmark with `--source warmed_bucket`. Compare against the cold `--source bucket` and the
|
|
`--source hub` baseline.
|
|
|
|
Manual only — not run in CI.
|