mirror of https://github.com/huggingface/lerobot.git synced 2026-06-18 08:47:05 +00:00

Files

T

Pepijn 1050c2fb6c feat(streaming): episode-pool iteration with decode-on-exit, video prefetch, and exact resume

Replace the shard/Backtrackable/decoded-shuffle-buffer internals with an
episode pool: each (rank x worker) consumer keeps episode_pool_size whole
episodes' tabular rows in RAM and emits uniformly random frames across
them. delta_timestamps windows become exact in-RAM slices with correct
boundary padding (the Backtrackable machinery and its lookback/lookahead
ceilings are gone), and video is decoded only when a sample is emitted,
so pool memory stays tabular-sized instead of buffer_size decoded
samples.

- Prefetch-on-admit: when streaming from a remote source, each pooled
  episode's video files download to a local cache in the background
  (refcounted, since v3 packs several episodes per file; deleted on
  eviction), so decode-on-exit reads local bytes instead of paying
  network seek latency.
- Per-consumer RNG derived from (seed, epoch, rank, worker): consumers
  decorrelated, runs reproducible, epochs reshuffle automatically.
- Deterministic fast-forward resume: load_state_dict takes the trainer's
  {batches_consumed, batch_size}; each worker re-derives its own skip
  from the DataLoader's round-robin batch assignment and replays
  tabular-only (no decode). Exact within an epoch, works with
  num_workers > 0, and the same state file serves every rank. Replaces
  the per-shard HF state_dict approach, which lived in worker processes
  and could not be captured from the trainer.
- Shard-cap default removed (max_num_shards=None uses every parquet
  shard); runtime warnings for non-divisible world sizes (datasets
  degrades to read-everything splitting) and workers left without
  shards.
- episode_pool_size replaces buffer_size (deprecated, ignored with a
  warning); decoder cache sized to the pool working set, capped at 128.

Legacy order-replication tests asserted the old buffer algorithm
step-by-step and are rewritten as behavior contracts (exactly-once
coverage, per-seed determinism, epoch reshuffle). Value-level parity
tests against the map-style dataset pass unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

2026-06-11 15:02:15 +02:00

benchmark_streaming.py

feat(streaming): episode-pool iteration with decode-on-exit, video prefetch, and exact resume

2026-06-11 15:02:15 +02:00

diagnose_decode.py

docs(streaming): A100/H100 NVDEC cannot decode AV1 — correct guidance

2026-06-09 17:08:54 +02:00

README.md

docs(streaming): note AV1 is LeRobot's default codec (vcodec=libsvtav1)

2026-06-09 17:10:18 +02:00

summarize_results.py

feat(streaming): full-matrix SLURM submitter + results summarizer

2026-06-09 15:51:36 +02:00

README.md

Streaming dataloading benchmark

Measures dataloading only (no model) for StreamingLeRobotDataset: parquet read + video decode + delta windowing + shuffle. A dummy consumer pulls batches and moves them to the device, so the numbers isolate the data pipeline. Use it to compare sources (Hub vs. storage bucket vs. prewarmed bucket), frame modes, and node counts, and to catch p95/p99 video-decode regressions.

Run

python benchmarks/streaming/benchmark_streaming.py \
    --repo_id pepijn223/robocasa_pretrain_human300_v4 \
    --mode sarm --batch_size 64 --num_workers 12 --num_batches 200 \
    --source hub --out_dir benchmarks/streaming/results

Multinode (per-node throughput) goes through Accelerate under SLURM:

sbatch slurm/benchmark_streaming_robocasa.sh

Matrix

Axis	Values
Source	`hub` (verify now), `bucket`, `warmed_bucket` (bucket + prewarming; with user's help later)
Baseline	current `main` `StreamingLeRobotDataset` on Hub streaming
Nodes	1 and 2 (per-node throughput should be independent)
Frame mode	`single` (1 frame, all cameras; target ≥ 120 frames/s/node) · `sarm` (8 steps spaced 1s; target ≥ 320 frames/s/node)

--source is a label only; the actual source is whatever --repo_id / --root / --data_files_root point at.

GPU (NVDEC) decoding

By default video is decoded on the CPU in each DataLoader worker, so throughput is CPU-decode-bound and scales with --num_workers (capped by the dataset's num_shards). Pass --video_decode_device cuda to offload H.264/H.265 decode to the GPU's dedicated NVDEC engine, which runs independently of the SMs used for training (see https://developer.nvidia.com/video-codec-sdk). This requires a CUDA-enabled torchcodec build, and because CUDA cannot initialize in forked workers the benchmark switches to the spawn start method automatically when --num_workers > 0.

# GPU/NVDEC decode, 6 workers, bucket source
python benchmarks/streaming/benchmark_streaming.py \
    --repo_id pepijn223/robocasa_pretrain_human300_v4 \
    --data_files_root hf://buckets/pepijn223/robocasa-stream \
    --mode sarm --batch_size 64 --num_workers 6 --num_batches 200 \
    --video_decode_device cuda --source bucket

Caveats with cuda + many workers: each worker creates its own CUDA context (VRAM overhead) and NVDEC has a limited number of concurrent decode sessions per GPU; if you hit session/IPC limits, reduce --num_workers or compare against --num_workers 0 (single-process NVDEC, which often saturates the decode engine on its own). Result files include the decode device in their name (..._w6_cuda.json).

Codec ⇄ NVDEC compatibility (important). NVDEC can only decode codecs its hardware supports. LeRobot's default video codec is AV1 (VideoEncoderConfig.vcodec = "libsvtav1"), so most v3 datasets are AV1-encoded — and the A100 and H100 compute GPUs have no AV1 NVDEC decoder (per NVIDIA's decode support matrix); only Ada (L4/L40/RTX40) and a few Ampere cards (A10/A40/A16) do. On A100/H100, AV1 must be decoded on CPU, or the dataset re-encoded to H.265/H.264 (which those GPUs' NVDEC do support). Run diagnose_decode.py --video_decode_device cuda to check your exact node before relying on cuda decode. A cuda torchcodec build also needs an FFmpeg with NVDEC; see https://github.com/meta-pytorch/torchcodec#installing-cuda-enabled-torchcodec.

Reference data root: bucket sources resolve through --data_files_root hf://buckets/<owner>/<name> (metadata still loads from --repo_id). The local single/sarm CPU baselines on this dataset were ~176 / ~212 frames/s/node at --num_workers 3 (3 cameras, fps 20).

Metrics emitted (JSON + CSV)

frames_per_s_node, samples_per_s, first_batch_latency_s, p50/p95/p99_sample_latency_ms, wallclock_s, and video_decoder_cache (hits, misses, evictions, hit_rate, size). A low cache hit_rate with high p99 is the decoder-thrash signature — raise --video_decoder_cache_size or --buffer_size, or reduce num_workers.

Bucket sources & prewarming (manual)

Prewarming is a server-side Hugging Face storage-bucket feature — there is no client script. To benchmark the warmed_bucket source:

Attach a storage bucket to the dataset and enable it (see https://huggingface.co/docs/hub/storage-buckets). Buckets resolve through fsspec, the same as hf://, so no code change is needed — point --repo_id/--revision (or --root) at the bucket.
Enable prewarming in the bucket settings and wait for warm-up to complete.
Run the benchmark with --source warmed_bucket. Compare against the cold --source bucket and the --source hub baseline.

Manual only — not run in CI.