The remote commit (2ab71231c) added an opt-in episode pool, deferred
decode in the legacy buffer path, decode/fetch timing instrumentation,
remote-IO retries (video_utils), and 32MB row-group writing
(dataset_tools). The pool rewrite on this side makes the episode pool
the only iteration path (with prefetch-on-admit, per-consumer seeding,
worker-exact fast-forward resume), so streaming_dataset.py resolves to
the rewrite with the remote instrumentation ported into it:
- 5-slot shared counters + timing_stats() (decode_s_total/fetch_s_total)
- fetch timed around episode admission, decode timed around emission
- benchmark/slurm keep the remote updates, with episode_pool_size as the
knob (buffer_size deprecated and ignored)
video_utils retries and dataset_tools row groups are taken unchanged.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The fast-forward skip assumed every DataLoader worker delivers batches;
workers that own no shards yield nothing and are stopped, so the batch
round-robin runs over min(num_workers, num_shards) active workers. Use
that effective count (shard-less workers skip nothing). Adds a resume
test under num_workers=2 asserting exact continuation.
Note: the test fixtures write a single parquet file regardless of
data_files_size_in_mb, so worker-splitting tests exercise the degenerate
single-shard layout; multi-shard behavior is covered by the rank-level
split_dataset_by_node tests.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace the shard/Backtrackable/decoded-shuffle-buffer internals with an
episode pool: each (rank x worker) consumer keeps episode_pool_size whole
episodes' tabular rows in RAM and emits uniformly random frames across
them. delta_timestamps windows become exact in-RAM slices with correct
boundary padding (the Backtrackable machinery and its lookback/lookahead
ceilings are gone), and video is decoded only when a sample is emitted,
so pool memory stays tabular-sized instead of buffer_size decoded
samples.
- Prefetch-on-admit: when streaming from a remote source, each pooled
episode's video files download to a local cache in the background
(refcounted, since v3 packs several episodes per file; deleted on
eviction), so decode-on-exit reads local bytes instead of paying
network seek latency.
- Per-consumer RNG derived from (seed, epoch, rank, worker): consumers
decorrelated, runs reproducible, epochs reshuffle automatically.
- Deterministic fast-forward resume: load_state_dict takes the trainer's
{batches_consumed, batch_size}; each worker re-derives its own skip
from the DataLoader's round-robin batch assignment and replays
tabular-only (no decode). Exact within an epoch, works with
num_workers > 0, and the same state file serves every rank. Replaces
the per-shard HF state_dict approach, which lived in worker processes
and could not be captured from the trainer.
- Shard-cap default removed (max_num_shards=None uses every parquet
shard); runtime warnings for non-divisible world sizes (datasets
degrades to read-everything splitting) and workers left without
shards.
- episode_pool_size replaces buffer_size (deprecated, ignored with a
warning); decoder cache sized to the pool working set, capped at 128.
Legacy order-replication tests asserted the old buffer algorithm
step-by-step and are rewritten as behavior contracts (exactly-once
coverage, per-seed determinism, epoch reshuffle). Value-level parity
tests against the map-style dataset pass unchanged.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- streaming_dataset: defer torchcodec decode until a sample leaves the shuffle
buffer (buffer now holds ~KB tabular rows, not MB of pixels) and add an opt-in
episode-pool shuffle (episode_pool_size) with exact in-episode delta lookups;
expose decode/fetch timing_stats.
- video_utils: retry transient hf:///fsspec/httpx transport errors during
streaming decode (LEROBOT_REMOTE_IO_MAX_RETRIES).
- dataset_tools: write multiple ~32MB row groups with a page index to bound
per-shard streaming memory.
- benchmarks/slurm: streaming benchmark + matrix submitter updates.
Co-authored-by: Cursor <cursoragent@cursor.com>
So the A100/H100 no-AV1-NVDEC limitation applies to most LeRobot v3 datasets, not just
RoboCasa — GPU decode needs an Ada GPU, an hevc/h264-encoded dataset, or a re-encode.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
NVIDIA's decode support matrix: the compute GPUs A100 (GA100) and H100 (GH100) have no
AV1 NVDEC decoder; only Ada (L4/L40/RTX40) and some Ampere (A10/A40/A16) do. So on
A100/H100 nodes, AV1 datasets must be decoded on CPU or re-encoded to H.265/H.264 — no
torchcodec build enables cuda AV1 decode there. Also distinguish that error from
"Unsupported device: cuda (variant: ffmpeg)", which is a torchcodec-built-without-CUDA
issue. Update diagnose_decode.py message + benchmark README accordingly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- benchmark: raise SystemExit if 0 frames were measured, so a run that produces no
batches (swallowed decode error, all batches dropped) fails loudly instead of being
reported green with NaN/zero numbers (the misleading "COMPLETED" CUDA jobs).
- add benchmarks/streaming/diagnose_decode.py: isolates the streaming decode path
(resolve path -> fsspec.open -> torchcodec VideoDecoder -> get one frame) and prints
package versions + the first bytes of the handle. Pinpoints decode failures: bad/
placeholder bytes vs ffmpeg/torchcodec build issue. RoboCasa videos are AV1; the
failure message calls out AV1 decoder + NVDEC-on-Ada requirements explicitly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
slurm/run_streaming_matrix.sh fans the benchmark matrix (sources {hub,bucket,
warmed_bucket} x modes {single,sarm} x decode {cpu,cuda}) out as isolated single-GPU
SLURM jobs, so an OOM in one config is contained and reported per-job by SLURM. Worker
count and shuffle buffer are bounded (lower for cuda, which holds a CUDA context + NVDEC
session per worker) to avoid host/VRAM OOM. Source/mode/decode/workers/buffer/account/
partition are env-overridable; SOURCES/MODES/DECODES select subsets.
benchmarks/streaming/summarize_results.py collapses the per-run JSONs into one comparison
table + summary.csv (frames/s/node, first-batch + p50/p95/p99 latency, cache hit-rate).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add `video_decode_device` to StreamingLeRobotDataset and a `device` arg to
VideoDecoderCache, passed to torchcodec's VideoDecoder. "cuda" offloads H.264/H.265
decode to the GPU's dedicated NVDEC engine (independent of the training SMs); requires
a CUDA-enabled torchcodec build.
benchmark: `--video_decode_device` flag. With cuda + num_workers>0 it forces the
`spawn` start method (CUDA cannot init in forked workers) and disables CPU pin_memory
(frames are already on-GPU). Decode device is recorded in results and the output
filename. README documents the NVDEC option and its concurrency/IPC caveats.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- benchmark: frames_per_s_node now measures sustained wall-clock throughput over the
post-warmup window. The previous metric summed inter-batch gaps, which collapse to ~0
under async prefetch (consumer drains a pre-filled queue) and overstated throughput ~100x.
- VideoDecoderCache gains an optional shared [hits, misses, evictions] counter tensor;
StreamingLeRobotDataset.video_decoder_cache_stats() aggregates it across DataLoader
workers (lock-free, approximate; hit_rate preserved). Fixes empty cache stats with workers.
- StreamingLeRobotDataset.data_files_root: read bulk data/ + videos/ from an fsspec root
(e.g. hf://buckets/<owner>/<name>) while metadata still loads from repo_id. Enables
bucket / prewarmed-bucket benchmark sources without copying metadata. Exposed as
benchmark --data_files_root.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- examples/scaling/train_streaming_multinode.py: Accelerate-based distributed/
resumable streaming training (no DistributedSampler; rank/world_size auto-resolved),
checkpoints the dataset stream state, and supports a --dummy pure-dataloading path
with throughput logging. SLURM launcher in slurm/train_streaming_robocasa.sh.
- benchmarks/streaming/benchmark_streaming.py: dummy-consumer dataloading benchmark
(single / sarm frame modes) emitting frames/s/node, p50/p95/p99 sample latency,
first-batch latency, and VideoDecoderCache reuse stats as JSON + CSV. SLURM launcher
+ README documenting the source/node/mode matrix and manual bucket prewarming.
- VideoDecoderCache: add hit/miss/eviction counters and a stats() method so the
benchmark can surface decoder thrash (no new cache, no eviction-policy change).
- tests/datasets/test_streaming_distributed.py: accelerate-launch smoke test asserting
per-rank disjointness; skips (does not false-pass) when <2 processes spawn.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(deps): better versioning control for torchcodec
* refactor(video_utils): replace torchvision with pyav
* adding Torchcodec version to lerobot-info
* chore(benchmarks): delete video benchmark
---------
Co-authored-by: Maximellerbach <maxime.ellerbach@huggingface.co>
* fix(time benchmark): removing deprecated TimeBenchmark dependency
* fix(typo): renaming frames in an up-to-date fashion
* feat(duets): rearanging crf and g parameters in a proper unique combination manner
* fix(segfault): fixing segfault by adding a lock in ThreadPoolExecutor
* chore(update) : update datasets, codecs and backends to the latest versions
* chore(unused files): removing unused files
* fix(dataset paths): fix datasets paths to live among lerobot datasets
* chore: replace hard-coded OBS values with constants throughout all the source code
* chore(tests): replace hard-coded OBS values with constants throughout all the test code