Add `video_decode_device` to StreamingLeRobotDataset and a `device` arg to
VideoDecoderCache, passed to torchcodec's VideoDecoder. "cuda" offloads H.264/H.265
decode to the GPU's dedicated NVDEC engine (independent of the training SMs); requires
a CUDA-enabled torchcodec build.
benchmark: `--video_decode_device` flag. With cuda + num_workers>0 it forces the
`spawn` start method (CUDA cannot init in forked workers) and disables CPU pin_memory
(frames are already on-GPU). Decode device is recorded in results and the output
filename. README documents the NVDEC option and its concurrency/IPC caveats.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- benchmark: frames_per_s_node now measures sustained wall-clock throughput over the
post-warmup window. The previous metric summed inter-batch gaps, which collapse to ~0
under async prefetch (consumer drains a pre-filled queue) and overstated throughput ~100x.
- VideoDecoderCache gains an optional shared [hits, misses, evictions] counter tensor;
StreamingLeRobotDataset.video_decoder_cache_stats() aggregates it across DataLoader
workers (lock-free, approximate; hit_rate preserved). Fixes empty cache stats with workers.
- StreamingLeRobotDataset.data_files_root: read bulk data/ + videos/ from an fsspec root
(e.g. hf://buckets/<owner>/<name>) while metadata still loads from repo_id. Enables
bucket / prewarmed-bucket benchmark sources without copying metadata. Exposed as
benchmark --data_files_root.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- examples/scaling/train_streaming_multinode.py: Accelerate-based distributed/
resumable streaming training (no DistributedSampler; rank/world_size auto-resolved),
checkpoints the dataset stream state, and supports a --dummy pure-dataloading path
with throughput logging. SLURM launcher in slurm/train_streaming_robocasa.sh.
- benchmarks/streaming/benchmark_streaming.py: dummy-consumer dataloading benchmark
(single / sarm frame modes) emitting frames/s/node, p50/p95/p99 sample latency,
first-batch latency, and VideoDecoderCache reuse stats as JSON + CSV. SLURM launcher
+ README documenting the source/node/mode matrix and manual bucket prewarming.
- VideoDecoderCache: add hit/miss/eviction counters and a stats() method so the
benchmark can surface decoder thrash (no new cache, no eviction-policy change).
- tests/datasets/test_streaming_distributed.py: accelerate-launch smoke test asserting
per-rank disjointness; skips (does not false-pass) when <2 processes spawn.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(deps): better versioning control for torchcodec
* refactor(video_utils): replace torchvision with pyav
* adding Torchcodec version to lerobot-info
* chore(benchmarks): delete video benchmark
---------
Co-authored-by: Maximellerbach <maxime.ellerbach@huggingface.co>
* fix(time benchmark): removing deprecated TimeBenchmark dependency
* fix(typo): renaming frames in an up-to-date fashion
* feat(duets): rearanging crf and g parameters in a proper unique combination manner
* fix(segfault): fixing segfault by adding a lock in ThreadPoolExecutor
* chore(update) : update datasets, codecs and backends to the latest versions
* chore(unused files): removing unused files
* fix(dataset paths): fix datasets paths to live among lerobot datasets
* chore: replace hard-coded OBS values with constants throughout all the source code
* chore(tests): replace hard-coded OBS values with constants throughout all the test code