lerobot

mirror of https://github.com/huggingface/lerobot.git synced 2026-06-19 01:07:18 +00:00

Author	SHA1	Message	Date
Pepijn	79b547de32	Merge remote episode-pool work into the full pool rewrite The remote commit (`2ab71231c`) added an opt-in episode pool, deferred decode in the legacy buffer path, decode/fetch timing instrumentation, remote-IO retries (video_utils), and 32MB row-group writing (dataset_tools). The pool rewrite on this side makes the episode pool the only iteration path (with prefetch-on-admit, per-consumer seeding, worker-exact fast-forward resume), so streaming_dataset.py resolves to the rewrite with the remote instrumentation ported into it: - 5-slot shared counters + timing_stats() (decode_s_total/fetch_s_total) - fetch timed around episode admission, decode timed around emission - benchmark/slurm keep the remote updates, with episode_pool_size as the knob (buffer_size deprecated and ignored) video_utils retries and dataset_tools row groups are taken unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 15:17:04 +02:00
Pepijn	a7b7f4964e	fix(streaming): worker-exact resume arithmetic and multi-worker resume test The fast-forward skip assumed every DataLoader worker delivers batches; workers that own no shards yield nothing and are stopped, so the batch round-robin runs over min(num_workers, num_shards) active workers. Use that effective count (shard-less workers skip nothing). Adds a resume test under num_workers=2 asserting exact continuation. Note: the test fixtures write a single parquet file regardless of data_files_size_in_mb, so worker-splitting tests exercise the degenerate single-shard layout; multi-shard behavior is covered by the rank-level split_dataset_by_node tests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 15:11:00 +02:00
Pepijn	1050c2fb6c	feat(streaming): episode-pool iteration with decode-on-exit, video prefetch, and exact resume Replace the shard/Backtrackable/decoded-shuffle-buffer internals with an episode pool: each (rank x worker) consumer keeps episode_pool_size whole episodes' tabular rows in RAM and emits uniformly random frames across them. delta_timestamps windows become exact in-RAM slices with correct boundary padding (the Backtrackable machinery and its lookback/lookahead ceilings are gone), and video is decoded only when a sample is emitted, so pool memory stays tabular-sized instead of buffer_size decoded samples. - Prefetch-on-admit: when streaming from a remote source, each pooled episode's video files download to a local cache in the background (refcounted, since v3 packs several episodes per file; deleted on eviction), so decode-on-exit reads local bytes instead of paying network seek latency. - Per-consumer RNG derived from (seed, epoch, rank, worker): consumers decorrelated, runs reproducible, epochs reshuffle automatically. - Deterministic fast-forward resume: load_state_dict takes the trainer's {batches_consumed, batch_size}; each worker re-derives its own skip from the DataLoader's round-robin batch assignment and replays tabular-only (no decode). Exact within an epoch, works with num_workers > 0, and the same state file serves every rank. Replaces the per-shard HF state_dict approach, which lived in worker processes and could not be captured from the trainer. - Shard-cap default removed (max_num_shards=None uses every parquet shard); runtime warnings for non-divisible world sizes (datasets degrades to read-everything splitting) and workers left without shards. - episode_pool_size replaces buffer_size (deprecated, ignored with a warning); decoder cache sized to the pool working set, capped at 128. Legacy order-replication tests asserted the old buffer algorithm step-by-step and are rewritten as behavior contracts (exactly-once coverage, per-seed determinism, epoch reshuffle). Value-level parity tests against the map-style dataset pass unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 15:02:15 +02:00
pepijn	2ab71231cd	feat(streaming): defer video decode, episode-pool shuffle, and remote-IO retries - streaming_dataset: defer torchcodec decode until a sample leaves the shuffle buffer (buffer now holds ~KB tabular rows, not MB of pixels) and add an opt-in episode-pool shuffle (episode_pool_size) with exact in-episode delta lookups; expose decode/fetch timing_stats. - video_utils: retry transient hf:///fsspec/httpx transport errors during streaming decode (LEROBOT_REMOTE_IO_MAX_RETRIES). - dataset_tools: write multiple ~32MB row groups with a page index to bound per-shard streaming memory. - benchmarks/slurm: streaming benchmark + matrix submitter updates. Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-11 10:08:28 +00:00
Pepijn	2d1c17d971	docs(streaming): note AV1 is LeRobot's default codec (vcodec=libsvtav1) So the A100/H100 no-AV1-NVDEC limitation applies to most LeRobot v3 datasets, not just RoboCasa — GPU decode needs an Ada GPU, an hevc/h264-encoded dataset, or a re-encode. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 17:10:18 +02:00
Pepijn	7241f029c6	docs(streaming): A100/H100 NVDEC cannot decode AV1 — correct guidance NVIDIA's decode support matrix: the compute GPUs A100 (GA100) and H100 (GH100) have no AV1 NVDEC decoder; only Ada (L4/L40/RTX40) and some Ampere (A10/A40/A16) do. So on A100/H100 nodes, AV1 datasets must be decoded on CPU or re-encoded to H.265/H.264 — no torchcodec build enables cuda AV1 decode there. Also distinguish that error from "Unsupported device: cuda (variant: ffmpeg)", which is a torchcodec-built-without-CUDA issue. Update diagnose_decode.py message + benchmark README accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 17:08:54 +02:00
Pepijn	23c58f5f9e	feat(streaming): decode diagnostic + fail benchmark on 0 frames - benchmark: raise SystemExit if 0 frames were measured, so a run that produces no batches (swallowed decode error, all batches dropped) fails loudly instead of being reported green with NaN/zero numbers (the misleading "COMPLETED" CUDA jobs). - add benchmarks/streaming/diagnose_decode.py: isolates the streaming decode path (resolve path -> fsspec.open -> torchcodec VideoDecoder -> get one frame) and prints package versions + the first bytes of the handle. Pinpoints decode failures: bad/ placeholder bytes vs ffmpeg/torchcodec build issue. RoboCasa videos are AV1; the failure message calls out AV1 decoder + NVDEC-on-Ada requirements explicitly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 16:40:24 +02:00
Pepijn	a32a2c647b	feat(streaming): full-matrix SLURM submitter + results summarizer slurm/run_streaming_matrix.sh fans the benchmark matrix (sources {hub,bucket, warmed_bucket} x modes {single,sarm} x decode {cpu,cuda}) out as isolated single-GPU SLURM jobs, so an OOM in one config is contained and reported per-job by SLURM. Worker count and shuffle buffer are bounded (lower for cuda, which holds a CUDA context + NVDEC session per worker) to avoid host/VRAM OOM. Source/mode/decode/workers/buffer/account/ partition are env-overridable; SOURCES/MODES/DECODES select subsets. benchmarks/streaming/summarize_results.py collapses the per-run JSONs into one comparison table + summary.csv (frames/s/node, first-batch + p50/p95/p99 latency, cache hit-rate). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 15:51:36 +02:00
Pepijn	343ecd7980	feat(streaming): optional GPU (NVDEC) video decode device Add `video_decode_device` to StreamingLeRobotDataset and a `device` arg to VideoDecoderCache, passed to torchcodec's VideoDecoder. "cuda" offloads H.264/H.265 decode to the GPU's dedicated NVDEC engine (independent of the training SMs); requires a CUDA-enabled torchcodec build. benchmark: `--video_decode_device` flag. With cuda + num_workers>0 it forces the `spawn` start method (CUDA cannot init in forked workers) and disables CPU pin_memory (frames are already on-GPU). Decode device is recorded in results and the output filename. README documents the NVDEC option and its concurrency/IPC caveats. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 15:47:11 +02:00
Pepijn	f7c8a526e8	feat(streaming): wallclock benchmark throughput, cross-worker cache stats, bucket source - benchmark: frames_per_s_node now measures sustained wall-clock throughput over the post-warmup window. The previous metric summed inter-batch gaps, which collapse to ~0 under async prefetch (consumer drains a pre-filled queue) and overstated throughput ~100x. - VideoDecoderCache gains an optional shared [hits, misses, evictions] counter tensor; StreamingLeRobotDataset.video_decoder_cache_stats() aggregates it across DataLoader workers (lock-free, approximate; hit_rate preserved). Fixes empty cache stats with workers. - StreamingLeRobotDataset.data_files_root: read bulk data/ + videos/ from an fsspec root (e.g. hf://buckets/<owner>/<name>) while metadata still loads from repo_id. Enables bucket / prewarmed-bucket benchmark sources without copying metadata. Exposed as benchmark --data_files_root. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 15:25:44 +02:00
Pepijn	68fa5d80b0	feat(streaming): multinode example, dataloading benchmark, distributed smoke test - examples/scaling/train_streaming_multinode.py: Accelerate-based distributed/ resumable streaming training (no DistributedSampler; rank/world_size auto-resolved), checkpoints the dataset stream state, and supports a --dummy pure-dataloading path with throughput logging. SLURM launcher in slurm/train_streaming_robocasa.sh. - benchmarks/streaming/benchmark_streaming.py: dummy-consumer dataloading benchmark (single / sarm frame modes) emitting frames/s/node, p50/p95/p99 sample latency, first-batch latency, and VideoDecoderCache reuse stats as JSON + CSV. SLURM launcher + README documenting the source/node/mode matrix and manual bucket prewarming. - VideoDecoderCache: add hit/miss/eviction counters and a stats() method so the benchmark can surface decoder thrash (no new cache, no eviction-policy change). - tests/datasets/test_streaming_distributed.py: accelerate-launch smoke test asserting per-rank disjointness; skips (does not false-pass) when <2 processes spawn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 13:48:23 +02:00

11 Commits