lerobot

admin/lerobot

Fork 0

mirror of https://github.com/huggingface/lerobot.git synced 2026-06-18 16:57:12 +00:00

Commit Graph

Author	SHA1	Message	Date
Pepijn	d1fc8e298c	feat(streaming): distributed + resumable HF-native StreamingLeRobotDataset Add the large-scale streaming pieces that were missing from the frame-streaming internals, keeping the existing Backtrackable + output-reservoir frame-shuffle: - split_dataset_by_node(rank, world_size) before the per-shard loop so each rank streams a disjoint set of shards (fixes duplicate data across GPUs). rank and world_size auto-resolve from Accelerate state / RANK,WORLD_SIZE env / (0, 1). - get_worker_info() shard splitting so DataLoader workers within a rank don't yield duplicate frames. - Dynamic Backtrackable window (dynamic_bounds=True) sized to the requested delta_timestamps, removing the fixed 100-frame ceiling so long horizons (e.g. a SARM window ~160 frames) reach real frames instead of silently padding. Fix the peek_back off-by-one: history = lookback + 1. - video_decoder_cache_size knob; default (active_shards + 1) x num_cameras so the live decoder working set does not thrash the VideoDecoderCache LRU. - state_dict()/load_state_dict() for resume (per-shard HF stream state + exhausted set + RNG). Reservoir is re-warmed, so resumption is not bit-exact (documented). - factory.py wires buffer_size from a new DatasetConfig.streaming_buffer_size field instead of repurposing max_num_shards as the worker count. Tests: tests/datasets/test_streaming_native.py covers distributed disjointness, worker de-duplication, the SARM-length window, resume, schema parity vs map-style, local video path resolution, and shuffle decorrelation. 21 passed (13 existing + 8). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-09 13:37:30 +02:00

Author

SHA1

Message

Date

Pepijn

d1fc8e298c

feat(streaming): distributed + resumable HF-native StreamingLeRobotDataset

Add the large-scale streaming pieces that were missing from the frame-streaming
internals, keeping the existing Backtrackable + output-reservoir frame-shuffle:

- split_dataset_by_node(rank, world_size) before the per-shard loop so each rank
  streams a disjoint set of shards (fixes duplicate data across GPUs). rank and
  world_size auto-resolve from Accelerate state / RANK,WORLD_SIZE env / (0, 1).
- get_worker_info() shard splitting so DataLoader workers within a rank don't
  yield duplicate frames.
- Dynamic Backtrackable window (dynamic_bounds=True) sized to the requested
  delta_timestamps, removing the fixed 100-frame ceiling so long horizons (e.g. a
  SARM window ~160 frames) reach real frames instead of silently padding. Fix the
  peek_back off-by-one: history = lookback + 1.
- video_decoder_cache_size knob; default (active_shards + 1) x num_cameras so the
  live decoder working set does not thrash the VideoDecoderCache LRU.
- state_dict()/load_state_dict() for resume (per-shard HF stream state + exhausted
  set + RNG). Reservoir is re-warmed, so resumption is not bit-exact (documented).
- factory.py wires buffer_size from a new DatasetConfig.streaming_buffer_size field
  instead of repurposing max_num_shards as the worker count.

Tests: tests/datasets/test_streaming_native.py covers distributed disjointness,
worker de-duplication, the SARM-length window, resume, schema parity vs map-style,
local video path resolution, and shuffle decorrelation. 21 passed (13 existing + 8).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-09 13:37:30 +02:00

1 Commits