mirror of
https://github.com/huggingface/lerobot.git
synced 2026-06-18 00:37:10 +00:00
a164bb97bd
Allow datasets 5.x (pin >=4.7,<6; lockfile moves to 5.0.0) and use its Arrow-native batch(by_column="episode_index") (huggingface/datasets#8194 sibling, #8172) for episode admission when available - one Arrow accumulation per episode instead of one Python dict per row - with the existing row loop as the 4.x fallback. A parity test asserts both paths group identically. Also fixes a latent worker bug this surfaced: `datasets` detects torch DataLoader workers and re-splits its shards internally (_iter_pytorch), on top of our explicit per-worker shard assignment. That second split silently drops data whenever a per-worker stream has fewer internal shards than there are workers (masked so far by single-file test fixtures), and on datasets 5.0 it crashes by_column batching outright. The worker context is now hidden from `datasets` while draining streams we already partitioned (process-local patch, restored on exit). The multi-shard shuffle buffer (huggingface/datasets#8194) is intentionally NOT used: frame-level shuffling upstream of episode grouping would fragment episodes and break delta windows. Its threaded multi-source prefetch idea remains a follow-up for episode admission if fetch timings warrant it. Verified on both datasets 4.8.5 (fallback) and 5.0.0 (native): 27/27 streaming tests each; full datasets suite 469 passed under 5.0.0. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>