lerobot

mirror of https://github.com/huggingface/lerobot.git synced 2026-06-18 08:47:05 +00:00

Files

T

Pepijn a164bb97bd feat(streaming): native datasets-5 episode batching and worker-split suppression

Allow datasets 5.x (pin >=4.7,<6; lockfile moves to 5.0.0) and use its
Arrow-native batch(by_column="episode_index") (huggingface/datasets#8194
sibling, #8172) for episode admission when available - one Arrow
accumulation per episode instead of one Python dict per row - with the
existing row loop as the 4.x fallback. A parity test asserts both paths
group identically.

Also fixes a latent worker bug this surfaced: `datasets` detects torch
DataLoader workers and re-splits its shards internally (_iter_pytorch),
on top of our explicit per-worker shard assignment. That second split
silently drops data whenever a per-worker stream has fewer internal
shards than there are workers (masked so far by single-file test
fixtures), and on datasets 5.0 it crashes by_column batching outright.
The worker context is now hidden from `datasets` while draining streams
we already partitioned (process-local patch, restored on exit).

The multi-shard shuffle buffer (huggingface/datasets#8194) is
intentionally NOT used: frame-level shuffling upstream of episode
grouping would fragment episodes and break delta windows. Its threaded
multi-source prefetch idea remains a follow-up for episode admission if
fetch timings warrant it.

Verified on both datasets 4.8.5 (fallback) and 5.0.0 (native): 27/27
streaming tests each; full datasets suite 469 passed under 5.0.0.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

2026-06-11 16:10:53 +02:00

test_aggregate.py

feat(encoding parameters): adding support for user provided video encoding parameters (#3455 )

2026-05-14 23:46:42 +02:00

test_compute_stats.py

feat(dependencies): minimal default tag install (#3362 )

2026-04-12 20:03:04 +02:00

test_dataset_metadata.py

Add extensive language support (#3467 )

2026-05-19 14:46:11 +02:00

test_dataset_reader.py

feat(encoding parameters): adding support for user provided video encoding parameters (#3455 )