lerobot

mirror of https://github.com/huggingface/lerobot.git synced 2026-06-16 15:57:03 +00:00

Files

T

Pepijn 99840ebef3 fix(datasets): enforce one parquet row group per episode in v3 data writes

LeRobot v3 data shards must hold exactly one row group per episode so a
reader can fetch episode i with pq.ParquetFile(path).read_row_group(i)
(a byte-range read) instead of loading the whole shard. The recording
writer already does this (one write_table per episode); the aggregate
and lerobot-annotate re-write paths instead concatenated many episodes
and wrote them in one shot, collapsing the file to a single row group.

- io_utils: add write_table_one_row_group_per_episode (one ParquetWriter,
  one write_table per episode — same pattern as the recording writer);
  to_parquet_with_hf_images embeds images then writes per-episode row
  groups; to_parquet_one_row_group_per_episode wraps it for plain frames
- aggregate: route non-image data writes through the per-episode writer;
  leave the episodes-metadata parquet untouched (already one row/episode)
- annotate: rewrite shards via the per-episode writer instead of a single
  bulk pq.write_table
- tests: invariant coverage through the aggregate (image + video) and
  annotate paths

No change to on-disk schema, paths, naming, rollover thresholds, or
compression. Readers stay backward-compatible (old collapsed files load).

2026-06-15 14:53:12 +02:00

__init__.py

feat: language annotation pipeline (#3471 )

2026-06-12 15:12:33 +02:00

_helpers.py

feat: language annotation pipeline (#3471 )

2026-06-12 15:12:33 +02:00

conftest.py

feat: language annotation pipeline (#3471 )

2026-06-12 15:12:33 +02:00

run_e2e_smoke.py

feat: language annotation pipeline (#3471 )