* fix(datasets): enforce one parquet row group per episode in v3 data writes
LeRobot v3 data shards must hold exactly one row group per episode so a
reader can fetch episode i with pq.ParquetFile(path).read_row_group(i)
(a byte-range read) instead of loading the whole shard. The recording
writer already does this (one write_table per episode); the aggregate
and lerobot-annotate re-write paths instead concatenated many episodes
and wrote them in one shot, collapsing the file to a single row group.
- io_utils: add write_table_one_row_group_per_episode (one ParquetWriter,
one write_table per episode — same pattern as the recording writer);
to_parquet_with_hf_images embeds images then writes per-episode row
groups; to_parquet_one_row_group_per_episode wraps it for plain frames
- aggregate: route non-image data writes through the per-episode writer;
leave the episodes-metadata parquet untouched (already one row/episode)
- annotate: rewrite shards via the per-episode writer instead of a single
bulk pq.write_table
- tests: invariant coverage through the aggregate (image + video) and
annotate paths
No change to on-disk schema, paths, naming, rollover thresholds, or
compression. Readers stay backward-compatible (old collapsed files load).
* Update src/lerobot/datasets/io_utils.py
Co-authored-by: Caroline Pascal <caroline8.pascal@gmail.com>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
* Update src/lerobot/datasets/io_utils.py
Co-authored-by: Caroline Pascal <caroline8.pascal@gmail.com>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
* fix(datasets): correct indentation and add strict= in row-group helper
The web-edited numpy version of write_table_one_row_group_per_episode had an
over-indented line (IndentationError, breaking pre-commit + test collection)
and a zip() without strict=. Fix both; behaviour unchanged.
---------
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
Co-authored-by: Caroline Pascal <caroline8.pascal@gmail.com>
* fix(images/videos): fixing aggregate_pipeline_dataset_features to avoid unwanted images features deletion when videos are not used
* fix(docstrings): improving docstrings
Signed-off-by: Caroline Pascal <caroline8.pascal@gmail.com>
---------
Signed-off-by: Caroline Pascal <caroline8.pascal@gmail.com>
* chore(robots): homogenize bi setups
* feat(robots): split openarm mini into single and bi
* refactor(robots): mixin for bi classes
* docs: update docs