lerobot

mirror of https://github.com/huggingface/lerobot.git synced 2026-06-16 15:57:03 +00:00

Files

T

Mahbod 30790de178 feat(edit-dataset): add concatenate_videos opt-out to merge (#3663 )

* feat(edit-dataset): add `concatenate_videos` opt-out to merge

When merging datasets, source mp4s are concatenated into shards capped at
`video_files_size_in_mb` (default 200 MB). This is great for dataloader
throughput but destroys per-episode (or per-source) video boundaries,
which is undesirable when you want to inspect, ship, or reuse the
individual mp4s.

Add a `concatenate_videos: bool = True` knob plumbed through
`MergeConfig` → `merge_datasets` → `aggregate_datasets` → `aggregate_videos`.
When False, each source mp4 is copied 1:1 to its own destination mp4 with
no re-muxing, so the merge preserves source video boundaries.

Usage:

    lerobot-edit-dataset \
        --new_repo_id user/merged \
        --operation.type=merge \
        --operation.repo_ids "['user/a', 'user/b']" \
        --operation.concatenate_videos=false

Defaults are unchanged; the dataloader path is unaffected because the
`episodes.parquet` `from_timestamp`/`to_timestamp` index keeps working
regardless of whether each mp4 holds one or many episodes.

* feat(edit-dataset): extend concatenate opt-out to data files

Following review, add a concatenate_data flag mirroring concatenate_videos,
threaded through MergeConfig, merge_datasets, aggregate_datasets, aggregate_data
and append_or_create_parquet_file. Metadata index files still always concatenate.

Also trim the verbose docstrings and comments since the names are
self-explanatory, and extend the existing merge test to cover data files.

2026-06-12 20:05:04 +02:00

test_edit_dataset_parsing.py

feat(edit-dataset): add concatenate_videos opt-out to merge (#3663 )

2026-06-12 20:05:04 +02:00

test_lerobot_annotate.py

feat: language annotation pipeline (#3471 )

2026-06-12 15:12:33 +02:00