* feat(edit-dataset): add `concatenate_videos` opt-out to merge
When merging datasets, source mp4s are concatenated into shards capped at
`video_files_size_in_mb` (default 200 MB). This is great for dataloader
throughput but destroys per-episode (or per-source) video boundaries,
which is undesirable when you want to inspect, ship, or reuse the
individual mp4s.
Add a `concatenate_videos: bool = True` knob plumbed through
`MergeConfig` → `merge_datasets` → `aggregate_datasets` → `aggregate_videos`.
When False, each source mp4 is copied 1:1 to its own destination mp4 with
no re-muxing, so the merge preserves source video boundaries.
Usage:
lerobot-edit-dataset \
--new_repo_id user/merged \
--operation.type=merge \
--operation.repo_ids "['user/a', 'user/b']" \
--operation.concatenate_videos=false
Defaults are unchanged; the dataloader path is unaffected because the
`episodes.parquet` `from_timestamp`/`to_timestamp` index keeps working
regardless of whether each mp4 holds one or many episodes.
* feat(edit-dataset): extend concatenate opt-out to data files
Following review, add a concatenate_data flag mirroring concatenate_videos,
threaded through MergeConfig, merge_datasets, aggregate_datasets, aggregate_data
and append_or_create_parquet_file. Metadata index files still always concatenate.
Also trim the verbose docstrings and comments since the names are
self-explanatory, and extend the existing merge test to cover data files.
Steerable annotation pipeline (lerobot-annotate) that populates the language_persistent and language_events columns introduced in PR 1 (#3467) directly into data/chunk-*/file-*.parquet.
This is PR 2 of the three-PR plan:
PR 1 (Add extensive language support #3467): schema + DSL + rendering, base of this PR
PR 2 (this PR): annotation pipeline writing into PR 1's columns
PR 3: model with language prediction and runtime
A VLM (Qwen-VL family, served on vLLM) watches each episode's video and emits grounded language annotations: subtasks, plans, memory, task rephrasings, interjections + speech, and per-camera VQA. The pipeline is built for production annotation at scale — single-camera grounding, embedded-frame inputs, a describe-then-segment grounding flow, and a deterministic full-episode coverage guarantee — informed by Scale's dense-captioning findings (representation > sampling, rules > reasoning, model capacity is the biggest lever, two-pass systems compound errors)
* fix(root): adding proper support for the root and new_root arguments
* feat(roots): adding a roots agrument for the merge operation
* chore(clean): cleaning up code
* chore(doctrings): updating doctrings with new features
* fix(repo_id): setting repo_id to None when not needed
* fix(roots/repo_ids): making mypy happy by using repo_ids and roots for merge operation
* fix(path): fixing path related issues
* fix(repo_id): fixing issues related to repo_id
* chore(doctrings): updating docstrings + fix typo
* chore(clean): cleaning code
* fix(split new_repo_id): reverting new_repo_id addition for split operation
* docs(dosctrings): completing docstrings
* fix(repo_ids/roots): improving checks for repo_ids/roots lengths
* fix(repo_ids): making repo_ids optional in MergeConfig but raise if not given
* fix(docstrings): fixing docstrings for split operation
* fix(hints): updating get_output_path hints to accept paths as strings too
* fix(y/N prompts): removing y/N prompts in lerobot_edit_dataset
* fix(merge repo_id): fixing merge operation to use new_repo_id instead of repo_id
* fix(typo): fixing typo in doctrings
* Add New featrue to lerobot_edit_datset.py that show dataset information.
* Fix to draccus error when happen give only --operation.type=info
* Updating test and documents regarding lerobot-edit-dataset info function.
* Updating documents regarding lerobot-edit-dataset extract function. option name in document is mistake.
* feat(datasets): Update to align formatting with pre-commit.(#2917)
Update to align formatting by pre-commit.
---------
Co-authored-by: Caroline Pascal <caroline8.pascal@gmail.com>
- Add `tests/scripts/save_policy_to_safetensor.py` to generate test artifacts
- Add `test_backward_compatibility to test generated outputs from the policies against artifacts