* feat(edit-dataset): add `concatenate_videos` opt-out to merge
When merging datasets, source mp4s are concatenated into shards capped at
`video_files_size_in_mb` (default 200 MB). This is great for dataloader
throughput but destroys per-episode (or per-source) video boundaries,
which is undesirable when you want to inspect, ship, or reuse the
individual mp4s.
Add a `concatenate_videos: bool = True` knob plumbed through
`MergeConfig` → `merge_datasets` → `aggregate_datasets` → `aggregate_videos`.
When False, each source mp4 is copied 1:1 to its own destination mp4 with
no re-muxing, so the merge preserves source video boundaries.
Usage:
lerobot-edit-dataset \
--new_repo_id user/merged \
--operation.type=merge \
--operation.repo_ids "['user/a', 'user/b']" \
--operation.concatenate_videos=false
Defaults are unchanged; the dataloader path is unaffected because the
`episodes.parquet` `from_timestamp`/`to_timestamp` index keeps working
regardless of whether each mp4 holds one or many episodes.
* feat(edit-dataset): extend concatenate opt-out to data files
Following review, add a concatenate_data flag mirroring concatenate_videos,
threaded through MergeConfig, merge_datasets, aggregate_datasets, aggregate_data
and append_or_create_parquet_file. Metadata index files still always concatenate.
Also trim the verbose docstrings and comments since the names are
self-explanatory, and extend the existing merge test to cover data files.
* fix(datasets): expose a generator on EpisodeAwareSampler for distributed shuffle sync
In distributed training, accelerate can only synchronize the shuffle
permutation across ranks when the sampler exposes a generator attribute.
EpisodeAwareSampler shuffled via the global torch RNG, so disjoint batch
shards relied on every rank's global CPU RNG staying in lockstep forever;
any rank-asymmetric RNG consumption (e.g. eval rollouts on the main
process only) silently desynced the permutations and ranks trained on
overlapping/missing samples.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(train): seed sampler generator and gate dataset download per node
- Pass a generator seeded with cfg.seed to EpisodeAwareSampler so
accelerator.prepare registers it as the synchronized RNG and the
shuffle order is reproducible.
- Gate the initial make_dataset call on is_local_main_process instead of
is_main_process: the global main process only exists on node 0, so on
every other node all local ranks were downloading the dataset and
building the Arrow cache concurrently.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(datasets): add DeterministicEpisodeAwareSampler with O(1) memory and sample-exact resume
Add a sampler that never materializes frame indices: it stores only
per-episode boundaries (numpy, a few bytes per episode) and maps logical
positions to frame indices on the fly with searchsorted. Shuffling uses a
seeded Feistel permutation over [0, num_frames) (cycle-walking to the
exact domain), so the data order is a pure function of (seed, epoch):
- no RNG state to synchronize across distributed ranks,
- constant memory and zero epoch-boundary cost at any dataset size,
- O(1) seek to any position, enabling sample-exact resume.
Opt in with --deterministic_sampler=true. On resume, lerobot-train maps
the checkpointed step back to (epoch, start_index) via
compute_sampler_state and continues at the exact sample where the run
left off (up to accelerate's even_batches padding at epoch boundaries).
The shuffle is pseudo-random rather than a true uniform permutation, the
standard trade-off in large-scale training loaders.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* refactor(datasets): fold deterministic mode into EpisodeAwareSampler
Instead of a parallel DeterministicEpisodeAwareSampler class, extend the
existing EpisodeAwareSampler with a deterministic=True mode (seeded
Feistel permutation, epoch auto-advance, state_dict/load_state_dict).
The default mode is behavior-identical: same torch.randperm consumption
and the same generator contract accelerate synchronizes; the O(N) Python
index list is replaced by O(num_episodes) boundary arrays in both modes,
with `indices` kept as a back-compat property. Passing a generator
together with deterministic=True is rejected, and the state/seek methods
raise outside deterministic mode.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(train): enable deterministic_sampler by default
Deterministic data order (sample-exact resume, no cross-rank RNG sync,
O(1) sampler memory) is now the default for map-style training; set
deterministic_sampler=false to restore the legacy RNG-based shuffle.
Streaming datasets ignore the flag (the sampler path only applies to
map-style datasets), replacing the previous hard validation error so
streaming configs keep working with the new default.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(datasets): default EpisodeAwareSampler to deterministic mode and trim comments
deterministic=True is now the class default as well as the training
default; the legacy RNG path requires an explicit deterministic=False
(the train script's non-deterministic branch passes it). Docstrings and
inline comments slimmed down across the changed files.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test(sampler): drain resumed trillion-frame sampler via iter() to avoid list() prealloc
list(sampler) calls PyObject_LengthHint -> __len__ (the full 10**12 epoch length) and
preallocates that many slots before iterating, OOMing even though the resumed epoch only
yields 3 frames. Collect through the iterator (no length hint) so the test exercises the
real O(1) seek/drain instead of CPython's list growth heuristic.
* fix(datasets): guard Feistel cycle-walking loop against non-convergence
Replace the unbounded while True in EpisodeAwareSampler._permute with a
bounded for loop capped at _MAX_CYCLE_WALK_STEPS (100) and raise
RuntimeError if the cycle-walk fails to land in [0, num_frames). The
loop is expected to converge in <4 steps on the chosen power-of-two
domain, so the bound is a safety net that should never trip in practice
but prevents a pathological infinite loop.
https://claude.ai/code/session_01HQ15tFrBsHYScjGWosEv22
* fix(datasets): make deterministic-sampler resume robust to world-size changes
compute_sampler_state mapped a checkpointed step back to (epoch, start_index)
using the *current* num_processes, but the number of sampler positions a step
consumes scales with the world size that produced it. Resuming on a different
GPU count therefore landed on the wrong epoch/offset, silently re-seeing or
skipping data.
Record num_processes in training_step.json at checkpoint time and feed the
checkpoint's value into compute_sampler_state on resume, so the data order
resumes at the right position regardless of the new world size. Warn when the
world size changed (the global offset is correct, but per-rank sample-exactness
needs the same topology). Old checkpoints without the field fall back to the
current world size.
Also document compute_sampler_state's assumptions explicitly: num_processes /
batch_size must match the checkpointing run, and accelerate's even_batches=True
padding is mirrored by the ceil(... / num_processes) term.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* style: apply ruff-format to lerobot_train.py
Collapse the compute_sampler_state(...) call onto one line so the
ruff-format pre-commit hook passes (fixes the failing CI check).
Co-authored-by: Cursor <cursoragent@cursor.com>
* refactor(datasets): use seeded torch.randperm instead of Feistel in EpisodeAwareSampler
Drop the Feistel permutation (and its SplitMix64 hash / cycle-walking) in favor of a
torch.randperm seeded from (seed, epoch). The deterministic mode keeps its key properties
- data order is a pure function of (seed, epoch), so it reproduces on every rank with no
global-RNG synchronization, and
- state_dict / load_state_dict still resume sample-exactly, now by regenerating the epoch's
permutation and slicing from the saved offset.
Construction stays O(num_episodes) (only episode boundaries are stored, never a per-frame
index list). The trade-off vs Feistel: the per-epoch shuffle is again O(num_frames) memory
(the randperm tensor) and no longer O(1)-seekable, in exchange for ~30 fewer LOC and a truly
uniform shuffle. Tests updated: the trillion-frame O(1) test is replaced with a
boundary-storage check and a scale resume-exactness test.
Co-authored-by: Cursor <cursoragent@cursor.com>
* refactor(datasets): make EpisodeAwareSampler always deterministic
With Feistel gone, deterministic and legacy modes were both just torch.randperm and the
deterministic path strictly dominated (reproducible across ranks via the (seed, epoch) seed,
no accelerate generator sync, resumable). Collapse to a single path and drop the redundant
flag:
- remove the `deterministic` and `generator` constructor args, `_iter_default`, and
`_require_deterministic`; `set_epoch` / `state_dict` / `load_state_dict` are now unconditional
- remove the `deterministic_sampler` train config field and the legacy generator branch in
lerobot_train.py (non-streaming map datasets always use the sampler)
- drop the now-obsolete generator/legacy tests
Note: removes the `generator` kwarg from EpisodeAwareSampler (back-compat break vs main); the
order is now a pure function of (seed, epoch), so no cross-rank RNG sync is needed.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(datasets): address sampler review (batch_size resume guard + docs)
- Record batch_size in training_step.json alongside num_processes and feed
the checkpoint's value into compute_sampler_state on resume; warn when it
differs (per-rank sample-exactness needs the same batch size).
- Document the set_epoch vs __iter__ auto-advance coupling on EpisodeAwareSampler
(callers should rely on exactly one mechanism per run).
- Note the broadened (reproducibility-breaking) sampler guard and the no-generator
distributed sharding correctness in lerobot_train.py.
- Add load_training_batch_size + parallel tests.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(train): download dataset once on the global main process
Gate the training dataset download on the global is_main_process (download once to the
shared dataset root, barrier, then every other rank reads the already-populated copy)
instead of per-node is_local_main_process. LeRobotDataset skips its snapshot_download
when try_load() succeeds, so no rank re-downloads. Assumes the dataset root / HF cache is
on storage shared across nodes.
Co-authored-by: Cursor <cursoragent@cursor.com>
* chore(datasets): trim sampler comment and drop duplicate tests
Remove the verbose dataloader-guard comment and the two EpisodeAwareSampler tests
that duplicated existing validation/warning coverage (no coverage loss).
Co-authored-by: Cursor <cursoragent@cursor.com>
---------
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(datasets): expose a generator on EpisodeAwareSampler for distributed shuffle sync
In distributed training, accelerate can only synchronize the shuffle
permutation across ranks when the sampler exposes a generator attribute.
EpisodeAwareSampler shuffled via the global torch RNG, so disjoint batch
shards relied on every rank's global CPU RNG staying in lockstep forever;
any rank-asymmetric RNG consumption (e.g. eval rollouts on the main
process only) silently desynced the permutations and ranks trained on
overlapping/missing samples.
* fix(train): seed sampler generator and gate dataset download per node
- Pass a generator seeded with cfg.seed to EpisodeAwareSampler so
accelerator.prepare registers it as the synchronized RNG and the
shuffle order is reproducible.
- Gate the initial make_dataset call on is_local_main_process instead of
is_main_process: the global main process only exists on node 0, so on
every other node all local ranks were downloading the dataset and
building the Arrow cache concurrently.
VideoDecoderCache used an unbounded dict keyed on absolute path, with no
eviction in the standard LeRobotDataset path. With shuffled iteration over
datasets that have many distinct mp4 files, every DataLoader worker
accumulated one cached (VideoDecoder, fsspec file handle) pair per distinct
path it had ever touched. Per-entry cost is ~3-5 MB of host RAM plus one
open FD; at ~8 k entries this is roughly 30 GB per worker.
This was hit in the wild during a SmolVLA training run on a 4,195-episode
SO-101 dataset (8,390 mp4s, two cameras per episode). dmesg showed
anon-rss climbing to 34.9 GB on a single pt_data_worker before the OOM
killer fired ~30 min into training; with --num_workers=8 the per-worker
peak halved to 17.9 GB, which is the expected inverse-scaling signature
when the leak is per-decode and the workload is split across workers. The
working workaround on the affected platform was --dataset.video_backend=pyav,
because the pyav path opens/closes per call and never touches this cache.
Switch the backing store to an OrderedDict and evict LRU entries when the
cap is reached, closing the evicted file handle inside the lock so we do
not leak FDs either. Default cap is DEFAULT_DECODER_CACHE_SIZE = 100,
overridable via LEROBOT_VIDEO_DECODER_CACHE_SIZE or by passing max_size=
to the constructor; max_size=None restores the legacy unbounded behaviour
for callers that need it.
Validation on the original failing workload (decode_video_frames_torchcodec
called over real mp4s from the affected SO-101 dataset):
unbounded: 300 files -> +1087 MB host RSS, cache=300, still climbing
cap=50: 500 files -> +266 MB host RSS, cache=50, stable
cap=50: 2000 calls -> +312 MB host RSS, cache=50, stable
cap=100: 1000 calls -> +470 MB host RSS, cache=100, stable
Three independent seeded runs at cap=50 agreed to within 1% (263 / 266 /
265 MB delta), and the 2000-call multi-pass run shows RSS plateaus after
the cap is reached instead of drifting.
Tests in tests/datasets/test_video_decoder_cache.py cover:
default-is-bounded, size cap, LRU ordering, FD close on eviction, FD close
on clear(), cache-hit invariance, max_size=None fallback, and env-var
override. No regressions in test_video_encoding.py, test_streaming.py, or
test_dataset_reader.py (73 prior tests still pass alongside the 8 new ones).
* feat(utility): adding video re-encode utility
* feat(edit): adding a new lerobot-edit-dataset tool to re-encode all the videos of a dataset
* chore(format): formatting code
* chore(review): fix Claude reviews
* test(reencode dataset): adding missing test for reencode dataset
* Add extensive language support
* Address review: split persistent/event schemas, drop event timestamps
- recipe.py: derive _VALID_ROLES/_VALID_STREAMS from MessageRole/MessageStream Literals
- dataset_metadata.py: keep CODEBASE_VERSION at v3.0
- language.py: remove RESERVED_STYLES; split arrow/feature schemas into
persistent (with timestamp) and event (without timestamp); add docstrings
- language_render.py: events use frame-row timestamp implicitly; no
per-event timestamp filtering or sorting
- converters.py: drop unused subtask_key passthrough
- add docstrings to new public APIs (recipe, render_messages_processor, collate)
- update tests for split schemas; revert uv.lock
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add docstrings to all new helpers; revert uv.lock
Covers private helpers in recipe.py, language.py, language_render.py,
and render_messages_processor.py. Also reverts uv.lock to main (it was
re-generated by `uv run` during local checks).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(language): add motion (persistent) and trace (event-only) styles
Promote the previously-reserved motion/trace styles to first-class core
styles. motion routes to language_persistent (it tracks robot state over
time); trace routes to language_events (single-moment annotations).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(language): per-camera tagging on view-dependent styles
Adds a nullable `camera` field to the language row struct (both persistent
and event variants) so view-dependent styles like `vqa` can carry which
`observation.images.*` view they were grounded against. Without this,
multi-camera datasets ended up with multiple `(vqa, role)` rows at the
same timestamp that the resolver could not disambiguate.
- `language.py`: add `camera` to PERSISTENT_ROW_FIELDS / EVENT_ROW_FIELDS,
to both Arrow struct types and the HF datasets feature mappings;
introduce VIEW_DEPENDENT_STYLES = {vqa, motion, trace} plus
`is_view_dependent_style` and `validate_camera_field` helpers (camera
required iff style is view-dependent).
- `language_render.py`: thread an optional `camera=` kwarg through every
resolver (`active_at`, `emitted_at`, `nth_prev`, `nth_next`) and through
`_matching_rows` / `_select_*`, so recipes can disambiguate per-camera
VQA with `emitted_at(t, style=vqa, role=assistant, camera=...)`.
Without a `camera` filter, multi-row matches keep raising the existing
ambiguity error — which is the desired behaviour on multi-camera data.
- `recipes/pi05_hirobot.yaml`: replace the single `ask_vqa` branch with
`ask_vqa_top` and `ask_vqa_wrist` per-camera sub-recipes (each carrying
the matching image block), keeping the original 0.20 budget and
documenting the customization point for datasets with different cameras.
- Tests: schema test asserts the new field order; new tests cover
`is_view_dependent_style`, `validate_camera_field` (both required and
forbidden directions), per-camera `emitted_at` filtering, and the
ambiguity error when two cameras emit `(vqa, assistant)` at the same
timestamp without a `camera=` filter. RenderMessagesStep + dataset
passthrough fixtures updated to include the new field.
- `docs/source/language_and_recipes.mdx`: document the `camera` field,
the per-camera resolver pattern, and the canonical recipe convention.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(language): drop motion from VIEW_DEPENDENT_STYLES
Motion primitives are described in robot-frame (joint / Cartesian) terms,
not pixel space, so they are camera-agnostic. Only `vqa` (event) and
`trace` (event, pixel-trajectory) are view-dependent.
The `camera` field stays on PERSISTENT_ROW_FIELDS for schema symmetry —
the validator, resolver, and HF feature mapping behave identically across
the two columns regardless of which styles populate `camera` today —
but persistent rows now always have `camera=None` in practice.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(language): task_aug style + automatic ${task} rephrasing rotation
Adds task-prompt diversity (Xiao 2022 / CAST) without touching
``meta/tasks.parquet`` or forcing recipes to opt in. The plan reserved
``task_aug`` as a future style; this lands it now.
- ``language.py``: add ``task_aug`` to ``CORE_STYLES`` and
``PERSISTENT_STYLES``. ``column_for_style("task_aug")`` returns
``language_persistent`` so PR 2 writers route it correctly.
- ``language_render.py``: ``_resolve_task`` now consults the persistent
slice for rows of ``style="task_aug", role="user"``. When any exist
it picks one deterministically by ``sample_idx`` (blake2b-keyed, not
Python's randomized hash) so an epoch sees every rephrasing of every
episode while the same sample still resolves identically across
reruns. Falls back to the canonical ``meta/tasks.parquet`` task when
no rephrasings are present, so existing datasets and unannotated runs
keep their behaviour. Explicit ``task=`` overrides still win.
- Tests: rephrasing coverage across samples, determinism on repeat
``sample_idx``, fallback when persistent has no ``task_aug`` rows,
and explicit override priority.
Recipes get this for free: any ``${task}`` placeholder rotates through
the available rephrasings. Recipes that want the literal canonical task
can override the binding.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(language): tool catalog in meta/info.json + LeRobotDatasetMetadata.tools
Stores OpenAI-style function schemas at ``meta/info.json["tools"]`` so
datasets can declare which tools are available (today: just ``say``;
tomorrow: per-dataset extensions). The ``DEFAULT_TOOLS`` constant
fills in for unannotated datasets so chat-template consumers don't
have to special-case anything.
Three pieces:
- ``language.py``: ``SAY_TOOL_SCHEMA`` and ``DEFAULT_TOOLS``
constants. Single source of truth — PR 2's writer and PR 3's
runtime tool registry will both import from here instead of
duplicating the dict.
- ``dataset_metadata.py``: ``LeRobotDatasetMetadata.tools`` property
reads ``info.json["tools"]`` and falls back to ``DEFAULT_TOOLS``.
Returns deep-copied dicts so callers can mutate the result safely.
- ``docs/source/tools.mdx``: spec page covering the catalog, per-row
invocations, and the three-step "how to add a new tool" workflow
(declare schema, implement, register). Linked from the docs
toctree under the Datasets section.
This lays the groundwork for PR 2's pipeline writing the catalog out
during annotation, and PR 3's ``src/lerobot/tools/`` package shipping
runnable implementations (one file per tool — first up:
``say.py`` wrapping Kyutai's pocket-tts).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Apply ruff and prettier formatting after merge
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(language): unify resolver dispatch and prune redundant test scaffolding
* Drop the unused `events` kwarg from `active_at`/`nth_prev`/`nth_next`;
only `emitted_at` actually consults events. The dispatcher in
`_resolve_spec` now passes events conditionally.
* Replace the dual `_persistent_sort_key`/`_event_sort_key` pair with a
single `_row_sort_key` and drop the `sort_key` parameter from
`_select_one`. Event rows lack `timestamp` (it is implicit in the
frame) and now default to `0.0` for sort purposes — the
`(style, role)` tiebreaker is unchanged.
* Inline `_select_latest` into `active_at` (its only caller).
* Collapse `emitted_at`'s dual-branch into one `_select_one` call.
* Tighten `_validate_persistent_resolver` to a single
`column_for_style(style) != LANGUAGE_PERSISTENT` check.
* Parameterize `test_per_camera_blend_renders_both_views` over the two
cameras and factor the sub-recipe builder into `_vqa_subrecipe` so
the test no longer hand-rolls two near-identical recipe blocks.
Net -98 LOC; behavior, public resolver names, and test expectations
unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(language): always raise on ambiguous resolver matches
`_select_one` previously skipped its ambiguity check whenever any of
`role`/`tool_name`/`camera` was set, on the assumption that the caller
had already pinned down a unique row. That left a real ambiguity hole
for VQA: with two cameras emitting `(vqa, assistant)` at the same
frame, `emitted_at(..., role="assistant")` silently picked the first
sorted row instead of telling the recipe to add `camera=...`. The
existing `test_emitted_at_raises_on_ambiguous_per_camera_vqa` test
already encoded the desired behavior.
Tighten the check: any time `len(rows) > 1` we now raise with the
selectors echoed back, so users see exactly which fields they passed
and that more is needed to disambiguate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: fix CI — collapse short ValueError to one line, refresh uv.lock
* `ruff format` on CI (newer version) wants the short `camera=None`
ValueError on a single line.
* `uv.lock` was stale relative to `pyproject.toml`'s `datasets>=4.7.0`
pin (and picked up upstream `s390x` marker fixes for cuda packages).
CI runs `uv sync --locked` which rejected the divergence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(language): keep base install green — drop processor re-export, gate dataset-extra tests
`lerobot.processor` re-exported `RenderMessagesStep` at the package
level, so importing anything from `lerobot.processor` pulled in
`lerobot.datasets.language` → `lerobot.datasets/__init__.py` →
`require_package("datasets")`, which fails in the Tier 1 base install
that intentionally omits the `[dataset]` extra. The chain bricked
collection for unrelated suites (`tests/policies/pi0_pi05/...`,
`tests/envs/...`, etc.).
* Stop re-exporting `RenderMessagesStep` from `lerobot.processor`. The
only consumer (the test) already imports from the submodule.
Document the deliberate omission in the module docstring.
* Add `pytest.importorskip("datasets", ...)` (and `pandas` where
needed) at the top of the four PR-added tests that exercise the
language stack:
- tests/datasets/test_language.py
- tests/datasets/test_language_render.py
- tests/processor/test_render_messages_processor.py
- tests/utils/test_collate.py
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(language): address review — tools accessor, motion docs, conditional collate
* **`meta.tools` actually reads `info.json["tools"]`.** `DatasetInfo`
had no `tools` field, so `from_dict` silently dropped the key (it
warned about unknown fields then discarded them) and the property
always returned `DEFAULT_TOOLS`. Added `tools: list[dict] | None`
to the dataclass; `to_dict()` drops it when unset so existing
datasets keep a clean `info.json`. Fixed the accessor to read
`self.info.tools` (the previous `.get(...)` would have raised
AttributeError on the dataclass anyway). Added regression tests:
fallback when absent, round-trip from disk, and round-trip
through `DatasetInfo.from_dict` / `to_dict`.
* **`motion` is not view-dependent — fix the docs.** The mdx claimed
rows of style `motion` must carry `camera`, but `VIEW_DEPENDENT_STYLES
= {"vqa", "trace"}` and the validator agrees: motion primitives are
joint/Cartesian-frame, not pixel-space. Updated both call-out
paragraphs in `language_and_recipes.mdx`.
* **Conditional `collate_fn` swap.** Added `meta.has_language_columns`
and gate the `lerobot_collate_fn` swap in `lerobot_train.py` on it,
so non-language datasets keep PyTorch's `default_collate`. Also
added a pass-through test in `test_collate.py` that asserts on a
plain tensor batch the custom collate matches `default_collate`
key-for-key, plus a test for the `None`-sample drop path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* review: dedupe regex, centralize column names, harden collate, more tests
* **#2 — dedupe `_PLACEHOLDER_RE`.** The same regex was compiled in
`recipe.py` and `language_render.py`. Promote to module-level
`PLACEHOLDER_RE` in `recipe.py` (its primary owner — declares
template syntax) and import from `language_render.py`.
* **#3 — centralize language column names.** `io_utils.py` had
hardcoded `{"language_persistent", "language_events"}` literals at
two sites. Replace with `LANGUAGE_COLUMNS` import so a future column
rename can't silently desync.
* **#4 — defensive collate preserved-keys.** `lerobot_collate_fn`
silently filtered language fields from samples that didn't have
them, which would hand downstream consumers a preserved list
shorter than the tensor batch. Now: if any sample carries a key,
every sample in the batch must carry it; otherwise raise a
`ValueError` so the upstream rendering bug surfaces at the boundary.
* **#5 — `_scalar` rejects non-singleton lists.** Previously a zero-
or multi-element list fell through and triggered confusing
`float([])` errors downstream. Now raises `ValueError` with the
actual length.
* **#6 — refactor `_extract_complementary_data`.** Replace 11 lines
of `key = {... if ... else {}}` plus an 11-line splat dict with a
single `_COMPLEMENTARY_KEYS` tuple iterated once.
* **#7 — document `EXTENDED_STYLES`.** Was an empty `set()` with no
comment. Add a docstring explaining it's an intentional extension
point: downstream modules append project-local styles before
`column_for_style` is called.
* **#9 — `tools.mdx` notes the runtime layer is future work.** The
page referenced `src/lerobot/tools/`, `registry.py`, and
`get_tools(meta)` — none exist in this PR. Added a callout at the
start of "How to add your own tool" plus a note on the
implementations paragraph.
* **#10 — tests for YAML round-trip, malformed rows, blend
validation.** `test_recipe.py` grew from 1 case to 12 covering:
blend-or-messages exclusivity, target-turn requirement, blend
emptiness, weight presence/positivity, nested-blend rejection,
`from_dict` with nested blends, `from_yaml` / `load_recipe`
agreement, top-level non-mapping rejection. Added a malformed-row
test for `_normalize_rows` that asserts non-dict entries raise
`TypeError`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* review: emitted_at uses 0.1s tolerance; MessageTurn requires stream at construction
* **Float tolerance in `emitted_at` for persistent styles.** The
``_timestamp(row) == t`` exact-equality check silently missed any
caller that derived ``t`` arithmetically (e.g. ``frame_idx / fps``)
even though the parquet timestamp would only differ by ULPs. Added
``EMITTED_AT_TOLERANCE_S = 0.1`` and check ``abs(...) <= tolerance``
instead, with a docstring explaining why exact equality wasn't
enough and why 0.1 s is safe at typical 30–100 Hz control rates.
Test asserts the new behavior at half-window (matches) and
double-window (no match) using the constant so it stays in sync.
* **`MessageTurn.stream` is required at construction.** It was typed
``MessageStream | None = None`` so YAML could omit ``stream:`` and
pass the dataclass invariant — but ``_validate_rendered`` rejected
``None`` streams later, surfacing the error at the first sample
instead of at recipe load. Now ``__post_init__`` raises
``ValueError`` if ``stream`` is ``None``, with the list of valid
streams in the message. The redundant late-stage check in
``_validate_rendered`` is replaced with a one-line comment that
cites the upstream invariant. Test pins the new construction-time
rejection.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(tools): drop follow-up-PR references
Reword the two callouts in `tools.mdx` to describe the runtime layer
in present tense ("not part of the catalog layer shipped today",
"those modules don't yet exist in the tree") instead of pointing at a
specific follow-up PR. Keeps the doc honest about what works now
without coupling it to a particular release order.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* review: address CarolinePascal feedback
- language timestamps: float64 -> float32 to match LeRobotDataset frame
timestamps (Arrow struct + HF feature)
- dataset_metadata: hoist `.language` imports to module top — language.py
has no lerobot imports, so there is no circular-import risk
- dataset_metadata: add a `meta.tools` setter that persists the catalog to
info.json and reloads `meta.info`
- feature_utils: validate the `language` dtype instead of returning "" —
warn (non-fatal) when a non-empty value is written at record time
- centralize the scalar-unwrap helper as `lerobot.utils.utils.unwrap_scalar`,
shared by render_messages_processor and language_render
- docs: move `## Layer 2 — recipe anatomy` ahead of the resolver sections,
which describe recipe bindings rather than dataset layout
- language_render: note in EMITTED_AT_TOLERANCE_S that persistent rows change
on a human-action timescale, not the camera frame rate
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(video backend): renaming codec into video_backend in get_safe_default_video_backend()
* feat(pyav utils): adding suport for PyAV encoding parameters validation
* feat(VideoEncoderConfig): creating a VideoEncoderConfig to encapsulate encoding parameters
* feat(VideoEncoderConfig): propagating the VideoEncoderConfig in the codebase
* chore(docs): updating the docs
* feat(metadata): adding encoding parameters in dataset metadata
* fix(concatenation compatibility): adding compatibility check when concatenating video files
* feat(VideoEncoderConfig init): making VideoEncoderConfig more robust and adaptable to multiple backends
* feat(pyav checks): making pyav parameters checks more robust
* chore(duplicate): removing duplicate get_codec_options definition
* test(existing): adapting existing tests
* test(new): adding new tests for encoding related features
* chore(format): fixing formatting issues
* chore(PyAV): cleaning up PyAV utils and encoding parameters checks to stick to the minimun required tooling.
* chore(format): formatting code
* chore(doctrings): updating docstrings
* fix(camera_encoder_config): Removing camera_encoder_config from LeRobotDataset, as it's only required in LeRobotDatasetWriter.
* feat(default values): applying a consistent naming convention for default RGB cameras video encoder parameters
* fix(rollout): propagating VideoEncoderConfig to the latest recording modes
* chore(format): formatting code, fixing error messages and variable names
* fix(arguments order): reverting changes in arguments order in StreamingVideoEncoder
* chore(relative imports): switching to relative local imports within lerobot.datasets
* test(artifacts): cleaning up artifacts for the video encoding tests
* chore(docs): updating docs
* chore(fromat): formatting code
* fix(imports): refactoring the file architecture to avoid circular imports. VideoEncoderConfig is now defined in lerobot.configs and lazily imports av at runtime.
* fix(typos): fixing typos and small mistakes
* test(factories): updating factories
* feat(aggregate): updating dataset aggregation procedure. Encoding tuning paramters (crf, g,...) are ignored for validation and changed to None in the aggregated dataset if incompatible.
* docs(typos): fixing typos
* fix(deletion): reverting unwanted deletion
* fix(typos): fixing multiple typos
* feat(codec options): passing codec options to lerobot_edit_dataset episode deletion tool
* typo(typo): typo
* fix(typos): fixing remaining typos
* chore(rename): renaming camera_encoder_config to camera_encoder
* docs(clean): cleaning and formating docs
* docs(dataset): addind details about datasets
* chore(format): formatting code
* docs(warning): adding warning regarding encoding parameters modification
* fix(re-encoding): removing inconsistent re-encoding option in lerobot_edit_dataset
* typos(typos): typos
* chore(format): resolving prettier issues
* fix(h264_nvenc): fixing crf handling for h264_nvenc
* docs(clean): removing too technical parts of the docs
* fix(imports): fixing imports at the __init__ level
* fix(imports): fixing not very pretty imports in video config file
* feat(episode filtering): adding support for episodes filtering at initialization time in LeRobotDataset
* test(tests): adding tests
* chore(format): formatting code
* feat(performance): improving implementation for better performances on big datasets
* chores(warning): improving warnings and errors for episodes filtering
* test(invalid key): adding test for invalid filtering key
* chore(format): formatting code
* refactor(datasets): replace untyped dict with typed DatasetInfo dataclass
Introduce typed DatasetInfo dataclass to replace untyped dict representation of info.json.
Changes:
- Add DatasetInfo dataclass with explicit fields and validation
- Implement __post_init__ for shape conversion (list ↔ tuple)
- Add dict-style compatibility layer (__getitem__, __setitem__, .get())
- Add from_dict() and to_dict() for JSON serialization
- Update io_utils to use load_info/write_info with DatasetInfo
- Update dataset utilities and metadata to use attribute access
- Remove aggregate.py dict-style field access
- Add tests fixture support for DatasetInfo
Benefits:
- Type safety with IDE auto-completion
- Validation at construction time
- Explicit schema documentation
* fix pre-commit
* update docstring inside DatasetInfo.from_dict()
* sorts the unknown to have deterministic output
Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>
* refactoring the last few old fieds
* fix crop dataset roi type mismatch
* use consistantly int for data and video_files_size_in_mb
---------
Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net>
Co-authored-by: jjolla93 <jjolla93@gmail.com>
* fix(one shot load): adding metadata loading when reading from a dataset after writing
* refactor(one shot load): move metadata reload to ensure_readable() on LeRobotDatasetMetadata
Move the metadata reload from DatasetReader.load_and_activate() to a new
public ensure_readable() method on LeRobotDatasetMetadata, called from
LeRobotDataset._ensure_reader(). This places lifecycle management in the
right layer: metadata owns its readiness check, the dataset orchestrates
the write-to-read transition, and the reader stays clean.
Also adds a regression test using delta_timestamps to exercise the
meta.episodes access path in the create -> write -> finalize -> read flow.
Co-authored-by: Steven Palma <imstevenpmwork@users.noreply.github.com>
---------
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Steven Palma <imstevenpmwork@users.noreply.github.com>
* add: a flexible transformation registry
* fix: image transforms can be set both at init and after
* add: tests
* fix: take in review
* feat(datasets): add image transform setters
* fix: pre-commit
* fix: CI
---------
Signed-off-by: Francesco Capuano <74058581+fracapuano@users.noreply.github.com>
* refactor(dataset): enhance dataset root directory handling and introduce hub cache support
- Updated DatasetConfig and LeRobotDatasetMetadata to clarify root directory behavior and introduce a dedicated hub cache for downloads.
- Refactored LeRobotDataset and StreamingLeRobotDataset to utilize the new hub cache and improve directory management.
- Added tests to ensure correct behavior when using the hub cache and handling different revisions without a specified root directory.
* refactor(dataset): improve root directory handling in LeRobotDataset
- Updated LeRobotDataset to store the requested root path separately from the actual root path.
- Adjusted metadata loading to use the requested root, enhancing clarity and consistency in directory management.
* refactor(dataset): minor improvements for hub cache support
* chore(datasets): guard in resume + assertion test
---------
Co-authored-by: AdilZouitine <adilzouitinegm@gmail.com>
Co-authored-by: mickaelChen <mickael.chen.levinson@gmail.com>
* refactor(dataset): split reader and writer
* chore(dataset): remove proxys
* refactor(dataset): better reader & writer encapsulation
* refactor(datasets): clean API + reduce leaky implementations
* refactor(dataset): API cleaning for writer, reader and meta
* refactor(dataset): expose writer & reader + other minor improvements
* refactor(dataset): improve teardown routine
* refactor(dataset): add hf_dataset property at the facade level
* chore(dataset): add init for datasset module
* docs(dataset): add docstrings for public API of the dataset classes
* tests(dataset): add tests for new classes
* fix(dataset): remove circular dependecy
* feat(datasets): add modify_tasks function for in-place task editing
Add a new utility function to modify tasks in LeRobotDataset in-place.
This allows users to:
- Set a single task for all episodes
- Set specific tasks for individual episodes
- Combine a default task with per-episode overrides
* feat(edit-dataset): add CLI support for modify_tasks operation
Integrate the modify_tasks function into lerobot_edit_dataset CLI.
Users can now modify dataset tasks via command line:
Supports setting a default task, per-episode tasks, or both combined.
* test(datasets): add tests for modify_tasks function
Add comprehensive test coverage for the modify_tasks utility:
- Single task for all episodes
- Episode-specific task assignment
- Default task with per-episode overrides
- Error handling for missing/invalid arguments
- Verification of task_index correctness
- In-place modification behavior
- Metadata preservation
* respond to copilot review
* Fix aggeregation of datasets when subdatasets are already a result of a previous merge
* docstring
* respond to copilot review + add regression test
* Remove unnecessary int conversion for indicies
* improve image2video
* add episodes video encoding
* fix mypy failing
* iterate on review
* nit
* remove max, and let it be optional
* iterate more
* update docs
* fix test
---------
Co-authored-by: Michel Aractingi <michel.aractingi@huggingface.co>
* fix: use features when aggregating image based datasets
* add: test asserting for data type
* add: features param to writing dataset
---------
Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>