Commit Graph

1533 Commits

Author SHA1 Message Date
Steven Palma 028a2e047b fix(test): style check 2026-06-29 15:07:16 +02:00
Steven Palma b2abb1e996 chore(utils): add guard for blueprint 2026-06-29 15:06:20 +02:00
Steven Palma 86098ccc2f chore(scripts): recover comments 2026-06-29 15:05:35 +02:00
Steven Palma 4f29cb7a2e chore(dependecies): update rerun ceil version 2026-06-29 15:05:16 +02:00
Steven Palma 000c716dee Merge branch 'main' into feat/bump_rerun
Signed-off-by: Steven Palma <imstevenpmwork@ieee.org>
2026-06-29 13:08:34 +02:00
Caroline Pascal 3dd19d043e feat(depth maps): adding support for depth in LeRobot (#3644)
* feat(depth): add depth quantization helpers and tests

* feat(video): add ffv1 to supported codecs

* feat(depth): persist depth metadata

* feat(depth): extend quantization tools to better fit the encoding/decoding pipeline

* feat(depth): plumb DepthEncoderConfig through LeRobotDataset and DatasetWriter

* feat(depth): wire StreamingVideoEncoder + writer to depth encoder

* feat(depth): wire DatasetReader to decode_depth_frames

* feat(cameras/realsense): expose async depth in metric meters

* feat(features): route 2D camera shapes to observation.depth.<key>

* feat(robots/so_follower): emit + populate depth keys when use_depth

* feat(record): plumb DepthEncoderConfig through lerobot-record

* feat(viz): render depth observations as rr.DepthImage in Viridis

* feat(depth maps writer): adding support for raw depth maps recording with image writer

* chore(format): format code

* feat(depth shape): ensuring depth maps shape is always including the channel

* feat(is_depth): simplifying is_depth nested name + legacy support

* fix(stop_event): fixing stop_event race condition in camera classes

* fix(plumbing): fixing missing parts in the depth maps pipeline

* chore(typos): fixing typos

* test(fix): fixing exisiting tests to still work with latest features

* tests(depth): adding new tests for depth integration validation

* feat(pix_fmt channels): use PyAv to check get pixel formats number of channels

* feat(refactor): refactor DepthEncoderConfig quantization pipeline, so that the methods do not live in the config class. Add pixel format - channels validation.Move the default pixel format for depth in the config file.

* fix(pre-commit): fixing mutable defautl value

* fix(info): fixing info metadata update when is_depth_map was set

* tests(typos): fixing typos in tests

* fix(realsense): fixing typo in realsense serial number

* fix(normalization): restricting 255 normalization to non depth/uint8 images only

* fix(typo): fixing typo

* fix(TIFF): add missing quantization and cleanup for TIFF files

* feat(batched dequantization): optimizing dequantize_depth for torch based batched dequantization

* feat(tools): adding depth support in LeRobotDataset edition tools

* test(aggregate): extending aggregation tests to depth frames

* test(cleaning): cleaning up tests

* fix(from_video_info): fixing early validation issue in from_video_info

* fix(typo): fixing typo

* fix(is_depth): adding missing doctrings and is_depth arguments in video decoding functions

Co-authored-by: Wensi (Vince) Ai <59036629+wensi-ai@users.noreply.github.com>

* fix(depth units): fixing depth units output for the realsense cameras

* feat(output unit): adding support for output unit specification at dataset reading/training time

Co-authored-by: Wensi (Vince) Ai <59036629+wensi-ai@users.noreply.github.com>

* test(depth): cleaning up depth tests

* test(depth encoding): updating and cleaning video/depth encoding tests

* chore(format): formatting code

* docs(depth): improving depth maps docs

* test(fix): fixing depth tests

* test(dataset tools): adding missing tests for new dataset edition tools features

* chore(format): formatting code

* fix(pyav check): fixing PyAV option validation for integer codec options by normalizing
numeric values before calling `is_integer()`

Co-authored-by: Wensi (Vince) Ai <59036629+wensi-ai@users.noreply.github.com>

* docs(mermaid): fixing mermaid diagram

* fix(rebase): rebase follow up corrections

* feat(dataset tools): adding missing docstrings and features for depth fill support in dataset edition tools

* docs(docstring): updating docstrings

* docs(dataset tools): updating docs

* fix(save images): fixing image saving in dataset tools

* fix(update video info): fixing update video info logic to match the recording and editing use cases

* test(reencode): fixing reencoding monkeypatch

* fix(review): add Claude review

* chore(format): format code

* fix(update video info): ditching the differentiated approahces for video info update - video info are always updated unless for preserved keys.

* chore(rebase): fixing rebase merge conflicts

* test(visualization): fixing visualization tests

* feat(docstrings): adding explicit docstring for encoding parameters. Docstrigns will now show up as description in the CLI --help.

* feat(mm as default): adding a global DEFAULT_DEPTH_UNIT variable setting mm as default depth unit

* fix(RGB <-> camera): renaming camera_encoder to rgb_encoder for clarity

* chore(TODO): removing deprecated TODO

* doc(write_u16_plane): improving docstrings for write_u16_plane

* feat(units): adding constants for depth frames units (m and mm)

* fix(spam): replacing spamming warning but a debug log

* feat(leagcy metadata): adding automatic metadata update for legacy 'video.is_depth_map' feature

* fix(copy&reindex): fixing metadat reshaping for single channel frames

* fix(ImageNet): excluding dpeth frames from ImageNet stats

* fix(PyAV container seek): fixing initial  PyAV container seek to be robust againsy codec choice

* feat(lerobot-dataset-viz): adding support for depth in lerobot-dataset-viz

* fix(compress): removing rerun compression for DepthImages

* fix(signle channel squeeze): fixing single channel squeezing

* chore(format): format code

* fix(streaming): adding support for dequantization in streaming_dataset.py

* refactor(read depth): factorizing depth reading methods for realsense camera and adding support for depth-only usage

* chore(renaming): fixing missed RGBEncoderConfig renamings

* docs(renaming): reflecting renamings in a clearer way in the docs

* chore(annotation): excluding depth from the annotation pipeline

* feat(robots): adding depth support in compatible follower robots

* feat(LeSadKiwi): excluding LeKiwi from depth support (for now)

* chore(fail): removing misplaced file

* chore(fail): removing misplaced file

* fix(remove ffv1): removing ffv1 as it does not support MP4

* docs(cheat sheet): adding depth and video encoding to the cheat sheet

* fix(lossless): tuning depth encoding parameters for lossless depth storage

* test(fix): fixing failing tests

* depth(ZMQ): excluding ZMQ from depth support

* Revert "depth(ZMQ): excluding ZMQ from depth support"

This reverts commit b95cf4e4c2.

* fix(image transforms): excluding depth frames from images transforms

* fix(typo): typo

* fix(stats): fixing stats computation for depth frames

* fix(TIFF vs. pytorch): adding an extra uint16 to float32 conversion for depth maps stored as raw TIFF images

* fix(typos): fixing typos

* test(dtype): fixing stats computation typing tests

---------

Signed-off-by: Steven Palma <imstevenpmwork@ieee.org>
Co-authored-by: Wensi (Vince) Ai <59036629+wensi-ai@users.noreply.github.com>
Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
Co-authored-by: Wensi Ai <wsai@stanford.edu>
2026-06-27 14:21:21 +02:00
Khalil Meftah 6a788fbdb0 Add inline offline validation with train/eval split (#3824)
* refactor(training): rename eval_freq to env_eval_freq

- Rename eval_freq to env_eval_freq to distinguish sim environment evaluation from offline loss evaluation.

* feat(training): add inline offline validation with train/eval split

- Add eval_split config for balanced per-task holdout
- Add eval_steps for periodic inline eval loss computation
- Add max_eval_samples to cap eval cost

* fix(datasets): remap absolute indices in __getitem__ for filtered datasets

* fix(train): vectorize eval subset selection for max_eval_samples

* fix(datasets): Move the remapping into EpisodeAwareSampler via absolute_to_relative_idx

* fix(validation): add eval_split range check and eval_steps warning

Validate eval_split is in [0.0, 1.0) to prevent garbage splits from
out-of-range values. Raise when eval_steps > 0 but eval_split is 0.0
since no offline eval will run.

* fix(train): prepare eval dataloader with accelerator for multi-GPU

Prepare eval_dataloader through accelerator.prepare() so eval data is
sharded across ranks instead of duplicated. Reduce eval_loss across
ranks with mean reduction for consistent logging.

* fix(test): rename eval_freq to env_eval_freq for multi-GPU training
2026-06-25 15:31:24 +02:00
Khalil Meftah c3f180e115 refactor(policies): clean MolmoAct2 to follow EO1/TOPReward patterns (#3724)
Align the MolmoAct2 implementation with lerobot codebase conventions:

- Rename hf_model/ to molmoact2_hf_model/
- Slim config: move all I/O and runtime logic to modeling
- Remove blanket  from 8 vendored files, fix 66 lint issues
- Deduplicate _hf_token() and _resolve_checkpoint_location()
- Make huggingface_hub imports lazy
- Remove custom MolmoAct2CosineDecayWithWarmupSchedulerConfig, use base class
- Extract 13 static/classmethods from MolmoAct2Policy to free functions
- Replace print() with logger in vendored action_tokenizer
- Add module docstrings, class docstring, and key method docstrings
- Add module-level loggers to modeling and processor
- Fix docs: pip to uv install, deduplicate README symlink
- Remove shebangs from all files
2026-06-25 14:19:35 +02:00
Eric Chan 324086abc3 Update follower arm description in documentation (#3780)
Signed-off-by: Eric Chan <hazzelnut@pm.me>
2026-06-25 13:58:08 +02:00
Steven Palma b4e454c0ff feat(utils): display-independent keyboard controls for recording (Wayland / headless / macOS) (#3875)
* feat(utils): headless keyboard control

* refactor(utils): consolidate keyboard listener creation

* fix(rollout): remove import require guard for pynput

---------

Co-authored-by: Leo Toff <leo@toff.dev>
Co-authored-by: Stefano Maestri <stefano.maestri@javalinux.it>
Co-authored-by: Sahil Chande <85823961+SahilChande@users.noreply.github.com>
Co-authored-by: Vinayak Agarwal <63502278+Vinayak-Agarwal-2004@users.noreply.github.com>
Co-authored-by: Abdul Rahim Mirani <abdulrahimmirani@gmail.com>
2026-06-25 10:58:39 +02:00
someone114514 508d18f8a1 Fix ACT policy type examples in docs (#3792) 2026-06-25 08:59:07 +02:00
Alexandre Edmond 536b9621b2 Fix pi0fast model id in docs (#3855) 2026-06-24 11:44:03 +02:00
Jiwen Cai 79d4976ae2 fix(deps): pin cmeel-urdfdom <5 and cmeel-tinyxml2 <11 in placo-dep (#3873)
placo pulls in pin (Pinocchio), whose binary wheels dlopen specific cmeel
sonames (liburdfdom_sensor.so.4.0, libtinyxml2.so.10) but declare only `>=`
floors on their cmeel packages. The 2026-05-21 major bumps (cmeel-urdfdom
6.0.0 -> .so.6, cmeel-tinyxml2 11.0.0 -> .so.11) ship newer sonames, so left
unpinned the resolver grabs them and `import placo` fails at load with
"liburdfdom_sensor.so.4.0: cannot open shared object file".

#3647 capped placo and hardened the kinematics import, but the guard only
defers the failure: constructing RobotKinematics still raises. Pin the cmeel
packages to the 4.x / 10.x ABI the placo/pin wheels are built against (there
is no cmeel-urdfdom 5.x; <5 selects 4.x). Regenerated uv.lock with uv 0.8.0
to match CI; the only resolution change is the two cmeel versions (plus a
deterministic decord platform-marker cascade from 4.0.1's wider wheel set).

Fixes #3755
2026-06-24 11:23:25 +02:00
Khalil Meftah 6f0ba4be38 Record eval rollouts as LeRobot datasets (#3825)
* feat(eval): record eval rollouts as raw LeRobot datasets

- Record raw env observations inline during rollout(), before
preprocess_observation() transforms them. Uses LeRobotDataset.create()
with add_frame()/save_episode().

- Supports vectorized envs: each env in the batch records independently,
with save_episode() called per env on termination. Each task gets its
own dataset under output_dir/recordings/{task_group}_{task_id}/.

Enabled via --eval.recording=true; disabled by default.

* fix(eval): use FeatureType enum comparison instead of string value

* refactor(eval): per-env datasets recording, no double reset

- Extract _infer_shape_from_obs() to reduce nesting in feature conversion
- Move dataset creation into rollout() using its own env.reset() observation,
  eliminating the extra reset in run_one()
- Replace deepcopy with _shallow_copy_obs() for raw observation stashing
- Support batch_size > 1: each parallel env records to its own dataset
  (single env skips the env_0/ nesting for simplicity)
- One-time warning for env_features keys missing from observations
- Pass recording_dir + env_features through the call chain instead of
  a pre-built recording_dataset object

* refactor(eval): remove shape inference and shallow copy helpers

* feat(eval): optionally push recorded eval datasets to the Hub

* fix(eval): address review comments

- Wrap rollout loop in try/finally so finalize() runs on crash/interrupt
- Guard push_to_hub with num_episodes > 0 to avoid pushing empty datasets
- Hoist loop-invariant multi_env and base_repo_id out of creation loop
2026-06-23 14:03:57 +02:00
Maxime Ellerbach 73782447f2 feat(train): FSDP checkpoint saving (#3810)
* feat(train): FSDP checkpoint saving

* adding docs for FSDP

* adding a test for the fsdp checkpoint path

* cleanup

* fixing final upload to hub

* refactored initial implementation to use torch fsdp api and adding new tests
2026-06-22 13:51:21 +02:00
Khalil Meftah 2d7a42011a fix(policies): support offline batch inference for ACT and Diffusion (#3822)
- Guard ACT's KL divergence computation against None latent params to
prevent crashes during eval when use_vae is set but the forward path
returns no VAE outputs.
- Add offline batch fallback to Diffusion's predict_action_chunk() so
it works with dataloader batches (empty queues) in addition to the
existing online rollout path (populated queues). This enables batched
action prediction for offline evaluation.
2026-06-21 11:48:45 +02:00
Khalil Meftah b06ad40888 feat(hub): add pretrained_revision to pin Hub model versions (#3820)
- Add pretrained_revision field to PreTrainedConfig (policies) and
RewardModelConfig (reward models), and thread it through make_policy(),
make_pre_post_processors(), and make_reward_model() so that weights and
processor configs can be loaded from a specific Hub commit, branch, or
tag. Defaults to None (latest version, preserving current behavior).
Dataset and env hub loading already supported revision pinning.

Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
2026-06-19 18:32:47 +02:00
Khalil Meftah b3d74f80f0 Fix batch wandb logging metrics and handle scalar stats (#3821)
* fix(logging): batch wandb metrics

- Batch all metrics into a single wandb.log() call instead of one per
key, reducing API overhead.

- Add support for list-valued metrics by expanding them to indexed keys (e.g.
metric_0, metric_1).

* fix(stats): handle scalar stats robustly

- Wrap cast_stats_to_numpy with np.atleast_1d to prevent 0-d arrays
from scalar stats causing shape mismatches downstream.

* fix(logging): remove unused list-valued metric expansion

---------

Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
2026-06-19 18:31:12 +02:00
Khalil Meftah 552b4c3563 Add third-party env plugin discovery (#3823)
* feat(envs): add env plugin discovery

- Add 'lerobot_env_' to third-party plugin discovery prefixes, completing
the plugin system for all component types (robots, cameras, teleoperators,
policies, and now environments). External packages named lerobot_env_*
can self-register EnvConfig subclasses on import, enabling --env.type=
resolution without lerobot code changes.

* feat(envs): add generic observation passthrough

- Add generic observation passthrough in preprocess_observation() for
unhandled ndarray/tensor keys, replacing the pattern of adding per-env
hardcoded key handlers. Extra keys are forwarded as observation.<key>
and can be shaped by env-specific ProcessorSteps via get_env_processors().

---------

Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
2026-06-19 18:30:00 +02:00
Nicolas Rabault 8bf6056d14 docs: add LeLab web interface to README (#3831) 2026-06-17 18:22:21 +02:00
Caroline Pascal da92db8fc0 fix(image transforms): cleaning up image_transforms implementation in LeRobotDataset (#3829) 2026-06-17 11:50:09 +02:00
Caroline Pascal 2b0834bcb8 fix(cameras): snapshot stop_event in read loops to avoid None deref (#3812)
* Do not set stop_event to None when stopping thread

* fix(cameras): snapshot stop_event in read loops to avoid None deref
The background read loops accessed self.stop_event repeatedly while
_stop_read_thread() can reassign it to None after join(). Reading the
attribute across the loop condition (and a mid-loop re-check) was a
time-of-check/time-of-use race: stop_event could flip to None between
the `is None` test and the `.is_set()` call, raising AttributeError on
the worker thread.
Snapshot self.stop_event into a local once, guard it, and loop on the
local Event. The Event object is thread-safe and lives for the thread's
lifetime; _stop_read_thread() always calls .set() before nulling the
attribute, so the local observes the stop and exits cleanly. This also
lets us drop the redundant pre-lock stop check.
Applies to OpenCVCamera, RealSenseCamera, and ZMQ camera.

---------

Co-authored-by: Anes Benmerzoug <anes.benmerzoug@gmail.com>
2026-06-17 11:40:17 +02:00
Caroline Pascal 287c823f13 fix(features copy): adding deepcopy on LeRobot dataset features to avoid shallow copy leaks (#3826)
* fix(features copy): adding deepcopy on LeRobot dataset features to avoid shallow copy leaks

* tests(test): adding new test
2026-06-16 17:58:59 +02:00
Pepijn 58ccc01508 fix(datasets): enforce one parquet row group per episode in v3 data writes (#3807)
* fix(datasets): enforce one parquet row group per episode in v3 data writes

LeRobot v3 data shards must hold exactly one row group per episode so a
reader can fetch episode i with pq.ParquetFile(path).read_row_group(i)
(a byte-range read) instead of loading the whole shard. The recording
writer already does this (one write_table per episode); the aggregate
and lerobot-annotate re-write paths instead concatenated many episodes
and wrote them in one shot, collapsing the file to a single row group.

- io_utils: add write_table_one_row_group_per_episode (one ParquetWriter,
  one write_table per episode — same pattern as the recording writer);
  to_parquet_with_hf_images embeds images then writes per-episode row
  groups; to_parquet_one_row_group_per_episode wraps it for plain frames
- aggregate: route non-image data writes through the per-episode writer;
  leave the episodes-metadata parquet untouched (already one row/episode)
- annotate: rewrite shards via the per-episode writer instead of a single
  bulk pq.write_table
- tests: invariant coverage through the aggregate (image + video) and
  annotate paths

No change to on-disk schema, paths, naming, rollover thresholds, or
compression. Readers stay backward-compatible (old collapsed files load).

* Update src/lerobot/datasets/io_utils.py

Co-authored-by: Caroline Pascal <caroline8.pascal@gmail.com>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* Update src/lerobot/datasets/io_utils.py

Co-authored-by: Caroline Pascal <caroline8.pascal@gmail.com>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* fix(datasets): correct indentation and add strict= in row-group helper

The web-edited numpy version of write_table_one_row_group_per_episode had an
over-indented line (IndentationError, breaking pre-commit + test collection)
and a zip() without strict=. Fix both; behaviour unchanged.

---------

Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
Co-authored-by: Caroline Pascal <caroline8.pascal@gmail.com>
2026-06-16 12:15:48 +02:00
Caroline Pascal 38327fdc84 fix(images/videos): fixing aggregate_pipeline_dataset_features to avoid unwanted images features deletion (#3783)
* fix(images/videos): fixing aggregate_pipeline_dataset_features to avoid unwanted images features deletion when videos are not used

* fix(docstrings): improving docstrings

Signed-off-by: Caroline Pascal <caroline8.pascal@gmail.com>

---------

Signed-off-by: Caroline Pascal <caroline8.pascal@gmail.com>
2026-06-15 17:55:52 +02:00
Steven Palma 9555efc02c chore(dependencies): update uv.lock (#3595)
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-06-15 16:29:44 +02:00
Steven Palma d576c59afb refactor(robots): homogenize bi-manual setups implementations (#3772)
* chore(robots): homogenize bi setups

* feat(robots): split openarm mini into single and bi

* refactor(robots): mixin for bi classes

* docs: update docs
2026-06-15 16:28:54 +02:00
Altman 8515d456be fix(datasets): avoid uint8 overflow in image stats (#3697)
* fix(datasets): avoid uint8 overflow in image stats

* fix(datasets): promote stats batches dynamically
2026-06-13 12:09:43 +02:00
Mahbod 30790de178 feat(edit-dataset): add concatenate_videos opt-out to merge (#3663)
* feat(edit-dataset): add `concatenate_videos` opt-out to merge

When merging datasets, source mp4s are concatenated into shards capped at
`video_files_size_in_mb` (default 200 MB). This is great for dataloader
throughput but destroys per-episode (or per-source) video boundaries,
which is undesirable when you want to inspect, ship, or reuse the
individual mp4s.

Add a `concatenate_videos: bool = True` knob plumbed through
`MergeConfig` → `merge_datasets` → `aggregate_datasets` → `aggregate_videos`.
When False, each source mp4 is copied 1:1 to its own destination mp4 with
no re-muxing, so the merge preserves source video boundaries.

Usage:

    lerobot-edit-dataset \
        --new_repo_id user/merged \
        --operation.type=merge \
        --operation.repo_ids "['user/a', 'user/b']" \
        --operation.concatenate_videos=false

Defaults are unchanged; the dataloader path is unaffected because the
`episodes.parquet` `from_timestamp`/`to_timestamp` index keeps working
regardless of whether each mp4 holds one or many episodes.

* feat(edit-dataset): extend concatenate opt-out to data files

Following review, add a concatenate_data flag mirroring concatenate_videos,
threaded through MergeConfig, merge_datasets, aggregate_datasets, aggregate_data
and append_or_create_parquet_file. Metadata index files still always concatenate.

Also trim the verbose docstrings and comments since the names are
self-explanatory, and extend the existing merge test to cover data files.
2026-06-12 20:05:04 +02:00
Pepijn cec8ee0be6 feat: language annotation pipeline (#3471)
Steerable annotation pipeline (lerobot-annotate) that populates the language_persistent and language_events columns introduced in PR 1 (#3467) directly into data/chunk-*/file-*.parquet.

This is PR 2 of the three-PR plan:

PR 1 (Add extensive language support #3467): schema + DSL + rendering, base of this PR
PR 2 (this PR): annotation pipeline writing into PR 1's columns
PR 3: model with language prediction and runtime
A VLM (Qwen-VL family, served on vLLM) watches each episode's video and emits grounded language annotations: subtasks, plans, memory, task rephrasings, interjections + speech, and per-camera VQA. The pipeline is built for production annotation at scale — single-camera grounding, embedded-frame inputs, a describe-then-segment grounding flow, and a deterministic full-episode coverage guarantee — informed by Scale's dense-captioning findings (representation > sampling, rules > reasoning, model capacity is the biggest lever, two-pass systems compound errors)
2026-06-12 15:12:33 +02:00
Nikodem Bartnik 02b315ab6a Docs/model card improvements (#3634)
* update policy deployment instruction with rollout

* add port and fix formatting

* add more base models to generate model card

* updated and extended model descriptions

* fix bug

* improved and extended structure

* exclude the templates from config

* add images and visualize dataset button

* add all policies we have docs for

* remove policies without the docs

* new fields, improved examples
2026-06-12 13:26:52 +02:00
Pepijn 234c768dfb feat(datasets): deterministic, resumable shuffling for EpisodeAwareSampler (#3769)
* fix(datasets): expose a generator on EpisodeAwareSampler for distributed shuffle sync

In distributed training, accelerate can only synchronize the shuffle
permutation across ranks when the sampler exposes a generator attribute.
EpisodeAwareSampler shuffled via the global torch RNG, so disjoint batch
shards relied on every rank's global CPU RNG staying in lockstep forever;
any rank-asymmetric RNG consumption (e.g. eval rollouts on the main
process only) silently desynced the permutations and ranks trained on
overlapping/missing samples.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(train): seed sampler generator and gate dataset download per node

- Pass a generator seeded with cfg.seed to EpisodeAwareSampler so
  accelerator.prepare registers it as the synchronized RNG and the
  shuffle order is reproducible.
- Gate the initial make_dataset call on is_local_main_process instead of
  is_main_process: the global main process only exists on node 0, so on
  every other node all local ranks were downloading the dataset and
  building the Arrow cache concurrently.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(datasets): add DeterministicEpisodeAwareSampler with O(1) memory and sample-exact resume

Add a sampler that never materializes frame indices: it stores only
per-episode boundaries (numpy, a few bytes per episode) and maps logical
positions to frame indices on the fly with searchsorted. Shuffling uses a
seeded Feistel permutation over [0, num_frames) (cycle-walking to the
exact domain), so the data order is a pure function of (seed, epoch):

- no RNG state to synchronize across distributed ranks,
- constant memory and zero epoch-boundary cost at any dataset size,
- O(1) seek to any position, enabling sample-exact resume.

Opt in with --deterministic_sampler=true. On resume, lerobot-train maps
the checkpointed step back to (epoch, start_index) via
compute_sampler_state and continues at the exact sample where the run
left off (up to accelerate's even_batches padding at epoch boundaries).
The shuffle is pseudo-random rather than a true uniform permutation, the
standard trade-off in large-scale training loaders.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* refactor(datasets): fold deterministic mode into EpisodeAwareSampler

Instead of a parallel DeterministicEpisodeAwareSampler class, extend the
existing EpisodeAwareSampler with a deterministic=True mode (seeded
Feistel permutation, epoch auto-advance, state_dict/load_state_dict).

The default mode is behavior-identical: same torch.randperm consumption
and the same generator contract accelerate synchronizes; the O(N) Python
index list is replaced by O(num_episodes) boundary arrays in both modes,
with `indices` kept as a back-compat property. Passing a generator
together with deterministic=True is rejected, and the state/seek methods
raise outside deterministic mode.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(train): enable deterministic_sampler by default

Deterministic data order (sample-exact resume, no cross-rank RNG sync,
O(1) sampler memory) is now the default for map-style training; set
deterministic_sampler=false to restore the legacy RNG-based shuffle.
Streaming datasets ignore the flag (the sampler path only applies to
map-style datasets), replacing the previous hard validation error so
streaming configs keep working with the new default.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(datasets): default EpisodeAwareSampler to deterministic mode and trim comments

deterministic=True is now the class default as well as the training
default; the legacy RNG path requires an explicit deterministic=False
(the train script's non-deterministic branch passes it). Docstrings and
inline comments slimmed down across the changed files.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test(sampler): drain resumed trillion-frame sampler via iter() to avoid list() prealloc

list(sampler) calls PyObject_LengthHint -> __len__ (the full 10**12 epoch length) and
preallocates that many slots before iterating, OOMing even though the resumed epoch only
yields 3 frames. Collect through the iterator (no length hint) so the test exercises the
real O(1) seek/drain instead of CPython's list growth heuristic.

* fix(datasets): guard Feistel cycle-walking loop against non-convergence

Replace the unbounded while True in EpisodeAwareSampler._permute with a
bounded for loop capped at _MAX_CYCLE_WALK_STEPS (100) and raise
RuntimeError if the cycle-walk fails to land in [0, num_frames). The
loop is expected to converge in <4 steps on the chosen power-of-two
domain, so the bound is a safety net that should never trip in practice
but prevents a pathological infinite loop.

https://claude.ai/code/session_01HQ15tFrBsHYScjGWosEv22

* fix(datasets): make deterministic-sampler resume robust to world-size changes

compute_sampler_state mapped a checkpointed step back to (epoch, start_index)
using the *current* num_processes, but the number of sampler positions a step
consumes scales with the world size that produced it. Resuming on a different
GPU count therefore landed on the wrong epoch/offset, silently re-seeing or
skipping data.

Record num_processes in training_step.json at checkpoint time and feed the
checkpoint's value into compute_sampler_state on resume, so the data order
resumes at the right position regardless of the new world size. Warn when the
world size changed (the global offset is correct, but per-rank sample-exactness
needs the same topology). Old checkpoints without the field fall back to the
current world size.

Also document compute_sampler_state's assumptions explicitly: num_processes /
batch_size must match the checkpointing run, and accelerate's even_batches=True
padding is mirrored by the ceil(... / num_processes) term.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

* style: apply ruff-format to lerobot_train.py

Collapse the compute_sampler_state(...) call onto one line so the
ruff-format pre-commit hook passes (fixes the failing CI check).

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(datasets): use seeded torch.randperm instead of Feistel in EpisodeAwareSampler

Drop the Feistel permutation (and its SplitMix64 hash / cycle-walking) in favor of a
torch.randperm seeded from (seed, epoch). The deterministic mode keeps its key properties
- data order is a pure function of (seed, epoch), so it reproduces on every rank with no
  global-RNG synchronization, and
- state_dict / load_state_dict still resume sample-exactly, now by regenerating the epoch's
  permutation and slicing from the saved offset.

Construction stays O(num_episodes) (only episode boundaries are stored, never a per-frame
index list). The trade-off vs Feistel: the per-epoch shuffle is again O(num_frames) memory
(the randperm tensor) and no longer O(1)-seekable, in exchange for ~30 fewer LOC and a truly
uniform shuffle. Tests updated: the trillion-frame O(1) test is replaced with a
boundary-storage check and a scale resume-exactness test.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(datasets): make EpisodeAwareSampler always deterministic

With Feistel gone, deterministic and legacy modes were both just torch.randperm and the
deterministic path strictly dominated (reproducible across ranks via the (seed, epoch) seed,
no accelerate generator sync, resumable). Collapse to a single path and drop the redundant
flag:

- remove the `deterministic` and `generator` constructor args, `_iter_default`, and
  `_require_deterministic`; `set_epoch` / `state_dict` / `load_state_dict` are now unconditional
- remove the `deterministic_sampler` train config field and the legacy generator branch in
  lerobot_train.py (non-streaming map datasets always use the sampler)
- drop the now-obsolete generator/legacy tests

Note: removes the `generator` kwarg from EpisodeAwareSampler (back-compat break vs main); the
order is now a pure function of (seed, epoch), so no cross-rank RNG sync is needed.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(datasets): address sampler review (batch_size resume guard + docs)

- Record batch_size in training_step.json alongside num_processes and feed
  the checkpoint's value into compute_sampler_state on resume; warn when it
  differs (per-rank sample-exactness needs the same batch size).
- Document the set_epoch vs __iter__ auto-advance coupling on EpisodeAwareSampler
  (callers should rely on exactly one mechanism per run).
- Note the broadened (reproducibility-breaking) sampler guard and the no-generator
  distributed sharding correctness in lerobot_train.py.
- Add load_training_batch_size + parallel tests.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(train): download dataset once on the global main process

Gate the training dataset download on the global is_main_process (download once to the
shared dataset root, barrier, then every other rank reads the already-populated copy)
instead of per-node is_local_main_process. LeRobotDataset skips its snapshot_download
when try_load() succeeds, so no rank re-downloads. Assumes the dataset root / HF cache is
on storage shared across nodes.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(datasets): trim sampler comment and drop duplicate tests

Remove the verbose dataloader-guard comment and the two EpisodeAwareSampler tests
that duplicated existing validation/warning coverage (no coverage loss).

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-12 11:47:16 +02:00
Caroline Pascal 0e9bd9e6fb feat(trim): adding optional trimming option in reencode_video (#3779)
* feat(trim): adding optional trimming option in reencode_video

* tests(trim): add triming test

---------

Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>
2026-06-12 11:29:26 +02:00
Steven Palma 87242cfced chore(dependecies): relax grpc-related bounds (#3777)
Signed-off-by: Steven Palma <imstevenpmwork@ieee.org>
2026-06-11 19:13:14 +02:00
Steven Palma 1edc83a0ef feat(training): bump accelerate + use reduction types for tracked metrics in a multi rank setup (#3773)
* feat(training): bump accelerate + use reduction types for tracked metrics in a multi rank setup

* chore: address feedback
2026-06-11 19:07:28 +02:00
Steven Palma 6fbcf67249 chore: update readme (#3774)
* chore: update readme

* chore: update authors in project readme
2026-06-11 18:17:26 +02:00
Pepijn 41166b39fb fix(train): synchronize EpisodeAwareSampler shuffling across ranks and gate dataset download per node (#3768)
* fix(datasets): expose a generator on EpisodeAwareSampler for distributed shuffle sync

In distributed training, accelerate can only synchronize the shuffle
permutation across ranks when the sampler exposes a generator attribute.
EpisodeAwareSampler shuffled via the global torch RNG, so disjoint batch
shards relied on every rank's global CPU RNG staying in lockstep forever;
any rank-asymmetric RNG consumption (e.g. eval rollouts on the main
process only) silently desynced the permutations and ranks trained on
overlapping/missing samples.

* fix(train): seed sampler generator and gate dataset download per node

- Pass a generator seeded with cfg.seed to EpisodeAwareSampler so
  accelerator.prepare registers it as the synchronized RNG and the
  shuffle order is reproducible.
- Gate the initial make_dataset call on is_local_main_process instead of
  is_main_process: the global main process only exists on node 0, so on
  every other node all local ranks were downloading the dataset and
  building the Arrow cache concurrently.
2026-06-11 11:07:42 +02:00
CarolinePascal fcd8ab5800 fix(claude): claude reviews 2026-06-10 20:25:12 +02:00
CarolinePascal ee6eb745b8 chore(imports): cleaning up imports 2026-06-10 20:00:08 +02:00
CarolinePascal 27b482adf7 chore(simplification): removing no longer needed reshape 2026-06-10 19:50:26 +02:00
CarolinePascal 21d158e066 chore(colors): removing unreliable colors 2026-06-10 19:46:04 +02:00
CarolinePascal 22991ed69a test(update): update tests 2026-06-10 19:32:14 +02:00
CarolinePascal 1adc7a0309 feat(grid): Leveraging rerun's automatic grid arangement for improved layout 2026-06-10 19:23:55 +02:00
CarolinePascal f72fc3b4ba feat(blueprints): switching to blueprints for backwards (and forward) compatibiltiy 2026-06-10 19:23:55 +02:00
CarolinePascal dabf88ef9f feat(blueprints): switching to blueprints for backwards (and forward) compatibiltiy 2026-06-10 19:23:55 +02:00
CarolinePascal 2c47217825 feat(features names and color): improving features names and display colors when replaying an episode 2026-06-10 19:23:54 +02:00
CarolinePascal 9c502e204e chore(format): formatting code 2026-06-10 19:23:54 +02:00
CarolinePascal c55df19e6c chore(updae): update rerun logging to use the latest features 2026-06-10 15:24:03 +02:00
ntjohnson1 c91f345092 Update upper bound to latest rerun-sdk 2026-06-10 15:24:03 +02:00
Steven Palma 79c6821407 chore(dependecies): update mujoco transitives (#3756) 2026-06-10 12:58:55 +02:00