Allow --config_path to be a Hub repo id when resuming, not only a local path.
The latest checkpoint under checkpoints/<step>/ is downloaded into a fresh local
run dir and resumed from there (optimizer, scheduler, RNG and data order
restored as for a local resume). TrainPipelineConfig.from_pretrained falls back
to the latest checkpoint's train_config.json when a repo has no root config
(an interrupted run that only pushed checkpoints). The download is skipped when
dispatching remotely so the executor (local machine or HF Jobs pod) performs it.
- add find_latest_hub_checkpoint (utils/hub) and resolve_resume_checkpoint
(common/train_utils), the symmetric download counterpart to
push_checkpoint_to_hub
- unit tests for both helpers and the from_pretrained fallback
Address review feedback on #3856: pushing a checkpoint to the Hub now
also creates a tag named after the checkpoint step, so a checkpoint can
be recovered with --policy.pretrained_revision=<step> instead of having
to look up its commit sha.
huggingface_hub is a core dependency; the in-function import was
unnecessary. Move HfApi to a module-level import and point the test
monkeypatches at lerobot.common.train_utils.HfApi.
Introduce a JobConfig draccus group on TrainPipelineConfig (--job.target/image/
timeout/detach/tags) whose is_remote property gates remote dispatch, plus a
save_checkpoint_to_hub flag and validation. Add push_checkpoint_to_hub(), which
uploads a saved checkpoint directory to the model repo under checkpoints/<step>/
and creates the repo idempotently (private propagates from policy.private).
* feat(train): FSDP checkpoint saving
* adding docs for FSDP
* adding a test for the fsdp checkpoint path
* cleanup
* fixing final upload to hub
* refactored initial implementation to use torch fsdp api and adding new tests
* fix(datasets): expose a generator on EpisodeAwareSampler for distributed shuffle sync
In distributed training, accelerate can only synchronize the shuffle
permutation across ranks when the sampler exposes a generator attribute.
EpisodeAwareSampler shuffled via the global torch RNG, so disjoint batch
shards relied on every rank's global CPU RNG staying in lockstep forever;
any rank-asymmetric RNG consumption (e.g. eval rollouts on the main
process only) silently desynced the permutations and ranks trained on
overlapping/missing samples.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* fix(train): seed sampler generator and gate dataset download per node
- Pass a generator seeded with cfg.seed to EpisodeAwareSampler so
accelerator.prepare registers it as the synchronized RNG and the
shuffle order is reproducible.
- Gate the initial make_dataset call on is_local_main_process instead of
is_main_process: the global main process only exists on node 0, so on
every other node all local ranks were downloading the dataset and
building the Arrow cache concurrently.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(datasets): add DeterministicEpisodeAwareSampler with O(1) memory and sample-exact resume
Add a sampler that never materializes frame indices: it stores only
per-episode boundaries (numpy, a few bytes per episode) and maps logical
positions to frame indices on the fly with searchsorted. Shuffling uses a
seeded Feistel permutation over [0, num_frames) (cycle-walking to the
exact domain), so the data order is a pure function of (seed, epoch):
- no RNG state to synchronize across distributed ranks,
- constant memory and zero epoch-boundary cost at any dataset size,
- O(1) seek to any position, enabling sample-exact resume.
Opt in with --deterministic_sampler=true. On resume, lerobot-train maps
the checkpointed step back to (epoch, start_index) via
compute_sampler_state and continues at the exact sample where the run
left off (up to accelerate's even_batches padding at epoch boundaries).
The shuffle is pseudo-random rather than a true uniform permutation, the
standard trade-off in large-scale training loaders.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* refactor(datasets): fold deterministic mode into EpisodeAwareSampler
Instead of a parallel DeterministicEpisodeAwareSampler class, extend the
existing EpisodeAwareSampler with a deterministic=True mode (seeded
Feistel permutation, epoch auto-advance, state_dict/load_state_dict).
The default mode is behavior-identical: same torch.randperm consumption
and the same generator contract accelerate synchronizes; the O(N) Python
index list is replaced by O(num_episodes) boundary arrays in both modes,
with `indices` kept as a back-compat property. Passing a generator
together with deterministic=True is rejected, and the state/seek methods
raise outside deterministic mode.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(train): enable deterministic_sampler by default
Deterministic data order (sample-exact resume, no cross-rank RNG sync,
O(1) sampler memory) is now the default for map-style training; set
deterministic_sampler=false to restore the legacy RNG-based shuffle.
Streaming datasets ignore the flag (the sampler path only applies to
map-style datasets), replacing the previous hard validation error so
streaming configs keep working with the new default.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat(datasets): default EpisodeAwareSampler to deterministic mode and trim comments
deterministic=True is now the class default as well as the training
default; the legacy RNG path requires an explicit deterministic=False
(the train script's non-deterministic branch passes it). Docstrings and
inline comments slimmed down across the changed files.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test(sampler): drain resumed trillion-frame sampler via iter() to avoid list() prealloc
list(sampler) calls PyObject_LengthHint -> __len__ (the full 10**12 epoch length) and
preallocates that many slots before iterating, OOMing even though the resumed epoch only
yields 3 frames. Collect through the iterator (no length hint) so the test exercises the
real O(1) seek/drain instead of CPython's list growth heuristic.
* fix(datasets): guard Feistel cycle-walking loop against non-convergence
Replace the unbounded while True in EpisodeAwareSampler._permute with a
bounded for loop capped at _MAX_CYCLE_WALK_STEPS (100) and raise
RuntimeError if the cycle-walk fails to land in [0, num_frames). The
loop is expected to converge in <4 steps on the chosen power-of-two
domain, so the bound is a safety net that should never trip in practice
but prevents a pathological infinite loop.
https://claude.ai/code/session_01HQ15tFrBsHYScjGWosEv22
* fix(datasets): make deterministic-sampler resume robust to world-size changes
compute_sampler_state mapped a checkpointed step back to (epoch, start_index)
using the *current* num_processes, but the number of sampler positions a step
consumes scales with the world size that produced it. Resuming on a different
GPU count therefore landed on the wrong epoch/offset, silently re-seeing or
skipping data.
Record num_processes in training_step.json at checkpoint time and feed the
checkpoint's value into compute_sampler_state on resume, so the data order
resumes at the right position regardless of the new world size. Warn when the
world size changed (the global offset is correct, but per-rank sample-exactness
needs the same topology). Old checkpoints without the field fall back to the
current world size.
Also document compute_sampler_state's assumptions explicitly: num_processes /
batch_size must match the checkpointing run, and accelerate's even_batches=True
padding is mirrored by the ceil(... / num_processes) term.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* style: apply ruff-format to lerobot_train.py
Collapse the compute_sampler_state(...) call onto one line so the
ruff-format pre-commit hook passes (fixes the failing CI check).
Co-authored-by: Cursor <cursoragent@cursor.com>
* refactor(datasets): use seeded torch.randperm instead of Feistel in EpisodeAwareSampler
Drop the Feistel permutation (and its SplitMix64 hash / cycle-walking) in favor of a
torch.randperm seeded from (seed, epoch). The deterministic mode keeps its key properties
- data order is a pure function of (seed, epoch), so it reproduces on every rank with no
global-RNG synchronization, and
- state_dict / load_state_dict still resume sample-exactly, now by regenerating the epoch's
permutation and slicing from the saved offset.
Construction stays O(num_episodes) (only episode boundaries are stored, never a per-frame
index list). The trade-off vs Feistel: the per-epoch shuffle is again O(num_frames) memory
(the randperm tensor) and no longer O(1)-seekable, in exchange for ~30 fewer LOC and a truly
uniform shuffle. Tests updated: the trillion-frame O(1) test is replaced with a
boundary-storage check and a scale resume-exactness test.
Co-authored-by: Cursor <cursoragent@cursor.com>
* refactor(datasets): make EpisodeAwareSampler always deterministic
With Feistel gone, deterministic and legacy modes were both just torch.randperm and the
deterministic path strictly dominated (reproducible across ranks via the (seed, epoch) seed,
no accelerate generator sync, resumable). Collapse to a single path and drop the redundant
flag:
- remove the `deterministic` and `generator` constructor args, `_iter_default`, and
`_require_deterministic`; `set_epoch` / `state_dict` / `load_state_dict` are now unconditional
- remove the `deterministic_sampler` train config field and the legacy generator branch in
lerobot_train.py (non-streaming map datasets always use the sampler)
- drop the now-obsolete generator/legacy tests
Note: removes the `generator` kwarg from EpisodeAwareSampler (back-compat break vs main); the
order is now a pure function of (seed, epoch), so no cross-rank RNG sync is needed.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(datasets): address sampler review (batch_size resume guard + docs)
- Record batch_size in training_step.json alongside num_processes and feed
the checkpoint's value into compute_sampler_state on resume; warn when it
differs (per-rank sample-exactness needs the same batch size).
- Document the set_epoch vs __iter__ auto-advance coupling on EpisodeAwareSampler
(callers should rely on exactly one mechanism per run).
- Note the broadened (reproducibility-breaking) sampler guard and the no-generator
distributed sharding correctness in lerobot_train.py.
- Add load_training_batch_size + parallel tests.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(train): download dataset once on the global main process
Gate the training dataset download on the global is_main_process (download once to the
shared dataset root, barrier, then every other rank reads the already-populated copy)
instead of per-node is_local_main_process. LeRobotDataset skips its snapshot_download
when try_load() succeeds, so no rank re-downloads. Assumes the dataset root / HF cache is
on storage shared across nodes.
Co-authored-by: Cursor <cursoragent@cursor.com>
* chore(datasets): trim sampler comment and drop duplicate tests
Remove the verbose dataloader-guard comment and the two EpisodeAwareSampler tests
that duplicated existing validation/warning coverage (no coverage loss).
Co-authored-by: Cursor <cursoragent@cursor.com>
---------
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* Add basic support for PEFT adapter methods
This changes adds support for training policies with much less parameters
by applying adapter methods such as LoRA on specific parts of the policies
and therefore possibly higher learning rates / batch sizes.
To make this as accessible as possible I thought it useful to provide
defaults for `target_modules` and `modules_to_save`. Currently only SmolVLA
has such defaults but when we agree that this change is useful I will set
out to generate more such defaults. While the user can override these
settings, they are expected to only change the peft_method, rank and init_type
parameters.
* Implement loading of PEFT adapters
Loading a PEFT adapter is currently done by initializing a policy with default config
and then applying the adapter on the resulting model. This has the obvious drawback
that any configurations done during training are not applied in the adapted model.
Currently the `use_peft` attribute of `PreTrainedConfig` is only set during loading
to signal the following code that it has to deal with a PEFT adapter. However
we could imagine a scenario where this is already set at training time and stored
alongside the adapter.
* Store policy config alongside PEFT checkpoint
Before this change the PEFT-wrapped policy did not save the policy's config
alongside the adapter config / weights which prevented us from changing the
policy config. Now the policy config is saved both in full training and PEFT
training.
This change makes loading the PEFT policy adapter much easier as well.
* Add default config for ACT
* Support targets like `all-linear`
* Formatting
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix failing tests
* Remove PEFT compatibility changes in config
We'll wait for the PEFT release that fixes this for good.
* Remove `use_peft` parameter from training script
Instead we make the PEFT config optional which has the same effect.
* Log adapter config to WandB
* Better documentation for CLI arguments
* Don't unload & merge the PEFT model
This can make things hard when using quantized layers (user expects quantized base layers with
unquantized adapters for example, merging defaults to upcast the layers leading to higher
memory).
* Correct way of identifying when to save config
* Add CLI end-to-end tests
Currently there don't seem to be any way to test the CLI commands.
Since this change mostly happens in those I thought it best to add
a way to test these commands end-to-end.
More integrated commands like `lerobot-record` need patching but
standalone commands like training seem to work fine.
* Update default targets
Removed ACT since it doesn't make sense to fine-tune ACT without having it pretrained beforehand.
SmolVLA and Pi0/0.5 are much more senseful targets.
* Clean up loading code
- Centralized instantiation of the PEFT wrapper in `make_policy` for inference
(e.g. in `lerobot-record`)
- Training a PEFT policy also sets `cfg.use_peft` so that all inference code loading
the policy can rely on that attribute to identify if PEFT loading is needed
- Modified RTC example to also include PEFT policies. Mostly because this is an example
I'm currently exploring.
* Make sure push_to_hub works
Since PEFT only wraps `push_to_hub` and not `push_model_to_hub`, the reference
to `self` in `policy.push_model_to_hub` is the unwrapped policy which, of course,
doesn't know anything about PEFT.
To make the upload process aware of PEFT, we pass the unwrapped policy down to
`push_model_to_hub` as a kwarg. This is not ideal but I think it is the best way
for now.
* formatting
* Warn when encountering from-scratch-training
* Revamp pretrained model loading
There were quite a few factors that convinced me that the status quo
is able to load pretrained models from the PEFT adapter config but
in fact that didn't work.
This commit fixes the following things:
- policies wrapped in PEFT will now have a `name_or_path` attribute
containing the name or path of the pretrained model we're fine-tuning
- we further assume that SmolVLA without `pretrained_path` and
`load_vlm_weights==False` must be an user-side error
- we assume that using PEFT on from-scratch-policies must be
an user-side-error
* Make it possible to unset policy features
This is necessary to train pre-trained policies on new datasets so that the
features are inferred from the new dataset and not from the pretrained
policy.
* Use correct loading for PEFT in RTC example
* Make it possible to use PeftModels in eval
* Add test checking that PEFT actually reduces params
* Adapt state/action projections instead of full-finetuning
There doesn't seem to be a benefit to fully fine-tune these layers
over just adapting them, so we do that instead.
* Disallow PEFT training on non-pretrained policies
At first I thought it would make sense to have this feature
in case you want to fine-tune a pre-trained section but in the
end it makes more trouble than it's worth.
It's still possible to allow this in the future when a concrete
need arises.
* Add basic documentation
* Formatting
* Add peft as extra dependency, mark tests
Fast tests currently fail because of the missing dependency.
* Fix pre-commit issues
* Add walx <> peft conflict for uv
* Exclude peft from pi install for now
---------
Co-authored-by: nemo <git@ningu.net>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>