lerobot

mirror of https://github.com/huggingface/lerobot.git synced 2026-07-02 07:37:10 +00:00

Author	SHA1	Message	Date
Nicolas Rabault	5ac3b49a5f	feat(train): run training remotely on HF Jobs via --job.target (#3856 ) * feat(train): add JobConfig group, save_checkpoint_to_hub flag, Hub checkpoint helper Introduce a JobConfig draccus group on TrainPipelineConfig (--job.target/image/ timeout/detach/tags) whose is_remote property gates remote dispatch, plus a save_checkpoint_to_hub flag and validation. Add push_checkpoint_to_hub(), which uploads a saved checkpoint directory to the model repo under checkpoints/<step>/ and creates the repo idempotently (private propagates from policy.private). * feat(train): run training remotely on HF Jobs via --job.target When --job.target names a GPU flavor, train() dispatches to lerobot.jobs.submit_to_hf instead of training locally: it authenticates, ensures the dataset is on the Hub (pushing a local-only one privately), serializes a pod-compatible train_config.json (strips client-only fields, points at the model repo), submits via HfApi.run_job with HF_TOKEN/WANDB_API_KEY secrets, then streams logs and finishes when the model is pushed. Wires push_checkpoint_to_hub into the training loop behind save_checkpoint_to_hub, and tags jobs/datasets/model with 'lerobot' + --job.tags. * docs(train): document remote training on HF Jobs * test(train): skip remote-dispatch tests without the dataset extra The module imports lerobot.scripts.lerobot_train, which eagerly pulls in lerobot.datasets (dataset extra). The base fast-test CI tier runs without that extra, so collection failed there. Guard with pytest.importorskip, matching the existing tests/scripts dataset-extra tests. * refactor(jobs): hoist huggingface_hub imports to module level in hf.py huggingface_hub is a core dependency, so the per-function dynamic imports had no lazy-loading rationale. Move them to a single module-level import and update test monkeypatch targets to lerobot.jobs.hf.* accordingly. * refactor(jobs): build remote config dict via cfg.to_dict() TrainPipelineConfig.to_dict() already returns the canonical draccus encoding, so the StringIO + draccus.dump + json.loads round-trip was redundant. Use it directly and drop the now-unused io/draccus imports. * refactor(train): use module-level HfApi import in push_checkpoint_to_hub huggingface_hub is a core dependency; the in-function import was unnecessary. Move HfApi to a module-level import and point the test monkeypatches at lerobot.common.train_utils.HfApi. * refactor(configs): export JobConfig from the configs package Re-export JobConfig in lerobot/configs/__init__.py so external callers import it as `from lerobot.configs import JobConfig`, matching the other config classes. Adapt the train script and test imports. * refactor(jobs): check dataset presence with api.repo_exists Replace the dataset_info try/except RepositoryNotFoundError dance with a direct api.repo_exists(repo_id, repo_type="dataset") call, dropping the httpx/RepositoryNotFoundError test scaffolding. * chore(jobs): annotate ensure_dataset_available api param as HfApi Add the missing HfApi type hint via a TYPE_CHECKING import. * refactor(jobs): use HF_LEROBOT_HOME constant for the local cache root Resolve the local dataset cache via lerobot.utils.constants.HF_LEROBOT_HOME instead of re-reading the env var by hand, dropping the os/Path imports. Tests now patch the imported constant and assert on a stable message substring (the previous "neither" match only passed by accident, matching the test name embedded in the pytest tmp_path). * chore(jobs): guard LeRobotDataset import with require_package Surface a clear "install lerobot[dataset]" error if the datasets extra is missing, instead of a raw ImportError, before pushing a local dataset. * docs(configs): clarify the is_remote_target/is_remote split Add a comment explaining why JobConfig keeps both the staticmethod (tests a raw target string from argv before a config exists) and the property (accessor for an existing config instance). * docs(train): note how to pin a pushed model version for inference Document --policy.pretrained_revision alongside --policy.path so a specific Hub-pushed checkpoint (once --save_checkpoint_to_hub has committed several) can be selected for inference. * test(jobs): skip dataset import guard in base-deps test The fast test env installs base deps only, so require_package('datasets') raised ImportError before the mocked lerobot.datasets import was reached. Monkeypatch the guard to a no-op so the unit test exercises the upload logic. * fix(jobs): address claude review findings on remote training Resolve the claude[bot] review on #3856: - Reject reward-model training under --job.target with a clear error instead of crashing on a None policy inside build_remote_config_file. - Support --policy.path remote runs: validate() no longer requires repo_id for remote runs (it is auto-generated in submit_to_hf), and repo_id/push_to_hub are now set after validate() resolves the policy. - Narrow the bare `except Exception` in _tail_logs/_poll_until_done to (OSError, httpx.HTTPError) so programming errors surface instead of being silently retried or counted as job failures. - Install the SIGINT detach handler only on the main thread. - Generate model repo timestamps in UTC. * docs(jobs): document the model-pushed marker contract and orphaned repos Follow-up to the claude[bot] review on #3856 (non-blocking observations): - Cross-reference the "Model pushed to <url>" log line between its producer (PreTrainedPolicy.push_model_to_hub) and the remote-run consumer in submit_to_hf, noting the contract is an early-finish optimization that falls back to status polling if it drifts. - Note in the HF Jobs guide that a failed remote run leaves its model repo on the Hub (it is not auto-deleted) and how to remove it. * feat(train): tag each pushed checkpoint with its step Address review feedback on #3856: pushing a checkpoint to the Hub now also creates a tag named after the checkpoint step, so a checkpoint can be recovered with --policy.pretrained_revision=<step> instead of having to look up its commit sha. * fix(jobs): hoist ensure_dataset_available to a module-level import Addresses Caroline's review comment on PR #3856: the local import of ensure_dataset_available inside submit_to_hf was vestigial. dataset.py does not import hf.py, so there is no circular-import risk and no extra load cost (its heavy deps stay lazy), so make it a top-level import. * refactor(configs): untangle config_path/resume resolution in validate() Split the re-parse HACK block in TrainPipelineConfig.validate() into focused helpers (_resolve_pretrained_from_cli, _resolve_resume_checkpoint) that handle the policy path, reward-model path, and resume config_path as separate, readable units. Behavior-preserving. * feat(train): resume training from a Hub checkpoint Allow --config_path to be a Hub repo id when resuming, not only a local path. The latest checkpoint under checkpoints/<step>/ is downloaded into a fresh local run dir and resumed from there (optimizer, scheduler, RNG and data order restored as for a local resume). TrainPipelineConfig.from_pretrained falls back to the latest checkpoint's train_config.json when a repo has no root config (an interrupted run that only pushed checkpoints). The download is skipped when dispatching remotely so the executor (local machine or HF Jobs pod) performs it. - add find_latest_hub_checkpoint (utils/hub) and resolve_resume_checkpoint (common/train_utils), the symmetric download counterpart to push_checkpoint_to_hub - unit tests for both helpers and the from_pretrained fallback * feat(jobs): resume a run on HF Jobs from a checkpoint When --resume is set with a remote --job.target, submit_to_hf resumes from the checkpoint repo instead of staging a fresh config. A Hub config_path is resumed in place (its checkpoint config already targets that repo); a local config_path has its checkpoint uploaded to a new private repo first and the run is forced to push back to it. The pod command carries --job.target=local so the checkpoint's saved job.target can't make the pod re-dispatch itself, and the user's CLI overrides are forwarded so a remote resume matches the same local command. ensure_dataset_available is hoisted before the resume/fresh branch since it applies to both. * docs(train): document resuming from a Hub checkpoint, locally and on jobs Show that --config_path accepts a Hub repo id for --resume, and that adding --job.target resumes on HF Jobs (uploading a local checkpoint/dataset first). * fix(jobs): default remote job timeout to 2d instead of the platform default HF Jobs applies its own short 30-minute timeout when none is sent, which silently kills long training runs. Pass an explicit, generous 2d cap by default; users can still override --job.timeout to fail fast or extend it. * fix(jobs): drop --dataset.root on resume + restore keyboard-control docs Address the latest Claude review on #3856: - _build_resume_job no longer forwards --dataset.root to the pod (a host-local path it can't read); the fresh-run path already nulls it in build_remote_config_file, so this makes resume consistent. Add a unit test for _pod_forwarded_args covering the drop in both flag forms. - Restore the display-independent keyboard-control docs (n/r/q letter equivalents + X11/Wayland/headless Tip) in il_robots.mdx that this branch was stale on relative to main (#3875). * fix(jobs): handle str-typed job stage from huggingface_hub inspect_job's status.stage is an enum (with .value) in some huggingface_hub versions and a plain str in others. The poller assumed the enum shape, raising "'str' object has no attribute 'value'" on resume for users on the str-returning version. Read it via getattr(..., "value", ...) so both shapes work, and parametrize the poll test over enum and str stages so the str case is actually exercised (the old mock only ever simulated the enum). * refactor(jobs): use relative import for ensure_dataset_available * refactor(train): hoist submit_to_hf import to module top The `from lerobot.jobs import submit_to_hf` was a function-local import in train(); it pulls no heavy/optional deps and has no circular-import risk, so move it to the top-level import block. * refactor(train): hoist _remote_target_in_argv imports to module top Move `import sys` and `from lerobot.configs import JobConfig` out of the function body and into the top-level import block. * refactor(utils): use relative import for sibling constants in hub.py `from lerobot.utils.constants import CHECKPOINTS_DIR` was the odd one out in utils/ — sibling modules there are imported relatively (.constants, .errors, .utils, ...). Match that convention. * refactor(jobs): hoist LeRobotDataset import, guard dataset extra at package init Move the `from lerobot.datasets import LeRobotDataset` import to the top of dataset.py and relocate the `require_package("datasets", extra="dataset")` guard to the jobs package __init__, per review feedback. * test(jobs): skip test_hf if datasets extra is missing lerobot.configs.train pulls in datasets at import time, so the module fails to collect without lerobot[dataset]. Guard with importorskip, matching the convention in tests/training/test_multi_gpu.py. * test(jobs): skip test_dataset if datasets extra is missing tests/jobs/test_dataset.py imports lerobot.jobs.dataset, which triggers the require_package("datasets") guard in lerobot/jobs/__init__.py at import time. Without lerobot[dataset] the module fails to collect in the base CI tier. Guard with importorskip, same as test_hf.py.	2026-06-29 17:59:33 +02:00
Maxime Ellerbach	73782447f2	feat(train): FSDP checkpoint saving (#3810 ) * feat(train): FSDP checkpoint saving * adding docs for FSDP * adding a test for the fsdp checkpoint path * cleanup * fixing final upload to hub * refactored initial implementation to use torch fsdp api and adding new tests	2026-06-22 13:51:21 +02:00
Pepijn	234c768dfb	feat(datasets): deterministic, resumable shuffling for EpisodeAwareSampler (#3769 ) * fix(datasets): expose a generator on EpisodeAwareSampler for distributed shuffle sync In distributed training, accelerate can only synchronize the shuffle permutation across ranks when the sampler exposes a generator attribute. EpisodeAwareSampler shuffled via the global torch RNG, so disjoint batch shards relied on every rank's global CPU RNG staying in lockstep forever; any rank-asymmetric RNG consumption (e.g. eval rollouts on the main process only) silently desynced the permutations and ranks trained on overlapping/missing samples. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(train): seed sampler generator and gate dataset download per node - Pass a generator seeded with cfg.seed to EpisodeAwareSampler so accelerator.prepare registers it as the synchronized RNG and the shuffle order is reproducible. - Gate the initial make_dataset call on is_local_main_process instead of is_main_process: the global main process only exists on node 0, so on every other node all local ranks were downloading the dataset and building the Arrow cache concurrently. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(datasets): add DeterministicEpisodeAwareSampler with O(1) memory and sample-exact resume Add a sampler that never materializes frame indices: it stores only per-episode boundaries (numpy, a few bytes per episode) and maps logical positions to frame indices on the fly with searchsorted. Shuffling uses a seeded Feistel permutation over [0, num_frames) (cycle-walking to the exact domain), so the data order is a pure function of (seed, epoch): - no RNG state to synchronize across distributed ranks, - constant memory and zero epoch-boundary cost at any dataset size, - O(1) seek to any position, enabling sample-exact resume. Opt in with --deterministic_sampler=true. On resume, lerobot-train maps the checkpointed step back to (epoch, start_index) via compute_sampler_state and continues at the exact sample where the run left off (up to accelerate's even_batches padding at epoch boundaries). The shuffle is pseudo-random rather than a true uniform permutation, the standard trade-off in large-scale training loaders. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * refactor(datasets): fold deterministic mode into EpisodeAwareSampler Instead of a parallel DeterministicEpisodeAwareSampler class, extend the existing EpisodeAwareSampler with a deterministic=True mode (seeded Feistel permutation, epoch auto-advance, state_dict/load_state_dict). The default mode is behavior-identical: same torch.randperm consumption and the same generator contract accelerate synchronizes; the O(N) Python index list is replaced by O(num_episodes) boundary arrays in both modes, with `indices` kept as a back-compat property. Passing a generator together with deterministic=True is rejected, and the state/seek methods raise outside deterministic mode. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(train): enable deterministic_sampler by default Deterministic data order (sample-exact resume, no cross-rank RNG sync, O(1) sampler memory) is now the default for map-style training; set deterministic_sampler=false to restore the legacy RNG-based shuffle. Streaming datasets ignore the flag (the sampler path only applies to map-style datasets), replacing the previous hard validation error so streaming configs keep working with the new default. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(datasets): default EpisodeAwareSampler to deterministic mode and trim comments deterministic=True is now the class default as well as the training default; the legacy RNG path requires an explicit deterministic=False (the train script's non-deterministic branch passes it). Docstrings and inline comments slimmed down across the changed files. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(sampler): drain resumed trillion-frame sampler via iter() to avoid list() prealloc list(sampler) calls PyObject_LengthHint -> __len__ (the full 10*12 epoch length) and preallocates that many slots before iterating, OOMing even though the resumed epoch only yields 3 frames. Collect through the iterator (no length hint) so the test exercises the real O(1) seek/drain instead of CPython's list growth heuristic. fix(datasets): guard Feistel cycle-walking loop against non-convergence Replace the unbounded while True in EpisodeAwareSampler._permute with a bounded for loop capped at _MAX_CYCLE_WALK_STEPS (100) and raise RuntimeError if the cycle-walk fails to land in [0, num_frames). The loop is expected to converge in <4 steps on the chosen power-of-two domain, so the bound is a safety net that should never trip in practice but prevents a pathological infinite loop. https://claude.ai/code/session_01HQ15tFrBsHYScjGWosEv22 * fix(datasets): make deterministic-sampler resume robust to world-size changes compute_sampler_state mapped a checkpointed step back to (epoch, start_index) using the current num_processes, but the number of sampler positions a step consumes scales with the world size that produced it. Resuming on a different GPU count therefore landed on the wrong epoch/offset, silently re-seeing or skipping data. Record num_processes in training_step.json at checkpoint time and feed the checkpoint's value into compute_sampler_state on resume, so the data order resumes at the right position regardless of the new world size. Warn when the world size changed (the global offset is correct, but per-rank sample-exactness needs the same topology). Old checkpoints without the field fall back to the current world size. Also document compute_sampler_state's assumptions explicitly: num_processes / batch_size must match the checkpointing run, and accelerate's even_batches=True padding is mirrored by the ceil(... / num_processes) term. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com> * style: apply ruff-format to lerobot_train.py Collapse the compute_sampler_state(...) call onto one line so the ruff-format pre-commit hook passes (fixes the failing CI check). Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(datasets): use seeded torch.randperm instead of Feistel in EpisodeAwareSampler Drop the Feistel permutation (and its SplitMix64 hash / cycle-walking) in favor of a torch.randperm seeded from (seed, epoch). The deterministic mode keeps its key properties - data order is a pure function of (seed, epoch), so it reproduces on every rank with no global-RNG synchronization, and - state_dict / load_state_dict still resume sample-exactly, now by regenerating the epoch's permutation and slicing from the saved offset. Construction stays O(num_episodes) (only episode boundaries are stored, never a per-frame index list). The trade-off vs Feistel: the per-epoch shuffle is again O(num_frames) memory (the randperm tensor) and no longer O(1)-seekable, in exchange for ~30 fewer LOC and a truly uniform shuffle. Tests updated: the trillion-frame O(1) test is replaced with a boundary-storage check and a scale resume-exactness test. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(datasets): make EpisodeAwareSampler always deterministic With Feistel gone, deterministic and legacy modes were both just torch.randperm and the deterministic path strictly dominated (reproducible across ranks via the (seed, epoch) seed, no accelerate generator sync, resumable). Collapse to a single path and drop the redundant flag: - remove the `deterministic` and `generator` constructor args, `_iter_default`, and `_require_deterministic`; `set_epoch` / `state_dict` / `load_state_dict` are now unconditional - remove the `deterministic_sampler` train config field and the legacy generator branch in lerobot_train.py (non-streaming map datasets always use the sampler) - drop the now-obsolete generator/legacy tests Note: removes the `generator` kwarg from EpisodeAwareSampler (back-compat break vs main); the order is now a pure function of (seed, epoch), so no cross-rank RNG sync is needed. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(datasets): address sampler review (batch_size resume guard + docs) - Record batch_size in training_step.json alongside num_processes and feed the checkpoint's value into compute_sampler_state on resume; warn when it differs (per-rank sample-exactness needs the same batch size). - Document the set_epoch vs __iter__ auto-advance coupling on EpisodeAwareSampler (callers should rely on exactly one mechanism per run). - Note the broadened (reproducibility-breaking) sampler guard and the no-generator distributed sharding correctness in lerobot_train.py. - Add load_training_batch_size + parallel tests. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(train): download dataset once on the global main process Gate the training dataset download on the global is_main_process (download once to the shared dataset root, barrier, then every other rank reads the already-populated copy) instead of per-node is_local_main_process. LeRobotDataset skips its snapshot_download when try_load() succeeds, so no rank re-downloads. Assumes the dataset root / HF cache is on storage shared across nodes. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(datasets): trim sampler comment and drop duplicate tests Remove the verbose dataloader-guard comment and the two EpisodeAwareSampler tests that duplicated existing validation/warning coverage (no coverage loss). Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-12 11:47:16 +02:00
Steven Palma	df0763a2bc	feat(dependencies): minimal default tag install (#3362 )	2026-04-12 20:03:04 +02:00
githubnemo	e670ac5daf	Add basic PEFT support to train script + record module (#1411 ) * Add basic support for PEFT adapter methods This changes adds support for training policies with much less parameters by applying adapter methods such as LoRA on specific parts of the policies and therefore possibly higher learning rates / batch sizes. To make this as accessible as possible I thought it useful to provide defaults for `target_modules` and `modules_to_save`. Currently only SmolVLA has such defaults but when we agree that this change is useful I will set out to generate more such defaults. While the user can override these settings, they are expected to only change the peft_method, rank and init_type parameters. * Implement loading of PEFT adapters Loading a PEFT adapter is currently done by initializing a policy with default config and then applying the adapter on the resulting model. This has the obvious drawback that any configurations done during training are not applied in the adapted model. Currently the `use_peft` attribute of `PreTrainedConfig` is only set during loading to signal the following code that it has to deal with a PEFT adapter. However we could imagine a scenario where this is already set at training time and stored alongside the adapter. * Store policy config alongside PEFT checkpoint Before this change the PEFT-wrapped policy did not save the policy's config alongside the adapter config / weights which prevented us from changing the policy config. Now the policy config is saved both in full training and PEFT training. This change makes loading the PEFT policy adapter much easier as well. * Add default config for ACT * Support targets like `all-linear` * Formatting * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix failing tests * Remove PEFT compatibility changes in config We'll wait for the PEFT release that fixes this for good. * Remove `use_peft` parameter from training script Instead we make the PEFT config optional which has the same effect. * Log adapter config to WandB * Better documentation for CLI arguments * Don't unload & merge the PEFT model This can make things hard when using quantized layers (user expects quantized base layers with unquantized adapters for example, merging defaults to upcast the layers leading to higher memory). * Correct way of identifying when to save config * Add CLI end-to-end tests Currently there don't seem to be any way to test the CLI commands. Since this change mostly happens in those I thought it best to add a way to test these commands end-to-end. More integrated commands like `lerobot-record` need patching but standalone commands like training seem to work fine. * Update default targets Removed ACT since it doesn't make sense to fine-tune ACT without having it pretrained beforehand. SmolVLA and Pi0/0.5 are much more senseful targets. * Clean up loading code - Centralized instantiation of the PEFT wrapper in `make_policy` for inference (e.g. in `lerobot-record`) - Training a PEFT policy also sets `cfg.use_peft` so that all inference code loading the policy can rely on that attribute to identify if PEFT loading is needed - Modified RTC example to also include PEFT policies. Mostly because this is an example I'm currently exploring. * Make sure push_to_hub works Since PEFT only wraps `push_to_hub` and not `push_model_to_hub`, the reference to `self` in `policy.push_model_to_hub` is the unwrapped policy which, of course, doesn't know anything about PEFT. To make the upload process aware of PEFT, we pass the unwrapped policy down to `push_model_to_hub` as a kwarg. This is not ideal but I think it is the best way for now. * formatting * Warn when encountering from-scratch-training * Revamp pretrained model loading There were quite a few factors that convinced me that the status quo is able to load pretrained models from the PEFT adapter config but in fact that didn't work. This commit fixes the following things: - policies wrapped in PEFT will now have a `name_or_path` attribute containing the name or path of the pretrained model we're fine-tuning - we further assume that SmolVLA without `pretrained_path` and `load_vlm_weights==False` must be an user-side error - we assume that using PEFT on from-scratch-policies must be an user-side-error * Make it possible to unset policy features This is necessary to train pre-trained policies on new datasets so that the features are inferred from the new dataset and not from the pretrained policy. * Use correct loading for PEFT in RTC example * Make it possible to use PeftModels in eval * Add test checking that PEFT actually reduces params * Adapt state/action projections instead of full-finetuning There doesn't seem to be a benefit to fully fine-tune these layers over just adapting them, so we do that instead. * Disallow PEFT training on non-pretrained policies At first I thought it would make sense to have this feature in case you want to fine-tune a pre-trained section but in the end it makes more trouble than it's worth. It's still possible to allow this in the future when a concrete need arises. * Add basic documentation * Formatting * Add peft as extra dependency, mark tests Fast tests currently fail because of the missing dependency. * Fix pre-commit issues * Add walx <> peft conflict for uv * Exclude peft from pi install for now --------- Co-authored-by: nemo <git@ningu.net> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>	2026-01-05 08:51:26 +01:00
Steven Palma	6c28ef894a	chore(docs): add missing license headers (#2140 )	2025-10-08 14:27:52 +02:00
Steven Palma	7cf04a5ec3	chore: move constants to utils (#2016 )	2025-09-24 11:11:53 +02:00
Simon Alibert	d4ee470b00	Package folder structure (#1417 ) * Move files * Replace imports & paths * Update relative paths * Update doc symlinks * Update instructions paths * Fix imports * Update grpc files * Update more instructions * Downgrade grpc-tools * Update manifest * Update more paths * Update config paths * Update CI paths * Update bandit exclusions * Remove walkthrough section	2025-07-01 16:34:46 +02:00
Simon Alibert	974028bd28	Organize test folders (#856 ) Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>	2025-03-13 14:05:55 +01:00

9 Commits