mirror of https://github.com/huggingface/lerobot.git synced 2026-07-01 23:27:08 +00:00

Files

T

Nicolas Rabault 5ac3b49a5f feat(train): run training remotely on HF Jobs via --job.target (#3856 )

* feat(train): add JobConfig group, save_checkpoint_to_hub flag, Hub checkpoint helper

Introduce a JobConfig draccus group on TrainPipelineConfig (--job.target/image/
timeout/detach/tags) whose is_remote property gates remote dispatch, plus a
save_checkpoint_to_hub flag and validation. Add push_checkpoint_to_hub(), which
uploads a saved checkpoint directory to the model repo under checkpoints/<step>/
and creates the repo idempotently (private propagates from policy.private).

* feat(train): run training remotely on HF Jobs via --job.target

When --job.target names a GPU flavor, train() dispatches to lerobot.jobs.submit_to_hf
instead of training locally: it authenticates, ensures the dataset is on the Hub
(pushing a local-only one privately), serializes a pod-compatible train_config.json
(strips client-only fields, points at the model repo), submits via HfApi.run_job
with HF_TOKEN/WANDB_API_KEY secrets, then streams logs and finishes when the model
is pushed. Wires push_checkpoint_to_hub into the training loop behind
save_checkpoint_to_hub, and tags jobs/datasets/model with 'lerobot' + --job.tags.

* docs(train): document remote training on HF Jobs

* test(train): skip remote-dispatch tests without the dataset extra

The module imports lerobot.scripts.lerobot_train, which eagerly pulls in
lerobot.datasets (dataset extra). The base fast-test CI tier runs without
that extra, so collection failed there. Guard with pytest.importorskip,
matching the existing tests/scripts dataset-extra tests.

* refactor(jobs): hoist huggingface_hub imports to module level in hf.py

huggingface_hub is a core dependency, so the per-function dynamic imports
had no lazy-loading rationale. Move them to a single module-level import
and update test monkeypatch targets to lerobot.jobs.hf.* accordingly.

* refactor(jobs): build remote config dict via cfg.to_dict()

TrainPipelineConfig.to_dict() already returns the canonical draccus
encoding, so the StringIO + draccus.dump + json.loads round-trip was
redundant. Use it directly and drop the now-unused io/draccus imports.

* refactor(train): use module-level HfApi import in push_checkpoint_to_hub

huggingface_hub is a core dependency; the in-function import was
unnecessary. Move HfApi to a module-level import and point the test
monkeypatches at lerobot.common.train_utils.HfApi.

* refactor(configs): export JobConfig from the configs package

Re-export JobConfig in lerobot/configs/__init__.py so external callers
import it as `from lerobot.configs import JobConfig`, matching the other
config classes. Adapt the train script and test imports.

* refactor(jobs): check dataset presence with api.repo_exists

Replace the dataset_info try/except RepositoryNotFoundError dance with a
direct api.repo_exists(repo_id, repo_type="dataset") call, dropping the
httpx/RepositoryNotFoundError test scaffolding.

* chore(jobs): annotate ensure_dataset_available api param as HfApi

Add the missing HfApi type hint via a TYPE_CHECKING import.

* refactor(jobs): use HF_LEROBOT_HOME constant for the local cache root

Resolve the local dataset cache via lerobot.utils.constants.HF_LEROBOT_HOME
instead of re-reading the env var by hand, dropping the os/Path imports.
Tests now patch the imported constant and assert on a stable message
substring (the previous "neither" match only passed by accident, matching
the test name embedded in the pytest tmp_path).

* chore(jobs): guard LeRobotDataset import with require_package

Surface a clear "install lerobot[dataset]" error if the datasets extra
is missing, instead of a raw ImportError, before pushing a local dataset.

* docs(configs): clarify the is_remote_target/is_remote split

Add a comment explaining why JobConfig keeps both the staticmethod (tests
a raw target string from argv before a config exists) and the property
(accessor for an existing config instance).

* docs(train): note how to pin a pushed model version for inference

Document --policy.pretrained_revision alongside --policy.path so a
specific Hub-pushed checkpoint (once --save_checkpoint_to_hub has
committed several) can be selected for inference.

* test(jobs): skip dataset import guard in base-deps test

The fast test env installs base deps only, so require_package('datasets')
raised ImportError before the mocked lerobot.datasets import was reached.
Monkeypatch the guard to a no-op so the unit test exercises the upload logic.

* fix(jobs): address claude review findings on remote training

Resolve the claude[bot] review on #3856:

- Reject reward-model training under --job.target with a clear error instead
  of crashing on a None policy inside build_remote_config_file.
- Support --policy.path remote runs: validate() no longer requires repo_id for
  remote runs (it is auto-generated in submit_to_hf), and repo_id/push_to_hub
  are now set after validate() resolves the policy.
- Narrow the bare `except Exception` in _tail_logs/_poll_until_done to
  (OSError, httpx.HTTPError) so programming errors surface instead of being
  silently retried or counted as job failures.
- Install the SIGINT detach handler only on the main thread.
- Generate model repo timestamps in UTC.

* docs(jobs): document the model-pushed marker contract and orphaned repos

Follow-up to the claude[bot] review on #3856 (non-blocking observations):

- Cross-reference the "Model pushed to <url>" log line between its producer
  (PreTrainedPolicy.push_model_to_hub) and the remote-run consumer in
  submit_to_hf, noting the contract is an early-finish optimization that
  falls back to status polling if it drifts.
- Note in the HF Jobs guide that a failed remote run leaves its model repo
  on the Hub (it is not auto-deleted) and how to remove it.

* feat(train): tag each pushed checkpoint with its step

Address review feedback on #3856: pushing a checkpoint to the Hub now
also creates a tag named after the checkpoint step, so a checkpoint can
be recovered with --policy.pretrained_revision=<step> instead of having
to look up its commit sha.

* fix(jobs): hoist ensure_dataset_available to a module-level import

Addresses Caroline's review comment on PR #3856: the local import of
ensure_dataset_available inside submit_to_hf was vestigial. dataset.py
does not import hf.py, so there is no circular-import risk and no extra
load cost (its heavy deps stay lazy), so make it a top-level import.

* refactor(configs): untangle config_path/resume resolution in validate()

Split the re-parse HACK block in TrainPipelineConfig.validate() into focused
helpers (_resolve_pretrained_from_cli, _resolve_resume_checkpoint) that handle
the policy path, reward-model path, and resume config_path as separate,
readable units. Behavior-preserving.

* feat(train): resume training from a Hub checkpoint

Allow --config_path to be a Hub repo id when resuming, not only a local path.
The latest checkpoint under checkpoints/<step>/ is downloaded into a fresh local
run dir and resumed from there (optimizer, scheduler, RNG and data order
restored as for a local resume). TrainPipelineConfig.from_pretrained falls back
to the latest checkpoint's train_config.json when a repo has no root config
(an interrupted run that only pushed checkpoints). The download is skipped when
dispatching remotely so the executor (local machine or HF Jobs pod) performs it.

- add find_latest_hub_checkpoint (utils/hub) and resolve_resume_checkpoint
  (common/train_utils), the symmetric download counterpart to
  push_checkpoint_to_hub
- unit tests for both helpers and the from_pretrained fallback

* feat(jobs): resume a run on HF Jobs from a checkpoint

When --resume is set with a remote --job.target, submit_to_hf resumes from the
checkpoint repo instead of staging a fresh config. A Hub config_path is resumed
in place (its checkpoint config already targets that repo); a local config_path
has its checkpoint uploaded to a new private repo first and the run is forced to
push back to it. The pod command carries --job.target=local so the checkpoint's
saved job.target can't make the pod re-dispatch itself, and the user's CLI
overrides are forwarded so a remote resume matches the same local command.
ensure_dataset_available is hoisted before the resume/fresh branch since it
applies to both.

* docs(train): document resuming from a Hub checkpoint, locally and on jobs

Show that --config_path accepts a Hub repo id for --resume, and that adding
--job.target resumes on HF Jobs (uploading a local checkpoint/dataset first).

* fix(jobs): default remote job timeout to 2d instead of the platform default

HF Jobs applies its own short 30-minute timeout when none is sent, which
silently kills long training runs. Pass an explicit, generous 2d cap by
default; users can still override --job.timeout to fail fast or extend it.

* fix(jobs): drop --dataset.root on resume + restore keyboard-control docs

Address the latest Claude review on #3856:

- _build_resume_job no longer forwards --dataset.root to the pod (a
  host-local path it can't read); the fresh-run path already nulls it in
  build_remote_config_file, so this makes resume consistent. Add a unit
  test for _pod_forwarded_args covering the drop in both flag forms.
- Restore the display-independent keyboard-control docs (n/r/q letter
  equivalents + X11/Wayland/headless Tip) in il_robots.mdx that this
  branch was stale on relative to main (#3875).

* fix(jobs): handle str-typed job stage from huggingface_hub

inspect_job's status.stage is an enum (with .value) in some
huggingface_hub versions and a plain str in others. The poller
assumed the enum shape, raising "'str' object has no attribute
'value'" on resume for users on the str-returning version.

Read it via getattr(..., "value", ...) so both shapes work, and
parametrize the poll test over enum and str stages so the str case
is actually exercised (the old mock only ever simulated the enum).

* refactor(jobs): use relative import for ensure_dataset_available

* refactor(train): hoist submit_to_hf import to module top

The `from lerobot.jobs import submit_to_hf` was a function-local import in
train(); it pulls no heavy/optional deps and has no circular-import risk, so
move it to the top-level import block.

* refactor(train): hoist _remote_target_in_argv imports to module top

Move `import sys` and `from lerobot.configs import JobConfig` out of the
function body and into the top-level import block.

* refactor(utils): use relative import for sibling constants in hub.py

`from lerobot.utils.constants import CHECKPOINTS_DIR` was the odd one out in
utils/ — sibling modules there are imported relatively (.constants, .errors,
.utils, ...). Match that convention.

* refactor(jobs): hoist LeRobotDataset import, guard dataset extra at package init

Move the `from lerobot.datasets import LeRobotDataset` import to the top of
dataset.py and relocate the `require_package("datasets", extra="dataset")`
guard to the jobs package __init__, per review feedback.

* test(jobs): skip test_hf if datasets extra is missing

lerobot.configs.train pulls in datasets at import time, so the module
fails to collect without lerobot[dataset]. Guard with importorskip,
matching the convention in tests/training/test_multi_gpu.py.

* test(jobs): skip test_dataset if datasets extra is missing

tests/jobs/test_dataset.py imports lerobot.jobs.dataset, which triggers
the require_package("datasets") guard in lerobot/jobs/__init__.py at
import time. Without lerobot[dataset] the module fails to collect in the
base CI tier. Guard with importorskip, same as test_hf.py.

2026-06-29 17:59:33 +02:00

source

feat(train): run training remotely on HF Jobs via --job.target (#3856 )

2026-06-29 17:59:33 +02:00

README.md

feat(ci): release workflow publish to pypi test + lock files (#1643 )

2025-08-01 17:14:15 +02:00

README.md

Generating the documentation

To generate the documentation, you first have to build it. Several packages are necessary to build the doc, you can install them with the following command, at the root of the code repository:

pip install -e . -r docs-requirements.txt

You will also need nodejs. Please refer to their installation page

NOTE

You only need to generate the documentation to inspect it locally (if you're planning changes and want to check how they look before committing for instance). You don't have to git commit the built documentation.

Building the documentation

Once you have setup the doc-builder and additional packages, you can generate the documentation by typing the following command:

doc-builder build lerobot docs/source/ --build_dir ~/tmp/test-build

You can adapt the --build_dir to set any temporary folder that you prefer. This command will create it and generate the MDX files that will be rendered as the documentation on the main website. You can inspect them in your favorite Markdown editor.

Previewing the documentation

To preview the docs, first install the watchdog module with:

pip install watchdog

Then run the following command:

doc-builder preview lerobot docs/source/

The docs will be viewable at http://localhost:3000. You can also preview the docs once you have opened a PR. You will see a bot add a comment to a link where the documentation with your changes lives.

NOTE

The preview command only works with existing doc files. When you add a completely new file, you need to update _toctree.yml & restart preview command (ctrl-c to stop it & call doc-builder preview ... again).

Accepted files are Markdown (.md).

Create a file with its extension and put it in the source directory. You can then link it to the toc-tree by putting the filename without the extension in the _toctree.yml file.

Renaming section headers and moving sections

It helps to keep the old links working when renaming the section header and/or moving sections from one document to another. This is because the old links are likely to be used in Issues, Forums, and Social media and it'd make for a much more superior user experience if users reading those months later could still easily navigate to the originally intended information.

Therefore, we simply keep a little map of moved sections at the end of the document where the original section was. The key is to preserve the original anchor.

So if you renamed a section from: "Section A" to "Section B", then you can add at the end of the file:

Sections that were moved:

[ <a href="#section-b">Section A</a><a id="section-a"></a> ]

and of course, if you moved it to another file, then:

Sections that were moved:

[ <a href="../new-file#section-b">Section A</a><a id="section-a"></a> ]

Use the relative style to link to the new file so that the versioned docs continue to work.

For an example of a rich moved sections set please see the very end of the transformers Trainer doc.

Adding a new tutorial

Adding a new tutorial or section is done in two steps:

Add a new file under ./source. This file can either be ReStructuredText (.rst) or Markdown (.md).
Link that file in ./source/_toctree.yml on the correct toc-tree.

Make sure to put your new file under the proper section. If you have a doubt, feel free to ask in a Github Issue or PR.

Writing source documentation

Values that should be put in code should either be surrounded by backticks: `like so`. Note that argument names and objects like True, None or any strings should usually be put in code.

Writing a multi-line code block

Multi-line code blocks can be useful for displaying examples. They are done between two lines of three backticks as usual in Markdown:

```
# first line of code
# second line
# etc
```

Adding an image

Due to the rapidly growing repository, it is important to make sure that no files that would significantly weigh down the repository are added. This includes images, videos, and other non-text files. We prefer to leverage a hf.co hosted dataset like the ones hosted on hf-internal-testing in which to place these files and reference them by URL. We recommend putting them in the following dataset: huggingface/documentation-images. If an external contribution, feel free to add the images to your PR and ask a Hugging Face member to migrate your images to this dataset.

README.md

Generating the documentation

Building the documentation

Previewing the documentation

Adding a new element to the navigation bar

Renaming section headers and moving sections

Adding a new tutorial

Writing source documentation

Writing a multi-line code block

Adding an image