From 527f7a45c28730bb004e3cbf4adb76496d13d85c Mon Sep 17 00:00:00 2001 From: Nicolas Rabault Date: Wed, 24 Jun 2026 10:16:10 +0200 Subject: [PATCH] docs(train): document resuming from a Hub checkpoint, locally and on jobs Show that --config_path accepts a Hub repo id for --resume, and that adding --job.target resumes on HF Jobs (uploading a local checkpoint/dataset first). --- docs/source/cheat-sheet.mdx | 6 ++++++ docs/source/il_robots.mdx | 22 ++++++++++++++++++++++ 2 files changed, 28 insertions(+) diff --git a/docs/source/cheat-sheet.mdx b/docs/source/cheat-sheet.mdx index dc24f6274..e93e69c9c 100644 --- a/docs/source/cheat-sheet.mdx +++ b/docs/source/cheat-sheet.mdx @@ -122,6 +122,12 @@ lerobot-train \ No local GPU? Add `--job.target=` (e.g. `a10g-small`) to either command and `lerobot-train` runs it on [Hugging Face Jobs](https://huggingface.co/docs/hub/jobs) instead — it uploads a local-only dataset for you and pushes the trained model. List flavors with `hf jobs hardware`. +To resume, point `--config_path` at a checkpoint and add `--resume=true`. It accepts a local path or a Hub repo id (the latest checkpoint is fetched), and works locally or on a job by adding `--job.target=`: + +```bash +lerobot-train --config_path=${HF_USER}/policy_test --resume=true --job.target=a10g-small +``` + ### Inference Inference means running the trained policy/model on a robot. For that we use `lerobot-rollout`. You will need to provide a path to your policy. It can be a local path or a path to Hugging Face for example "lerobot/folding_latest". Your cameras configuration needs to match what was used when collecting the dataset. Duration is in seconds if unspecified, it will run forever. diff --git a/docs/source/il_robots.mdx b/docs/source/il_robots.mdx index d18d6085a..fd05839e7 100644 --- a/docs/source/il_robots.mdx +++ b/docs/source/il_robots.mdx @@ -506,6 +506,12 @@ lerobot-train \ --resume=true ``` +`--config_path` also accepts a **Hub repo id**: if a run pushed its checkpoints to the Hub (with `--save_checkpoint_to_hub=true`), you can resume straight from the repo — its latest checkpoint is downloaded and training continues, restoring the optimizer, scheduler, step counter and data order: + +```bash +lerobot-train --config_path=${HF_USER}/my_policy --resume=true +``` + If you do not want to push your model to the hub after training use `--policy.push_to_hub=false`. Additionally you can provide extra `tags` or specify a `license` for your model or make the model repo `private` by adding this: `--policy.private=true --policy.tags=\[ppo,rl\] --policy.license=mit` @@ -622,6 +628,22 @@ By default the job runs until training finishes, with no time limit. Cap it with **Prerequisites:** run `hf auth login` before submitting. For Weights & Biases integration, run `wandb login` or set `WANDB_API_KEY` on your machine — the key is forwarded to the job automatically. +**Resuming on a job.** Adding `--job.target` to a resume command runs the resume in the cloud — the same command works locally or remotely. The checkpoint repo is the source of truth, and new checkpoints continue the lineage in the same repo: + +```bash +# resume a Hub run on a job (its checkpoints are already on the Hub) +lerobot-train --config_path=${HF_USER}/my_policy --resume=true --job.target=a10g-small + +# resume a LOCAL run on a job — the checkpoint is uploaded to a private Hub repo first, +# then the job resumes from it (a local-only dataset is uploaded the same way) +lerobot-train \ + --config_path=outputs/train/act_so101_test/checkpoints/last/pretrained_model/train_config.json \ + --resume=true \ + --job.target=a10g-small +``` + +Job settings come from the current command, so override `--job.target`, `--job.timeout`, etc. as needed; for the resumed run to itself be resumable later, keep `--save_checkpoint_to_hub=true`. + #### Upload policy checkpoints Once training is done, upload the latest checkpoint with: