docs(train): document resuming from a Hub checkpoint, locally and on jobs

Show that --config_path accepts a Hub repo id for --resume, and that adding --job.target resumes on HF Jobs (uploading a local checkpoint/dataset first).
2026-06-25 20:27:05 +00:00 · 2026-06-24 10:16:10 +02:00
parent 651c113cd3
commit 527f7a45c2
2 changed files with 28 additions and 0 deletions
@@ -122,6 +122,12 @@ lerobot-train \

 No local GPU? Add `--job.target=<flavor>` (e.g. `a10g-small`) to either command and `lerobot-train` runs it on [Hugging Face Jobs](https://huggingface.co/docs/hub/jobs) instead — it uploads a local-only dataset for you and pushes the trained model. List flavors with `hf jobs hardware`.

+To resume, point `--config_path` at a checkpoint and add `--resume=true`. It accepts a local path or a Hub repo id (the latest checkpoint is fetched), and works locally or on a job by adding `--job.target=<flavor>`:
+
+```bash
+lerobot-train --config_path=${HF_USER}/policy_test --resume=true --job.target=a10g-small
+```
+
 ### Inference

 Inference means running the trained policy/model on a robot. For that we use `lerobot-rollout`. You will need to provide a path to your policy. It can be a local path or a path to Hugging Face for example "lerobot/folding_latest". Your cameras configuration needs to match what was used when collecting the dataset. Duration is in seconds if unspecified, it will run forever.
@@ -506,6 +506,12 @@ lerobot-train \
  --resume=true
 ```

+`--config_path` also accepts a **Hub repo id**: if a run pushed its checkpoints to the Hub (with `--save_checkpoint_to_hub=true`), you can resume straight from the repo — its latest checkpoint is downloaded and training continues, restoring the optimizer, scheduler, step counter and data order:
+
+```bash
+lerobot-train --config_path=${HF_USER}/my_policy --resume=true
+```
+
 If you do not want to push your model to the hub after training use `--policy.push_to_hub=false`.

 Additionally you can provide extra `tags` or specify a `license` for your model or make the model repo `private` by adding this: `--policy.private=true --policy.tags=\[ppo,rl\] --policy.license=mit`
@@ -622,6 +628,22 @@ By default the job runs until training finishes, with no time limit. Cap it with

 **Prerequisites:** run `hf auth login` before submitting. For Weights & Biases integration, run `wandb login` or set `WANDB_API_KEY` on your machine — the key is forwarded to the job automatically.

+**Resuming on a job.** Adding `--job.target` to a resume command runs the resume in the cloud — the same command works locally or remotely. The checkpoint repo is the source of truth, and new checkpoints continue the lineage in the same repo:
+
+```bash
+# resume a Hub run on a job (its checkpoints are already on the Hub)
+lerobot-train --config_path=${HF_USER}/my_policy --resume=true --job.target=a10g-small
+
+# resume a LOCAL run on a job — the checkpoint is uploaded to a private Hub repo first,
+# then the job resumes from it (a local-only dataset is uploaded the same way)
+lerobot-train \
+  --config_path=outputs/train/act_so101_test/checkpoints/last/pretrained_model/train_config.json \
+  --resume=true \
+  --job.target=a10g-small
+```
+
+Job settings come from the current command, so override `--job.target`, `--job.timeout`, etc. as needed; for the resumed run to itself be resumable later, keep `--save_checkpoint_to_hub=true`.
+
 #### Upload policy checkpoints

 Once training is done, upload the latest checkpoint with: