From 527f7a45c28730bb004e3cbf4adb76496d13d85c Mon Sep 17 00:00:00 2001
From: Nicolas Rabault <rabault.nicolas@gmail.com>
Date: Wed, 24 Jun 2026 10:16:10 +0200
Subject: [PATCH] docs(train): document resuming from a Hub checkpoint, locally
 and on jobs

Show that --config_path accepts a Hub repo id for --resume, and that adding
--job.target resumes on HF Jobs (uploading a local checkpoint/dataset first).
---
 docs/source/cheat-sheet.mdx |  6 ++++++
 docs/source/il_robots.mdx   | 22 ++++++++++++++++++++++
 2 files changed, 28 insertions(+)
diff --git a/docs/source/cheat-sheet.mdx b/docs/source/cheat-sheet.mdx
index dc24f6274..e93e69c9c 100644
--- a/docs/source/cheat-sheet.mdx
+++ b/docs/source/cheat-sheet.mdx
@@ -122,6 +122,12 @@ lerobot-train \
 
 No local GPU? Add `--job.target=<flavor>` (e.g. `a10g-small`) to either command and `lerobot-train` runs it on [Hugging Face Jobs](https://huggingface.co/docs/hub/jobs) instead — it uploads a local-only dataset for you and pushes the trained model. List flavors with `hf jobs hardware`.
 
+To resume, point `--config_path` at a checkpoint and add `--resume=true`. It accepts a local path or a Hub repo id (the latest checkpoint is fetched), and works locally or on a job by adding `--job.target=<flavor>`:
+
+```bash
+lerobot-train --config_path=${HF_USER}/policy_test --resume=true --job.target=a10g-small
+```
+
 ### Inference
 
 Inference means running the trained policy/model on a robot. For that we use `lerobot-rollout`. You will need to provide a path to your policy. It can be a local path or a path to Hugging Face for example "lerobot/folding_latest". Your cameras configuration needs to match what was used when collecting the dataset. Duration is in seconds if unspecified, it will run forever.
diff --git a/docs/source/il_robots.mdx b/docs/source/il_robots.mdx
index d18d6085a..fd05839e7 100644
--- a/docs/source/il_robots.mdx
+++ b/docs/source/il_robots.mdx
@@ -506,6 +506,12 @@ lerobot-train \
   --resume=true
 ```
 
+`--config_path` also accepts a **Hub repo id**: if a run pushed its checkpoints to the Hub (with `--save_checkpoint_to_hub=true`), you can resume straight from the repo — its latest checkpoint is downloaded and training continues, restoring the optimizer, scheduler, step counter and data order:
+
+```bash
+lerobot-train --config_path=${HF_USER}/my_policy --resume=true
+```
+
 If you do not want to push your model to the hub after training use `--policy.push_to_hub=false`.
 
 Additionally you can provide extra `tags` or specify a `license` for your model or make the model repo `private` by adding this: `--policy.private=true --policy.tags=\[ppo,rl\] --policy.license=mit`
@@ -622,6 +628,22 @@ By default the job runs until training finishes, with no time limit. Cap it with
 
 **Prerequisites:** run `hf auth login` before submitting. For Weights & Biases integration, run `wandb login` or set `WANDB_API_KEY` on your machine — the key is forwarded to the job automatically.
 
+**Resuming on a job.** Adding `--job.target` to a resume command runs the resume in the cloud — the same command works locally or remotely. The checkpoint repo is the source of truth, and new checkpoints continue the lineage in the same repo:
+
+```bash
+# resume a Hub run on a job (its checkpoints are already on the Hub)
+lerobot-train --config_path=${HF_USER}/my_policy --resume=true --job.target=a10g-small
+
+# resume a LOCAL run on a job — the checkpoint is uploaded to a private Hub repo first,
+# then the job resumes from it (a local-only dataset is uploaded the same way)
+lerobot-train \
+  --config_path=outputs/train/act_so101_test/checkpoints/last/pretrained_model/train_config.json \
+  --resume=true \
+  --job.target=a10g-small
+```
+
+Job settings come from the current command, so override `--job.target`, `--job.timeout`, etc. as needed; for the resumed run to itself be resumable later, keep `--save_checkpoint_to_hub=true`.
+
 #### Upload policy checkpoints
 
 Once training is done, upload the latest checkpoint with: