mirror of
https://github.com/huggingface/lerobot.git
synced 2026-06-25 20:27:05 +00:00
docs(train): document resuming from a Hub checkpoint, locally and on jobs
Show that --config_path accepts a Hub repo id for --resume, and that adding --job.target resumes on HF Jobs (uploading a local checkpoint/dataset first).
This commit is contained in:
@@ -122,6 +122,12 @@ lerobot-train \
|
||||
|
||||
No local GPU? Add `--job.target=<flavor>` (e.g. `a10g-small`) to either command and `lerobot-train` runs it on [Hugging Face Jobs](https://huggingface.co/docs/hub/jobs) instead — it uploads a local-only dataset for you and pushes the trained model. List flavors with `hf jobs hardware`.
|
||||
|
||||
To resume, point `--config_path` at a checkpoint and add `--resume=true`. It accepts a local path or a Hub repo id (the latest checkpoint is fetched), and works locally or on a job by adding `--job.target=<flavor>`:
|
||||
|
||||
```bash
|
||||
lerobot-train --config_path=${HF_USER}/policy_test --resume=true --job.target=a10g-small
|
||||
```
|
||||
|
||||
### Inference
|
||||
|
||||
Inference means running the trained policy/model on a robot. For that we use `lerobot-rollout`. You will need to provide a path to your policy. It can be a local path or a path to Hugging Face for example "lerobot/folding_latest". Your cameras configuration needs to match what was used when collecting the dataset. Duration is in seconds if unspecified, it will run forever.
|
||||
|
||||
@@ -506,6 +506,12 @@ lerobot-train \
|
||||
--resume=true
|
||||
```
|
||||
|
||||
`--config_path` also accepts a **Hub repo id**: if a run pushed its checkpoints to the Hub (with `--save_checkpoint_to_hub=true`), you can resume straight from the repo — its latest checkpoint is downloaded and training continues, restoring the optimizer, scheduler, step counter and data order:
|
||||
|
||||
```bash
|
||||
lerobot-train --config_path=${HF_USER}/my_policy --resume=true
|
||||
```
|
||||
|
||||
If you do not want to push your model to the hub after training use `--policy.push_to_hub=false`.
|
||||
|
||||
Additionally you can provide extra `tags` or specify a `license` for your model or make the model repo `private` by adding this: `--policy.private=true --policy.tags=\[ppo,rl\] --policy.license=mit`
|
||||
@@ -622,6 +628,22 @@ By default the job runs until training finishes, with no time limit. Cap it with
|
||||
|
||||
**Prerequisites:** run `hf auth login` before submitting. For Weights & Biases integration, run `wandb login` or set `WANDB_API_KEY` on your machine — the key is forwarded to the job automatically.
|
||||
|
||||
**Resuming on a job.** Adding `--job.target` to a resume command runs the resume in the cloud — the same command works locally or remotely. The checkpoint repo is the source of truth, and new checkpoints continue the lineage in the same repo:
|
||||
|
||||
```bash
|
||||
# resume a Hub run on a job (its checkpoints are already on the Hub)
|
||||
lerobot-train --config_path=${HF_USER}/my_policy --resume=true --job.target=a10g-small
|
||||
|
||||
# resume a LOCAL run on a job — the checkpoint is uploaded to a private Hub repo first,
|
||||
# then the job resumes from it (a local-only dataset is uploaded the same way)
|
||||
lerobot-train \
|
||||
--config_path=outputs/train/act_so101_test/checkpoints/last/pretrained_model/train_config.json \
|
||||
--resume=true \
|
||||
--job.target=a10g-small
|
||||
```
|
||||
|
||||
Job settings come from the current command, so override `--job.target`, `--job.timeout`, etc. as needed; for the resumed run to itself be resumable later, keep `--save_checkpoint_to_hub=true`.
|
||||
|
||||
#### Upload policy checkpoints
|
||||
|
||||
Once training is done, upload the latest checkpoint with:
|
||||
|
||||
Reference in New Issue
Block a user