diff --git a/AGENT_GUIDE.md b/AGENT_GUIDE.md index 57a33fdba..03b270dce 100644 --- a/AGENT_GUIDE.md +++ b/AGENT_GUIDE.md @@ -138,7 +138,7 @@ lerobot-replay --robot.type=so101_follower --robot.port= --robot. --dataset.repo_id=${HF_USER}/my_task --dataset.episode=0 ``` -**4.9 Train** (default: ACT — fastest, lowest memory). Apple silicon: `--policy.device=mps`. See §6/§7 for policy and duration. +**4.9 Train** (default: ACT — fastest, lowest memory). Apple silicon: `--policy.device=mps`. No local GPU? Add `--job.target=` (e.g. `a10g-small`, list them with `hf jobs hardware`) to run on Hugging Face Jobs instead. See §6/§7 for policy and duration. ```bash lerobot-train \ diff --git a/docs/source/cheat-sheet.mdx b/docs/source/cheat-sheet.mdx index a6afa14c2..dc24f6274 100644 --- a/docs/source/cheat-sheet.mdx +++ b/docs/source/cheat-sheet.mdx @@ -120,6 +120,8 @@ lerobot-train \ --steps=20000 ``` +No local GPU? Add `--job.target=` (e.g. `a10g-small`) to either command and `lerobot-train` runs it on [Hugging Face Jobs](https://huggingface.co/docs/hub/jobs) instead — it uploads a local-only dataset for you and pushes the trained model. List flavors with `hf jobs hardware`. + ### Inference Inference means running the trained policy/model on a robot. For that we use `lerobot-rollout`. You will need to provide a path to your policy. It can be a local path or a path to Hugging Face for example "lerobot/folding_latest". Your cameras configuration needs to match what was used when collecting the dataset. Duration is in seconds if unspecified, it will run forever. diff --git a/docs/source/hardware_guide.mdx b/docs/source/hardware_guide.mdx index 0998344ec..5f236d3e8 100644 --- a/docs/source/hardware_guide.mdx +++ b/docs/source/hardware_guide.mdx @@ -96,3 +96,4 @@ Notes: - The leading `nvidia-smi` is a quick sanity check that CUDA is visible inside the container — useful to fail fast if the flavor or driver mismatched. - The default Job timeout is 30 minutes; pass `--timeout 4h` (or longer) for real training. - `--flavor` maps onto the table above: `t4-small`/`t4-medium` (T4, ACT only), `l4x1`/`l4x4` (L4 24 GB), `a10g-small/large/largex2/largex4` (A10G 24 GB scaled out), `a100-large` (A100). For the current full catalogue + pricing see [https://huggingface.co/docs/hub/jobs](https://huggingface.co/docs/hub/jobs). +- Prefer not to write the `hf jobs run` wrapper yourself? `lerobot-train` can submit the job for you: just add `--job.target=` to a normal training command and it handles dataset upload, log streaming, and the final model push. See the [imitation-learning training guide](./il_robots). diff --git a/docs/source/il_robots.mdx b/docs/source/il_robots.mdx index 53ae5af82..2670f1ad5 100644 --- a/docs/source/il_robots.mdx +++ b/docs/source/il_robots.mdx @@ -518,7 +518,9 @@ If your local computer doesn't have a powerful GPU you could utilize Google Cola Hugging Face jobs let's you easily select hardware and run the training in the cloud. So if you don't have a powerful GPU or you need more VRAM or just want to train a model much faster use HF Jobs! It's pay as you go and you simply pay for each second of use, you can see the pricing and additional information [here](https://huggingface.co/docs/hub/jobs). -To run the training use this command: +> **Tip:** if you just want to launch a standard training run, you can skip building the command below and use the integrated **Train on HF Jobs via `--job.target`** flow described further down — `lerobot-train` then submits the job, uploads a local-only dataset for you, and streams the logs. + +To run the training manually use this command: @@ -591,6 +593,33 @@ Once the training is started you can go to [Jobs](https://huggingface.co/setting After training the model will be pushed to hub and you can use it as any other model with LeRobot. +#### Train on HF Jobs via `--job.target` (integrated CLI) + +`lerobot-train` runs locally by default. To run on a HuggingFace GPU without constructing the Docker command yourself, pass `--job.target` with a hardware flavor name: + +```bash +lerobot-train \ + --dataset.repo_id=${HF_USER}/so101_test \ + --policy.type=act \ + --policy.repo_id=${HF_USER}/my_policy \ + --job.target=a10g-small +``` + +List available flavors and pricing with `hf jobs hardware`. The run streams its logs to your terminal; press Ctrl-C to detach (the job keeps running in the cloud). Re-attach or cancel with: + +```bash +hf jobs logs +hf jobs cancel +``` + +If your dataset exists only locally (not yet on the Hub), it is automatically pushed to a **private** Hub repo so the job can download it by `repo_id` (nothing is made public). The trained model is pushed to the model repo at the end of the run. To also push every intermediate checkpoint to the Hub as it is saved (so you can monitor progress mid-run), add `--save_checkpoint_to_hub=true` — this requires a runtime image that includes this feature. + +Every job (and any dataset pushed by the run) is tagged `lerobot` so it's easy to find on the Hub. Add your own with `--job.tags '["my-tag"]'`. + +By default the job runs until training finishes, with no time limit. Cap it with an HF Jobs duration string if you want a hard ceiling, e.g. `--job.timeout=4h`. + +**Prerequisites:** run `hf auth login` before submitting. For Weights & Biases integration, run `wandb login` or set `WANDB_API_KEY` on your machine — the key is forwarded to the job automatically. + #### Upload policy checkpoints Once training is done, upload the latest checkpoint with: