From 2a0dccdffc8b939199e67fcec993c77d72873622 Mon Sep 17 00:00:00 2001 From: Nikodem Bartnik Date: Thu, 2 Jul 2026 11:20:08 +0200 Subject: [PATCH] improve hf jobs docs --- docs/source/hardware_guide.mdx | 18 ++++---- docs/source/il_robots.mdx | 79 +--------------------------------- 2 files changed, 10 insertions(+), 87 deletions(-) diff --git a/docs/source/hardware_guide.mdx b/docs/source/hardware_guide.mdx index 5f236d3e8..fe7d2928b 100644 --- a/docs/source/hardware_guide.mdx +++ b/docs/source/hardware_guide.mdx @@ -82,18 +82,18 @@ VRAM is the first filter. Within a tier, pick by budget and availability — the ### Hugging Face Jobs -[Hugging Face Jobs](https://huggingface.co/docs/hub/jobs) lets you run training on managed HF infrastructure, billed by the second. The repo publishes a ready-to-use image: **`huggingface/lerobot-gpu:latest`**, rebuilt **every night at 02:00 UTC from `main`** ([`docker_publish.yml`](https://github.com/huggingface/lerobot/blob/main/.github/workflows/docker_publish.yml)) — so it tracks the current state of the repo, not a tagged release. +[Hugging Face Jobs](https://huggingface.co/docs/hub/jobs) lets you run training on managed HF infrastructure, billed by the second, without owning a GPU. `lerobot-train` submits and streams the job for you — just add `--job.target=` to a normal training command: ```bash -hf jobs run --flavor a10g-large huggingface/lerobot-gpu:latest \ - bash -c "nvidia-smi && lerobot-train \ - --policy.type=act --dataset.repo_id=/ \ - --policy.repo_id=/act_ --batch_size=8 --steps=50000" +lerobot-train \ + --policy.type=act --dataset.repo_id=/ \ + --policy.repo_id=/act_ \ + --job.target=a10g-large ``` Notes: -- The leading `nvidia-smi` is a quick sanity check that CUDA is visible inside the container — useful to fail fast if the flavor or driver mismatched. -- The default Job timeout is 30 minutes; pass `--timeout 4h` (or longer) for real training. -- `--flavor` maps onto the table above: `t4-small`/`t4-medium` (T4, ACT only), `l4x1`/`l4x4` (L4 24 GB), `a10g-small/large/largex2/largex4` (A10G 24 GB scaled out), `a100-large` (A100). For the current full catalogue + pricing see [https://huggingface.co/docs/hub/jobs](https://huggingface.co/docs/hub/jobs). -- Prefer not to write the `hf jobs run` wrapper yourself? `lerobot-train` can submit the job for you: just add `--job.target=` to a normal training command and it handles dataset upload, log streaming, and the final model push. See the [imitation-learning training guide](./il_robots). +- Run `hf auth login` once before submitting, the job runs under your token. +- `--job.target` maps onto the table above: `t4-small`/`t4-medium` (T4, ACT only), `l4x1`/`l4x4` (L4 24 GB), `a10g-small/large/largex2/largex4` (A10G 24 GB scaled out), `a100-large` (A100). List the current catalogue with pricing via `hf jobs hardware`, or see [https://huggingface.co/docs/hub/jobs](https://huggingface.co/docs/hub/jobs). +- The job defaults to a `2d` (48h) timeout; override it with `--job.timeout=4h` (or any other duration string) to fail faster or run longer. +- For the full walkthrough — dataset upload, checkpoint streaming, resuming a run on a job — see the [imitation-learning training guide](./il_robots#train-using-hugging-face-jobs). diff --git a/docs/source/il_robots.mdx b/docs/source/il_robots.mdx index 64a39e29c..5893b93f4 100644 --- a/docs/source/il_robots.mdx +++ b/docs/source/il_robots.mdx @@ -532,84 +532,7 @@ If your local computer doesn't have a powerful GPU you could utilize Google Cola Hugging Face jobs let's you easily select hardware and run the training in the cloud. So if you don't have a powerful GPU or you need more VRAM or just want to train a model much faster use HF Jobs! It's pay as you go and you simply pay for each second of use, you can see the pricing and additional information [here](https://huggingface.co/docs/hub/jobs). -> **Tip:** if you just want to launch a standard training run, you can skip building the command below and use the integrated **Train on HF Jobs via `--job.target`** flow described further down — `lerobot-train` then submits the job, uploads a local-only dataset for you, and streams the logs. - -To run the training manually use this command: - - - -```bash -hf jobs run \ - --flavor a10g-small \ - --timeout 4h \ - --secrets HF_TOKEN \ - huggingface/lerobot-gpu:latest \ - -- \ - python -m lerobot.scripts.lerobot_train \ - --dataset.repo_id=username/dataset \ - --policy.type=act \ - --steps=5000 \ - --batch_size=16 \ - --policy.device=cuda \ - --policy.repo_id=username/your_policy \ - --log_freq=100 -``` - - - - -```python -from huggingface_hub import run_job, get_token - -run_name = "act_so101_hf_jobs" -dataset_id = "username/dataset" -user_hub_id = "username" - -command_args = [ - "python", "-m", "lerobot.scripts.lerobot_train", - "--dataset.repo_id", dataset_id, - "--policy.type", "act", - "--steps", "5000", - "--batch_size", "16", - "--num_workers", "4", - "--policy.device", "cuda", - "--log_freq", "100", - "--save_freq", "1000", - "--save_checkpoint", "true", - "--wandb.enable", "false", - "--policy.repo_id", f"{user_hub_id}/{run_name}" -] - -print(f"Submitting job '{run_name}' to Hugging Face Infrastructure...") - -job_info = run_job( - image="huggingface/lerobot-gpu:latest", - command=command_args, - flavor="a10g-small", - timeout="4h", - secrets={"HF_TOKEN": get_token()} -) - -print("\n🚀 Job successfully launched!") -print(f"🔹 Job ID: {job_info.id}") -print(f"🔗 Live UI Dashboard & Logs: {job_info.url}") -``` - - - - - -You can modify the `--flavor` to use different hardware, for example: `t4-small`, `a100-large`, `h200`. Use `hf jobs hardware` to see the full list with pricing. -Depending on the model you want to train and the hardware you selected you can also modify the `--batch_size` and `--number_of_workers`. -For longer training sessions increase the timeout. - -Once the training is started you can go to [Jobs](https://huggingface.co/settings/jobs) and see if your jobs is running as well as all the outputs. Sometimes it takes a few minutes to schedule your job so be patient. - -After training the model will be pushed to hub and you can use it as any other model with LeRobot. - -#### Train on HF Jobs via `--job.target` (integrated CLI) - -`lerobot-train` runs locally by default. To run on a HuggingFace GPU without constructing the Docker command yourself, pass `--job.target` with a hardware flavor name: +`lerobot-train` runs locally by default. To run on a HuggingFace GPU, pass `--job.target` with a hardware flavor name: ```bash lerobot-train \