add section on training using HF jobs

2026-07-23 01:41:54 +00:00 · 2026-05-18 13:27:55 +02:00
parent 6dc9391013
commit ea92b88556
1 changed files with 77 additions and 0 deletions
@@ -514,6 +514,83 @@ Additionally you can provide extra `tags` or specify a `license` for your model
 If your local computer doesn't have a powerful GPU you could utilize Google Colab to train your model by following the [ACT training notebook](./notebooks#training-act).
 #### Train using Hugging Face Jobs
 Hugging Face jobs let's you easily select hardware and run the training in the cloud. So if you don't have a powerful GPU or you need more VRAM or just want to train a model much faster use HF Jobs! It's pay as you go and you simply pay for each second of use, you can see the pricing and additional information [here](https://huggingface.co/docs/hub/jobs).
 To run the training use this command:
 <hfoptions id="train_with_hf_jobs">
 <hfoption id="Command">
 ```bash
 hf jobs run \
  --flavor a10g-small \
  --timeout 4h \
  --secrets HF_TOKEN \
  huggingface/lerobot-gpu:latest \
  -- \
  python -m lerobot.scripts.lerobot_train \
    --dataset.repo_id=username/dataset \
    --policy.type=act \
    --steps=5000 \
    --batch_size=16 \
    --policy.device=cuda \
    --policy.repo_id=username/your_policy \
    --log_freq=100
 ```
 </hfoption>
 <hfoption id="API example">
 <!-- prettier-ignore-start -->
 ```python
 from huggingface_hub import run_job, get_token
 run_name = "act_so101_hf_jobs"
 dataset_id = "username/dataset"
 user_hub_id = "username"
 command_args = [
    "python", "-m", "lerobot.scripts.lerobot_train",
    "--dataset.repo_id", dataset_id,
    "--policy.type", "act",
    "--steps", "5000",
    "--batch_size", "16",
    "--num_workers", "4",
    "--policy.device", "cuda",
    "--log_freq", "100",
    "--save_freq", "1000",
    "--save_checkpoint", "true",
    "--wandb.enable", "false",
    "--policy.repo_id", f"{user_hub_id}/{run_name}"
 ]
 print(f"Submitting job '{run_name}' to Hugging Face Infrastructure...")
 job_info = run_job(
    image="huggingface/lerobot-gpu:latest",
    command=command_args,
    flavor="a10g-small",
    timeout="4h",
    secrets={"HF_TOKEN": get_token()}
 )
 print("\n🚀 Job successfully launched!")
 print(f"🔹 Job ID: {job_info.id}")
 print(f"🔗 Live UI Dashboard & Logs: {job_info.url}")
 ```
 <!-- prettier-ignore-end -->
 </hfoption>
 </hfoptions>
 You can modify the ```--flavor``` to use different hardware, for example: ```t4-small```, ```a100-large```, ```h200```. Use ```hf jobs hardware``` to see the full list with pricing.
 Depending on the model you want to train and the hardware you selected you can also modify the ```--batch_size``` and ```--number_of_workers```.
 For longer training sessions increase the timeout. 
 Once the training is started you can go to [Jobs](https://huggingface.co/settings/jobs) and see if your jobs is running as well as all the outputs. Sometimes it takes a few minutes to schedule your job so be patient.
 After training the model will be pushed to hub and you can use it as any other model with LeRobot.
 #### Upload policy checkpoints
 Once training is done, upload the latest checkpoint with: