mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-21 03:30:10 +00:00
add section on training using HF jobs
This commit is contained in:
@@ -514,6 +514,83 @@ Additionally you can provide extra `tags` or specify a `license` for your model
|
|||||||
|
|
||||||
If your local computer doesn't have a powerful GPU you could utilize Google Colab to train your model by following the [ACT training notebook](./notebooks#training-act).
|
If your local computer doesn't have a powerful GPU you could utilize Google Colab to train your model by following the [ACT training notebook](./notebooks#training-act).
|
||||||
|
|
||||||
|
#### Train using Hugging Face Jobs
|
||||||
|
|
||||||
|
Hugging Face jobs let's you easily select hardware and run the training in the cloud. So if you don't have a powerful GPU or you need more VRAM or just want to train a model much faster use HF Jobs! It's pay as you go and you simply pay for each second of use, you can see the pricing and additional information [here](https://huggingface.co/docs/hub/jobs).
|
||||||
|
|
||||||
|
To run the training use this command:
|
||||||
|
|
||||||
|
<hfoptions id="train_with_hf_jobs">
|
||||||
|
<hfoption id="Command">
|
||||||
|
```bash
|
||||||
|
hf jobs run \
|
||||||
|
--flavor a10g-small \
|
||||||
|
--timeout 4h \
|
||||||
|
--secrets HF_TOKEN \
|
||||||
|
huggingface/lerobot-gpu:latest \
|
||||||
|
-- \
|
||||||
|
python -m lerobot.scripts.lerobot_train \
|
||||||
|
--dataset.repo_id=username/dataset \
|
||||||
|
--policy.type=act \
|
||||||
|
--steps=5000 \
|
||||||
|
--batch_size=16 \
|
||||||
|
--policy.device=cuda \
|
||||||
|
--policy.repo_id=username/your_policy \
|
||||||
|
--log_freq=100
|
||||||
|
```
|
||||||
|
</hfoption>
|
||||||
|
<hfoption id="API example">
|
||||||
|
|
||||||
|
<!-- prettier-ignore-start -->
|
||||||
|
```python
|
||||||
|
from huggingface_hub import run_job, get_token
|
||||||
|
|
||||||
|
run_name = "act_so101_hf_jobs"
|
||||||
|
dataset_id = "username/dataset"
|
||||||
|
user_hub_id = "username"
|
||||||
|
|
||||||
|
command_args = [
|
||||||
|
"python", "-m", "lerobot.scripts.lerobot_train",
|
||||||
|
"--dataset.repo_id", dataset_id,
|
||||||
|
"--policy.type", "act",
|
||||||
|
"--steps", "5000",
|
||||||
|
"--batch_size", "16",
|
||||||
|
"--num_workers", "4",
|
||||||
|
"--policy.device", "cuda",
|
||||||
|
"--log_freq", "100",
|
||||||
|
"--save_freq", "1000",
|
||||||
|
"--save_checkpoint", "true",
|
||||||
|
"--wandb.enable", "false",
|
||||||
|
"--policy.repo_id", f"{user_hub_id}/{run_name}"
|
||||||
|
]
|
||||||
|
|
||||||
|
print(f"Submitting job '{run_name}' to Hugging Face Infrastructure...")
|
||||||
|
|
||||||
|
job_info = run_job(
|
||||||
|
image="huggingface/lerobot-gpu:latest",
|
||||||
|
command=command_args,
|
||||||
|
flavor="a10g-small",
|
||||||
|
timeout="4h",
|
||||||
|
secrets={"HF_TOKEN": get_token()}
|
||||||
|
)
|
||||||
|
|
||||||
|
print("\n🚀 Job successfully launched!")
|
||||||
|
print(f"🔹 Job ID: {job_info.id}")
|
||||||
|
print(f"🔗 Live UI Dashboard & Logs: {job_info.url}")
|
||||||
|
```
|
||||||
|
<!-- prettier-ignore-end -->
|
||||||
|
|
||||||
|
</hfoption>
|
||||||
|
</hfoptions>
|
||||||
|
|
||||||
|
You can modify the ```--flavor``` to use different hardware, for example: ```t4-small```, ```a100-large```, ```h200```. Use ```hf jobs hardware``` to see the full list with pricing.
|
||||||
|
Depending on the model you want to train and the hardware you selected you can also modify the ```--batch_size``` and ```--number_of_workers```.
|
||||||
|
For longer training sessions increase the timeout.
|
||||||
|
|
||||||
|
Once the training is started you can go to [Jobs](https://huggingface.co/settings/jobs) and see if your jobs is running as well as all the outputs. Sometimes it takes a few minutes to schedule your job so be patient.
|
||||||
|
|
||||||
|
After training the model will be pushed to hub and you can use it as any other model with LeRobot.
|
||||||
|
|
||||||
#### Upload policy checkpoints
|
#### Upload policy checkpoints
|
||||||
|
|
||||||
Once training is done, upload the latest checkpoint with:
|
Once training is done, upload the latest checkpoint with:
|
||||||
|
|||||||
Reference in New Issue
Block a user