mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-16 17:20:05 +00:00
399 lines
12 KiB
Plaintext
399 lines
12 KiB
Plaintext
# Benchmark Training & Evaluation
|
||
|
||
This guide explains how to train and evaluate policies on the simulation benchmarks
|
||
integrated in LeRobot: **LIBERO**, **LIBERO-plus**, **MetaWorld**, **RoboCasa**, and **RoboMME**.
|
||
|
||
The workflow is:
|
||
|
||
1. Pick one or more benchmarks.
|
||
2. For each benchmark, train a policy on its combined dataset (multi-GPU).
|
||
3. Upload the trained policy to the Hugging Face Hub.
|
||
4. Evaluate the policy on every task suite within that benchmark.
|
||
|
||
## Prerequisites
|
||
|
||
Install the benchmark-specific dependencies for the environments you want to evaluate on:
|
||
|
||
```bash
|
||
# LIBERO (original)
|
||
pip install -e ".[libero]"
|
||
|
||
# LIBERO-plus
|
||
pip install -e ".[libero_plus]"
|
||
|
||
# MetaWorld
|
||
pip install -e ".[metaworld]"
|
||
|
||
# RoboCasa
|
||
pip install -e ".[robocasa]"
|
||
|
||
# RoboMME
|
||
pip install -e ".[robomme]"
|
||
```
|
||
|
||
`libero_plus` includes the same EGL probe dependencies as `libero` so headless
|
||
renderer setup is consistent between both installs.
|
||
|
||
If your environment has CMake build-isolation issues, use the same fallback as
|
||
standard LIBERO installs:
|
||
|
||
```bash
|
||
PATH=/usr/bin:/bin:$PATH pip install --no-build-isolation -e ".[libero-plus]"
|
||
```
|
||
|
||
For multi-GPU training you also need [Accelerate](https://huggingface.co/docs/accelerate):
|
||
|
||
```bash
|
||
pip install accelerate
|
||
```
|
||
|
||
## Docker-isolated evaluation (EnvHub)
|
||
|
||
LeRobot eval now supports running the full eval worker in a Docker container
|
||
while keeping policy loading compatible with local checkpoints and local code changes.
|
||
|
||
Use `lerobot-eval` with `--eval.runtime=docker`:
|
||
|
||
```bash
|
||
lerobot-eval \
|
||
--policy.path=outputs/train/my_policy/checkpoints/050000/pretrained_model \
|
||
--env.type=libero_plus \
|
||
--eval.runtime=docker \
|
||
--eval.docker.envhub_ref=envhub://lerobot/libero_plus@v1 \
|
||
--eval.n_episodes=10 \
|
||
--eval.batch_size=10
|
||
```
|
||
|
||
`eval.docker.envhub_ref` is optional. If omitted, LeRobot resolves a default
|
||
image from `env.type`. You can also override the image directly:
|
||
|
||
```bash
|
||
--eval.docker.image=docker://ghcr.io/huggingface/lerobot-eval-libero-plus:latest
|
||
```
|
||
|
||
By default (`eval.docker.use_local_code=true`), the local repository is mounted
|
||
in the container and added to `PYTHONPATH`, so edited policy/env code and local
|
||
checkpoints continue to work without rebuilding the image for each change.
|
||
|
||
Common Docker runtime options:
|
||
|
||
```bash
|
||
--eval.docker.pull=true \
|
||
--eval.docker.gpus=all \
|
||
--eval.docker.shm_size=8g \
|
||
--eval.docker.use_local_code=true
|
||
```
|
||
|
||
The benchmark runner supports the same Docker eval path (extra args are
|
||
forwarded to each generated `lerobot-eval` call):
|
||
|
||
```bash
|
||
lerobot-benchmark eval \
|
||
--benchmarks libero_plus,robocasa \
|
||
--hub-user $HF_USER \
|
||
--n-episodes 50 \
|
||
--eval.runtime=docker \
|
||
--eval.docker.pull=true
|
||
```
|
||
|
||
Build benchmark images locally:
|
||
|
||
```bash
|
||
make build-eval-images
|
||
```
|
||
|
||
## Fast single-machine eval tuning
|
||
|
||
`lerobot-eval` now has two orthogonal throughput knobs:
|
||
|
||
- `eval.batch_size`: number of sub-envs per task (inside one vector env).
|
||
- `env.max_parallel_tasks`: number of tasks scheduled concurrently.
|
||
- `eval.instance_count`: number of full eval instances (process-level sharding).
|
||
|
||
Use them in this order:
|
||
|
||
1. Increase `eval.batch_size` first for per-task throughput.
|
||
2. Then increase `env.max_parallel_tasks` to overlap tasks, while monitoring RAM/VRAM.
|
||
3. Optionally increase `eval.instance_count` for process-level parallelism (best with enough CPU/RAM and small models).
|
||
|
||
The eval logs print the active scheduler mode (`sequential`, `threaded`, or `batched_lazy`) so you can verify the effective concurrency path.
|
||
|
||
### Suggested starting points
|
||
|
||
| Benchmark | Conservative | Faster (single GPU) | Notes |
|
||
|---|---|---|---|
|
||
| `libero` / `libero_plus` | `eval.batch_size=1`, `env.max_parallel_tasks=4` | `eval.batch_size=1`, `env.max_parallel_tasks=16` | For large suite sweeps, increase `max_parallel_tasks` before `batch_size` to avoid MuJoCo memory spikes. |
|
||
| `metaworld` | `eval.batch_size=8`, `env.max_parallel_tasks=1` | `eval.batch_size=16`, `env.max_parallel_tasks=2` | Prefer larger per-task vectorization first. |
|
||
| `robocasa` | `eval.batch_size=4`, `env.max_parallel_tasks=1` | `eval.batch_size=8`, `env.max_parallel_tasks=2` | Rendering/memory can dominate at high image resolution. |
|
||
| `robomme` | `eval.batch_size=4`, `env.max_parallel_tasks=1` | `eval.batch_size=8`, `env.max_parallel_tasks=2` | Start small and scale gradually with task count. |
|
||
|
||
### Local fast eval recipe
|
||
|
||
```bash
|
||
lerobot-eval \
|
||
--policy.path=$HF_USER/smolvla_libero_plus \
|
||
--env.type=libero_plus \
|
||
--eval.n_episodes=1 \
|
||
--eval.batch_size=1 \
|
||
--env.max_parallel_tasks=16 \
|
||
--eval.instance_count=2 \
|
||
--rename_map='{"observation.images.image":"observation.images.camera1","observation.images.image2":"observation.images.camera2"}' \
|
||
--output_dir=outputs/eval/smolvla_libero_plus \
|
||
--push_to_hub=true
|
||
```
|
||
|
||
### Docker fast eval recipe
|
||
|
||
```bash
|
||
lerobot-eval \
|
||
--policy.path=$HF_USER/smolvla_libero_plus \
|
||
--env.type=libero_plus \
|
||
--eval.runtime=docker \
|
||
--eval.docker.envhub_ref=envhub://lerobot/libero_plus@v1 \
|
||
--eval.docker.gpus=all \
|
||
--eval.docker.shm_size=16g \
|
||
--eval.n_episodes=1 \
|
||
--eval.batch_size=1 \
|
||
--env.max_parallel_tasks=16
|
||
```
|
||
|
||
## Quick start — single benchmark
|
||
|
||
Train SmolVLA on LIBERO-plus with 4 GPUs for 50 000 steps:
|
||
|
||
```bash
|
||
lerobot-benchmark train \
|
||
--benchmarks libero_plus \
|
||
--policy-path lerobot/smolvla_base \
|
||
--hub-user $HF_USER \
|
||
--num-gpus 4 \
|
||
--steps 50000 \
|
||
--batch-size 32 \
|
||
--wandb
|
||
```
|
||
|
||
This trains on the combined LIBERO-plus dataset and pushes the checkpoint to
|
||
`$HF_USER/smolvla_libero_plus` on the Hub.
|
||
|
||
Then evaluate on **all four** LIBERO suites (spatial, object, goal, 10):
|
||
|
||
```bash
|
||
lerobot-benchmark eval \
|
||
--benchmarks libero_plus \
|
||
--hub-user $HF_USER \
|
||
--n-episodes 50
|
||
```
|
||
|
||
This automatically runs a separate `lerobot-eval` for each suite.
|
||
|
||
## Full sweep — multiple benchmarks
|
||
|
||
Run training **and** evaluation across all benchmarks:
|
||
|
||
```bash
|
||
lerobot-benchmark all \
|
||
--benchmarks libero,libero_plus,metaworld,robocasa,robomme \
|
||
--policy-path lerobot/smolvla_base \
|
||
--hub-user $HF_USER \
|
||
--num-gpus 4 \
|
||
--steps 50000 \
|
||
--batch-size 32 \
|
||
--wandb \
|
||
--push-eval-to-hub
|
||
```
|
||
|
||
For each benchmark the runner:
|
||
1. Trains a policy on its dataset.
|
||
2. Evaluates on every eval task in the benchmark (e.g. 4 suites for LIBERO).
|
||
3. Pushes HF-native `.eval_results` rows (and optional artifacts) to the Hub.
|
||
|
||
<Tip>
|
||
|
||
Use `--dry-run` to print the exact `lerobot-train` / `lerobot-eval` commands without executing them, so you can inspect or modify them before running.
|
||
|
||
</Tip>
|
||
|
||
## Using the CLI directly (without the benchmark runner)
|
||
|
||
You can also compose the commands yourself. The benchmark runner is a thin wrapper; here is what it does under the hood.
|
||
|
||
### Training
|
||
|
||
```bash
|
||
accelerate launch \
|
||
--multi_gpu \
|
||
--num_processes=4 \
|
||
$(which lerobot-train) \
|
||
--policy.path=lerobot/smolvla_base \
|
||
--dataset.repo_id=$HF_USER/libero_plus \
|
||
--policy.repo_id=$HF_USER/smolvla_libero_plus \
|
||
--env.type=libero_plus \
|
||
--env.task=libero_spatial \
|
||
--steps=50000 \
|
||
--batch_size=32 \
|
||
--eval_freq=10000 \
|
||
--save_freq=10000 \
|
||
--output_dir=outputs/train/smolvla_libero_plus \
|
||
--job_name=smolvla_libero_plus \
|
||
--policy.push_to_hub=true \
|
||
--wandb.enable=true
|
||
```
|
||
|
||
### Evaluation (run once per suite)
|
||
|
||
```bash
|
||
for SUITE in libero_spatial libero_object libero_goal libero_10; do
|
||
lerobot-eval \
|
||
--policy.path=$HF_USER/smolvla_libero_plus \
|
||
--env.type=libero_plus \
|
||
--env.task=$SUITE \
|
||
--eval.n_episodes=50 \
|
||
--eval.batch_size=10 \
|
||
--output_dir=outputs/eval/smolvla_libero_plus/$SUITE \
|
||
--policy.device=cuda \
|
||
--push_to_hub=true \
|
||
--benchmark_dataset_id=lerobot/sim-benchmarks
|
||
done
|
||
```
|
||
|
||
## Available benchmarks
|
||
|
||
| Benchmark | Env type | Dataset | Eval tasks | Action dim |
|
||
|---|---|---|---|---|
|
||
| `libero` | `libero` | `{hub_user}/libero` | spatial, object, goal, 10 | 7 |
|
||
| `libero_plus` | `libero_plus` | `{hub_user}/libero_plus` | spatial, object, goal, 10 | 7 |
|
||
| `metaworld` | `metaworld` | `{hub_user}/metaworld` | push-v2 | 4 |
|
||
| `robocasa` | `robocasa` | `{hub_user}/robocasa` | PickPlaceCounterToCabinet | 12 |
|
||
| `robomme` | `robomme` | `{hub_user}/robomme` | PickXtimes | 8 |
|
||
|
||
Run `lerobot-benchmark list` to see the full registry with all eval tasks.
|
||
|
||
## Policy naming convention
|
||
|
||
The benchmark runner stores trained policies under:
|
||
|
||
```
|
||
{hub_user}/{policy_name}_{benchmark}
|
||
```
|
||
|
||
The default `--policy-name` is `smolvla`. So training on `libero_plus` as user `alice` produces `alice/smolvla_libero_plus`.
|
||
|
||
You can override this, e.g. `--policy-name pi05` if training π₀.₅ instead.
|
||
|
||
## Multi-GPU considerations
|
||
|
||
The effective batch size is `batch_size × num_gpus`. With `--batch-size=32` and
|
||
`--num-gpus=4`, you train with an effective batch of 128 per step. LeRobot does **not**
|
||
auto-scale the learning rate; see the [Multi-GPU Training guide](./multi_gpu_training) for
|
||
details on when and how to adjust it.
|
||
|
||
## Custom benchmarks
|
||
|
||
To add a new benchmark, edit the `BENCHMARK_REGISTRY` in
|
||
`src/lerobot/scripts/lerobot_benchmark.py`:
|
||
|
||
```python
|
||
from lerobot.scripts.lerobot_benchmark import BenchmarkEntry, BENCHMARK_REGISTRY
|
||
|
||
BENCHMARK_REGISTRY["my_benchmark"] = BenchmarkEntry(
|
||
dataset_repo_id="{hub_user}/my_dataset",
|
||
env_type="my_env",
|
||
env_task="MyDefaultTask",
|
||
eval_tasks=["TaskA", "TaskB", "TaskC"],
|
||
)
|
||
```
|
||
|
||
Then use `--benchmarks my_benchmark` as usual. The runner will train once and
|
||
evaluate separately on TaskA, TaskB, and TaskC.
|
||
|
||
## Outputs
|
||
|
||
After training and evaluation, your outputs directory looks like:
|
||
|
||
```
|
||
outputs/
|
||
├── train/
|
||
│ ├── smolvla_libero/
|
||
│ │ ├── checkpoints/
|
||
│ │ └── ...
|
||
│ ├── smolvla_libero_plus/
|
||
│ ├── smolvla_robocasa/
|
||
│ └── smolvla_robomme/
|
||
└── eval/
|
||
├── smolvla_libero/
|
||
│ ├── libero_spatial/
|
||
│ │ ├── eval_info.json
|
||
│ │ └── videos/
|
||
│ ├── libero_object/
|
||
│ ├── libero_goal/
|
||
│ └── libero_10/
|
||
├── smolvla_libero_plus/
|
||
│ ├── libero_spatial/
|
||
│ ├── libero_object/
|
||
│ ├── libero_goal/
|
||
│ └── libero_10/
|
||
├── smolvla_robocasa/
|
||
└── smolvla_robomme/
|
||
```
|
||
|
||
Each `eval_info.json` contains per-episode rewards, success rates, and aggregate metrics.
|
||
|
||
## HF Eval Results + Leaderboard
|
||
|
||
LeRobot publishes benchmark scores using Hugging Face's native
|
||
`/.eval_results/*.yaml` format, which powers model-page eval cards and
|
||
benchmark leaderboards.
|
||
|
||
Add `--push-eval-to-hub` to push results after each eval run:
|
||
|
||
```bash
|
||
lerobot-benchmark eval \
|
||
--benchmarks libero_plus,robocasa \
|
||
--hub-user $HF_USER \
|
||
--benchmark-dataset-id lerobot/sim-benchmarks \
|
||
--push-eval-to-hub
|
||
```
|
||
|
||
This writes one or more files under `.eval_results/` in the model repo, for example:
|
||
|
||
```yaml
|
||
- dataset:
|
||
id: lerobot/sim-benchmarks
|
||
task_id: libero_plus/spatial
|
||
value: 82.4
|
||
notes: lerobot-eval
|
||
```
|
||
|
||
Notes:
|
||
- `--benchmark-dataset-id` points to your consolidated benchmark dataset repo.
|
||
- `task_id` values are derived from `env.type` and evaluated suite/task names.
|
||
- Eval artifacts (`eval_info.json`, `eval_config.json`, videos) are still uploaded
|
||
for provenance, but leaderboard ranking comes from `.eval_results`.
|
||
|
||
## Passing extra arguments
|
||
|
||
Any arguments after the recognized flags are forwarded to `lerobot-train` or
|
||
`lerobot-eval`.
|
||
|
||
Example (training): use PEFT/LoRA during training.
|
||
|
||
```bash
|
||
lerobot-benchmark train \
|
||
--benchmarks libero_plus \
|
||
--policy-path lerobot/smolvla_base \
|
||
--hub-user $HF_USER \
|
||
--num-gpus 4 \
|
||
--steps 50000 \
|
||
--peft.method_type=LORA --peft.r=16
|
||
```
|
||
|
||
Example (evaluation): forward Docker runtime flags to each `lerobot-eval` call.
|
||
|
||
```bash
|
||
lerobot-benchmark eval \
|
||
--benchmarks libero_plus \
|
||
--hub-user $HF_USER \
|
||
--eval.runtime=docker \
|
||
--eval.docker.envhub_ref=envhub://lerobot/libero_plus@v1
|
||
```
|