# Benchmark Training & Evaluation

This guide explains how to train and evaluate policies on the simulation benchmarks
integrated in LeRobot: **LIBERO**, **LIBERO-plus**, **MetaWorld**, **RoboCasa**, and **RoboMME**.

The workflow is:

1. Pick one or more benchmarks.
2. For each benchmark, train a policy on its combined dataset (multi-GPU).
3. Upload the trained policy to the Hugging Face Hub.
4. Evaluate the policy on every task suite within that benchmark.

## Prerequisites

Install the benchmark-specific dependencies for the environments you want to evaluate on:

```bash
# LIBERO (original)
pip install -e ".[libero]"

# LIBERO-plus
pip install -e ".[libero_plus]"

# MetaWorld
pip install -e ".[metaworld]"

# RoboCasa
pip install -e ".[robocasa]"

# RoboMME
pip install -e ".[robomme]"
```

`libero_plus` includes the same EGL probe dependencies as `libero` so headless
renderer setup is consistent between both installs.

If your environment has CMake build-isolation issues, use the same fallback as
standard LIBERO installs:

```bash
PATH=/usr/bin:/bin:$PATH pip install --no-build-isolation -e ".[libero-plus]"
```

For multi-GPU training you also need [Accelerate](https://huggingface.co/docs/accelerate):

```bash
pip install accelerate
```

## Docker-isolated evaluation (EnvHub)

LeRobot eval now supports running the full eval worker in a Docker container
while keeping policy loading compatible with local checkpoints and local code changes.

Use `lerobot-eval` with `--eval.runtime=docker`:

```bash
lerobot-eval \
  --policy.path=outputs/train/my_policy/checkpoints/050000/pretrained_model \
  --env.type=libero_plus \
  --eval.runtime=docker \
  --eval.docker.envhub_ref=envhub://lerobot/libero_plus@v1 \
  --eval.n_episodes=10 \
  --eval.batch_size=10
```

`eval.docker.envhub_ref` is optional. If omitted, LeRobot resolves a default
image from `env.type`. You can also override the image directly:

```bash
--eval.docker.image=docker://ghcr.io/huggingface/lerobot-eval-libero-plus:latest
```

By default (`eval.docker.use_local_code=true`), the local repository is mounted
in the container and added to `PYTHONPATH`, so edited policy/env code and local
checkpoints continue to work without rebuilding the image for each change.

Common Docker runtime options:

```bash
--eval.docker.pull=true \
--eval.docker.gpus=all \
--eval.docker.shm_size=8g \
--eval.docker.use_local_code=true
```

The benchmark runner supports the same Docker eval path (extra args are
forwarded to each generated `lerobot-eval` call):

```bash
lerobot-benchmark eval \
    --benchmarks libero_plus,robocasa \
    --hub-user $HF_USER \
    --n-episodes 50 \
    --eval.runtime=docker \
    --eval.docker.pull=true
```

Build benchmark images locally:

```bash
make build-eval-images
```

## Fast single-machine eval tuning

`lerobot-eval` now has two orthogonal throughput knobs:

- `eval.batch_size`: number of sub-envs per task (inside one vector env).
- `env.max_parallel_tasks`: number of tasks scheduled concurrently.
- `eval.instance_count`: number of full eval instances (process-level sharding).

Use them in this order:

1. Increase `eval.batch_size` first for per-task throughput.
2. Then increase `env.max_parallel_tasks` to overlap tasks, while monitoring RAM/VRAM.
3. Optionally increase `eval.instance_count` for process-level parallelism (best with enough CPU/RAM and small models).

The eval logs print the active scheduler mode (`sequential`, `threaded`, or `batched_lazy`) so you can verify the effective concurrency path.

### Suggested starting points

| Benchmark | Conservative | Faster (single GPU) | Notes |
|---|---|---|---|
| `libero` / `libero_plus` | `eval.batch_size=1`, `env.max_parallel_tasks=4` | `eval.batch_size=1`, `env.max_parallel_tasks=16` | For large suite sweeps, increase `max_parallel_tasks` before `batch_size` to avoid MuJoCo memory spikes. |
| `metaworld` | `eval.batch_size=8`, `env.max_parallel_tasks=1` | `eval.batch_size=16`, `env.max_parallel_tasks=2` | Prefer larger per-task vectorization first. |
| `robocasa` | `eval.batch_size=4`, `env.max_parallel_tasks=1` | `eval.batch_size=8`, `env.max_parallel_tasks=2` | Rendering/memory can dominate at high image resolution. |
| `robomme` | `eval.batch_size=4`, `env.max_parallel_tasks=1` | `eval.batch_size=8`, `env.max_parallel_tasks=2` | Start small and scale gradually with task count. |

### Local fast eval recipe

```bash
lerobot-eval \
  --policy.path=$HF_USER/smolvla_libero_plus \
  --env.type=libero_plus \
  --eval.n_episodes=1 \
  --eval.batch_size=1 \
  --env.max_parallel_tasks=16 \
  --eval.instance_count=2 \
  --rename_map='{"observation.images.image":"observation.images.camera1","observation.images.image2":"observation.images.camera2"}' \
  --output_dir=outputs/eval/smolvla_libero_plus \
  --push_to_hub=true
```

### Docker fast eval recipe

```bash
lerobot-eval \
  --policy.path=$HF_USER/smolvla_libero_plus \
  --env.type=libero_plus \
  --eval.runtime=docker \
  --eval.docker.envhub_ref=envhub://lerobot/libero_plus@v1 \
  --eval.docker.gpus=all \
  --eval.docker.shm_size=16g \
  --eval.n_episodes=1 \
  --eval.batch_size=1 \
  --env.max_parallel_tasks=16
```

## Quick start — single benchmark

Train SmolVLA on LIBERO-plus with 4 GPUs for 50 000 steps:

```bash
lerobot-benchmark train \
    --benchmarks libero_plus \
    --policy-path lerobot/smolvla_base \
    --hub-user $HF_USER \
    --num-gpus 4 \
    --steps 50000 \
    --batch-size 32 \
    --wandb
```

This trains on the combined LIBERO-plus dataset and pushes the checkpoint to
`$HF_USER/smolvla_libero_plus` on the Hub.

Then evaluate on **all four** LIBERO suites (spatial, object, goal, 10):

```bash
lerobot-benchmark eval \
    --benchmarks libero_plus \
    --hub-user $HF_USER \
    --n-episodes 50
```

This automatically runs a separate `lerobot-eval` for each suite.

## Full sweep — multiple benchmarks

Run training **and** evaluation across all benchmarks:

```bash
lerobot-benchmark all \
    --benchmarks libero,libero_plus,metaworld,robocasa,robomme \
    --policy-path lerobot/smolvla_base \
    --hub-user $HF_USER \
    --num-gpus 4 \
    --steps 50000 \
    --batch-size 32 \
    --wandb \
    --push-eval-to-hub
```

For each benchmark the runner:
1. Trains a policy on its dataset.
2. Evaluates on every eval task in the benchmark (e.g. 4 suites for LIBERO).
3. Pushes HF-native `.eval_results` rows (and optional artifacts) to the Hub.

<Tip>

Use `--dry-run` to print the exact `lerobot-train` / `lerobot-eval` commands without executing them, so you can inspect or modify them before running.

</Tip>

## Using the CLI directly (without the benchmark runner)

You can also compose the commands yourself. The benchmark runner is a thin wrapper; here is what it does under the hood.

### Training

```bash
accelerate launch \
    --multi_gpu \
    --num_processes=4 \
    $(which lerobot-train) \
    --policy.path=lerobot/smolvla_base \
    --dataset.repo_id=$HF_USER/libero_plus \
    --policy.repo_id=$HF_USER/smolvla_libero_plus \
    --env.type=libero_plus \
    --env.task=libero_spatial \
    --steps=50000 \
    --batch_size=32 \
    --eval_freq=10000 \
    --save_freq=10000 \
    --output_dir=outputs/train/smolvla_libero_plus \
    --job_name=smolvla_libero_plus \
    --policy.push_to_hub=true \
    --wandb.enable=true
```

### Evaluation (run once per suite)

```bash
for SUITE in libero_spatial libero_object libero_goal libero_10; do
    lerobot-eval \
        --policy.path=$HF_USER/smolvla_libero_plus \
        --env.type=libero_plus \
        --env.task=$SUITE \
        --eval.n_episodes=50 \
        --eval.batch_size=10 \
        --output_dir=outputs/eval/smolvla_libero_plus/$SUITE \
        --policy.device=cuda \
        --push_to_hub=true \
        --benchmark_dataset_id=lerobot/sim-benchmarks
done
```

## Available benchmarks

| Benchmark | Env type | Dataset | Eval tasks | Action dim |
|---|---|---|---|---|
| `libero` | `libero` | `{hub_user}/libero` | spatial, object, goal, 10 | 7 |
| `libero_plus` | `libero_plus` | `{hub_user}/libero_plus` | spatial, object, goal, 10 | 7 |
| `metaworld` | `metaworld` | `{hub_user}/metaworld` | push-v2 | 4 |
| `robocasa` | `robocasa` | `{hub_user}/robocasa` | PickPlaceCounterToCabinet | 12 |
| `robomme` | `robomme` | `{hub_user}/robomme` | PickXtimes | 8 |

Run `lerobot-benchmark list` to see the full registry with all eval tasks.

## Policy naming convention

The benchmark runner stores trained policies under:

```
{hub_user}/{policy_name}_{benchmark}
```

The default `--policy-name` is `smolvla`. So training on `libero_plus` as user `alice` produces `alice/smolvla_libero_plus`.

You can override this, e.g. `--policy-name pi05` if training π₀.₅ instead.

## Multi-GPU considerations

The effective batch size is `batch_size × num_gpus`. With `--batch-size=32` and
`--num-gpus=4`, you train with an effective batch of 128 per step. LeRobot does **not**
auto-scale the learning rate; see the [Multi-GPU Training guide](./multi_gpu_training) for
details on when and how to adjust it.

## Custom benchmarks

To add a new benchmark, edit the `BENCHMARK_REGISTRY` in
`src/lerobot/scripts/lerobot_benchmark.py`:

```python
from lerobot.scripts.lerobot_benchmark import BenchmarkEntry, BENCHMARK_REGISTRY

BENCHMARK_REGISTRY["my_benchmark"] = BenchmarkEntry(
    dataset_repo_id="{hub_user}/my_dataset",
    env_type="my_env",
    env_task="MyDefaultTask",
    eval_tasks=["TaskA", "TaskB", "TaskC"],
)
```

Then use `--benchmarks my_benchmark` as usual. The runner will train once and
evaluate separately on TaskA, TaskB, and TaskC.

## Outputs

After training and evaluation, your outputs directory looks like:

```
outputs/
├── train/
│   ├── smolvla_libero/
│   │   ├── checkpoints/
│   │   └── ...
│   ├── smolvla_libero_plus/
│   ├── smolvla_robocasa/
│   └── smolvla_robomme/
└── eval/
    ├── smolvla_libero/
    │   ├── libero_spatial/
    │   │   ├── eval_info.json
    │   │   └── videos/
    │   ├── libero_object/
    │   ├── libero_goal/
    │   └── libero_10/
    ├── smolvla_libero_plus/
    │   ├── libero_spatial/
    │   ├── libero_object/
    │   ├── libero_goal/
    │   └── libero_10/
    ├── smolvla_robocasa/
    └── smolvla_robomme/
```

Each `eval_info.json` contains per-episode rewards, success rates, and aggregate metrics.

## HF Eval Results + Leaderboard

LeRobot publishes benchmark scores using Hugging Face's native
`/.eval_results/*.yaml` format, which powers model-page eval cards and
benchmark leaderboards.

Add `--push-eval-to-hub` to push results after each eval run:

```bash
lerobot-benchmark eval \
    --benchmarks libero_plus,robocasa \
    --hub-user $HF_USER \
    --benchmark-dataset-id lerobot/sim-benchmarks \
    --push-eval-to-hub
```

This writes one or more files under `.eval_results/` in the model repo, for example:

```yaml
- dataset:
    id: lerobot/sim-benchmarks
    task_id: libero_plus/spatial
  value: 82.4
  notes: lerobot-eval
```

Notes:
- `--benchmark-dataset-id` points to your consolidated benchmark dataset repo.
- `task_id` values are derived from `env.type` and evaluated suite/task names.
- Eval artifacts (`eval_info.json`, `eval_config.json`, videos) are still uploaded
  for provenance, but leaderboard ranking comes from `.eval_results`.

## Passing extra arguments

Any arguments after the recognized flags are forwarded to `lerobot-train` or
`lerobot-eval`.

Example (training): use PEFT/LoRA during training.

```bash
lerobot-benchmark train \
    --benchmarks libero_plus \
    --policy-path lerobot/smolvla_base \
    --hub-user $HF_USER \
    --num-gpus 4 \
    --steps 50000 \
    --peft.method_type=LORA --peft.r=16
```

Example (evaluation): forward Docker runtime flags to each `lerobot-eval` call.

```bash
lerobot-benchmark eval \
    --benchmarks libero_plus \
    --hub-user $HF_USER \
    --eval.runtime=docker \
    --eval.docker.envhub_ref=envhub://lerobot/libero_plus@v1
```