# Benchmark Training & Evaluation This guide explains how to train and evaluate policies on the simulation benchmarks integrated in LeRobot: **LIBERO**, **LIBERO-plus**, **MetaWorld**, **RoboCasa**, and **RoboMME**. The workflow is: 1. Pick one or more benchmarks. 2. For each benchmark, train a policy on its combined dataset (multi-GPU). 3. Upload the trained policy to the Hugging Face Hub. 4. Evaluate the policy on every task suite within that benchmark. ## Prerequisites Install the benchmark-specific dependencies for the environments you want to evaluate on: ```bash # LIBERO (original) pip install -e ".[libero]" # LIBERO-plus pip install -e ".[libero_plus]" # MetaWorld pip install -e ".[metaworld]" # RoboCasa pip install -e ".[robocasa]" # RoboMME pip install -e ".[robomme]" ``` `libero_plus` includes the same EGL probe dependencies as `libero` so headless renderer setup is consistent between both installs. If your environment has CMake build-isolation issues, use the same fallback as standard LIBERO installs: ```bash PATH=/usr/bin:/bin:$PATH pip install --no-build-isolation -e ".[libero-plus]" ``` For multi-GPU training you also need [Accelerate](https://huggingface.co/docs/accelerate): ```bash pip install accelerate ``` ## Docker-isolated evaluation (EnvHub) LeRobot eval now supports running the full eval worker in a Docker container while keeping policy loading compatible with local checkpoints and local code changes. Use `lerobot-eval` with `--eval.runtime=docker`: ```bash lerobot-eval \ --policy.path=outputs/train/my_policy/checkpoints/050000/pretrained_model \ --env.type=libero_plus \ --eval.runtime=docker \ --eval.docker.envhub_ref=envhub://lerobot/libero_plus@v1 \ --eval.n_episodes=10 \ --eval.batch_size=10 ``` `eval.docker.envhub_ref` is optional. If omitted, LeRobot resolves a default image from `env.type`. You can also override the image directly: ```bash --eval.docker.image=docker://ghcr.io/huggingface/lerobot-eval-libero-plus:latest ``` By default (`eval.docker.use_local_code=true`), the local repository is mounted in the container and added to `PYTHONPATH`, so edited policy/env code and local checkpoints continue to work without rebuilding the image for each change. Common Docker runtime options: ```bash --eval.docker.pull=true \ --eval.docker.gpus=all \ --eval.docker.shm_size=8g \ --eval.docker.use_local_code=true ``` The benchmark runner supports the same Docker eval path (extra args are forwarded to each generated `lerobot-eval` call): ```bash lerobot-benchmark eval \ --benchmarks libero_plus,robocasa \ --hub-user $HF_USER \ --n-episodes 50 \ --eval.runtime=docker \ --eval.docker.pull=true ``` Build benchmark images locally: ```bash make build-eval-images ``` ## Fast single-machine eval tuning `lerobot-eval` now has two orthogonal throughput knobs: - `eval.batch_size`: number of sub-envs per task (inside one vector env). - `env.max_parallel_tasks`: number of tasks scheduled concurrently. - `eval.instance_count`: number of full eval instances (process-level sharding). Use them in this order: 1. Increase `eval.batch_size` first for per-task throughput. 2. Then increase `env.max_parallel_tasks` to overlap tasks, while monitoring RAM/VRAM. 3. Optionally increase `eval.instance_count` for process-level parallelism (best with enough CPU/RAM and small models). The eval logs print the active scheduler mode (`sequential`, `threaded`, or `batched_lazy`) so you can verify the effective concurrency path. ### Suggested starting points | Benchmark | Conservative | Faster (single GPU) | Notes | |---|---|---|---| | `libero` / `libero_plus` | `eval.batch_size=1`, `env.max_parallel_tasks=4` | `eval.batch_size=1`, `env.max_parallel_tasks=16` | For large suite sweeps, increase `max_parallel_tasks` before `batch_size` to avoid MuJoCo memory spikes. | | `metaworld` | `eval.batch_size=8`, `env.max_parallel_tasks=1` | `eval.batch_size=16`, `env.max_parallel_tasks=2` | Prefer larger per-task vectorization first. | | `robocasa` | `eval.batch_size=4`, `env.max_parallel_tasks=1` | `eval.batch_size=8`, `env.max_parallel_tasks=2` | Rendering/memory can dominate at high image resolution. | | `robomme` | `eval.batch_size=4`, `env.max_parallel_tasks=1` | `eval.batch_size=8`, `env.max_parallel_tasks=2` | Start small and scale gradually with task count. | ### Local fast eval recipe ```bash lerobot-eval \ --policy.path=$HF_USER/smolvla_libero_plus \ --env.type=libero_plus \ --eval.n_episodes=1 \ --eval.batch_size=1 \ --env.max_parallel_tasks=16 \ --eval.instance_count=2 \ --rename_map='{"observation.images.image":"observation.images.camera1","observation.images.image2":"observation.images.camera2"}' \ --output_dir=outputs/eval/smolvla_libero_plus \ --push_to_hub=true ``` ### Docker fast eval recipe ```bash lerobot-eval \ --policy.path=$HF_USER/smolvla_libero_plus \ --env.type=libero_plus \ --eval.runtime=docker \ --eval.docker.envhub_ref=envhub://lerobot/libero_plus@v1 \ --eval.docker.gpus=all \ --eval.docker.shm_size=16g \ --eval.n_episodes=1 \ --eval.batch_size=1 \ --env.max_parallel_tasks=16 ``` ## Quick start — single benchmark Train SmolVLA on LIBERO-plus with 4 GPUs for 50 000 steps: ```bash lerobot-benchmark train \ --benchmarks libero_plus \ --policy-path lerobot/smolvla_base \ --hub-user $HF_USER \ --num-gpus 4 \ --steps 50000 \ --batch-size 32 \ --wandb ``` This trains on the combined LIBERO-plus dataset and pushes the checkpoint to `$HF_USER/smolvla_libero_plus` on the Hub. Then evaluate on **all four** LIBERO suites (spatial, object, goal, 10): ```bash lerobot-benchmark eval \ --benchmarks libero_plus \ --hub-user $HF_USER \ --n-episodes 50 ``` This automatically runs a separate `lerobot-eval` for each suite. ## Full sweep — multiple benchmarks Run training **and** evaluation across all benchmarks: ```bash lerobot-benchmark all \ --benchmarks libero,libero_plus,metaworld,robocasa,robomme \ --policy-path lerobot/smolvla_base \ --hub-user $HF_USER \ --num-gpus 4 \ --steps 50000 \ --batch-size 32 \ --wandb \ --push-eval-to-hub ``` For each benchmark the runner: 1. Trains a policy on its dataset. 2. Evaluates on every eval task in the benchmark (e.g. 4 suites for LIBERO). 3. Pushes HF-native `.eval_results` rows (and optional artifacts) to the Hub. Use `--dry-run` to print the exact `lerobot-train` / `lerobot-eval` commands without executing them, so you can inspect or modify them before running. ## Using the CLI directly (without the benchmark runner) You can also compose the commands yourself. The benchmark runner is a thin wrapper; here is what it does under the hood. ### Training ```bash accelerate launch \ --multi_gpu \ --num_processes=4 \ $(which lerobot-train) \ --policy.path=lerobot/smolvla_base \ --dataset.repo_id=$HF_USER/libero_plus \ --policy.repo_id=$HF_USER/smolvla_libero_plus \ --env.type=libero_plus \ --env.task=libero_spatial \ --steps=50000 \ --batch_size=32 \ --eval_freq=10000 \ --save_freq=10000 \ --output_dir=outputs/train/smolvla_libero_plus \ --job_name=smolvla_libero_plus \ --policy.push_to_hub=true \ --wandb.enable=true ``` ### Evaluation (run once per suite) ```bash for SUITE in libero_spatial libero_object libero_goal libero_10; do lerobot-eval \ --policy.path=$HF_USER/smolvla_libero_plus \ --env.type=libero_plus \ --env.task=$SUITE \ --eval.n_episodes=50 \ --eval.batch_size=10 \ --output_dir=outputs/eval/smolvla_libero_plus/$SUITE \ --policy.device=cuda \ --push_to_hub=true \ --benchmark_dataset_id=lerobot/sim-benchmarks done ``` ## Available benchmarks | Benchmark | Env type | Dataset | Eval tasks | Action dim | |---|---|---|---|---| | `libero` | `libero` | `{hub_user}/libero` | spatial, object, goal, 10 | 7 | | `libero_plus` | `libero_plus` | `{hub_user}/libero_plus` | spatial, object, goal, 10 | 7 | | `metaworld` | `metaworld` | `{hub_user}/metaworld` | push-v2 | 4 | | `robocasa` | `robocasa` | `{hub_user}/robocasa` | PickPlaceCounterToCabinet | 12 | | `robomme` | `robomme` | `{hub_user}/robomme` | PickXtimes | 8 | Run `lerobot-benchmark list` to see the full registry with all eval tasks. ## Policy naming convention The benchmark runner stores trained policies under: ``` {hub_user}/{policy_name}_{benchmark} ``` The default `--policy-name` is `smolvla`. So training on `libero_plus` as user `alice` produces `alice/smolvla_libero_plus`. You can override this, e.g. `--policy-name pi05` if training π₀.₅ instead. ## Multi-GPU considerations The effective batch size is `batch_size × num_gpus`. With `--batch-size=32` and `--num-gpus=4`, you train with an effective batch of 128 per step. LeRobot does **not** auto-scale the learning rate; see the [Multi-GPU Training guide](./multi_gpu_training) for details on when and how to adjust it. ## Custom benchmarks To add a new benchmark, edit the `BENCHMARK_REGISTRY` in `src/lerobot/scripts/lerobot_benchmark.py`: ```python from lerobot.scripts.lerobot_benchmark import BenchmarkEntry, BENCHMARK_REGISTRY BENCHMARK_REGISTRY["my_benchmark"] = BenchmarkEntry( dataset_repo_id="{hub_user}/my_dataset", env_type="my_env", env_task="MyDefaultTask", eval_tasks=["TaskA", "TaskB", "TaskC"], ) ``` Then use `--benchmarks my_benchmark` as usual. The runner will train once and evaluate separately on TaskA, TaskB, and TaskC. ## Outputs After training and evaluation, your outputs directory looks like: ``` outputs/ ├── train/ │ ├── smolvla_libero/ │ │ ├── checkpoints/ │ │ └── ... │ ├── smolvla_libero_plus/ │ ├── smolvla_robocasa/ │ └── smolvla_robomme/ └── eval/ ├── smolvla_libero/ │ ├── libero_spatial/ │ │ ├── eval_info.json │ │ └── videos/ │ ├── libero_object/ │ ├── libero_goal/ │ └── libero_10/ ├── smolvla_libero_plus/ │ ├── libero_spatial/ │ ├── libero_object/ │ ├── libero_goal/ │ └── libero_10/ ├── smolvla_robocasa/ └── smolvla_robomme/ ``` Each `eval_info.json` contains per-episode rewards, success rates, and aggregate metrics. ## HF Eval Results + Leaderboard LeRobot publishes benchmark scores using Hugging Face's native `/.eval_results/*.yaml` format, which powers model-page eval cards and benchmark leaderboards. Add `--push-eval-to-hub` to push results after each eval run: ```bash lerobot-benchmark eval \ --benchmarks libero_plus,robocasa \ --hub-user $HF_USER \ --benchmark-dataset-id lerobot/sim-benchmarks \ --push-eval-to-hub ``` This writes one or more files under `.eval_results/` in the model repo, for example: ```yaml - dataset: id: lerobot/sim-benchmarks task_id: libero_plus/spatial value: 82.4 notes: lerobot-eval ``` Notes: - `--benchmark-dataset-id` points to your consolidated benchmark dataset repo. - `task_id` values are derived from `env.type` and evaluated suite/task names. - Eval artifacts (`eval_info.json`, `eval_config.json`, videos) are still uploaded for provenance, but leaderboard ranking comes from `.eval_results`. ## Passing extra arguments Any arguments after the recognized flags are forwarded to `lerobot-train` or `lerobot-eval`. Example (training): use PEFT/LoRA during training. ```bash lerobot-benchmark train \ --benchmarks libero_plus \ --policy-path lerobot/smolvla_base \ --hub-user $HF_USER \ --num-gpus 4 \ --steps 50000 \ --peft.method_type=LORA --peft.r=16 ``` Example (evaluation): forward Docker runtime flags to each `lerobot-eval` call. ```bash lerobot-benchmark eval \ --benchmarks libero_plus \ --hub-user $HF_USER \ --eval.runtime=docker \ --eval.docker.envhub_ref=envhub://lerobot/libero_plus@v1 ```