# Evaluation `lerobot-eval` runs a trained policy on a simulation benchmark and reports success rate, reward, and (optionally) episode videos. It handles environment creation, batched rollouts, and metric aggregation automatically. ## Quick start Evaluate a Hub-hosted policy on LIBERO: ```bash lerobot-eval \ --policy.path=pepijn223/smolvla_libero \ --env.type=libero \ --env.task=libero_spatial \ --eval.n_episodes=10 \ --policy.device=cuda ``` Evaluate a local checkpoint: ```bash lerobot-eval \ --policy.path=outputs/train/act_pusht/checkpoints/005000/pretrained_model \ --env.type=pusht \ --eval.n_episodes=10 ``` `batch_size` defaults to **auto** (based on CPU cores). The script picks the right number of parallel environments for your machine. ## Key flags | Flag | Default | Description | | ----------------------- | -------------- | ------------------------------------------------------------------------------------- | | `--policy.path` | required | Hub repo ID or local path to a pretrained model | | `--env.type` | required | Benchmark name (`pusht`, `libero`, `metaworld`, etc.) | | `--env.task` | varies | Task or suite name (e.g. `libero_spatial`, `libero_10`) | | `--eval.n_episodes` | `50` | Total episodes to run (across all tasks) | | `--eval.batch_size` | `0` (auto) | Number of parallel environments. `0` = auto-tune from CPU cores | | `--eval.use_async_envs` | `true` | Use `AsyncVectorEnv` (parallel stepping). Auto-downgrades to sync when `batch_size=1` | | `--policy.device` | `cuda` | Inference device | | `--policy.use_amp` | `false` | Mixed-precision inference (saves VRAM, faster on Ampere+) | | `--seed` | `1000` | Random seed for reproducibility | | `--output_dir` | auto-generated | Where to write results and videos | ### Environment-specific flags Some benchmarks accept additional flags through `--env.*`: ```bash # LIBERO: map simulator camera names to policy feature names --env.camera_name_mapping='{"agentview_image": "camera1", "robot0_eye_in_hand_image": "camera2"}' # Fill unused camera slots with zeros --policy.empty_cameras=1 ``` See each benchmark's documentation ([LIBERO](libero), [Meta-World](metaworld)) for benchmark-specific flags. ## How batch_size works `batch_size` controls how many environments run in parallel within a single `VectorEnv`: | `batch_size` | Behavior | | ------------- | -------------------------------------------------------------------- | | `0` (default) | Auto-tune: `floor(cpu_cores × 0.7)`, capped by `n_episodes` and `64` | | `1` | Single environment, synchronous. Useful for debugging | | `N` | N environments step in parallel via `AsyncVectorEnv` | When `batch_size > 1` and `use_async_envs=true`, each environment runs in its own subprocess via Gymnasium's `AsyncVectorEnv`. This parallelizes the simulation stepping (the main bottleneck), while the policy runs a single batched forward pass on GPU. **Example:** On a 16-core machine with `n_episodes=100`: - Auto batch_size = `floor(16 × 0.7)` = `11` - 11 environments step simultaneously → ~11× faster than sequential ## Performance ### AsyncVectorEnv (default) `AsyncVectorEnv` spawns one subprocess per environment. Each subprocess has its own simulator instance. While the policy computes actions on GPU, all environments step in parallel on CPU: ``` GPU: [inference]....[inference]....[inference].... CPU: [step × N]....................[step × N]...... ↑ parallel ↑ parallel ``` For GPU-based simulators (LIBERO, Meta-World), the environments use **lazy initialization**: the GPU/EGL context is created inside the worker subprocess on first `reset()`, not in the parent process. This avoids `EGL_BAD_CONTEXT` crashes from inheriting stale GPU handles across `fork()`. ### Lazy task loading For multi-task benchmarks (e.g. LIBERO with 10 tasks), environments are wrapped in `_LazyAsyncVectorEnv` which defers worker creation until the task is actually evaluated. This keeps peak process count = `batch_size` instead of `n_tasks × batch_size`. After each task completes, workers are closed to free resources. ### Tuning for speed | Situation | Recommendation | | ------------------------------ | ----------------------------------------------------- | | Slow eval, low GPU utilization | Increase `batch_size` (or leave at auto) | | Out of memory (system RAM) | Decrease `batch_size` | | Out of GPU memory | Decrease `batch_size`, or use `--policy.use_amp=true` | | Debugging / single-stepping | `--eval.batch_size=1 --eval.use_async_envs=false` | ## Output Results are written to `output_dir` (default: `outputs/eval//