diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 3dcba5993..f69f6d900 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -73,6 +73,8 @@ title: Control & Train Robots in Sim (LeIsaac) title: "Simulation" - sections: + - local: evaluation + title: Evaluation (lerobot-eval) - local: adding_benchmarks title: Adding a New Benchmark - local: libero diff --git a/docs/source/adding_benchmarks.mdx b/docs/source/adding_benchmarks.mdx index 77ccd3d4a..d8ca2f4a6 100644 --- a/docs/source/adding_benchmarks.mdx +++ b/docs/source/adding_benchmarks.mdx @@ -301,7 +301,7 @@ After completing the steps above, confirm that everything works: 1. **Install** — `pip install -e ".[mybenchmark]"` and verify the dependency group installs cleanly. 2. **Smoke test env creation** — call `make_env()` with your config in Python, check that the returned dict has the expected `{suite: {task_id: VectorEnv}}` shape, and that `reset()` returns observations with the right keys. -3. **Run a full eval** — `lerobot-eval --env.type= --env.task= --eval.n_episodes=1 --eval.batch_size=1 --policy.path=` to exercise the full pipeline end-to-end. +3. **Run a full eval** — `lerobot-eval --env.type= --env.task= --eval.n_episodes=1 --policy.path=` to exercise the full pipeline end-to-end. (`batch_size` defaults to auto-tuning based on CPU cores; pass `--eval.batch_size=1` to force a single environment.) 4. **Check success detection** — verify that `info["is_success"]` flips to `True` when the task is actually completed. This is what the eval loop uses to compute success rates. ## Writing a benchmark doc page @@ -313,7 +313,7 @@ Each benchmark `.mdx` page should include: - **Overview image or GIF.** - **Available tasks** — table of task suites with counts and brief descriptions. - **Installation** — `pip install -e ".[]"` plus any extra steps (env vars, system packages). -- **Evaluation** — recommended `lerobot-eval` command with `n_episodes` and `batch_size` for reproducible results. Include single-task and multi-task examples if applicable. +- **Evaluation** — recommended `lerobot-eval` command with `n_episodes` for reproducible results. `batch_size` defaults to auto; only specify it if needed. Include single-task and multi-task examples if applicable. See the [Evaluation guide](evaluation) for details. - **Policy inputs and outputs** — observation keys with shapes, action space description. - **Recommended evaluation episodes** — how many episodes per task is standard. - **Training** — example `lerobot-train` command. diff --git a/docs/source/evaluation.mdx b/docs/source/evaluation.mdx new file mode 100644 index 000000000..6ad2e7ae6 --- /dev/null +++ b/docs/source/evaluation.mdx @@ -0,0 +1,170 @@ +# Evaluation + +`lerobot-eval` runs a trained policy on a simulation benchmark and reports success rate, reward, and (optionally) episode videos. It handles environment creation, batched rollouts, and metric aggregation automatically. + +## Quick start + +Evaluate a Hub-hosted policy on LIBERO: + +```bash +lerobot-eval \ + --policy.path=pepijn223/smolvla_libero \ + --env.type=libero \ + --env.task=libero_spatial \ + --eval.n_episodes=10 \ + --policy.device=cuda +``` + +Evaluate a local checkpoint: + +```bash +lerobot-eval \ + --policy.path=outputs/train/act_pusht/checkpoints/005000/pretrained_model \ + --env.type=pusht \ + --eval.n_episodes=10 +``` + +`batch_size` defaults to **auto** (based on CPU cores). The script picks the right number of parallel environments for your machine. + +## Key flags + +| Flag | Default | Description | +|---|---|---| +| `--policy.path` | required | Hub repo ID or local path to a pretrained model | +| `--env.type` | required | Benchmark name (`pusht`, `libero`, `metaworld`, etc.) | +| `--env.task` | varies | Task or suite name (e.g. `libero_spatial`, `libero_10`) | +| `--eval.n_episodes` | `50` | Total episodes to run (across all tasks) | +| `--eval.batch_size` | `0` (auto) | Number of parallel environments. `0` = auto-tune from CPU cores | +| `--eval.use_async_envs` | `true` | Use `AsyncVectorEnv` (parallel stepping). Auto-downgrades to sync when `batch_size=1` | +| `--policy.device` | `cuda` | Inference device | +| `--policy.use_amp` | `false` | Mixed-precision inference (saves VRAM, faster on Ampere+) | +| `--seed` | `1000` | Random seed for reproducibility | +| `--output_dir` | auto-generated | Where to write results and videos | + +### Environment-specific flags + +Some benchmarks accept additional flags through `--env.*`: + +```bash +# LIBERO: map simulator camera names to policy feature names +--env.camera_name_mapping='{"agentview_image": "camera1", "robot0_eye_in_hand_image": "camera2"}' + +# Fill unused camera slots with zeros +--policy.empty_cameras=1 +``` + +See each benchmark's documentation ([LIBERO](libero), [Meta-World](metaworld)) for benchmark-specific flags. + +## How batch_size works + +`batch_size` controls how many environments run in parallel within a single `VectorEnv`: + +| `batch_size` | Behavior | +|---|---| +| `0` (default) | Auto-tune: `floor(cpu_cores × 0.7)`, capped by `n_episodes` and `64` | +| `1` | Single environment, synchronous. Useful for debugging | +| `N` | N environments step in parallel via `AsyncVectorEnv` | + +When `batch_size > 1` and `use_async_envs=true`, each environment runs in its own subprocess via Gymnasium's `AsyncVectorEnv`. This parallelizes the simulation stepping (the main bottleneck), while the policy runs a single batched forward pass on GPU. + +**Example:** On a 16-core machine with `n_episodes=100`: +- Auto batch_size = `floor(16 × 0.7)` = `11` +- 11 environments step simultaneously → ~11× faster than sequential + +## Performance + +### AsyncVectorEnv (default) + +`AsyncVectorEnv` spawns one subprocess per environment. Each subprocess has its own simulator instance. While the policy computes actions on GPU, all environments step in parallel on CPU: + +``` +GPU: [inference]....[inference]....[inference].... +CPU: [step × N]....................[step × N]...... + ↑ parallel ↑ parallel +``` + +For GPU-based simulators (LIBERO, Meta-World), the environments use **lazy initialization**: the GPU/EGL context is created inside the worker subprocess on first `reset()`, not in the parent process. This avoids `EGL_BAD_CONTEXT` crashes from inheriting stale GPU handles across `fork()`. + +### Lazy task loading + +For multi-task benchmarks (e.g. LIBERO with 10 tasks), environments are wrapped in `_LazyAsyncVectorEnv` which defers worker creation until the task is actually evaluated. This keeps peak process count = `batch_size` instead of `n_tasks × batch_size`. After each task completes, workers are closed to free resources. + +### Tuning for speed + +| Situation | Recommendation | +|---|---| +| Slow eval, low GPU utilization | Increase `batch_size` (or leave at auto) | +| Out of memory (system RAM) | Decrease `batch_size` | +| Out of GPU memory | Decrease `batch_size`, or use `--policy.use_amp=true` | +| Debugging / single-stepping | `--eval.batch_size=1 --eval.use_async_envs=false` | + +### Benchmarks + +Measured with `pepijn223/smolvla_libero` on `libero_spatial` (10 tasks, 100 episodes total): + +| Configuration | Wall time | GPU util | +|---|---|---| +| `batch_size=1` (sync) | ~400s | 0–8% | +| `batch_size=10` (async) | ~189s | 0–99% | + +## Output + +Results are written to `output_dir` (default: `outputs/eval//