mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-15 08:39:49 +00:00
feat(ci): add benchmark smoke tests with isolated Docker images
Each benchmark gets its own image (lerobot[<benchmark>,smolvla]) so incompatible dep trees can never collide. A 1-episode smoke eval runs per benchmark on GPU runners. - Libero: pepijn223/smolvla_libero, libero_spatial, camera_name_mapping - MetaWorld: pepijn223/smolvla_metaworld, metaworld-push-v2 - LIBERO config pre-created at build time to bypass interactive stdin prompt - Triggers on envs/**, lerobot_eval.py, Dockerfiles, pyproject.toml changes - Adds docs/source/evaluation.mdx and restores step 7 in adding_benchmarks Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -73,6 +73,8 @@
|
||||
title: Control & Train Robots in Sim (LeIsaac)
|
||||
title: "Simulation"
|
||||
- sections:
|
||||
- local: evaluation
|
||||
title: Evaluation (lerobot-eval)
|
||||
- local: adding_benchmarks
|
||||
title: Adding a New Benchmark
|
||||
- local: libero
|
||||
|
||||
@@ -122,15 +122,17 @@ Each `EnvConfig` subclass declares two dicts that tell the policy what to expect
|
||||
|
||||
### Checklist
|
||||
|
||||
| File | Required | Why |
|
||||
| ---------------------------------------- | -------- | ------------------------------------------------------------ |
|
||||
| `src/lerobot/envs/<benchmark>.py` | Yes | Wraps the simulator as a standard gym.Env |
|
||||
| `src/lerobot/envs/configs.py` | Yes | Registers your benchmark and its `create_envs()` for the CLI |
|
||||
| `src/lerobot/processor/env_processor.py` | Optional | Custom observation/action transforms |
|
||||
| `src/lerobot/envs/utils.py` | Optional | Only if you need new raw observation keys |
|
||||
| `pyproject.toml` | Yes | Declares benchmark-specific dependencies |
|
||||
| `docs/source/<benchmark>.mdx` | Yes | User-facing documentation page |
|
||||
| `docs/source/_toctree.yml` | Yes | Adds your page to the docs sidebar |
|
||||
| File | Required | Why |
|
||||
| ----------------------------------------- | -------- | ------------------------------------------------------------ |
|
||||
| `src/lerobot/envs/<benchmark>.py` | Yes | Wraps the simulator as a standard gym.Env |
|
||||
| `src/lerobot/envs/configs.py` | Yes | Registers your benchmark and its `create_envs()` for the CLI |
|
||||
| `src/lerobot/processor/env_processor.py` | Optional | Custom observation/action transforms |
|
||||
| `src/lerobot/envs/utils.py` | Optional | Only if you need new raw observation keys |
|
||||
| `pyproject.toml` | Yes | Declares benchmark-specific dependencies |
|
||||
| `docs/source/<benchmark>.mdx` | Yes | User-facing documentation page |
|
||||
| `docs/source/_toctree.yml` | Yes | Adds your page to the docs sidebar |
|
||||
| `docker/Dockerfile.benchmark.<benchmark>` | Yes | Isolated Docker image for CI smoke tests |
|
||||
| `.github/workflows/benchmark_tests.yml` | Yes | CI job that builds the image and runs a 1-episode smoke eval |
|
||||
|
||||
### 1. The gym.Env wrapper (`src/lerobot/envs/<benchmark>.py`)
|
||||
|
||||
@@ -295,6 +297,78 @@ Add your benchmark to the "Benchmarks" section:
|
||||
title: "Benchmarks"
|
||||
```
|
||||
|
||||
### 7. CI smoke test (`docker/` + `.github/workflows/benchmark_tests.yml`)
|
||||
|
||||
Each benchmark must have an isolated Docker image and a CI job that runs a 1-episode eval. This catches install-time regressions (broken transitive deps, import errors, interactive prompts) before they reach users.
|
||||
|
||||
**Create `docker/Dockerfile.benchmark.<benchmark>`** — copy an existing one and change only the extra name:
|
||||
|
||||
```dockerfile
|
||||
# Isolated benchmark image — installs lerobot[<benchmark>] only.
|
||||
# Build: docker build -f docker/Dockerfile.benchmark.<benchmark> -t lerobot-benchmark-<benchmark> .
|
||||
ARG CUDA_VERSION=12.4.1
|
||||
ARG OS_VERSION=22.04
|
||||
FROM nvidia/cuda:${CUDA_VERSION}-base-ubuntu${OS_VERSION}
|
||||
ARG PYTHON_VERSION=3.12
|
||||
# ... (same system deps as Dockerfile.benchmark.libero) ...
|
||||
RUN uv sync --locked --extra <benchmark> --no-cache
|
||||
```
|
||||
|
||||
Each benchmark gets its own image so its dependency tree (pinned simulator packages, specific mujoco/scipy versions) cannot conflict with other benchmarks.
|
||||
|
||||
**Add a job to `.github/workflows/benchmark_tests.yml`** — copy an existing job block and adjust:
|
||||
|
||||
```yaml
|
||||
<benchmark>-integration-test:
|
||||
name: <Benchmark> — build image + 1-episode eval
|
||||
runs-on:
|
||||
group: aws-g6-4xlarge-plus
|
||||
env:
|
||||
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
|
||||
steps:
|
||||
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
|
||||
with:
|
||||
persist-credentials: false
|
||||
lfs: true
|
||||
- name: Set up Docker Buildx
|
||||
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
|
||||
with:
|
||||
cache-binary: false
|
||||
- name: Build <Benchmark> image
|
||||
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
|
||||
with:
|
||||
context: .
|
||||
file: docker/Dockerfile.benchmark.<benchmark>
|
||||
push: false
|
||||
load: true
|
||||
tags: lerobot-benchmark-<benchmark>:ci
|
||||
cache-from: type=local,src=/tmp/.buildx-cache-<benchmark>
|
||||
cache-to: type=local,dest=/tmp/.buildx-cache-<benchmark>,mode=max
|
||||
- name: Run <Benchmark> smoke eval (1 episode)
|
||||
run: |
|
||||
docker run --rm --gpus all \
|
||||
--shm-size=4g \
|
||||
-e HF_HOME=/tmp/hf \
|
||||
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
|
||||
lerobot-benchmark-<benchmark>:ci \
|
||||
bash -c "
|
||||
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
|
||||
lerobot-eval \
|
||||
--policy.path=<hub_policy_path> \
|
||||
--env.type=<benchmark> \
|
||||
--env.task=<task> \
|
||||
--eval.batch_size=1 \
|
||||
--eval.n_episodes=1 \
|
||||
--eval.use_async_envs=false \
|
||||
--policy.device=cuda
|
||||
"
|
||||
```
|
||||
|
||||
**Tips:**
|
||||
|
||||
- If the benchmark library prompts for user input on import (like LIBERO asking for a dataset folder), pass the relevant env var in the `docker run` command (e.g. `-e LIBERO_DATA_FOLDER=/tmp/libero_data`).
|
||||
- The job is scoped to only trigger on changes to `src/lerobot/envs/**`, `src/lerobot/scripts/lerobot_eval.py`, and the Dockerfiles — it won't run on unrelated PRs.
|
||||
|
||||
## Verifying your integration
|
||||
|
||||
After completing the steps above, confirm that everything works:
|
||||
@@ -303,6 +377,7 @@ After completing the steps above, confirm that everything works:
|
||||
2. **Smoke test env creation** — call `make_env()` with your config in Python, check that the returned dict has the expected `{suite: {task_id: VectorEnv}}` shape, and that `reset()` returns observations with the right keys.
|
||||
3. **Run a full eval** — `lerobot-eval --env.type=<name> --env.task=<task> --eval.n_episodes=1 --policy.path=<any_compatible_policy>` to exercise the full pipeline end-to-end. (`batch_size` defaults to auto-tuning based on CPU cores; pass `--eval.batch_size=1` to force a single environment.)
|
||||
4. **Check success detection** — verify that `info["is_success"]` flips to `True` when the task is actually completed. This is what the eval loop uses to compute success rates.
|
||||
5. **Add CI smoke test** — follow step 7 above to add a Dockerfile and CI job. This ensures the install stays green as dependencies evolve.
|
||||
|
||||
## Writing a benchmark doc page
|
||||
|
||||
@@ -313,7 +388,7 @@ Each benchmark `.mdx` page should include:
|
||||
- **Overview image or GIF.**
|
||||
- **Available tasks** — table of task suites with counts and brief descriptions.
|
||||
- **Installation** — `pip install -e ".[<benchmark>]"` plus any extra steps (env vars, system packages).
|
||||
- **Evaluation** — recommended `lerobot-eval` command with `n_episodes` for reproducible results. `batch_size` defaults to auto; only specify it if needed. Include single-task and multi-task examples if applicable.
|
||||
- **Evaluation** — recommended `lerobot-eval` command with `n_episodes` for reproducible results. `batch_size` defaults to auto; only specify it if needed. Include single-task and multi-task examples if applicable. See the [Evaluation guide](evaluation) for details.
|
||||
- **Policy inputs and outputs** — observation keys with shapes, action space description.
|
||||
- **Recommended evaluation episodes** — how many episodes per task is standard.
|
||||
- **Training** — example `lerobot-train` command.
|
||||
|
||||
@@ -0,0 +1,162 @@
|
||||
# Evaluation
|
||||
|
||||
`lerobot-eval` runs a trained policy on a simulation benchmark and reports success rate, reward, and (optionally) episode videos. It handles environment creation, batched rollouts, and metric aggregation automatically.
|
||||
|
||||
## Quick start
|
||||
|
||||
Evaluate a Hub-hosted policy on LIBERO:
|
||||
|
||||
```bash
|
||||
lerobot-eval \
|
||||
--policy.path=pepijn223/smolvla_libero \
|
||||
--env.type=libero \
|
||||
--env.task=libero_spatial \
|
||||
--eval.n_episodes=10 \
|
||||
--policy.device=cuda
|
||||
```
|
||||
|
||||
Evaluate a local checkpoint:
|
||||
|
||||
```bash
|
||||
lerobot-eval \
|
||||
--policy.path=outputs/train/act_pusht/checkpoints/005000/pretrained_model \
|
||||
--env.type=pusht \
|
||||
--eval.n_episodes=10
|
||||
```
|
||||
|
||||
`batch_size` defaults to **auto** (based on CPU cores). The script picks the right number of parallel environments for your machine.
|
||||
|
||||
## Key flags
|
||||
|
||||
| Flag | Default | Description |
|
||||
| ----------------------- | -------------- | ------------------------------------------------------------------------------------- |
|
||||
| `--policy.path` | required | Hub repo ID or local path to a pretrained model |
|
||||
| `--env.type` | required | Benchmark name (`pusht`, `libero`, `metaworld`, etc.) |
|
||||
| `--env.task` | varies | Task or suite name (e.g. `libero_spatial`, `libero_10`) |
|
||||
| `--eval.n_episodes` | `50` | Total episodes to run (across all tasks) |
|
||||
| `--eval.batch_size` | `0` (auto) | Number of parallel environments. `0` = auto-tune from CPU cores |
|
||||
| `--eval.use_async_envs` | `true` | Use `AsyncVectorEnv` (parallel stepping). Auto-downgrades to sync when `batch_size=1` |
|
||||
| `--policy.device` | `cuda` | Inference device |
|
||||
| `--policy.use_amp` | `false` | Mixed-precision inference (saves VRAM, faster on Ampere+) |
|
||||
| `--seed` | `1000` | Random seed for reproducibility |
|
||||
| `--output_dir` | auto-generated | Where to write results and videos |
|
||||
|
||||
### Environment-specific flags
|
||||
|
||||
Some benchmarks accept additional flags through `--env.*`:
|
||||
|
||||
```bash
|
||||
# LIBERO: map simulator camera names to policy feature names
|
||||
--env.camera_name_mapping='{"agentview_image": "camera1", "robot0_eye_in_hand_image": "camera2"}'
|
||||
|
||||
# Fill unused camera slots with zeros
|
||||
--policy.empty_cameras=1
|
||||
```
|
||||
|
||||
See each benchmark's documentation ([LIBERO](libero), [Meta-World](metaworld)) for benchmark-specific flags.
|
||||
|
||||
## How batch_size works
|
||||
|
||||
`batch_size` controls how many environments run in parallel within a single `VectorEnv`:
|
||||
|
||||
| `batch_size` | Behavior |
|
||||
| ------------- | -------------------------------------------------------------------- |
|
||||
| `0` (default) | Auto-tune: `floor(cpu_cores × 0.7)`, capped by `n_episodes` and `64` |
|
||||
| `1` | Single environment, synchronous. Useful for debugging |
|
||||
| `N` | N environments step in parallel via `AsyncVectorEnv` |
|
||||
|
||||
When `batch_size > 1` and `use_async_envs=true`, each environment runs in its own subprocess via Gymnasium's `AsyncVectorEnv`. This parallelizes the simulation stepping (the main bottleneck), while the policy runs a single batched forward pass on GPU.
|
||||
|
||||
**Example:** On a 16-core machine with `n_episodes=100`:
|
||||
|
||||
- Auto batch_size = `floor(16 × 0.7)` = `11`
|
||||
- 11 environments step simultaneously → ~11× faster than sequential
|
||||
|
||||
## Performance
|
||||
|
||||
### AsyncVectorEnv (default)
|
||||
|
||||
`AsyncVectorEnv` spawns one subprocess per environment. Each subprocess has its own simulator instance. While the policy computes actions on GPU, all environments step in parallel on CPU:
|
||||
|
||||
```
|
||||
GPU: [inference]....[inference]....[inference]....
|
||||
CPU: [step × N]....................[step × N]......
|
||||
↑ parallel ↑ parallel
|
||||
```
|
||||
|
||||
For GPU-based simulators (LIBERO, Meta-World), the environments use **lazy initialization**: the GPU/EGL context is created inside the worker subprocess on first `reset()`, not in the parent process. This avoids `EGL_BAD_CONTEXT` crashes from inheriting stale GPU handles across `fork()`.
|
||||
|
||||
### Lazy task loading
|
||||
|
||||
For multi-task benchmarks (e.g. LIBERO with 10 tasks), environments are wrapped in `_LazyAsyncVectorEnv` which defers worker creation until the task is actually evaluated. This keeps peak process count = `batch_size` instead of `n_tasks × batch_size`. After each task completes, workers are closed to free resources.
|
||||
|
||||
### Tuning for speed
|
||||
|
||||
| Situation | Recommendation |
|
||||
| ------------------------------ | ----------------------------------------------------- |
|
||||
| Slow eval, low GPU utilization | Increase `batch_size` (or leave at auto) |
|
||||
| Out of memory (system RAM) | Decrease `batch_size` |
|
||||
| Out of GPU memory | Decrease `batch_size`, or use `--policy.use_amp=true` |
|
||||
| Debugging / single-stepping | `--eval.batch_size=1 --eval.use_async_envs=false` |
|
||||
|
||||
## Output
|
||||
|
||||
Results are written to `output_dir` (default: `outputs/eval/<date>/<time>_<job_name>/`):
|
||||
|
||||
- `eval_info.json` — full metrics: per-episode, per-task, per-group, and overall aggregates
|
||||
- `videos/` — episode recordings (when `--eval.n_episodes_to_render > 0`)
|
||||
|
||||
### Metrics
|
||||
|
||||
| Metric | Description |
|
||||
| ---------------- | -------------------------------------------------------------------- |
|
||||
| `pc_success` | Success rate (%). Based on `info["is_success"]` from the environment |
|
||||
| `avg_sum_reward` | Mean cumulative reward per episode |
|
||||
| `avg_max_reward` | Mean peak reward per episode |
|
||||
| `n_episodes` | Total episodes evaluated |
|
||||
| `eval_s` | Total wall-clock time |
|
||||
| `eval_ep_s` | Mean wall-clock time per episode |
|
||||
|
||||
## Multi-task evaluation
|
||||
|
||||
For benchmarks with multiple tasks (LIBERO suites, Meta-World MT50), `lerobot-eval` automatically:
|
||||
|
||||
1. Creates environments for all tasks in the selected suite(s)
|
||||
2. Evaluates each task sequentially (one task's workers at a time)
|
||||
3. Aggregates metrics per-task, per-group (suite), and overall
|
||||
|
||||
```bash
|
||||
# Evaluate all 10 tasks in libero_spatial
|
||||
lerobot-eval \
|
||||
--policy.path=pepijn223/smolvla_libero \
|
||||
--env.type=libero \
|
||||
--env.task=libero_spatial \
|
||||
--eval.n_episodes=10
|
||||
|
||||
# Evaluate multiple suites
|
||||
lerobot-eval \
|
||||
--policy.path=pepijn223/smolvla_libero \
|
||||
--env.type=libero \
|
||||
--env.task="libero_spatial,libero_object" \
|
||||
--eval.n_episodes=10
|
||||
```
|
||||
|
||||
## API usage
|
||||
|
||||
You can call the eval functions directly from Python:
|
||||
|
||||
```python
|
||||
from lerobot.envs.factory import make_env
|
||||
from lerobot.policies.factory import make_policy
|
||||
from lerobot.scripts.lerobot_eval import eval_policy
|
||||
|
||||
envs = make_env(env_cfg, n_envs=10)
|
||||
policy = make_policy(cfg=policy_cfg, env_cfg=env_cfg)
|
||||
|
||||
metrics = eval_policy(
|
||||
env=envs["libero_spatial"][0],
|
||||
policy=policy,
|
||||
n_episodes=10,
|
||||
)
|
||||
print(metrics["pc_success"])
|
||||
```
|
||||
Reference in New Issue
Block a user