docs(benchmarks): add benchmark integration guide and standardize benchmark docs

Add a comprehensive guide for adding new benchmarks to LeRobot, and
refactor the existing LIBERO and Meta-World docs to follow the new
standardized template.

Made-with: Cursor
This commit is contained in:
Pepijn
2026-04-02 20:43:31 +02:00
parent 5de7aa5a4f
commit 69eec9c822
2 changed files with 237 additions and 166 deletions
+229 -158
View File
@@ -1,140 +1,124 @@
# Adding a New Benchmark
This guide walks you through adding a new simulation benchmark to LeRobot. Follow the steps in order and use the existing benchmarks as templates.
This guide explains how to integrate a new simulation benchmark into LeRobot. It is intended for both human contributors and coding agents follow the steps in order and use the referenced files as templates.
A benchmark in LeRobot is a set of [Gymnasium](https://gymnasium.farama.org/) environments that wrap a third-party simulator (like LIBERO or Meta-World) behind a standard `gym.Env` interface. The `lerobot-eval` CLI then runs evaluation uniformly across all benchmarks.
A "benchmark" in LeRobot is a set of gymnasium environments used for standardized evaluation. Each benchmark wraps a third-party simulator (e.g., LIBERO, Meta-World) behind a `gym.Env` interface, and the `lerobot-eval` script drives evaluation uniformly across all benchmarks.
## Existing benchmarks at a glance
## Architecture overview
Before diving in, here is what is already integrated:
### Observation and action data flow
| Benchmark | Env file | Config class | Tasks | Action dim | Processor |
| -------------- | ------------------- | ------------------ | ------------------- | ------------ | ---------------------------- |
| LIBERO | `envs/libero.py` | `LiberoEnv` | 130 across 5 suites | 7 | `LiberoProcessorStep` |
| Meta-World | `envs/metaworld.py` | `MetaworldEnv` | 50 (MT50) | 4 | None |
| IsaacLab Arena | Hub-hosted | `IsaaclabArenaEnv` | Configurable | Configurable | `IsaaclabArenaProcessorStep` |
Use `src/lerobot/envs/libero.py` and `src/lerobot/envs/metaworld.py` as reference implementations.
## How it all fits together
### Data flow
During evaluation, data moves through four stages:
During evaluation, observations and actions flow through a multi-stage pipeline:
```
1. gym.Env ──→ raw observations (numpy dicts)
2. Preprocessing ──→ standard LeRobot keys + task description
(preprocess_observation, add_envs_task in envs/utils.py)
3. Processors ──→ env-specific then policy-specific transforms
(env_preprocessor, policy_preprocessor)
4. Policy ──→ select_action() ──→ action tensor
then reverse: policy_postprocessorenv_postprocessor → numpy action → env.step()
gym.Env.reset() / step()
▼ raw observation (dict[str, Any])
preprocess_observation() # envs/utils.py — numpy→tensor, key mapping
▼ LeRobot-format observation
add_envs_task() # envs/utils.py — injects task description
env_preprocessor # processor/env_processor.py — env-specific transforms
policy_preprocessor # per-policy normalization, device transfer
policy.select_action() # PreTrainedPolicy — returns action tensor
policy_postprocessor # per-policy denormalization
env_postprocessor # env-specific action transforms
▼ numpy action
gym.Env.step(action)
```
Most benchmarks only need to care about stage 1 (producing observations in the right format) and optionally stage 3 (if env-specific transforms are needed).
### Environment return shape
### Environment structure
`make_env()` returns a nested dict of vectorized environments:
`make_env()` returns a nested dict:
```python
dict[str, dict[int, gym.vector.VectorEnv]]
# ^suite ^task_id
# ^suite_name ^task_id ^vectorized env with n_envs parallel copies
```
A single-task env (e.g. PushT) looks like `{"pusht": {0: vec_env}}`.
A multi-task benchmark (e.g. LIBERO) looks like `{"libero_spatial": {0: vec0, 1: vec1, ...}, ...}`.
For single-task environments (e.g., PushT), this is `{"pusht": {0: vec_env}}`.
For multi-task benchmarks (e.g., LIBERO), this is `{"libero_spatial": {0: vec0, 1: vec1, ...}, "libero_object": {0: ..., ...}}`.
### How evaluation runs
The eval loop (`eval_policy_all()`) iterates over all suites and tasks uniformly.
All benchmarks are evaluated the same way by `lerobot-eval`:
## The policy-environment contract
1. `make_env()` builds the nested `{suite: {task_id: VectorEnv}}` dict.
2. `eval_policy_all()` iterates over every suite and task.
3. For each task, it runs `n_episodes` rollouts via `rollout()`.
4. Results are aggregated hierarchically: episode, task, suite, overall.
5. Metrics include `pc_success` (success rate), `avg_sum_reward`, and `avg_max_reward`.
There is no enforced schema: `RobotObservation` is typed as `dict[str, Any]`. Instead, LeRobot relies on conventions:
The critical piece: your env must return `info["is_success"]` on every `step()` call. This is how the eval loop knows whether a task was completed.
### Required attributes on your `gym.Env`
## What your environment must provide
| Attribute | Type | Used by |
| -------------------- | ----- | -------------------------------------------------------------- |
| `_max_episode_steps` | `int` | `rollout()` — caps episode length |
| `task_description` | `str` | `add_envs_task()` — feeds language instruction to VLA policies |
| `task` | `str` | `add_envs_task()` — fallback if `task_description` is absent |
LeRobot does not enforce a strict observation schema. Instead it relies on a set of conventions that all benchmarks follow.
### Required fields in `info` dict
### Env attributes
| Key | Type | Used by |
| ------------ | ------ | ----------------------------------------------------------- |
| `is_success` | `bool` | `eval_policy()` — detects task success |
| `final_info` | `dict` | Gymnasium `VectorEnv` — carries per-env info on termination |
Your `gym.Env` must set these attributes:
### Raw observation format
| Attribute | Type | Why |
| -------------------- | ----- | ---------------------------------------------------- |
| `_max_episode_steps` | `int` | `rollout()` uses this to cap episode length |
| `task_description` | `str` | Passed to VLA policies as a language instruction |
| `task` | `str` | Fallback identifier if `task_description` is not set |
`preprocess_observation()` expects raw observations to use these keys:
### Success reporting
| Raw key | Mapped to | Description |
| --------------------------- | ------------------------------- | -------------------------------------- |
| `"pixels"` (single image) | `observation.image` | Single camera, HWC uint8 |
| `"pixels"` (dict of images) | `observation.images.<cam_name>` | Multiple cameras, each HWC uint8 |
| `"agent_pos"` | `observation.state` | Proprioceptive state vector |
| `"environment_state"` | `observation.env_state` | Environment state (e.g., PushT) |
| `"robot_state"` | `observation.robot_state` | Nested robot state dict (e.g., LIBERO) |
Your `step()` and `reset()` must include `"is_success"` in the `info` dict:
If your benchmark's raw observations don't match these keys, you have two options:
```python
info = {"is_success": True} # or False
return observation, reward, terminated, truncated, info
```
1. **Preferred**: Map your observations to these standard keys inside your `gym.Env._format_raw_obs()` method.
2. **Alternative**: Write an env processor that transforms the observations after `preprocess_observation()` runs.
### Observations
### Action space
The simplest approach is to map your simulator's outputs to the standard keys that `preprocess_observation()` already understands. Do this inside your `gym.Env` (e.g. in a `_format_raw_obs()` helper):
| Your env should output | LeRobot maps it to | What it is |
| ------------------------- | -------------------------- | ------------------------------------- |
| `"pixels"` (single array) | `observation.image` | Single camera image, HWC uint8 |
| `"pixels"` (dict) | `observation.images.<cam>` | Multiple cameras, each HWC uint8 |
| `"agent_pos"` | `observation.state` | Proprioceptive state vector |
| `"environment_state"` | `observation.env_state` | Full environment state (e.g. PushT) |
| `"robot_state"` | `observation.robot_state` | Nested robot state dict (e.g. LIBERO) |
If your simulator uses different key names, you have two options:
1. **Recommended:** Rename them to the standard keys inside your `gym.Env` wrapper.
2. **Alternative:** Write an env processor to transform observations after `preprocess_observation()` runs (see step 4 below).
### Actions
Actions are continuous numpy arrays in a `gym.spaces.Box`. The dimensionality depends on your benchmark (7 for LIBERO, 4 for Meta-World, etc.). Policies adapt to different action dimensions through their `input_features` / `output_features` config.
Actions are continuous numpy arrays in a `gym.spaces.Box`. The dimensionality is benchmark-specific (e.g., 7 for LIBERO, 4 for Meta-World). Policies handle the dimension mismatch via their `input_features` / `output_features` config.
### Feature declaration
Each `EnvConfig` subclass declares two dicts that tell the policy what to expect:
Each `EnvConfig` subclass declares:
- `features` — maps feature names to `PolicyFeature(type, shape)` (e.g. action dim, image shape).
- `features_map` — maps raw observation keys to LeRobot convention keys (e.g. `"agent_pos"` to `"observation.state"`).
- `features`: dict mapping feature names to `PolicyFeature(type, shape)` — tells the policy what to expect.
- `features_map`: dict mapping raw env keys to LeRobot convention keys (e.g., `"agent_pos""observation.state"`).
## Step by step
<Tip>
At minimum, you need two files: a **gym.Env wrapper** and an **EnvConfig
subclass** with a `create_envs()` override. Everything else is optional or
documentation. No changes to `factory.py` are needed.
</Tip>
## Files to create or modify
### Checklist
| File | Required | Why |
| ---------------------------------------- | -------- | ------------------------------------------------------------ |
| `src/lerobot/envs/<benchmark>.py` | Yes | Wraps the simulator as a standard gym.Env |
| `src/lerobot/envs/configs.py` | Yes | Registers your benchmark and its `create_envs()` for the CLI |
| `src/lerobot/processor/env_processor.py` | Optional | Custom observation/action transforms |
| `src/lerobot/envs/utils.py` | Optional | Only if you need new raw observation keys |
| `pyproject.toml` | Yes | Declares benchmark-specific dependencies |
| `docs/source/<benchmark>.mdx` | Yes | User-facing documentation page |
| `docs/source/_toctree.yml` | Yes | Adds your page to the docs sidebar |
| File | Required | Description |
| ---------------------------------------- | -------- | ----------------------------------------------------------------------------------- |
| `src/lerobot/envs/<benchmark>.py` | Yes | `gym.Env` subclass + `create_<benchmark>_envs()` factory |
| `src/lerobot/envs/configs.py` | Yes | `@EnvConfig.register_subclass("<name>")` dataclass |
| `src/lerobot/envs/factory.py` | Yes | Add dispatch branch in `make_env()` and optionally `make_env_pre_post_processors()` |
| `src/lerobot/processor/env_processor.py` | Optional | `ProcessorStep` subclass for env-specific observation transforms |
| `src/lerobot/envs/utils.py` | Optional | Extend `preprocess_observation()` if new raw keys are needed |
| `pyproject.toml` | Yes | Add optional dependency group |
| `docs/source/<benchmark>.mdx` | Yes | User-facing benchmark documentation |
| `docs/source/_toctree.yml` | Yes | Add entry under the "Benchmarks" section |
### 1. The gym.Env wrapper (`src/lerobot/envs/<benchmark>.py`)
Create a `gym.Env` subclass that wraps the third-party simulator:
Create a `gym.Env` subclass that wraps the third-party simulator. Use `src/lerobot/envs/libero.py` or `src/lerobot/envs/metaworld.py` as templates.
Your env must implement:
```python
class MyBenchmarkEnv(gym.Env):
@@ -149,19 +133,26 @@ class MyBenchmarkEnv(gym.Env):
self.action_space = spaces.Box(low=..., high=..., shape=(...,), dtype=np.float32)
def reset(self, seed=None, **kwargs):
... # return (observation, info) — info must contain {"is_success": False}
# Reset simulator, return (observation, info)
# info must contain {"is_success": False}
...
def step(self, action: np.ndarray):
... # return (obs, reward, terminated, truncated, info) — info must contain {"is_success": <bool>}
# Step simulator, return (observation, reward, terminated, truncated, info)
# info must contain {"is_success": <bool>}
# On termination, info must contain "final_info" with success status
...
def render(self):
... # return RGB image as numpy array
# Return RGB image as numpy array
...
def close(self):
# Clean up simulator resources
...
```
Also provide a factory function that returns the nested dict structure:
Also provide a factory function that returns the standard nested dict:
```python
def create_mybenchmark_envs(
@@ -174,22 +165,20 @@ def create_mybenchmark_envs(
...
```
See `create_libero_envs()` (multi-suite, multi-task) and `create_metaworld_envs()` (difficulty-grouped tasks) for reference.
See `create_libero_envs()` in `src/lerobot/envs/libero.py` (multi-suite, multi-task) and `create_metaworld_envs()` in `src/lerobot/envs/metaworld.py` (difficulty-grouped tasks) for reference.
### 2. The config (`src/lerobot/envs/configs.py`)
Register a config dataclass so users can select your benchmark with `--env.type=<name>`. Each config owns its environment creation and processor logic via two methods:
- **`create_envs(n_envs, use_async_envs)`** — Returns `{suite: {task_id: VectorEnv}}`. The base class default uses `gym.make()` for single-task envs. Multi-task benchmarks override this.
- **`get_env_processors()`** — Returns `(preprocessor, postprocessor)`. The base class default returns identity (no-op) pipelines. Override if your benchmark needs observation/action transforms.
Register a new config dataclass:
```python
@EnvConfig.register_subclass("<benchmark_name>")
@dataclass
class MyBenchmarkEnvConfig(EnvConfig):
class MyBenchmarkEnv(EnvConfig):
task: str = "<default_task>"
fps: int = <fps>
obs_type: str = "pixels_agent_pos"
# ... benchmark-specific fields ...
features: dict[str, PolicyFeature] = field(default_factory=lambda: {
ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(<action_dim>,)),
@@ -201,37 +190,49 @@ class MyBenchmarkEnvConfig(EnvConfig):
})
def __post_init__(self):
... # populate features based on obs_type
# Populate features based on obs_type
...
@property
def gym_kwargs(self) -> dict:
return {"obs_type": self.obs_type, "render_mode": self.render_mode}
def create_envs(self, n_envs: int, use_async_envs: bool = False):
"""Override for multi-task benchmarks or custom env creation."""
from lerobot.envs.<benchmark> import create_<benchmark>_envs
return create_<benchmark>_envs(task=self.task, n_envs=n_envs, ...)
def get_env_processors(self):
"""Override if your benchmark needs observation/action transforms."""
from lerobot.processor.pipeline import PolicyProcessorPipeline
from lerobot.processor.env_processor import MyBenchmarkProcessorStep
return (
PolicyProcessorPipeline(steps=[MyBenchmarkProcessorStep()]),
PolicyProcessorPipeline(steps=[]),
)
```
Key points:
- The `register_subclass` name is what users pass on the CLI (`--env.type=<name>`).
- `features` tells the policy what the environment produces.
- The `register_subclass` name is what users pass as `--env.type=<name>` on the CLI.
- `features` declares what the environment produces (used to configure the policy).
- `features_map` maps raw observation keys to LeRobot convention keys.
- **No changes to `factory.py` needed** — the factory delegates to `cfg.create_envs()` and `cfg.get_env_processors()` automatically.
### 3. Env processor (optional — `src/lerobot/processor/env_processor.py`)
### 3. The factory dispatch (`src/lerobot/envs/factory.py`)
Only needed if your benchmark requires observation transforms beyond what `preprocess_observation()` handles (e.g. image flipping, coordinate conversion). Define the processor step here and return it from `get_env_processors()` in your config (see step 2):
Add a branch in `make_env()`:
```python
elif "<benchmark_name>" in cfg.type:
from lerobot.envs.<benchmark> import create_<benchmark>_envs
if cfg.task is None:
raise ValueError("<BenchmarkName> requires a task to be specified")
return create_<benchmark>_envs(
task=cfg.task,
n_envs=n_envs,
gym_kwargs=cfg.gym_kwargs,
env_cls=env_cls,
)
```
If your benchmark needs an env processor, add it in `make_env_pre_post_processors()`:
```python
if isinstance(env_cfg, MyBenchmarkEnv) or "<benchmark_name>" in env_cfg.type:
preprocessor_steps.append(MyBenchmarkProcessorStep())
```
### 4. Env processor (optional) (`src/lerobot/processor/env_processor.py`)
If your benchmark needs observation transforms beyond what `preprocess_observation()` handles (e.g., image flipping, coordinate frame conversion), add a `ProcessorStep`:
```python
@dataclass
@@ -239,11 +240,12 @@ Only needed if your benchmark requires observation transforms beyond what `prepr
class MyBenchmarkProcessorStep(ObservationProcessorStep):
def _process_observation(self, observation):
processed = observation.copy()
# your transforms here
# Your transforms here
return processed
def transform_features(self, features):
return features # update if shapes change
# Update feature declarations if shapes change
return features
def observation(self, observation):
return self._process_observation(observation)
@@ -251,20 +253,20 @@ class MyBenchmarkProcessorStep(ObservationProcessorStep):
See `LiberoProcessorStep` for a full example (image rotation, quaternion-to-axis-angle conversion).
### 4. Dependencies (`pyproject.toml`)
### 5. Dependencies (`pyproject.toml`)
Add a new optional-dependency group:
Add a new optional-dependency group under `[project.optional-dependencies]`:
```toml
mybenchmark = ["my-benchmark-pkg==1.2.3", "lerobot[scipy-dep]"]
```
Pinning rules:
**Dependency pinning rules:**
- **Always pin** benchmark packages to exact versions for reproducibility (e.g. `metaworld==3.0.0`).
- **Add platform markers** when needed (e.g. `; sys_platform == 'linux'`).
- **Pin fragile transitive deps** if known (e.g. `gymnasium==1.1.0` for Meta-World).
- **Document constraints** in your benchmark doc page.
- **Always pin benchmark-specific packages** to exact versions or tight ranges for reproducibility (e.g., `metaworld==3.0.0`, `hf-libero>=0.1.3,<0.2.0`).
- **Add platform markers** if the dependency is platform-specific (e.g., `; sys_platform == 'linux'`).
- **Pin known-fragile transitive dependencies** (e.g., `gymnasium==1.1.0` for Meta-World compatibility).
- **Document version constraints** in the benchmark doc page.
Users install with:
@@ -272,13 +274,13 @@ Users install with:
pip install -e ".[mybenchmark]"
```
### 5. Documentation (`docs/source/<benchmark>.mdx`)
### 6. Documentation (`docs/source/<benchmark>.mdx`)
Write a user-facing page following the template in the next section. See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for full examples.
Follow the template below. See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for full examples.
### 6. Table of contents (`docs/source/_toctree.yml`)
### 7. Table of contents (`docs/source/_toctree.yml`)
Add your benchmark to the "Benchmarks" section:
Add your benchmark under the "Benchmarks" section:
```yaml
- sections:
@@ -293,28 +295,97 @@ Add your benchmark to the "Benchmarks" section:
title: "Benchmarks"
```
## Verifying your integration
## Benchmark documentation template
After completing the steps above, confirm that everything works:
Each benchmark `.mdx` page should follow this structure:
1. **Install** — `pip install -e ".[mybenchmark]"` and verify the dependency group installs cleanly.
2. **Smoke test env creation** — call `make_env()` with your config in Python, check that the returned dict has the expected `{suite: {task_id: VectorEnv}}` shape, and that `reset()` returns observations with the right keys.
3. **Run a full eval** — `lerobot-eval --env.type=<name> --env.task=<task> --eval.n_episodes=1 --eval.batch_size=1 --policy.path=<any_compatible_policy>` to exercise the full pipeline end-to-end.
4. **Check success detection** — verify that `info["is_success"]` flips to `True` when the task is actually completed. This is what the eval loop uses to compute success rates.
```markdown
# <Benchmark Name>
## Writing a benchmark doc page
<1-2 paragraphs: what the benchmark tests and why it matters for robot learning.>
Each benchmark `.mdx` page should include:
- Paper: [<title>](arxiv_url)
- GitHub: [<repo>](github_url)
- Project website: [<name>](url) (if available)
- **Title and description** — 1-2 paragraphs on what the benchmark tests and why it matters.
- **Links** — paper, GitHub repo, project website (if available).
- **Overview image or GIF.**
- **Available tasks** — table of task suites with counts and brief descriptions.
- **Installation** — `pip install -e ".[<benchmark>]"` plus any extra steps (env vars, system packages).
- **Evaluation** — recommended `lerobot-eval` command with `n_episodes` and `batch_size` for reproducible results. Include single-task and multi-task examples if applicable.
- **Policy inputs and outputs** — observation keys with shapes, action space description.
- **Recommended evaluation episodes** — how many episodes per task is standard.
- **Training** — example `lerobot-train` command.
- **Reproducing published results** — link to pretrained model, eval command, results table (if available).
<Overview image or GIF>
See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for complete examples.
## Available tasks
<Table listing task suites or individual tasks, with counts.
For multi-suite benchmarks, describe each suite briefly.>
| Suite | Tasks | Description |
| ----- | ----- | ----------- |
| ... | ... | ... |
## Installation
After following the LeRobot installation instructions:
pip install -e ".[<benchmark>]"
<Any additional steps: environment variables, system packages, etc.>
## Evaluation
### Default evaluation (recommended)
<Command with recommended n_episodes, batch_size for reproducible results.>
### Single-task evaluation
<Command example with --env.task=<single_task>>
### Multi-task evaluation
<Command example with comma-separated tasks, if applicable.>
### Policy inputs and outputs
**Observations:**
- `observation.state` — <shape, description>
- `observation.images.image` — <shape, description>
- ...
**Actions:**
- Continuous control in Box(<low>, <high>, shape=(<dim>,))
### Recommended evaluation episodes
<State how many episodes per task are standard for this benchmark.
E.g., "50 episodes per task (500 total for LIBERO Spatial).">
## Training
<Example lerobot-train command.>
## Reproducing published results
<If available: link to pretrained model, eval command, results table.>
```
## How evaluation works
All benchmarks are evaluated uniformly by `lerobot-eval` (see `src/lerobot/scripts/lerobot_eval.py`).
The `eval_policy_all()` function:
1. Receives the nested `{suite: {task_id: VectorEnv}}` dict from `make_env()`.
2. Iterates over every `(suite, task_id, vec_env)` tuple.
3. For each task, runs `n_episodes` rollouts via `eval_policy()` → `rollout()`.
4. Aggregates results hierarchically: **episode → task → suite → overall**.
5. Reports `pc_success` (success rate), `avg_sum_reward`, `avg_max_reward` at each level.
6. Saves all results to `eval_info.json` with the full config snapshot for reproducibility.
The key contract: your `gym.Env` must return `info["is_success"]` on every `step()`, and the `VectorEnv` must surface it through `final_info["is_success"]` on termination. This is how the eval loop detects task completion.
## Quick reference: existing benchmarks
| Benchmark | Env file | Config class | Tasks | Action dim | Processor |
| -------------- | ------------------- | ------------------ | ------------------- | ------------ | ---------------------------- |
| LIBERO | `envs/libero.py` | `LiberoEnv` | 130 across 5 suites | 7 | `LiberoProcessorStep` |
| Meta-World | `envs/metaworld.py` | `MetaworldEnv` | 50 (MT50) | 4 | None |
| IsaacLab Arena | Hub-hosted | `IsaaclabArenaEnv` | Configurable | Configurable | `IsaaclabArenaProcessorStep` |