docs(benchmarks): clean up adding-benchmarks guide for clarity

Rewrite for simpler language, better structure, and easier navigation.
Move quick-reference table to the top, fold eval explanation into
architecture section, condense the doc template to a bulleted outline.

Made-with: Cursor
This commit is contained in:
Pepijn
2026-04-03 13:36:16 +02:00
parent 5ad4c8f7b6
commit bfa0a0f846
+160 -222
View File
@@ -1,123 +1,141 @@
# Adding a New Benchmark # Adding a New Benchmark
This guide explains how to integrate a new simulation benchmark into LeRobot. It is intended for both human contributors and coding agents follow the steps in order and use the referenced files as templates. This guide walks you through adding a new simulation benchmark to LeRobot. Follow the steps in order and use the existing benchmarks as templates.
A "benchmark" in LeRobot is a set of gymnasium environments used for standardized evaluation. Each benchmark wraps a third-party simulator (e.g., LIBERO, Meta-World) behind a `gym.Env` interface, and the `lerobot-eval` script drives evaluation uniformly across all benchmarks. A benchmark in LeRobot is a set of [Gymnasium](https://gymnasium.farama.org/) environments that wrap a third-party simulator (like LIBERO or Meta-World) behind a standard `gym.Env` interface. The `lerobot-eval` CLI then runs evaluation uniformly across all benchmarks.
## Architecture overview ## Existing benchmarks at a glance
### Observation and action data flow Before diving in, here is what is already integrated:
During evaluation, observations and actions flow through a multi-stage pipeline: | Benchmark | Env file | Config class | Tasks | Action dim | Processor |
| -------------- | ------------------- | ------------------ | ------------------- | ------------ | ---------------------------- |
| LIBERO | `envs/libero.py` | `LiberoEnv` | 130 across 5 suites | 7 | `LiberoProcessorStep` |
| Meta-World | `envs/metaworld.py` | `MetaworldEnv` | 50 (MT50) | 4 | None |
| IsaacLab Arena | Hub-hosted | `IsaaclabArenaEnv` | Configurable | Configurable | `IsaaclabArenaProcessorStep` |
Use `src/lerobot/envs/libero.py` and `src/lerobot/envs/metaworld.py` as reference implementations.
## How it all fits together
### Data flow
During evaluation, data moves through four stages:
``` ```
gym.Env.reset() / step() 1. gym.Env ──→ raw observations (numpy dicts)
▼ raw observation (dict[str, Any]) 2. Preprocessing ──→ standard LeRobot keys + task description
preprocess_observation() # envs/utils.py — numpy→tensor, key mapping (preprocess_observation, add_envs_task in envs/utils.py)
▼ LeRobot-format observation 3. Processors ──→ env-specific then policy-specific transforms
add_envs_task() # envs/utils.py — injects task description (env_preprocessor, policy_preprocessor)
4. Policy ──→ select_action() ──→ action tensor
env_preprocessor # processor/env_processor.py — env-specific transforms then reverse: policy_postprocessorenv_postprocessor → numpy action → env.step()
policy_preprocessor # per-policy normalization, device transfer
policy.select_action() # PreTrainedPolicy — returns action tensor
policy_postprocessor # per-policy denormalization
env_postprocessor # env-specific action transforms
▼ numpy action
gym.Env.step(action)
``` ```
### Environment return shape Most benchmarks only need to care about stage 1 (producing observations in the right format) and optionally stage 3 (if env-specific transforms are needed).
`make_env()` returns a nested dict: ### Environment structure
`make_env()` returns a nested dict of vectorized environments:
```python ```python
dict[str, dict[int, gym.vector.VectorEnv]] dict[str, dict[int, gym.vector.VectorEnv]]
# ^suite_name ^task_id ^vectorized env with n_envs parallel copies # ^suite ^task_id
``` ```
For single-task environments (e.g., PushT), this is `{"pusht": {0: vec_env}}`. A single-task env (e.g. PushT) looks like `{"pusht": {0: vec_env}}`.
For multi-task benchmarks (e.g., LIBERO), this is `{"libero_spatial": {0: vec0, 1: vec1, ...}, "libero_object": {0: ..., ...}}`. A multi-task benchmark (e.g. LIBERO) looks like `{"libero_spatial": {0: vec0, 1: vec1, ...}, ...}`.
The eval loop (`eval_policy_all()`) iterates over all suites and tasks uniformly. ### How evaluation runs
## The policy-environment contract All benchmarks are evaluated the same way by `lerobot-eval`:
There is no enforced schema: `RobotObservation` is typed as `dict[str, Any]`. Instead, LeRobot relies on conventions: 1. `make_env()` builds the nested `{suite: {task_id: VectorEnv}}` dict.
2. `eval_policy_all()` iterates over every suite and task.
3. For each task, it runs `n_episodes` rollouts via `rollout()`.
4. Results are aggregated hierarchically: episode, task, suite, overall.
5. Metrics include `pc_success` (success rate), `avg_sum_reward`, and `avg_max_reward`.
### Required attributes on your `gym.Env` The critical piece: your env must return `info["is_success"]` on every `step()` call. This is how the eval loop knows whether a task was completed.
| Attribute | Type | Used by | ## What your environment must provide
| -------------------- | ----- | -------------------------------------------------------------- |
| `_max_episode_steps` | `int` | `rollout()` — caps episode length |
| `task_description` | `str` | `add_envs_task()` — feeds language instruction to VLA policies |
| `task` | `str` | `add_envs_task()` — fallback if `task_description` is absent |
### Required fields in `info` dict LeRobot does not enforce a strict observation schema. Instead it relies on a set of conventions that all benchmarks follow.
| Key | Type | Used by | ### Env attributes
| ------------ | ------ | ----------------------------------------------------------- |
| `is_success` | `bool` | `eval_policy()` — detects task success |
| `final_info` | `dict` | Gymnasium `VectorEnv` — carries per-env info on termination |
### Raw observation format Your `gym.Env` must set these attributes:
`preprocess_observation()` expects raw observations to use these keys: | Attribute | Type | Why |
| -------------------- | ----- | ---------------------------------------------------- |
| `_max_episode_steps` | `int` | `rollout()` uses this to cap episode length |
| `task_description` | `str` | Passed to VLA policies as a language instruction |
| `task` | `str` | Fallback identifier if `task_description` is not set |
| Raw key | Mapped to | Description | ### Success reporting
| --------------------------- | ------------------------------- | -------------------------------------- |
| `"pixels"` (single image) | `observation.image` | Single camera, HWC uint8 |
| `"pixels"` (dict of images) | `observation.images.<cam_name>` | Multiple cameras, each HWC uint8 |
| `"agent_pos"` | `observation.state` | Proprioceptive state vector |
| `"environment_state"` | `observation.env_state` | Environment state (e.g., PushT) |
| `"robot_state"` | `observation.robot_state` | Nested robot state dict (e.g., LIBERO) |
If your benchmark's raw observations don't match these keys, you have two options: Your `step()` and `reset()` must include `"is_success"` in the `info` dict:
1. **Preferred**: Map your observations to these standard keys inside your `gym.Env._format_raw_obs()` method. ```python
2. **Alternative**: Write an env processor that transforms the observations after `preprocess_observation()` runs. info = {"is_success": True} # or False
return observation, reward, terminated, truncated, info
```
### Action space ### Observations
Actions are continuous numpy arrays in a `gym.spaces.Box`. The dimensionality is benchmark-specific (e.g., 7 for LIBERO, 4 for Meta-World). Policies handle the dimension mismatch via their `input_features` / `output_features` config. The simplest approach is to map your simulator's outputs to the standard keys that `preprocess_observation()` already understands. Do this inside your `gym.Env` (e.g. in a `_format_raw_obs()` helper):
| Your env should output | LeRobot maps it to | What it is |
| ------------------------- | -------------------------- | ------------------------------------- |
| `"pixels"` (single array) | `observation.image` | Single camera image, HWC uint8 |
| `"pixels"` (dict) | `observation.images.<cam>` | Multiple cameras, each HWC uint8 |
| `"agent_pos"` | `observation.state` | Proprioceptive state vector |
| `"environment_state"` | `observation.env_state` | Full environment state (e.g. PushT) |
| `"robot_state"` | `observation.robot_state` | Nested robot state dict (e.g. LIBERO) |
If your simulator uses different key names, you have two options:
1. **Recommended:** Rename them to the standard keys inside your `gym.Env` wrapper.
2. **Alternative:** Write an env processor to transform observations after `preprocess_observation()` runs (see step 4 below).
### Actions
Actions are continuous numpy arrays in a `gym.spaces.Box`. The dimensionality depends on your benchmark (7 for LIBERO, 4 for Meta-World, etc.). Policies adapt to different action dimensions through their `input_features` / `output_features` config.
### Feature declaration ### Feature declaration
Each `EnvConfig` subclass declares: Each `EnvConfig` subclass declares two dicts that tell the policy what to expect:
- `features`: dict mapping feature names to `PolicyFeature(type, shape)` — tells the policy what to expect. - `features` — maps feature names to `PolicyFeature(type, shape)` (e.g. action dim, image shape).
- `features_map`: dict mapping raw env keys to LeRobot convention keys (e.g., `"agent_pos""observation.state"`). - `features_map` — maps raw observation keys to LeRobot convention keys (e.g. `"agent_pos"` to `"observation.state"`).
## Files to create or modify ## Step by step
<Tip>
At minimum, you need three files: a **gym.Env wrapper**, an **EnvConfig
subclass**, and a **factory dispatch branch**. Everything else is optional or
documentation.
</Tip>
### Checklist ### Checklist
| File | Required | Description | | File | Required | Why |
| ---------------------------------------- | -------- | -------------------------------------------------------------------------------- | | ---------------------------------------- | -------- | ----------------------------------------- |
| `src/lerobot/envs/<benchmark>.py` | Yes | `gym.Env` subclass + `create_<benchmark>_envs()` factory | | `src/lerobot/envs/<benchmark>.py` | Yes | Wraps the simulator as a standard gym.Env |
| `src/lerobot/envs/configs.py` | Yes | `@EnvConfig.register_subclass("<name>")` dataclass with `create_envs()` override | | `src/lerobot/envs/configs.py` | Yes | Registers your benchmark for the CLI |
| `src/lerobot/processor/env_processor.py` | Optional | `ProcessorStep` subclass for env-specific observation transforms | | `src/lerobot/envs/factory.py` | Yes | Tells `make_env()` how to build your envs |
| `src/lerobot/envs/utils.py` | Optional | Extend `preprocess_observation()` if new raw keys are needed | | `src/lerobot/processor/env_processor.py` | Optional | Custom observation/action transforms |
| `pyproject.toml` | Yes | Add optional dependency group | | `src/lerobot/envs/utils.py` | Optional | Only if you need new raw observation keys |
| `docs/source/<benchmark>.mdx` | Yes | User-facing benchmark documentation | | `pyproject.toml` | Yes | Declares benchmark-specific dependencies |
| `docs/source/_toctree.yml` | Yes | Add entry under the "Benchmarks" section | | `docs/source/<benchmark>.mdx` | Yes | User-facing documentation page |
| `docs/source/_toctree.yml` | Yes | Adds your page to the docs sidebar |
### 1. The gym.Env wrapper (`src/lerobot/envs/<benchmark>.py`) ### 1. The gym.Env wrapper (`src/lerobot/envs/<benchmark>.py`)
Create a `gym.Env` subclass that wraps the third-party simulator. Use `src/lerobot/envs/libero.py` or `src/lerobot/envs/metaworld.py` as templates. Create a `gym.Env` subclass that wraps the third-party simulator:
Your env must implement:
```python ```python
class MyBenchmarkEnv(gym.Env): class MyBenchmarkEnv(gym.Env):
@@ -132,26 +150,19 @@ class MyBenchmarkEnv(gym.Env):
self.action_space = spaces.Box(low=..., high=..., shape=(...,), dtype=np.float32) self.action_space = spaces.Box(low=..., high=..., shape=(...,), dtype=np.float32)
def reset(self, seed=None, **kwargs): def reset(self, seed=None, **kwargs):
# Reset simulator, return (observation, info) ... # return (observation, info) — info must contain {"is_success": False}
# info must contain {"is_success": False}
...
def step(self, action: np.ndarray): def step(self, action: np.ndarray):
# Step simulator, return (observation, reward, terminated, truncated, info) ... # return (obs, reward, terminated, truncated, info) — info must contain {"is_success": <bool>}
# info must contain {"is_success": <bool>}
# On termination, info must contain "final_info" with success status
...
def render(self): def render(self):
# Return RGB image as numpy array ... # return RGB image as numpy array
...
def close(self): def close(self):
# Clean up simulator resources
... ...
``` ```
Also provide a factory function that returns the standard nested dict: Also provide a factory function that returns the nested dict structure:
```python ```python
def create_mybenchmark_envs( def create_mybenchmark_envs(
@@ -164,14 +175,11 @@ def create_mybenchmark_envs(
... ...
``` ```
See `create_libero_envs()` in `src/lerobot/envs/libero.py` (multi-suite, multi-task) and `create_metaworld_envs()` in `src/lerobot/envs/metaworld.py` (difficulty-grouped tasks) for reference. See `create_libero_envs()` (multi-suite, multi-task) and `create_metaworld_envs()` (difficulty-grouped tasks) for reference.
### 2. The config (`src/lerobot/envs/configs.py`) ### 2. The config (`src/lerobot/envs/configs.py`)
Register a new config dataclass. Each config owns its environment creation and processor logic via two methods: Register a config dataclass so users can select your benchmark with `--env.type=<name>`:
- **`create_envs(n_envs, use_async_envs)`** — Returns `{suite: {task_id: VectorEnv}}`. The base class default uses `gym.make()` for single-task envs. Multi-task benchmarks override this.
- **`get_env_processors()`** — Returns `(preprocessor, postprocessor)`. The base class default returns identity (no-op) pipelines. Override if your benchmark needs observation/action transforms.
```python ```python
@EnvConfig.register_subclass("<benchmark_name>") @EnvConfig.register_subclass("<benchmark_name>")
@@ -180,7 +188,6 @@ class MyBenchmarkEnv(EnvConfig):
task: str = "<default_task>" task: str = "<default_task>"
fps: int = <fps> fps: int = <fps>
obs_type: str = "pixels_agent_pos" obs_type: str = "pixels_agent_pos"
# ... benchmark-specific fields ...
features: dict[str, PolicyFeature] = field(default_factory=lambda: { features: dict[str, PolicyFeature] = field(default_factory=lambda: {
ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(<action_dim>,)), ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(<action_dim>,)),
@@ -192,38 +199,48 @@ class MyBenchmarkEnv(EnvConfig):
}) })
def __post_init__(self): def __post_init__(self):
# Populate features based on obs_type ... # populate features based on obs_type
...
@property @property
def gym_kwargs(self) -> dict: def gym_kwargs(self) -> dict:
return {"obs_type": self.obs_type, "render_mode": self.render_mode} return {"obs_type": self.obs_type, "render_mode": self.render_mode}
def create_envs(self, n_envs: int, use_async_envs: bool = False):
"""Override for multi-task benchmarks or custom env creation."""
from lerobot.envs.<benchmark> import create_<benchmark>_envs
return create_<benchmark>_envs(task=self.task, n_envs=n_envs, ...)
def get_env_processors(self):
"""Override if your benchmark needs observation/action transforms."""
from lerobot.processor.pipeline import PolicyProcessorPipeline
from lerobot.processor.env_processor import MyBenchmarkProcessorStep
return (
PolicyProcessorPipeline(steps=[MyBenchmarkProcessorStep()]),
PolicyProcessorPipeline(steps=[]),
)
``` ```
Key points: Key points:
- The `register_subclass` name is what users pass as `--env.type=<name>` on the CLI. - The `register_subclass` name is what users pass on the CLI (`--env.type=<name>`).
- `features` declares what the environment produces (used to configure the policy). - `features` tells the policy what the environment produces.
- `features_map` maps raw observation keys to LeRobot convention keys. - `features_map` maps raw observation keys to LeRobot convention keys.
- **No changes to `factory.py` needed** — the factory delegates to `cfg.create_envs()` and `cfg.get_env_processors()` automatically.
### 3. Env processor (optional) (`src/lerobot/processor/env_processor.py`) ### 3. The factory dispatch (`src/lerobot/envs/factory.py`)
If your benchmark needs observation transforms beyond what `preprocess_observation()` handles (e.g., image flipping, coordinate frame conversion), add a `ProcessorStep` and return it from `get_env_processors()` in your config (see step 2): Add a branch in `make_env()` to call your factory function:
```python
elif "<benchmark_name>" in cfg.type:
from lerobot.envs.<benchmark> import create_<benchmark>_envs
if cfg.task is None:
raise ValueError("<BenchmarkName> requires a task to be specified")
return create_<benchmark>_envs(
task=cfg.task,
n_envs=n_envs,
gym_kwargs=cfg.gym_kwargs,
env_cls=env_cls,
)
```
If your benchmark needs an env processor, add it in `make_env_pre_post_processors()`:
```python
if isinstance(env_cfg, MyBenchmarkEnv) or "<benchmark_name>" in env_cfg.type:
preprocessor_steps.append(MyBenchmarkProcessorStep())
```
### 4. Env processor (optional — `src/lerobot/processor/env_processor.py`)
Only needed if your benchmark requires observation transforms beyond what `preprocess_observation()` handles (e.g. image flipping, coordinate conversion):
```python ```python
@dataclass @dataclass
@@ -231,12 +248,11 @@ If your benchmark needs observation transforms beyond what `preprocess_observati
class MyBenchmarkProcessorStep(ObservationProcessorStep): class MyBenchmarkProcessorStep(ObservationProcessorStep):
def _process_observation(self, observation): def _process_observation(self, observation):
processed = observation.copy() processed = observation.copy()
# Your transforms here # your transforms here
return processed return processed
def transform_features(self, features): def transform_features(self, features):
# Update feature declarations if shapes change return features # update if shapes change
return features
def observation(self, observation): def observation(self, observation):
return self._process_observation(observation) return self._process_observation(observation)
@@ -244,20 +260,20 @@ class MyBenchmarkProcessorStep(ObservationProcessorStep):
See `LiberoProcessorStep` for a full example (image rotation, quaternion-to-axis-angle conversion). See `LiberoProcessorStep` for a full example (image rotation, quaternion-to-axis-angle conversion).
### 4. Dependencies (`pyproject.toml`) ### 5. Dependencies (`pyproject.toml`)
Add a new optional-dependency group under `[project.optional-dependencies]`: Add a new optional-dependency group:
```toml ```toml
mybenchmark = ["my-benchmark-pkg==1.2.3", "lerobot[scipy-dep]"] mybenchmark = ["my-benchmark-pkg==1.2.3", "lerobot[scipy-dep]"]
``` ```
**Dependency pinning rules:** Pinning rules:
- **Always pin benchmark-specific packages** to exact versions or tight ranges for reproducibility (e.g., `metaworld==3.0.0`, `hf-libero>=0.1.3,<0.2.0`). - **Always pin** benchmark packages to exact versions for reproducibility (e.g. `metaworld==3.0.0`).
- **Add platform markers** if the dependency is platform-specific (e.g., `; sys_platform == 'linux'`). - **Add platform markers** when needed (e.g. `; sys_platform == 'linux'`).
- **Pin known-fragile transitive dependencies** (e.g., `gymnasium==1.1.0` for Meta-World compatibility). - **Pin fragile transitive deps** if known (e.g. `gymnasium==1.1.0` for Meta-World).
- **Document version constraints** in the benchmark doc page. - **Document constraints** in your benchmark doc page.
Users install with: Users install with:
@@ -265,13 +281,13 @@ Users install with:
pip install -e ".[mybenchmark]" pip install -e ".[mybenchmark]"
``` ```
### 5. Documentation (`docs/source/<benchmark>.mdx`) ### 6. Documentation (`docs/source/<benchmark>.mdx`)
Follow the template below. See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for full examples. Write a user-facing page following the template in the next section. See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for full examples.
### 6. Table of contents (`docs/source/_toctree.yml`) ### 7. Table of contents (`docs/source/_toctree.yml`)
Add your benchmark under the "Benchmarks" section: Add your benchmark to the "Benchmarks" section:
```yaml ```yaml
- sections: - sections:
@@ -286,97 +302,19 @@ Add your benchmark under the "Benchmarks" section:
title: "Benchmarks" title: "Benchmarks"
``` ```
## Benchmark documentation template ## Writing a benchmark doc page
Each benchmark `.mdx` page should follow this structure: Each benchmark `.mdx` page should include:
```markdown - **Title and description** — 1-2 paragraphs on what the benchmark tests and why it matters.
# <Benchmark Name> - **Links** — paper, GitHub repo, project website (if available).
- **Overview image or GIF.**
- **Available tasks** — table of task suites with counts and brief descriptions.
- **Installation** — `pip install -e ".[<benchmark>]"` plus any extra steps (env vars, system packages).
- **Evaluation** — recommended `lerobot-eval` command with `n_episodes` and `batch_size` for reproducible results. Include single-task and multi-task examples if applicable.
- **Policy inputs and outputs** — observation keys with shapes, action space description.
- **Recommended evaluation episodes** — how many episodes per task is standard.
- **Training** — example `lerobot-train` command.
- **Reproducing published results** — link to pretrained model, eval command, results table (if available).
<1-2 paragraphs: what the benchmark tests and why it matters for robot learning.> See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for complete examples.
- Paper: [<title>](arxiv_url)
- GitHub: [<repo>](github_url)
- Project website: [<name>](url) (if available)
<Overview image or GIF>
## Available tasks
<Table listing task suites or individual tasks, with counts.
For multi-suite benchmarks, describe each suite briefly.>
| Suite | Tasks | Description |
| ----- | ----- | ----------- |
| ... | ... | ... |
## Installation
After following the LeRobot installation instructions:
pip install -e ".[<benchmark>]"
<Any additional steps: environment variables, system packages, etc.>
## Evaluation
### Default evaluation (recommended)
<Command with recommended n_episodes, batch_size for reproducible results.>
### Single-task evaluation
<Command example with --env.task=<single_task>>
### Multi-task evaluation
<Command example with comma-separated tasks, if applicable.>
### Policy inputs and outputs
**Observations:**
- `observation.state` — <shape, description>
- `observation.images.image` — <shape, description>
- ...
**Actions:**
- Continuous control in Box(<low>, <high>, shape=(<dim>,))
### Recommended evaluation episodes
<State how many episodes per task are standard for this benchmark.
E.g., "50 episodes per task (500 total for LIBERO Spatial).">
## Training
<Example lerobot-train command.>
## Reproducing published results
<If available: link to pretrained model, eval command, results table.>
```
## How evaluation works
All benchmarks are evaluated uniformly by `lerobot-eval` (see `src/lerobot/scripts/lerobot_eval.py`).
The `eval_policy_all()` function:
1. Receives the nested `{suite: {task_id: VectorEnv}}` dict from `make_env()`.
2. Iterates over every `(suite, task_id, vec_env)` tuple.
3. For each task, runs `n_episodes` rollouts via `eval_policy()` → `rollout()`.
4. Aggregates results hierarchically: **episode → task → suite → overall**.
5. Reports `pc_success` (success rate), `avg_sum_reward`, `avg_max_reward` at each level.
6. Saves all results to `eval_info.json` with the full config snapshot for reproducibility.
The key contract: your `gym.Env` must return `info["is_success"]` on every `step()`, and the `VectorEnv` must surface it through `final_info["is_success"]` on termination. This is how the eval loop detects task completion.
## Quick reference: existing benchmarks
| Benchmark | Env file | Config class | Tasks | Action dim | Processor |
| -------------- | ------------------- | ------------------ | ------------------- | ------------ | ---------------------------- |
| LIBERO | `envs/libero.py` | `LiberoEnv` | 130 across 5 suites | 7 | `LiberoProcessorStep` |
| Meta-World | `envs/metaworld.py` | `MetaworldEnv` | 50 (MT50) | 4 | None |
| IsaacLab Arena | Hub-hosted | `IsaaclabArenaEnv` | Configurable | Configurable | `IsaaclabArenaProcessorStep` |