merge: resolve conflicts with docs/adding-benchmarks-guide

Incorporate cleaner writing from the docs branch while reflecting the
refactored dispatch pattern (no factory.py edits needed for new benchmarks).

Made-with: Cursor
This commit is contained in:
Pepijn
2026-04-03 13:45:12 +02:00
+129 -200
View File
@@ -1,123 +1,140 @@
# Adding a New Benchmark
This guide explains how to integrate a new simulation benchmark into LeRobot. It is intended for both human contributors and coding agents follow the steps in order and use the referenced files as templates.
This guide walks you through adding a new simulation benchmark to LeRobot. Follow the steps in order and use the existing benchmarks as templates.
A "benchmark" in LeRobot is a set of gymnasium environments used for standardized evaluation. Each benchmark wraps a third-party simulator (e.g., LIBERO, Meta-World) behind a `gym.Env` interface, and the `lerobot-eval` script drives evaluation uniformly across all benchmarks.
A benchmark in LeRobot is a set of [Gymnasium](https://gymnasium.farama.org/) environments that wrap a third-party simulator (like LIBERO or Meta-World) behind a standard `gym.Env` interface. The `lerobot-eval` CLI then runs evaluation uniformly across all benchmarks.
## Architecture overview
## Existing benchmarks at a glance
### Observation and action data flow
Before diving in, here is what is already integrated:
During evaluation, observations and actions flow through a multi-stage pipeline:
| Benchmark | Env file | Config class | Tasks | Action dim | Processor |
| -------------- | ------------------- | ------------------ | ------------------- | ------------ | ---------------------------- |
| LIBERO | `envs/libero.py` | `LiberoEnv` | 130 across 5 suites | 7 | `LiberoProcessorStep` |
| Meta-World | `envs/metaworld.py` | `MetaworldEnv` | 50 (MT50) | 4 | None |
| IsaacLab Arena | Hub-hosted | `IsaaclabArenaEnv` | Configurable | Configurable | `IsaaclabArenaProcessorStep` |
Use `src/lerobot/envs/libero.py` and `src/lerobot/envs/metaworld.py` as reference implementations.
## How it all fits together
### Data flow
During evaluation, data moves through four stages:
```
gym.Env.reset() / step()
▼ raw observation (dict[str, Any])
preprocess_observation() # envs/utils.py — numpy→tensor, key mapping
▼ LeRobot-format observation
add_envs_task() # envs/utils.py — injects task description
env_preprocessor # processor/env_processor.py — env-specific transforms
policy_preprocessor # per-policy normalization, device transfer
policy.select_action() # PreTrainedPolicy — returns action tensor
policy_postprocessor # per-policy denormalization
env_postprocessor # env-specific action transforms
▼ numpy action
gym.Env.step(action)
1. gym.Env ──→ raw observations (numpy dicts)
2. Preprocessing ──→ standard LeRobot keys + task description
(preprocess_observation, add_envs_task in envs/utils.py)
3. Processors ──→ env-specific then policy-specific transforms
(env_preprocessor, policy_preprocessor)
4. Policy ──→ select_action() ──→ action tensor
then reverse: policy_postprocessorenv_postprocessor → numpy action → env.step()
```
### Environment return shape
Most benchmarks only need to care about stage 1 (producing observations in the right format) and optionally stage 3 (if env-specific transforms are needed).
`make_env()` returns a nested dict:
### Environment structure
`make_env()` returns a nested dict of vectorized environments:
```python
dict[str, dict[int, gym.vector.VectorEnv]]
# ^suite_name ^task_id ^vectorized env with n_envs parallel copies
# ^suite ^task_id
```
For single-task environments (e.g., PushT), this is `{"pusht": {0: vec_env}}`.
For multi-task benchmarks (e.g., LIBERO), this is `{"libero_spatial": {0: vec0, 1: vec1, ...}, "libero_object": {0: ..., ...}}`.
A single-task env (e.g. PushT) looks like `{"pusht": {0: vec_env}}`.
A multi-task benchmark (e.g. LIBERO) looks like `{"libero_spatial": {0: vec0, 1: vec1, ...}, ...}`.
The eval loop (`eval_policy_all()`) iterates over all suites and tasks uniformly.
### How evaluation runs
## The policy-environment contract
All benchmarks are evaluated the same way by `lerobot-eval`:
There is no enforced schema: `RobotObservation` is typed as `dict[str, Any]`. Instead, LeRobot relies on conventions:
1. `make_env()` builds the nested `{suite: {task_id: VectorEnv}}` dict.
2. `eval_policy_all()` iterates over every suite and task.
3. For each task, it runs `n_episodes` rollouts via `rollout()`.
4. Results are aggregated hierarchically: episode, task, suite, overall.
5. Metrics include `pc_success` (success rate), `avg_sum_reward`, and `avg_max_reward`.
### Required attributes on your `gym.Env`
The critical piece: your env must return `info["is_success"]` on every `step()` call. This is how the eval loop knows whether a task was completed.
| Attribute | Type | Used by |
| -------------------- | ----- | -------------------------------------------------------------- |
| `_max_episode_steps` | `int` | `rollout()` — caps episode length |
| `task_description` | `str` | `add_envs_task()` — feeds language instruction to VLA policies |
| `task` | `str` | `add_envs_task()` — fallback if `task_description` is absent |
## What your environment must provide
### Required fields in `info` dict
LeRobot does not enforce a strict observation schema. Instead it relies on a set of conventions that all benchmarks follow.
| Key | Type | Used by |
| ------------ | ------ | ----------------------------------------------------------- |
| `is_success` | `bool` | `eval_policy()` — detects task success |
| `final_info` | `dict` | Gymnasium `VectorEnv` — carries per-env info on termination |
### Env attributes
### Raw observation format
Your `gym.Env` must set these attributes:
`preprocess_observation()` expects raw observations to use these keys:
| Attribute | Type | Why |
| -------------------- | ----- | ---------------------------------------------------- |
| `_max_episode_steps` | `int` | `rollout()` uses this to cap episode length |
| `task_description` | `str` | Passed to VLA policies as a language instruction |
| `task` | `str` | Fallback identifier if `task_description` is not set |
| Raw key | Mapped to | Description |
| --------------------------- | ------------------------------- | -------------------------------------- |
| `"pixels"` (single image) | `observation.image` | Single camera, HWC uint8 |
| `"pixels"` (dict of images) | `observation.images.<cam_name>` | Multiple cameras, each HWC uint8 |
| `"agent_pos"` | `observation.state` | Proprioceptive state vector |
| `"environment_state"` | `observation.env_state` | Environment state (e.g., PushT) |
| `"robot_state"` | `observation.robot_state` | Nested robot state dict (e.g., LIBERO) |
### Success reporting
If your benchmark's raw observations don't match these keys, you have two options:
Your `step()` and `reset()` must include `"is_success"` in the `info` dict:
1. **Preferred**: Map your observations to these standard keys inside your `gym.Env._format_raw_obs()` method.
2. **Alternative**: Write an env processor that transforms the observations after `preprocess_observation()` runs.
```python
info = {"is_success": True} # or False
return observation, reward, terminated, truncated, info
```
### Action space
### Observations
Actions are continuous numpy arrays in a `gym.spaces.Box`. The dimensionality is benchmark-specific (e.g., 7 for LIBERO, 4 for Meta-World). Policies handle the dimension mismatch via their `input_features` / `output_features` config.
The simplest approach is to map your simulator's outputs to the standard keys that `preprocess_observation()` already understands. Do this inside your `gym.Env` (e.g. in a `_format_raw_obs()` helper):
| Your env should output | LeRobot maps it to | What it is |
| ------------------------- | -------------------------- | ------------------------------------- |
| `"pixels"` (single array) | `observation.image` | Single camera image, HWC uint8 |
| `"pixels"` (dict) | `observation.images.<cam>` | Multiple cameras, each HWC uint8 |
| `"agent_pos"` | `observation.state` | Proprioceptive state vector |
| `"environment_state"` | `observation.env_state` | Full environment state (e.g. PushT) |
| `"robot_state"` | `observation.robot_state` | Nested robot state dict (e.g. LIBERO) |
If your simulator uses different key names, you have two options:
1. **Recommended:** Rename them to the standard keys inside your `gym.Env` wrapper.
2. **Alternative:** Write an env processor to transform observations after `preprocess_observation()` runs (see step 4 below).
### Actions
Actions are continuous numpy arrays in a `gym.spaces.Box`. The dimensionality depends on your benchmark (7 for LIBERO, 4 for Meta-World, etc.). Policies adapt to different action dimensions through their `input_features` / `output_features` config.
### Feature declaration
Each `EnvConfig` subclass declares:
Each `EnvConfig` subclass declares two dicts that tell the policy what to expect:
- `features`: dict mapping feature names to `PolicyFeature(type, shape)` — tells the policy what to expect.
- `features_map`: dict mapping raw env keys to LeRobot convention keys (e.g., `"agent_pos""observation.state"`).
- `features` — maps feature names to `PolicyFeature(type, shape)` (e.g. action dim, image shape).
- `features_map` — maps raw observation keys to LeRobot convention keys (e.g. `"agent_pos"` to `"observation.state"`).
## Files to create or modify
## Step by step
<Tip>
At minimum, you need two files: a **gym.Env wrapper** and an **EnvConfig
subclass** with a `create_envs()` override. Everything else is optional or
documentation. No changes to `factory.py` are needed.
</Tip>
### Checklist
| File | Required | Description |
| ---------------------------------------- | -------- | -------------------------------------------------------------------------------- |
| `src/lerobot/envs/<benchmark>.py` | Yes | `gym.Env` subclass + `create_<benchmark>_envs()` factory |
| `src/lerobot/envs/configs.py` | Yes | `@EnvConfig.register_subclass("<name>")` dataclass with `create_envs()` override |
| `src/lerobot/processor/env_processor.py` | Optional | `ProcessorStep` subclass for env-specific observation transforms |
| `src/lerobot/envs/utils.py` | Optional | Extend `preprocess_observation()` if new raw keys are needed |
| `pyproject.toml` | Yes | Add optional dependency group |
| `docs/source/<benchmark>.mdx` | Yes | User-facing benchmark documentation |
| `docs/source/_toctree.yml` | Yes | Add entry under the "Benchmarks" section |
| File | Required | Why |
| ---------------------------------------- | -------- | ------------------------------------------------------------ |
| `src/lerobot/envs/<benchmark>.py` | Yes | Wraps the simulator as a standard gym.Env |
| `src/lerobot/envs/configs.py` | Yes | Registers your benchmark and its `create_envs()` for the CLI |
| `src/lerobot/processor/env_processor.py` | Optional | Custom observation/action transforms |
| `src/lerobot/envs/utils.py` | Optional | Only if you need new raw observation keys |
| `pyproject.toml` | Yes | Declares benchmark-specific dependencies |
| `docs/source/<benchmark>.mdx` | Yes | User-facing documentation page |
| `docs/source/_toctree.yml` | Yes | Adds your page to the docs sidebar |
### 1. The gym.Env wrapper (`src/lerobot/envs/<benchmark>.py`)
Create a `gym.Env` subclass that wraps the third-party simulator. Use `src/lerobot/envs/libero.py` or `src/lerobot/envs/metaworld.py` as templates.
Your env must implement:
Create a `gym.Env` subclass that wraps the third-party simulator:
```python
class MyBenchmarkEnv(gym.Env):
@@ -132,26 +149,19 @@ class MyBenchmarkEnv(gym.Env):
self.action_space = spaces.Box(low=..., high=..., shape=(...,), dtype=np.float32)
def reset(self, seed=None, **kwargs):
# Reset simulator, return (observation, info)
# info must contain {"is_success": False}
...
... # return (observation, info) — info must contain {"is_success": False}
def step(self, action: np.ndarray):
# Step simulator, return (observation, reward, terminated, truncated, info)
# info must contain {"is_success": <bool>}
# On termination, info must contain "final_info" with success status
...
... # return (obs, reward, terminated, truncated, info) — info must contain {"is_success": <bool>}
def render(self):
# Return RGB image as numpy array
...
... # return RGB image as numpy array
def close(self):
# Clean up simulator resources
...
```
Also provide a factory function that returns the standard nested dict:
Also provide a factory function that returns the nested dict structure:
```python
def create_mybenchmark_envs(
@@ -164,11 +174,11 @@ def create_mybenchmark_envs(
...
```
See `create_libero_envs()` in `src/lerobot/envs/libero.py` (multi-suite, multi-task) and `create_metaworld_envs()` in `src/lerobot/envs/metaworld.py` (difficulty-grouped tasks) for reference.
See `create_libero_envs()` (multi-suite, multi-task) and `create_metaworld_envs()` (difficulty-grouped tasks) for reference.
### 2. The config (`src/lerobot/envs/configs.py`)
Register a new config dataclass. Each config owns its environment creation and processor logic via two methods:
Register a config dataclass so users can select your benchmark with `--env.type=<name>`. Each config owns its environment creation and processor logic via two methods:
- **`create_envs(n_envs, use_async_envs)`** — Returns `{suite: {task_id: VectorEnv}}`. The base class default uses `gym.make()` for single-task envs. Multi-task benchmarks override this.
- **`get_env_processors()`** — Returns `(preprocessor, postprocessor)`. The base class default returns identity (no-op) pipelines. Override if your benchmark needs observation/action transforms.
@@ -180,7 +190,6 @@ class MyBenchmarkEnv(EnvConfig):
task: str = "<default_task>"
fps: int = <fps>
obs_type: str = "pixels_agent_pos"
# ... benchmark-specific fields ...
features: dict[str, PolicyFeature] = field(default_factory=lambda: {
ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(<action_dim>,)),
@@ -192,8 +201,7 @@ class MyBenchmarkEnv(EnvConfig):
})
def __post_init__(self):
# Populate features based on obs_type
...
... # populate features based on obs_type
@property
def gym_kwargs(self) -> dict:
@@ -216,14 +224,14 @@ class MyBenchmarkEnv(EnvConfig):
Key points:
- The `register_subclass` name is what users pass as `--env.type=<name>` on the CLI.
- `features` declares what the environment produces (used to configure the policy).
- The `register_subclass` name is what users pass on the CLI (`--env.type=<name>`).
- `features` tells the policy what the environment produces.
- `features_map` maps raw observation keys to LeRobot convention keys.
- **No changes to `factory.py` needed** — the factory delegates to `cfg.create_envs()` and `cfg.get_env_processors()` automatically.
### 3. Env processor (optional) (`src/lerobot/processor/env_processor.py`)
If your benchmark needs observation transforms beyond what `preprocess_observation()` handles (e.g., image flipping, coordinate frame conversion), add a `ProcessorStep` and return it from `get_env_processors()` in your config (see step 2):
Only needed if your benchmark requires observation transforms beyond what `preprocess_observation()` handles (e.g. image flipping, coordinate conversion). Define the processor step here and return it from `get_env_processors()` in your config (see step 2):
```python
@dataclass
@@ -231,12 +239,11 @@ If your benchmark needs observation transforms beyond what `preprocess_observati
class MyBenchmarkProcessorStep(ObservationProcessorStep):
def _process_observation(self, observation):
processed = observation.copy()
# Your transforms here
# your transforms here
return processed
def transform_features(self, features):
# Update feature declarations if shapes change
return features
return features # update if shapes change
def observation(self, observation):
return self._process_observation(observation)
@@ -246,18 +253,18 @@ See `LiberoProcessorStep` for a full example (image rotation, quaternion-to-axis
### 4. Dependencies (`pyproject.toml`)
Add a new optional-dependency group under `[project.optional-dependencies]`:
Add a new optional-dependency group:
```toml
mybenchmark = ["my-benchmark-pkg==1.2.3", "lerobot[scipy-dep]"]
```
**Dependency pinning rules:**
Pinning rules:
- **Always pin benchmark-specific packages** to exact versions or tight ranges for reproducibility (e.g., `metaworld==3.0.0`, `hf-libero>=0.1.3,<0.2.0`).
- **Add platform markers** if the dependency is platform-specific (e.g., `; sys_platform == 'linux'`).
- **Pin known-fragile transitive dependencies** (e.g., `gymnasium==1.1.0` for Meta-World compatibility).
- **Document version constraints** in the benchmark doc page.
- **Always pin** benchmark packages to exact versions for reproducibility (e.g. `metaworld==3.0.0`).
- **Add platform markers** when needed (e.g. `; sys_platform == 'linux'`).
- **Pin fragile transitive deps** if known (e.g. `gymnasium==1.1.0` for Meta-World).
- **Document constraints** in your benchmark doc page.
Users install with:
@@ -267,11 +274,11 @@ pip install -e ".[mybenchmark]"
### 5. Documentation (`docs/source/<benchmark>.mdx`)
Follow the template below. See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for full examples.
Write a user-facing page following the template in the next section. See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for full examples.
### 6. Table of contents (`docs/source/_toctree.yml`)
Add your benchmark under the "Benchmarks" section:
Add your benchmark to the "Benchmarks" section:
```yaml
- sections:
@@ -286,97 +293,19 @@ Add your benchmark under the "Benchmarks" section:
title: "Benchmarks"
```
## Benchmark documentation template
## Writing a benchmark doc page
Each benchmark `.mdx` page should follow this structure:
Each benchmark `.mdx` page should include:
```markdown
# <Benchmark Name>
- **Title and description** — 1-2 paragraphs on what the benchmark tests and why it matters.
- **Links** — paper, GitHub repo, project website (if available).
- **Overview image or GIF.**
- **Available tasks** — table of task suites with counts and brief descriptions.
- **Installation** — `pip install -e ".[<benchmark>]"` plus any extra steps (env vars, system packages).
- **Evaluation** — recommended `lerobot-eval` command with `n_episodes` and `batch_size` for reproducible results. Include single-task and multi-task examples if applicable.
- **Policy inputs and outputs** — observation keys with shapes, action space description.
- **Recommended evaluation episodes** — how many episodes per task is standard.
- **Training** — example `lerobot-train` command.
- **Reproducing published results** — link to pretrained model, eval command, results table (if available).
<1-2 paragraphs: what the benchmark tests and why it matters for robot learning.>
- Paper: [<title>](arxiv_url)
- GitHub: [<repo>](github_url)
- Project website: [<name>](url) (if available)
<Overview image or GIF>
## Available tasks
<Table listing task suites or individual tasks, with counts.
For multi-suite benchmarks, describe each suite briefly.>
| Suite | Tasks | Description |
| ----- | ----- | ----------- |
| ... | ... | ... |
## Installation
After following the LeRobot installation instructions:
pip install -e ".[<benchmark>]"
<Any additional steps: environment variables, system packages, etc.>
## Evaluation
### Default evaluation (recommended)
<Command with recommended n_episodes, batch_size for reproducible results.>
### Single-task evaluation
<Command example with --env.task=<single_task>>
### Multi-task evaluation
<Command example with comma-separated tasks, if applicable.>
### Policy inputs and outputs
**Observations:**
- `observation.state` — <shape, description>
- `observation.images.image` — <shape, description>
- ...
**Actions:**
- Continuous control in Box(<low>, <high>, shape=(<dim>,))
### Recommended evaluation episodes
<State how many episodes per task are standard for this benchmark.
E.g., "50 episodes per task (500 total for LIBERO Spatial).">
## Training
<Example lerobot-train command.>
## Reproducing published results
<If available: link to pretrained model, eval command, results table.>
```
## How evaluation works
All benchmarks are evaluated uniformly by `lerobot-eval` (see `src/lerobot/scripts/lerobot_eval.py`).
The `eval_policy_all()` function:
1. Receives the nested `{suite: {task_id: VectorEnv}}` dict from `make_env()`.
2. Iterates over every `(suite, task_id, vec_env)` tuple.
3. For each task, runs `n_episodes` rollouts via `eval_policy()` → `rollout()`.
4. Aggregates results hierarchically: **episode → task → suite → overall**.
5. Reports `pc_success` (success rate), `avg_sum_reward`, `avg_max_reward` at each level.
6. Saves all results to `eval_info.json` with the full config snapshot for reproducibility.
The key contract: your `gym.Env` must return `info["is_success"]` on every `step()`, and the `VectorEnv` must surface it through `final_info["is_success"]` on termination. This is how the eval loop detects task completion.
## Quick reference: existing benchmarks
| Benchmark | Env file | Config class | Tasks | Action dim | Processor |
| -------------- | ------------------- | ------------------ | ------------------- | ------------ | ---------------------------- |
| LIBERO | `envs/libero.py` | `LiberoEnv` | 130 across 5 suites | 7 | `LiberoProcessorStep` |
| Meta-World | `envs/metaworld.py` | `MetaworldEnv` | 50 (MT50) | 4 | None |
| IsaacLab Arena | Hub-hosted | `IsaaclabArenaEnv` | Configurable | Configurable | `IsaaclabArenaProcessorStep` |
See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for complete examples.