lerobot/docs/source/adding_benchmarks.mdx

# Adding a New Benchmark

This guide explains how to integrate a new simulation benchmark into LeRobot. It is intended for both human contributors and coding agents follow the steps in order and use the referenced files as templates.

A "benchmark" in LeRobot is a set of gymnasium environments used for standardized evaluation. Each benchmark wraps a third-party simulator (e.g., LIBERO, Meta-World) behind a `gym.Env` interface, and the `lerobot-eval` script drives evaluation uniformly across all benchmarks.

## Architecture overview

### Observation and action data flow

During evaluation, observations and actions flow through a multi-stage pipeline:

```
gym.Env.reset() / step()
  │
  ▼  raw observation (dict[str, Any])
preprocess_observation()              # envs/utils.py — numpy→tensor, key mapping
  │
  ▼  LeRobot-format observation
add_envs_task()                       # envs/utils.py — injects task description
  │
  ▼
env_preprocessor                      # processor/env_processor.py — env-specific transforms
  │
  ▼
policy_preprocessor                   # per-policy normalization, device transfer
  │
  ▼
policy.select_action()                # PreTrainedPolicy — returns action tensor
  │
  ▼
policy_postprocessor                  # per-policy denormalization
  │
  ▼
env_postprocessor                     # env-specific action transforms
  │
  ▼  numpy action
gym.Env.step(action)
```

### Environment return shape

`make_env()` returns a nested dict:

```python
dict[str, dict[int, gym.vector.VectorEnv]]
#    ^suite_name  ^task_id  ^vectorized env with n_envs parallel copies
```

For single-task environments (e.g., PushT), this is `{"pusht": {0: vec_env}}`.
For multi-task benchmarks (e.g., LIBERO), this is `{"libero_spatial": {0: vec0, 1: vec1, ...}, "libero_object": {0: ..., ...}}`.

The eval loop (`eval_policy_all()`) iterates over all suites and tasks uniformly.

## The policy-environment contract

There is no enforced schema: `RobotObservation` is typed as `dict[str, Any]`. Instead, LeRobot relies on conventions:

### Required attributes on your `gym.Env`

| Attribute            | Type  | Used by                                                        |
| -------------------- | ----- | -------------------------------------------------------------- |
| `_max_episode_steps` | `int` | `rollout()` — caps episode length                              |
| `task_description`   | `str` | `add_envs_task()` — feeds language instruction to VLA policies |
| `task`               | `str` | `add_envs_task()` — fallback if `task_description` is absent   |

### Required fields in `info` dict

| Key          | Type   | Used by                                                     |
| ------------ | ------ | ----------------------------------------------------------- |
| `is_success` | `bool` | `eval_policy()` — detects task success                      |
| `final_info` | `dict` | Gymnasium `VectorEnv` — carries per-env info on termination |

### Raw observation format

`preprocess_observation()` expects raw observations to use these keys:

| Raw key                     | Mapped to                       | Description                            |
| --------------------------- | ------------------------------- | -------------------------------------- |
| `"pixels"` (single image)   | `observation.image`             | Single camera, HWC uint8               |
| `"pixels"` (dict of images) | `observation.images.<cam_name>` | Multiple cameras, each HWC uint8       |
| `"agent_pos"`               | `observation.state`             | Proprioceptive state vector            |
| `"environment_state"`       | `observation.env_state`         | Environment state (e.g., PushT)        |
| `"robot_state"`             | `observation.robot_state`       | Nested robot state dict (e.g., LIBERO) |

If your benchmark's raw observations don't match these keys, you have two options:

1. **Preferred**: Map your observations to these standard keys inside your `gym.Env._format_raw_obs()` method.
2. **Alternative**: Write an env processor that transforms the observations after `preprocess_observation()` runs.

### Action space

Actions are continuous numpy arrays in a `gym.spaces.Box`. The dimensionality is benchmark-specific (e.g., 7 for LIBERO, 4 for Meta-World). Policies handle the dimension mismatch via their `input_features` / `output_features` config.

### Feature declaration

Each `EnvConfig` subclass declares:

- `features`: dict mapping feature names to `PolicyFeature(type, shape)` — tells the policy what to expect.
- `features_map`: dict mapping raw env keys to LeRobot convention keys (e.g., `"agent_pos" → "observation.state"`).

## Files to create or modify

### Checklist

| File                                     | Required | Description                                                                      |
| ---------------------------------------- | -------- | -------------------------------------------------------------------------------- |
| `src/lerobot/envs/<benchmark>.py`        | Yes      | `gym.Env` subclass + `create_<benchmark>_envs()` factory                         |
| `src/lerobot/envs/configs.py`            | Yes      | `@EnvConfig.register_subclass("<name>")` dataclass with `create_envs()` override |
| `src/lerobot/processor/env_processor.py` | Optional | `ProcessorStep` subclass for env-specific observation transforms                 |
| `src/lerobot/envs/utils.py`              | Optional | Extend `preprocess_observation()` if new raw keys are needed                     |
| `pyproject.toml`                         | Yes      | Add optional dependency group                                                    |
| `docs/source/<benchmark>.mdx`            | Yes      | User-facing benchmark documentation                                              |
| `docs/source/_toctree.yml`               | Yes      | Add entry under the "Benchmarks" section                                         |

### 1. The gym.Env wrapper (`src/lerobot/envs/<benchmark>.py`)

Create a `gym.Env` subclass that wraps the third-party simulator. Use `src/lerobot/envs/libero.py` or `src/lerobot/envs/metaworld.py` as templates.

Your env must implement:

```python
class MyBenchmarkEnv(gym.Env):
    metadata = {"render_modes": ["rgb_array"], "render_fps": <fps>}

    def __init__(self, task_suite, task_id, ...):
        super().__init__()
        self.task = <task_name_string>
        self.task_description = <natural_language_instruction>
        self._max_episode_steps = <max_steps>
        self.observation_space = spaces.Dict({...})
        self.action_space = spaces.Box(low=..., high=..., shape=(...,), dtype=np.float32)

    def reset(self, seed=None, **kwargs):
        # Reset simulator, return (observation, info)
        # info must contain {"is_success": False}
        ...

    def step(self, action: np.ndarray):
        # Step simulator, return (observation, reward, terminated, truncated, info)
        # info must contain {"is_success": <bool>}
        # On termination, info must contain "final_info" with success status
        ...

    def render(self):
        # Return RGB image as numpy array
        ...

    def close(self):
        # Clean up simulator resources
        ...
```

Also provide a factory function that returns the standard nested dict:

```python
def create_mybenchmark_envs(
    task: str,
    n_envs: int,
    gym_kwargs: dict | None = None,
    env_cls: type | None = None,
) -> dict[str, dict[int, Any]]:
    """Create {suite_name: {task_id: VectorEnv}} for MyBenchmark."""
    ...
```

See `create_libero_envs()` in `src/lerobot/envs/libero.py` (multi-suite, multi-task) and `create_metaworld_envs()` in `src/lerobot/envs/metaworld.py` (difficulty-grouped tasks) for reference.

### 2. The config (`src/lerobot/envs/configs.py`)

Register a new config dataclass. Each config owns its environment creation and processor logic via two methods:

- **`create_envs(n_envs, use_async_envs)`** — Returns `{suite: {task_id: VectorEnv}}`. The base class default uses `gym.make()` for single-task envs. Multi-task benchmarks override this.
- **`get_env_processors()`** — Returns `(preprocessor, postprocessor)`. The base class default returns identity (no-op) pipelines. Override if your benchmark needs observation/action transforms.

```python
@EnvConfig.register_subclass("<benchmark_name>")
@dataclass
class MyBenchmarkEnv(EnvConfig):
    task: str = "<default_task>"
    fps: int = <fps>
    obs_type: str = "pixels_agent_pos"
    # ... benchmark-specific fields ...

    features: dict[str, PolicyFeature] = field(default_factory=lambda: {
        ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(<action_dim>,)),
    })
    features_map: dict[str, str] = field(default_factory=lambda: {
        ACTION: ACTION,
        "agent_pos": OBS_STATE,
        "pixels": OBS_IMAGE,
    })

    def __post_init__(self):
        # Populate features based on obs_type
        ...

    @property
    def gym_kwargs(self) -> dict:
        return {"obs_type": self.obs_type, "render_mode": self.render_mode}

    def create_envs(self, n_envs: int, use_async_envs: bool = False):
        """Override for multi-task benchmarks or custom env creation."""
        from lerobot.envs.<benchmark> import create_<benchmark>_envs
        return create_<benchmark>_envs(task=self.task, n_envs=n_envs, ...)

    def get_env_processors(self):
        """Override if your benchmark needs observation/action transforms."""
        from lerobot.processor.pipeline import PolicyProcessorPipeline
        from lerobot.processor.env_processor import MyBenchmarkProcessorStep
        return (
            PolicyProcessorPipeline(steps=[MyBenchmarkProcessorStep()]),
            PolicyProcessorPipeline(steps=[]),
        )
```

Key points:

- The `register_subclass` name is what users pass as `--env.type=<name>` on the CLI.
- `features` declares what the environment produces (used to configure the policy).
- `features_map` maps raw observation keys to LeRobot convention keys.
- **No changes to `factory.py` needed** — the factory delegates to `cfg.create_envs()` and `cfg.get_env_processors()` automatically.

### 3. Env processor (optional) (`src/lerobot/processor/env_processor.py`)

If your benchmark needs observation transforms beyond what `preprocess_observation()` handles (e.g., image flipping, coordinate frame conversion), add a `ProcessorStep` and return it from `get_env_processors()` in your config (see step 2):

```python
@dataclass
@ProcessorStepRegistry.register(name="<benchmark>_processor")
class MyBenchmarkProcessorStep(ObservationProcessorStep):
    def _process_observation(self, observation):
        processed = observation.copy()
        # Your transforms here
        return processed

    def transform_features(self, features):
        # Update feature declarations if shapes change
        return features

    def observation(self, observation):
        return self._process_observation(observation)
```

See `LiberoProcessorStep` for a full example (image rotation, quaternion-to-axis-angle conversion).

### 4. Dependencies (`pyproject.toml`)

Add a new optional-dependency group under `[project.optional-dependencies]`:

```toml
mybenchmark = ["my-benchmark-pkg==1.2.3", "lerobot[scipy-dep]"]
```

**Dependency pinning rules:**

- **Always pin benchmark-specific packages** to exact versions or tight ranges for reproducibility (e.g., `metaworld==3.0.0`, `hf-libero>=0.1.3,<0.2.0`).
- **Add platform markers** if the dependency is platform-specific (e.g., `; sys_platform == 'linux'`).
- **Pin known-fragile transitive dependencies** (e.g., `gymnasium==1.1.0` for Meta-World compatibility).
- **Document version constraints** in the benchmark doc page.

Users install with:

```bash
pip install -e ".[mybenchmark]"
```

### 5. Documentation (`docs/source/<benchmark>.mdx`)

Follow the template below. See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for full examples.

### 6. Table of contents (`docs/source/_toctree.yml`)

Add your benchmark under the "Benchmarks" section:

```yaml
- sections:
    - local: libero
      title: LIBERO
    - local: metaworld
      title: Meta-World
    - local: envhub_isaaclab_arena
      title: NVIDIA IsaacLab Arena Environments
    - local: <your_benchmark>
      title: <Your Benchmark Name>
  title: "Benchmarks"
```

## Benchmark documentation template

Each benchmark `.mdx` page should follow this structure:

```markdown
# <Benchmark Name>

<1-2 paragraphs: what the benchmark tests and why it matters for robot learning.>

- Paper: [<title>](arxiv_url)
- GitHub: [<repo>](github_url)
- Project website: [<name>](url) (if available)

<Overview image or GIF>

## Available tasks

<Table listing task suites or individual tasks, with counts.
For multi-suite benchmarks, describe each suite briefly.>

| Suite | Tasks | Description |
| ----- | ----- | ----------- |
| ...   | ...   | ...         |

## Installation

After following the LeRobot installation instructions:

pip install -e ".[<benchmark>]"

<Any additional steps: environment variables, system packages, etc.>

## Evaluation

### Default evaluation (recommended)

<Command with recommended n_episodes, batch_size for reproducible results.>

### Single-task evaluation

<Command example with --env.task=<single_task>>

### Multi-task evaluation

<Command example with comma-separated tasks, if applicable.>

### Policy inputs and outputs

**Observations:**

- `observation.state` — <shape, description>
- `observation.images.image` — <shape, description>
- ...

**Actions:**

- Continuous control in Box(<low>, <high>, shape=(<dim>,))

### Recommended evaluation episodes

<State how many episodes per task are standard for this benchmark.
E.g., "50 episodes per task (500 total for LIBERO Spatial).">

## Training

<Example lerobot-train command.>

## Reproducing published results

<If available: link to pretrained model, eval command, results table.>
```

## How evaluation works

All benchmarks are evaluated uniformly by `lerobot-eval` (see `src/lerobot/scripts/lerobot_eval.py`).

The `eval_policy_all()` function:

1. Receives the nested `{suite: {task_id: VectorEnv}}` dict from `make_env()`.
2. Iterates over every `(suite, task_id, vec_env)` tuple.
3. For each task, runs `n_episodes` rollouts via `eval_policy()` → `rollout()`.
4. Aggregates results hierarchically: **episode → task → suite → overall**.
5. Reports `pc_success` (success rate), `avg_sum_reward`, `avg_max_reward` at each level.
6. Saves all results to `eval_info.json` with the full config snapshot for reproducibility.

The key contract: your `gym.Env` must return `info["is_success"]` on every `step()`, and the `VectorEnv` must surface it through `final_info["is_success"]` on termination. This is how the eval loop detects task completion.

## Quick reference: existing benchmarks

| Benchmark      | Env file            | Config class       | Tasks               | Action dim   | Processor                    |
| -------------- | ------------------- | ------------------ | ------------------- | ------------ | ---------------------------- |
| LIBERO         | `envs/libero.py`    | `LiberoEnv`        | 130 across 5 suites | 7            | `LiberoProcessorStep`        |
| Meta-World     | `envs/metaworld.py` | `MetaworldEnv`     | 50 (MT50)           | 4            | None                         |
| IsaacLab Arena | Hub-hosted          | `IsaaclabArenaEnv` | Configurable        | Configurable | `IsaaclabArenaProcessorStep` |