# Adding a New Benchmark This guide explains how to integrate a new simulation benchmark into LeRobot. It is intended for both human contributors and coding agents follow the steps in order and use the referenced files as templates. A "benchmark" in LeRobot is a set of gymnasium environments used for standardized evaluation. Each benchmark wraps a third-party simulator (e.g., LIBERO, Meta-World) behind a `gym.Env` interface, and the `lerobot-eval` script drives evaluation uniformly across all benchmarks. ## Architecture overview ### Observation and action data flow During evaluation, observations and actions flow through a multi-stage pipeline: ``` gym.Env.reset() / step() │ ▼ raw observation (dict[str, Any]) preprocess_observation() # envs/utils.py — numpy→tensor, key mapping │ ▼ LeRobot-format observation add_envs_task() # envs/utils.py — injects task description │ ▼ env_preprocessor # processor/env_processor.py — env-specific transforms │ ▼ policy_preprocessor # per-policy normalization, device transfer │ ▼ policy.select_action() # PreTrainedPolicy — returns action tensor │ ▼ policy_postprocessor # per-policy denormalization │ ▼ env_postprocessor # env-specific action transforms │ ▼ numpy action gym.Env.step(action) ``` ### Environment return shape `make_env()` returns a nested dict: ```python dict[str, dict[int, gym.vector.VectorEnv]] # ^suite_name ^task_id ^vectorized env with n_envs parallel copies ``` For single-task environments (e.g., PushT), this is `{"pusht": {0: vec_env}}`. For multi-task benchmarks (e.g., LIBERO), this is `{"libero_spatial": {0: vec0, 1: vec1, ...}, "libero_object": {0: ..., ...}}`. The eval loop (`eval_policy_all()`) iterates over all suites and tasks uniformly. ## The policy-environment contract There is no enforced schema: `RobotObservation` is typed as `dict[str, Any]`. Instead, LeRobot relies on conventions: ### Required attributes on your `gym.Env` | Attribute | Type | Used by | | -------------------- | ----- | -------------------------------------------------------------- | | `_max_episode_steps` | `int` | `rollout()` — caps episode length | | `task_description` | `str` | `add_envs_task()` — feeds language instruction to VLA policies | | `task` | `str` | `add_envs_task()` — fallback if `task_description` is absent | ### Required fields in `info` dict | Key | Type | Used by | | ------------ | ------ | ----------------------------------------------------------- | | `is_success` | `bool` | `eval_policy()` — detects task success | | `final_info` | `dict` | Gymnasium `VectorEnv` — carries per-env info on termination | ### Raw observation format `preprocess_observation()` expects raw observations to use these keys: | Raw key | Mapped to | Description | | --------------------------- | ------------------------------- | -------------------------------------- | | `"pixels"` (single image) | `observation.image` | Single camera, HWC uint8 | | `"pixels"` (dict of images) | `observation.images.` | Multiple cameras, each HWC uint8 | | `"agent_pos"` | `observation.state` | Proprioceptive state vector | | `"environment_state"` | `observation.env_state` | Environment state (e.g., PushT) | | `"robot_state"` | `observation.robot_state` | Nested robot state dict (e.g., LIBERO) | If your benchmark's raw observations don't match these keys, you have two options: 1. **Preferred**: Map your observations to these standard keys inside your `gym.Env._format_raw_obs()` method. 2. **Alternative**: Write an env processor that transforms the observations after `preprocess_observation()` runs. ### Action space Actions are continuous numpy arrays in a `gym.spaces.Box`. The dimensionality is benchmark-specific (e.g., 7 for LIBERO, 4 for Meta-World). Policies handle the dimension mismatch via their `input_features` / `output_features` config. ### Feature declaration Each `EnvConfig` subclass declares: - `features`: dict mapping feature names to `PolicyFeature(type, shape)` — tells the policy what to expect. - `features_map`: dict mapping raw env keys to LeRobot convention keys (e.g., `"agent_pos" → "observation.state"`). ## Files to create or modify ### Checklist | File | Required | Description | | ---------------------------------------- | -------- | ----------------------------------------------------------------------------------- | | `src/lerobot/envs/.py` | Yes | `gym.Env` subclass + `create__envs()` factory | | `src/lerobot/envs/configs.py` | Yes | `@EnvConfig.register_subclass("")` dataclass | | `src/lerobot/envs/factory.py` | Yes | Add dispatch branch in `make_env()` and optionally `make_env_pre_post_processors()` | | `src/lerobot/processor/env_processor.py` | Optional | `ProcessorStep` subclass for env-specific observation transforms | | `src/lerobot/envs/utils.py` | Optional | Extend `preprocess_observation()` if new raw keys are needed | | `pyproject.toml` | Yes | Add optional dependency group | | `docs/source/.mdx` | Yes | User-facing benchmark documentation | | `docs/source/_toctree.yml` | Yes | Add entry under the "Benchmarks" section | ### 1. The gym.Env wrapper (`src/lerobot/envs/.py`) Create a `gym.Env` subclass that wraps the third-party simulator. Use `src/lerobot/envs/libero.py` or `src/lerobot/envs/metaworld.py` as templates. Your env must implement: ```python class MyBenchmarkEnv(gym.Env): metadata = {"render_modes": ["rgb_array"], "render_fps": } def __init__(self, task_suite, task_id, ...): super().__init__() self.task = self.task_description = self._max_episode_steps = self.observation_space = spaces.Dict({...}) self.action_space = spaces.Box(low=..., high=..., shape=(...,), dtype=np.float32) def reset(self, seed=None, **kwargs): # Reset simulator, return (observation, info) # info must contain {"is_success": False} ... def step(self, action: np.ndarray): # Step simulator, return (observation, reward, terminated, truncated, info) # info must contain {"is_success": } # On termination, info must contain "final_info" with success status ... def render(self): # Return RGB image as numpy array ... def close(self): # Clean up simulator resources ... ``` Also provide a factory function that returns the standard nested dict: ```python def create_mybenchmark_envs( task: str, n_envs: int, gym_kwargs: dict | None = None, env_cls: type | None = None, ) -> dict[str, dict[int, Any]]: """Create {suite_name: {task_id: VectorEnv}} for MyBenchmark.""" ... ``` See `create_libero_envs()` in `src/lerobot/envs/libero.py` (multi-suite, multi-task) and `create_metaworld_envs()` in `src/lerobot/envs/metaworld.py` (difficulty-grouped tasks) for reference. ### 2. The config (`src/lerobot/envs/configs.py`) Register a new config dataclass: ```python @EnvConfig.register_subclass("") @dataclass class MyBenchmarkEnv(EnvConfig): task: str = "" fps: int = obs_type: str = "pixels_agent_pos" # ... benchmark-specific fields ... features: dict[str, PolicyFeature] = field(default_factory=lambda: { ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(,)), }) features_map: dict[str, str] = field(default_factory=lambda: { ACTION: ACTION, "agent_pos": OBS_STATE, "pixels": OBS_IMAGE, }) def __post_init__(self): # Populate features based on obs_type ... @property def gym_kwargs(self) -> dict: return {"obs_type": self.obs_type, "render_mode": self.render_mode} ``` Key points: - The `register_subclass` name is what users pass as `--env.type=` on the CLI. - `features` declares what the environment produces (used to configure the policy). - `features_map` maps raw observation keys to LeRobot convention keys. ### 3. The factory dispatch (`src/lerobot/envs/factory.py`) Add a branch in `make_env()`: ```python elif "" in cfg.type: from lerobot.envs. import create__envs if cfg.task is None: raise ValueError(" requires a task to be specified") return create__envs( task=cfg.task, n_envs=n_envs, gym_kwargs=cfg.gym_kwargs, env_cls=env_cls, ) ``` If your benchmark needs an env processor, add it in `make_env_pre_post_processors()`: ```python if isinstance(env_cfg, MyBenchmarkEnv) or "" in env_cfg.type: preprocessor_steps.append(MyBenchmarkProcessorStep()) ``` ### 4. Env processor (optional) (`src/lerobot/processor/env_processor.py`) If your benchmark needs observation transforms beyond what `preprocess_observation()` handles (e.g., image flipping, coordinate frame conversion), add a `ProcessorStep`: ```python @dataclass @ProcessorStepRegistry.register(name="_processor") class MyBenchmarkProcessorStep(ObservationProcessorStep): def _process_observation(self, observation): processed = observation.copy() # Your transforms here return processed def transform_features(self, features): # Update feature declarations if shapes change return features def observation(self, observation): return self._process_observation(observation) ``` See `LiberoProcessorStep` for a full example (image rotation, quaternion-to-axis-angle conversion). ### 5. Dependencies (`pyproject.toml`) Add a new optional-dependency group under `[project.optional-dependencies]`: ```toml mybenchmark = ["my-benchmark-pkg==1.2.3", "lerobot[scipy-dep]"] ``` **Dependency pinning rules:** - **Always pin benchmark-specific packages** to exact versions or tight ranges for reproducibility (e.g., `metaworld==3.0.0`, `hf-libero>=0.1.3,<0.2.0`). - **Add platform markers** if the dependency is platform-specific (e.g., `; sys_platform == 'linux'`). - **Pin known-fragile transitive dependencies** (e.g., `gymnasium==1.1.0` for Meta-World compatibility). - **Document version constraints** in the benchmark doc page. Users install with: ```bash pip install -e ".[mybenchmark]" ``` ### 6. Documentation (`docs/source/.mdx`) Follow the template below. See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for full examples. ### 7. Table of contents (`docs/source/_toctree.yml`) Add your benchmark under the "Benchmarks" section: ```yaml - sections: - local: libero title: LIBERO - local: metaworld title: Meta-World - local: envhub_isaaclab_arena title: NVIDIA IsaacLab Arena Environments - local: title: title: "Benchmarks" ``` ## Benchmark documentation template Each benchmark `.mdx` page should follow this structure: ```markdown # <1-2 paragraphs: what the benchmark tests and why it matters for robot learning.> - Paper: [](arxiv_url) - GitHub: [<repo>](github_url) - Project website: [<name>](url) (if available) <Overview image or GIF> ## Available tasks <Table listing task suites or individual tasks, with counts. For multi-suite benchmarks, describe each suite briefly.> | Suite | Tasks | Description | | ----- | ----- | ----------- | | ... | ... | ... | ## Installation After following the LeRobot installation instructions: pip install -e ".[<benchmark>]" <Any additional steps: environment variables, system packages, etc.> ## Evaluation ### Default evaluation (recommended) <Command with recommended n_episodes, batch_size for reproducible results.> ### Single-task evaluation <Command example with --env.task=<single_task>> ### Multi-task evaluation <Command example with comma-separated tasks, if applicable.> ### Policy inputs and outputs **Observations:** - `observation.state` — <shape, description> - `observation.images.image` — <shape, description> - ... **Actions:** - Continuous control in Box(<low>, <high>, shape=(<dim>,)) ### Recommended evaluation episodes <State how many episodes per task are standard for this benchmark. E.g., "50 episodes per task (500 total for LIBERO Spatial)."> ## Training <Example lerobot-train command.> ## Reproducing published results <If available: link to pretrained model, eval command, results table.> ``` ## How evaluation works All benchmarks are evaluated uniformly by `lerobot-eval` (see `src/lerobot/scripts/lerobot_eval.py`). The `eval_policy_all()` function: 1. Receives the nested `{suite: {task_id: VectorEnv}}` dict from `make_env()`. 2. Iterates over every `(suite, task_id, vec_env)` tuple. 3. For each task, runs `n_episodes` rollouts via `eval_policy()` → `rollout()`. 4. Aggregates results hierarchically: **episode → task → suite → overall**. 5. Reports `pc_success` (success rate), `avg_sum_reward`, `avg_max_reward` at each level. 6. Saves all results to `eval_info.json` with the full config snapshot for reproducibility. The key contract: your `gym.Env` must return `info["is_success"]` on every `step()`, and the `VectorEnv` must surface it through `final_info["is_success"]` on termination. This is how the eval loop detects task completion. ## Quick reference: existing benchmarks | Benchmark | Env file | Config class | Tasks | Action dim | Processor | | -------------- | ------------------- | ------------------ | ------------------- | ------------ | ---------------------------- | | LIBERO | `envs/libero.py` | `LiberoEnv` | 130 across 5 suites | 7 | `LiberoProcessorStep` | | Meta-World | `envs/metaworld.py` | `MetaworldEnv` | 50 (MT50) | 4 | None | | IsaacLab Arena | Hub-hosted | `IsaaclabArenaEnv` | Configurable | Configurable | `IsaaclabArenaProcessorStep` |