From 77415559b868958d748d815e5b978e9fbde15318 Mon Sep 17 00:00:00 2001 From: Pepijn Date: Fri, 3 Apr 2026 13:36:16 +0200 Subject: [PATCH] docs(benchmarks): clean up adding-benchmarks guide for clarity Rewrite for simpler language, better structure, and easier navigation. Move quick-reference table to the top, fold eval explanation into architecture section, condense the doc template to a bulleted outline. Made-with: Cursor --- docs/source/adding_benchmarks.mdx | 335 ++++++++++++------------------ 1 file changed, 132 insertions(+), 203 deletions(-) diff --git a/docs/source/adding_benchmarks.mdx b/docs/source/adding_benchmarks.mdx index 3ba606dd2..db599bb3c 100644 --- a/docs/source/adding_benchmarks.mdx +++ b/docs/source/adding_benchmarks.mdx @@ -1,124 +1,141 @@ # Adding a New Benchmark -This guide explains how to integrate a new simulation benchmark into LeRobot. It is intended for both human contributors and coding agents follow the steps in order and use the referenced files as templates. +This guide walks you through adding a new simulation benchmark to LeRobot. Follow the steps in order and use the existing benchmarks as templates. -A "benchmark" in LeRobot is a set of gymnasium environments used for standardized evaluation. Each benchmark wraps a third-party simulator (e.g., LIBERO, Meta-World) behind a `gym.Env` interface, and the `lerobot-eval` script drives evaluation uniformly across all benchmarks. +A benchmark in LeRobot is a set of [Gymnasium](https://gymnasium.farama.org/) environments that wrap a third-party simulator (like LIBERO or Meta-World) behind a standard `gym.Env` interface. The `lerobot-eval` CLI then runs evaluation uniformly across all benchmarks. -## Architecture overview +## Existing benchmarks at a glance -### Observation and action data flow +Before diving in, here is what is already integrated: -During evaluation, observations and actions flow through a multi-stage pipeline: +| Benchmark | Env file | Config class | Tasks | Action dim | Processor | +| -------------- | ------------------- | ------------------ | ------------------- | ------------ | ---------------------------- | +| LIBERO | `envs/libero.py` | `LiberoEnv` | 130 across 5 suites | 7 | `LiberoProcessorStep` | +| Meta-World | `envs/metaworld.py` | `MetaworldEnv` | 50 (MT50) | 4 | None | +| IsaacLab Arena | Hub-hosted | `IsaaclabArenaEnv` | Configurable | Configurable | `IsaaclabArenaProcessorStep` | + +Use `src/lerobot/envs/libero.py` and `src/lerobot/envs/metaworld.py` as reference implementations. + +## How it all fits together + +### Data flow + +During evaluation, data moves through four stages: ``` -gym.Env.reset() / step() - │ - ▼ raw observation (dict[str, Any]) -preprocess_observation() # envs/utils.py — numpy→tensor, key mapping - │ - ▼ LeRobot-format observation -add_envs_task() # envs/utils.py — injects task description - │ - ▼ -env_preprocessor # processor/env_processor.py — env-specific transforms - │ - ▼ -policy_preprocessor # per-policy normalization, device transfer - │ - ▼ -policy.select_action() # PreTrainedPolicy — returns action tensor - │ - ▼ -policy_postprocessor # per-policy denormalization - │ - ▼ -env_postprocessor # env-specific action transforms - │ - ▼ numpy action -gym.Env.step(action) +1. gym.Env ──→ raw observations (numpy dicts) + +2. Preprocessing ──→ standard LeRobot keys + task description + (preprocess_observation, add_envs_task in envs/utils.py) + +3. Processors ──→ env-specific then policy-specific transforms + (env_preprocessor, policy_preprocessor) + +4. Policy ──→ select_action() ──→ action tensor + then reverse: policy_postprocessor → env_postprocessor → numpy action → env.step() ``` -### Environment return shape +Most benchmarks only need to care about stage 1 (producing observations in the right format) and optionally stage 3 (if env-specific transforms are needed). -`make_env()` returns a nested dict: +### Environment structure + +`make_env()` returns a nested dict of vectorized environments: ```python dict[str, dict[int, gym.vector.VectorEnv]] -# ^suite_name ^task_id ^vectorized env with n_envs parallel copies +# ^suite ^task_id ``` -For single-task environments (e.g., PushT), this is `{"pusht": {0: vec_env}}`. -For multi-task benchmarks (e.g., LIBERO), this is `{"libero_spatial": {0: vec0, 1: vec1, ...}, "libero_object": {0: ..., ...}}`. +A single-task env (e.g. PushT) looks like `{"pusht": {0: vec_env}}`. +A multi-task benchmark (e.g. LIBERO) looks like `{"libero_spatial": {0: vec0, 1: vec1, ...}, ...}`. -The eval loop (`eval_policy_all()`) iterates over all suites and tasks uniformly. +### How evaluation runs -## The policy-environment contract +All benchmarks are evaluated the same way by `lerobot-eval`: -There is no enforced schema: `RobotObservation` is typed as `dict[str, Any]`. Instead, LeRobot relies on conventions: +1. `make_env()` builds the nested `{suite: {task_id: VectorEnv}}` dict. +2. `eval_policy_all()` iterates over every suite and task. +3. For each task, it runs `n_episodes` rollouts via `rollout()`. +4. Results are aggregated hierarchically: episode, task, suite, overall. +5. Metrics include `pc_success` (success rate), `avg_sum_reward`, and `avg_max_reward`. -### Required attributes on your `gym.Env` +The critical piece: your env must return `info["is_success"]` on every `step()` call. This is how the eval loop knows whether a task was completed. -| Attribute | Type | Used by | -| -------------------- | ----- | -------------------------------------------------------------- | -| `_max_episode_steps` | `int` | `rollout()` — caps episode length | -| `task_description` | `str` | `add_envs_task()` — feeds language instruction to VLA policies | -| `task` | `str` | `add_envs_task()` — fallback if `task_description` is absent | +## What your environment must provide -### Required fields in `info` dict +LeRobot does not enforce a strict observation schema. Instead it relies on a set of conventions that all benchmarks follow. -| Key | Type | Used by | -| ------------ | ------ | ----------------------------------------------------------- | -| `is_success` | `bool` | `eval_policy()` — detects task success | -| `final_info` | `dict` | Gymnasium `VectorEnv` — carries per-env info on termination | +### Env attributes -### Raw observation format +Your `gym.Env` must set these attributes: -`preprocess_observation()` expects raw observations to use these keys: +| Attribute | Type | Why | +| -------------------- | ----- | ---------------------------------------------------- | +| `_max_episode_steps` | `int` | `rollout()` uses this to cap episode length | +| `task_description` | `str` | Passed to VLA policies as a language instruction | +| `task` | `str` | Fallback identifier if `task_description` is not set | -| Raw key | Mapped to | Description | -| --------------------------- | ------------------------------- | -------------------------------------- | -| `"pixels"` (single image) | `observation.image` | Single camera, HWC uint8 | -| `"pixels"` (dict of images) | `observation.images.` | Multiple cameras, each HWC uint8 | -| `"agent_pos"` | `observation.state` | Proprioceptive state vector | -| `"environment_state"` | `observation.env_state` | Environment state (e.g., PushT) | -| `"robot_state"` | `observation.robot_state` | Nested robot state dict (e.g., LIBERO) | +### Success reporting -If your benchmark's raw observations don't match these keys, you have two options: +Your `step()` and `reset()` must include `"is_success"` in the `info` dict: -1. **Preferred**: Map your observations to these standard keys inside your `gym.Env._format_raw_obs()` method. -2. **Alternative**: Write an env processor that transforms the observations after `preprocess_observation()` runs. +```python +info = {"is_success": True} # or False +return observation, reward, terminated, truncated, info +``` -### Action space +### Observations -Actions are continuous numpy arrays in a `gym.spaces.Box`. The dimensionality is benchmark-specific (e.g., 7 for LIBERO, 4 for Meta-World). Policies handle the dimension mismatch via their `input_features` / `output_features` config. +The simplest approach is to map your simulator's outputs to the standard keys that `preprocess_observation()` already understands. Do this inside your `gym.Env` (e.g. in a `_format_raw_obs()` helper): + +| Your env should output | LeRobot maps it to | What it is | +| ------------------------- | -------------------------- | ------------------------------------- | +| `"pixels"` (single array) | `observation.image` | Single camera image, HWC uint8 | +| `"pixels"` (dict) | `observation.images.` | Multiple cameras, each HWC uint8 | +| `"agent_pos"` | `observation.state` | Proprioceptive state vector | +| `"environment_state"` | `observation.env_state` | Full environment state (e.g. PushT) | +| `"robot_state"` | `observation.robot_state` | Nested robot state dict (e.g. LIBERO) | + +If your simulator uses different key names, you have two options: + +1. **Recommended:** Rename them to the standard keys inside your `gym.Env` wrapper. +2. **Alternative:** Write an env processor to transform observations after `preprocess_observation()` runs (see step 4 below). + +### Actions + +Actions are continuous numpy arrays in a `gym.spaces.Box`. The dimensionality depends on your benchmark (7 for LIBERO, 4 for Meta-World, etc.). Policies adapt to different action dimensions through their `input_features` / `output_features` config. ### Feature declaration -Each `EnvConfig` subclass declares: +Each `EnvConfig` subclass declares two dicts that tell the policy what to expect: -- `features`: dict mapping feature names to `PolicyFeature(type, shape)` — tells the policy what to expect. -- `features_map`: dict mapping raw env keys to LeRobot convention keys (e.g., `"agent_pos" → "observation.state"`). +- `features` — maps feature names to `PolicyFeature(type, shape)` (e.g. action dim, image shape). +- `features_map` — maps raw observation keys to LeRobot convention keys (e.g. `"agent_pos"` to `"observation.state"`). -## Files to create or modify +## Step by step + + + At minimum, you need three files: a **gym.Env wrapper**, an **EnvConfig + subclass**, and a **factory dispatch branch**. Everything else is optional or + documentation. + ### Checklist -| File | Required | Description | -| ---------------------------------------- | -------- | ----------------------------------------------------------------------------------- | -| `src/lerobot/envs/.py` | Yes | `gym.Env` subclass + `create__envs()` factory | -| `src/lerobot/envs/configs.py` | Yes | `@EnvConfig.register_subclass("")` dataclass | -| `src/lerobot/envs/factory.py` | Yes | Add dispatch branch in `make_env()` and optionally `make_env_pre_post_processors()` | -| `src/lerobot/processor/env_processor.py` | Optional | `ProcessorStep` subclass for env-specific observation transforms | -| `src/lerobot/envs/utils.py` | Optional | Extend `preprocess_observation()` if new raw keys are needed | -| `pyproject.toml` | Yes | Add optional dependency group | -| `docs/source/.mdx` | Yes | User-facing benchmark documentation | -| `docs/source/_toctree.yml` | Yes | Add entry under the "Benchmarks" section | +| File | Required | Why | +| ---------------------------------------- | -------- | ----------------------------------------- | +| `src/lerobot/envs/.py` | Yes | Wraps the simulator as a standard gym.Env | +| `src/lerobot/envs/configs.py` | Yes | Registers your benchmark for the CLI | +| `src/lerobot/envs/factory.py` | Yes | Tells `make_env()` how to build your envs | +| `src/lerobot/processor/env_processor.py` | Optional | Custom observation/action transforms | +| `src/lerobot/envs/utils.py` | Optional | Only if you need new raw observation keys | +| `pyproject.toml` | Yes | Declares benchmark-specific dependencies | +| `docs/source/.mdx` | Yes | User-facing documentation page | +| `docs/source/_toctree.yml` | Yes | Adds your page to the docs sidebar | ### 1. The gym.Env wrapper (`src/lerobot/envs/.py`) -Create a `gym.Env` subclass that wraps the third-party simulator. Use `src/lerobot/envs/libero.py` or `src/lerobot/envs/metaworld.py` as templates. - -Your env must implement: +Create a `gym.Env` subclass that wraps the third-party simulator: ```python class MyBenchmarkEnv(gym.Env): @@ -133,26 +150,19 @@ class MyBenchmarkEnv(gym.Env): self.action_space = spaces.Box(low=..., high=..., shape=(...,), dtype=np.float32) def reset(self, seed=None, **kwargs): - # Reset simulator, return (observation, info) - # info must contain {"is_success": False} - ... + ... # return (observation, info) — info must contain {"is_success": False} def step(self, action: np.ndarray): - # Step simulator, return (observation, reward, terminated, truncated, info) - # info must contain {"is_success": } - # On termination, info must contain "final_info" with success status - ... + ... # return (obs, reward, terminated, truncated, info) — info must contain {"is_success": } def render(self): - # Return RGB image as numpy array - ... + ... # return RGB image as numpy array def close(self): - # Clean up simulator resources ... ``` -Also provide a factory function that returns the standard nested dict: +Also provide a factory function that returns the nested dict structure: ```python def create_mybenchmark_envs( @@ -165,11 +175,11 @@ def create_mybenchmark_envs( ... ``` -See `create_libero_envs()` in `src/lerobot/envs/libero.py` (multi-suite, multi-task) and `create_metaworld_envs()` in `src/lerobot/envs/metaworld.py` (difficulty-grouped tasks) for reference. +See `create_libero_envs()` (multi-suite, multi-task) and `create_metaworld_envs()` (difficulty-grouped tasks) for reference. ### 2. The config (`src/lerobot/envs/configs.py`) -Register a new config dataclass: +Register a config dataclass so users can select your benchmark with `--env.type=`: ```python @EnvConfig.register_subclass("") @@ -178,7 +188,6 @@ class MyBenchmarkEnv(EnvConfig): task: str = "" fps: int = obs_type: str = "pixels_agent_pos" - # ... benchmark-specific fields ... features: dict[str, PolicyFeature] = field(default_factory=lambda: { ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(,)), @@ -190,8 +199,7 @@ class MyBenchmarkEnv(EnvConfig): }) def __post_init__(self): - # Populate features based on obs_type - ... + ... # populate features based on obs_type @property def gym_kwargs(self) -> dict: @@ -200,13 +208,13 @@ class MyBenchmarkEnv(EnvConfig): Key points: -- The `register_subclass` name is what users pass as `--env.type=` on the CLI. -- `features` declares what the environment produces (used to configure the policy). +- The `register_subclass` name is what users pass on the CLI (`--env.type=`). +- `features` tells the policy what the environment produces. - `features_map` maps raw observation keys to LeRobot convention keys. ### 3. The factory dispatch (`src/lerobot/envs/factory.py`) -Add a branch in `make_env()`: +Add a branch in `make_env()` to call your factory function: ```python elif "" in cfg.type: @@ -230,9 +238,9 @@ if isinstance(env_cfg, MyBenchmarkEnv) or "" in env_cfg.type: preprocessor_steps.append(MyBenchmarkProcessorStep()) ``` -### 4. Env processor (optional) (`src/lerobot/processor/env_processor.py`) +### 4. Env processor (optional — `src/lerobot/processor/env_processor.py`) -If your benchmark needs observation transforms beyond what `preprocess_observation()` handles (e.g., image flipping, coordinate frame conversion), add a `ProcessorStep`: +Only needed if your benchmark requires observation transforms beyond what `preprocess_observation()` handles (e.g. image flipping, coordinate conversion): ```python @dataclass @@ -240,12 +248,11 @@ If your benchmark needs observation transforms beyond what `preprocess_observati class MyBenchmarkProcessorStep(ObservationProcessorStep): def _process_observation(self, observation): processed = observation.copy() - # Your transforms here + # your transforms here return processed def transform_features(self, features): - # Update feature declarations if shapes change - return features + return features # update if shapes change def observation(self, observation): return self._process_observation(observation) @@ -255,18 +262,18 @@ See `LiberoProcessorStep` for a full example (image rotation, quaternion-to-axis ### 5. Dependencies (`pyproject.toml`) -Add a new optional-dependency group under `[project.optional-dependencies]`: +Add a new optional-dependency group: ```toml mybenchmark = ["my-benchmark-pkg==1.2.3", "lerobot[scipy-dep]"] ``` -**Dependency pinning rules:** +Pinning rules: -- **Always pin benchmark-specific packages** to exact versions or tight ranges for reproducibility (e.g., `metaworld==3.0.0`, `hf-libero>=0.1.3,<0.2.0`). -- **Add platform markers** if the dependency is platform-specific (e.g., `; sys_platform == 'linux'`). -- **Pin known-fragile transitive dependencies** (e.g., `gymnasium==1.1.0` for Meta-World compatibility). -- **Document version constraints** in the benchmark doc page. +- **Always pin** benchmark packages to exact versions for reproducibility (e.g. `metaworld==3.0.0`). +- **Add platform markers** when needed (e.g. `; sys_platform == 'linux'`). +- **Pin fragile transitive deps** if known (e.g. `gymnasium==1.1.0` for Meta-World). +- **Document constraints** in your benchmark doc page. Users install with: @@ -276,11 +283,11 @@ pip install -e ".[mybenchmark]" ### 6. Documentation (`docs/source/.mdx`) -Follow the template below. See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for full examples. +Write a user-facing page following the template in the next section. See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for full examples. ### 7. Table of contents (`docs/source/_toctree.yml`) -Add your benchmark under the "Benchmarks" section: +Add your benchmark to the "Benchmarks" section: ```yaml - sections: @@ -295,97 +302,19 @@ Add your benchmark under the "Benchmarks" section: title: "Benchmarks" ``` -## Benchmark documentation template +## Writing a benchmark doc page -Each benchmark `.mdx` page should follow this structure: +Each benchmark `.mdx` page should include: -```markdown -# +- **Title and description** — 1-2 paragraphs on what the benchmark tests and why it matters. +- **Links** — paper, GitHub repo, project website (if available). +- **Overview image or GIF.** +- **Available tasks** — table of task suites with counts and brief descriptions. +- **Installation** — `pip install -e ".[]"` plus any extra steps (env vars, system packages). +- **Evaluation** — recommended `lerobot-eval` command with `n_episodes` and `batch_size` for reproducible results. Include single-task and multi-task examples if applicable. +- **Policy inputs and outputs** — observation keys with shapes, action space description. +- **Recommended evaluation episodes** — how many episodes per task is standard. +- **Training** — example `lerobot-train` command. +- **Reproducing published results** — link to pretrained model, eval command, results table (if available). -<1-2 paragraphs: what the benchmark tests and why it matters for robot learning.> - -- Paper: [](arxiv_url) -- GitHub: [<repo>](github_url) -- Project website: [<name>](url) (if available) - -<Overview image or GIF> - -## Available tasks - -<Table listing task suites or individual tasks, with counts. -For multi-suite benchmarks, describe each suite briefly.> - -| Suite | Tasks | Description | -| ----- | ----- | ----------- | -| ... | ... | ... | - -## Installation - -After following the LeRobot installation instructions: - -pip install -e ".[<benchmark>]" - -<Any additional steps: environment variables, system packages, etc.> - -## Evaluation - -### Default evaluation (recommended) - -<Command with recommended n_episodes, batch_size for reproducible results.> - -### Single-task evaluation - -<Command example with --env.task=<single_task>> - -### Multi-task evaluation - -<Command example with comma-separated tasks, if applicable.> - -### Policy inputs and outputs - -**Observations:** - -- `observation.state` — <shape, description> -- `observation.images.image` — <shape, description> -- ... - -**Actions:** - -- Continuous control in Box(<low>, <high>, shape=(<dim>,)) - -### Recommended evaluation episodes - -<State how many episodes per task are standard for this benchmark. -E.g., "50 episodes per task (500 total for LIBERO Spatial)."> - -## Training - -<Example lerobot-train command.> - -## Reproducing published results - -<If available: link to pretrained model, eval command, results table.> -``` - -## How evaluation works - -All benchmarks are evaluated uniformly by `lerobot-eval` (see `src/lerobot/scripts/lerobot_eval.py`). - -The `eval_policy_all()` function: - -1. Receives the nested `{suite: {task_id: VectorEnv}}` dict from `make_env()`. -2. Iterates over every `(suite, task_id, vec_env)` tuple. -3. For each task, runs `n_episodes` rollouts via `eval_policy()` → `rollout()`. -4. Aggregates results hierarchically: **episode → task → suite → overall**. -5. Reports `pc_success` (success rate), `avg_sum_reward`, `avg_max_reward` at each level. -6. Saves all results to `eval_info.json` with the full config snapshot for reproducibility. - -The key contract: your `gym.Env` must return `info["is_success"]` on every `step()`, and the `VectorEnv` must surface it through `final_info["is_success"]` on termination. This is how the eval loop detects task completion. - -## Quick reference: existing benchmarks - -| Benchmark | Env file | Config class | Tasks | Action dim | Processor | -| -------------- | ------------------- | ------------------ | ------------------- | ------------ | ---------------------------- | -| LIBERO | `envs/libero.py` | `LiberoEnv` | 130 across 5 suites | 7 | `LiberoProcessorStep` | -| Meta-World | `envs/metaworld.py` | `MetaworldEnv` | 50 (MT50) | 4 | None | -| IsaacLab Arena | Hub-hosted | `IsaaclabArenaEnv` | Configurable | Configurable | `IsaaclabArenaProcessorStep` | +See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for complete examples.