diff --git a/docs/source/adding_benchmarks.mdx b/docs/source/adding_benchmarks.mdx
index 73a951276..3ba606dd2 100644
--- a/docs/source/adding_benchmarks.mdx
+++ b/docs/source/adding_benchmarks.mdx
@@ -1,140 +1,124 @@
 # Adding a New Benchmark
 
-This guide walks you through adding a new simulation benchmark to LeRobot. Follow the steps in order and use the existing benchmarks as templates.
+This guide explains how to integrate a new simulation benchmark into LeRobot. It is intended for both human contributors and coding agents follow the steps in order and use the referenced files as templates.
 
-A benchmark in LeRobot is a set of [Gymnasium](https://gymnasium.farama.org/) environments that wrap a third-party simulator (like LIBERO or Meta-World) behind a standard `gym.Env` interface. The `lerobot-eval` CLI then runs evaluation uniformly across all benchmarks.
+A "benchmark" in LeRobot is a set of gymnasium environments used for standardized evaluation. Each benchmark wraps a third-party simulator (e.g., LIBERO, Meta-World) behind a `gym.Env` interface, and the `lerobot-eval` script drives evaluation uniformly across all benchmarks.
 
-## Existing benchmarks at a glance
+## Architecture overview
 
-Before diving in, here is what is already integrated:
+### Observation and action data flow
 
-| Benchmark      | Env file            | Config class       | Tasks               | Action dim   | Processor                    |
-| -------------- | ------------------- | ------------------ | ------------------- | ------------ | ---------------------------- |
-| LIBERO         | `envs/libero.py`    | `LiberoEnv`        | 130 across 5 suites | 7            | `LiberoProcessorStep`        |
-| Meta-World     | `envs/metaworld.py` | `MetaworldEnv`     | 50 (MT50)           | 4            | None                         |
-| IsaacLab Arena | Hub-hosted          | `IsaaclabArenaEnv` | Configurable        | Configurable | `IsaaclabArenaProcessorStep` |
-
-Use `src/lerobot/envs/libero.py` and `src/lerobot/envs/metaworld.py` as reference implementations.
-
-## How it all fits together
-
-### Data flow
-
-During evaluation, data moves through four stages:
+During evaluation, observations and actions flow through a multi-stage pipeline:
 
 ```
-1. gym.Env  ──→  raw observations (numpy dicts)
-
-2. Preprocessing  ──→  standard LeRobot keys + task description
-   (preprocess_observation, add_envs_task in envs/utils.py)
-
-3. Processors  ──→  env-specific then policy-specific transforms
-   (env_preprocessor, policy_preprocessor)
-
-4. Policy  ──→  select_action()  ──→  action tensor
-   then reverse: policy_postprocessor → env_postprocessor → numpy action → env.step()
+gym.Env.reset() / step()
+  │
+  ▼  raw observation (dict[str, Any])
+preprocess_observation()              # envs/utils.py — numpy→tensor, key mapping
+  │
+  ▼  LeRobot-format observation
+add_envs_task()                       # envs/utils.py — injects task description
+  │
+  ▼
+env_preprocessor                      # processor/env_processor.py — env-specific transforms
+  │
+  ▼
+policy_preprocessor                   # per-policy normalization, device transfer
+  │
+  ▼
+policy.select_action()                # PreTrainedPolicy — returns action tensor
+  │
+  ▼
+policy_postprocessor                  # per-policy denormalization
+  │
+  ▼
+env_postprocessor                     # env-specific action transforms
+  │
+  ▼  numpy action
+gym.Env.step(action)
 ```
 
-Most benchmarks only need to care about stage 1 (producing observations in the right format) and optionally stage 3 (if env-specific transforms are needed).
+### Environment return shape
 
-### Environment structure
-
-`make_env()` returns a nested dict of vectorized environments:
+`make_env()` returns a nested dict:
 
 ```python
 dict[str, dict[int, gym.vector.VectorEnv]]
-#    ^suite       ^task_id
+#    ^suite_name  ^task_id  ^vectorized env with n_envs parallel copies
 ```
 
-A single-task env (e.g. PushT) looks like `{"pusht": {0: vec_env}}`.
-A multi-task benchmark (e.g. LIBERO) looks like `{"libero_spatial": {0: vec0, 1: vec1, ...}, ...}`.
+For single-task environments (e.g., PushT), this is `{"pusht": {0: vec_env}}`.
+For multi-task benchmarks (e.g., LIBERO), this is `{"libero_spatial": {0: vec0, 1: vec1, ...}, "libero_object": {0: ..., ...}}`.
 
-### How evaluation runs
+The eval loop (`eval_policy_all()`) iterates over all suites and tasks uniformly.
 
-All benchmarks are evaluated the same way by `lerobot-eval`:
+## The policy-environment contract
 
-1. `make_env()` builds the nested `{suite: {task_id: VectorEnv}}` dict.
-2. `eval_policy_all()` iterates over every suite and task.
-3. For each task, it runs `n_episodes` rollouts via `rollout()`.
-4. Results are aggregated hierarchically: episode, task, suite, overall.
-5. Metrics include `pc_success` (success rate), `avg_sum_reward`, and `avg_max_reward`.
+There is no enforced schema: `RobotObservation` is typed as `dict[str, Any]`. Instead, LeRobot relies on conventions:
 
-The critical piece: your env must return `info["is_success"]` on every `step()` call. This is how the eval loop knows whether a task was completed.
+### Required attributes on your `gym.Env`
 
-## What your environment must provide
+| Attribute            | Type  | Used by                                                        |
+| -------------------- | ----- | -------------------------------------------------------------- |
+| `_max_episode_steps` | `int` | `rollout()` — caps episode length                              |
+| `task_description`   | `str` | `add_envs_task()` — feeds language instruction to VLA policies |
+| `task`               | `str` | `add_envs_task()` — fallback if `task_description` is absent   |
 
-LeRobot does not enforce a strict observation schema. Instead it relies on a set of conventions that all benchmarks follow.
+### Required fields in `info` dict
 
-### Env attributes
+| Key          | Type   | Used by                                                     |
+| ------------ | ------ | ----------------------------------------------------------- |
+| `is_success` | `bool` | `eval_policy()` — detects task success                      |
+| `final_info` | `dict` | Gymnasium `VectorEnv` — carries per-env info on termination |
 
-Your `gym.Env` must set these attributes:
+### Raw observation format
 
-| Attribute            | Type  | Why                                                  |
-| -------------------- | ----- | ---------------------------------------------------- |
-| `_max_episode_steps` | `int` | `rollout()` uses this to cap episode length          |
-| `task_description`   | `str` | Passed to VLA policies as a language instruction     |
-| `task`               | `str` | Fallback identifier if `task_description` is not set |
+`preprocess_observation()` expects raw observations to use these keys:
 
-### Success reporting
+| Raw key                     | Mapped to                       | Description                            |
+| --------------------------- | ------------------------------- | -------------------------------------- |
+| `"pixels"` (single image)   | `observation.image`             | Single camera, HWC uint8               |
+| `"pixels"` (dict of images) | `observation.images.<cam_name>` | Multiple cameras, each HWC uint8       |
+| `"agent_pos"`               | `observation.state`             | Proprioceptive state vector            |
+| `"environment_state"`       | `observation.env_state`         | Environment state (e.g., PushT)        |
+| `"robot_state"`             | `observation.robot_state`       | Nested robot state dict (e.g., LIBERO) |
 
-Your `step()` and `reset()` must include `"is_success"` in the `info` dict:
+If your benchmark's raw observations don't match these keys, you have two options:
 
-```python
-info = {"is_success": True}   # or False
-return observation, reward, terminated, truncated, info
-```
+1. **Preferred**: Map your observations to these standard keys inside your `gym.Env._format_raw_obs()` method.
+2. **Alternative**: Write an env processor that transforms the observations after `preprocess_observation()` runs.
 
-### Observations
+### Action space
 
-The simplest approach is to map your simulator's outputs to the standard keys that `preprocess_observation()` already understands. Do this inside your `gym.Env` (e.g. in a `_format_raw_obs()` helper):
-
-| Your env should output    | LeRobot maps it to         | What it is                            |
-| ------------------------- | -------------------------- | ------------------------------------- |
-| `"pixels"` (single array) | `observation.image`        | Single camera image, HWC uint8        |
-| `"pixels"` (dict)         | `observation.images.<cam>` | Multiple cameras, each HWC uint8      |
-| `"agent_pos"`             | `observation.state`        | Proprioceptive state vector           |
-| `"environment_state"`     | `observation.env_state`    | Full environment state (e.g. PushT)   |
-| `"robot_state"`           | `observation.robot_state`  | Nested robot state dict (e.g. LIBERO) |
-
-If your simulator uses different key names, you have two options:
-
-1. **Recommended:** Rename them to the standard keys inside your `gym.Env` wrapper.
-2. **Alternative:** Write an env processor to transform observations after `preprocess_observation()` runs (see step 4 below).
-
-### Actions
-
-Actions are continuous numpy arrays in a `gym.spaces.Box`. The dimensionality depends on your benchmark (7 for LIBERO, 4 for Meta-World, etc.). Policies adapt to different action dimensions through their `input_features` / `output_features` config.
+Actions are continuous numpy arrays in a `gym.spaces.Box`. The dimensionality is benchmark-specific (e.g., 7 for LIBERO, 4 for Meta-World). Policies handle the dimension mismatch via their `input_features` / `output_features` config.
 
 ### Feature declaration
 
-Each `EnvConfig` subclass declares two dicts that tell the policy what to expect:
+Each `EnvConfig` subclass declares:
 
-- `features` — maps feature names to `PolicyFeature(type, shape)` (e.g. action dim, image shape).
-- `features_map` — maps raw observation keys to LeRobot convention keys (e.g. `"agent_pos"` to `"observation.state"`).
+- `features`: dict mapping feature names to `PolicyFeature(type, shape)` — tells the policy what to expect.
+- `features_map`: dict mapping raw env keys to LeRobot convention keys (e.g., `"agent_pos" → "observation.state"`).
 
-## Step by step
-
-<Tip>
-  At minimum, you need two files: a **gym.Env wrapper** and an **EnvConfig
-  subclass** with a `create_envs()` override. Everything else is optional or
-  documentation. No changes to `factory.py` are needed.
-</Tip>
+## Files to create or modify
 
 ### Checklist
 
-| File                                     | Required | Why                                                          |
-| ---------------------------------------- | -------- | ------------------------------------------------------------ |
-| `src/lerobot/envs/<benchmark>.py`        | Yes      | Wraps the simulator as a standard gym.Env                    |
-| `src/lerobot/envs/configs.py`            | Yes      | Registers your benchmark and its `create_envs()` for the CLI |
-| `src/lerobot/processor/env_processor.py` | Optional | Custom observation/action transforms                         |
-| `src/lerobot/envs/utils.py`              | Optional | Only if you need new raw observation keys                    |
-| `pyproject.toml`                         | Yes      | Declares benchmark-specific dependencies                     |
-| `docs/source/<benchmark>.mdx`            | Yes      | User-facing documentation page                               |
-| `docs/source/_toctree.yml`               | Yes      | Adds your page to the docs sidebar                           |
+| File                                     | Required | Description                                                                         |
+| ---------------------------------------- | -------- | ----------------------------------------------------------------------------------- |
+| `src/lerobot/envs/<benchmark>.py`        | Yes      | `gym.Env` subclass + `create_<benchmark>_envs()` factory                            |
+| `src/lerobot/envs/configs.py`            | Yes      | `@EnvConfig.register_subclass("<name>")` dataclass                                  |
+| `src/lerobot/envs/factory.py`            | Yes      | Add dispatch branch in `make_env()` and optionally `make_env_pre_post_processors()` |
+| `src/lerobot/processor/env_processor.py` | Optional | `ProcessorStep` subclass for env-specific observation transforms                    |
+| `src/lerobot/envs/utils.py`              | Optional | Extend `preprocess_observation()` if new raw keys are needed                        |
+| `pyproject.toml`                         | Yes      | Add optional dependency group                                                       |
+| `docs/source/<benchmark>.mdx`            | Yes      | User-facing benchmark documentation                                                 |
+| `docs/source/_toctree.yml`               | Yes      | Add entry under the "Benchmarks" section                                            |
 
 ### 1. The gym.Env wrapper (`src/lerobot/envs/<benchmark>.py`)
 
-Create a `gym.Env` subclass that wraps the third-party simulator:
+Create a `gym.Env` subclass that wraps the third-party simulator. Use `src/lerobot/envs/libero.py` or `src/lerobot/envs/metaworld.py` as templates.
+
+Your env must implement:
 
 ```python
 class MyBenchmarkEnv(gym.Env):
@@ -149,19 +133,26 @@ class MyBenchmarkEnv(gym.Env):
         self.action_space = spaces.Box(low=..., high=..., shape=(...,), dtype=np.float32)
 
     def reset(self, seed=None, **kwargs):
-        ...  # return (observation, info) — info must contain {"is_success": False}
+        # Reset simulator, return (observation, info)
+        # info must contain {"is_success": False}
+        ...
 
     def step(self, action: np.ndarray):
-        ...  # return (obs, reward, terminated, truncated, info) — info must contain {"is_success": <bool>}
+        # Step simulator, return (observation, reward, terminated, truncated, info)
+        # info must contain {"is_success": <bool>}
+        # On termination, info must contain "final_info" with success status
+        ...
 
     def render(self):
-        ...  # return RGB image as numpy array
+        # Return RGB image as numpy array
+        ...
 
     def close(self):
+        # Clean up simulator resources
         ...
 ```
 
-Also provide a factory function that returns the nested dict structure:
+Also provide a factory function that returns the standard nested dict:
 
 ```python
 def create_mybenchmark_envs(
@@ -174,22 +165,20 @@ def create_mybenchmark_envs(
     ...
 ```
 
-See `create_libero_envs()` (multi-suite, multi-task) and `create_metaworld_envs()` (difficulty-grouped tasks) for reference.
+See `create_libero_envs()` in `src/lerobot/envs/libero.py` (multi-suite, multi-task) and `create_metaworld_envs()` in `src/lerobot/envs/metaworld.py` (difficulty-grouped tasks) for reference.
 
 ### 2. The config (`src/lerobot/envs/configs.py`)
 
-Register a config dataclass so users can select your benchmark with `--env.type=<name>`. Each config owns its environment creation and processor logic via two methods:
-
-- **`create_envs(n_envs, use_async_envs)`** — Returns `{suite: {task_id: VectorEnv}}`. The base class default uses `gym.make()` for single-task envs. Multi-task benchmarks override this.
-- **`get_env_processors()`** — Returns `(preprocessor, postprocessor)`. The base class default returns identity (no-op) pipelines. Override if your benchmark needs observation/action transforms.
+Register a new config dataclass:
 
 ```python
 @EnvConfig.register_subclass("<benchmark_name>")
 @dataclass
-class MyBenchmarkEnvConfig(EnvConfig):
+class MyBenchmarkEnv(EnvConfig):
     task: str = "<default_task>"
     fps: int = <fps>
     obs_type: str = "pixels_agent_pos"
+    # ... benchmark-specific fields ...
 
     features: dict[str, PolicyFeature] = field(default_factory=lambda: {
         ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(<action_dim>,)),
@@ -201,37 +190,49 @@ class MyBenchmarkEnvConfig(EnvConfig):
     })
 
     def __post_init__(self):
-        ...  # populate features based on obs_type
+        # Populate features based on obs_type
+        ...
 
     @property
     def gym_kwargs(self) -> dict:
         return {"obs_type": self.obs_type, "render_mode": self.render_mode}
-
-    def create_envs(self, n_envs: int, use_async_envs: bool = False):
-        """Override for multi-task benchmarks or custom env creation."""
-        from lerobot.envs.<benchmark> import create_<benchmark>_envs
-        return create_<benchmark>_envs(task=self.task, n_envs=n_envs, ...)
-
-    def get_env_processors(self):
-        """Override if your benchmark needs observation/action transforms."""
-        from lerobot.processor.pipeline import PolicyProcessorPipeline
-        from lerobot.processor.env_processor import MyBenchmarkProcessorStep
-        return (
-            PolicyProcessorPipeline(steps=[MyBenchmarkProcessorStep()]),
-            PolicyProcessorPipeline(steps=[]),
-        )
 ```
 
 Key points:
 
-- The `register_subclass` name is what users pass on the CLI (`--env.type=<name>`).
-- `features` tells the policy what the environment produces.
+- The `register_subclass` name is what users pass as `--env.type=<name>` on the CLI.
+- `features` declares what the environment produces (used to configure the policy).
 - `features_map` maps raw observation keys to LeRobot convention keys.
-- **No changes to `factory.py` needed** — the factory delegates to `cfg.create_envs()` and `cfg.get_env_processors()` automatically.
 
-### 3. Env processor (optional — `src/lerobot/processor/env_processor.py`)
+### 3. The factory dispatch (`src/lerobot/envs/factory.py`)
 
-Only needed if your benchmark requires observation transforms beyond what `preprocess_observation()` handles (e.g. image flipping, coordinate conversion). Define the processor step here and return it from `get_env_processors()` in your config (see step 2):
+Add a branch in `make_env()`:
+
+```python
+elif "<benchmark_name>" in cfg.type:
+    from lerobot.envs.<benchmark> import create_<benchmark>_envs
+
+    if cfg.task is None:
+        raise ValueError("<BenchmarkName> requires a task to be specified")
+
+    return create_<benchmark>_envs(
+        task=cfg.task,
+        n_envs=n_envs,
+        gym_kwargs=cfg.gym_kwargs,
+        env_cls=env_cls,
+    )
+```
+
+If your benchmark needs an env processor, add it in `make_env_pre_post_processors()`:
+
+```python
+if isinstance(env_cfg, MyBenchmarkEnv) or "<benchmark_name>" in env_cfg.type:
+    preprocessor_steps.append(MyBenchmarkProcessorStep())
+```
+
+### 4. Env processor (optional) (`src/lerobot/processor/env_processor.py`)
+
+If your benchmark needs observation transforms beyond what `preprocess_observation()` handles (e.g., image flipping, coordinate frame conversion), add a `ProcessorStep`:
 
 ```python
 @dataclass
@@ -239,11 +240,12 @@ Only needed if your benchmark requires observation transforms beyond what `prepr
 class MyBenchmarkProcessorStep(ObservationProcessorStep):
     def _process_observation(self, observation):
         processed = observation.copy()
-        # your transforms here
+        # Your transforms here
         return processed
 
     def transform_features(self, features):
-        return features  # update if shapes change
+        # Update feature declarations if shapes change
+        return features
 
     def observation(self, observation):
         return self._process_observation(observation)
@@ -251,20 +253,20 @@ class MyBenchmarkProcessorStep(ObservationProcessorStep):
 
 See `LiberoProcessorStep` for a full example (image rotation, quaternion-to-axis-angle conversion).
 
-### 4. Dependencies (`pyproject.toml`)
+### 5. Dependencies (`pyproject.toml`)
 
-Add a new optional-dependency group:
+Add a new optional-dependency group under `[project.optional-dependencies]`:
 
 ```toml
 mybenchmark = ["my-benchmark-pkg==1.2.3", "lerobot[scipy-dep]"]
 ```
 
-Pinning rules:
+**Dependency pinning rules:**
 
-- **Always pin** benchmark packages to exact versions for reproducibility (e.g. `metaworld==3.0.0`).
-- **Add platform markers** when needed (e.g. `; sys_platform == 'linux'`).
-- **Pin fragile transitive deps** if known (e.g. `gymnasium==1.1.0` for Meta-World).
-- **Document constraints** in your benchmark doc page.
+- **Always pin benchmark-specific packages** to exact versions or tight ranges for reproducibility (e.g., `metaworld==3.0.0`, `hf-libero>=0.1.3,<0.2.0`).
+- **Add platform markers** if the dependency is platform-specific (e.g., `; sys_platform == 'linux'`).
+- **Pin known-fragile transitive dependencies** (e.g., `gymnasium==1.1.0` for Meta-World compatibility).
+- **Document version constraints** in the benchmark doc page.
 
 Users install with:
 
@@ -272,13 +274,13 @@ Users install with:
 pip install -e ".[mybenchmark]"
 ```
 
-### 5. Documentation (`docs/source/<benchmark>.mdx`)
+### 6. Documentation (`docs/source/<benchmark>.mdx`)
 
-Write a user-facing page following the template in the next section. See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for full examples.
+Follow the template below. See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for full examples.
 
-### 6. Table of contents (`docs/source/_toctree.yml`)
+### 7. Table of contents (`docs/source/_toctree.yml`)
 
-Add your benchmark to the "Benchmarks" section:
+Add your benchmark under the "Benchmarks" section:
 
 ```yaml
 - sections:
@@ -293,28 +295,97 @@ Add your benchmark to the "Benchmarks" section:
   title: "Benchmarks"
 ```
 
-## Verifying your integration
+## Benchmark documentation template
 
-After completing the steps above, confirm that everything works:
+Each benchmark `.mdx` page should follow this structure:
 
-1. **Install** — `pip install -e ".[mybenchmark]"` and verify the dependency group installs cleanly.
-2. **Smoke test env creation** — call `make_env()` with your config in Python, check that the returned dict has the expected `{suite: {task_id: VectorEnv}}` shape, and that `reset()` returns observations with the right keys.
-3. **Run a full eval** — `lerobot-eval --env.type=<name> --env.task=<task> --eval.n_episodes=1 --eval.batch_size=1 --policy.path=<any_compatible_policy>` to exercise the full pipeline end-to-end.
-4. **Check success detection** — verify that `info["is_success"]` flips to `True` when the task is actually completed. This is what the eval loop uses to compute success rates.
+```markdown
+# <Benchmark Name>
 
-## Writing a benchmark doc page
+<1-2 paragraphs: what the benchmark tests and why it matters for robot learning.>
 
-Each benchmark `.mdx` page should include:
+- Paper: [<title>](arxiv_url)
+- GitHub: [<repo>](github_url)
+- Project website: [<name>](url) (if available)
 
-- **Title and description** — 1-2 paragraphs on what the benchmark tests and why it matters.
-- **Links** — paper, GitHub repo, project website (if available).
-- **Overview image or GIF.**
-- **Available tasks** — table of task suites with counts and brief descriptions.
-- **Installation** — `pip install -e ".[<benchmark>]"` plus any extra steps (env vars, system packages).
-- **Evaluation** — recommended `lerobot-eval` command with `n_episodes` and `batch_size` for reproducible results. Include single-task and multi-task examples if applicable.
-- **Policy inputs and outputs** — observation keys with shapes, action space description.
-- **Recommended evaluation episodes** — how many episodes per task is standard.
-- **Training** — example `lerobot-train` command.
-- **Reproducing published results** — link to pretrained model, eval command, results table (if available).
+<Overview image or GIF>
 
-See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for complete examples.
+## Available tasks
+
+<Table listing task suites or individual tasks, with counts.
+For multi-suite benchmarks, describe each suite briefly.>
+
+| Suite | Tasks | Description |
+| ----- | ----- | ----------- |
+| ...   | ...   | ...         |
+
+## Installation
+
+After following the LeRobot installation instructions:
+
+pip install -e ".[<benchmark>]"
+
+<Any additional steps: environment variables, system packages, etc.>
+
+## Evaluation
+
+### Default evaluation (recommended)
+
+<Command with recommended n_episodes, batch_size for reproducible results.>
+
+### Single-task evaluation
+
+<Command example with --env.task=<single_task>>
+
+### Multi-task evaluation
+
+<Command example with comma-separated tasks, if applicable.>
+
+### Policy inputs and outputs
+
+**Observations:**
+
+- `observation.state` — <shape, description>
+- `observation.images.image` — <shape, description>
+- ...
+
+**Actions:**
+
+- Continuous control in Box(<low>, <high>, shape=(<dim>,))
+
+### Recommended evaluation episodes
+
+<State how many episodes per task are standard for this benchmark.
+E.g., "50 episodes per task (500 total for LIBERO Spatial).">
+
+## Training
+
+<Example lerobot-train command.>
+
+## Reproducing published results
+
+<If available: link to pretrained model, eval command, results table.>
+```
+
+## How evaluation works
+
+All benchmarks are evaluated uniformly by `lerobot-eval` (see `src/lerobot/scripts/lerobot_eval.py`).
+
+The `eval_policy_all()` function:
+
+1. Receives the nested `{suite: {task_id: VectorEnv}}` dict from `make_env()`.
+2. Iterates over every `(suite, task_id, vec_env)` tuple.
+3. For each task, runs `n_episodes` rollouts via `eval_policy()` → `rollout()`.
+4. Aggregates results hierarchically: **episode → task → suite → overall**.
+5. Reports `pc_success` (success rate), `avg_sum_reward`, `avg_max_reward` at each level.
+6. Saves all results to `eval_info.json` with the full config snapshot for reproducibility.
+
+The key contract: your `gym.Env` must return `info["is_success"]` on every `step()`, and the `VectorEnv` must surface it through `final_info["is_success"]` on termination. This is how the eval loop detects task completion.
+
+## Quick reference: existing benchmarks
+
+| Benchmark      | Env file            | Config class       | Tasks               | Action dim   | Processor                    |
+| -------------- | ------------------- | ------------------ | ------------------- | ------------ | ---------------------------- |
+| LIBERO         | `envs/libero.py`    | `LiberoEnv`        | 130 across 5 suites | 7            | `LiberoProcessorStep`        |
+| Meta-World     | `envs/metaworld.py` | `MetaworldEnv`     | 50 (MT50)           | 4            | None                         |
+| IsaacLab Arena | Hub-hosted          | `IsaaclabArenaEnv` | Configurable        | Configurable | `IsaaclabArenaProcessorStep` |
diff --git a/docs/source/metaworld.mdx b/docs/source/metaworld.mdx
index 5c4a780be..103c4b805 100644
--- a/docs/source/metaworld.mdx
+++ b/docs/source/metaworld.mdx
@@ -2,7 +2,7 @@
 
 Meta-World is an open-source simulation benchmark for **multi-task and meta reinforcement learning** in continuous-control robotic manipulation. It bundles 50 diverse manipulation tasks using everyday objects and a common tabletop Sawyer arm, providing a standardized playground to test whether algorithms can learn many different tasks and generalize quickly to new ones.
 
-- Paper: [Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning](https://arxiv.org/abs/1910.10897)
+- Paper: [Meta-World+: An Improved, Standardized, RL Benchmark](https://arxiv.org/abs/1910.10897)
 - GitHub: [Farama-Foundation/Metaworld](https://github.com/Farama-Foundation/Metaworld)
 - Project website: [metaworld.farama.org](https://metaworld.farama.org)
 
@@ -12,13 +12,13 @@ Meta-World is an open-source simulation benchmark for **multi-task and meta rein
 
 Meta-World provides 50 tasks organized into difficulty groups. In LeRobot, you can evaluate on individual tasks, difficulty groups, or the full MT50 suite:
 
-| Group      | CLI name             | Tasks | Description                                            |
-| ---------- | -------------------- | ----- | ------------------------------------------------------ |
-| Easy       | `easy`               | 28    | Tasks with simple dynamics and single-step goals       |
-| Medium     | `medium`             | 11    | Tasks requiring multi-step reasoning                   |
-| Hard       | `hard`               | 6     | Tasks with complex contacts and precise manipulation   |
-| Very Hard  | `very_hard`          | 5     | The most challenging tasks in the suite                |
-| MT50 (all) | Comma-separated list | 50    | All 50 tasks — the most challenging multi-task setting |
+| Group      | CLI name             | Tasks  | Description                                            |
+| ---------- | -------------------- | ------ | ------------------------------------------------------ |
+| Easy       | `easy`               | Subset | Tasks with simple dynamics and single-step goals       |
+| Medium     | `medium`             | Subset | Tasks requiring multi-step reasoning                   |
+| Hard       | `hard`               | Subset | Tasks with complex contacts and precise manipulation   |
+| Very Hard  | `very_hard`          | Subset | The most challenging tasks in the suite                |
+| MT50 (all) | Comma-separated list | 50     | All 50 tasks — the most challenging multi-task setting |
 
 You can also pass individual task names directly (e.g., `assembly-v3`, `dial-turn-v3`).