diff --git a/docs/source/adding_benchmarks.mdx b/docs/source/adding_benchmarks.mdx index 73a951276..3ba606dd2 100644 --- a/docs/source/adding_benchmarks.mdx +++ b/docs/source/adding_benchmarks.mdx @@ -1,140 +1,124 @@ # Adding a New Benchmark -This guide walks you through adding a new simulation benchmark to LeRobot. Follow the steps in order and use the existing benchmarks as templates. +This guide explains how to integrate a new simulation benchmark into LeRobot. It is intended for both human contributors and coding agents follow the steps in order and use the referenced files as templates. -A benchmark in LeRobot is a set of [Gymnasium](https://gymnasium.farama.org/) environments that wrap a third-party simulator (like LIBERO or Meta-World) behind a standard `gym.Env` interface. The `lerobot-eval` CLI then runs evaluation uniformly across all benchmarks. +A "benchmark" in LeRobot is a set of gymnasium environments used for standardized evaluation. Each benchmark wraps a third-party simulator (e.g., LIBERO, Meta-World) behind a `gym.Env` interface, and the `lerobot-eval` script drives evaluation uniformly across all benchmarks. -## Existing benchmarks at a glance +## Architecture overview -Before diving in, here is what is already integrated: +### Observation and action data flow -| Benchmark | Env file | Config class | Tasks | Action dim | Processor | -| -------------- | ------------------- | ------------------ | ------------------- | ------------ | ---------------------------- | -| LIBERO | `envs/libero.py` | `LiberoEnv` | 130 across 5 suites | 7 | `LiberoProcessorStep` | -| Meta-World | `envs/metaworld.py` | `MetaworldEnv` | 50 (MT50) | 4 | None | -| IsaacLab Arena | Hub-hosted | `IsaaclabArenaEnv` | Configurable | Configurable | `IsaaclabArenaProcessorStep` | - -Use `src/lerobot/envs/libero.py` and `src/lerobot/envs/metaworld.py` as reference implementations. - -## How it all fits together - -### Data flow - -During evaluation, data moves through four stages: +During evaluation, observations and actions flow through a multi-stage pipeline: ``` -1. gym.Env ──→ raw observations (numpy dicts) - -2. Preprocessing ──→ standard LeRobot keys + task description - (preprocess_observation, add_envs_task in envs/utils.py) - -3. Processors ──→ env-specific then policy-specific transforms - (env_preprocessor, policy_preprocessor) - -4. Policy ──→ select_action() ──→ action tensor - then reverse: policy_postprocessor → env_postprocessor → numpy action → env.step() +gym.Env.reset() / step() + │ + ▼ raw observation (dict[str, Any]) +preprocess_observation() # envs/utils.py — numpy→tensor, key mapping + │ + ▼ LeRobot-format observation +add_envs_task() # envs/utils.py — injects task description + │ + ▼ +env_preprocessor # processor/env_processor.py — env-specific transforms + │ + ▼ +policy_preprocessor # per-policy normalization, device transfer + │ + ▼ +policy.select_action() # PreTrainedPolicy — returns action tensor + │ + ▼ +policy_postprocessor # per-policy denormalization + │ + ▼ +env_postprocessor # env-specific action transforms + │ + ▼ numpy action +gym.Env.step(action) ``` -Most benchmarks only need to care about stage 1 (producing observations in the right format) and optionally stage 3 (if env-specific transforms are needed). +### Environment return shape -### Environment structure - -`make_env()` returns a nested dict of vectorized environments: +`make_env()` returns a nested dict: ```python dict[str, dict[int, gym.vector.VectorEnv]] -# ^suite ^task_id +# ^suite_name ^task_id ^vectorized env with n_envs parallel copies ``` -A single-task env (e.g. PushT) looks like `{"pusht": {0: vec_env}}`. -A multi-task benchmark (e.g. LIBERO) looks like `{"libero_spatial": {0: vec0, 1: vec1, ...}, ...}`. +For single-task environments (e.g., PushT), this is `{"pusht": {0: vec_env}}`. +For multi-task benchmarks (e.g., LIBERO), this is `{"libero_spatial": {0: vec0, 1: vec1, ...}, "libero_object": {0: ..., ...}}`. -### How evaluation runs +The eval loop (`eval_policy_all()`) iterates over all suites and tasks uniformly. -All benchmarks are evaluated the same way by `lerobot-eval`: +## The policy-environment contract -1. `make_env()` builds the nested `{suite: {task_id: VectorEnv}}` dict. -2. `eval_policy_all()` iterates over every suite and task. -3. For each task, it runs `n_episodes` rollouts via `rollout()`. -4. Results are aggregated hierarchically: episode, task, suite, overall. -5. Metrics include `pc_success` (success rate), `avg_sum_reward`, and `avg_max_reward`. +There is no enforced schema: `RobotObservation` is typed as `dict[str, Any]`. Instead, LeRobot relies on conventions: -The critical piece: your env must return `info["is_success"]` on every `step()` call. This is how the eval loop knows whether a task was completed. +### Required attributes on your `gym.Env` -## What your environment must provide +| Attribute | Type | Used by | +| -------------------- | ----- | -------------------------------------------------------------- | +| `_max_episode_steps` | `int` | `rollout()` — caps episode length | +| `task_description` | `str` | `add_envs_task()` — feeds language instruction to VLA policies | +| `task` | `str` | `add_envs_task()` — fallback if `task_description` is absent | -LeRobot does not enforce a strict observation schema. Instead it relies on a set of conventions that all benchmarks follow. +### Required fields in `info` dict -### Env attributes +| Key | Type | Used by | +| ------------ | ------ | ----------------------------------------------------------- | +| `is_success` | `bool` | `eval_policy()` — detects task success | +| `final_info` | `dict` | Gymnasium `VectorEnv` — carries per-env info on termination | -Your `gym.Env` must set these attributes: +### Raw observation format -| Attribute | Type | Why | -| -------------------- | ----- | ---------------------------------------------------- | -| `_max_episode_steps` | `int` | `rollout()` uses this to cap episode length | -| `task_description` | `str` | Passed to VLA policies as a language instruction | -| `task` | `str` | Fallback identifier if `task_description` is not set | +`preprocess_observation()` expects raw observations to use these keys: -### Success reporting +| Raw key | Mapped to | Description | +| --------------------------- | ------------------------------- | -------------------------------------- | +| `"pixels"` (single image) | `observation.image` | Single camera, HWC uint8 | +| `"pixels"` (dict of images) | `observation.images.` | Multiple cameras, each HWC uint8 | +| `"agent_pos"` | `observation.state` | Proprioceptive state vector | +| `"environment_state"` | `observation.env_state` | Environment state (e.g., PushT) | +| `"robot_state"` | `observation.robot_state` | Nested robot state dict (e.g., LIBERO) | -Your `step()` and `reset()` must include `"is_success"` in the `info` dict: +If your benchmark's raw observations don't match these keys, you have two options: -```python -info = {"is_success": True} # or False -return observation, reward, terminated, truncated, info -``` +1. **Preferred**: Map your observations to these standard keys inside your `gym.Env._format_raw_obs()` method. +2. **Alternative**: Write an env processor that transforms the observations after `preprocess_observation()` runs. -### Observations +### Action space -The simplest approach is to map your simulator's outputs to the standard keys that `preprocess_observation()` already understands. Do this inside your `gym.Env` (e.g. in a `_format_raw_obs()` helper): - -| Your env should output | LeRobot maps it to | What it is | -| ------------------------- | -------------------------- | ------------------------------------- | -| `"pixels"` (single array) | `observation.image` | Single camera image, HWC uint8 | -| `"pixels"` (dict) | `observation.images.` | Multiple cameras, each HWC uint8 | -| `"agent_pos"` | `observation.state` | Proprioceptive state vector | -| `"environment_state"` | `observation.env_state` | Full environment state (e.g. PushT) | -| `"robot_state"` | `observation.robot_state` | Nested robot state dict (e.g. LIBERO) | - -If your simulator uses different key names, you have two options: - -1. **Recommended:** Rename them to the standard keys inside your `gym.Env` wrapper. -2. **Alternative:** Write an env processor to transform observations after `preprocess_observation()` runs (see step 4 below). - -### Actions - -Actions are continuous numpy arrays in a `gym.spaces.Box`. The dimensionality depends on your benchmark (7 for LIBERO, 4 for Meta-World, etc.). Policies adapt to different action dimensions through their `input_features` / `output_features` config. +Actions are continuous numpy arrays in a `gym.spaces.Box`. The dimensionality is benchmark-specific (e.g., 7 for LIBERO, 4 for Meta-World). Policies handle the dimension mismatch via their `input_features` / `output_features` config. ### Feature declaration -Each `EnvConfig` subclass declares two dicts that tell the policy what to expect: +Each `EnvConfig` subclass declares: -- `features` — maps feature names to `PolicyFeature(type, shape)` (e.g. action dim, image shape). -- `features_map` — maps raw observation keys to LeRobot convention keys (e.g. `"agent_pos"` to `"observation.state"`). +- `features`: dict mapping feature names to `PolicyFeature(type, shape)` — tells the policy what to expect. +- `features_map`: dict mapping raw env keys to LeRobot convention keys (e.g., `"agent_pos" → "observation.state"`). -## Step by step - - - At minimum, you need two files: a **gym.Env wrapper** and an **EnvConfig - subclass** with a `create_envs()` override. Everything else is optional or - documentation. No changes to `factory.py` are needed. - +## Files to create or modify ### Checklist -| File | Required | Why | -| ---------------------------------------- | -------- | ------------------------------------------------------------ | -| `src/lerobot/envs/.py` | Yes | Wraps the simulator as a standard gym.Env | -| `src/lerobot/envs/configs.py` | Yes | Registers your benchmark and its `create_envs()` for the CLI | -| `src/lerobot/processor/env_processor.py` | Optional | Custom observation/action transforms | -| `src/lerobot/envs/utils.py` | Optional | Only if you need new raw observation keys | -| `pyproject.toml` | Yes | Declares benchmark-specific dependencies | -| `docs/source/.mdx` | Yes | User-facing documentation page | -| `docs/source/_toctree.yml` | Yes | Adds your page to the docs sidebar | +| File | Required | Description | +| ---------------------------------------- | -------- | ----------------------------------------------------------------------------------- | +| `src/lerobot/envs/.py` | Yes | `gym.Env` subclass + `create__envs()` factory | +| `src/lerobot/envs/configs.py` | Yes | `@EnvConfig.register_subclass("")` dataclass | +| `src/lerobot/envs/factory.py` | Yes | Add dispatch branch in `make_env()` and optionally `make_env_pre_post_processors()` | +| `src/lerobot/processor/env_processor.py` | Optional | `ProcessorStep` subclass for env-specific observation transforms | +| `src/lerobot/envs/utils.py` | Optional | Extend `preprocess_observation()` if new raw keys are needed | +| `pyproject.toml` | Yes | Add optional dependency group | +| `docs/source/.mdx` | Yes | User-facing benchmark documentation | +| `docs/source/_toctree.yml` | Yes | Add entry under the "Benchmarks" section | ### 1. The gym.Env wrapper (`src/lerobot/envs/.py`) -Create a `gym.Env` subclass that wraps the third-party simulator: +Create a `gym.Env` subclass that wraps the third-party simulator. Use `src/lerobot/envs/libero.py` or `src/lerobot/envs/metaworld.py` as templates. + +Your env must implement: ```python class MyBenchmarkEnv(gym.Env): @@ -149,19 +133,26 @@ class MyBenchmarkEnv(gym.Env): self.action_space = spaces.Box(low=..., high=..., shape=(...,), dtype=np.float32) def reset(self, seed=None, **kwargs): - ... # return (observation, info) — info must contain {"is_success": False} + # Reset simulator, return (observation, info) + # info must contain {"is_success": False} + ... def step(self, action: np.ndarray): - ... # return (obs, reward, terminated, truncated, info) — info must contain {"is_success": } + # Step simulator, return (observation, reward, terminated, truncated, info) + # info must contain {"is_success": } + # On termination, info must contain "final_info" with success status + ... def render(self): - ... # return RGB image as numpy array + # Return RGB image as numpy array + ... def close(self): + # Clean up simulator resources ... ``` -Also provide a factory function that returns the nested dict structure: +Also provide a factory function that returns the standard nested dict: ```python def create_mybenchmark_envs( @@ -174,22 +165,20 @@ def create_mybenchmark_envs( ... ``` -See `create_libero_envs()` (multi-suite, multi-task) and `create_metaworld_envs()` (difficulty-grouped tasks) for reference. +See `create_libero_envs()` in `src/lerobot/envs/libero.py` (multi-suite, multi-task) and `create_metaworld_envs()` in `src/lerobot/envs/metaworld.py` (difficulty-grouped tasks) for reference. ### 2. The config (`src/lerobot/envs/configs.py`) -Register a config dataclass so users can select your benchmark with `--env.type=`. Each config owns its environment creation and processor logic via two methods: - -- **`create_envs(n_envs, use_async_envs)`** — Returns `{suite: {task_id: VectorEnv}}`. The base class default uses `gym.make()` for single-task envs. Multi-task benchmarks override this. -- **`get_env_processors()`** — Returns `(preprocessor, postprocessor)`. The base class default returns identity (no-op) pipelines. Override if your benchmark needs observation/action transforms. +Register a new config dataclass: ```python @EnvConfig.register_subclass("") @dataclass -class MyBenchmarkEnvConfig(EnvConfig): +class MyBenchmarkEnv(EnvConfig): task: str = "" fps: int = obs_type: str = "pixels_agent_pos" + # ... benchmark-specific fields ... features: dict[str, PolicyFeature] = field(default_factory=lambda: { ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(,)), @@ -201,37 +190,49 @@ class MyBenchmarkEnvConfig(EnvConfig): }) def __post_init__(self): - ... # populate features based on obs_type + # Populate features based on obs_type + ... @property def gym_kwargs(self) -> dict: return {"obs_type": self.obs_type, "render_mode": self.render_mode} - - def create_envs(self, n_envs: int, use_async_envs: bool = False): - """Override for multi-task benchmarks or custom env creation.""" - from lerobot.envs. import create__envs - return create__envs(task=self.task, n_envs=n_envs, ...) - - def get_env_processors(self): - """Override if your benchmark needs observation/action transforms.""" - from lerobot.processor.pipeline import PolicyProcessorPipeline - from lerobot.processor.env_processor import MyBenchmarkProcessorStep - return ( - PolicyProcessorPipeline(steps=[MyBenchmarkProcessorStep()]), - PolicyProcessorPipeline(steps=[]), - ) ``` Key points: -- The `register_subclass` name is what users pass on the CLI (`--env.type=`). -- `features` tells the policy what the environment produces. +- The `register_subclass` name is what users pass as `--env.type=` on the CLI. +- `features` declares what the environment produces (used to configure the policy). - `features_map` maps raw observation keys to LeRobot convention keys. -- **No changes to `factory.py` needed** — the factory delegates to `cfg.create_envs()` and `cfg.get_env_processors()` automatically. -### 3. Env processor (optional — `src/lerobot/processor/env_processor.py`) +### 3. The factory dispatch (`src/lerobot/envs/factory.py`) -Only needed if your benchmark requires observation transforms beyond what `preprocess_observation()` handles (e.g. image flipping, coordinate conversion). Define the processor step here and return it from `get_env_processors()` in your config (see step 2): +Add a branch in `make_env()`: + +```python +elif "" in cfg.type: + from lerobot.envs. import create__envs + + if cfg.task is None: + raise ValueError(" requires a task to be specified") + + return create__envs( + task=cfg.task, + n_envs=n_envs, + gym_kwargs=cfg.gym_kwargs, + env_cls=env_cls, + ) +``` + +If your benchmark needs an env processor, add it in `make_env_pre_post_processors()`: + +```python +if isinstance(env_cfg, MyBenchmarkEnv) or "" in env_cfg.type: + preprocessor_steps.append(MyBenchmarkProcessorStep()) +``` + +### 4. Env processor (optional) (`src/lerobot/processor/env_processor.py`) + +If your benchmark needs observation transforms beyond what `preprocess_observation()` handles (e.g., image flipping, coordinate frame conversion), add a `ProcessorStep`: ```python @dataclass @@ -239,11 +240,12 @@ Only needed if your benchmark requires observation transforms beyond what `prepr class MyBenchmarkProcessorStep(ObservationProcessorStep): def _process_observation(self, observation): processed = observation.copy() - # your transforms here + # Your transforms here return processed def transform_features(self, features): - return features # update if shapes change + # Update feature declarations if shapes change + return features def observation(self, observation): return self._process_observation(observation) @@ -251,20 +253,20 @@ class MyBenchmarkProcessorStep(ObservationProcessorStep): See `LiberoProcessorStep` for a full example (image rotation, quaternion-to-axis-angle conversion). -### 4. Dependencies (`pyproject.toml`) +### 5. Dependencies (`pyproject.toml`) -Add a new optional-dependency group: +Add a new optional-dependency group under `[project.optional-dependencies]`: ```toml mybenchmark = ["my-benchmark-pkg==1.2.3", "lerobot[scipy-dep]"] ``` -Pinning rules: +**Dependency pinning rules:** -- **Always pin** benchmark packages to exact versions for reproducibility (e.g. `metaworld==3.0.0`). -- **Add platform markers** when needed (e.g. `; sys_platform == 'linux'`). -- **Pin fragile transitive deps** if known (e.g. `gymnasium==1.1.0` for Meta-World). -- **Document constraints** in your benchmark doc page. +- **Always pin benchmark-specific packages** to exact versions or tight ranges for reproducibility (e.g., `metaworld==3.0.0`, `hf-libero>=0.1.3,<0.2.0`). +- **Add platform markers** if the dependency is platform-specific (e.g., `; sys_platform == 'linux'`). +- **Pin known-fragile transitive dependencies** (e.g., `gymnasium==1.1.0` for Meta-World compatibility). +- **Document version constraints** in the benchmark doc page. Users install with: @@ -272,13 +274,13 @@ Users install with: pip install -e ".[mybenchmark]" ``` -### 5. Documentation (`docs/source/.mdx`) +### 6. Documentation (`docs/source/.mdx`) -Write a user-facing page following the template in the next section. See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for full examples. +Follow the template below. See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for full examples. -### 6. Table of contents (`docs/source/_toctree.yml`) +### 7. Table of contents (`docs/source/_toctree.yml`) -Add your benchmark to the "Benchmarks" section: +Add your benchmark under the "Benchmarks" section: ```yaml - sections: @@ -293,28 +295,97 @@ Add your benchmark to the "Benchmarks" section: title: "Benchmarks" ``` -## Verifying your integration +## Benchmark documentation template -After completing the steps above, confirm that everything works: +Each benchmark `.mdx` page should follow this structure: -1. **Install** — `pip install -e ".[mybenchmark]"` and verify the dependency group installs cleanly. -2. **Smoke test env creation** — call `make_env()` with your config in Python, check that the returned dict has the expected `{suite: {task_id: VectorEnv}}` shape, and that `reset()` returns observations with the right keys. -3. **Run a full eval** — `lerobot-eval --env.type= --env.task= --eval.n_episodes=1 --eval.batch_size=1 --policy.path=` to exercise the full pipeline end-to-end. -4. **Check success detection** — verify that `info["is_success"]` flips to `True` when the task is actually completed. This is what the eval loop uses to compute success rates. +```markdown +# -## Writing a benchmark doc page +<1-2 paragraphs: what the benchmark tests and why it matters for robot learning.> -Each benchmark `.mdx` page should include: +- Paper: [](arxiv_url) +- GitHub: [<repo>](github_url) +- Project website: [<name>](url) (if available) -- **Title and description** — 1-2 paragraphs on what the benchmark tests and why it matters. -- **Links** — paper, GitHub repo, project website (if available). -- **Overview image or GIF.** -- **Available tasks** — table of task suites with counts and brief descriptions. -- **Installation** — `pip install -e ".[<benchmark>]"` plus any extra steps (env vars, system packages). -- **Evaluation** — recommended `lerobot-eval` command with `n_episodes` and `batch_size` for reproducible results. Include single-task and multi-task examples if applicable. -- **Policy inputs and outputs** — observation keys with shapes, action space description. -- **Recommended evaluation episodes** — how many episodes per task is standard. -- **Training** — example `lerobot-train` command. -- **Reproducing published results** — link to pretrained model, eval command, results table (if available). +<Overview image or GIF> -See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for complete examples. +## Available tasks + +<Table listing task suites or individual tasks, with counts. +For multi-suite benchmarks, describe each suite briefly.> + +| Suite | Tasks | Description | +| ----- | ----- | ----------- | +| ... | ... | ... | + +## Installation + +After following the LeRobot installation instructions: + +pip install -e ".[<benchmark>]" + +<Any additional steps: environment variables, system packages, etc.> + +## Evaluation + +### Default evaluation (recommended) + +<Command with recommended n_episodes, batch_size for reproducible results.> + +### Single-task evaluation + +<Command example with --env.task=<single_task>> + +### Multi-task evaluation + +<Command example with comma-separated tasks, if applicable.> + +### Policy inputs and outputs + +**Observations:** + +- `observation.state` — <shape, description> +- `observation.images.image` — <shape, description> +- ... + +**Actions:** + +- Continuous control in Box(<low>, <high>, shape=(<dim>,)) + +### Recommended evaluation episodes + +<State how many episodes per task are standard for this benchmark. +E.g., "50 episodes per task (500 total for LIBERO Spatial)."> + +## Training + +<Example lerobot-train command.> + +## Reproducing published results + +<If available: link to pretrained model, eval command, results table.> +``` + +## How evaluation works + +All benchmarks are evaluated uniformly by `lerobot-eval` (see `src/lerobot/scripts/lerobot_eval.py`). + +The `eval_policy_all()` function: + +1. Receives the nested `{suite: {task_id: VectorEnv}}` dict from `make_env()`. +2. Iterates over every `(suite, task_id, vec_env)` tuple. +3. For each task, runs `n_episodes` rollouts via `eval_policy()` → `rollout()`. +4. Aggregates results hierarchically: **episode → task → suite → overall**. +5. Reports `pc_success` (success rate), `avg_sum_reward`, `avg_max_reward` at each level. +6. Saves all results to `eval_info.json` with the full config snapshot for reproducibility. + +The key contract: your `gym.Env` must return `info["is_success"]` on every `step()`, and the `VectorEnv` must surface it through `final_info["is_success"]` on termination. This is how the eval loop detects task completion. + +## Quick reference: existing benchmarks + +| Benchmark | Env file | Config class | Tasks | Action dim | Processor | +| -------------- | ------------------- | ------------------ | ------------------- | ------------ | ---------------------------- | +| LIBERO | `envs/libero.py` | `LiberoEnv` | 130 across 5 suites | 7 | `LiberoProcessorStep` | +| Meta-World | `envs/metaworld.py` | `MetaworldEnv` | 50 (MT50) | 4 | None | +| IsaacLab Arena | Hub-hosted | `IsaaclabArenaEnv` | Configurable | Configurable | `IsaaclabArenaProcessorStep` | diff --git a/docs/source/metaworld.mdx b/docs/source/metaworld.mdx index 5c4a780be..103c4b805 100644 --- a/docs/source/metaworld.mdx +++ b/docs/source/metaworld.mdx @@ -2,7 +2,7 @@ Meta-World is an open-source simulation benchmark for **multi-task and meta reinforcement learning** in continuous-control robotic manipulation. It bundles 50 diverse manipulation tasks using everyday objects and a common tabletop Sawyer arm, providing a standardized playground to test whether algorithms can learn many different tasks and generalize quickly to new ones. -- Paper: [Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning](https://arxiv.org/abs/1910.10897) +- Paper: [Meta-World+: An Improved, Standardized, RL Benchmark](https://arxiv.org/abs/1910.10897) - GitHub: [Farama-Foundation/Metaworld](https://github.com/Farama-Foundation/Metaworld) - Project website: [metaworld.farama.org](https://metaworld.farama.org) @@ -12,13 +12,13 @@ Meta-World is an open-source simulation benchmark for **multi-task and meta rein Meta-World provides 50 tasks organized into difficulty groups. In LeRobot, you can evaluate on individual tasks, difficulty groups, or the full MT50 suite: -| Group | CLI name | Tasks | Description | -| ---------- | -------------------- | ----- | ------------------------------------------------------ | -| Easy | `easy` | 28 | Tasks with simple dynamics and single-step goals | -| Medium | `medium` | 11 | Tasks requiring multi-step reasoning | -| Hard | `hard` | 6 | Tasks with complex contacts and precise manipulation | -| Very Hard | `very_hard` | 5 | The most challenging tasks in the suite | -| MT50 (all) | Comma-separated list | 50 | All 50 tasks — the most challenging multi-task setting | +| Group | CLI name | Tasks | Description | +| ---------- | -------------------- | ------ | ------------------------------------------------------ | +| Easy | `easy` | Subset | Tasks with simple dynamics and single-step goals | +| Medium | `medium` | Subset | Tasks requiring multi-step reasoning | +| Hard | `hard` | Subset | Tasks with complex contacts and precise manipulation | +| Very Hard | `very_hard` | Subset | The most challenging tasks in the suite | +| MT50 (all) | Comma-separated list | 50 | All 50 tasks — the most challenging multi-task setting | You can also pass individual task names directly (e.g., `assembly-v3`, `dial-turn-v3`).