mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-16 09:09:48 +00:00
5ad4c8f7b6
Replace hardcoded if/elif chains in factory.py with create_envs() and get_env_processors() methods on EnvConfig. New benchmarks now only need to register a config subclass — no factory.py edits required. Net -23 lines: factory.py shrinks from ~200 to ~70 lines of logic. Made-with: Cursor
383 lines
15 KiB
Plaintext
383 lines
15 KiB
Plaintext
# Adding a New Benchmark
|
|
|
|
This guide explains how to integrate a new simulation benchmark into LeRobot. It is intended for both human contributors and coding agents follow the steps in order and use the referenced files as templates.
|
|
|
|
A "benchmark" in LeRobot is a set of gymnasium environments used for standardized evaluation. Each benchmark wraps a third-party simulator (e.g., LIBERO, Meta-World) behind a `gym.Env` interface, and the `lerobot-eval` script drives evaluation uniformly across all benchmarks.
|
|
|
|
## Architecture overview
|
|
|
|
### Observation and action data flow
|
|
|
|
During evaluation, observations and actions flow through a multi-stage pipeline:
|
|
|
|
```
|
|
gym.Env.reset() / step()
|
|
│
|
|
▼ raw observation (dict[str, Any])
|
|
preprocess_observation() # envs/utils.py — numpy→tensor, key mapping
|
|
│
|
|
▼ LeRobot-format observation
|
|
add_envs_task() # envs/utils.py — injects task description
|
|
│
|
|
▼
|
|
env_preprocessor # processor/env_processor.py — env-specific transforms
|
|
│
|
|
▼
|
|
policy_preprocessor # per-policy normalization, device transfer
|
|
│
|
|
▼
|
|
policy.select_action() # PreTrainedPolicy — returns action tensor
|
|
│
|
|
▼
|
|
policy_postprocessor # per-policy denormalization
|
|
│
|
|
▼
|
|
env_postprocessor # env-specific action transforms
|
|
│
|
|
▼ numpy action
|
|
gym.Env.step(action)
|
|
```
|
|
|
|
### Environment return shape
|
|
|
|
`make_env()` returns a nested dict:
|
|
|
|
```python
|
|
dict[str, dict[int, gym.vector.VectorEnv]]
|
|
# ^suite_name ^task_id ^vectorized env with n_envs parallel copies
|
|
```
|
|
|
|
For single-task environments (e.g., PushT), this is `{"pusht": {0: vec_env}}`.
|
|
For multi-task benchmarks (e.g., LIBERO), this is `{"libero_spatial": {0: vec0, 1: vec1, ...}, "libero_object": {0: ..., ...}}`.
|
|
|
|
The eval loop (`eval_policy_all()`) iterates over all suites and tasks uniformly.
|
|
|
|
## The policy-environment contract
|
|
|
|
There is no enforced schema: `RobotObservation` is typed as `dict[str, Any]`. Instead, LeRobot relies on conventions:
|
|
|
|
### Required attributes on your `gym.Env`
|
|
|
|
| Attribute | Type | Used by |
|
|
| -------------------- | ----- | -------------------------------------------------------------- |
|
|
| `_max_episode_steps` | `int` | `rollout()` — caps episode length |
|
|
| `task_description` | `str` | `add_envs_task()` — feeds language instruction to VLA policies |
|
|
| `task` | `str` | `add_envs_task()` — fallback if `task_description` is absent |
|
|
|
|
### Required fields in `info` dict
|
|
|
|
| Key | Type | Used by |
|
|
| ------------ | ------ | ----------------------------------------------------------- |
|
|
| `is_success` | `bool` | `eval_policy()` — detects task success |
|
|
| `final_info` | `dict` | Gymnasium `VectorEnv` — carries per-env info on termination |
|
|
|
|
### Raw observation format
|
|
|
|
`preprocess_observation()` expects raw observations to use these keys:
|
|
|
|
| Raw key | Mapped to | Description |
|
|
| --------------------------- | ------------------------------- | -------------------------------------- |
|
|
| `"pixels"` (single image) | `observation.image` | Single camera, HWC uint8 |
|
|
| `"pixels"` (dict of images) | `observation.images.<cam_name>` | Multiple cameras, each HWC uint8 |
|
|
| `"agent_pos"` | `observation.state` | Proprioceptive state vector |
|
|
| `"environment_state"` | `observation.env_state` | Environment state (e.g., PushT) |
|
|
| `"robot_state"` | `observation.robot_state` | Nested robot state dict (e.g., LIBERO) |
|
|
|
|
If your benchmark's raw observations don't match these keys, you have two options:
|
|
|
|
1. **Preferred**: Map your observations to these standard keys inside your `gym.Env._format_raw_obs()` method.
|
|
2. **Alternative**: Write an env processor that transforms the observations after `preprocess_observation()` runs.
|
|
|
|
### Action space
|
|
|
|
Actions are continuous numpy arrays in a `gym.spaces.Box`. The dimensionality is benchmark-specific (e.g., 7 for LIBERO, 4 for Meta-World). Policies handle the dimension mismatch via their `input_features` / `output_features` config.
|
|
|
|
### Feature declaration
|
|
|
|
Each `EnvConfig` subclass declares:
|
|
|
|
- `features`: dict mapping feature names to `PolicyFeature(type, shape)` — tells the policy what to expect.
|
|
- `features_map`: dict mapping raw env keys to LeRobot convention keys (e.g., `"agent_pos" → "observation.state"`).
|
|
|
|
## Files to create or modify
|
|
|
|
### Checklist
|
|
|
|
| File | Required | Description |
|
|
| ---------------------------------------- | -------- | -------------------------------------------------------------------------------- |
|
|
| `src/lerobot/envs/<benchmark>.py` | Yes | `gym.Env` subclass + `create_<benchmark>_envs()` factory |
|
|
| `src/lerobot/envs/configs.py` | Yes | `@EnvConfig.register_subclass("<name>")` dataclass with `create_envs()` override |
|
|
| `src/lerobot/processor/env_processor.py` | Optional | `ProcessorStep` subclass for env-specific observation transforms |
|
|
| `src/lerobot/envs/utils.py` | Optional | Extend `preprocess_observation()` if new raw keys are needed |
|
|
| `pyproject.toml` | Yes | Add optional dependency group |
|
|
| `docs/source/<benchmark>.mdx` | Yes | User-facing benchmark documentation |
|
|
| `docs/source/_toctree.yml` | Yes | Add entry under the "Benchmarks" section |
|
|
|
|
### 1. The gym.Env wrapper (`src/lerobot/envs/<benchmark>.py`)
|
|
|
|
Create a `gym.Env` subclass that wraps the third-party simulator. Use `src/lerobot/envs/libero.py` or `src/lerobot/envs/metaworld.py` as templates.
|
|
|
|
Your env must implement:
|
|
|
|
```python
|
|
class MyBenchmarkEnv(gym.Env):
|
|
metadata = {"render_modes": ["rgb_array"], "render_fps": <fps>}
|
|
|
|
def __init__(self, task_suite, task_id, ...):
|
|
super().__init__()
|
|
self.task = <task_name_string>
|
|
self.task_description = <natural_language_instruction>
|
|
self._max_episode_steps = <max_steps>
|
|
self.observation_space = spaces.Dict({...})
|
|
self.action_space = spaces.Box(low=..., high=..., shape=(...,), dtype=np.float32)
|
|
|
|
def reset(self, seed=None, **kwargs):
|
|
# Reset simulator, return (observation, info)
|
|
# info must contain {"is_success": False}
|
|
...
|
|
|
|
def step(self, action: np.ndarray):
|
|
# Step simulator, return (observation, reward, terminated, truncated, info)
|
|
# info must contain {"is_success": <bool>}
|
|
# On termination, info must contain "final_info" with success status
|
|
...
|
|
|
|
def render(self):
|
|
# Return RGB image as numpy array
|
|
...
|
|
|
|
def close(self):
|
|
# Clean up simulator resources
|
|
...
|
|
```
|
|
|
|
Also provide a factory function that returns the standard nested dict:
|
|
|
|
```python
|
|
def create_mybenchmark_envs(
|
|
task: str,
|
|
n_envs: int,
|
|
gym_kwargs: dict | None = None,
|
|
env_cls: type | None = None,
|
|
) -> dict[str, dict[int, Any]]:
|
|
"""Create {suite_name: {task_id: VectorEnv}} for MyBenchmark."""
|
|
...
|
|
```
|
|
|
|
See `create_libero_envs()` in `src/lerobot/envs/libero.py` (multi-suite, multi-task) and `create_metaworld_envs()` in `src/lerobot/envs/metaworld.py` (difficulty-grouped tasks) for reference.
|
|
|
|
### 2. The config (`src/lerobot/envs/configs.py`)
|
|
|
|
Register a new config dataclass. Each config owns its environment creation and processor logic via two methods:
|
|
|
|
- **`create_envs(n_envs, use_async_envs)`** — Returns `{suite: {task_id: VectorEnv}}`. The base class default uses `gym.make()` for single-task envs. Multi-task benchmarks override this.
|
|
- **`get_env_processors()`** — Returns `(preprocessor, postprocessor)`. The base class default returns identity (no-op) pipelines. Override if your benchmark needs observation/action transforms.
|
|
|
|
```python
|
|
@EnvConfig.register_subclass("<benchmark_name>")
|
|
@dataclass
|
|
class MyBenchmarkEnv(EnvConfig):
|
|
task: str = "<default_task>"
|
|
fps: int = <fps>
|
|
obs_type: str = "pixels_agent_pos"
|
|
# ... benchmark-specific fields ...
|
|
|
|
features: dict[str, PolicyFeature] = field(default_factory=lambda: {
|
|
ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(<action_dim>,)),
|
|
})
|
|
features_map: dict[str, str] = field(default_factory=lambda: {
|
|
ACTION: ACTION,
|
|
"agent_pos": OBS_STATE,
|
|
"pixels": OBS_IMAGE,
|
|
})
|
|
|
|
def __post_init__(self):
|
|
# Populate features based on obs_type
|
|
...
|
|
|
|
@property
|
|
def gym_kwargs(self) -> dict:
|
|
return {"obs_type": self.obs_type, "render_mode": self.render_mode}
|
|
|
|
def create_envs(self, n_envs: int, use_async_envs: bool = False):
|
|
"""Override for multi-task benchmarks or custom env creation."""
|
|
from lerobot.envs.<benchmark> import create_<benchmark>_envs
|
|
return create_<benchmark>_envs(task=self.task, n_envs=n_envs, ...)
|
|
|
|
def get_env_processors(self):
|
|
"""Override if your benchmark needs observation/action transforms."""
|
|
from lerobot.processor.pipeline import PolicyProcessorPipeline
|
|
from lerobot.processor.env_processor import MyBenchmarkProcessorStep
|
|
return (
|
|
PolicyProcessorPipeline(steps=[MyBenchmarkProcessorStep()]),
|
|
PolicyProcessorPipeline(steps=[]),
|
|
)
|
|
```
|
|
|
|
Key points:
|
|
|
|
- The `register_subclass` name is what users pass as `--env.type=<name>` on the CLI.
|
|
- `features` declares what the environment produces (used to configure the policy).
|
|
- `features_map` maps raw observation keys to LeRobot convention keys.
|
|
- **No changes to `factory.py` needed** — the factory delegates to `cfg.create_envs()` and `cfg.get_env_processors()` automatically.
|
|
|
|
### 3. Env processor (optional) (`src/lerobot/processor/env_processor.py`)
|
|
|
|
If your benchmark needs observation transforms beyond what `preprocess_observation()` handles (e.g., image flipping, coordinate frame conversion), add a `ProcessorStep` and return it from `get_env_processors()` in your config (see step 2):
|
|
|
|
```python
|
|
@dataclass
|
|
@ProcessorStepRegistry.register(name="<benchmark>_processor")
|
|
class MyBenchmarkProcessorStep(ObservationProcessorStep):
|
|
def _process_observation(self, observation):
|
|
processed = observation.copy()
|
|
# Your transforms here
|
|
return processed
|
|
|
|
def transform_features(self, features):
|
|
# Update feature declarations if shapes change
|
|
return features
|
|
|
|
def observation(self, observation):
|
|
return self._process_observation(observation)
|
|
```
|
|
|
|
See `LiberoProcessorStep` for a full example (image rotation, quaternion-to-axis-angle conversion).
|
|
|
|
### 4. Dependencies (`pyproject.toml`)
|
|
|
|
Add a new optional-dependency group under `[project.optional-dependencies]`:
|
|
|
|
```toml
|
|
mybenchmark = ["my-benchmark-pkg==1.2.3", "lerobot[scipy-dep]"]
|
|
```
|
|
|
|
**Dependency pinning rules:**
|
|
|
|
- **Always pin benchmark-specific packages** to exact versions or tight ranges for reproducibility (e.g., `metaworld==3.0.0`, `hf-libero>=0.1.3,<0.2.0`).
|
|
- **Add platform markers** if the dependency is platform-specific (e.g., `; sys_platform == 'linux'`).
|
|
- **Pin known-fragile transitive dependencies** (e.g., `gymnasium==1.1.0` for Meta-World compatibility).
|
|
- **Document version constraints** in the benchmark doc page.
|
|
|
|
Users install with:
|
|
|
|
```bash
|
|
pip install -e ".[mybenchmark]"
|
|
```
|
|
|
|
### 5. Documentation (`docs/source/<benchmark>.mdx`)
|
|
|
|
Follow the template below. See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for full examples.
|
|
|
|
### 6. Table of contents (`docs/source/_toctree.yml`)
|
|
|
|
Add your benchmark under the "Benchmarks" section:
|
|
|
|
```yaml
|
|
- sections:
|
|
- local: libero
|
|
title: LIBERO
|
|
- local: metaworld
|
|
title: Meta-World
|
|
- local: envhub_isaaclab_arena
|
|
title: NVIDIA IsaacLab Arena Environments
|
|
- local: <your_benchmark>
|
|
title: <Your Benchmark Name>
|
|
title: "Benchmarks"
|
|
```
|
|
|
|
## Benchmark documentation template
|
|
|
|
Each benchmark `.mdx` page should follow this structure:
|
|
|
|
```markdown
|
|
# <Benchmark Name>
|
|
|
|
<1-2 paragraphs: what the benchmark tests and why it matters for robot learning.>
|
|
|
|
- Paper: [<title>](arxiv_url)
|
|
- GitHub: [<repo>](github_url)
|
|
- Project website: [<name>](url) (if available)
|
|
|
|
<Overview image or GIF>
|
|
|
|
## Available tasks
|
|
|
|
<Table listing task suites or individual tasks, with counts.
|
|
For multi-suite benchmarks, describe each suite briefly.>
|
|
|
|
| Suite | Tasks | Description |
|
|
| ----- | ----- | ----------- |
|
|
| ... | ... | ... |
|
|
|
|
## Installation
|
|
|
|
After following the LeRobot installation instructions:
|
|
|
|
pip install -e ".[<benchmark>]"
|
|
|
|
<Any additional steps: environment variables, system packages, etc.>
|
|
|
|
## Evaluation
|
|
|
|
### Default evaluation (recommended)
|
|
|
|
<Command with recommended n_episodes, batch_size for reproducible results.>
|
|
|
|
### Single-task evaluation
|
|
|
|
<Command example with --env.task=<single_task>>
|
|
|
|
### Multi-task evaluation
|
|
|
|
<Command example with comma-separated tasks, if applicable.>
|
|
|
|
### Policy inputs and outputs
|
|
|
|
**Observations:**
|
|
|
|
- `observation.state` — <shape, description>
|
|
- `observation.images.image` — <shape, description>
|
|
- ...
|
|
|
|
**Actions:**
|
|
|
|
- Continuous control in Box(<low>, <high>, shape=(<dim>,))
|
|
|
|
### Recommended evaluation episodes
|
|
|
|
<State how many episodes per task are standard for this benchmark.
|
|
E.g., "50 episodes per task (500 total for LIBERO Spatial).">
|
|
|
|
## Training
|
|
|
|
<Example lerobot-train command.>
|
|
|
|
## Reproducing published results
|
|
|
|
<If available: link to pretrained model, eval command, results table.>
|
|
```
|
|
|
|
## How evaluation works
|
|
|
|
All benchmarks are evaluated uniformly by `lerobot-eval` (see `src/lerobot/scripts/lerobot_eval.py`).
|
|
|
|
The `eval_policy_all()` function:
|
|
|
|
1. Receives the nested `{suite: {task_id: VectorEnv}}` dict from `make_env()`.
|
|
2. Iterates over every `(suite, task_id, vec_env)` tuple.
|
|
3. For each task, runs `n_episodes` rollouts via `eval_policy()` → `rollout()`.
|
|
4. Aggregates results hierarchically: **episode → task → suite → overall**.
|
|
5. Reports `pc_success` (success rate), `avg_sum_reward`, `avg_max_reward` at each level.
|
|
6. Saves all results to `eval_info.json` with the full config snapshot for reproducibility.
|
|
|
|
The key contract: your `gym.Env` must return `info["is_success"]` on every `step()`, and the `VectorEnv` must surface it through `final_info["is_success"]` on termination. This is how the eval loop detects task completion.
|
|
|
|
## Quick reference: existing benchmarks
|
|
|
|
| Benchmark | Env file | Config class | Tasks | Action dim | Processor |
|
|
| -------------- | ------------------- | ------------------ | ------------------- | ------------ | ---------------------------- |
|
|
| LIBERO | `envs/libero.py` | `LiberoEnv` | 130 across 5 suites | 7 | `LiberoProcessorStep` |
|
|
| Meta-World | `envs/metaworld.py` | `MetaworldEnv` | 50 (MT50) | 4 | None |
|
|
| IsaacLab Arena | Hub-hosted | `IsaaclabArenaEnv` | Configurable | Configurable | `IsaaclabArenaProcessorStep` |
|