feat(envs): add RoboCerebra long-horizon manipulation benchmark (#3314)

* feat(ci): add RoboCerebra benchmark eval job - Docker image with robosuite/libero deps for RoboCerebra eval - CI workflow: 1-episode eval with pepijn223/smolvla_robocerebra - Reuses libero env with rename_map + empty_cameras=3 * docs(robocerebra): add benchmark page and toctree entry Add a dedicated docs page for RoboCerebra that points at the canonical dataset lerobot/robocerebra_unified and shows how to run eval + fine-tune against it. Wire it into the Benchmarks section of the toctree so doc-builder picks it up. * ci: point benchmark eval checkpoints at the lerobot/ org mirrors pepijn223/smolvla_* → lerobot/smolvla_* across every benchmark job in this branch (libero, metaworld, and the per-branch benchmark). The checkpoints were mirrored into the lerobot/ org and that's the canonical location going forward. * fix(robocerebra): drop alias extra + simplify docker image Two problems rolled up: 1. `uv sync --locked --extra test` was failing because pyproject.toml added a `robocerebra = ["lerobot[libero]"]` alias extra but uv.lock wasn't regenerated. Drop the alias — nothing in CI actually needs the extra name (the Dockerfile just installs what it needs directly), so this restores pyproject.toml and uv.lock to byte-exact origin/main. 2. Rebase docker/Dockerfile.benchmark.robocerebra on huggingface/lerobot-gpu:latest (same pattern as libero/metaworld/…). The nightly image already ships lerobot[all] which includes [libero], so the RoboCerebra image is essentially identical to the LIBERO one: fetch libero-assets, write ~/.libero/config.yaml, overlay source. 92 → 43 lines. Also repoint the CI workflow comment that referenced the removed extra. * ci: use dedicated lerobot/smolvla_robocerebra checkpoint for smoke eval Replace the generic pepijn223/smolvla_libero placeholder with the purpose-trained lerobot/smolvla_robocerebra model in the RoboCerebra CI smoke test. * fix(ci): align RoboCerebra eval with training pipeline Fixes 5 mismatches that caused 0% success rate: - env.type: robocerebra (unregistered) → libero - resolution: 360x360 (default) → 256x256 (matches dataset) - camera_name_mapping: map eye_in_hand → wrist_image (not image2) - empty_cameras: 3 → 1 (matches training) - add HF_USER_TOKEN guard on eval step * fix(ci): set env.fps=20 and explicit obs_type for RoboCerebra eval Match the dataset's 20 FPS (LiberoEnv defaults to 30) and make obs_type=pixels_agent_pos explicit for safety against future changes. * docs(robocerebra): align page with adding_benchmarks template Rework docs/source/robocerebra.mdx to follow the standard benchmark doc structure: intro + links + available tasks + installation + eval + recommended episodes + policy I/O + training + reproducing results. - Point everything at lerobot/smolvla_robocerebra (the released checkpoint), not the personal pepijn223 mirror. - Add the --env.fps=20 and --env.obs_type=pixels_agent_pos flags that CI actually uses, so copy-paste eval reproduces CI. - Split the "Training" block out of the recipe section into its own section with the feature table. - Add an explicit "Reproducing published results" section pointing at the CI smoke eval. * fix: integrate PR #3314 review feedback - ci(robocerebra): drop redundant hf auth login step (auth is already performed inside the eval step's container). - ci(robocerebra): add Docker Hub login before the image build to pick up the authenticated rate limit. - docs(robocerebra): align eval snippet with the CI command (observation size, camera_name_mapping, use_async_envs, device, empty_cameras=1). * fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs Port of #3416 onto this branch. * ci: gate Docker Hub login on secret availability * Update .github/workflows/benchmark_tests.yml Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co> Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> * Update .github/workflows/benchmark_tests.yml Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
2026-07-08 10:32:00 +00:00 · 2026-04-20 19:12:15 +02:00
parent 0f1c9b0851
commit a147fa4439
4 changed files with 251 additions and 0 deletions
@@ -83,6 +83,8 @@
    title: RoboTwin 2.0
  - local: robocasa
    title: RoboCasa365
+  - local: robocerebra
+    title: RoboCerebra
  - local: envhub_isaaclab_arena
    title: NVIDIA IsaacLab Arena Environments
  title: "Benchmarks"
@@ -0,0 +1,99 @@
+# RoboCerebra
+
+[RoboCerebra](https://robocerebra-project.github.io/) is a long-horizon manipulation benchmark that evaluates **high-level reasoning, planning, and memory** in VLAs. Episodes chain multiple sub-goals with language-grounded intermediate instructions, built on top of LIBERO's simulator stack (MuJoCo + robosuite, Franka Panda 7-DOF).
+
+- Paper: [RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation](https://arxiv.org/abs/2506.06677)
+- Project website: [robocerebra-project.github.io](https://robocerebra-project.github.io/)
+- Dataset: [`lerobot/robocerebra_unified`](https://huggingface.co/datasets/lerobot/robocerebra_unified) — LeRobot v3.0, 6,660 episodes / 571,116 frames at 20 fps, 1,728 language-grounded sub-tasks.
+- Pretrained policy: [`lerobot/smolvla_robocerebra`](https://huggingface.co/lerobot/smolvla_robocerebra)
+
+## Available tasks
+
+RoboCerebra reuses LIBERO's simulator, so evaluation runs against the LIBERO `libero_10` long-horizon suite:
+
+| Suite     | CLI name    | Tasks | Description                                                   |
+| --------- | ----------- | ----- | ------------------------------------------------------------- |
+| LIBERO-10 | `libero_10` | 10    | Long-horizon kitchen/living room tasks chaining 3–6 sub-goals |
+
+Each RoboCerebra episode in the dataset is segmented into multiple sub-tasks with natural-language instructions, which the unified dataset exposes as independent supervision signals.
+
+## Installation
+
+RoboCerebra piggybacks on LIBERO, so the `libero` extra is all you need:
+
+```bash
+pip install -e ".[libero]"
+```
+
+<Tip>
+RoboCerebra requires Linux (MuJoCo / robosuite). Set the rendering backend before training or evaluation:
+
+```bash
+export MUJOCO_GL=egl  # for headless servers (HPC, cloud)
+```
+
+</Tip>
+
+## Evaluation
+
+RoboCerebra eval runs against LIBERO's `libero_10` suite with RoboCerebra's camera naming (`image` + `wrist_image`) and an extra empty-camera slot so a three-view-trained policy receives the expected input layout:
+
+```bash
+lerobot-eval \
+  --policy.path=lerobot/smolvla_robocerebra \
+  --env.type=libero \
+  --env.task=libero_10 \
+  --env.fps=20 \
+  --env.obs_type=pixels_agent_pos \
+  --env.observation_height=256 \
+  --env.observation_width=256 \
+  '--env.camera_name_mapping={"agentview_image": "image", "robot0_eye_in_hand_image": "wrist_image"}' \
+  --eval.batch_size=1 \
+  --eval.n_episodes=10 \
+  --eval.use_async_envs=false \
+  --policy.device=cuda \
+  '--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.wrist_image": "observation.images.camera2"}' \
+  --policy.empty_cameras=1
+```
+
+### Recommended evaluation episodes
+
+**10 episodes per task** across the `libero_10` suite (100 total) for reproducible benchmarking. Matches the protocol used in the RoboCerebra paper.
+
+## Policy inputs and outputs
+
+**Observations:**
+
+- `observation.state` — 8-dim proprioceptive state (7 joint positions + gripper)
+- `observation.images.image` — third-person view, 256×256 HWC uint8
+- `observation.images.wrist_image` — wrist-mounted camera view, 256×256 HWC uint8
+
+**Actions:**
+
+- Continuous control in `Box(-1, 1, shape=(7,))` — end-effector delta (6D) + gripper (1D)
+
+## Training
+
+The unified dataset at [`lerobot/robocerebra_unified`](https://huggingface.co/datasets/lerobot/robocerebra_unified) exposes two RGB streams and language-grounded sub-task annotations:
+
+| Feature                          | Shape         | Description          |
+| -------------------------------- | ------------- | -------------------- |
+| `observation.images.image`       | (256, 256, 3) | Third-person view    |
+| `observation.images.wrist_image` | (256, 256, 3) | Wrist-mounted camera |
+| `observation.state`              | (8,)          | Joint pos + gripper  |
+| `action`                         | (7,)          | EEF delta + gripper  |
+
+Fine-tune a SmolVLA base on it:
+
+```bash
+lerobot-train \
+  --policy.path=lerobot/smolvla_base \
+  --dataset.repo_id=lerobot/robocerebra_unified \
+  --env.type=libero \
+  --env.task=libero_10 \
+  --output_dir=outputs/smolvla_robocerebra
+```
+
+## Reproducing published results
+
+The released checkpoint [`lerobot/smolvla_robocerebra`](https://huggingface.co/lerobot/smolvla_robocerebra) was trained on `lerobot/robocerebra_unified` and evaluated with the command in the [Evaluation](#evaluation) section. CI runs the same command with `--eval.n_episodes=1` as a smoke test on every PR touching the benchmark.