feat(sim): VLABench benchmark integration (#3396)

feat(sim): add VLABench benchmark integration Add VLABench as a new simulation benchmark in LeRobot, following the existing LIBERO and MetaWorld patterns. This PR wires VLABench end-to-end across environment integration, Docker setup, CI smoke evaluation, and documentation. It also fixes a number of upstream packaging and runtime issues required to make VLABench usable and reproducible in CI. What’s included Benchmark integration Add VLABench as a new simulation benchmark. Expose supported VLABench tasks through the LeRobot env interface. Follow the established LIBERO / MetaWorld factory patterns. Preserve lazy async-env metadata so env.unwrapped.metadata["render_fps"] continues to work. CI smoke evaluation Add a VLABench smoke-eval job using lerobot/smolvla_vlabench. Use the correct rename_map for the 3-camera dataset layout. Expand smoke coverage from 1 to 10 primitive tasks. Extract task descriptions after eval so metrics artifacts include per-task labels. Skip Docker Hub login when secrets are unavailable (e.g. fork PRs). Docker / install fixes Install VLABench from GitHub rather than PyPI. Use uv pip, not pip, in the base image. Fail loudly on install errors instead of masking them. Clone VLABench into the non-root user’s home directory. Use shallow editable installs for VLABench and rrt-algorithms to work around missing __init__.py issues. Pin upstream clones to exact commit SHAs for reproducibility. Add undeclared runtime dependencies required by VLABench (open3d, colorlog, scikit-learn, openai). Unpin open3d so Python 3.12 wheels resolve. Assets Support downloading VLABench assets from a Hugging Face Hub mirror via VLABENCH_ASSETS_REPO. Keep Google Drive download support as fallback. Install huggingface_hub[hf_xet] so Xet-backed assets download correctly. Validate required mesh/XML asset subtrees at build time. Patch VLABench constants to tolerate missing asset directories at import time. Runtime / env correctness Import VLABench robots and tasks explicitly so decorator-based registry population happens. Resize and normalize camera observations so they always match the declared (H, W, 3) uint8 observation space. Reinstall LeRobot editably inside the image so the new env code is actually used. Coerce agent_pos / ee_state to the expected shape. Pad actions when needed to match data.ctrl. Replace zero-padding fallback with proper dm_control IK for 7D end-effector actions. Refetch dm_control physics on each step instead of caching weakrefs. Retry unstable resets with reseeding and handle PhysicsError gracefully at step time. Dataset / policy alignment Align VLABench observations and actions with Hugging Face dataset conventions used by lerobot/vlabench_unified: convert EE position between world frame and robot-base frame at the env boundary, expose / consume Euler XYZ instead of raw quaternion layout, align gripper semantics with dataset convention (1 = open, 0 = closed). This fixes policy/env mismatches that previously caused incorrect IK targets and unstable behavior at evaluation time. Docs Add a full docs/source/vlabench.mdx page aligned with the standard benchmark template. Document task selection forms (single task, comma list, suite shortcut). Document installation, evaluation, training, and result reproduction. Point examples at lerobot/smolvla_vlabench. Add a benchmark banner image. Remove outdated / misleading references to upstream evaluation tracks. Document manual install flow instead of a broken vlabench extra. Packaging cleanup Remove the unresolvable vlabench extra from pyproject.toml. Remove the no-op VLABench processor step. Remove the obsolete env unit test that only covered the dropped gripper remap helper. Apply formatting / logging / style cleanup from review feedback. Why this is needed VLABench is not currently consumable as a normal Python dependency and requires several upstream workarounds: no PyPI release, missing package declarations, undeclared runtime deps, SSH-only submodule references, asset downloads outside normal package install flow, registry population that depends on import side effects, env outputs that do not always match declared observation shapes, task resets that can diverge under some random layouts. This PR makes the benchmark usable in LeRobot despite those constraints, and ensures CI runs are reproducible and informative. If you want a much shorter squash commit message, I’d use this: feat(sim): integrate VLABench benchmark with CI, Docker, and docs Add VLABench as a new LeRobot simulation benchmark, following the existing LIBERO / MetaWorld patterns. This includes: LeRobot env integration and task exposure, CI smoke eval with lerobot/smolvla_vlabench, Docker install and asset-download fixes, runtime fixes for registry loading, assets, camera obs, action handling, dm_control IK, and PhysicsError recovery, alignment of obs/action semantics with HF VLABench datasets, docs and packaging cleanup. The PR also incorporates review feedback, improves reproducibility by pinning upstream commits, and makes VLABench usable in CI despite upstream packaging and asset-management issues.
2026-07-08 10:32:00 +00:00 · 2026-04-21 17:54:11 +02:00
parent a07f22e22c
commit 5adad11128
8 changed files with 1053 additions and 0 deletions
@@ -843,3 +843,103 @@ jobs:
          name: libero-plus-metrics
          path: /tmp/libero-plus-artifacts/metrics.json
          if-no-files-found: warn
+
+  # ── VLABENCH ─────────────────────────────────────────────────────────────
+  # Isolated image: lerobot[vlabench] only (VLABench, mujoco==3.2.2, dm-control chain)
+  vlabench-integration-test:
+    name: VLABench — build image + 1-episode eval
+    runs-on:
+      group: aws-g6-4xlarge-plus
+    env:
+      HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
+
+    steps:
+      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd  # v6.0.2
+        with:
+          persist-credentials: false
+          lfs: true
+
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
+        with:
+          cache-binary: false
+
+      - name: Login to Docker Hub
+        if: ${{ env.DOCKERHUB_USERNAME != '' }}
+        uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
+        with:
+          username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
+          password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
+        env:
+          DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
+
+      - name: Build VLABench benchmark image
+        uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
+        with:
+          context: .
+          file: docker/Dockerfile.benchmark.vlabench
+          push: false
+          load: true
+          tags: lerobot-benchmark-vlabench:ci
+          build-args: |
+            VLABENCH_ASSETS_REPO=lerobot/vlabench-assets
+
+      - name: Run VLABench smoke eval (10 tasks, 1 episode each)
+        if: env.HF_USER_TOKEN != ''
+        run: |
+          docker run --name vlabench-eval --gpus all \
+            --shm-size=4g \
+            -e HF_HOME=/tmp/hf \
+            -e HF_USER_TOKEN="${HF_USER_TOKEN}" \
+            -e HF_HUB_DOWNLOAD_TIMEOUT=300 \
+            -e MUJOCO_GL=egl \
+            lerobot-benchmark-vlabench:ci \
+            bash -c "
+              hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
+              lerobot-eval \
+                --policy.path=lerobot/smolvla_vlabench \
+                --env.type=vlabench \
+                --env.task=select_fruit,select_toy,select_book,select_painting,select_drink,select_ingredient,select_billiards,select_poker,add_condiment,insert_flower \
+                --eval.batch_size=1 \
+                --eval.n_episodes=1 \
+                --eval.use_async_envs=false \
+                --policy.device=cuda \
+                '--rename_map={\"observation.images.image\": \"observation.images.camera1\", \"observation.images.second_image\": \"observation.images.camera2\", \"observation.images.wrist_image\": \"observation.images.camera3\"}' \
+                --output_dir=/tmp/eval-artifacts
+              python scripts/ci/extract_task_descriptions.py \
+                --env vlabench \
+                --task select_fruit,select_toy,select_book,select_painting,select_drink,select_ingredient,select_billiards,select_poker,add_condiment,insert_flower \
+                --output /tmp/eval-artifacts/task_descriptions.json
+            "
+
+      - name: Copy VLABench artifacts from container
+        if: always()
+        run: |
+          mkdir -p /tmp/vlabench-artifacts
+          docker cp vlabench-eval:/tmp/eval-artifacts/. /tmp/vlabench-artifacts/ 2>/dev/null || true
+          docker rm -f vlabench-eval || true
+
+      - name: Parse VLABench eval metrics
+        if: always()
+        run: |
+          python3 scripts/ci/parse_eval_metrics.py \
+            --artifacts-dir /tmp/vlabench-artifacts \
+            --env vlabench \
+            --task select_fruit,select_toy,select_book,select_painting,select_drink,select_ingredient,select_billiards,select_poker,add_condiment,insert_flower \
+            --policy lerobot/smolvla_vlabench
+
+      - name: Upload VLABench rollout video
+        if: always()
+        uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
+        with:
+          name: vlabench-rollout-video
+          path: /tmp/vlabench-artifacts/videos/
+          if-no-files-found: warn
+
+      - name: Upload VLABench eval metrics
+        if: always()
+        uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
+        with:
+          name: vlabench-metrics
+          path: /tmp/vlabench-artifacts/metrics.json
+          if-no-files-found: warn
@@ -0,0 +1,99 @@
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Benchmark image for VLABench integration tests.
+# Extends the nightly GPU image with the PR's source code and VLABench setup.
+#
+# Build:  docker build -f docker/Dockerfile.benchmark.vlabench -t lerobot-benchmark-vlabench .
+# Run:    docker run --gpus all --rm lerobot-benchmark-vlabench lerobot-eval ...
+
+FROM huggingface/lerobot-gpu:latest
+
+# Install VLABench from GitHub (not on PyPI) and pin MuJoCo/dm-control.
+# Shallow-clone without submodule recursion (nested SSH-only submodules fail in CI).
+# Editable install (-e) because VLABench/utils/ has no __init__.py, so
+# find_packages() omits it from wheels; editable mode uses the source tree directly.
+# rrt-algorithms has the same packaging issue (rrt/ dir missing __init__.py).
+# Patch: constant.py calls os.listdir on ~100 asset/obj/meshes/* dirs at import
+# time. Guard the call so missing dirs return [] instead of crashing (in case
+# the asset download is partial).
+#
+# Pinned upstream SHAs for reproducible benchmark runs. Bump when you need
+# an upstream fix; don't rely on `main`/`develop` drift.
+ARG VLABENCH_SHA=cf588fe60c0c7282174fe979f5913170cfe69017
+ARG RRT_ALGORITHMS_SHA=e51d95ee489a225220d6ae2a764c4111f6ba7d85
+RUN git clone https://github.com/OpenMOSS/VLABench.git ~/VLABench && \
+    git -C ~/VLABench checkout ${VLABENCH_SHA} && \
+    git clone https://github.com/motion-planning/rrt-algorithms.git ~/rrt-algorithms && \
+    git -C ~/rrt-algorithms checkout ${RRT_ALGORITHMS_SHA} && \
+    python3 -c "\
+import pathlib; \
+p = pathlib.Path.home() / 'VLABench/VLABench/configs/constant.py'; \
+t = p.read_text(); \
+p.write_text(t.replace( \
+    'subdirs = os.listdir(xml_dir)', \
+    'if not os.path.isdir(xml_dir): return []\n    subdirs = os.listdir(xml_dir)'))" && \
+    uv pip install --no-cache -e ~/VLABench -e ~/rrt-algorithms \
+      mujoco==3.2.2 dm-control==1.0.22 \
+      open3d colorlog scikit-learn openai gdown
+
+# Download VLABench mesh assets. Task configs reference object meshes
+# (obj/meshes/fruit/, containers/basket/, tablewares/plates/, etc.); without
+# them the task builder picks from an empty mesh list and crashes with
+# IndexError at task-build time (random.choice([]) in config_manager.py).
+#
+# Preferred source: an HF Hub mirror. Set VLABENCH_ASSETS_REPO at build time
+# (e.g. --build-arg VLABENCH_ASSETS_REPO=lerobot/vlabench-assets) and we'll
+# snapshot_download the repo into VLABench's assets dir. This is the reliable
+# path for CI — Google Drive frequently returns HTTP 429 ("Too many users have
+# viewed or downloaded this file recently") on shared academic files.
+#
+# After download we *validate* that at least one XML exists under each
+# task-critical subtree and fail the build loudly if not. Silent-empty asset
+# dirs are the #1 cause of VLABench runtime crashes in CI, so we surface them
+# here rather than after a 10-minute eval build.
+#
+# Fallback: VLABench's own gdown-based script. Best-effort only.
+ARG VLABENCH_ASSETS_REPO=""
+RUN ASSETS_DIR="$HOME/VLABench/VLABench/assets" && \
+    if [ -n "${VLABENCH_ASSETS_REPO}" ]; then \
+        echo "Downloading VLABench assets from HF Hub: ${VLABENCH_ASSETS_REPO}" && \
+        uv pip install --no-cache "huggingface_hub[hf_xet]>=0.26" && \
+        python -c "from huggingface_hub import snapshot_download; \
+p = snapshot_download(repo_id='${VLABENCH_ASSETS_REPO}', repo_type='dataset', \
+    local_dir='${ASSETS_DIR}', allow_patterns=['obj/**', 'scenes/**']); \
+print('snapshot_download returned:', p)"; \
+    else \
+        echo "No VLABENCH_ASSETS_REPO set — falling back to gdown" && \
+        python ~/VLABench/scripts/download_assets.py --choice all; \
+    fi && \
+    python -c "\
+from pathlib import Path; \
+import sys; \
+root = Path('${ASSETS_DIR}'); \
+checks = ['obj/meshes/tablewares/plates', 'obj/meshes/containers/basket', 'obj/meshes/fruit', 'obj/meshes/containers/tray']; \
+failed = []; \
+print(f'Validating VLABench assets under {root}'); \
+[print(f'  {c}: {len(list((root/c).rglob(\"*.xml\")))} XMLs') for c in checks]; \
+[failed.append(c) for c in checks if not any((root/c).rglob('*.xml'))]; \
+sys.exit(f'Empty asset dirs (no *.xml): {failed}') if failed else print('All asset dirs populated.')"
+
+# Overlay the PR's source code on top of the nightly image.
+COPY --chown=user_lerobot:user_lerobot . .
+
+# Re-install lerobot editably so the new source (with VLABenchEnv registration
+# and updated obs handling) replaces the stale package baked into the nightly image.
+RUN uv pip install --no-cache --no-deps -e .
+
+CMD ["/bin/bash"]
@@ -91,6 +91,8 @@
    title: RoboMME
  - local: envhub_isaaclab_arena
    title: NVIDIA IsaacLab Arena Environments
+  - local: vlabench
+    title: VLABench
  title: "Benchmarks"
 - sections:
  - local: introduction_processors
@@ -0,0 +1,176 @@
+# VLABench
+
+[VLABench](https://github.com/OpenMOSS/VLABench) is a large-scale benchmark for **language-conditioned robotic manipulation with long-horizon reasoning**. The upstream suite covers 100 task categories across 2,000+ objects and evaluates six dimensions of robot intelligence: mesh & texture understanding, spatial reasoning, world-knowledge transfer, semantic instruction comprehension, physical-law understanding, and long-horizon planning. Built on MuJoCo / dm_control with a Franka Panda 7-DOF arm. LeRobot exposes **43 of these tasks** through `--env.task` (21 primitives + 22 composites, see [Available tasks](#available-tasks) below).
+
+- Paper: [VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning](https://arxiv.org/abs/2412.18194)
+- GitHub: [OpenMOSS/VLABench](https://github.com/OpenMOSS/VLABench)
+- Project website: [vlabench.github.io](https://vlabench.github.io)
+- Pretrained policy: [`lerobot/smolvla_vlabench`](https://huggingface.co/lerobot/smolvla_vlabench)
+
+<img
+  src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/vlabench.png"
+  alt="VLABench benchmark overview"
+  width="85%"
+/>
+
+## Available tasks
+
+VLABench ships two task suites covering **43 task categories** in LeRobot's `--env.task` surface:
+
+| Suite     | CLI name    | Tasks | Description                                                      |
+| --------- | ----------- | ----- | ---------------------------------------------------------------- |
+| Primitive | `primitive` | 21    | Single / few-skill combinations (select, insert, physics QA)     |
+| Composite | `composite` | 22    | Multi-step reasoning and long-horizon planning (cook, rearrange) |
+
+**Primitive tasks:** `select_fruit`, `select_toy`, `select_chemistry_tube`, `add_condiment`, `select_book`, `select_painting`, `select_drink`, `insert_flower`, `select_billiards`, `select_ingredient`, `select_mahjong`, `select_poker`, and physical-reasoning tasks (`density_qa`, `friction_qa`, `magnetism_qa`, `reflection_qa`, `simple_cuestick_usage`, `simple_seesaw_usage`, `sound_speed_qa`, `thermal_expansion_qa`, `weight_qa`).
+
+**Composite tasks:** `cluster_billiards`, `cluster_book`, `cluster_drink`, `cluster_toy`, `cook_dishes`, `cool_drink`, `find_unseen_object`, `get_coffee`, `hammer_nail`, `heat_food`, `make_juice`, `play_mahjong`, `play_math_game`, `play_poker`, `play_snooker`, `rearrange_book`, `rearrange_chemistry_tube`, `set_dining_table`, `set_study_table`, `store_food`, `take_chemistry_experiment`, `use_seesaw_complex`.
+
+`--env.task` accepts three forms:
+
+- a single task name (`select_fruit`)
+- a comma-separated list (`select_fruit,heat_food`)
+- a suite shortcut (`primitive`, `composite`, or `primitive,composite`)
+
+## Installation
+
+VLABench is **not on PyPI** — its only distribution is the [OpenMOSS/VLABench](https://github.com/OpenMOSS/VLABench) GitHub repo — so LeRobot does not expose a `vlabench` extra. Install it manually as an editable clone, alongside the MuJoCo / dm_control pins VLABench needs, then fetch the mesh assets:
+
+```bash
+# After following the standard LeRobot installation instructions.
+
+git clone https://github.com/OpenMOSS/VLABench.git ~/VLABench
+git clone https://github.com/motion-planning/rrt-algorithms.git ~/rrt-algorithms
+pip install -e ~/VLABench -e ~/rrt-algorithms
+pip install "mujoco==3.2.2" "dm-control==1.0.22" \
+            open3d colorlog scikit-learn openai gdown
+
+python ~/VLABench/scripts/download_assets.py
+```
+
+<Tip>
+VLABench requires Linux (`sys_platform == 'linux'`) and Python 3.10+. Set the MuJoCo rendering backend before running:
+
+```bash
+export MUJOCO_GL=egl  # for headless servers (HPC, cloud)
+```
+
+</Tip>
+
+## Evaluation
+
+All eval snippets below mirror the command CI runs (see `.github/workflows/benchmark_tests.yml`). The `--rename_map` argument maps VLABench's `image` / `second_image` / `wrist_image` camera keys onto the three-camera (`camera1` / `camera2` / `camera3`) input layout the released `smolvla_vlabench` policy was trained on.
+
+### Single-task evaluation (recommended for quick iteration)
+
+```bash
+lerobot-eval \
+  --policy.path=lerobot/smolvla_vlabench \
+  --env.type=vlabench \
+  --env.task=select_fruit \
+  --eval.batch_size=1 \
+  --eval.n_episodes=10 \
+  --eval.use_async_envs=false \
+  --policy.device=cuda \
+  '--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
+```
+
+### Multi-task evaluation
+
+Pass a comma-separated list of tasks:
+
+```bash
+lerobot-eval \
+  --policy.path=lerobot/smolvla_vlabench \
+  --env.type=vlabench \
+  --env.task=select_fruit,select_toy,add_condiment,heat_food \
+  --eval.batch_size=1 \
+  --eval.n_episodes=10 \
+  --eval.use_async_envs=false \
+  --policy.device=cuda \
+  '--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
+```
+
+### Suite-wide evaluation
+
+Run an entire suite (all 21 primitives or all 22 composites):
+
+```bash
+lerobot-eval \
+  --policy.path=lerobot/smolvla_vlabench \
+  --env.type=vlabench \
+  --env.task=primitive \
+  --eval.batch_size=1 \
+  --eval.n_episodes=10 \
+  --eval.use_async_envs=false \
+  --policy.device=cuda \
+  --env.max_parallel_tasks=1 \
+  '--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
+```
+
+Or both suites:
+
+```bash
+lerobot-eval \
+  --policy.path=lerobot/smolvla_vlabench \
+  --env.type=vlabench \
+  --env.task=primitive,composite \
+  --eval.batch_size=1 \
+  --eval.n_episodes=10 \
+  --eval.use_async_envs=false \
+  --policy.device=cuda \
+  --env.max_parallel_tasks=1 \
+  '--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
+```
+
+### Recommended evaluation episodes
+
+**10 episodes per task** for reproducible benchmarking (210 total for the full primitive suite, 220 for composite). Matches the protocol in the VLABench paper.
+
+## Policy inputs and outputs
+
+**Observations:**
+
+- `observation.state` — 7-dim end-effector state (position xyz + Euler xyz + gripper)
+- `observation.images.image` — front camera, 480×480 HWC uint8
+- `observation.images.second_image` — second camera, 480×480 HWC uint8
+- `observation.images.wrist_image` — wrist camera, 480×480 HWC uint8
+
+**Actions:**
+
+- Continuous control in `Box(-1, 1, shape=(7,))` — 3D position + 3D Euler orientation + 1D gripper.
+
+## Training
+
+### Datasets
+
+Pre-collected VLABench datasets in LeRobot format on the Hub:
+
+- [`VLABench/vlabench_primitive_ft_lerobot_video`](https://huggingface.co/datasets/VLABench/vlabench_primitive_ft_lerobot_video) — 5,000 episodes, 128 tasks, 480×480 images.
+- [`VLABench/vlabench_composite_ft_lerobot_video`](https://huggingface.co/datasets/VLABench/vlabench_composite_ft_lerobot_video) — 5,977 episodes, 167 tasks, 224×224 images.
+
+### Example training command
+
+Fine-tune a SmolVLA base on the primitive suite:
+
+```bash
+lerobot-train \
+  --policy.type=smolvla \
+  --policy.repo_id=${HF_USER}/smolvla_vlabench_primitive \
+  --policy.load_vlm_weights=true \
+  --policy.push_to_hub=true \
+  --dataset.repo_id=VLABench/vlabench_primitive_ft_lerobot_video \
+  --env.type=vlabench \
+  --env.task=select_fruit \
+  --output_dir=./outputs/smolvla_vlabench_primitive \
+  --steps=100000 \
+  --batch_size=4 \
+  --eval_freq=5000 \
+  --eval.batch_size=1 \
+  --eval.n_episodes=1 \
+  --save_freq=10000
+```
+
+## Reproducing published results
+
+The released checkpoint [`lerobot/smolvla_vlabench`](https://huggingface.co/lerobot/smolvla_vlabench) was trained on the primitive-suite dataset above and is evaluated with the [Single-task](#single-task-evaluation-recommended-for-quick-iteration) / [Suite-wide](#suite-wide-evaluation) commands. CI runs a 10-primitive-task smoke eval (one episode each) on every PR touching the benchmark.
@@ -212,6 +212,11 @@ aloha = ["lerobot[dataset]", "gym-aloha>=0.1.2,<0.2.0", "lerobot[scipy-dep]"]
 pusht = ["lerobot[dataset]", "gym-pusht>=0.1.5,<0.2.0", "pymunk>=6.6.0,<7.0.0"] # TODO: Fix pymunk version in gym-pusht instead
 libero = ["lerobot[dataset]", "lerobot[transformers-dep]", "hf-libero>=0.1.3,<0.2.0; sys_platform == 'linux'", "lerobot[scipy-dep]"]
 metaworld = ["lerobot[dataset]", "metaworld==3.0.0", "lerobot[scipy-dep]"]
+# NOTE: vlabench is NOT exposed as a `lerobot` extra. Its only distribution
+# is the OpenMOSS/VLABench GitHub repo (package name `VLABench`, no PyPI
+# release), so any `vlabench>=X` pip spec is unresolvable. Install it
+# manually alongside MuJoCo / dm-control — see docs/source/vlabench.mdx
+# for the recipe.
 # NOTE: robomme is NOT a pyproject extra — mani-skill hard-pins numpy<2
 # which conflicts with lerobot's numpy>=2 base pin, so the two trees can't
 # resolve into a single env. Install it only in the RoboMME Docker image
@@ -142,6 +142,21 @@ def _robomme_descriptions(task_names: str, task_ids: list[int] | None = None) ->
    return out


+def _vlabench_descriptions(task_spec: str) -> dict[str, str]:
+    """For each task in the comma-separated list, emit a cleaned-name label.
+
+    VLABench tasks carry language instructions on their dm_control task
+    object, but pulling them requires loading the full env per task
+    (~seconds each). The CI smoke-eval already captures the instruction
+    inside its episode info; this mapping is just enough to key
+    `metrics.json` by `<task>_0`.
+    """
+    out: dict[str, str] = {}
+    for task in (t.strip() for t in task_spec.split(",") if t.strip()):
+        out[f"{task}_0"] = task.replace("_", " ").strip()
+    return out
+
+
 def main() -> int:
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument("--env", required=True, help="Environment family (libero, metaworld, ...)")
@@ -171,6 +186,8 @@ def main() -> int:
            descriptions = _robocasa_descriptions(args.task)
        elif args.env == "robomme":
            descriptions = _robomme_descriptions(args.task, task_ids=task_ids)
+        elif args.env == "vlabench":
+            descriptions = _vlabench_descriptions(args.task)
        else:
            print(
                f"[extract_task_descriptions] No description extractor for env '{args.env}'.",
@@ -573,6 +573,71 @@ class RoboCasaEnv(EnvConfig):
        )


+@EnvConfig.register_subclass("vlabench")
+@dataclass
+class VLABenchEnv(EnvConfig):
+    task: str = "select_fruit"
+    fps: int = 10
+    episode_length: int = 500
+    obs_type: str = "pixels_agent_pos"
+    render_mode: str = "rgb_array"
+    render_resolution: tuple[int, int] = (480, 480)
+    robot: str = "franka"
+    action_mode: str = "eef"
+    features: dict[str, PolicyFeature] = field(
+        default_factory=lambda: {
+            ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(7,)),
+        }
+    )
+    features_map: dict[str, str] = field(
+        default_factory=lambda: {
+            ACTION: ACTION,
+            "agent_pos": OBS_STATE,
+            "pixels/image": f"{OBS_IMAGES}.image",
+            "pixels/second_image": f"{OBS_IMAGES}.second_image",
+            "pixels/wrist_image": f"{OBS_IMAGES}.wrist_image",
+        }
+    )
+
+    def __post_init__(self):
+        h, w = self.render_resolution
+        if self.obs_type == "pixels":
+            self.features["pixels/image"] = PolicyFeature(type=FeatureType.VISUAL, shape=(h, w, 3))
+            self.features["pixels/second_image"] = PolicyFeature(type=FeatureType.VISUAL, shape=(h, w, 3))
+            self.features["pixels/wrist_image"] = PolicyFeature(type=FeatureType.VISUAL, shape=(h, w, 3))
+        elif self.obs_type == "pixels_agent_pos":
+            self.features["pixels/image"] = PolicyFeature(type=FeatureType.VISUAL, shape=(h, w, 3))
+            self.features["pixels/second_image"] = PolicyFeature(type=FeatureType.VISUAL, shape=(h, w, 3))
+            self.features["pixels/wrist_image"] = PolicyFeature(type=FeatureType.VISUAL, shape=(h, w, 3))
+            self.features["agent_pos"] = PolicyFeature(type=FeatureType.STATE, shape=(7,))
+        else:
+            raise ValueError(f"Unsupported obs_type: {self.obs_type}")
+
+    @property
+    def gym_kwargs(self) -> dict:
+        return {
+            "obs_type": self.obs_type,
+            "render_mode": self.render_mode,
+            "render_resolution": self.render_resolution,
+            "robot": self.robot,
+            "max_episode_steps": self.episode_length,
+            "action_mode": self.action_mode,
+        }
+
+    def create_envs(self, n_envs: int, use_async_envs: bool = False):
+        from .vlabench import create_vlabench_envs
+
+        if self.task is None:
+            raise ValueError("VLABenchEnv requires a task to be specified")
+        env_cls = _make_vec_env_cls(use_async_envs, n_envs)
+        return create_vlabench_envs(
+            task=self.task,
+            n_envs=n_envs,
+            gym_kwargs=self.gym_kwargs,
+            env_cls=env_cls,
+        )
+
+
@EnvConfig.register_subclass("isaaclab_arena")
@dataclass
 class IsaaclabArenaEnv(HubEnvConfig):
@@ -0,0 +1,589 @@
+#!/usr/bin/env python
+
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""VLABench environment wrapper for LeRobot.
+
+VLABench is a large-scale benchmark for language-conditioned robotic manipulation
+with long-horizon reasoning, built on MuJoCo/dm_control.
+
+- Paper: https://arxiv.org/abs/2412.18194
+- GitHub: https://github.com/OpenMOSS/VLABench
+- Website: https://vlabench.github.io
+"""
+
+from __future__ import annotations
+
+import contextlib
+import logging
+from collections import defaultdict
+from collections.abc import Callable, Sequence
+from typing import Any
+
+import cv2
+import gymnasium as gym
+import numpy as np
+from gymnasium import spaces
+from scipy.spatial.transform import Rotation
+
+from lerobot.types import RobotObservation
+
+from .utils import _LazyAsyncVectorEnv
+
+logger = logging.getLogger(__name__)
+
+ACTION_DIM = 7  # pos(3) + euler(3) + gripper(1)
+ACTION_LOW = np.array([-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, 0.0], dtype=np.float32)
+ACTION_HIGH = np.array([1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], dtype=np.float32)
+
+# Default max episode steps per task type
+DEFAULT_MAX_EPISODE_STEPS = 500
+
+# VLABench task suites
+PRIMITIVE_TASKS = [
+    "select_fruit",
+    "select_toy",
+    "select_chemistry_tube",
+    "add_condiment",
+    "select_book",
+    "select_painting",
+    "select_drink",
+    "insert_flower",
+    "select_billiards",
+    "select_ingredient",
+    "select_mahjong",
+    "select_poker",
+    # Physical series
+    "density_qa",
+    "friction_qa",
+    "magnetism_qa",
+    "reflection_qa",
+    "simple_cuestick_usage",
+    "simple_seesaw_usage",
+    "sound_speed_qa",
+    "thermal_expansion_qa",
+    "weight_qa",
+]
+
+COMPOSITE_TASKS = [
+    "cluster_billiards",
+    "cluster_book",
+    "cluster_drink",
+    "cluster_toy",
+    "cook_dishes",
+    "cool_drink",
+    "find_unseen_object",
+    "get_coffee",
+    "hammer_nail",
+    "heat_food",
+    "make_juice",
+    "play_mahjong",
+    "play_math_game",
+    "play_poker",
+    "play_snooker",
+    "rearrange_book",
+    "rearrange_chemistry_tube",
+    "set_dining_table",
+    "set_study_table",
+    "store_food",
+    "take_chemistry_experiment",
+    "use_seesaw_complex",
+]
+
+SUITE_TASKS: dict[str, list[str]] = {
+    "primitive": PRIMITIVE_TASKS,
+    "composite": COMPOSITE_TASKS,
+}
+
+
+class VLABenchEnv(gym.Env):
+    """Gymnasium wrapper for VLABench environments.
+
+    Wraps the dm_control-based VLABench simulator behind a standard gym.Env interface.
+    Supports multiple cameras (front, second, wrist) and end-effector control.
+    """
+
+    metadata = {"render_modes": ["rgb_array"], "render_fps": 10}
+
+    def __init__(
+        self,
+        task: str = "select_fruit",
+        obs_type: str = "pixels_agent_pos",
+        render_mode: str = "rgb_array",
+        render_resolution: tuple[int, int] = (480, 480),
+        robot: str = "franka",
+        max_episode_steps: int = DEFAULT_MAX_EPISODE_STEPS,
+        action_mode: str = "eef",
+    ):
+        super().__init__()
+        self.task = task
+        self.obs_type = obs_type
+        self.render_mode = render_mode
+        self.render_resolution = render_resolution
+        self.robot = robot
+        self._max_episode_steps = max_episode_steps
+        self.action_mode = action_mode
+
+        # Deferred — created on first reset() inside worker subprocess to avoid
+        # inheriting stale GPU/EGL contexts when AsyncVectorEnv spawns workers.
+        # We never cache `env.physics`: dm_control exposes it as a weakref
+        # proxy that goes stale across resets (rebuilds the sim), so we always
+        # refetch it via `self._env.physics` at the call site.
+        self._env = None
+        self.task_description = ""  # populated on first reset
+        # Cached world-frame XYZ of the robot base link. The VLABench datasets
+        # log both `observation.state` positions and `actions` positions in
+        # robot-base frame (see VLABench/scripts/convert_to_lerobot.py which
+        # subtracts `robot_frame_pos` from ee_pos). The robot is attached at a
+        # fixed offset per task so this is safe to cache once per env build.
+        self._robot_base_xyz: np.ndarray | None = None
+
+        h, w = self.render_resolution
+
+        if self.obs_type == "state":
+            raise NotImplementedError(
+                "The 'state' observation type is not supported in VLABenchEnv. "
+                "Please use 'pixels' or 'pixels_agent_pos'."
+            )
+        elif self.obs_type == "pixels":
+            self.observation_space = spaces.Dict(
+                {
+                    "pixels": spaces.Dict(
+                        {
+                            "image": spaces.Box(low=0, high=255, shape=(h, w, 3), dtype=np.uint8),
+                            "second_image": spaces.Box(low=0, high=255, shape=(h, w, 3), dtype=np.uint8),
+                            "wrist_image": spaces.Box(low=0, high=255, shape=(h, w, 3), dtype=np.uint8),
+                        }
+                    ),
+                }
+            )
+        elif self.obs_type == "pixels_agent_pos":
+            self.observation_space = spaces.Dict(
+                {
+                    "pixels": spaces.Dict(
+                        {
+                            "image": spaces.Box(low=0, high=255, shape=(h, w, 3), dtype=np.uint8),
+                            "second_image": spaces.Box(low=0, high=255, shape=(h, w, 3), dtype=np.uint8),
+                            "wrist_image": spaces.Box(low=0, high=255, shape=(h, w, 3), dtype=np.uint8),
+                        }
+                    ),
+                    "agent_pos": spaces.Box(low=-np.inf, high=np.inf, shape=(7,), dtype=np.float64),
+                }
+            )
+        else:
+            raise ValueError(f"Unsupported obs_type: {self.obs_type}")
+
+        self.action_space = spaces.Box(low=ACTION_LOW, high=ACTION_HIGH, dtype=np.float32)
+
+    # Max attempts to rebuild the underlying env when MuJoCo throws
+    # `PhysicsError` (e.g. mjWARN_BADQACC) during VLABench's 20-step
+    # reset warm-up. Some random task/layout samples land in unstable
+    # initial configurations; re-sampling the layout almost always
+    # gives a stable one. A handful of upstream tasks (notably
+    # `select_mahjong`) have layout samplers that diverge often enough
+    # to need >>5 retries, so we pick a generous ceiling.
+    _ENSURE_ENV_MAX_ATTEMPTS = 20
+
+    def _ensure_env(self) -> None:
+        """Create the underlying VLABench env on first use.
+
+        Called inside the worker subprocess after fork(), so each worker gets
+        its own clean rendering context rather than inheriting a stale one from
+        the parent process (which causes crashes with AsyncVectorEnv).
+
+        Retries on `PhysicsError`: VLABench's `LM4ManipDMEnv.reset()` runs 20
+        warm-up `step()` calls while toggling gravity/fluids to let the scene
+        settle; for some random layouts MuJoCo's integrator diverges and
+        raises `mjWARN_BADQACC`. Re-sampling the layout almost always yields
+        a stable one, so we retry a number of times before giving up. Between
+        attempts we reseed NumPy's global RNG from OS entropy so the upstream
+        task sampler explores fresh initial states — without this, retries
+        can replay the same diverging configuration when the sampler is
+        deterministic given the current RNG state.
+        """
+        if self._env is not None:
+            return
+
+        import VLABench.robots  # noqa: F401  # type: ignore[import-untyped]
+        import VLABench.tasks  # noqa: F401  # type: ignore[import-untyped]
+        from dm_control.rl.control import PhysicsError  # type: ignore[import-untyped]
+        from VLABench.envs import load_env  # type: ignore[import-untyped]
+
+        h, w = self.render_resolution
+        last_exc: PhysicsError | None = None
+        for attempt in range(1, self._ENSURE_ENV_MAX_ATTEMPTS + 1):
+            try:
+                env = load_env(task=self.task, robot=self.robot, render_resolution=(h, w))
+                self._env = env
+                break
+            except PhysicsError as exc:
+                last_exc = exc
+                logger.warning(
+                    "PhysicsError on attempt %d/%d while building task '%s': %s. Retrying with fresh layout…",
+                    attempt,
+                    self._ENSURE_ENV_MAX_ATTEMPTS,
+                    self.task,
+                    exc,
+                )
+                np.random.seed(None)
+        if self._env is None:
+            assert last_exc is not None
+            raise RuntimeError(
+                f"VLABench task '{self.task}' failed to produce a stable "
+                f"initial layout after {self._ENSURE_ENV_MAX_ATTEMPTS} "
+                f"attempts. This task's upstream sampler diverges too "
+                f"often for the configured robot; consider removing it "
+                f"from the eval set. Last physics error: {last_exc}"
+            ) from last_exc
+
+        # Extract task description from the dm_control task
+        task_obj = self._env.task
+        if hasattr(task_obj, "task_description"):
+            self.task_description = task_obj.task_description
+        elif hasattr(task_obj, "language_instruction"):
+            self.task_description = task_obj.language_instruction
+        else:
+            self.task_description = self.task
+
+        # Cache robot base world position so `_build_ctrl_from_action` and
+        # `_get_obs` can translate between robot-frame (dataset) and
+        # world-frame (dm_control) without hitting physics every call.
+        try:
+            self._robot_base_xyz = np.asarray(self._env.get_robot_frame_position(), dtype=np.float64).reshape(
+                3
+            )
+        except Exception:
+            # Fallback to VLABench's default Franka base position.
+            self._robot_base_xyz = np.array([0.0, -0.4, 0.78], dtype=np.float64)
+
+    def _get_obs(self) -> dict:
+        """Get current observation from the environment."""
+        assert self._env is not None
+
+        obs = self._env.get_observation()
+        h, w = self.render_resolution
+
+        def _to_hwc3(arr: np.ndarray) -> np.ndarray:
+            """Coerce any camera array to the declared (h, w, 3) uint8 shape."""
+            a = np.asarray(arr)
+            # Drop a leading singleton batch dim if present.
+            while a.ndim > 3 and a.shape[0] == 1:
+                a = a[0]
+            if a.ndim == 3 and a.shape[0] in (1, 3, 4) and a.shape[-1] not in (1, 3, 4):
+                # CHW → HWC
+                a = np.transpose(a, (1, 2, 0))
+            if a.ndim == 2:
+                a = np.stack([a] * 3, axis=-1)
+            if a.ndim != 3:
+                return np.zeros((h, w, 3), dtype=np.uint8)
+            # Force 3 channels.
+            if a.shape[-1] == 1:
+                a = np.repeat(a, 3, axis=-1)
+            elif a.shape[-1] == 4:
+                a = a[..., :3]
+            elif a.shape[-1] != 3:
+                return np.zeros((h, w, 3), dtype=np.uint8)
+            if a.shape[:2] != (h, w):
+                a = cv2.resize(a, (w, h), interpolation=cv2.INTER_AREA)
+            return a.astype(np.uint8)
+
+        # Extract camera images — VLABench returns (n_cameras, C, H, W) or individual arrays
+        raw_frames: list[np.ndarray] = []
+        if "rgb" in obs:
+            rgb = obs["rgb"]
+            if isinstance(rgb, np.ndarray):
+                if rgb.ndim == 4:
+                    raw_frames = [rgb[i] for i in range(rgb.shape[0])]
+                elif rgb.ndim == 3:
+                    raw_frames = [rgb]
+
+        image_keys = ["image", "second_image", "wrist_image"]
+        images: dict[str, np.ndarray] = {}
+        for i, key in enumerate(image_keys):
+            if i < len(raw_frames):
+                images[key] = _to_hwc3(raw_frames[i])
+            else:
+                images[key] = np.zeros((h, w, 3), dtype=np.uint8)
+
+        # Convert VLABench's raw ee_state `[pos_world(3), quat_wxyz(4), open(1)]`
+        # to the dataset's observation.state layout `[pos_robot(3), euler_xyz(3),
+        # gripper(1)]`. See VLABench/scripts/convert_to_lerobot.py — positions
+        # are stored in robot-base frame and orientations as scipy extrinsic
+        # 'xyz' euler angles.
+        raw = np.asarray(obs.get("ee_state", np.zeros(8)), dtype=np.float64).ravel()
+        pos_world = raw[:3] if raw.size >= 3 else np.zeros(3, dtype=np.float64)
+        quat_wxyz = raw[3:7] if raw.size >= 7 else np.array([1.0, 0.0, 0.0, 0.0], dtype=np.float64)
+        gripper = float(raw[7]) if raw.size >= 8 else 0.0
+
+        base = self._robot_base_xyz if self._robot_base_xyz is not None else np.zeros(3, dtype=np.float64)
+        pos_robot = pos_world - base
+        euler_xyz = Rotation.from_quat([quat_wxyz[1], quat_wxyz[2], quat_wxyz[3], quat_wxyz[0]]).as_euler(
+            "xyz", degrees=False
+        )
+
+        ee_state = np.concatenate([pos_robot, euler_xyz, [gripper]]).astype(np.float64)
+
+        if self.obs_type == "pixels":
+            return {"pixels": images}
+        elif self.obs_type == "pixels_agent_pos":
+            return {
+                "pixels": images,
+                "agent_pos": ee_state.astype(np.float64),
+            }
+        else:
+            raise ValueError(f"Unknown obs_type: {self.obs_type}")
+
+    # ---- Action adaptation (EEF → joint ctrl) --------------------------------
+    #
+    # The HF vlabench datasets log 7D actions
+    # `[x, y, z (robot frame), rx, ry, rz (scipy extrinsic xyz), gripper]`,
+    # exactly matching VLABench's own eval pipeline (evaluator.base):
+    #   pos, euler, g = policy(...)
+    #   quat = euler_to_quaternion(*euler)      # extrinsic xyz -> wxyz
+    #   _, qpos = robot.get_qpos_from_ee_pos(physics, pos=pos + base, quat=quat)
+    #   env.step(np.concatenate([qpos, [g, g]]))
+    #
+    # VLABench's dm_control task writes `data.ctrl[:] = action` directly — for
+    # Franka that's 9 entries (7 arm joints + 2 gripper fingers). We mirror the
+    # above conversion so the policy's EEF commands actually drive the robot.
+
+    _FRANKA_FINGER_OPEN = 0.04  # qpos when gripper fully open
+
+    def _build_ctrl_from_action(self, action: np.ndarray, ctrl_dim: int) -> np.ndarray:
+        """Convert a 7D EEF action into the `ctrl_dim`-sized joint command vector.
+
+        For the Franka default (ctrl_dim=9): 7 arm joint qposes (via IK) +
+        2 gripper finger qposes (open/closed based on the gripper scalar).
+        If the action is already joint-space (shape matches ctrl_dim), pass
+        through.
+        """
+        if action.shape[0] == ctrl_dim:
+            return action.astype(np.float64, copy=False)
+
+        if action.shape[0] != 7:
+            # Unknown layout — fall back to zero-pad so the sim doesn't crash.
+            padded = np.zeros(ctrl_dim, dtype=np.float64)
+            padded[: min(action.shape[0], ctrl_dim)] = action[:ctrl_dim]
+            return padded
+
+        from dm_control.utils.inverse_kinematics import qpos_from_site_pose
+
+        # Action position is in robot-base frame (see convert_to_lerobot.py);
+        # dm_control's IK expects a world-frame target.
+        base = self._robot_base_xyz if self._robot_base_xyz is not None else np.zeros(3, dtype=np.float64)
+        pos_world = np.asarray(action[:3], dtype=np.float64) + base
+        rx, ry, rz = float(action[3]), float(action[4]), float(action[5])
+        gripper = float(np.clip(action[6], 0.0, 1.0))
+
+        # Dataset euler is scipy extrinsic 'xyz' (same as VLABench's
+        # `euler_to_quaternion`). scipy emits `[x, y, z, w]`; dm_control's IK
+        # and MuJoCo use `[w, x, y, z]`, so reorder.
+        qxyzw = Rotation.from_euler("xyz", [rx, ry, rz], degrees=False).as_quat()
+        quat = np.array([qxyzw[3], qxyzw[0], qxyzw[1], qxyzw[2]], dtype=np.float64)
+
+        assert self._env is not None
+        robot = self._env.task.robot
+        site_name = robot.end_effector_site.full_identifier
+
+        # inplace=False so IK doesn't mutate physics state mid-step — we only
+        # want the solved qpos. Fetch a fresh physics handle — caching it can
+        # yield a stale weakref after a reset.
+        ik_result = qpos_from_site_pose(
+            self._env.physics,
+            site_name=site_name,
+            target_pos=pos_world,
+            target_quat=quat,
+            inplace=False,
+            max_steps=100,
+        )
+        n_dof = robot.n_dof  # 7 for Franka
+        arm_qpos = ik_result.qpos[:n_dof]
+
+        # Dataset gripper convention: 1 = open (finger qpos = 0.04),
+        # 0 = closed (finger qpos = 0.0). See VLABench/scripts/convert_to_lerobot.py
+        # where `trajectory[i][-1] > 0.03` is encoded as `1`.
+        finger_qpos = gripper * self._FRANKA_FINGER_OPEN
+
+        ctrl = np.zeros(ctrl_dim, dtype=np.float64)
+        ctrl[:n_dof] = arm_qpos
+        # Remaining entries are gripper fingers (usually 2 for Franka).
+        ctrl[n_dof:] = finger_qpos
+        return ctrl
+
+    def reset(self, seed=None, **kwargs) -> tuple[RobotObservation, dict[str, Any]]:
+        self._ensure_env()
+        assert self._env is not None
+        super().reset(seed=seed)
+
+        if seed is not None:
+            self._seed_inner_env(int(self.np_random.integers(0, 2**31 - 1)))
+
+        self._env.reset()
+
+        observation = self._get_obs()
+        info = {"is_success": False}
+        return observation, info
+
+    def _seed_inner_env(self, seed: int) -> None:
+        """Propagate `seed` to the inner dm_control env. `Environment.reset()`
+        doesn't accept a seed, so we re-seed the task and environment
+        `RandomState`s directly. Best-effort: silently skipped when the
+        expected attributes are absent on a given VLABench version.
+        """
+        for owner_attr, rng_attr in (("task", "random"), (None, "_random_state")):
+            owner = getattr(self._env, owner_attr) if owner_attr else self._env
+            rng = getattr(owner, rng_attr, None)
+            rng_seed = getattr(rng, "seed", None)
+            if callable(rng_seed):
+                rng_seed(seed)
+
+    def step(self, action: np.ndarray) -> tuple[RobotObservation, float, bool, bool, dict[str, Any]]:
+        from dm_control.rl.control import PhysicsError  # type: ignore[import-untyped]
+
+        self._ensure_env()
+        assert self._env is not None
+
+        if action.ndim != 1:
+            raise ValueError(
+                f"Expected action to be 1-D (shape (action_dim,)), "
+                f"but got shape {action.shape} with ndim={action.ndim}"
+            )
+
+        if self.action_mode not in ("eef", "joint", "delta_eef"):
+            raise ValueError(f"Unknown action_mode: {self.action_mode}")
+
+        # Always refetch physics — dm_control returns a weakref proxy that can
+        # go stale across resets.
+        physics = self._env.physics
+        ctrl_dim = int(physics.data.ctrl.shape[0])
+        ctrl = self._build_ctrl_from_action(action, ctrl_dim)
+        try:
+            timestep = self._env.step(ctrl)
+        except PhysicsError as exc:
+            # Physics integrator diverged (e.g. mjWARN_BADQACC). Treat it as
+            # a graceful failed termination rather than a hard crash — the
+            # rest of the multi-task eval should still run.
+            logger.warning(
+                "PhysicsError during step on task '%s': %s. Terminating episode.",
+                self.task,
+                exc,
+            )
+            observation = self._get_obs()
+            info = {"task": self.task, "is_success": False, "physics_error": True}
+            # Drop the stale env so the next reset() rebuilds it cleanly.
+            with contextlib.suppress(Exception):
+                self._env.close()
+            self._env = None
+            return observation, 0.0, True, False, info
+
+        # Extract reward from dm_control timestep
+        reward = float(timestep.reward) if timestep.reward is not None else 0.0
+
+        # Check success via the task's termination condition
+        is_success = False
+        if hasattr(self._env, "task") and hasattr(self._env.task, "should_terminate_episode"):
+            is_success = bool(self._env.task.should_terminate_episode(self._env.physics))
+
+        terminated = is_success
+        truncated = False
+        info = {
+            "task": self.task,
+            "is_success": is_success,
+        }
+
+        observation = self._get_obs()
+
+        if terminated:
+            self.reset()
+
+        return observation, reward, terminated, truncated, info
+
+    def render(self) -> np.ndarray:
+        self._ensure_env()
+        obs = self._get_obs()
+        return obs["pixels"]["image"]
+
+    def close(self):
+        if self._env is not None:
+            self._env.close()
+            self._env = None
+
+
+# ---- Main API ----------------------------------------------------------------
+
+
+def create_vlabench_envs(
+    task: str,
+    n_envs: int,
+    gym_kwargs: dict[str, Any] | None = None,
+    env_cls: Callable[[Sequence[Callable[[], Any]]], Any] | None = None,
+) -> dict[str, dict[int, Any]]:
+    """
+    Create vectorized VLABench environments with a consistent return shape.
+
+    Returns:
+        dict[suite_name][task_id] -> vec_env (env_cls([...]) with exactly n_envs factories)
+
+    Notes:
+        - n_envs is the number of rollouts *per task*.
+        - `task` can be a suite name ("primitive", "composite"), a comma-separated list of
+          suite names, or individual task names (e.g. "select_fruit,heat_food").
+    """
+    if env_cls is None or not callable(env_cls):
+        raise ValueError("env_cls must be a callable that wraps a list of environment factory callables.")
+    if not isinstance(n_envs, int) or n_envs <= 0:
+        raise ValueError(f"n_envs must be a positive int; got {n_envs}.")
+
+    gym_kwargs = dict(gym_kwargs or {})
+    task_groups = [t.strip() for t in task.split(",") if t.strip()]
+    if not task_groups:
+        raise ValueError("`task` must contain at least one VLABench task or suite name.")
+
+    logger.info(
+        "Creating VLABench envs | task_groups=%s | n_envs(per task)=%d",
+        task_groups,
+        n_envs,
+    )
+
+    is_async = env_cls is gym.vector.AsyncVectorEnv
+    cached_obs_space = None
+    cached_act_space = None
+    cached_metadata = None
+    out: dict[str, dict[int, Any]] = defaultdict(dict)
+
+    for group in task_groups:
+        # Check if it's a suite name, otherwise treat as individual task
+        tasks = SUITE_TASKS.get(group, [group])
+
+        for tid, task_name in enumerate(tasks):
+            logger.info(
+                "Building vec env | group=%s | task_id=%d | task=%s",
+                group,
+                tid,
+                task_name,
+            )
+
+            fns = [(lambda tn=task_name: VLABenchEnv(task=tn, **gym_kwargs)) for _ in range(n_envs)]
+
+            if is_async:
+                lazy = _LazyAsyncVectorEnv(fns, cached_obs_space, cached_act_space, cached_metadata)
+                if cached_obs_space is None:
+                    cached_obs_space = lazy.observation_space
+                    cached_act_space = lazy.action_space
+                    cached_metadata = lazy.metadata
+                out[group][tid] = lazy
+            else:
+                out[group][tid] = env_cls(fns)
+
+    return {group: dict(task_map) for group, task_map in out.items()}