mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-11 14:49:43 +00:00
5adad11128
feat(sim): add VLABench benchmark integration Add VLABench as a new simulation benchmark in LeRobot, following the existing LIBERO and MetaWorld patterns. This PR wires VLABench end-to-end across environment integration, Docker setup, CI smoke evaluation, and documentation. It also fixes a number of upstream packaging and runtime issues required to make VLABench usable and reproducible in CI. What’s included Benchmark integration Add VLABench as a new simulation benchmark. Expose supported VLABench tasks through the LeRobot env interface. Follow the established LIBERO / MetaWorld factory patterns. Preserve lazy async-env metadata so env.unwrapped.metadata["render_fps"] continues to work. CI smoke evaluation Add a VLABench smoke-eval job using lerobot/smolvla_vlabench. Use the correct rename_map for the 3-camera dataset layout. Expand smoke coverage from 1 to 10 primitive tasks. Extract task descriptions after eval so metrics artifacts include per-task labels. Skip Docker Hub login when secrets are unavailable (e.g. fork PRs). Docker / install fixes Install VLABench from GitHub rather than PyPI. Use uv pip, not pip, in the base image. Fail loudly on install errors instead of masking them. Clone VLABench into the non-root user’s home directory. Use shallow editable installs for VLABench and rrt-algorithms to work around missing __init__.py issues. Pin upstream clones to exact commit SHAs for reproducibility. Add undeclared runtime dependencies required by VLABench (open3d, colorlog, scikit-learn, openai). Unpin open3d so Python 3.12 wheels resolve. Assets Support downloading VLABench assets from a Hugging Face Hub mirror via VLABENCH_ASSETS_REPO. Keep Google Drive download support as fallback. Install huggingface_hub[hf_xet] so Xet-backed assets download correctly. Validate required mesh/XML asset subtrees at build time. Patch VLABench constants to tolerate missing asset directories at import time. Runtime / env correctness Import VLABench robots and tasks explicitly so decorator-based registry population happens. Resize and normalize camera observations so they always match the declared (H, W, 3) uint8 observation space. Reinstall LeRobot editably inside the image so the new env code is actually used. Coerce agent_pos / ee_state to the expected shape. Pad actions when needed to match data.ctrl. Replace zero-padding fallback with proper dm_control IK for 7D end-effector actions. Refetch dm_control physics on each step instead of caching weakrefs. Retry unstable resets with reseeding and handle PhysicsError gracefully at step time. Dataset / policy alignment Align VLABench observations and actions with Hugging Face dataset conventions used by lerobot/vlabench_unified: convert EE position between world frame and robot-base frame at the env boundary, expose / consume Euler XYZ instead of raw quaternion layout, align gripper semantics with dataset convention (1 = open, 0 = closed). This fixes policy/env mismatches that previously caused incorrect IK targets and unstable behavior at evaluation time. Docs Add a full docs/source/vlabench.mdx page aligned with the standard benchmark template. Document task selection forms (single task, comma list, suite shortcut). Document installation, evaluation, training, and result reproduction. Point examples at lerobot/smolvla_vlabench. Add a benchmark banner image. Remove outdated / misleading references to upstream evaluation tracks. Document manual install flow instead of a broken vlabench extra. Packaging cleanup Remove the unresolvable vlabench extra from pyproject.toml. Remove the no-op VLABench processor step. Remove the obsolete env unit test that only covered the dropped gripper remap helper. Apply formatting / logging / style cleanup from review feedback. Why this is needed VLABench is not currently consumable as a normal Python dependency and requires several upstream workarounds: no PyPI release, missing package declarations, undeclared runtime deps, SSH-only submodule references, asset downloads outside normal package install flow, registry population that depends on import side effects, env outputs that do not always match declared observation shapes, task resets that can diverge under some random layouts. This PR makes the benchmark usable in LeRobot despite those constraints, and ensures CI runs are reproducible and informative. If you want a much shorter squash commit message, I’d use this: feat(sim): integrate VLABench benchmark with CI, Docker, and docs Add VLABench as a new LeRobot simulation benchmark, following the existing LIBERO / MetaWorld patterns. This includes: LeRobot env integration and task exposure, CI smoke eval with lerobot/smolvla_vlabench, Docker install and asset-download fixes, runtime fixes for registry loading, assets, camera obs, action handling, dm_control IK, and PhysicsError recovery, alignment of obs/action semantics with HF VLABench datasets, docs and packaging cleanup. The PR also incorporates review feedback, improves reproducibility by pinning upstream commits, and makes VLABench usable in CI despite upstream packaging and asset-management issues.
177 lines
8.0 KiB
Plaintext
177 lines
8.0 KiB
Plaintext
# VLABench
|
||
|
||
[VLABench](https://github.com/OpenMOSS/VLABench) is a large-scale benchmark for **language-conditioned robotic manipulation with long-horizon reasoning**. The upstream suite covers 100 task categories across 2,000+ objects and evaluates six dimensions of robot intelligence: mesh & texture understanding, spatial reasoning, world-knowledge transfer, semantic instruction comprehension, physical-law understanding, and long-horizon planning. Built on MuJoCo / dm_control with a Franka Panda 7-DOF arm. LeRobot exposes **43 of these tasks** through `--env.task` (21 primitives + 22 composites, see [Available tasks](#available-tasks) below).
|
||
|
||
- Paper: [VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning](https://arxiv.org/abs/2412.18194)
|
||
- GitHub: [OpenMOSS/VLABench](https://github.com/OpenMOSS/VLABench)
|
||
- Project website: [vlabench.github.io](https://vlabench.github.io)
|
||
- Pretrained policy: [`lerobot/smolvla_vlabench`](https://huggingface.co/lerobot/smolvla_vlabench)
|
||
|
||
<img
|
||
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/vlabench.png"
|
||
alt="VLABench benchmark overview"
|
||
width="85%"
|
||
/>
|
||
|
||
## Available tasks
|
||
|
||
VLABench ships two task suites covering **43 task categories** in LeRobot's `--env.task` surface:
|
||
|
||
| Suite | CLI name | Tasks | Description |
|
||
| --------- | ----------- | ----- | ---------------------------------------------------------------- |
|
||
| Primitive | `primitive` | 21 | Single / few-skill combinations (select, insert, physics QA) |
|
||
| Composite | `composite` | 22 | Multi-step reasoning and long-horizon planning (cook, rearrange) |
|
||
|
||
**Primitive tasks:** `select_fruit`, `select_toy`, `select_chemistry_tube`, `add_condiment`, `select_book`, `select_painting`, `select_drink`, `insert_flower`, `select_billiards`, `select_ingredient`, `select_mahjong`, `select_poker`, and physical-reasoning tasks (`density_qa`, `friction_qa`, `magnetism_qa`, `reflection_qa`, `simple_cuestick_usage`, `simple_seesaw_usage`, `sound_speed_qa`, `thermal_expansion_qa`, `weight_qa`).
|
||
|
||
**Composite tasks:** `cluster_billiards`, `cluster_book`, `cluster_drink`, `cluster_toy`, `cook_dishes`, `cool_drink`, `find_unseen_object`, `get_coffee`, `hammer_nail`, `heat_food`, `make_juice`, `play_mahjong`, `play_math_game`, `play_poker`, `play_snooker`, `rearrange_book`, `rearrange_chemistry_tube`, `set_dining_table`, `set_study_table`, `store_food`, `take_chemistry_experiment`, `use_seesaw_complex`.
|
||
|
||
`--env.task` accepts three forms:
|
||
|
||
- a single task name (`select_fruit`)
|
||
- a comma-separated list (`select_fruit,heat_food`)
|
||
- a suite shortcut (`primitive`, `composite`, or `primitive,composite`)
|
||
|
||
## Installation
|
||
|
||
VLABench is **not on PyPI** — its only distribution is the [OpenMOSS/VLABench](https://github.com/OpenMOSS/VLABench) GitHub repo — so LeRobot does not expose a `vlabench` extra. Install it manually as an editable clone, alongside the MuJoCo / dm_control pins VLABench needs, then fetch the mesh assets:
|
||
|
||
```bash
|
||
# After following the standard LeRobot installation instructions.
|
||
|
||
git clone https://github.com/OpenMOSS/VLABench.git ~/VLABench
|
||
git clone https://github.com/motion-planning/rrt-algorithms.git ~/rrt-algorithms
|
||
pip install -e ~/VLABench -e ~/rrt-algorithms
|
||
pip install "mujoco==3.2.2" "dm-control==1.0.22" \
|
||
open3d colorlog scikit-learn openai gdown
|
||
|
||
python ~/VLABench/scripts/download_assets.py
|
||
```
|
||
|
||
<Tip>
|
||
VLABench requires Linux (`sys_platform == 'linux'`) and Python 3.10+. Set the MuJoCo rendering backend before running:
|
||
|
||
```bash
|
||
export MUJOCO_GL=egl # for headless servers (HPC, cloud)
|
||
```
|
||
|
||
</Tip>
|
||
|
||
## Evaluation
|
||
|
||
All eval snippets below mirror the command CI runs (see `.github/workflows/benchmark_tests.yml`). The `--rename_map` argument maps VLABench's `image` / `second_image` / `wrist_image` camera keys onto the three-camera (`camera1` / `camera2` / `camera3`) input layout the released `smolvla_vlabench` policy was trained on.
|
||
|
||
### Single-task evaluation (recommended for quick iteration)
|
||
|
||
```bash
|
||
lerobot-eval \
|
||
--policy.path=lerobot/smolvla_vlabench \
|
||
--env.type=vlabench \
|
||
--env.task=select_fruit \
|
||
--eval.batch_size=1 \
|
||
--eval.n_episodes=10 \
|
||
--eval.use_async_envs=false \
|
||
--policy.device=cuda \
|
||
'--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
|
||
```
|
||
|
||
### Multi-task evaluation
|
||
|
||
Pass a comma-separated list of tasks:
|
||
|
||
```bash
|
||
lerobot-eval \
|
||
--policy.path=lerobot/smolvla_vlabench \
|
||
--env.type=vlabench \
|
||
--env.task=select_fruit,select_toy,add_condiment,heat_food \
|
||
--eval.batch_size=1 \
|
||
--eval.n_episodes=10 \
|
||
--eval.use_async_envs=false \
|
||
--policy.device=cuda \
|
||
'--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
|
||
```
|
||
|
||
### Suite-wide evaluation
|
||
|
||
Run an entire suite (all 21 primitives or all 22 composites):
|
||
|
||
```bash
|
||
lerobot-eval \
|
||
--policy.path=lerobot/smolvla_vlabench \
|
||
--env.type=vlabench \
|
||
--env.task=primitive \
|
||
--eval.batch_size=1 \
|
||
--eval.n_episodes=10 \
|
||
--eval.use_async_envs=false \
|
||
--policy.device=cuda \
|
||
--env.max_parallel_tasks=1 \
|
||
'--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
|
||
```
|
||
|
||
Or both suites:
|
||
|
||
```bash
|
||
lerobot-eval \
|
||
--policy.path=lerobot/smolvla_vlabench \
|
||
--env.type=vlabench \
|
||
--env.task=primitive,composite \
|
||
--eval.batch_size=1 \
|
||
--eval.n_episodes=10 \
|
||
--eval.use_async_envs=false \
|
||
--policy.device=cuda \
|
||
--env.max_parallel_tasks=1 \
|
||
'--rename_map={"observation.images.image": "observation.images.camera1", "observation.images.second_image": "observation.images.camera2", "observation.images.wrist_image": "observation.images.camera3"}'
|
||
```
|
||
|
||
### Recommended evaluation episodes
|
||
|
||
**10 episodes per task** for reproducible benchmarking (210 total for the full primitive suite, 220 for composite). Matches the protocol in the VLABench paper.
|
||
|
||
## Policy inputs and outputs
|
||
|
||
**Observations:**
|
||
|
||
- `observation.state` — 7-dim end-effector state (position xyz + Euler xyz + gripper)
|
||
- `observation.images.image` — front camera, 480×480 HWC uint8
|
||
- `observation.images.second_image` — second camera, 480×480 HWC uint8
|
||
- `observation.images.wrist_image` — wrist camera, 480×480 HWC uint8
|
||
|
||
**Actions:**
|
||
|
||
- Continuous control in `Box(-1, 1, shape=(7,))` — 3D position + 3D Euler orientation + 1D gripper.
|
||
|
||
## Training
|
||
|
||
### Datasets
|
||
|
||
Pre-collected VLABench datasets in LeRobot format on the Hub:
|
||
|
||
- [`VLABench/vlabench_primitive_ft_lerobot_video`](https://huggingface.co/datasets/VLABench/vlabench_primitive_ft_lerobot_video) — 5,000 episodes, 128 tasks, 480×480 images.
|
||
- [`VLABench/vlabench_composite_ft_lerobot_video`](https://huggingface.co/datasets/VLABench/vlabench_composite_ft_lerobot_video) — 5,977 episodes, 167 tasks, 224×224 images.
|
||
|
||
### Example training command
|
||
|
||
Fine-tune a SmolVLA base on the primitive suite:
|
||
|
||
```bash
|
||
lerobot-train \
|
||
--policy.type=smolvla \
|
||
--policy.repo_id=${HF_USER}/smolvla_vlabench_primitive \
|
||
--policy.load_vlm_weights=true \
|
||
--policy.push_to_hub=true \
|
||
--dataset.repo_id=VLABench/vlabench_primitive_ft_lerobot_video \
|
||
--env.type=vlabench \
|
||
--env.task=select_fruit \
|
||
--output_dir=./outputs/smolvla_vlabench_primitive \
|
||
--steps=100000 \
|
||
--batch_size=4 \
|
||
--eval_freq=5000 \
|
||
--eval.batch_size=1 \
|
||
--eval.n_episodes=1 \
|
||
--save_freq=10000
|
||
```
|
||
|
||
## Reproducing published results
|
||
|
||
The released checkpoint [`lerobot/smolvla_vlabench`](https://huggingface.co/lerobot/smolvla_vlabench) was trained on the primitive-suite dataset above and is evaluated with the [Single-task](#single-task-evaluation-recommended-for-quick-iteration) / [Suite-wide](#suite-wide-evaluation) commands. CI runs a 10-primitive-task smoke eval (one episode each) on every PR touching the benchmark.
|