* fix(sarm): handle BaseModelOutputWithPooling from transformers 5.x in CLIP encoding
In transformers 5.x, CLIPModel.get_image_features() and get_text_features()
return BaseModelOutputWithPooling instead of a plain torch.FloatTensor.
Added isinstance check to extract pooler_output when the return value is not
a tensor, maintaining backward compatibility with transformers 4.x.
Fixes AttributeError: 'BaseModelOutputWithPooling' object has no attribute 'detach'
* Adding assertion check for pooler_output of CLIP. This change is response to below comment.
https://github.com/huggingface/lerobot/pull/3419#discussion_r3112594387
* Adding assertion check for pooler_output of CLIP. This change is response to below comment. Change to simple check and rise
https://github.com/huggingface/lerobot/pull/3419#discussion_r3126953776
---------
Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>
Apply the same per-scalar-mean fix to SmolVLA that #3377 landed for
ACT / Diffusion / MultiTaskDiT. The pre-patch form applies the
`action_is_pad` mask to zero out padded timesteps, then calls `.mean()`
(or `.mean(dim=(1, 2))`). Because `.mean()` divides by the total number
of elements including the zeroed padding, the loss is diluted by the
padding fraction.
Fixed by normalizing only over valid (non-padded) scalar entries:
num_valid = ((~actions_is_pad).sum(...) * losses.shape[-1]).clamp_min(1)
loss = losses.sum(...) / num_valid
`clamp_min(1)` preserves the all-padded-batch edge case (0/1 = 0). Both
reduction paths are updated. Behavior when `action_is_pad` is missing is
unchanged (`losses.mean()`).
Empirical A/B on aloha_sim_transfer_cube_human (chunk_size=40, batch=2,
30 steps, fixed seed, GB200) shows `loss_A / loss_B = 0.9672 (±0.088)` —
same direction and magnitude as PR #3377's `loss_A / loss_C ≈ 0.96` for
ACT. Heavier-padding recipes will see a larger gap.
Refs: #3353 (original report for ACT), #3377 (fix for the other three
policies).
feat(sim): add VLABench benchmark integration
Add VLABench as a new simulation benchmark in LeRobot, following the existing LIBERO and MetaWorld patterns.
This PR wires VLABench end-to-end across environment integration, Docker setup, CI smoke evaluation, and documentation. It also fixes a number of upstream packaging and runtime issues required to make VLABench usable and reproducible in CI.
What’s included
Benchmark integration
Add VLABench as a new simulation benchmark.
Expose supported VLABench tasks through the LeRobot env interface.
Follow the established LIBERO / MetaWorld factory patterns.
Preserve lazy async-env metadata so env.unwrapped.metadata["render_fps"] continues to work.
CI smoke evaluation
Add a VLABench smoke-eval job using lerobot/smolvla_vlabench.
Use the correct rename_map for the 3-camera dataset layout.
Expand smoke coverage from 1 to 10 primitive tasks.
Extract task descriptions after eval so metrics artifacts include per-task labels.
Skip Docker Hub login when secrets are unavailable (e.g. fork PRs).
Docker / install fixes
Install VLABench from GitHub rather than PyPI.
Use uv pip, not pip, in the base image.
Fail loudly on install errors instead of masking them.
Clone VLABench into the non-root user’s home directory.
Use shallow editable installs for VLABench and rrt-algorithms to work around missing __init__.py issues.
Pin upstream clones to exact commit SHAs for reproducibility.
Add undeclared runtime dependencies required by VLABench (open3d, colorlog, scikit-learn, openai).
Unpin open3d so Python 3.12 wheels resolve.
Assets
Support downloading VLABench assets from a Hugging Face Hub mirror via VLABENCH_ASSETS_REPO.
Keep Google Drive download support as fallback.
Install huggingface_hub[hf_xet] so Xet-backed assets download correctly.
Validate required mesh/XML asset subtrees at build time.
Patch VLABench constants to tolerate missing asset directories at import time.
Runtime / env correctness
Import VLABench robots and tasks explicitly so decorator-based registry population happens.
Resize and normalize camera observations so they always match the declared (H, W, 3) uint8 observation space.
Reinstall LeRobot editably inside the image so the new env code is actually used.
Coerce agent_pos / ee_state to the expected shape.
Pad actions when needed to match data.ctrl.
Replace zero-padding fallback with proper dm_control IK for 7D end-effector actions.
Refetch dm_control physics on each step instead of caching weakrefs.
Retry unstable resets with reseeding and handle PhysicsError gracefully at step time.
Dataset / policy alignment
Align VLABench observations and actions with Hugging Face dataset conventions used by lerobot/vlabench_unified:
convert EE position between world frame and robot-base frame at the env boundary,
expose / consume Euler XYZ instead of raw quaternion layout,
align gripper semantics with dataset convention (1 = open, 0 = closed).
This fixes policy/env mismatches that previously caused incorrect IK targets and unstable behavior at evaluation time.
Docs
Add a full docs/source/vlabench.mdx page aligned with the standard benchmark template.
Document task selection forms (single task, comma list, suite shortcut).
Document installation, evaluation, training, and result reproduction.
Point examples at lerobot/smolvla_vlabench.
Add a benchmark banner image.
Remove outdated / misleading references to upstream evaluation tracks.
Document manual install flow instead of a broken vlabench extra.
Packaging cleanup
Remove the unresolvable vlabench extra from pyproject.toml.
Remove the no-op VLABench processor step.
Remove the obsolete env unit test that only covered the dropped gripper remap helper.
Apply formatting / logging / style cleanup from review feedback.
Why this is needed
VLABench is not currently consumable as a normal Python dependency and requires several upstream workarounds:
no PyPI release,
missing package declarations,
undeclared runtime deps,
SSH-only submodule references,
asset downloads outside normal package install flow,
registry population that depends on import side effects,
env outputs that do not always match declared observation shapes,
task resets that can diverge under some random layouts.
This PR makes the benchmark usable in LeRobot despite those constraints, and ensures CI runs are reproducible and informative.
If you want a much shorter squash commit message, I’d use this:
feat(sim): integrate VLABench benchmark with CI, Docker, and docs
Add VLABench as a new LeRobot simulation benchmark, following the existing LIBERO / MetaWorld patterns.
This includes:
LeRobot env integration and task exposure,
CI smoke eval with lerobot/smolvla_vlabench,
Docker install and asset-download fixes,
runtime fixes for registry loading, assets, camera obs, action handling, dm_control IK, and PhysicsError recovery,
alignment of obs/action semantics with HF VLABench datasets,
docs and packaging cleanup.
The PR also incorporates review feedback, improves reproducibility by pinning upstream commits, and makes VLABench usable in CI despite upstream packaging and asset-management issues.
* feat(envs): add LIBERO-plus robustness benchmark integration
- LiberoPlusEnv config (subclass of LiberoEnv, same gym interface)
- Docker image installing LIBERO-plus fork via PYTHONPATH
- CI workflow: 1-episode smoke eval with pepijn223/smolvla_libero_plus
- pyproject.toml: libero_plus extra
* fix(libero): use suite's perturbation-aware init_states loader
LIBERO-plus's Benchmark class exposes a `get_task_init_states(i)` method that
strips perturbation suffixes (`_table_N`, `_tb_N`, `_view_`, `_language_`,
`_light_`, `_add_`, `_level`) and loads the underlying base `.pruned_init`
file — the on-disk name for a perturbation variant doesn't exist as a file,
only the base does. lerobot's loader was bypassing that logic and trying to
read the suffix-bearing filename directly, which failed for every non-zero
task id and killed the eval before any rollout video could be written.
Delegate to the suite's method when it exists; fall back to the path-based
loader for vanilla LIBERO (which does not provide the method).
Also drop the hf-libero install + init_files copy from the LIBERO-plus
Dockerfile — the LIBERO-plus clone already ships both `bddl_files/` and
`init_files/` for all five suites, so the copy was unnecessary and the
`cp -r` into an existing dir produced a confusing nested layout.
* fix(libero): resolve LIBERO-plus perturbation init_states path ourselves
Delegating to `task_suite.get_task_init_states(i)` works for path resolution
but LIBERO-plus's method calls `torch.load(path)` without `weights_only=False`,
which fails on PyTorch 2.6+ because the pickled init_states contains numpy
objects not in the default allowlist:
_pickle.UnpicklingError: Weights only load failed.
WeightsUnpickler error: Unsupported global:
GLOBAL numpy.core.multiarray._reconstruct was not an allowed global.
Mirror LIBERO-plus's suffix-stripping logic (`_table_N`, `_tb_N`, `_view_`,
`_language_`, `_light_`, `_add_`, `_level`) in our own helper so we can pass
`weights_only=False` ourselves. Vanilla LIBERO task names don't contain any
of these patterns except for `_table_` when followed by the word `center`
(e.g. `pick_up_the_black_bowl_from_table_center_...`), and the regex
requires `_table_\\d+` so semantic uses are preserved.
* fix(libero-plus): download perturbation assets from Sylvest/LIBERO-plus
LIBERO-plus's bddl_base_domain.py resolves scene XMLs with
`os.path.join(DIR_PATH, "../assets")`, so the `assets` key in config.yaml
has no effect on scene lookup — MuJoCo always opens
`<clone>/libero/libero/assets/scenes/...`. With no such directory present,
every perturbation task fails on:
FileNotFoundError: No such file or directory:
.../libero-plus/libero/libero/assets/scenes/tabletop_table_Cobblestone01_GLOSS_6K.xml
These textures, views, and extra objects ship only in the 6.4 GB `assets.zip`
published at `Sylvest/LIBERO-plus` (the LIBERO-plus README explicitly says
to download and unzip it into the package dir). Fetch it via `hf_hub_download`,
unzip into `${LIBERO_PLUS_ROOT}/`, install `unzip`, and point config.yaml at
the extracted dir so everything stays consistent. The download lives in its
own Docker layer so subsequent rebuilds reuse the cached assets.
Drops the lerobot/libero-assets snapshot_download — that mirror only has
vanilla LIBERO textures and is ignored for scene loading anyway.
* fix(libero-plus): flatten deep path prefix from Sylvest/LIBERO-plus assets.zip
The 6.4 GB zip ships with every entry prefixed by
`inspire/hdd/project/embodied-multimodality/public/syfei/libero_new/release/dataset/LIBERO-plus-0/assets/...`
(the author's internal filesystem layout, not the layout the LIBERO-plus
README promises), so the previous `unzip -d ${LIBERO_PLUS_ROOT}/` created
`${LIBERO_PLUS_ROOT}/inspire/.../assets/` — robosuite still opened
`${LIBERO_PLUS_ROOT}/assets/scenes/tabletop_table_Cobblestone01_GLOSS_6K.xml`
and hit the same FileNotFoundError.
Extract to a scratch dir, then `mv` the nested `assets/` subtree to the
expected location. Verified the target file exists in the zip central
directory under that exact prefix.
* refactor(libero): inline init_states resolver behind single regex
Collapse the three-style suffix stripper (split/re.sub/in) into one
compiled regex, drop the (Path, bool) tuple return, and move the
`_add_`/`_level` reshape branch into the caller so each branch loads
its own file and returns directly. Net: -11 lines, one fewer helper.
* refactor(libero-plus): rebase docker image on huggingface/lerobot-gpu
Mirror the libero/metaworld/robomme pattern: start from the nightly GPU
image (apt deps, python, uv, venv, lerobot[all] already there) and only
layer on what LIBERO-plus uniquely needs — its wand/ImageMagick build
deps, the non-extra runtime pips (robosuite==1.4.1, bddl, …), the
PYTHONPATH-shadowed fork, and the 6.4 GB assets.zip.
Drops ~50 lines of duplicated base setup (CUDA FROM, apt python, uv
install, user creation, venv init) the nightly already provides.
123 → 73 lines.
Also:
- Add libero_plus to docs/source/_toctree.yml under Benchmarks so
doc-builder's TOC integrity check stops failing.
- Repoint the docs dataset link from pepijn223/libero_plus_lerobot to
the canonical lerobot/libero_plus.
- Revert the stray uv.lock churn (revision/marker diff that crept in
from an unrelated resolve — unrelated to LIBERO-plus).
* fix(libero-plus): stop touching pyproject + uv.lock
The fast-tests job was rejecting the branch because pyproject.toml had a
[libero_plus] extra whose git dep wasn't represented in uv.lock.
The Docker image no longer needs the extra — it clones LIBERO-plus
directly and PYTHONPATH-shadows hf-libero. Drop [libero_plus] from
pyproject and restore pyproject.toml + uv.lock to exactly what's on
origin/main, so `uv sync --locked --extra test` is a no-op for this PR.
Also repoint the doc/CI/env comments that still mentioned the extra at
the Docker install path.
* fix(libero-plus): strip perturbation metadata from task descriptions
LIBERO-plus builds task.language by space-joining the perturbation-variant
filename, so every non-_language_ variant inherits a trailing blob like
"view 0 0 100 0 0 initstate 0 noise 45" or "add 16". That shows up in the
dashboard video labels and no longer matches the base instruction stored
in the training dataset.
Strip those tokens in extract_task_descriptions.py with an end-anchored
regex over the {view,initstate,noise,add,tb,table,light,level}(+digits)
vocabulary. The anchor preserves mid-sentence literal uses of those words
(e.g. "from table center and place it on the plate") — only the trailing
metadata chain is removed. _language_ variants carry real BDDL-sourced
text and are left untouched.
* ci: point benchmark eval checkpoints at the lerobot/ org mirrors
pepijn223/smolvla_* → lerobot/smolvla_* across every benchmark job in
this branch (libero, metaworld, and the per-branch benchmark). The
checkpoints were mirrored into the lerobot/ org and that's the canonical
location going forward.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: integrate PR #3313 review feedback
- docs: fix paper link to arxiv, add benchmark image, add suite descriptions,
add LIBERO-plus replacement warning, restructure eval section to match
LIBERO doc style, fix policy I/O section, remove false try/except claim
- docker: fix shell grouping for hf-libero uninstall, replace hardcoded
asset path with dynamic find
- ci: add Docker Hub login step, add HF_USER_TOKEN guard on eval step
- envs: add is_libero_plus param to get_task_init_states so vanilla LIBERO
always takes the simple path
* fix(docs): use correct LIBERO-plus teaser image URL
* ci(libero-plus): drop redundant hf auth login step
The standalone login step ran `hf auth login` in a throwaway
`docker run --rm` container, so no credentials persisted. Auth is
already performed inside the eval step's container. Removing the
redundant step per PR #3313 review feedback.
* fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs
Port of #3416 onto this branch. Without these attributes eval crashes
when calling `env.unwrapped.metadata["render_fps"]` with async vector
envs. Adds `metadata` / `unwrapped` to `_LazyAsyncVectorEnv` and
caches the metadata alongside obs/action spaces in the LIBERO and
MetaWorld factories.
* ci: gate Docker Hub login on secret availability
Fork PRs cannot access `secrets.DOCKERHUB_LEROBOT_{USERNAME,PASSWORD}`,
which made every benchmark job fail at the login step before any of
the actual build/eval work could run. Gate the login on the env-var
expansion of the username so the step is skipped (not failed) when
secrets are absent. Mirrors the existing pattern in the VLABench job.
* Update .github/workflows/benchmark_tests.yml
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
* Update scripts/ci/extract_task_descriptions.py
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
* Update .github/workflows/benchmark_tests.yml
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
* Update docker/Dockerfile.benchmark.libero_plus
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
* Update .github/workflows/benchmark_tests.yml
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
* fix(libero-plus): address review feedback
* ci(libero-plus): fix YAML indentation in upload-artifact steps
The `uses:` key on two upload-artifact steps was at column 0 instead
of nested under the step, causing `pre-commit run check-yaml` to fail
with "expected <block end>, but found '<block mapping start>'".
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
* feat(envs): add RoboMME benchmark integration
- RoboMME env wrapper with image/wrist_image/state observations
- Docker image with Vulkan, SAPIEN, mani-skill deps
- CI workflow: 1-episode smoke eval with pepijn223/smolvla_robomme
- preprocess_observation: handle image/wrist_image/state keys
- pyproject.toml: robomme extra
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor(docker): rebase RoboMME image on huggingface/lerobot-gpu
Mirror the libero/metaworld pattern: start from the nightly GPU image
(which already has apt deps, uv, venv, and lerobot[all] preinstalled)
and only layer on what RoboMME uniquely needs — the Vulkan libs
ManiSkill/SAPIEN requires, plus the robomme extra with the
gymnasium/numpy overrides.
Drops 48 lines of duplicated base setup (CUDA FROM, python install,
user creation, venv init, base apt deps) that the nightly image already
provides. Net: 102 → 54 lines.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs(robomme): drop prototype-branch note and move dataset to lerobot/robomme
- Remove the "Related work" block referencing the prototype branch
feat/robomme-integration; the PR stands on its own.
- Point all dataset references at lerobot/robomme (docs, env module
docstring, RoboMMEEnvConfig docstring) — this is the canonical HF
location once the dataset is mirrored.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(robomme): make docs build + fast tests green
1. Docs: add robomme to _toctree.yml under Benchmarks so doc-builder's
TOC integrity check stops rejecting the new page.
2. Fast tests: robomme's mani-skill transitively pins numpy<2 which is
unsatisfiable against the project's numpy>=2 base pin, so `uv sync`
couldn't resolve a universal lockfile.
Drop robomme as a pyproject extra entirely — it truly cannot coexist
with the rest of the dep tree. The Dockerfile installs robomme
directly from its git URL via `uv pip install --override`, which was
already the runtime path. pyproject, docs, env docstrings, and the
CI job comment all now point to the docker-only install.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test(robomme): realign unit tests with current env API
The tests were written against an earlier env layout and never updated when
the wrapper was refactored, so CI's fast-test job was failing with:
- KeyError: 'front_rgb' / 'wrist_rgb' — these were renamed to the
lerobot-canonical 'image' / 'wrist_image' keys (matching the dataset
columns and preprocess_observation's built-in fallbacks).
- AssertionError: 'robomme' not in result — create_robomme_envs now
returns {task_name: {task_id: env}}, not {'robomme': {...}}, so
comma-separated task lists work.
- ModuleNotFoundError: lerobot.envs.lazy_vec_env — LazyVectorEnv was
removed; create_robomme_envs is straightforward synchronous now.
Rewrite the 7 failing cases against the current API, drop the three
LazyVectorEnv tests, and add a multi-task test so the new comma-separated
task parsing is covered. Stub install/teardown is moved into helpers
(`_install_robomme_stub` / `_uninstall_robomme_stub`) so individual tests
stop repeating six boilerplate lines.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* ci: point benchmark eval checkpoints at the lerobot/ org mirrors
pepijn223/smolvla_* → lerobot/smolvla_* across every benchmark job in
this branch (libero, metaworld, and the per-branch benchmark). The
checkpoints were mirrored into the lerobot/ org and that's the canonical
location going forward.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: integrate PR #3311 review feedback
- envs: rename obs keys to pixels/image, pixels/wrist_image, agent_pos
- envs: add __post_init__ for dynamic action_dim in RoboMMEEnv config
- envs: remove special-case obs conversion in utils.py (no longer needed)
- ci: add Docker Hub login, HF_USER_TOKEN guard, --env.task_ids=[0]
- scripts: extract_task_descriptions supports multiple task_ids
- docs: title to # RoboMME, add image, restructure eval section
- tests: update all key assertions to match new obs naming
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(docs): use correct RoboMME teaser image URL
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* ci(robomme): smoke-eval 10 tasks instead of 5
Broader coverage on the RoboMME benchmark CI job: bump the smoke eval
from 5 tasks to 10 (one episode each), all drawn from ROBOMME_TASKS.
Tasks now run: PickXtimes, BinFill, StopCube, MoveCube, InsertPeg,
SwingXtimes, VideoUnmask, ButtonUnmask, PickHighlight, PatternLock.
Updated the parse_eval_metrics.py `--task` label from the single
`PickXtimes` stub to the full comma list so the metrics artifact
reflects what was actually run. `parse_eval_metrics.py` already reads
`overall` for multi-task runs, so no parser change is needed.
Made-with: Cursor
* fix(robomme): nest `pixels` as a dict so preprocess_observation picks it up
`_convert_obs` was returning flat keys (`pixels/image`,
`pixels/wrist_image`). `preprocess_observation()` in envs/utils.py
keys off the top-level `"pixels"` entry and, not finding it,
silently dropped every image from the batch. The policy then saw
zero image features and raised
ValueError: All image features are missing from the batch.
Match the LIBERO layout: return
`{"pixels": {"image": ..., "wrist_image": ...}, "agent_pos": ...}`
and declare the same shape in `observation_space`.
Made-with: Cursor
* fix(robomme): align docs and tests with nested pixels obs layout
Addresses PR #3311 review feedback:
- Docs: correct observation keys to `pixels/image` / `pixels/wrist_image`
(mapped to `observation.images.image` / `observation.images.wrist_image`)
and drop the now-obsolete column-rename snippet.
- Tests: assert `result["pixels"]["image"]` instead of flat `pixels/image`,
matching the nested layout required by `preprocess_observation()`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs
Port of #3416 onto this branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ci: gate Docker Hub login on secret availability
Fork PRs cannot access `secrets.DOCKERHUB_LEROBOT_{USERNAME,PASSWORD}`,
which made every benchmark job fail at the login step. Gate the login
on the env-var expansion of the username so the step is skipped (not
failed) when secrets are absent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(robomme): address review feedback
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat(ci): add RoboCerebra benchmark eval job
- Docker image with robosuite/libero deps for RoboCerebra eval
- CI workflow: 1-episode eval with pepijn223/smolvla_robocerebra
- Reuses libero env with rename_map + empty_cameras=3
* docs(robocerebra): add benchmark page and toctree entry
Add a dedicated docs page for RoboCerebra that points at the canonical
dataset lerobot/robocerebra_unified and shows how to run eval + fine-tune
against it. Wire it into the Benchmarks section of the toctree so
doc-builder picks it up.
* ci: point benchmark eval checkpoints at the lerobot/ org mirrors
pepijn223/smolvla_* → lerobot/smolvla_* across every benchmark job in
this branch (libero, metaworld, and the per-branch benchmark). The
checkpoints were mirrored into the lerobot/ org and that's the canonical
location going forward.
* fix(robocerebra): drop alias extra + simplify docker image
Two problems rolled up:
1. `uv sync --locked --extra test` was failing because pyproject.toml added
a `robocerebra = ["lerobot[libero]"]` alias extra but uv.lock wasn't
regenerated. Drop the alias — nothing in CI actually needs the extra
name (the Dockerfile just installs what it needs directly), so this
restores pyproject.toml and uv.lock to byte-exact origin/main.
2. Rebase docker/Dockerfile.benchmark.robocerebra on
huggingface/lerobot-gpu:latest (same pattern as libero/metaworld/…).
The nightly image already ships lerobot[all] which includes [libero],
so the RoboCerebra image is essentially identical to the LIBERO one:
fetch libero-assets, write ~/.libero/config.yaml, overlay source.
92 → 43 lines.
Also repoint the CI workflow comment that referenced the removed extra.
* ci: use dedicated lerobot/smolvla_robocerebra checkpoint for smoke eval
Replace the generic pepijn223/smolvla_libero placeholder with the
purpose-trained lerobot/smolvla_robocerebra model in the RoboCerebra
CI smoke test.
* fix(ci): align RoboCerebra eval with training pipeline
Fixes 5 mismatches that caused 0% success rate:
- env.type: robocerebra (unregistered) → libero
- resolution: 360x360 (default) → 256x256 (matches dataset)
- camera_name_mapping: map eye_in_hand → wrist_image (not image2)
- empty_cameras: 3 → 1 (matches training)
- add HF_USER_TOKEN guard on eval step
* fix(ci): set env.fps=20 and explicit obs_type for RoboCerebra eval
Match the dataset's 20 FPS (LiberoEnv defaults to 30) and make
obs_type=pixels_agent_pos explicit for safety against future changes.
* docs(robocerebra): align page with adding_benchmarks template
Rework docs/source/robocerebra.mdx to follow the standard benchmark
doc structure: intro + links + available tasks + installation + eval
+ recommended episodes + policy I/O + training + reproducing results.
- Point everything at lerobot/smolvla_robocerebra (the released
checkpoint), not the personal pepijn223 mirror.
- Add the --env.fps=20 and --env.obs_type=pixels_agent_pos flags
that CI actually uses, so copy-paste eval reproduces CI.
- Split the "Training" block out of the recipe section into its own
section with the feature table.
- Add an explicit "Reproducing published results" section pointing
at the CI smoke eval.
* fix: integrate PR #3314 review feedback
- ci(robocerebra): drop redundant hf auth login step (auth is
already performed inside the eval step's container).
- ci(robocerebra): add Docker Hub login before the image build
to pick up the authenticated rate limit.
- docs(robocerebra): align eval snippet with the CI command
(observation size, camera_name_mapping, use_async_envs, device,
empty_cameras=1).
* fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs
Port of #3416 onto this branch.
* ci: gate Docker Hub login on secret availability
* Update .github/workflows/benchmark_tests.yml
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
* Update .github/workflows/benchmark_tests.yml
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
* feat(envs): add RoboTwin 2.0 benchmark integration
- RoboTwinEnvConfig with 4-camera setup (head/front/left_wrist/right_wrist)
- Docker image with SAPIEN, mplib, CuRobo, pytorch3d (Python 3.12)
- CI workflow: 1-episode smoke eval with pepijn223/smolvla_robotwin
- RoboTwinProcessorStep for state float32 casting
- Camera rename_map: head_camera/front_camera/left_wrist -> camera1/2/3
* fix(robotwin): re-enable autograd for CuRobo planner warmup and take_action
lerobot_eval wraps the full rollout in torch.no_grad() (lerobot_eval.py:566),
but RoboTwin's setup_demo → load_robot → CuroboPlanner(...) runs
motion_gen.warmup(), which invokes Newton's-method trajectory optimization.
That optimizer calls cost.backward() internally, which raises
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
when autograd is disabled. take_action() hits the same planner path at every
step. Wrap both setup_demo and take_action in torch.enable_grad() so CuRobo's
optimizer can build its computation graph. Policy inference is unaffected —
rollout()'s inner torch.inference_mode() block around select_action() is
untouched, so we still don't allocate grad buffers during policy forward.
* fix(robotwin): read nested get_obs() output and use aloha-agilex camera names
RoboTwin's base_task.get_obs() returns a nested dict:
{"observation": {cam: {"rgb": ..., "intrinsic_matrix": ...}},
"joint_action": {"left_arm": ..., "left_gripper": ...,
"right_arm": ..., "right_gripper": ...,
"vector": np.ndarray},
"endpose": {...}}
Our _get_obs was reading raw["{cam}_rgb"] / raw["{cam}"] and raw["joint_action"]
as if they were flat, so np.asarray(raw["joint_action"], dtype=float64) tripped
on a dict and raised
TypeError: float() argument must be a string or a real number, not 'dict'
Fix:
- Pull images from raw["observation"][cam]["rgb"]
- Pull joint state from raw["joint_action"]["vector"] (the flat array)
- Update the default camera tuple to (head_camera, left_camera, right_camera)
to match RoboTwin's actual wrist-camera names (envs/camera/camera.py:135-151)
* refactor(robotwin): drop defensive dict guards, cache black fallback frame
_get_obs was guarding every dict access with isinstance(..., dict) in case
RoboTwin's get_obs returned something else — but the API contract
(envs/_base_task.py:437) always returns a dict, so the guards were silently
masking real failures behind plausible-looking zero observations. Drop them.
Also:
- Cache a single black fallback frame in __init__ instead of allocating
a fresh np.zeros((H, W, 3), uint8) for every missing camera on every
step — the "camera not exposed" set is static per env.
- Only allocate the zero joint_state on the fallback path (not unconditionally
before the real value overwrites it).
- Replace .flatten() with .ravel() (no copy when already 1-D).
- Fold the nested-dict schema comment and two identical torch.enable_grad()
rationales into a single Autograd section in the class docstring.
- Fix stale `left_wrist` camera name in the observation docstring.
* fix(robotwin): align observation_space dims with D435 camera output
lerobot_eval crashed in gym.vector's SyncVectorEnv.reset with:
ValueError: Output array is the wrong shape
because RoboTwinEnvConfig declared observation_space = (480, 640, 3) but
task_config/demo_clean.yml specifies head_camera_type=D435, which renders
(240, 320, 3). gym.vector.concatenate pre-allocates a buffer from the
declared space, so the first np.stack raises on shape mismatch.
Changes:
- Config defaults now 240×320 (the D435 dims in _camera_config.yml), with
a comment pointing at the source of truth.
- RoboTwinEnv.__init__ accepts observation_height/width as Optional and
falls back to setup_kwargs["head_camera_h/w"] so the env is self-consistent
even if the config is not in sync.
- Config camera_names / features_map use the actual aloha-agilex camera
names (head_camera, left_camera, right_camera). Drops the stale
"front_camera" and "left_wrist"/"right_wrist" entries that never matched
anything RoboTwin exposes.
- CI workflow's rename_map updated to match the new camera names.
* fix(robotwin): expose _max_episode_steps for lerobot_eval.rollout
rollout() does `env.call("_max_episode_steps")` (lerobot_eval.py:157) to
know when to stop stepping. LiberoEnv and MetaworldEnv set this attribute;
RoboTwinEnv was tracking the limit under `episode_length` only, so the call
raised AttributeError once CuRobo finished warming up.
* fix(robotwin): install av-dep so lerobot_eval can write rollout MP4s
write_video (utils/io_utils.py:53) lazily imports PyAV via require_package
and raises silently inside the video-writing thread when the extra is not
installed — so the eval itself succeeds with pc_success=100 but no MP4
ever lands in videos/, and the artifact upload reports "No files were
found". Add av-dep to the install line (same pattern as the RoboMME image).
* feat(robotwin): eval 5 diverse tasks per CI run with NL descriptions
Widen the smoke eval from a single task (beat_block_hammer) to five:
click_bell, handover_block, open_laptop, stack_blocks_two on top of the
original. Each gets its own rollout video in videos/<task>_0/ so the
dashboard can surface visually distinct behaviours.
extract_task_descriptions.py now has a RoboTwin branch that reads
`description/task_instruction/<task>.json` (already shipped in the clone
at /opt/robotwin) and pulls the `full_description` field. CI cds into
the clone before invoking the script so the relative path resolves.
parse_eval_metrics.py is invoked with the same 5-task list so the
metrics.json embeds one entry per task.
* ci: point benchmark eval checkpoints at the lerobot/ org mirrors
pepijn223/smolvla_* → lerobot/smolvla_* across every benchmark job in
this branch (libero, metaworld, and the per-branch benchmark). The
checkpoints were mirrored into the lerobot/ org and that's the canonical
location going forward.
* refactor(robotwin): rebase docker image on huggingface/lerobot-gpu
Mirror the libero/metaworld/libero_plus/robomme pattern: start from the
nightly GPU image (apt deps, python, uv, venv, lerobot[all] already
there) and layer on only what RoboTwin 2.0 uniquely needs —
cuda-nvcc + cuda-cudart-dev (CuRobo builds from source), Vulkan libs +
NVIDIA ICD (SAPIEN renderer), sapien/mplib/open3d/pytorch3d/curobo
installs, the mplib + sapien upstream patches, and the TianxingChen
asset download.
Drops ~90 lines of duplicated base setup (CUDA FROM, apt python, uv
install, user creation, venv init, base lerobot install). 199 → 110.
Also repoint the docs + env docstring dataset link from
hxma/RoboTwin-LeRobot-v3.0 to the canonical lerobot/robotwin_unified.
* docs(robotwin): add robotwin to _toctree.yml under Benchmarks
doc-builder's TOC integrity check was rejecting the branch because
docs/source/robotwin.mdx existed but wasn't listed in _toctree.yml.
* fix(robotwin): defer YAML lookup and realign tests with current API
__init__ was eagerly calling _load_robotwin_setup_kwargs just to read
head_camera_h/w from the YAML. That import (`from envs import CONFIGS_PATH`)
required a real RoboTwin install, so constructing the env — and thus every
test in tests/envs/test_robotwin.py — blew up with ModuleNotFoundError
on fast-tests where RoboTwin isn't installed.
Replace the eager lookup with DEFAULT_CAMERA_H/W constants (240×320, the
D435 dims baked into task_config/demo_clean.yml). reset() still resolves
the full setup_kwargs lazily — that's fine because reset() is only
called inside the benchmark Docker image where RoboTwin is present.
Also resync the test file with the current env API:
- mock get_obs() as the real nested {"observation": {cam: {"rgb": …}},
"joint_action": {"vector": …}} shape
- patch both _load_robotwin_task and _load_robotwin_setup_kwargs
(_patch_load → _patch_runtime)
- drop `front_camera` / `left_wrist` from assertions — aloha-agilex
exposes head_camera + left_camera + right_camera, not those
- black-frame test now uses left_camera as the missing camera
- setup_demo call check loosened to the caller-provided seed/is_test
bits (full kwargs include the YAML-derived blob)
* fix: integrate PR #3315 review feedback
- ci: add Docker Hub login step, add HF_USER_TOKEN guard on eval step
- docker: tie patches to pinned versions with removal guidance, remove
unnecessary HF_TOKEN for public dataset, fix hadolint warnings
- docs: fix paper link to arxiv, add teaser image, fix camera names
(4→3 cameras), fix observation dims (480x640→240x320)
* fix(docs): correct RoboTwin 2.0 paper arxiv link
* fix(docs): use correct RoboTwin 2.0 teaser image URL
* fix(docs): use plain markdown image to fix MDX build
* ci(robotwin): smoke-eval 10 tasks instead of 5
Broader coverage on the RoboTwin 2.0 benchmark CI job: bump the smoke
eval from 5 tasks to 10 (one episode each). Added tasks are all drawn
from ROBOTWIN_TASKS and mirror the shape/complexity of the existing
set (simple single-object or single-fixture manipulations).
Tasks now run: beat_block_hammer, click_bell, handover_block,
open_laptop, stack_blocks_two, click_alarmclock, close_laptop,
close_microwave, open_microwave, place_block.
`parse_eval_metrics.py` reads `overall` for multi-task runs so no
parser change is needed. Bumped the step name and the metrics label
to reflect the 10-task layout.
* fix(ci): swap 4 broken RoboTwin tasks in smoke eval
The smoke eval hit two upstream issues:
- `open_laptop`: bug in OpenMOSS/RoboTwin main — `check_success()` uses
`self.arm_tag`, but that attribute is only set inside `play_once()`
(the scripted-expert path). During eval `take_action()` calls
`check_success()` directly, hitting `AttributeError: 'open_laptop'
object has no attribute 'arm_tag'`.
- `close_laptop`, `close_microwave`, `place_block`: not present in
upstream RoboTwin `envs/` at all — our ROBOTWIN_TASKS tuple drifted
from upstream and these names leaked into CI.
Replace the four broken tasks with upstream-confirmed equivalents
that exist both in ROBOTWIN_TASKS and in RoboTwin's `envs/`:
`adjust_bottle`, `lift_pot`, `stamp_seal`, `turn_switch`.
New 10-task smoke set: beat_block_hammer, click_bell, handover_block,
stack_blocks_two, click_alarmclock, open_microwave, adjust_bottle,
lift_pot, stamp_seal, turn_switch.
* fix(robotwin): sync ROBOTWIN_TASKS + doc with upstream (50 tasks)
The local ROBOTWIN_TASKS tuple drifted from upstream
RoboTwin-Platform/RoboTwin. Users passing names like `close_laptop`,
`close_microwave`, `dump_bin`, `place_block`, `pour_water`,
`fold_cloth`, etc. got past our validator (the names were in the
tuple) but then crashed inside robosuite with a confusing error,
because those tasks don't exist in upstream `envs/`.
- Replace ROBOTWIN_TASKS with a verbatim mirror of upstream's
`envs/` directory: 50 tasks as of main (was 60 with many
stale entries). Added a `gh api`-based one-liner comment so
future bumps are mechanical.
- Update the `60 tasks` claims in robotwin.mdx and
RoboTwinEnvConfig's docstring to `50`.
- Replace the stale example-task table in robotwin.mdx with ten
upstream-confirmed examples, and flag `open_laptop` as
temporarily broken (its `check_success()` uses `self.arm_tag`
which is only set inside `play_once()`; eval-mode callers hit
AttributeError).
- Rebuild the "Full benchmark" command with the actual 50-task
list, omitting `open_laptop`.
* test(robotwin): lower task-count floor from 60 to 50
ROBOTWIN_TASKS was trimmed to 50 tasks (see comment in
`src/lerobot/envs/robotwin.py:48`), but the assertion still
required ≥60, causing CI failures. Align the test with the
current upstream task count.
* fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs
Port of #3416 onto this branch.
* ci: gate Docker Hub login on secret availability
* fix: integrate PR #3315 review feedback
- envs(robotwin): default `observation_height/width` in
`create_robotwin_envs` to `DEFAULT_CAMERA_H/W` (240/320) so they
match the D435 dims baked into `task_config/demo_clean.yml`.
- envs(robotwin): resolve `task_config/demo_clean.yml` via
`CONFIGS_PATH` instead of a cwd-relative path; works regardless
of where `lerobot-eval` is invoked.
- envs(robotwin): replace `print()` calls in `create_robotwin_envs`
with `logger.info(...)` (module-level `logger = logging.getLogger`).
- envs(robotwin): use `_LazyAsyncVectorEnv` for the async path so
async workers start lazily (matches LIBERO / RoboCasa / VLABench).
- envs(robotwin): cast `agent_pos` space + joint-state output to
float32 end-to-end (was mixed float64/float32).
- envs(configs): use the existing `_make_vec_env_cls(use_async,
n_envs)` helper in `RoboTwinEnvConfig.create_envs`; drop the
`get_env_processors` override so RoboTwin uses the identity
processor inherited from `EnvConfig`.
- processor: delete `RoboTwinProcessorStep` — the float32 cast now
happens in the wrapper itself, so the processor is redundant.
- tests: drop the `TestRoboTwinProcessorStep` suite; update the
mock obs fixture to use float32 `joint_action.vector`.
- ci: hoist `ROBOTWIN_POLICY` and `ROBOTWIN_TASKS` to job-level
env vars so the task list and policy aren't duplicated across
eval / extract / parse steps.
- docker: pin RoboTwin + CuRobo upstream clones to commit SHAs
(`RoboTwin@0aeea2d6`, `curobo@ca941586`) for reproducibility.
* feat(envs): add RoboCasa365 benchmark integration
Add RoboCasa365 (arXiv:2603.04356) as a new simulation benchmark with
365 everyday kitchen manipulation tasks across 2,500 diverse environments.
New files:
- src/lerobot/envs/robocasa.py: gym.Env wrapper with deferred env creation,
flat 12D action / 16D state vectors, 3-camera support
- docs/source/robocasa.mdx: user-facing documentation
- docker/Dockerfile.benchmark.robocasa: CI benchmark image
Modified files:
- src/lerobot/envs/configs.py: RoboCasaEnv config (--env.type=robocasa)
- pyproject.toml: robocasa optional dependency group
- docs/source/_toctree.yml: sidebar entry
- .github/workflows/benchmark_tests.yml: integration test job
Refs: https://arxiv.org/abs/2603.04356, https://robocasa.ai
Related: huggingface/lerobot#321
* fix(docker): use uv pip to install robocasa in benchmark image
The huggingface/lerobot-gpu base image uses `uv` with a venv at
/lerobot/.venv — `pip` is not on PATH, so `pip install` fails with
"pip: not found". Switch to `uv pip install` which installs into the
existing venv.
Also drop the @v1.0.0 tag pin from the robocasa git URL since the
upstream repo may not have that tag; use default branch instead.
* fix(robocasa): editable install + switch to lerobot/smolvla_robocasa
- pip install from git omits data files like box_links_assets.json
(not declared in package_data). Clone and install editable so the
source tree is used at runtime.
- Download only tex + fixtures_lw asset types (smoke test doesn't need
objaverse/aigen objects). Pipe 'y' to auto-accept download prompt.
- Switch CI policy from pepijn223/smolvla_robocasa to lerobot/smolvla_robocasa.
* fix(docker): re-install lerobot editably after COPY
The nightly huggingface/lerobot-gpu image predates the RoboCasaEnv
registration — so `lerobot-eval --env.type=robocasa` fails at argparse
with "invalid choice" even after COPY . . overlays the new source.
Force an editable reinstall so the venv picks up the current configs.py.
* fix(ci): add rename_map for robocasa eval (image* -> camera*)
Policy lerobot/smolvla_robocasa expects observation.images.camera1/2/3,
but RoboCasaEnv produces observation.images.image/image2/image3.
* fix(robocasa): override RoboCasaGymEnv default split (test -> all)
RoboCasaGymEnv defaults split="test", but create_env only accepts
{None, "all", "pretrain", "target"}, so the out-of-the-box default
crashes with ValueError. Always pass "all" when split is None.
* fix(docker): also download objs_lw (lightwheel objects) for robocasa
Kitchen tasks (e.g. CloseFridge) reference lightwheel object meshes
like Stool022/model.xml. fixtures_lw alone isn't enough — we also
need objs_lw. Still skipping objaverse/aigen to keep image size down.
Made-with: Cursor
* feat(robocasa): raw camera names + benchmark-group task shortcuts
Align the LeRobot env with RoboCasa's native conventions so policies
trained on the upstream datasets don't need a --rename_map at eval
time, and expose the standard task groups as first-class --env.task
values.
- Preserve raw RoboCasa camera names (e.g. robot0_agentview_left)
as observation.images.<name> end-to-end. Drops camera_name_mapping
and DEFAULT_CAMERA_NAME_MAPPING; features/features_map are now
built dynamically from the parsed camera list.
- Accept benchmark-group names as --env.task: atomic_seen,
composite_seen, composite_unseen, pretrain50/100/200/300. Expanded
lazily via robocasa.utils.dataset_registry and auto-sets the
split ("target" | "pretrain").
- Update CI smoke-eval rename_map to map raw cam names to the
camera1/2/3 keys expected by lerobot/smolvla_robocasa.
* docs(robocasa): single-task smolvla train+eval recipe on pepijn223/robocasa_CloseFridge
- Rewrite observation section to use raw RoboCasa camera keys
(observation.images.robot0_agentview_{left,right},
observation.images.robot0_eye_in_hand).
- Add a "Training on a single task" section with a full smolvla
training command on pepijn223/robocasa_CloseFridge, plus matching
single-task eval command.
- Document benchmark-group task shortcuts (atomic_seen, composite_seen,
composite_unseen, pretrain50/100/200/300) as valid --env.task values.
* fix(robocasa): restrict obj_registries to lightwheel by default
CloseFridge (and most kitchen tasks) crashed at reset with
`ValueError: Probabilities contain NaN` coming out of
`sample_kitchen_object_helper`. RoboCasa's upstream default
`obj_registries=("objaverse", "lightwheel")` normalizes per-registry
candidate counts as probabilities; when a sampled category has zero
mjcf paths in every configured registry (because the objaverse asset
pack isn't on disk — ~30GB, skipped by our Docker build), the 0/0
divide yields NaNs and `rng.choice` raises.
- Add `obj_registries: list[str] = ["lightwheel"]` to `RoboCasaEnv`
config; thread it through `create_robocasa_envs`, `_make_env_fns`,
and the gym.Env wrapper to the underlying `RoboCasaGymEnv` (which
forwards to `create_env` → `robosuite.make` → kitchen env).
- Default matches what `download_kitchen_assets --type objs_lw`
actually ships, so the env works out of the box without a 30GB
objaverse download.
- Document the override (`--env.obj_registries='[objaverse,lightwheel]'`)
for users who have downloaded the full asset set.
* fix(docker): also download tex_generative for robocasa benchmark
RoboCasa's lightwheel kitchen fixtures embed references to
`generative_textures/wall/tex*.png` directly in their MuJoCo XML, so
`MjModel.from_xml_string` errors out at reset time with
"No such file or directory" even when the env is constructed with
`generative_textures=None`. The generative textures live under a
separate asset registry key (`tex_generative`) in
`download_kitchen_assets`, distinct from the base `tex` pack we were
already fetching.
- Add `tex_generative` to the download list so the fixture XMLs
resolve.
- Document the remaining omissions (objaverse/aigen, ~30GB) and how
the runtime side pairs this with obj_registries=["lightwheel"] to
avoid sampling from categories whose assets aren't on disk.
* ci(robocasa): smoke-eval 10 atomic tasks instead of 1
Broader coverage in the benchmark CI job: evaluate SmolVLA on ten
fixture-centric atomic RoboCasa tasks (one episode each) instead of
just CloseFridge. The tasks are all drawn from TARGET_TASKS.atomic_seen
and selected to avoid object-manipulation categories that would require
the objaverse/aigen asset packs (we only ship objs_lw in the Docker
image, paired with obj_registries=["lightwheel"] on the runtime side).
Tasks: CloseFridge, OpenCabinet, OpenDrawer, TurnOnMicrowave,
TurnOffStove, CloseToasterOvenDoor, SlideDishwasherRack,
TurnOnSinkFaucet, NavigateKitchen, TurnOnElectricKettle.
`scripts/ci/parse_eval_metrics.py` already handles multi-task output
via the `overall` key, so no parser changes needed. Bumped the metrics
artifact's task label to `atomic_smoke_10` to reflect the grouping.
* fix(pyproject): drop unresolvable robocasa extra
robocasa's upstream setup.py hardcodes `lerobot==0.3.3` in
install_requires. Exposing it as the `lerobot[robocasa]` extra made
uv's dep resolver cycle: `lerobot[robocasa]` -> robocasa -> lerobot
(a different version) -> unsolvable. This broke every `uv sync` — even
invocations with an unrelated extra like `--extra test` — because uv
validates the whole lockfile graph.
- Remove the `robocasa` extra from pyproject.toml. Installation
instructions in docs/source/robocasa.mdx now walk users through the
manual `git clone` + `pip install --no-deps` flow, which matches
what the Docker image already does and sidesteps the cyclic dep
entirely.
- Dockerfile: `uv pip install -e ~/robocasa --no-deps` so the
shadowed lerobot==0.3.3 never lands in the image; install
robocasa's actual runtime deps (numpy, numba, scipy, mujoco,
tianshou, etc.) explicitly.
* docs(robocasa): align page with adding_benchmarks template
Rework docs/source/robocasa.mdx to follow the standard benchmark doc
structure: intro + links + available tasks (with family breakdown and
first-class benchmark-group shortcuts) + installation + eval +
recommended episodes + policy I/O + training + reproducing results.
- Fix the paper link (was pointing at a non-existent arxiv ID).
- Surface lerobot/smolvla_robocasa and pepijn223/robocasa_CloseFridge
in the top-of-page links so they're findable without reading the
training section.
- Add an explicit "Object registries" subsection explaining the
`--env.obj_registries=[objaverse,lightwheel]` override path.
- Add an explicit "Reproducing published results" section pointing
at the CI smoke eval.
* fix: integrate PR #3375 review feedback
- envs(robocasa): hoist the duplicated `_parse_camera_names` helper
out of `libero.py` and `robocasa.py` into `envs/utils.py` as the
public `parse_camera_names`; call sites updated.
- envs(robocasa): give each factory a distinct `episode_index`
(`0..n_envs-1`) and derive a per-worker seed series in `reset()`
so n_envs workers don't all roll the same scene under a shared
outer seed.
- envs(robocasa): drop the unused `**kwargs` on `_make_env`; declare
`visualization_height` / `visualization_width` on both the wrapper
and the `RoboCasaEnv` config + propagate via `gym_kwargs`.
- envs(robocasa): emit `info["final_info"]` on termination (matching
MetaWorld) so downstream vector-env auto-reset keeps the terminal
task/success flags.
- docs(robocasa): add `--rename_map` (robot0_agentview_left/
eye_in_hand/agentview_right → camera1/2/3) plus CI-parity flags to
all three eval snippets.
- docker(robocasa): pin robocasa + robosuite git SHAs and the pip
dep versions (pygame, Pillow, opencv-python, pyyaml, pynput, tqdm,
termcolor, imageio, h5py, lxml, hidapi, gymnasium) for
reproducible benchmark images.
- ci(robocasa): update the workflow comment — there is no
`lerobot[robocasa]` extra; robocasa/robosuite are installed
manually because upstream's `lerobot==0.3.3` pin shadows ours.
* docs(robocasa): add benchmark banner image
* fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs
Port of #3416 onto this branch. Also threads the cached metadata
through the RoboCasa factory so async eval on `--env.type=robocasa`
keeps the same improvement.
* fix: integrate PR #3375 review feedback (round 2)
- envs(robocasa): when the caller passes `seed=None` to `reset()`,
fall back to `self.episode_index` for the inner env seed so each
worker still samples a distinct trajectory instead of all workers
inheriting the same global RNG state.
- envs(robocasa): replace the two module-level `print()` calls in
`create_robocasa_envs` with `logger.info(...)` via a module-level
`logger = logging.getLogger(__name__)`.
- ci(robocasa): run `scripts/ci/extract_task_descriptions.py` after
the eval so `metrics.json` carries per-task natural-language
labels, matching LIBERO / MetaWorld / VLABench jobs. Added a
`_robocasa_descriptions()` extractor that splits CamelCase task
names into word-level labels keyed by `<task>_0`.
On fork PRs, `secrets.DOCKERHUB_LEROBOT_*` expand to empty strings,
which fails `docker/login-action@v3` with `Error: Username and
password required` before any of the actual build/eval work runs.
Gate the login step on the env-var expansion of the username so the
step is skipped (not failed) when secrets are absent. On the main
repo + maintainer-approved fork runs (`pull_request_review` path),
the secrets resolve normally, the step runs, and image pulls get
the authenticated Docker Hub rate limit.
Scope: only `benchmark_tests.yml`, the lone benchmark workflow that
triggers on `pull_request` from forks. `full_tests.yml` and
`latest_deps_tests.yml` run under `pull_request_review` / schedule /
workflow_dispatch, where secrets are already guaranteed.
Context: surfaced on #3416 where an external contributor's PR failed
at the login step before any test could run.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(record): pass rename_map to make_policy in lerobot-record
Fixes#3181. The rename_map from dataset config was used for preprocessor
construction but not passed to make_policy(), causing feature mismatch
errors when camera key names differ between dataset and model config.
make_policy() already accepts a rename_map parameter and uses it to skip
visual feature consistency validation when remapping is active, but
lerobot_record.py was not passing it through.
* style: fix ruff format for ternary expression
---------
Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>