The setuptools package-data declaration only listed envs/*.json, so
pip-installed wheels (including HF Jobs runs) were missing the
module_1_subtasks/plan/memory and module_2/3 prompt templates,
causing FileNotFoundError at runtime.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Default backend is now a local OpenAI-compatible server (vllm /
transformers) which auto_serve spawns. Removes the
use_hf_inference_providers config flag and the router.huggingface.co
routing branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the pipeline completes, optionally create/locate a dataset repo
and upload the dataset root (excluding .annotate_staging/). Add
push_private and push_commit_message knobs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Saturates parallel_servers + client_concurrency. Previously the
executor processed one episode at a time, so each Module 1 episode's
3-5 dependent VLM calls hit a single server with the others idle. Now
defaults to 16 episodes in flight; configurable via
ExecutorConfig.episode_parallelism.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vllm with --uvicorn-log-level warning suppresses the "Uvicorn running"
banner that the readiness watcher waited for, so the spawn helper hung
forever even after the API was live. Add an HTTP probe in parallel with
the log watcher and broaden the log markers to include vllm's own
"Starting vLLM API server" / "Available routes are" lines.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 server-streaming threads writing chars unsynchronized cause UTF-8
sequences from different servers to interleave mid-byte, garbling the
terminal output. Switch to line-buffered reads with a single shared
print lock — output stays readable, ready-marker detection still works
on the line containing 'Uvicorn running' / 'Application startup
complete'.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds vlm.client_concurrency (default 16) which uses a ThreadPoolExecutor
to fan out batched chat.completions calls. vllm batches them internally
on the server side, giving big throughput wins on a single TP=1 server
without needing DP/TP and the NCCL setup it requires.
Module 3 now batches all per-episode VQA calls into a single
generate_json invocation so they fire in parallel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds --vlm.parallel_servers=N. Spawns N independent vllm processes
(each pinned to GPU i via CUDA_VISIBLE_DEVICES, listening on
serve_port+i) and round-robins requests across them. Sidesteps DP/TP
NCCL setup failures on nodes with restricted P2P/SHM.
Default serve_command for parallel mode: vllm serve <model_id>
--tensor-parallel-size 1 --max-model-len 32768 --uvicorn-log-level
warning. Override via --vlm.serve_command (use {port} placeholder).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Some prompts/models occasionally return pure prose with no JSON object
even on retry. Returning None (and logging a preview) lets the pipeline
skip that one VLM call cleanly instead of aborting the whole episode.
The modules already check for None / non-dict results and degrade
gracefully (no row emitted from that call).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Models often wrap JSON in prose or <think>...</think> blocks. Strip the
think tags first, then try direct json.loads, then fall back to scanning
for the first balanced {...} substring (ignoring braces inside strings).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the fixed max_video_frames count with a rate (default 1 fps).
A 30 s episode now sends 30 frames; a 5 s episode sends 5; capped at
max_video_frames (default 128) to avoid blowing up the payload on long
episodes.
Override with --module_1.frames_per_second=2.0 for denser sampling, or
--module_1.frames_per_second=0.5 for sparser.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fall back to huggingface_hub.get_token() when HF_TOKEN/HUGGINGFACE_API_KEY
env vars aren't set. That picks up the token cached by
'huggingface-cli login' so users don't need to export it on every shell.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flip the default backend to 'openai' with use_hf_inference_providers=True
and a Qwen3-VL-30B-A3B-Instruct:novita default model_id. The CLI now
runs end-to-end without a local model load — annotations are produced
by sending video_url + prompt to https://router.huggingface.co/v1.
Switch back to local inference with --vlm.backend=vllm or
--vlm.use_hf_inference_providers=false.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Setting --vlm.use_hf_inference_providers=true routes requests through
https://router.huggingface.co/v1 using HF_TOKEN as the API key, and
disables auto_serve so no local server is spawned. Combine with a
provider-pinned model id like 'Qwen/Qwen3-VL-30B-A3B-Instruct:novita'
or any plain model id to let HF route.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
transformers serve returns HTTP 422 'Unexpected fields' when
mm_processor_kwargs is in extra_body — that field is vllm-specific.
Drop it by default; opt in via LEROBOT_OPENAI_SEND_MM_KWARGS=1 when
talking to vllm serve.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes for video_url with transformers serve:
- fps must be in extra_body.mm_processor_kwargs, not in the content
block; otherwise the server discards it as unknown kwargs.
- file:// URLs aren't fetched by transformers serve. Read the local mp4
and inline it as a base64 data:video/mp4 URL so the server sees the
bytes directly.
Both surface as std::bad_alloc on the server side when wrong, which is
unhelpful but explains what we hit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
transformers serve rescans the HF cache on every /v1/models request
which exceeds the 2s urllib timeout, leaving the probe loop spinning
even after Uvicorn is fully up. Watch the streamed server output for
'Uvicorn running' / 'Application startup complete' instead.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous logger-based output never appeared, leaving users in the
dark when auto_serve silently no-op'd. Switch to print(flush=True) so
the spawn decision is unmistakable, and stream the server's stdout to
the parent terminal in real-time on a background thread so model-load
progress and errors surface immediately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Default auto_serve to True so lerobot-annotate can drive the entire
flow with one command. Probe api_base/models first — if a server is
already reachable (user started one manually, or it's a remote
endpoint), skip the spawn.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Setting --vlm.auto_serve=true with --vlm.backend=openai makes the CLI
launch 'transformers serve <model_id> --port <serve_port>
--continuous-batching' as a child process, poll /v1/models until ready
(up to serve_ready_timeout_s), run the pipeline, then SIGINT the
server on process exit.
Override the spawn command with --vlm.serve_command='vllm serve ...'
or any OpenAI-compatible launcher.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Module 1 can now send the episode's actual mp4 file as a video_url
content block instead of pre-decoded frames. The server (transformers
serve / vllm serve / ktransformers serve) handles frame sampling at
the configured fps. Default fps=1 (one frame per second is enough for
subtask-boundary detection on manipulation episodes).
A per-episode subclip is extracted to <root>/.annotate_staging/.video_clips/
via ffmpeg stream-copy (no re-encode) so the model sees only this
episode's frames, not the whole shard.
Enable with --module_1.use_video_url=true (and --vlm.backend=openai).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a third backend that talks to any OpenAI-compatible server. This
unblocks Qwen3.6 (and other models) that work in transformers serve /
ktransformers but not in vllm 0.10.2's fallback path:
- launch the server out-of-process (transformers serve, vllm serve,
ktransformers serve)
- point lerobot-annotate at it via --vlm.backend=openai
--vlm.api_base=http://localhost:8000/v1 --vlm.model_id=...
Image and video blocks are converted to OpenAI image_url/video_url
data URLs automatically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vllm.generate() expects a string/TextPrompt; passing message dicts
fails. vllm.chat() applies the chat template and extracts image/video
blocks automatically, which is what we need for VL models.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vllm 0.10.2 expects guided_decoding to be a GuidedDecodingParams object,
not a dict. Different vllm versions differ here. The parser already has
a one-retry JSON-recovery path, so drop guided decoding entirely for
portability.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pyav (and sometimes torchcodec) decode can return fewer frames than
requested timestamps when some timestamps fall outside the video file's
content range. Drop the strict=True on the zip and rely on the
None-filter to discard missing frames.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
torchcodec's __init__ bad-allocs on the cu128/torch-2.8 stack in some
environments (Lustre/conda combos). The annotation pipeline calls
decode_video_frames many times per episode, so this is a hard blocker.
Default to pyav (always available via the av package) and let users
opt back into torchcodec via LEROBOT_VIDEO_BACKEND=torchcodec.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Setting trust_remote_code=True unconditionally pulled custom loader
code that triggers std::bad_alloc post-load on Qwen3-VL — the official
transformers class is sufficient. Flip the default to False; keep the
config field so users can opt in for models that actually need it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Loading Qwen3-VL via transformers + accelerate's device_map='auto'
fails with std::bad_alloc on hosts with abundant RAM. The bug is in
accelerate's post-load dispatch path. Bypassing accelerate by loading
to CPU first and then calling .to('cuda') manually avoids that path.
LEROBOT_TRANSFORMERS_DEVICE_MAP=auto switches back to the old behavior
for cases where it works.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cuDNN 9.x + torch 2.8 has a regression where the conv3d kernel used in
Qwen-VL vision tower patch embedders fails with
CUDNN_STATUS_NOT_INITIALIZED. The crash is independent of model size
and reproduces on both Qwen2.5-VL and Qwen3-VL because both use 3D conv
for video patch embedding.
Setting LEROBOT_DISABLE_CUDNN=1 falls back to native PyTorch conv3d
kernels (slower but functional) so the pipeline can run while the
torch/cuDNN stack is still on the broken combo.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Large VL models (Qwen3-VL-30B-A3B BF16) take ~58 GB of an 80 GB H100,
leaving only ~22 GB for KV cache + cuDNN workspace. The vision tower's
3D conv then fails with CUDNN_STATUS_NOT_INITIALIZED because cuDNN
can't grab a workspace large enough.
- vlm.gpu_memory_utilization (default 0.9) — drop to 0.7 when the vision
encoder needs more cuDNN workspace.
- vlm.max_model_len — cap context to free KV cache memory; the 262k
default for Qwen3 is wildly more than annotation prompts need.
- vlm.trust_remote_code — already plumbed; now also passed to LLM().
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Required for many newer VL checkpoints (Qwen3.x FP8 in particular) that
ship custom loader code in their repo. Without it, the FP8
weight_scale_inv parameters never bind to FP8Linear modules and the
post-load dispatch path bad-allocs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The std::bad_alloc we hit on Qwen3-line VL models is not a real OOM —
it triggers in the post-load tensor-placement path even on hosts with
2 TB RAM. low_cpu_mem_usage=True bypasses the offending intermediate
staging buffer and is the standard accelerate workaround.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without device_map, transformers stages the full FP8 checkpoint in CPU
RAM before any GPU placement, OOMing the host on 27B+ models even when
the GPU has enough VRAM. device_map='auto' streams shards directly to
GPU memory.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Newer transformers versions renamed/removed AutoModelForVision2Seq in
favour of AutoModelForImageTextToText for VL models. Try the new name
first and fall back gracefully so the transformers backend works on
both transformers 4.45-4.5x and 5.x.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Older draccus versions (e.g. 0.10.x bundled in some envs) lack a decoder
for typing.Literal and raise:
No decoding function for type typing.Literal['vllm', 'transformers', 'stub']
Switching VlmConfig.backend from Literal to str works under every
draccus version. The runtime branch in vlm_client.make_vlm_client
already validates the value and raises ValueError on unknown backends,
so the constraint stays enforced.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces keyframe sampling with a single Qwen-VL video block covering
the whole demonstration. The model pools temporally itself and chooses
where to cut subtasks — no stride, no count, no keyframe count knob to
tune.
- frames.py: ``FrameProvider`` gains ``video_for_episode(record,
max_frames)``; ``VideoFrameProvider`` samples up to ``max_frames``
uniformly across the episode duration; ``_NullProvider`` returns []
for the no-video fallback. New ``to_video_block`` helper.
- Module 1: drops keyframe sampling. The subtask prompt now goes out as
``[{"type":"video", "video":[<frames>]}, {"type":"text", ...}]`` and
the prompt template asks the model to "watch the whole clip, then
segment it" with cut points decided from gripper/contact/regrasp
events the model sees.
- Module1Config: ``keyframes_per_episode`` removed; replaced with
``max_video_frames: int = 32`` (model-capacity bound, not annotation
logic).
- Test: ``test_module1_attaches_video_block_to_subtask_prompt`` locks in
the single-video-block invariant.
- Stub-VLM markers updated: tests now key on "atomic subtasks" instead
of the old "Decompose the demonstration" phrase that no longer
appears in the prompt.
- Docs: updated to describe the whole-episode video-block behavior and
the no-video fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the visual-grounding gap flagged after the initial PR review:
modules now decode actual camera frames at the relevant timestamps and
attach them as `{"type":"image", "image":<PIL>}` content blocks to the
VLM prompts.
- New `frames.py`:
- `FrameProvider` Protocol; `VideoFrameProvider` decodes from the
dataset's first `observation.images.*` stream via
`LeRobotDatasetMetadata.get_video_file_path` and
`decode_video_frames`, with the same `from_timestamp` shift the main
dataset uses.
- Per-process LRU cache so co-timestamped Module 1 plan-update + Module
2 calls share decode work.
- `make_frame_provider` falls back to a null provider when the dataset
has no video tracks → text-only prompts (graceful absence).
- Modules 1/2/3 take an optional `frame_provider` (default null) and
prepend image blocks before the text block.
- Module 1 attaches `keyframes_per_episode` keyframes to the subtask
decomposition prompt.
- Module 2 attaches the frame at the interjection timestamp.
- Module 3 attaches the exact emission frame to each VQA pair.
- VlmConfig: backend now defaults to `vllm`; default model is
`Qwen/Qwen3.6-27B-FP8`. New knobs: `--vlm.tensor_parallel_size`,
`--vlm.camera_key` (override the keyframe stream).
- `_make_vllm_client` honours `tensor_parallel_size` so 27B-FP8 sharded
on 2× GPUs works out of the box.
- `test_module3_attaches_frame_image_block_to_prompt` asserts modules
emit one image block per VQA prompt at the exact emission timestamp.
- Docs: example switched to `imstevenpmwork/super_poulain_draft` +
Qwen3.6-27B-FP8 + tensor_parallel_size=2; documents the keyframe
attachment behaviour and the no-video fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Motion primitives are described in robot-frame (joint / Cartesian) terms,
not pixel space, so they are camera-agnostic. Only `vqa` (event) and
`trace` (event, pixel-trajectory) are view-dependent.
The `camera` field stays on PERSISTENT_ROW_FIELDS for schema symmetry —
the validator, resolver, and HF feature mapping behave identically across
the two columns regardless of which styles populate `camera` today —
but persistent rows now always have `camera=None` in practice.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a nullable `camera` field to the language row struct (both persistent
and event variants) so view-dependent styles like `vqa` can carry which
`observation.images.*` view they were grounded against. Without this,
multi-camera datasets ended up with multiple `(vqa, role)` rows at the
same timestamp that the resolver could not disambiguate.
- `language.py`: add `camera` to PERSISTENT_ROW_FIELDS / EVENT_ROW_FIELDS,
to both Arrow struct types and the HF datasets feature mappings;
introduce VIEW_DEPENDENT_STYLES = {vqa, motion, trace} plus
`is_view_dependent_style` and `validate_camera_field` helpers (camera
required iff style is view-dependent).
- `language_render.py`: thread an optional `camera=` kwarg through every
resolver (`active_at`, `emitted_at`, `nth_prev`, `nth_next`) and through
`_matching_rows` / `_select_*`, so recipes can disambiguate per-camera
VQA with `emitted_at(t, style=vqa, role=assistant, camera=...)`.
Without a `camera` filter, multi-row matches keep raising the existing
ambiguity error — which is the desired behaviour on multi-camera data.
- `recipes/pi05_hirobot.yaml`: replace the single `ask_vqa` branch with
`ask_vqa_top` and `ask_vqa_wrist` per-camera sub-recipes (each carrying
the matching image block), keeping the original 0.20 budget and
documenting the customization point for datasets with different cameras.
- Tests: schema test asserts the new field order; new tests cover
`is_view_dependent_style`, `validate_camera_field` (both required and
forbidden directions), per-camera `emitted_at` filtering, and the
ambiguity error when two cameras emit `(vqa, assistant)` at the same
timestamp without a `camera=` filter. RenderMessagesStep + dataset
passthrough fixtures updated to include the new field.
- `docs/source/language_and_recipes.mdx`: document the `camera` field,
the per-camera resolver pattern, and the canonical recipe convention.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Promote the previously-reserved motion/trace styles to first-class core
styles. motion routes to language_persistent (it tracks robot state over
time); trace routes to language_events (single-moment annotations).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers private helpers in recipe.py, language.py, language_render.py,
and render_messages_processor.py. Also reverts uv.lock to main (it was
re-generated by `uv run` during local checks).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(sim): add VLABench benchmark integration
Add VLABench as a new simulation benchmark in LeRobot, following the existing LIBERO and MetaWorld patterns.
This PR wires VLABench end-to-end across environment integration, Docker setup, CI smoke evaluation, and documentation. It also fixes a number of upstream packaging and runtime issues required to make VLABench usable and reproducible in CI.
What’s included
Benchmark integration
Add VLABench as a new simulation benchmark.
Expose supported VLABench tasks through the LeRobot env interface.
Follow the established LIBERO / MetaWorld factory patterns.
Preserve lazy async-env metadata so env.unwrapped.metadata["render_fps"] continues to work.
CI smoke evaluation
Add a VLABench smoke-eval job using lerobot/smolvla_vlabench.
Use the correct rename_map for the 3-camera dataset layout.
Expand smoke coverage from 1 to 10 primitive tasks.
Extract task descriptions after eval so metrics artifacts include per-task labels.
Skip Docker Hub login when secrets are unavailable (e.g. fork PRs).
Docker / install fixes
Install VLABench from GitHub rather than PyPI.
Use uv pip, not pip, in the base image.
Fail loudly on install errors instead of masking them.
Clone VLABench into the non-root user’s home directory.
Use shallow editable installs for VLABench and rrt-algorithms to work around missing __init__.py issues.
Pin upstream clones to exact commit SHAs for reproducibility.
Add undeclared runtime dependencies required by VLABench (open3d, colorlog, scikit-learn, openai).
Unpin open3d so Python 3.12 wheels resolve.
Assets
Support downloading VLABench assets from a Hugging Face Hub mirror via VLABENCH_ASSETS_REPO.
Keep Google Drive download support as fallback.
Install huggingface_hub[hf_xet] so Xet-backed assets download correctly.
Validate required mesh/XML asset subtrees at build time.
Patch VLABench constants to tolerate missing asset directories at import time.
Runtime / env correctness
Import VLABench robots and tasks explicitly so decorator-based registry population happens.
Resize and normalize camera observations so they always match the declared (H, W, 3) uint8 observation space.
Reinstall LeRobot editably inside the image so the new env code is actually used.
Coerce agent_pos / ee_state to the expected shape.
Pad actions when needed to match data.ctrl.
Replace zero-padding fallback with proper dm_control IK for 7D end-effector actions.
Refetch dm_control physics on each step instead of caching weakrefs.
Retry unstable resets with reseeding and handle PhysicsError gracefully at step time.
Dataset / policy alignment
Align VLABench observations and actions with Hugging Face dataset conventions used by lerobot/vlabench_unified:
convert EE position between world frame and robot-base frame at the env boundary,
expose / consume Euler XYZ instead of raw quaternion layout,
align gripper semantics with dataset convention (1 = open, 0 = closed).
This fixes policy/env mismatches that previously caused incorrect IK targets and unstable behavior at evaluation time.
Docs
Add a full docs/source/vlabench.mdx page aligned with the standard benchmark template.
Document task selection forms (single task, comma list, suite shortcut).
Document installation, evaluation, training, and result reproduction.
Point examples at lerobot/smolvla_vlabench.
Add a benchmark banner image.
Remove outdated / misleading references to upstream evaluation tracks.
Document manual install flow instead of a broken vlabench extra.
Packaging cleanup
Remove the unresolvable vlabench extra from pyproject.toml.
Remove the no-op VLABench processor step.
Remove the obsolete env unit test that only covered the dropped gripper remap helper.
Apply formatting / logging / style cleanup from review feedback.
Why this is needed
VLABench is not currently consumable as a normal Python dependency and requires several upstream workarounds:
no PyPI release,
missing package declarations,
undeclared runtime deps,
SSH-only submodule references,
asset downloads outside normal package install flow,
registry population that depends on import side effects,
env outputs that do not always match declared observation shapes,
task resets that can diverge under some random layouts.
This PR makes the benchmark usable in LeRobot despite those constraints, and ensures CI runs are reproducible and informative.
If you want a much shorter squash commit message, I’d use this:
feat(sim): integrate VLABench benchmark with CI, Docker, and docs
Add VLABench as a new LeRobot simulation benchmark, following the existing LIBERO / MetaWorld patterns.
This includes:
LeRobot env integration and task exposure,
CI smoke eval with lerobot/smolvla_vlabench,
Docker install and asset-download fixes,
runtime fixes for registry loading, assets, camera obs, action handling, dm_control IK, and PhysicsError recovery,
alignment of obs/action semantics with HF VLABench datasets,
docs and packaging cleanup.
The PR also incorporates review feedback, improves reproducibility by pinning upstream commits, and makes VLABench usable in CI despite upstream packaging and asset-management issues.
* feat(envs): add LIBERO-plus robustness benchmark integration
- LiberoPlusEnv config (subclass of LiberoEnv, same gym interface)
- Docker image installing LIBERO-plus fork via PYTHONPATH
- CI workflow: 1-episode smoke eval with pepijn223/smolvla_libero_plus
- pyproject.toml: libero_plus extra
* fix(libero): use suite's perturbation-aware init_states loader
LIBERO-plus's Benchmark class exposes a `get_task_init_states(i)` method that
strips perturbation suffixes (`_table_N`, `_tb_N`, `_view_`, `_language_`,
`_light_`, `_add_`, `_level`) and loads the underlying base `.pruned_init`
file — the on-disk name for a perturbation variant doesn't exist as a file,
only the base does. lerobot's loader was bypassing that logic and trying to
read the suffix-bearing filename directly, which failed for every non-zero
task id and killed the eval before any rollout video could be written.
Delegate to the suite's method when it exists; fall back to the path-based
loader for vanilla LIBERO (which does not provide the method).
Also drop the hf-libero install + init_files copy from the LIBERO-plus
Dockerfile — the LIBERO-plus clone already ships both `bddl_files/` and
`init_files/` for all five suites, so the copy was unnecessary and the
`cp -r` into an existing dir produced a confusing nested layout.
* fix(libero): resolve LIBERO-plus perturbation init_states path ourselves
Delegating to `task_suite.get_task_init_states(i)` works for path resolution
but LIBERO-plus's method calls `torch.load(path)` without `weights_only=False`,
which fails on PyTorch 2.6+ because the pickled init_states contains numpy
objects not in the default allowlist:
_pickle.UnpicklingError: Weights only load failed.
WeightsUnpickler error: Unsupported global:
GLOBAL numpy.core.multiarray._reconstruct was not an allowed global.
Mirror LIBERO-plus's suffix-stripping logic (`_table_N`, `_tb_N`, `_view_`,
`_language_`, `_light_`, `_add_`, `_level`) in our own helper so we can pass
`weights_only=False` ourselves. Vanilla LIBERO task names don't contain any
of these patterns except for `_table_` when followed by the word `center`
(e.g. `pick_up_the_black_bowl_from_table_center_...`), and the regex
requires `_table_\\d+` so semantic uses are preserved.
* fix(libero-plus): download perturbation assets from Sylvest/LIBERO-plus
LIBERO-plus's bddl_base_domain.py resolves scene XMLs with
`os.path.join(DIR_PATH, "../assets")`, so the `assets` key in config.yaml
has no effect on scene lookup — MuJoCo always opens
`<clone>/libero/libero/assets/scenes/...`. With no such directory present,
every perturbation task fails on:
FileNotFoundError: No such file or directory:
.../libero-plus/libero/libero/assets/scenes/tabletop_table_Cobblestone01_GLOSS_6K.xml
These textures, views, and extra objects ship only in the 6.4 GB `assets.zip`
published at `Sylvest/LIBERO-plus` (the LIBERO-plus README explicitly says
to download and unzip it into the package dir). Fetch it via `hf_hub_download`,
unzip into `${LIBERO_PLUS_ROOT}/`, install `unzip`, and point config.yaml at
the extracted dir so everything stays consistent. The download lives in its
own Docker layer so subsequent rebuilds reuse the cached assets.
Drops the lerobot/libero-assets snapshot_download — that mirror only has
vanilla LIBERO textures and is ignored for scene loading anyway.
* fix(libero-plus): flatten deep path prefix from Sylvest/LIBERO-plus assets.zip
The 6.4 GB zip ships with every entry prefixed by
`inspire/hdd/project/embodied-multimodality/public/syfei/libero_new/release/dataset/LIBERO-plus-0/assets/...`
(the author's internal filesystem layout, not the layout the LIBERO-plus
README promises), so the previous `unzip -d ${LIBERO_PLUS_ROOT}/` created
`${LIBERO_PLUS_ROOT}/inspire/.../assets/` — robosuite still opened
`${LIBERO_PLUS_ROOT}/assets/scenes/tabletop_table_Cobblestone01_GLOSS_6K.xml`
and hit the same FileNotFoundError.
Extract to a scratch dir, then `mv` the nested `assets/` subtree to the
expected location. Verified the target file exists in the zip central
directory under that exact prefix.
* refactor(libero): inline init_states resolver behind single regex
Collapse the three-style suffix stripper (split/re.sub/in) into one
compiled regex, drop the (Path, bool) tuple return, and move the
`_add_`/`_level` reshape branch into the caller so each branch loads
its own file and returns directly. Net: -11 lines, one fewer helper.
* refactor(libero-plus): rebase docker image on huggingface/lerobot-gpu
Mirror the libero/metaworld/robomme pattern: start from the nightly GPU
image (apt deps, python, uv, venv, lerobot[all] already there) and only
layer on what LIBERO-plus uniquely needs — its wand/ImageMagick build
deps, the non-extra runtime pips (robosuite==1.4.1, bddl, …), the
PYTHONPATH-shadowed fork, and the 6.4 GB assets.zip.
Drops ~50 lines of duplicated base setup (CUDA FROM, apt python, uv
install, user creation, venv init) the nightly already provides.
123 → 73 lines.
Also:
- Add libero_plus to docs/source/_toctree.yml under Benchmarks so
doc-builder's TOC integrity check stops failing.
- Repoint the docs dataset link from pepijn223/libero_plus_lerobot to
the canonical lerobot/libero_plus.
- Revert the stray uv.lock churn (revision/marker diff that crept in
from an unrelated resolve — unrelated to LIBERO-plus).
* fix(libero-plus): stop touching pyproject + uv.lock
The fast-tests job was rejecting the branch because pyproject.toml had a
[libero_plus] extra whose git dep wasn't represented in uv.lock.
The Docker image no longer needs the extra — it clones LIBERO-plus
directly and PYTHONPATH-shadows hf-libero. Drop [libero_plus] from
pyproject and restore pyproject.toml + uv.lock to exactly what's on
origin/main, so `uv sync --locked --extra test` is a no-op for this PR.
Also repoint the doc/CI/env comments that still mentioned the extra at
the Docker install path.
* fix(libero-plus): strip perturbation metadata from task descriptions
LIBERO-plus builds task.language by space-joining the perturbation-variant
filename, so every non-_language_ variant inherits a trailing blob like
"view 0 0 100 0 0 initstate 0 noise 45" or "add 16". That shows up in the
dashboard video labels and no longer matches the base instruction stored
in the training dataset.
Strip those tokens in extract_task_descriptions.py with an end-anchored
regex over the {view,initstate,noise,add,tb,table,light,level}(+digits)
vocabulary. The anchor preserves mid-sentence literal uses of those words
(e.g. "from table center and place it on the plate") — only the trailing
metadata chain is removed. _language_ variants carry real BDDL-sourced
text and are left untouched.
* ci: point benchmark eval checkpoints at the lerobot/ org mirrors
pepijn223/smolvla_* → lerobot/smolvla_* across every benchmark job in
this branch (libero, metaworld, and the per-branch benchmark). The
checkpoints were mirrored into the lerobot/ org and that's the canonical
location going forward.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: integrate PR #3313 review feedback
- docs: fix paper link to arxiv, add benchmark image, add suite descriptions,
add LIBERO-plus replacement warning, restructure eval section to match
LIBERO doc style, fix policy I/O section, remove false try/except claim
- docker: fix shell grouping for hf-libero uninstall, replace hardcoded
asset path with dynamic find
- ci: add Docker Hub login step, add HF_USER_TOKEN guard on eval step
- envs: add is_libero_plus param to get_task_init_states so vanilla LIBERO
always takes the simple path
* fix(docs): use correct LIBERO-plus teaser image URL
* ci(libero-plus): drop redundant hf auth login step
The standalone login step ran `hf auth login` in a throwaway
`docker run --rm` container, so no credentials persisted. Auth is
already performed inside the eval step's container. Removing the
redundant step per PR #3313 review feedback.
* fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs
Port of #3416 onto this branch. Without these attributes eval crashes
when calling `env.unwrapped.metadata["render_fps"]` with async vector
envs. Adds `metadata` / `unwrapped` to `_LazyAsyncVectorEnv` and
caches the metadata alongside obs/action spaces in the LIBERO and
MetaWorld factories.
* ci: gate Docker Hub login on secret availability
Fork PRs cannot access `secrets.DOCKERHUB_LEROBOT_{USERNAME,PASSWORD}`,
which made every benchmark job fail at the login step before any of
the actual build/eval work could run. Gate the login on the env-var
expansion of the username so the step is skipped (not failed) when
secrets are absent. Mirrors the existing pattern in the VLABench job.
* Update .github/workflows/benchmark_tests.yml
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
* Update scripts/ci/extract_task_descriptions.py
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
* Update .github/workflows/benchmark_tests.yml
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
* Update docker/Dockerfile.benchmark.libero_plus
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
* Update .github/workflows/benchmark_tests.yml
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
* fix(libero-plus): address review feedback
* ci(libero-plus): fix YAML indentation in upload-artifact steps
The `uses:` key on two upload-artifact steps was at column 0 instead
of nested under the step, causing `pre-commit run check-yaml` to fail
with "expected <block end>, but found '<block mapping start>'".
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>