lerobot

mirror of https://github.com/huggingface/lerobot.git synced 2026-07-15 05:51:52 +00:00

Author	SHA1	Message	Date
Steven Palma	b607c8458e	docs: add policy & compute guide (#3534 ) * docs(policy): contributing a policy guide * docs(training): HW compute guide * chore(docs): add to readme and index * Apply suggestions from code review Co-authored-by: Haoming Song <1847575517@qq.com> Signed-off-by: Steven Palma <imstevenpmwork@ieee.org> * chore(docs): slight improvements * refactor(docs): consolidate add policy docs * chore(style): fix pre-commit --------- Signed-off-by: Steven Palma <imstevenpmwork@ieee.org> Co-authored-by: Haoming Song <1847575517@qq.com>	2026-05-11 15:19:12 +02:00
Jash Shah	9e83510c99	fix(datasets): close file handle on VideoDecoder init failure in cache (#3542 ) If VideoDecoder() raises during initialization, the fsspec file handle was leaked since it was opened via __enter__() but never closed on the exception path. Now explicitly closes the handle before re-raising.	2026-05-10 17:30:37 +02:00
Anthony Shoumikhin	1f7b03f5f2	chore(deps): allow torch 2.11/2.12 and fix autocast deprecation (#3435 ) * chore(deps): allow torch 2.11/2.12 and fix autocast deprecation - Bump torch to >=2.7,<2.13 (was <2.11), torchvision to <0.28 (was <0.26), and torchcodec to <0.13 (was <0.11) to allow installs against the latest stable torch 2.11 and the upcoming 2.12 line. - Replace removed torch.get_autocast_gpu_dtype() with torch.get_autocast_dtype("cuda") in Florence2 and Qwen2.5-VL-MoE FlashAttention paths (the former is removed in 2.11+). - Refresh uv.lock for the new resolution (torch 2.11.0+cu130, torchvision 0.26.0+cu130, torchcodec 0.11.1, full CUDA 13 stack). Verified locally with `uv sync --locked` from a clean .venv and the lerobot test suite (pytest -n 8 --dist=loadfile --timeout=300). Failure set is identical to the pre-bump baseline: 18 pre-existing failures (test_sac_policy, test_pi0_rtc, test_pi05_rtc, test_replay_buffer), 0 new, 0 fixed. AI assistance: this change was authored with Claude Code per AI_POLICY.md. * fix(policies): use device-agnostic autocast dtype lookup Pass query_states.device.type to torch.get_autocast_dtype() instead of hardcoding 'cuda', so the cast matches the active autocast context when running under CPU/MPS/XPU autocast. --------- Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>	2026-05-10 13:05:35 +02:00
Steven Palma	cb8edf17e6	chore(dependencies): update uv.lock (#3475 )	2026-05-10 12:24:22 +02:00
Steven Palma	5699f6cbf4	chore(ci): disable auto-stale (#3550 )	2026-05-10 11:49:31 +02:00
masato-ka	0e6114ac36	fix(train): restrict legacy RA-BC migration to JSON checkpoints only (#3490 ) * fix(train): restrict legacy RA-BC migration to JSON checkpoints only _migrate_legacy_rabc_fields was called for all config files, causing json.load to raise DecodeError when a YAML/TOML config was passed to lerobot-train for a new training run. Guard the block with an .endswith(".json") check so migration only runs when resuming from a JSON checkpoint.	2026-05-08 20:27:01 +02:00
Steven Palma	c8ce413d73	fix(robots): allign lekiwi default with so100 use_degrees (#3531 )	2026-05-07 17:52:34 +02:00
Pepijn	82dffde7fa	fix(ci): speed up multi-task benchmark evals (parallelize + cap VLABench steps) (#3529 ) * fix(ci): run multi-task benchmark evals 5-at-a-time in parallel The eval script supports running tasks concurrently via a ThreadPoolExecutor (env.max_parallel_tasks). Apply it to the four multi-task benchmark CI jobs (RoboTwin, RoboCasa, RoboMME, LIBERO-plus — 8-10 tasks/task_ids each) so they finish in ~2 waves of 5 instead of running sequentially. Single-task jobs (Libero, MetaWorld, RoboCerebra) are unchanged. * fix(ci): cap VLABench smoke eval at 50 steps per task VLABench's default episode_length is 500 steps; with 10 tasks at ~1 it/s the smoke eval took ~80 minutes of rollouts on top of the image build. The eval is a pipeline smoke test (running_success_rate stays at 0% on this short rollout anyway), so we don't need full episodes — cap each task at 50 steps to bring total rollout time down ~10x. * fix(ci): run VLABench tasks 5-at-a-time in parallel The eval script already supports running multiple tasks concurrently via a ThreadPoolExecutor (env.max_parallel_tasks). Set it to 5 so the 10 VLABench tasks finish in ~2 waves instead of running sequentially.	2026-05-07 13:37:16 +02:00
Ville Kuosmanen	eaf0218bc8	feat(policy): use pretrained vision encoder weights by default for diffusion and vqbet (#3202 ) * feat: add pretrained vision encoder weights for diffusion and vqbet * fix test by re-generating artifacts --------- Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>	2026-05-07 12:10:38 +02:00
Pepijn	a0e52d52fe	fix(ci): bump robotwin benchmark image to CUDA 12.6 (#3525 ) The robotwin benchmark Dockerfile still installed cuda-nvcc-12-4 and cuda-cudart-dev-12-4 after #3505 upgraded the base image to CUDA 12.6.3 on Ubuntu 24.04. Those packages aren't available in the ubuntu2404 CUDA repo, so the build failed at apt-get install. Bumping both to -12-6 to match the base image.	2026-05-07 11:11:12 +02:00
Haoming Song	e99c55af4b	feat(policies): add EO-1 model (#3403 ) * feat(policies): add EO-1 model * chore(eo1): adjust policy_eo1_README.md to to avoid duplicate with eo1.mdx * chore(eo1): remove policy_eo1_README.md, link eo1.mdx in policy folder --------- Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>	2026-05-06 18:01:16 +02:00
Steven Palma	408e0ca763	fix(robots): openarm features with openarmmini (#3524 )	2026-05-06 17:03:09 +02:00
Maxime Ellerbach	ce24063efd	feat(dagger): adding smooth handover (#3506 ) * feat(dagger): adding smooth handover * update docstring * small phase fix and documenting potential issues * cleaning up	2026-05-05 14:44:32 +02:00
Steven Palma	82934719db	chore(dep): bump transformers to 5.4.0 (#3374 ) * fix(deps): breaking change from transformers 5.4.0 * Update src/lerobot/policies/xvla/modeling_florence2.py Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net> * Update src/lerobot/policies/wall_x/qwen_model/qwen2_5_vl_moe.py Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net> * removing dataclass * bumping transformers 5.4.0 * weird i can't even pass the test on main * oops, typo * chore(style): fix pre-commit run * chore: update uv.lock * seems like a weird numerical precision issue, lets check in runners * chore: update uv.lock * chore(dependecies): adjust transformers version * chore: update uv.lock --------- Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net> Co-authored-by: Maximellerbach <maxime.ellerbach@huggingface.co> Co-authored-by: raushan <raushan@huggingface.co>	2026-05-05 14:19:09 +02:00
Steven Palma	401a217597	chore(ci): increase time stale (#3507 )	2026-05-04 22:35:16 +02:00
Steven Palma	40094b0464	chore(ci): upgrade docker internal (#3505 )	2026-05-04 21:28:52 +02:00
Jash Shah	fdbfc015a2	fix(peft): fix LoRA resume from Hub (PosixPath + double wrap) (#3485 )	2026-05-04 10:52:37 +02:00
Haoming Song	d656da8ccc	fix(pi): keep training sampling outside compiled forwards (#3487 ) Move PI0 and PI0.5 noise/time sampling into the policy wrappers so the compiled PyTorch cores receive them as tensor inputs. This keeps Beta sampling out of torch.compile on MPS, avoiding aten::_sample_dirichlet compilation errors while preserving the CUDA training path. Validation: .venv/bin/python -m pre_commit run --files src/lerobot/policies/pi0/modeling_pi0.py src/lerobot/policies/pi05/modeling_pi05.py; .venv/bin/python -m pytest -sv -rs tests/policies/pi0_pi05/test_pi0.py tests/policies/pi0_pi05/test_pi05.py tests/policies/pi0_pi05/test_pi0_rtc.py tests/policies/pi0_pi05/test_pi05_rtc.py Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>	2026-04-30 13:21:17 +02:00
Khalil Meftah	b5f65e5332	Expose sarm package API and ship reward model card template (#3477 ) * chore: List lerobot_rewardmodel_modelcard_template.md in MANIFEST.in * chore: export SARMConfig, SARMRewardModel, and make_sarm_pre_post_processors from rewards.sarm.	2026-04-29 16:17:16 +02:00
Khalil Meftah	cd6b43ea7a	fix(train): migrate legacy RA-BC fields in train config loading (#3480 )	2026-04-29 16:17:00 +02:00
Steven Palma	2236bbe7a3	fix(rollout): propagate policy-specific CLI config paramaters (#3483 ) Co-authored-by: Maxime Ellerbach <maxime.ellerbach@huggingface.co>	2026-04-29 16:13:10 +02:00
Maxime Ellerbach	cb0a944941	refactor(datasets): replace untyped dict with typed DatasetInfo dataclass (#3472 ) * refactor(datasets): replace untyped dict with typed DatasetInfo dataclass Introduce typed DatasetInfo dataclass to replace untyped dict representation of info.json. Changes: - Add DatasetInfo dataclass with explicit fields and validation - Implement __post_init__ for shape conversion (list ↔ tuple) - Add dict-style compatibility layer (__getitem__, __setitem__, .get()) - Add from_dict() and to_dict() for JSON serialization - Update io_utils to use load_info/write_info with DatasetInfo - Update dataset utilities and metadata to use attribute access - Remove aggregate.py dict-style field access - Add tests fixture support for DatasetInfo Benefits: - Type safety with IDE auto-completion - Validation at construction time - Explicit schema documentation * fix pre-commit * update docstring inside DatasetInfo.from_dict() * sorts the unknown to have deterministic output Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net> * refactoring the last few old fieds * fix crop dataset roi type mismatch * use consistantly int for data and video_files_size_in_mb --------- Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net> Co-authored-by: jjolla93 <jjolla93@gmail.com>	2026-04-28 18:40:30 +02:00
Khalil Meftah	8a3d64033f	Reward models refactor (#3142 ) * feat(rewards): add RewardModelConfig and PreTrainedRewardModel base classes * refactor(rewards): migrate Classifier from policies/sac/reward_model/ to rewards/classifier/ * refactor(rewards): migrate SARM from policies/sarm/ to rewards/sarm/ * refactor(rewards): add rewards/factory.py and remove reward model code from policies/factory.py * refactor(rewards): update imports and delete old reward model locations * test(rewards): add reward model tests and update existing test imports * fix(rewards): restore full Classifier and SARM implementations * test(rewards): restore missing CUDA and mixed precision classifier processor tests * refactor(lerobot_train.py): remove rabc specific configuration and replace it with a generic samplerweight class in lerobot_train * refactor(lerobot_train.py): add missing sampling weight script * linter + missing files * add testing for sampl weighter * revert some useless changes, improve typing * update docs * add automatic detection of the progress path * remove type exp * improve comment * fix: move rabc.py to rewards/sarm/ and update import paths * refactor(imports): update reward model imports to new module structure * refactor(imports): update reward model imports to reflect new module structure * refactor(imports): conditionally import pandas based on availability * feat(configs): add reward_model field to TrainPipelineConfig and Hub fields to RewardModelConfig * refactor(policies): remove reward model branches from policy factory and __init__ * refactor(rewards): expand __init__ facade and fix SARMConfig __post_init__ crash * feat(train): route reward model training through rewards/factory instead of policies/factory * refactor(train): streamline reward model training logic * fix(rewards): ensure FileNotFoundError is raised for missing config_file * refactor(train): update __get_path_fields__ to include reward_model for config loading * refactor(classifier): remove redundant input normalization in predict_reward method * fix(train): raise ValueError for non-trainable reward models in train function * refactor(pretrained_rm): add model card template * refactor(tests): reward models * refactor(sarm): update reset method and remove unused action prediction methods * refactor(wandb): differentiate tags for reward model and policy training in cfg_to_group function * fix(train): raise ValueError for PEFT usage in reward model training * refactor(rewards): enhance RewardModelConfig with device handling and delta indices properties --------- Co-authored-by: Michel Aractingi <michel.aractingi@huggingface.co>	2026-04-28 17:56:24 +02:00
Steven Palma	03ee50e08f	chore(ci): bump docs workflows (#3476 )	2026-04-28 15:06:44 +02:00
Steven Palma	ca87ccd941	feat(rollout): decouple policy deployment from data recording with new `lerobot-rollout` CLI (#3413 ) * feat(scripts): lerobot-rollout * fix(rollout) require dataset in dagger + use duration too * fix(docs): dagger num_episodes * test(rollout): fix expectations * fix(rollout): features check * fix(rollout): device and task propagation + feature pos + warn fps + move rename_map config * docs(rollout): edit rename_map instructions * chore(rollout): multiple minor improvements * chore(rollout): address coments + minor improvements * fix(rollout): enable default * fix(tests): default value RTCConfig * fix(rollout): robot_observation_processor and notify_observation at policy frequency instead of interpolator rate Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com> * fix(rollout): prevent relativeactions with sync inference engine Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com> * fix(rollout): rtc reanchor to non normalized state Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com> * fix(rollout): fixing the episode length to use hwc (#3469) also reducing default length to 5 minutes * feat(rollout): go back to initial position is now a config * fix(rollout): properly propagating video_files_size_in_mb to lerobot_dataset (#3470) * chore(rollout): note about dagger correction stage * chore(docs): update comments and docstring * fix(test): move rtc relative out of rollout module * fix(rollout): address the review comments --------- Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com> Co-authored-by: Maxime Ellerbach <maxime.ellerbach@huggingface.co>	2026-04-28 00:57:35 +02:00
Steven Palma	77352c495c	chore(dependencies): update uv.lock (#3437 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2026-04-27 23:15:46 +02:00
Steven Palma	05a5223885	fix(pi): avoid peak RAM in PiGemma construction by freeing replaced submodules (#3454 ) Co-Authored-By: Daiki Kamata <daiki.kamata@access-company.com> Co-Authored-By: Jack Vial <jackvial@users.noreply.github.com> Co-Authored-By: Ajay Anubolu <AjAnubolu@users.noreply.github.com> Co-Authored-By: Finn F. <F-Fer@users.noreply.github.com>	2026-04-24 17:50:12 +02:00
Steven Palma	580d818aa9	fix(dataset): no default overwrite in lerobot tool recompute stats (#3452 )	2026-04-24 15:07:19 +02:00
Steven Palma	587aa82021	fix(imports): realsense import name is platform dependent (#3451 )	2026-04-24 12:55:38 +02:00
Chuyao Shen	12b88fce02	not use dataclass (#3414 ) Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>	2026-04-24 11:26:59 +02:00
masato-ka	fc6c94c82a	fix(sarm): handle BaseModelOutputWithPooling from transformers 5.x in… (#3419 ) * fix(sarm): handle BaseModelOutputWithPooling from transformers 5.x in CLIP encoding In transformers 5.x, CLIPModel.get_image_features() and get_text_features() return BaseModelOutputWithPooling instead of a plain torch.FloatTensor. Added isinstance check to extract pooler_output when the return value is not a tensor, maintaining backward compatibility with transformers 4.x. Fixes AttributeError: 'BaseModelOutputWithPooling' object has no attribute 'detach' * Adding assertion check for pooler_output of CLIP. This change is response to below comment. https://github.com/huggingface/lerobot/pull/3419#discussion_r3112594387 * Adding assertion check for pooler_output of CLIP. This change is response to below comment. Change to simple check and rise https://github.com/huggingface/lerobot/pull/3419#discussion_r3126953776 --------- Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>	2026-04-23 16:26:58 +02:00
Steven Palma	1add460678	fix(policy): loss normalization for padded actions in ACT, Diffusion, and MultiTaskDiT (#3442 ) * Fix loss normalization for padded actions in ACT, Diffusion, and MultiTaskDiT When action_is_pad masks out padded timesteps, the subsequent .mean() still divides by the total element count (including zeroed-out padding), underestimating the loss. With 60-70% padding this can cut the effective gradient signal by 2-3x. Replace mask-then-mean with mask-then-sum / valid-count for all three affected policies. TDMPC is not affected because it sums over time before averaging over batch. Fixes #3353 * linting Co-authored-by: whats2000 <60466660+whats2000@users.noreply.github.com> Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net> * Update src/lerobot/policies/diffusion/modeling_diffusion.py Co-authored-by: whats2000 <60466660+whats2000@users.noreply.github.com> Signed-off-by: Steven Palma <imstevenpmwork@ieee.org> * Update src/lerobot/policies/multi_task_dit/modeling_multi_task_dit.py Co-authored-by: whats2000 <60466660+whats2000@users.noreply.github.com> Signed-off-by: Steven Palma <imstevenpmwork@ieee.org> * Update src/lerobot/policies/multi_task_dit/modeling_multi_task_dit.py Co-authored-by: whats2000 <60466660+whats2000@users.noreply.github.com> Signed-off-by: Steven Palma <imstevenpmwork@ieee.org> * apply ACT loss normalization suggestion from review Divide by num_valid (timesteps * action_dim) instead of just timesteps, matching the diffusion/multi_task_dit fix. Addresses review from @whats2000 (https://github.com/huggingface/lerobot/pull/3377#discussion_r3106845791). * fix(test): update safetensor act --------- Signed-off-by: Maxime Ellerbach <maxime@ellerbach.net> Signed-off-by: Steven Palma <imstevenpmwork@ieee.org> Co-authored-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Co-authored-by: Maxime Ellerbach <maxime@ellerbach.net> Co-authored-by: whats2000 <60466660+whats2000@users.noreply.github.com>	2026-04-23 15:23:54 +02:00
Qi Jia	4587c2b648	fix xvla docs (#3291 ) Co-authored-by: Qi Jia <kaufou@gmail.com> Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>	2026-04-23 14:50:32 +02:00
whats2000	2236cdb302	fix(smolvla): correct loss normalization for padded actions (#3434 ) Apply the same per-scalar-mean fix to SmolVLA that #3377 landed for ACT / Diffusion / MultiTaskDiT. The pre-patch form applies the `action_is_pad` mask to zero out padded timesteps, then calls `.mean()` (or `.mean(dim=(1, 2))`). Because `.mean()` divides by the total number of elements including the zeroed padding, the loss is diluted by the padding fraction. Fixed by normalizing only over valid (non-padded) scalar entries: num_valid = ((~actions_is_pad).sum(...) * losses.shape[-1]).clamp_min(1) loss = losses.sum(...) / num_valid `clamp_min(1)` preserves the all-padded-batch edge case (0/1 = 0). Both reduction paths are updated. Behavior when `action_is_pad` is missing is unchanged (`losses.mean()`). Empirical A/B on aloha_sim_transfer_cube_human (chunk_size=40, batch=2, 30 steps, fixed seed, GB200) shows `loss_A / loss_B = 0.9672 (±0.088)` — same direction and magnitude as PR #3377's `loss_A / loss_C ≈ 0.96` for ACT. Heavier-padding recipes will see a larger gap. Refs: #3353 (original report for ACT), #3377 (fix for the other three policies).	2026-04-23 10:34:11 +02:00
Steven Palma	7c2466979e	chore(dependencies): update uv.lock (#3408 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2026-04-22 16:38:51 +02:00
Pepijn	39b966e20a	docs(agents): add AGENT_GUIDE.md for user facing agent (#3430 ) * docs(agents): add AGENT_GUIDE.md with SO-101, data, policy, training, eval guidance Adds an agent-facing companion to AGENTS.md that helps AI agents (Cursor, Claude, ChatGPT, etc.) guide end-users through LeRobot without needing to re-read every doc: - Mandatory "ask the user first" block (goal, hardware, GPU, skill level) - SO-101 end-to-end cheat-sheet: install -> calibrate -> record -> train -> eval - Data-collection tips distilled from the folding project (practice before you record, quality > speed, start constrained then add diversity) - Policy decision table with indicative profiling numbers (update ms, peak GPU mem) and AdamW-vs-SGD caveats - Training duration guidance: 5-10 epoch rule, epoch<->step conversion, scheduler/checkpoint scaling with --steps, SmolVLA unfreeze tip - Real-robot eval via lerobot-record --policy.path and sim eval via lerobot-eval, including the pre-baked docker/Dockerfile.benchmark.* images AGENTS.md gets a short pointer to AGENT_GUIDE.md at the top. CLAUDE.md (symlink to AGENTS.md) inherits the pointer automatically. Made-with: Cursor * docs(agents): recommend 2 cameras (front + wrist) as default Made-with: Cursor * docs(agents): add Feetech wiring check and broaden visualizer note Made-with: Cursor * docs(agents): clarify Feetech LED behavior (steady-on, not flash) Made-with: Cursor * docs(agents): expand Feetech troubleshooting (blinking LED, 5V vs 12V variants) Made-with: Cursor * docs(agents): tighten Feetech LED wording Made-with: Cursor	2026-04-22 11:54:19 +02:00
Pepijn	ba27aab79c	fix(robotwin): pin compatible curobo in benchmark image (#3427 ) * fix(robotwin): pin compatible curobo in benchmark image * fix(robotwin): make curobo smoke check gpu-free	2026-04-21 19:51:44 +02:00
Pepijn	5adad11128	feat(sim): VLABench benchmark integration (#3396 ) feat(sim): add VLABench benchmark integration Add VLABench as a new simulation benchmark in LeRobot, following the existing LIBERO and MetaWorld patterns. This PR wires VLABench end-to-end across environment integration, Docker setup, CI smoke evaluation, and documentation. It also fixes a number of upstream packaging and runtime issues required to make VLABench usable and reproducible in CI. What’s included Benchmark integration Add VLABench as a new simulation benchmark. Expose supported VLABench tasks through the LeRobot env interface. Follow the established LIBERO / MetaWorld factory patterns. Preserve lazy async-env metadata so env.unwrapped.metadata["render_fps"] continues to work. CI smoke evaluation Add a VLABench smoke-eval job using lerobot/smolvla_vlabench. Use the correct rename_map for the 3-camera dataset layout. Expand smoke coverage from 1 to 10 primitive tasks. Extract task descriptions after eval so metrics artifacts include per-task labels. Skip Docker Hub login when secrets are unavailable (e.g. fork PRs). Docker / install fixes Install VLABench from GitHub rather than PyPI. Use uv pip, not pip, in the base image. Fail loudly on install errors instead of masking them. Clone VLABench into the non-root user’s home directory. Use shallow editable installs for VLABench and rrt-algorithms to work around missing __init__.py issues. Pin upstream clones to exact commit SHAs for reproducibility. Add undeclared runtime dependencies required by VLABench (open3d, colorlog, scikit-learn, openai). Unpin open3d so Python 3.12 wheels resolve. Assets Support downloading VLABench assets from a Hugging Face Hub mirror via VLABENCH_ASSETS_REPO. Keep Google Drive download support as fallback. Install huggingface_hub[hf_xet] so Xet-backed assets download correctly. Validate required mesh/XML asset subtrees at build time. Patch VLABench constants to tolerate missing asset directories at import time. Runtime / env correctness Import VLABench robots and tasks explicitly so decorator-based registry population happens. Resize and normalize camera observations so they always match the declared (H, W, 3) uint8 observation space. Reinstall LeRobot editably inside the image so the new env code is actually used. Coerce agent_pos / ee_state to the expected shape. Pad actions when needed to match data.ctrl. Replace zero-padding fallback with proper dm_control IK for 7D end-effector actions. Refetch dm_control physics on each step instead of caching weakrefs. Retry unstable resets with reseeding and handle PhysicsError gracefully at step time. Dataset / policy alignment Align VLABench observations and actions with Hugging Face dataset conventions used by lerobot/vlabench_unified: convert EE position between world frame and robot-base frame at the env boundary, expose / consume Euler XYZ instead of raw quaternion layout, align gripper semantics with dataset convention (1 = open, 0 = closed). This fixes policy/env mismatches that previously caused incorrect IK targets and unstable behavior at evaluation time. Docs Add a full docs/source/vlabench.mdx page aligned with the standard benchmark template. Document task selection forms (single task, comma list, suite shortcut). Document installation, evaluation, training, and result reproduction. Point examples at lerobot/smolvla_vlabench. Add a benchmark banner image. Remove outdated / misleading references to upstream evaluation tracks. Document manual install flow instead of a broken vlabench extra. Packaging cleanup Remove the unresolvable vlabench extra from pyproject.toml. Remove the no-op VLABench processor step. Remove the obsolete env unit test that only covered the dropped gripper remap helper. Apply formatting / logging / style cleanup from review feedback. Why this is needed VLABench is not currently consumable as a normal Python dependency and requires several upstream workarounds: no PyPI release, missing package declarations, undeclared runtime deps, SSH-only submodule references, asset downloads outside normal package install flow, registry population that depends on import side effects, env outputs that do not always match declared observation shapes, task resets that can diverge under some random layouts. This PR makes the benchmark usable in LeRobot despite those constraints, and ensures CI runs are reproducible and informative. If you want a much shorter squash commit message, I’d use this: feat(sim): integrate VLABench benchmark with CI, Docker, and docs Add VLABench as a new LeRobot simulation benchmark, following the existing LIBERO / MetaWorld patterns. This includes: LeRobot env integration and task exposure, CI smoke eval with lerobot/smolvla_vlabench, Docker install and asset-download fixes, runtime fixes for registry loading, assets, camera obs, action handling, dm_control IK, and PhysicsError recovery, alignment of obs/action semantics with HF VLABench datasets, docs and packaging cleanup. The PR also incorporates review feedback, improves reproducibility by pinning upstream commits, and makes VLABench usable in CI despite upstream packaging and asset-management issues.	2026-04-21 17:54:11 +02:00
Pepijn	a07f22e22c	feat(envs): add LIBERO-plus robustness benchmark (#3313 ) * feat(envs): add LIBERO-plus robustness benchmark integration - LiberoPlusEnv config (subclass of LiberoEnv, same gym interface) - Docker image installing LIBERO-plus fork via PYTHONPATH - CI workflow: 1-episode smoke eval with pepijn223/smolvla_libero_plus - pyproject.toml: libero_plus extra * fix(libero): use suite's perturbation-aware init_states loader LIBERO-plus's Benchmark class exposes a `get_task_init_states(i)` method that strips perturbation suffixes (`_table_N`, `_tb_N`, `_view_`, `_language_`, `_light_`, `_add_`, `_level`) and loads the underlying base `.pruned_init` file — the on-disk name for a perturbation variant doesn't exist as a file, only the base does. lerobot's loader was bypassing that logic and trying to read the suffix-bearing filename directly, which failed for every non-zero task id and killed the eval before any rollout video could be written. Delegate to the suite's method when it exists; fall back to the path-based loader for vanilla LIBERO (which does not provide the method). Also drop the hf-libero install + init_files copy from the LIBERO-plus Dockerfile — the LIBERO-plus clone already ships both `bddl_files/` and `init_files/` for all five suites, so the copy was unnecessary and the `cp -r` into an existing dir produced a confusing nested layout. * fix(libero): resolve LIBERO-plus perturbation init_states path ourselves Delegating to `task_suite.get_task_init_states(i)` works for path resolution but LIBERO-plus's method calls `torch.load(path)` without `weights_only=False`, which fails on PyTorch 2.6+ because the pickled init_states contains numpy objects not in the default allowlist: _pickle.UnpicklingError: Weights only load failed. WeightsUnpickler error: Unsupported global: GLOBAL numpy.core.multiarray._reconstruct was not an allowed global. Mirror LIBERO-plus's suffix-stripping logic (`_table_N`, `_tb_N`, `_view_`, `_language_`, `_light_`, `_add_`, `_level`) in our own helper so we can pass `weights_only=False` ourselves. Vanilla LIBERO task names don't contain any of these patterns except for `_table_` when followed by the word `center` (e.g. `pick_up_the_black_bowl_from_table_center_...`), and the regex requires `_table_\\d+` so semantic uses are preserved. * fix(libero-plus): download perturbation assets from Sylvest/LIBERO-plus LIBERO-plus's bddl_base_domain.py resolves scene XMLs with `os.path.join(DIR_PATH, "../assets")`, so the `assets` key in config.yaml has no effect on scene lookup — MuJoCo always opens `<clone>/libero/libero/assets/scenes/...`. With no such directory present, every perturbation task fails on: FileNotFoundError: No such file or directory: .../libero-plus/libero/libero/assets/scenes/tabletop_table_Cobblestone01_GLOSS_6K.xml These textures, views, and extra objects ship only in the 6.4 GB `assets.zip` published at `Sylvest/LIBERO-plus` (the LIBERO-plus README explicitly says to download and unzip it into the package dir). Fetch it via `hf_hub_download`, unzip into `${LIBERO_PLUS_ROOT}/`, install `unzip`, and point config.yaml at the extracted dir so everything stays consistent. The download lives in its own Docker layer so subsequent rebuilds reuse the cached assets. Drops the lerobot/libero-assets snapshot_download — that mirror only has vanilla LIBERO textures and is ignored for scene loading anyway. * fix(libero-plus): flatten deep path prefix from Sylvest/LIBERO-plus assets.zip The 6.4 GB zip ships with every entry prefixed by `inspire/hdd/project/embodied-multimodality/public/syfei/libero_new/release/dataset/LIBERO-plus-0/assets/...` (the author's internal filesystem layout, not the layout the LIBERO-plus README promises), so the previous `unzip -d ${LIBERO_PLUS_ROOT}/` created `${LIBERO_PLUS_ROOT}/inspire/.../assets/` — robosuite still opened `${LIBERO_PLUS_ROOT}/assets/scenes/tabletop_table_Cobblestone01_GLOSS_6K.xml` and hit the same FileNotFoundError. Extract to a scratch dir, then `mv` the nested `assets/` subtree to the expected location. Verified the target file exists in the zip central directory under that exact prefix. * refactor(libero): inline init_states resolver behind single regex Collapse the three-style suffix stripper (split/re.sub/in) into one compiled regex, drop the (Path, bool) tuple return, and move the `_add_`/`_level` reshape branch into the caller so each branch loads its own file and returns directly. Net: -11 lines, one fewer helper. * refactor(libero-plus): rebase docker image on huggingface/lerobot-gpu Mirror the libero/metaworld/robomme pattern: start from the nightly GPU image (apt deps, python, uv, venv, lerobot[all] already there) and only layer on what LIBERO-plus uniquely needs — its wand/ImageMagick build deps, the non-extra runtime pips (robosuite==1.4.1, bddl, …), the PYTHONPATH-shadowed fork, and the 6.4 GB assets.zip. Drops ~50 lines of duplicated base setup (CUDA FROM, apt python, uv install, user creation, venv init) the nightly already provides. 123 → 73 lines. Also: - Add libero_plus to docs/source/_toctree.yml under Benchmarks so doc-builder's TOC integrity check stops failing. - Repoint the docs dataset link from pepijn223/libero_plus_lerobot to the canonical lerobot/libero_plus. - Revert the stray uv.lock churn (revision/marker diff that crept in from an unrelated resolve — unrelated to LIBERO-plus). * fix(libero-plus): stop touching pyproject + uv.lock The fast-tests job was rejecting the branch because pyproject.toml had a [libero_plus] extra whose git dep wasn't represented in uv.lock. The Docker image no longer needs the extra — it clones LIBERO-plus directly and PYTHONPATH-shadows hf-libero. Drop [libero_plus] from pyproject and restore pyproject.toml + uv.lock to exactly what's on origin/main, so `uv sync --locked --extra test` is a no-op for this PR. Also repoint the doc/CI/env comments that still mentioned the extra at the Docker install path. * fix(libero-plus): strip perturbation metadata from task descriptions LIBERO-plus builds task.language by space-joining the perturbation-variant filename, so every non-_language_ variant inherits a trailing blob like "view 0 0 100 0 0 initstate 0 noise 45" or "add 16". That shows up in the dashboard video labels and no longer matches the base instruction stored in the training dataset. Strip those tokens in extract_task_descriptions.py with an end-anchored regex over the {view,initstate,noise,add,tb,table,light,level}(+digits) vocabulary. The anchor preserves mid-sentence literal uses of those words (e.g. "from table center and place it on the plate") — only the trailing metadata chain is removed. _language_ variants carry real BDDL-sourced text and are left untouched. * ci: point benchmark eval checkpoints at the lerobot/ org mirrors pepijn223/smolvla_* → lerobot/smolvla_* across every benchmark job in this branch (libero, metaworld, and the per-branch benchmark). The checkpoints were mirrored into the lerobot/ org and that's the canonical location going forward. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: integrate PR #3313 review feedback - docs: fix paper link to arxiv, add benchmark image, add suite descriptions, add LIBERO-plus replacement warning, restructure eval section to match LIBERO doc style, fix policy I/O section, remove false try/except claim - docker: fix shell grouping for hf-libero uninstall, replace hardcoded asset path with dynamic find - ci: add Docker Hub login step, add HF_USER_TOKEN guard on eval step - envs: add is_libero_plus param to get_task_init_states so vanilla LIBERO always takes the simple path * fix(docs): use correct LIBERO-plus teaser image URL * ci(libero-plus): drop redundant hf auth login step The standalone login step ran `hf auth login` in a throwaway `docker run --rm` container, so no credentials persisted. Auth is already performed inside the eval step's container. Removing the redundant step per PR #3313 review feedback. * fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs Port of #3416 onto this branch. Without these attributes eval crashes when calling `env.unwrapped.metadata["render_fps"]` with async vector envs. Adds `metadata` / `unwrapped` to `_LazyAsyncVectorEnv` and caches the metadata alongside obs/action spaces in the LIBERO and MetaWorld factories. * ci: gate Docker Hub login on secret availability Fork PRs cannot access `secrets.DOCKERHUB_LEROBOT_{USERNAME,PASSWORD}`, which made every benchmark job fail at the login step before any of the actual build/eval work could run. Gate the login on the env-var expansion of the username so the step is skipped (not failed) when secrets are absent. Mirrors the existing pattern in the VLABench job. * Update .github/workflows/benchmark_tests.yml Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co> Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> * Update scripts/ci/extract_task_descriptions.py Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co> Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> * Update .github/workflows/benchmark_tests.yml Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co> Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> * Update docker/Dockerfile.benchmark.libero_plus Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co> Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> * Update .github/workflows/benchmark_tests.yml Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co> Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> * fix(libero-plus): address review feedback * ci(libero-plus): fix YAML indentation in upload-artifact steps The `uses:` key on two upload-artifact steps was at column 0 instead of nested under the step, causing `pre-commit run check-yaml` to fail with "expected <block end>, but found '<block mapping start>'". Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>	2026-04-20 21:07:21 +02:00
Pepijn	282c31cfef	feat(envs): add RoboMME benchmark (#3311 ) * feat(envs): add RoboMME benchmark integration - RoboMME env wrapper with image/wrist_image/state observations - Docker image with Vulkan, SAPIEN, mani-skill deps - CI workflow: 1-episode smoke eval with pepijn223/smolvla_robomme - preprocess_observation: handle image/wrist_image/state keys - pyproject.toml: robomme extra Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor(docker): rebase RoboMME image on huggingface/lerobot-gpu Mirror the libero/metaworld pattern: start from the nightly GPU image (which already has apt deps, uv, venv, and lerobot[all] preinstalled) and only layer on what RoboMME uniquely needs — the Vulkan libs ManiSkill/SAPIEN requires, plus the robomme extra with the gymnasium/numpy overrides. Drops 48 lines of duplicated base setup (CUDA FROM, python install, user creation, venv init, base apt deps) that the nightly image already provides. Net: 102 → 54 lines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs(robomme): drop prototype-branch note and move dataset to lerobot/robomme - Remove the "Related work" block referencing the prototype branch feat/robomme-integration; the PR stands on its own. - Point all dataset references at lerobot/robomme (docs, env module docstring, RoboMMEEnvConfig docstring) — this is the canonical HF location once the dataset is mirrored. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(robomme): make docs build + fast tests green 1. Docs: add robomme to _toctree.yml under Benchmarks so doc-builder's TOC integrity check stops rejecting the new page. 2. Fast tests: robomme's mani-skill transitively pins numpy<2 which is unsatisfiable against the project's numpy>=2 base pin, so `uv sync` couldn't resolve a universal lockfile. Drop robomme as a pyproject extra entirely — it truly cannot coexist with the rest of the dep tree. The Dockerfile installs robomme directly from its git URL via `uv pip install --override`, which was already the runtime path. pyproject, docs, env docstrings, and the CI job comment all now point to the docker-only install. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(robomme): realign unit tests with current env API The tests were written against an earlier env layout and never updated when the wrapper was refactored, so CI's fast-test job was failing with: - KeyError: 'front_rgb' / 'wrist_rgb' — these were renamed to the lerobot-canonical 'image' / 'wrist_image' keys (matching the dataset columns and preprocess_observation's built-in fallbacks). - AssertionError: 'robomme' not in result — create_robomme_envs now returns {task_name: {task_id: env}}, not {'robomme': {...}}, so comma-separated task lists work. - ModuleNotFoundError: lerobot.envs.lazy_vec_env — LazyVectorEnv was removed; create_robomme_envs is straightforward synchronous now. Rewrite the 7 failing cases against the current API, drop the three LazyVectorEnv tests, and add a multi-task test so the new comma-separated task parsing is covered. Stub install/teardown is moved into helpers (`_install_robomme_stub` / `_uninstall_robomme_stub`) so individual tests stop repeating six boilerplate lines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * ci: point benchmark eval checkpoints at the lerobot/ org mirrors pepijn223/smolvla_* → lerobot/smolvla_* across every benchmark job in this branch (libero, metaworld, and the per-branch benchmark). The checkpoints were mirrored into the lerobot/ org and that's the canonical location going forward. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: integrate PR #3311 review feedback - envs: rename obs keys to pixels/image, pixels/wrist_image, agent_pos - envs: add __post_init__ for dynamic action_dim in RoboMMEEnv config - envs: remove special-case obs conversion in utils.py (no longer needed) - ci: add Docker Hub login, HF_USER_TOKEN guard, --env.task_ids=[0] - scripts: extract_task_descriptions supports multiple task_ids - docs: title to # RoboMME, add image, restructure eval section - tests: update all key assertions to match new obs naming Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(docs): use correct RoboMME teaser image URL Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * ci(robomme): smoke-eval 10 tasks instead of 5 Broader coverage on the RoboMME benchmark CI job: bump the smoke eval from 5 tasks to 10 (one episode each), all drawn from ROBOMME_TASKS. Tasks now run: PickXtimes, BinFill, StopCube, MoveCube, InsertPeg, SwingXtimes, VideoUnmask, ButtonUnmask, PickHighlight, PatternLock. Updated the parse_eval_metrics.py `--task` label from the single `PickXtimes` stub to the full comma list so the metrics artifact reflects what was actually run. `parse_eval_metrics.py` already reads `overall` for multi-task runs, so no parser change is needed. Made-with: Cursor * fix(robomme): nest `pixels` as a dict so preprocess_observation picks it up `_convert_obs` was returning flat keys (`pixels/image`, `pixels/wrist_image`). `preprocess_observation()` in envs/utils.py keys off the top-level `"pixels"` entry and, not finding it, silently dropped every image from the batch. The policy then saw zero image features and raised ValueError: All image features are missing from the batch. Match the LIBERO layout: return `{"pixels": {"image": ..., "wrist_image": ...}, "agent_pos": ...}` and declare the same shape in `observation_space`. Made-with: Cursor * fix(robomme): align docs and tests with nested pixels obs layout Addresses PR #3311 review feedback: - Docs: correct observation keys to `pixels/image` / `pixels/wrist_image` (mapped to `observation.images.image` / `observation.images.wrist_image`) and drop the now-obsolete column-rename snippet. - Tests: assert `result["pixels"]["image"]` instead of flat `pixels/image`, matching the nested layout required by `preprocess_observation()`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs Port of #3416 onto this branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: gate Docker Hub login on secret availability Fork PRs cannot access `secrets.DOCKERHUB_LEROBOT_{USERNAME,PASSWORD}`, which made every benchmark job fail at the login step. Gate the login on the env-var expansion of the username so the step is skipped (not failed) when secrets are absent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(robomme): address review feedback --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-20 20:21:27 +02:00
Pepijn	a147fa4439	feat(envs): add RoboCerebra long-horizon manipulation benchmark (#3314 ) * feat(ci): add RoboCerebra benchmark eval job - Docker image with robosuite/libero deps for RoboCerebra eval - CI workflow: 1-episode eval with pepijn223/smolvla_robocerebra - Reuses libero env with rename_map + empty_cameras=3 * docs(robocerebra): add benchmark page and toctree entry Add a dedicated docs page for RoboCerebra that points at the canonical dataset lerobot/robocerebra_unified and shows how to run eval + fine-tune against it. Wire it into the Benchmarks section of the toctree so doc-builder picks it up. * ci: point benchmark eval checkpoints at the lerobot/ org mirrors pepijn223/smolvla_* → lerobot/smolvla_* across every benchmark job in this branch (libero, metaworld, and the per-branch benchmark). The checkpoints were mirrored into the lerobot/ org and that's the canonical location going forward. * fix(robocerebra): drop alias extra + simplify docker image Two problems rolled up: 1. `uv sync --locked --extra test` was failing because pyproject.toml added a `robocerebra = ["lerobot[libero]"]` alias extra but uv.lock wasn't regenerated. Drop the alias — nothing in CI actually needs the extra name (the Dockerfile just installs what it needs directly), so this restores pyproject.toml and uv.lock to byte-exact origin/main. 2. Rebase docker/Dockerfile.benchmark.robocerebra on huggingface/lerobot-gpu:latest (same pattern as libero/metaworld/…). The nightly image already ships lerobot[all] which includes [libero], so the RoboCerebra image is essentially identical to the LIBERO one: fetch libero-assets, write ~/.libero/config.yaml, overlay source. 92 → 43 lines. Also repoint the CI workflow comment that referenced the removed extra. * ci: use dedicated lerobot/smolvla_robocerebra checkpoint for smoke eval Replace the generic pepijn223/smolvla_libero placeholder with the purpose-trained lerobot/smolvla_robocerebra model in the RoboCerebra CI smoke test. * fix(ci): align RoboCerebra eval with training pipeline Fixes 5 mismatches that caused 0% success rate: - env.type: robocerebra (unregistered) → libero - resolution: 360x360 (default) → 256x256 (matches dataset) - camera_name_mapping: map eye_in_hand → wrist_image (not image2) - empty_cameras: 3 → 1 (matches training) - add HF_USER_TOKEN guard on eval step * fix(ci): set env.fps=20 and explicit obs_type for RoboCerebra eval Match the dataset's 20 FPS (LiberoEnv defaults to 30) and make obs_type=pixels_agent_pos explicit for safety against future changes. * docs(robocerebra): align page with adding_benchmarks template Rework docs/source/robocerebra.mdx to follow the standard benchmark doc structure: intro + links + available tasks + installation + eval + recommended episodes + policy I/O + training + reproducing results. - Point everything at lerobot/smolvla_robocerebra (the released checkpoint), not the personal pepijn223 mirror. - Add the --env.fps=20 and --env.obs_type=pixels_agent_pos flags that CI actually uses, so copy-paste eval reproduces CI. - Split the "Training" block out of the recipe section into its own section with the feature table. - Add an explicit "Reproducing published results" section pointing at the CI smoke eval. * fix: integrate PR #3314 review feedback - ci(robocerebra): drop redundant hf auth login step (auth is already performed inside the eval step's container). - ci(robocerebra): add Docker Hub login before the image build to pick up the authenticated rate limit. - docs(robocerebra): align eval snippet with the CI command (observation size, camera_name_mapping, use_async_envs, device, empty_cameras=1). * fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs Port of #3416 onto this branch. * ci: gate Docker Hub login on secret availability * Update .github/workflows/benchmark_tests.yml Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co> Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> * Update .github/workflows/benchmark_tests.yml Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>	2026-04-20 19:12:15 +02:00
Pepijn	0f1c9b0851	feat(envs): add RoboTwin 2.0 benchmark (#3315 ) * feat(envs): add RoboTwin 2.0 benchmark integration - RoboTwinEnvConfig with 4-camera setup (head/front/left_wrist/right_wrist) - Docker image with SAPIEN, mplib, CuRobo, pytorch3d (Python 3.12) - CI workflow: 1-episode smoke eval with pepijn223/smolvla_robotwin - RoboTwinProcessorStep for state float32 casting - Camera rename_map: head_camera/front_camera/left_wrist -> camera1/2/3 * fix(robotwin): re-enable autograd for CuRobo planner warmup and take_action lerobot_eval wraps the full rollout in torch.no_grad() (lerobot_eval.py:566), but RoboTwin's setup_demo → load_robot → CuroboPlanner(...) runs motion_gen.warmup(), which invokes Newton's-method trajectory optimization. That optimizer calls cost.backward() internally, which raises RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn when autograd is disabled. take_action() hits the same planner path at every step. Wrap both setup_demo and take_action in torch.enable_grad() so CuRobo's optimizer can build its computation graph. Policy inference is unaffected — rollout()'s inner torch.inference_mode() block around select_action() is untouched, so we still don't allocate grad buffers during policy forward. * fix(robotwin): read nested get_obs() output and use aloha-agilex camera names RoboTwin's base_task.get_obs() returns a nested dict: {"observation": {cam: {"rgb": ..., "intrinsic_matrix": ...}}, "joint_action": {"left_arm": ..., "left_gripper": ..., "right_arm": ..., "right_gripper": ..., "vector": np.ndarray}, "endpose": {...}} Our _get_obs was reading raw["{cam}_rgb"] / raw["{cam}"] and raw["joint_action"] as if they were flat, so np.asarray(raw["joint_action"], dtype=float64) tripped on a dict and raised TypeError: float() argument must be a string or a real number, not 'dict' Fix: - Pull images from raw["observation"][cam]["rgb"] - Pull joint state from raw["joint_action"]["vector"] (the flat array) - Update the default camera tuple to (head_camera, left_camera, right_camera) to match RoboTwin's actual wrist-camera names (envs/camera/camera.py:135-151) * refactor(robotwin): drop defensive dict guards, cache black fallback frame _get_obs was guarding every dict access with isinstance(..., dict) in case RoboTwin's get_obs returned something else — but the API contract (envs/_base_task.py:437) always returns a dict, so the guards were silently masking real failures behind plausible-looking zero observations. Drop them. Also: - Cache a single black fallback frame in __init__ instead of allocating a fresh np.zeros((H, W, 3), uint8) for every missing camera on every step — the "camera not exposed" set is static per env. - Only allocate the zero joint_state on the fallback path (not unconditionally before the real value overwrites it). - Replace .flatten() with .ravel() (no copy when already 1-D). - Fold the nested-dict schema comment and two identical torch.enable_grad() rationales into a single Autograd section in the class docstring. - Fix stale `left_wrist` camera name in the observation docstring. * fix(robotwin): align observation_space dims with D435 camera output lerobot_eval crashed in gym.vector's SyncVectorEnv.reset with: ValueError: Output array is the wrong shape because RoboTwinEnvConfig declared observation_space = (480, 640, 3) but task_config/demo_clean.yml specifies head_camera_type=D435, which renders (240, 320, 3). gym.vector.concatenate pre-allocates a buffer from the declared space, so the first np.stack raises on shape mismatch. Changes: - Config defaults now 240×320 (the D435 dims in _camera_config.yml), with a comment pointing at the source of truth. - RoboTwinEnv.__init__ accepts observation_height/width as Optional and falls back to setup_kwargs["head_camera_h/w"] so the env is self-consistent even if the config is not in sync. - Config camera_names / features_map use the actual aloha-agilex camera names (head_camera, left_camera, right_camera). Drops the stale "front_camera" and "left_wrist"/"right_wrist" entries that never matched anything RoboTwin exposes. - CI workflow's rename_map updated to match the new camera names. * fix(robotwin): expose _max_episode_steps for lerobot_eval.rollout rollout() does `env.call("_max_episode_steps")` (lerobot_eval.py:157) to know when to stop stepping. LiberoEnv and MetaworldEnv set this attribute; RoboTwinEnv was tracking the limit under `episode_length` only, so the call raised AttributeError once CuRobo finished warming up. * fix(robotwin): install av-dep so lerobot_eval can write rollout MP4s write_video (utils/io_utils.py:53) lazily imports PyAV via require_package and raises silently inside the video-writing thread when the extra is not installed — so the eval itself succeeds with pc_success=100 but no MP4 ever lands in videos/, and the artifact upload reports "No files were found". Add av-dep to the install line (same pattern as the RoboMME image). * feat(robotwin): eval 5 diverse tasks per CI run with NL descriptions Widen the smoke eval from a single task (beat_block_hammer) to five: click_bell, handover_block, open_laptop, stack_blocks_two on top of the original. Each gets its own rollout video in videos/<task>_0/ so the dashboard can surface visually distinct behaviours. extract_task_descriptions.py now has a RoboTwin branch that reads `description/task_instruction/<task>.json` (already shipped in the clone at /opt/robotwin) and pulls the `full_description` field. CI cds into the clone before invoking the script so the relative path resolves. parse_eval_metrics.py is invoked with the same 5-task list so the metrics.json embeds one entry per task. * ci: point benchmark eval checkpoints at the lerobot/ org mirrors pepijn223/smolvla_* → lerobot/smolvla_* across every benchmark job in this branch (libero, metaworld, and the per-branch benchmark). The checkpoints were mirrored into the lerobot/ org and that's the canonical location going forward. * refactor(robotwin): rebase docker image on huggingface/lerobot-gpu Mirror the libero/metaworld/libero_plus/robomme pattern: start from the nightly GPU image (apt deps, python, uv, venv, lerobot[all] already there) and layer on only what RoboTwin 2.0 uniquely needs — cuda-nvcc + cuda-cudart-dev (CuRobo builds from source), Vulkan libs + NVIDIA ICD (SAPIEN renderer), sapien/mplib/open3d/pytorch3d/curobo installs, the mplib + sapien upstream patches, and the TianxingChen asset download. Drops ~90 lines of duplicated base setup (CUDA FROM, apt python, uv install, user creation, venv init, base lerobot install). 199 → 110. Also repoint the docs + env docstring dataset link from hxma/RoboTwin-LeRobot-v3.0 to the canonical lerobot/robotwin_unified. * docs(robotwin): add robotwin to _toctree.yml under Benchmarks doc-builder's TOC integrity check was rejecting the branch because docs/source/robotwin.mdx existed but wasn't listed in _toctree.yml. * fix(robotwin): defer YAML lookup and realign tests with current API __init__ was eagerly calling _load_robotwin_setup_kwargs just to read head_camera_h/w from the YAML. That import (`from envs import CONFIGS_PATH`) required a real RoboTwin install, so constructing the env — and thus every test in tests/envs/test_robotwin.py — blew up with ModuleNotFoundError on fast-tests where RoboTwin isn't installed. Replace the eager lookup with DEFAULT_CAMERA_H/W constants (240×320, the D435 dims baked into task_config/demo_clean.yml). reset() still resolves the full setup_kwargs lazily — that's fine because reset() is only called inside the benchmark Docker image where RoboTwin is present. Also resync the test file with the current env API: - mock get_obs() as the real nested {"observation": {cam: {"rgb": …}}, "joint_action": {"vector": …}} shape - patch both _load_robotwin_task and _load_robotwin_setup_kwargs (_patch_load → _patch_runtime) - drop `front_camera` / `left_wrist` from assertions — aloha-agilex exposes head_camera + left_camera + right_camera, not those - black-frame test now uses left_camera as the missing camera - setup_demo call check loosened to the caller-provided seed/is_test bits (full kwargs include the YAML-derived blob) * fix: integrate PR #3315 review feedback - ci: add Docker Hub login step, add HF_USER_TOKEN guard on eval step - docker: tie patches to pinned versions with removal guidance, remove unnecessary HF_TOKEN for public dataset, fix hadolint warnings - docs: fix paper link to arxiv, add teaser image, fix camera names (4→3 cameras), fix observation dims (480x640→240x320) * fix(docs): correct RoboTwin 2.0 paper arxiv link * fix(docs): use correct RoboTwin 2.0 teaser image URL * fix(docs): use plain markdown image to fix MDX build * ci(robotwin): smoke-eval 10 tasks instead of 5 Broader coverage on the RoboTwin 2.0 benchmark CI job: bump the smoke eval from 5 tasks to 10 (one episode each). Added tasks are all drawn from ROBOTWIN_TASKS and mirror the shape/complexity of the existing set (simple single-object or single-fixture manipulations). Tasks now run: beat_block_hammer, click_bell, handover_block, open_laptop, stack_blocks_two, click_alarmclock, close_laptop, close_microwave, open_microwave, place_block. `parse_eval_metrics.py` reads `overall` for multi-task runs so no parser change is needed. Bumped the step name and the metrics label to reflect the 10-task layout. * fix(ci): swap 4 broken RoboTwin tasks in smoke eval The smoke eval hit two upstream issues: - `open_laptop`: bug in OpenMOSS/RoboTwin main — `check_success()` uses `self.arm_tag`, but that attribute is only set inside `play_once()` (the scripted-expert path). During eval `take_action()` calls `check_success()` directly, hitting `AttributeError: 'open_laptop' object has no attribute 'arm_tag'`. - `close_laptop`, `close_microwave`, `place_block`: not present in upstream RoboTwin `envs/` at all — our ROBOTWIN_TASKS tuple drifted from upstream and these names leaked into CI. Replace the four broken tasks with upstream-confirmed equivalents that exist both in ROBOTWIN_TASKS and in RoboTwin's `envs/`: `adjust_bottle`, `lift_pot`, `stamp_seal`, `turn_switch`. New 10-task smoke set: beat_block_hammer, click_bell, handover_block, stack_blocks_two, click_alarmclock, open_microwave, adjust_bottle, lift_pot, stamp_seal, turn_switch. * fix(robotwin): sync ROBOTWIN_TASKS + doc with upstream (50 tasks) The local ROBOTWIN_TASKS tuple drifted from upstream RoboTwin-Platform/RoboTwin. Users passing names like `close_laptop`, `close_microwave`, `dump_bin`, `place_block`, `pour_water`, `fold_cloth`, etc. got past our validator (the names were in the tuple) but then crashed inside robosuite with a confusing error, because those tasks don't exist in upstream `envs/`. - Replace ROBOTWIN_TASKS with a verbatim mirror of upstream's `envs/` directory: 50 tasks as of main (was 60 with many stale entries). Added a `gh api`-based one-liner comment so future bumps are mechanical. - Update the `60 tasks` claims in robotwin.mdx and RoboTwinEnvConfig's docstring to `50`. - Replace the stale example-task table in robotwin.mdx with ten upstream-confirmed examples, and flag `open_laptop` as temporarily broken (its `check_success()` uses `self.arm_tag` which is only set inside `play_once()`; eval-mode callers hit AttributeError). - Rebuild the "Full benchmark" command with the actual 50-task list, omitting `open_laptop`. * test(robotwin): lower task-count floor from 60 to 50 ROBOTWIN_TASKS was trimmed to 50 tasks (see comment in `src/lerobot/envs/robotwin.py:48`), but the assertion still required ≥60, causing CI failures. Align the test with the current upstream task count. * fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs Port of #3416 onto this branch. * ci: gate Docker Hub login on secret availability * fix: integrate PR #3315 review feedback - envs(robotwin): default `observation_height/width` in `create_robotwin_envs` to `DEFAULT_CAMERA_H/W` (240/320) so they match the D435 dims baked into `task_config/demo_clean.yml`. - envs(robotwin): resolve `task_config/demo_clean.yml` via `CONFIGS_PATH` instead of a cwd-relative path; works regardless of where `lerobot-eval` is invoked. - envs(robotwin): replace `print()` calls in `create_robotwin_envs` with `logger.info(...)` (module-level `logger = logging.getLogger`). - envs(robotwin): use `_LazyAsyncVectorEnv` for the async path so async workers start lazily (matches LIBERO / RoboCasa / VLABench). - envs(robotwin): cast `agent_pos` space + joint-state output to float32 end-to-end (was mixed float64/float32). - envs(configs): use the existing `_make_vec_env_cls(use_async, n_envs)` helper in `RoboTwinEnvConfig.create_envs`; drop the `get_env_processors` override so RoboTwin uses the identity processor inherited from `EnvConfig`. - processor: delete `RoboTwinProcessorStep` — the float32 cast now happens in the wrapper itself, so the processor is redundant. - tests: drop the `TestRoboTwinProcessorStep` suite; update the mock obs fixture to use float32 `joint_action.vector`. - ci: hoist `ROBOTWIN_POLICY` and `ROBOTWIN_TASKS` to job-level env vars so the task list and policy aren't duplicated across eval / extract / parse steps. - docker: pin RoboTwin + CuRobo upstream clones to commit SHAs (`RoboTwin@0aeea2d6`, `curobo@ca941586`) for reproducibility.	2026-04-20 17:46:39 +02:00
Pepijn	e699e52388	feat(envs): add RoboCasa365 benchmark integration (#3375 ) * feat(envs): add RoboCasa365 benchmark integration Add RoboCasa365 (arXiv:2603.04356) as a new simulation benchmark with 365 everyday kitchen manipulation tasks across 2,500 diverse environments. New files: - src/lerobot/envs/robocasa.py: gym.Env wrapper with deferred env creation, flat 12D action / 16D state vectors, 3-camera support - docs/source/robocasa.mdx: user-facing documentation - docker/Dockerfile.benchmark.robocasa: CI benchmark image Modified files: - src/lerobot/envs/configs.py: RoboCasaEnv config (--env.type=robocasa) - pyproject.toml: robocasa optional dependency group - docs/source/_toctree.yml: sidebar entry - .github/workflows/benchmark_tests.yml: integration test job Refs: https://arxiv.org/abs/2603.04356, https://robocasa.ai Related: huggingface/lerobot#321 * fix(docker): use uv pip to install robocasa in benchmark image The huggingface/lerobot-gpu base image uses `uv` with a venv at /lerobot/.venv — `pip` is not on PATH, so `pip install` fails with "pip: not found". Switch to `uv pip install` which installs into the existing venv. Also drop the @v1.0.0 tag pin from the robocasa git URL since the upstream repo may not have that tag; use default branch instead. * fix(robocasa): editable install + switch to lerobot/smolvla_robocasa - pip install from git omits data files like box_links_assets.json (not declared in package_data). Clone and install editable so the source tree is used at runtime. - Download only tex + fixtures_lw asset types (smoke test doesn't need objaverse/aigen objects). Pipe 'y' to auto-accept download prompt. - Switch CI policy from pepijn223/smolvla_robocasa to lerobot/smolvla_robocasa. * fix(docker): re-install lerobot editably after COPY The nightly huggingface/lerobot-gpu image predates the RoboCasaEnv registration — so `lerobot-eval --env.type=robocasa` fails at argparse with "invalid choice" even after COPY . . overlays the new source. Force an editable reinstall so the venv picks up the current configs.py. * fix(ci): add rename_map for robocasa eval (image* -> camera) Policy lerobot/smolvla_robocasa expects observation.images.camera1/2/3, but RoboCasaEnv produces observation.images.image/image2/image3. fix(robocasa): override RoboCasaGymEnv default split (test -> all) RoboCasaGymEnv defaults split="test", but create_env only accepts {None, "all", "pretrain", "target"}, so the out-of-the-box default crashes with ValueError. Always pass "all" when split is None. * fix(docker): also download objs_lw (lightwheel objects) for robocasa Kitchen tasks (e.g. CloseFridge) reference lightwheel object meshes like Stool022/model.xml. fixtures_lw alone isn't enough — we also need objs_lw. Still skipping objaverse/aigen to keep image size down. Made-with: Cursor * feat(robocasa): raw camera names + benchmark-group task shortcuts Align the LeRobot env with RoboCasa's native conventions so policies trained on the upstream datasets don't need a --rename_map at eval time, and expose the standard task groups as first-class --env.task values. - Preserve raw RoboCasa camera names (e.g. robot0_agentview_left) as observation.images.<name> end-to-end. Drops camera_name_mapping and DEFAULT_CAMERA_NAME_MAPPING; features/features_map are now built dynamically from the parsed camera list. - Accept benchmark-group names as --env.task: atomic_seen, composite_seen, composite_unseen, pretrain50/100/200/300. Expanded lazily via robocasa.utils.dataset_registry and auto-sets the split ("target" \| "pretrain"). - Update CI smoke-eval rename_map to map raw cam names to the camera1/2/3 keys expected by lerobot/smolvla_robocasa. * docs(robocasa): single-task smolvla train+eval recipe on pepijn223/robocasa_CloseFridge - Rewrite observation section to use raw RoboCasa camera keys (observation.images.robot0_agentview_{left,right}, observation.images.robot0_eye_in_hand). - Add a "Training on a single task" section with a full smolvla training command on pepijn223/robocasa_CloseFridge, plus matching single-task eval command. - Document benchmark-group task shortcuts (atomic_seen, composite_seen, composite_unseen, pretrain50/100/200/300) as valid --env.task values. * fix(robocasa): restrict obj_registries to lightwheel by default CloseFridge (and most kitchen tasks) crashed at reset with `ValueError: Probabilities contain NaN` coming out of `sample_kitchen_object_helper`. RoboCasa's upstream default `obj_registries=("objaverse", "lightwheel")` normalizes per-registry candidate counts as probabilities; when a sampled category has zero mjcf paths in every configured registry (because the objaverse asset pack isn't on disk — ~30GB, skipped by our Docker build), the 0/0 divide yields NaNs and `rng.choice` raises. - Add `obj_registries: list[str] = ["lightwheel"]` to `RoboCasaEnv` config; thread it through `create_robocasa_envs`, `_make_env_fns`, and the gym.Env wrapper to the underlying `RoboCasaGymEnv` (which forwards to `create_env` → `robosuite.make` → kitchen env). - Default matches what `download_kitchen_assets --type objs_lw` actually ships, so the env works out of the box without a 30GB objaverse download. - Document the override (`--env.obj_registries='[objaverse,lightwheel]'`) for users who have downloaded the full asset set. * fix(docker): also download tex_generative for robocasa benchmark RoboCasa's lightwheel kitchen fixtures embed references to `generative_textures/wall/tex.png` directly in their MuJoCo XML, so `MjModel.from_xml_string` errors out at reset time with "No such file or directory" even when the env is constructed with `generative_textures=None`. The generative textures live under a separate asset registry key (`tex_generative`) in `download_kitchen_assets`, distinct from the base `tex` pack we were already fetching. - Add `tex_generative` to the download list so the fixture XMLs resolve. - Document the remaining omissions (objaverse/aigen, ~30GB) and how the runtime side pairs this with obj_registries=["lightwheel"] to avoid sampling from categories whose assets aren't on disk. ci(robocasa): smoke-eval 10 atomic tasks instead of 1 Broader coverage in the benchmark CI job: evaluate SmolVLA on ten fixture-centric atomic RoboCasa tasks (one episode each) instead of just CloseFridge. The tasks are all drawn from TARGET_TASKS.atomic_seen and selected to avoid object-manipulation categories that would require the objaverse/aigen asset packs (we only ship objs_lw in the Docker image, paired with obj_registries=["lightwheel"] on the runtime side). Tasks: CloseFridge, OpenCabinet, OpenDrawer, TurnOnMicrowave, TurnOffStove, CloseToasterOvenDoor, SlideDishwasherRack, TurnOnSinkFaucet, NavigateKitchen, TurnOnElectricKettle. `scripts/ci/parse_eval_metrics.py` already handles multi-task output via the `overall` key, so no parser changes needed. Bumped the metrics artifact's task label to `atomic_smoke_10` to reflect the grouping. * fix(pyproject): drop unresolvable robocasa extra robocasa's upstream setup.py hardcodes `lerobot==0.3.3` in install_requires. Exposing it as the `lerobot[robocasa]` extra made uv's dep resolver cycle: `lerobot[robocasa]` -> robocasa -> lerobot (a different version) -> unsolvable. This broke every `uv sync` — even invocations with an unrelated extra like `--extra test` — because uv validates the whole lockfile graph. - Remove the `robocasa` extra from pyproject.toml. Installation instructions in docs/source/robocasa.mdx now walk users through the manual `git clone` + `pip install --no-deps` flow, which matches what the Docker image already does and sidesteps the cyclic dep entirely. - Dockerfile: `uv pip install -e ~/robocasa --no-deps` so the shadowed lerobot==0.3.3 never lands in the image; install robocasa's actual runtime deps (numpy, numba, scipy, mujoco, tianshou, etc.) explicitly. * docs(robocasa): align page with adding_benchmarks template Rework docs/source/robocasa.mdx to follow the standard benchmark doc structure: intro + links + available tasks (with family breakdown and first-class benchmark-group shortcuts) + installation + eval + recommended episodes + policy I/O + training + reproducing results. - Fix the paper link (was pointing at a non-existent arxiv ID). - Surface lerobot/smolvla_robocasa and pepijn223/robocasa_CloseFridge in the top-of-page links so they're findable without reading the training section. - Add an explicit "Object registries" subsection explaining the `--env.obj_registries=[objaverse,lightwheel]` override path. - Add an explicit "Reproducing published results" section pointing at the CI smoke eval. * fix: integrate PR #3375 review feedback - envs(robocasa): hoist the duplicated `_parse_camera_names` helper out of `libero.py` and `robocasa.py` into `envs/utils.py` as the public `parse_camera_names`; call sites updated. - envs(robocasa): give each factory a distinct `episode_index` (`0..n_envs-1`) and derive a per-worker seed series in `reset()` so n_envs workers don't all roll the same scene under a shared outer seed. - envs(robocasa): drop the unused `*kwargs` on `_make_env`; declare `visualization_height` / `visualization_width` on both the wrapper and the `RoboCasaEnv` config + propagate via `gym_kwargs`. - envs(robocasa): emit `info["final_info"]` on termination (matching MetaWorld) so downstream vector-env auto-reset keeps the terminal task/success flags. - docs(robocasa): add `--rename_map` (robot0_agentview_left/ eye_in_hand/agentview_right → camera1/2/3) plus CI-parity flags to all three eval snippets. - docker(robocasa): pin robocasa + robosuite git SHAs and the pip dep versions (pygame, Pillow, opencv-python, pyyaml, pynput, tqdm, termcolor, imageio, h5py, lxml, hidapi, gymnasium) for reproducible benchmark images. - ci(robocasa): update the workflow comment — there is no `lerobot[robocasa]` extra; robocasa/robosuite are installed manually because upstream's `lerobot==0.3.3` pin shadows ours. docs(robocasa): add benchmark banner image * fix(envs): preserve AsyncVectorEnv metadata/unwrapped in lazy eval envs Port of #3416 onto this branch. Also threads the cached metadata through the RoboCasa factory so async eval on `--env.type=robocasa` keeps the same improvement. * fix: integrate PR #3375 review feedback (round 2) - envs(robocasa): when the caller passes `seed=None` to `reset()`, fall back to `self.episode_index` for the inner env seed so each worker still samples a distinct trajectory instead of all workers inheriting the same global RNG state. - envs(robocasa): replace the two module-level `print()` calls in `create_robocasa_envs` with `logger.info(...)` via a module-level `logger = logging.getLogger(__name__)`. - ci(robocasa): run `scripts/ci/extract_task_descriptions.py` after the eval so `metrics.json` carries per-task natural-language labels, matching LIBERO / MetaWorld / VLABench jobs. Added a `_robocasa_descriptions()` extractor that splits CamelCase task names into word-level labels keyed by `<task>_0`.	2026-04-20 17:10:53 +02:00
Haoming Song	b2765b39b8	Cache lazy async env metadata for eval (#3416 ) Co-authored-by: Pepijn <138571049+pkooij@users.noreply.github.com>	2026-04-20 15:33:13 +02:00
Pepijn	777b808c70	ci: skip Docker Hub login step on fork PRs (#3417 ) On fork PRs, `secrets.DOCKERHUB_LEROBOT_*` expand to empty strings, which fails `docker/login-action@v3` with `Error: Username and password required` before any of the actual build/eval work runs. Gate the login step on the env-var expansion of the username so the step is skipped (not failed) when secrets are absent. On the main repo + maintainer-approved fork runs (`pull_request_review` path), the secrets resolve normally, the step runs, and image pulls get the authenticated Docker Hub rate limit. Scope: only `benchmark_tests.yml`, the lone benchmark workflow that triggers on `pull_request` from forks. `full_tests.yml` and `latest_deps_tests.yml` run under `pull_request_review` / schedule / workflow_dispatch, where secrets are already guaranteed. Context: surfaced on #3416 where an external contributor's PR failed at the login step before any test could run. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 15:14:35 +02:00
Defalt	5c43fa1cce	fix(policies): replace deprecated torch.cuda.amp.autocast with torch.amp.autocast (#3167 ) Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>	2026-04-19 16:25:08 +02:00
k1000dai	3f16d98a9b	episods→episodes (#3410 ) Fixing typo	2026-04-19 12:58:06 +02:00
whats2000	52f508c51c	fix(dataset): cleanup_interrupted_episode wipes image temp dirs (#3405 )	2026-04-19 12:04:24 +02:00
Steven Palma	a8b72d9615	feat(dataset): 2x faster dataloader via parallel decode, uint8 transport, and persistent workers (#3406 ) * feat(dataset): 2xfaster dataloader * fix(dataset): streaming return uint8 decode * fix(tests): adjust normalization step comparison * fix(dataset): with threadexecutor + False default * chore(dataset): make it a config * fix(test): account for uint8 in training path testing	2026-04-19 00:08:22 +02:00
Steven Palma	760220d532	chore(dependencies): update uv.lock (#3365 ) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2026-04-18 22:32:05 +02:00

1 2 3 4 5 ...

1543 Commits