Each benchmark gets its own image (lerobot[<benchmark>,smolvla]) so
incompatible dep trees can never collide. A 1-episode smoke eval runs
per benchmark on GPU runners.
- Libero: pepijn223/smolvla_libero, libero_spatial, camera_name_mapping
- MetaWorld: pepijn223/smolvla_metaworld, metaworld-push-v2
- LIBERO config pre-created at build time to bypass interactive stdin prompt
- Triggers on envs/**, lerobot_eval.py, Dockerfiles, pyproject.toml changes
- Adds docs/source/evaluation.mdx and restores step 7 in adding_benchmarks
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Benchmark CI workflow, Dockerfiles, benchmark docs, evaluation smoke-test
doc, and dispatch tests belong in a separate PR. Scope this PR to the
async env init changes only.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- New docs/source/evaluation.mdx covering lerobot-eval usage, batch_size
auto-tuning, AsyncVectorEnv performance, tuning tips, output format,
multi-task evaluation, and programmatic usage.
- Add evaluation page to _toctree.yml under Benchmarks section.
- Update adding_benchmarks.mdx to reference batch_size auto default and
link to the evaluation guide.
Made-with: Cursor