Compare commits

...

6 Commits

Author SHA1 Message Date
Pepijn 2ab59a3099 feat(benchmarks): add matrix runner and leaderboard 2026-04-15 21:31:33 +02:00
Pepijn dab511dbb1 Merge branch 'main' into feat/libero-benchmark 2026-04-14 10:43:49 +02:00
Maxime Ellerbach a656a982af fix(feetech): motor position readings overflow (#3373) 2026-04-13 22:39:58 +02:00
Pepijn 187b2167ed feat(ci): benchmark smoke tests with isolated Docker images (LIBERO + MetaWorld) (#3319)
* docs(benchmarks): add benchmark integration guide and standardize benchmark docs

Add a comprehensive guide for adding new benchmarks to LeRobot, and
refactor the existing LIBERO and Meta-World docs to follow the new
standardized template.



* refactor(envs): move dispatch logic from factory into EnvConfig subclasses

Replace hardcoded if/elif chains in factory.py with create_envs() and
get_env_processors() methods on EnvConfig. New benchmarks now only need
to register a config subclass — no factory.py edits required.

Net -23 lines: factory.py shrinks from ~200 to ~70 lines of logic.



* docs(benchmarks): clean up adding-benchmarks guide for clarity

Rewrite for simpler language, better structure, and easier navigation.
Move quick-reference table to the top, fold eval explanation into
architecture section, condense the doc template to a bulleted outline.



* fix link

* fix task count

* fix: enable SmolVLA eval on LIBERO with custom camera mappings

- Thread camera_name_mapping from LiberoEnv config through to gym envs
- Sync features_map with camera_name_mapping in LiberoEnv.__post_init__
- Fix render() to use first available camera instead of hardcoded "image"
- Handle non-dict final_info in rollout by falling back to info["is_success"]
- Add use_peft legacy field to SmolVLAConfig for checkpoint compat
- Add defaults to GR00TN15Config init=False fields for transformers 5.3



* fix: use direct AutoresetMode import for gymnasium compat



* fix: handle gymnasium < 1.0 without AutoresetMode



* refactor: revert policy changes, keep env-only camera mapping fixes

- Revert GR00T N1.5 default_factory/default changes (transformers compat)
- Revert SmolVLA use_peft legacy field
- Apply ruff formatting fixes
- camera_name_mapping stays entirely in env/eval layer (no policy changes)



* Update docs/source/env_processor.mdx

Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>

* feat(envs): lazy env init + AsyncVectorEnv as default for n_envs > 1

LiberoEnv and MetaworldEnv previously allocated GPU resources (EGL context,
OpenGL framebuffer) in __init__, before AsyncVectorEnv's fork(). Worker
processes inherited stale GPU handles, causing EGL_BAD_CONTEXT crashes on
first render.

Fix: defer OffScreenRenderEnv / MT1 construction to _ensure_env(), called on
first reset() or step() inside the worker subprocess. Each worker creates its
own clean context after fork().

Also fixes lerobot_eval.py:170 (add_envs_task TODO): replace with
env.call("task") which works with both SyncVectorEnv and AsyncVectorEnv.

AsyncVectorEnv is now the default for n_envs > 1; auto-downgraded to
SyncVectorEnv when n_envs=1 (no benefit, less overhead).

Expected speedup: ~15-20x for LIBERO Spatial with batch_size=50.



* fix: close envs between tasks to prevent worker process accumulation

eval_policy_all never closed environments after each task completed,
causing AsyncVectorEnv worker processes to accumulate (N_tasks × n_envs).
This led to OOM, BrokenPipeError and EOFError on multi-task benchmarks.

Also fixes:
- AsyncVectorEnv compat in envs/utils.py (use get_attr/call instead of .envs)
- Tuple task handling in tokenizer_processor and lerobot_eval
- _LazyAsyncVectorEnv for deferred worker spawning in LIBERO



* fix(eval): use task_description instead of task for language conditioning

env.call("task") returns the LIBERO task name with underscores
(e.g. "pick_up_the_black_bowl_...") instead of the natural language
description ("pick up the black bowl ..."). The VLM tokenizes these
completely differently, causing 0.0 reward across all episodes.



* docs: update adding_benchmarks for async env changes

- Replace add_envs_task reference with env.call("task_description")
- Update use_async_envs default to True
- Add note about lazy GPU init for AsyncVectorEnv compatibility



* feat(eval): batch_size=auto + faster env loading

- batch_size=0 (default) auto-tunes based on CPU cores, capped by
  n_episodes and 64. Removes the need for users to guess the right
  value. The old batch_size > n_episodes error is replaced by silently
  clamping to n_episodes.
- _LazyAsyncVectorEnv accepts pre-computed spaces so only one temp env
  is created per suite (not per task). For libero_spatial (10 tasks)
  this avoids 9 redundant LiberoEnv instantiations during env setup.



* docs: add evaluation guide and update benchmarks doc

- New docs/source/evaluation.mdx covering lerobot-eval usage, batch_size
  auto-tuning, AsyncVectorEnv performance, tuning tips, output format,
  multi-task evaluation, and programmatic usage.
- Add evaluation page to _toctree.yml under Benchmarks section.
- Update adding_benchmarks.mdx to reference batch_size auto default and
  link to the evaluation guide.



* docs(evaluation): remove benchmark table, rename section header



* perf(eval): shared memory, observation passthrough, task prefetch

- AsyncVectorEnv now uses shared_memory=True for zero-copy observation transfer
- LiberoEnvConfig.gym_kwargs passes observation_height/width to the env
- eval_policy_all prefetches next task's workers while current task runs



* style: ruff format



* chore: revert env_processor.mdx changes (not part of this PR)



* ci(benchmarks): add isolated integration tests for libero and metaworld

Each benchmark gets its own Docker image (lerobot[libero] / lerobot[metaworld]
only) so incompatible dep trees cannot collide. A 1-episode smoke eval runs
per benchmark on GPU runners.



* ci(benchmarks): pin action hashes and use uv sync --locked



* ci(benchmarks): trigger only on envs/ or lerobot_eval.py changes



* fix(ci): set LIBERO_DATA_FOLDER to bypass interactive stdin prompt

libero/__init__.py calls input() to ask about a custom dataset path,
which raises EOFError when stdin is closed inside Docker. Setting
LIBERO_DATA_FOLDER skips the prompt entirely.



* docs(benchmarks): add CI smoke test step to adding_benchmarks guide



* fix(ci): pre-create libero config in Dockerfile to bypass stdin prompt

libero/__init__.py calls input() when ~/.libero/config.yaml is missing.
We write the config at image build time (without importing libero) so
the prompt never fires at runtime. Also trigger CI on pyproject.toml changes.



* fix(ci): use shell to create libero config instead of multiline python -c

The multiline RUN python -c "..." was being parsed as Dockerfile
instructions. Use printf to write ~/.libero/config.yaml directly.



* fix(ci): point libero config to bundled package init_files

The config was pointing to /tmp/libero_init which doesn't exist.
Use importlib.util.find_spec to locate the hf-libero package directory
and write paths to the actual bundled bddl_files/init_files/assets.



* fix(ci): add smolvla extra to benchmark Dockerfiles

num2words (required by SmolVLM processor) is declared in lerobot[smolvla],
not lerobot[libero/metaworld]. Install both extras together.



* fix(eval): render_frame covers _LazyAsyncVectorEnv

isinstance(env, AsyncVectorEnv) silently skipped _LazyAsyncVectorEnv,
causing video rendering to produce no frames on the default async path.
Switch to hasattr(env, "call") so any async-compatible env (including
_LazyAsyncVectorEnv) hits the call("render") branch.



* refactor(envs): remove unused _get_sub_env_attr helper

_get_sub_env_attr was defined but never called anywhere in the codebase.
_sub_env_has_attr (its sibling) is kept — it is actively used in utils.py.



* chore: apply prettier formatting to docs



* docs(env_processor): remove deprecated add_envs_task from pipeline example

add_envs_task is replaced by env.call("task_description") in this PR.
Remove it from the pipeline walkthrough and renumber the steps (8→7).



* refactor(envs): remove __del__ from _LazyAsyncVectorEnv

__del__ is unreliable as a cleanup mechanism. close() is already called
explicitly in the eval loop's finally block, so the finalizer is redundant.



* fix(eval): prefetch next task's workers after close to avoid GPU memory overlap

Previously, next task's AsyncVectorEnv workers were spawned while the
current task was still running, causing both tasks' GPU contexts to coexist.
Moving the prefetch start into the finally block (after env.close()) ensures
workers for task N+1 only spin up once task N has released GPU memory.



* refactor(envs): move _LazyAsyncVectorEnv to utils and apply to metaworld

_LazyAsyncVectorEnv lived in libero.py but metaworld had the same OOM
problem: all tasks' AsyncVectorEnv workers were spawned eagerly, wasting
GPU memory for tasks not yet running.

Move the class to envs/utils.py so both environments share it, then apply
the same is_async + lazy wrapping pattern in create_metaworld_envs.



* chore: remove out-of-scope benchmark/CI/docs files from PR

Benchmark CI workflow, Dockerfiles, benchmark docs, evaluation smoke-test
doc, and dispatch tests belong in a separate PR. Scope this PR to the
async env init changes only.



* chore: restore adding_benchmarks + test_dispatch, drop env_processor changes

- Restore docs/source/adding_benchmarks.mdx (belongs in this PR)
- Restore tests/envs/test_dispatch.py (belongs in this PR)
- Revert docs/source/env_processor.mdx to main (out of scope for this PR)



* docs(adding_benchmarks): remove CI smoke test step (coming in separate PR)

Step 7 (Dockerfile + benchmark_tests.yml CI job) and its table rows are
out of scope for this PR. The CI infrastructure will be added on top in a
follow-up PR.



* refactor(envs): remove unused add_envs_task

Replaced by env.call("task_description") in lerobot_eval.py. No callers
remain in the codebase.



* style: fix prettier formatting in env_processor.mdx



* fix(ci): use root container chmod to fix PermissionError on artifact dirs

Running chmod on the host doesn't propagate into Docker due to UID/SELinux
mismatch. Instead, spin up the image as root to mkdir+chmod from inside
the container before the eval run mounts the same path.



* fix(ci): re-chmod artifacts after eval to fix unreadable files

Files created by user_lerobot inside the eval container inherit a
restrictive umask, making them unreadable by the runner after the
container exits. Add a post-eval 'docker run --user root' chmod step
so upload-artifact can find the video files.



* feat(ci): add monthly schedule trigger for benchmark tests

Runs on the 1st of every month at 02:00 UTC in addition to the
existing push/PR and manual dispatch triggers.



* fix(ci): change benchmark schedule from monthly to weekly (every Monday)



* fix(ci): use docker cp instead of bind mounts for artifacts

Bind mounts on these runners don't surface container-written files on
the host path (likely DinD/socket-mount setup). Switch to named
containers + docker cp, which copies directly through the daemon and
lands files in the runner's accessible filesystem.



* fix(ci): write eval output to /tmp inside container

user_lerobot cannot create /artifacts at the container root.
Use /tmp/eval-artifacts (always writable) then docker cp it out.



* feat(ci): add parse_eval_metrics step to benchmark workflow

Adds scripts/ci/parse_eval_metrics.py and wires it into both Libero and
MetaWorld jobs so the dashboard can read pc_success, avg_sum_reward and
eval_s from the metrics artifact instead of relying on GitHub step timing.



* feat(ci): add Libero train+eval smoke test (1 step, eval_freq=1)

Runs accelerate launch --num_processes=1 lerobot-train with:
- steps=1, batch_size=1, dataset.episodes=[0] (episode 0 only)
- eval_freq=1 so the training loop triggers eval after step 1
- eval.n_episodes=1, eval.use_async_envs=false

Tests the full train→eval-within-training pipeline in the existing
libero-benchmark-libero:ci image (no extra Docker build cost).
Uploads eval video from /tmp/train-smoke/eval/ as libero-train-smoke-video.



* feat(ci): extract task descriptions and embed in metrics artifact

- Add scripts/ci/extract_task_descriptions.py: runs inside the benchmark
  Docker container (LIBERO/MetaWorld installed) after lerobot-eval and
  writes task_descriptions.json mapping task keys to NL instructions.
  LIBERO: uses libero.libero.benchmark to get suite.get_task(i).language.
  MetaWorld: formats task name as human-readable label.
- Call extraction at the end of each eval bash-c (|| true so never fatal).
- parse_eval_metrics.py reads task_descriptions.json and includes it in
  metrics.json so the health dashboard Space can label videos by task.



* fix(ci): call extract_task_descriptions.py after eval in benchmark jobs

The task descriptions were never populated in metrics.json because
extract_task_descriptions.py was never invoked. The script exists and
parse_eval_metrics.py already looks for its output — the call was
simply missing from the workflow.

Appends the extraction step to the existing bash -c block (runs inside
the container where libero/metaworld is installed) so task_descriptions.json
is written to the eval-artifacts dir before docker cp copies it out.



* fix(test): use SyncVectorEnv in test_base_create_envs

AsyncVectorEnv spawns new subprocesses that do not inherit the
in-process gym registration created by the test. Pass
use_async_envs=False since this test validates dispatch logic,
not async parallelism.



* perf(ci): split Dockerfile dep-install from source-copy for faster rebuilds

The dep-install layer (uv sync) now only depends on pyproject.toml,
uv.lock, and a minimal package stub — not the full src/ tree. Source
code changes only rebuild the final COPY layer (seconds, not minutes).

Also switch from type=local cache (lost on ephemeral runners) to
type=gha (persisted in GitHub Actions cache, shared across all runs).

Before: every src/ change → full uv sync rebuild (~8-10 min)
After:  src/-only change → cached dep layer, ~30s source copy



* fix(ci): add Docker Hub login to avoid pull rate limits

Anonymous pulls from Docker Hub are rate-limited to 100/6h, which
fails when multiple benchmark jobs pull nvidia/cuda in parallel.
Add docker/login-action step (conditional on DOCKERHUB_USERNAME var)
to authenticate and get 200 pulls/6h.

Setup: add DOCKERHUB_USERNAME as a repository variable and
DOCKERHUB_TOKEN as a repository secret in GitHub Settings.



* fix(ci): use existing DOCKERHUB_LEROBOT_USERNAME/PASSWORD secrets



* fix(ci): use env context for secrets check in step if-condition

Step-level 'if' cannot reference 'secrets' directly. Expose the
secret via an env var and check that instead.



* fix(ci): simplify Docker Hub login to match existing workflows

Drop the conditional guard — other workflows (docker_publish,
full_tests) call docker/login-action unconditionally.



* fix(ci): switch Docker cache from type=gha to type=registry

GHA cache is capped at 10GB per repo — a single CUDA + PyTorch +
benchmark image is ~8GB so the cache evicts before it's reused.

Switch to type=registry which pushes cache layers to Docker Hub
(huggingface/lerobot-benchmark-cache:{libero,metaworld}). No size
limit, layers persist until explicitly deleted, and shared across
all runners and branches.



* fix(ci): use GHCR for Docker layer cache (Docker Hub push denied)

Docker Hub CI token can't push to new repos. GHCR works out of the
box — GITHUB_TOKEN has automatic packages:write for the repo owner.

- Add GHCR login step (github.actor + GITHUB_TOKEN)
- Switch cache refs to ghcr.io/huggingface/lerobot/cache-benchmark
- Add packages:write at job level (not workflow, per zizmor)
- Keep Docker Hub login for pulling nvidia/cuda base image



* fix(ci): remove GHCR cache (org blocks GITHUB_TOKEN package writes)

The huggingface org restricts GHCR package creation via GITHUB_TOKEN,
causing 403 on cache export. Remove all registry caching and GHCR
login. The Dockerfile layer split (deps vs source) still helps when
the runner has a warm Docker daemon.

Also fix the metaworld job which had a stale conditional Docker Hub
login and was missing the GHCR login entirely.



* fix(ci): address PR review feedback for benchmark smoke tests

Security:
- Remove "Login to Hugging Face" step — it was a no-op (ephemeral
  --rm container) that exposed the HF token via CLI argument in
  docker inspect / /proc/*/cmdline. The eval step already
  re-authenticates via env var.

Functional:
- Remove feat/benchmark-ci from push trigger branches (won't exist
  post-merge).

Dockerfiles:
- Pin uv to 0.8.0 (was unpinned, fetching whatever latest ships).
- Add comment explaining the chmod +x ptxas workaround (Triton
  packaging bug — ships ptxas without execute bit).

Scripts:
- parse_eval_metrics.py: add note that it runs on bare host and must
  stay stdlib-only.
- parse_eval_metrics.py: add NaN guard for avg_sum_reward and eval_s
  (was only guarding pc_success).



* ci(benchmarks): trigger on PRs targeting feat/benchmark-ci

Benchmark PRs (robomme, libero-plus, robocerebra, robotwin) target
feat/benchmark-ci, not main. Without this, the workflow never runs
on those PRs.



* fix(docker): use uv pip install instead of uv sync (cross-extra conflict)

uv sync --locked validates the entire lockfile across all extras.
Since robomme depends on mani-skill which pins numpy<2.0, and the
base project requires numpy>=2.0, the full lockfile is unsatisfiable.

Switch to uv pip install -e ".[libero,smolvla]" which only resolves
the requested extras for the current Python version and platform,
avoiding the cross-extra numpy conflict entirely.



* chore: revert configs.py, factory.py, test_dispatch.py to main

These use_async_envs default changes belong to the async-vector-env
PR (#3274), not this CI PR. Restore to match origin/main.



* fix: address PR review feedback — broken link, NaN guard, zizmor tags, fork skip

- Remove broken Triton issue link from Dockerfile.benchmark.libero
- Add module-level _safe_int helper to guard n_episodes against NaN
- Move _safe_float to module level alongside _safe_int
- Add # zizmor: ignore[unpinned-uses] to all upload-artifact@v4 steps
- Add if: env.HF_USER_TOKEN != '' to Libero smoke eval for fork PRs



* fix(ci): add fork PR guard to train-smoke and MetaWorld eval steps

Add if: env.HF_USER_TOKEN != '' to the Libero train+eval smoke and
MetaWorld smoke eval steps so fork PRs without the secret skip gracefully.



* fix(ci): remove feat/benchmark-ci from PR trigger branches



* refactor(docker): rebase benchmark images on nightly lerobot-gpu

Use huggingface/lerobot-gpu:latest as base for both libero and metaworld
benchmark Dockerfiles instead of building from nvidia/cuda scratch. The
nightly image already has all extras installed via uv sync --extra all,
so we only need to overlay the PR source code (and libero asset setup).

This eliminates duplicated system dep installation, Python setup, uv
venv creation, and the Triton ptxas workaround from both files.

---------

Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
2026-04-13 21:24:01 +02:00
Jash Shah 9bd844a3b9 fix(rl): ensure queue and process cleanup on abnormal exit (#3063)
Wrap the main execution in actor_cli and start_learner_threads with
try/finally so that queues are closed and processes are joined even
when an unhandled exception occurs. Previously, exceptions in
act_with_policy or add_actor_information_and_train would skip all
cleanup code, leaking GPU/CPU resources.

Also sets the shutdown_event on exception so child processes exit
gracefully.

Fixes #3059

Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
2026-04-13 16:25:42 +02:00
Pepijn fd00e38851 feat(benchmarks): add LIBERO training benchmark pipeline
Single-script benchmark that trains and evaluates all 9 LeRobot policies
on LIBERO. Each SLURM job self-publishes its result row to a HuggingFace
leaderboard dataset — no separate collection step needed.

Policies: pi0, pi0_fast, pi05, groot, act, diffusion, smolvla, xvla,
multi_task_dit. 5000 steps, BS 256, with per-policy GPU allocation and
default LR/scheduler presets.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 17:01:49 +02:00
30 changed files with 3497 additions and 93 deletions
+490
View File
@@ -0,0 +1,490 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Integration tests: build an isolated Docker image per benchmark and run a
# 1-episode smoke eval. Each benchmark gets its own image so incompatible
# dependency trees (e.g. hf-libero vs metaworld==3.0.0) can never collide.
#
# To add a new benchmark:
# 1. Add docker/Dockerfile.benchmark.<name> (install only lerobot[<name>])
# 2. Copy one of the jobs below and adjust the image name and eval command.
name: Benchmark Integration Tests
on:
# Run manually from the Actions tab
workflow_dispatch:
# Run every Monday at 02:00 UTC.
schedule:
- cron: "0 2 * * 1"
push:
branches:
- main
paths:
- "src/lerobot/envs/**"
- "src/lerobot/scripts/lerobot_eval.py"
- "docker/Dockerfile.benchmark.*"
- ".github/workflows/benchmark_tests.yml"
- "pyproject.toml"
pull_request:
branches:
- main
paths:
- "src/lerobot/envs/**"
- "src/lerobot/scripts/lerobot_eval.py"
- "docker/Dockerfile.benchmark.*"
- ".github/workflows/benchmark_tests.yml"
- "pyproject.toml"
permissions:
contents: read
env:
UV_VERSION: "0.8.0"
PYTHON_VERSION: "3.12"
# Cancel in-flight runs for the same branch/PR.
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true
jobs:
# ── LIBERO ────────────────────────────────────────────────────────────────
# Isolated image: lerobot[libero] only (hf-libero, dm-control, mujoco chain)
libero-integration-test:
name: Libero — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false
- name: Login to Docker Hub
uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
with:
username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
# Build the benchmark-specific image. The Dockerfile separates dep-install
# from source-copy, so code-only changes skip the slow uv-sync layer
# when the runner has a warm Docker daemon cache.
- name: Build Libero benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.libero
push: false
load: true
tags: lerobot-benchmark-libero:ci
- name: Run Libero smoke eval (1 episode)
if: env.HF_USER_TOKEN != ''
run: |
# Named container (no --rm) so we can docker cp artifacts out.
# Output to /tmp inside the container — /artifacts doesn't exist
# and user_lerobot cannot create root-level dirs.
docker run --name libero-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
lerobot-benchmark-libero:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=pepijn223/smolvla_libero \
--env.type=libero \
--env.task=libero_spatial \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--env.camera_name_mapping={\"agentview_image\": \"camera1\", \"robot0_eye_in_hand_image\": \"camera2\"}' \
--policy.empty_cameras=1 \
--output_dir=/tmp/eval-artifacts
python scripts/ci/extract_task_descriptions.py \
--env libero --task libero_spatial \
--output /tmp/eval-artifacts/task_descriptions.json
"
- name: Copy Libero artifacts from container
if: always()
run: |
mkdir -p /tmp/libero-artifacts
docker cp libero-eval:/tmp/eval-artifacts/. /tmp/libero-artifacts/ 2>/dev/null || true
docker rm -f libero-eval || true
- name: Parse Libero eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/libero-artifacts \
--env libero \
--task libero_spatial \
--policy pepijn223/smolvla_libero
- name: Upload Libero rollout video
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: libero-rollout-video
path: /tmp/libero-artifacts/videos/
if-no-files-found: warn
- name: Upload Libero eval metrics
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: libero-metrics
path: /tmp/libero-artifacts/metrics.json
if-no-files-found: warn
# ── LIBERO TRAIN+EVAL SMOKE ──────────────────────────────────────────────
# Train SmolVLA for 1 step (batch_size=1, dataset episode 0 only) then
# immediately runs eval inside the training loop (eval_freq=1, 1 episode).
# Tests the full train→eval-within-training pipeline end-to-end.
- name: Run Libero train+eval smoke (1 step, eval_freq=1)
if: env.HF_USER_TOKEN != ''
run: |
docker run --name libero-train-smoke --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
lerobot-benchmark-libero:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
accelerate launch --num_processes=1 \$(which lerobot-train) \
--policy.path=lerobot/smolvla_base \
--policy.load_vlm_weights=true \
--policy.scheduler_decay_steps=25000 \
--policy.freeze_vision_encoder=false \
--policy.train_expert_only=false \
--dataset.repo_id=lerobot/libero \
--dataset.episodes=[0] \
--dataset.use_imagenet_stats=false \
--env.type=libero \
--env.task=libero_spatial \
'--env.camera_name_mapping={\"agentview_image\": \"camera1\", \"robot0_eye_in_hand_image\": \"camera2\"}' \
--policy.empty_cameras=1 \
--output_dir=/tmp/train-smoke \
--steps=1 \
--batch_size=1 \
--eval_freq=1 \
--eval.n_episodes=1 \
--eval.batch_size=1 \
--eval.use_async_envs=false \
--save_freq=1 \
--policy.push_to_hub=false \
'--rename_map={\"observation.images.image\": \"observation.images.camera1\", \"observation.images.image2\": \"observation.images.camera2\"}'
"
- name: Copy Libero train-smoke artifacts from container
if: always()
run: |
mkdir -p /tmp/libero-train-smoke-artifacts
docker cp libero-train-smoke:/tmp/train-smoke/. /tmp/libero-train-smoke-artifacts/ 2>/dev/null || true
docker rm -f libero-train-smoke || true
- name: Upload Libero train-smoke eval video
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: libero-train-smoke-video
path: /tmp/libero-train-smoke-artifacts/eval/
if-no-files-found: warn
# ── METAWORLD ─────────────────────────────────────────────────────────────
# Isolated image: lerobot[metaworld] only (metaworld==3.0.0, mujoco>=3 chain)
metaworld-integration-test:
name: MetaWorld — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false
- name: Login to Docker Hub
uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
with:
username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}
- name: Build MetaWorld benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.metaworld
push: false
load: true
tags: lerobot-benchmark-metaworld:ci
- name: Run MetaWorld smoke eval (1 episode)
if: env.HF_USER_TOKEN != ''
run: |
docker run --name metaworld-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
lerobot-benchmark-metaworld:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=pepijn223/smolvla_metaworld \
--env.type=metaworld \
--env.task=metaworld-push-v3 \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={\"observation.image\": \"observation.images.camera1\"}' \
--policy.empty_cameras=2 \
--output_dir=/tmp/eval-artifacts
python scripts/ci/extract_task_descriptions.py \
--env metaworld --task metaworld-push-v3 \
--output /tmp/eval-artifacts/task_descriptions.json
"
- name: Copy MetaWorld artifacts from container
if: always()
run: |
mkdir -p /tmp/metaworld-artifacts
docker cp metaworld-eval:/tmp/eval-artifacts/. /tmp/metaworld-artifacts/ 2>/dev/null || true
docker rm -f metaworld-eval || true
- name: Parse MetaWorld eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/metaworld-artifacts \
--env metaworld \
--task metaworld-push-v3 \
--policy pepijn223/smolvla_metaworld
- name: Upload MetaWorld rollout video
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: metaworld-rollout-video
path: /tmp/metaworld-artifacts/videos/
if-no-files-found: warn
- name: Upload MetaWorld eval metrics
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: metaworld-metrics
path: /tmp/metaworld-artifacts/metrics.json
if-no-files-found: warn
# ── LIBERO-plus ───────────────────────────────────────────────────────────
libero-plus-integration-test:
name: LIBERO-plus — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false
- name: Build LIBERO-plus benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.libero_plus
push: false
load: true
tags: lerobot-benchmark-libero-plus:ci
cache-from: type=local,src=/tmp/.buildx-cache-libero-plus
cache-to: type=local,dest=/tmp/.buildx-cache-libero-plus,mode=max
- name: Run LIBERO-plus smoke eval (1 episode)
if: env.HF_USER_TOKEN != ''
run: |
docker run --name libero-plus-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
lerobot-benchmark-libero-plus:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=lerobot/smolvla_libero_plus \
--env.type=libero_plus \
--env.task=libero_spatial \
'--env.task_ids=[0,100,260,500,1000,1500,2000,2400]' \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--env.camera_name_mapping={\"agentview_image\": \"camera1\", \"robot0_eye_in_hand_image\": \"camera2\"}' \
--policy.empty_cameras=1 \
--output_dir=/tmp/eval-artifacts
python scripts/ci/extract_task_descriptions.py \
--env libero_plus --task libero_spatial \
--output /tmp/eval-artifacts/task_descriptions.json
"
- name: Copy LIBERO-plus artifacts from container
if: always()
run: |
mkdir -p /tmp/libero-plus-artifacts
docker cp libero-plus-eval:/tmp/eval-artifacts/. /tmp/libero-plus-artifacts/ 2>/dev/null || true
docker rm -f libero-plus-eval || true
- name: Parse LIBERO-plus eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/libero-plus-artifacts \
--env libero_plus \
--task libero_spatial \
--policy lerobot/smolvla_libero_plus
- name: Upload LIBERO-plus rollout video
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: libero-plus-rollout-video
path: /tmp/libero-plus-artifacts/videos/
if-no-files-found: warn
- name: Upload LIBERO-plus eval metrics
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: libero-plus-metrics
path: /tmp/libero-plus-artifacts/metrics.json
if-no-files-found: warn
# ── ROBOMME ───────────────────────────────────────────────────────────────
robomme-integration-test:
name: RoboMME — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false
- name: Build RoboMME benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.robomme
push: false
load: true
tags: lerobot-benchmark-robomme:ci
- name: Run RoboMME smoke eval (1 episode)
if: env.HF_USER_TOKEN != ''
run: |
docker run --name robomme-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
lerobot-benchmark-robomme:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=lerobot/smolvla_robomme \
--env.type=robomme \
--env.task=PickXtimes,BinFill,StopCube,MoveCube,InsertPeg \
--env.dataset_split=test \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={\"observation.images.image\": \"observation.images.camera1\", \"observation.images.wrist_image\": \"observation.images.camera2\"}' \
--policy.empty_cameras=3 \
--output_dir=/tmp/eval-artifacts
python scripts/ci/extract_task_descriptions.py \
--env robomme --task PickXtimes,BinFill,StopCube,MoveCube,InsertPeg \
--output /tmp/eval-artifacts/task_descriptions.json
"
- name: Copy RoboMME artifacts from container
if: always()
run: |
mkdir -p /tmp/robomme-artifacts
docker cp robomme-eval:/tmp/eval-artifacts/. /tmp/robomme-artifacts/ 2>/dev/null || true
docker rm -f robomme-eval || true
- name: Parse RoboMME eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/robomme-artifacts \
--env robomme \
--task PickXtimes \
--policy lerobot/smolvla_robomme
- name: Upload RoboMME rollout video
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: robomme-rollout-video
path: /tmp/robomme-artifacts/videos/
if-no-files-found: warn
- name: Upload RoboMME eval metrics
if: always()
uses: actions/upload-artifact@v4 # zizmor: ignore[unpinned-uses]
with:
name: robomme-metrics
path: /tmp/robomme-artifacts/metrics.json
if-no-files-found: warn
+1
View File
@@ -0,0 +1 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+60
View File
@@ -0,0 +1,60 @@
# LeRobot LIBERO Training Benchmark
Train and evaluate all LeRobot policies on [LIBERO](https://libero-project.github.io/) and publish results as a HuggingFace leaderboard dataset.
## Policies
| Policy | Base Model | GPUs | LR | Chunk | Notes |
| -------------- | -------------------- | ---- | ------ | ----- | ------------------------------------- |
| pi0 | lerobot/pi0_base | 8 | 2.5e-5 | 30 | PaliGemma + Gemma flow matching |
| pi0_fast | lerobot/pi0fast-base | 8 | 2.5e-5 | 30 | Requires tokenizer pre-training |
| pi05 | lerobot/pi05_base | 8 | 2.5e-5 | 30 | Quantiles normalization |
| groot | nvidia/GR00T-N1.5-3B | 8 | 1e-4 | 30 | bf16, diffusion head + projector only |
| act | From scratch | 1 | 1e-5 | 30 | ResNet-18, lightweight |
| diffusion | From scratch | 1 | 1e-4 | 32\* | U-Net, horizon must be divisible by 8 |
| smolvla | lerobot/smolvla_base | 8 | 1e-4 | 30 | SmolVLM2-500M |
| xvla | lerobot/xvla-widowx | 4 | 1e-4 | 32\* | Florence2 + CLIP |
| multi_task_dit | From scratch | 1 | 2e-5 | 32\* | CLIP + DiT |
\* These policies use `horizon` rather than `chunk_size`. Set to 32 (nearest valid value to 30).
## Training spec
- **Steps**: 5,000 per policy
- **Batch size**: 32 per GPU (effective BS = 256 for multi-GPU)
- **Dataset**: `lerobot/libero` (libero_spatial)
- **Evaluation**: 20 episodes after training
- **LR**: each policy's default optimizer/scheduler preset
- **Results**: each SLURM job publishes its own row to the HF leaderboard dataset automatically
## Quick start
### 1. Generate SLURM scripts
```bash
python benchmarks/libero/run_benchmark.py \
--output_dir /scratch/lerobot-benchmark \
--hub_org lerobot
```
### 2. Submit jobs
```bash
# If using pi0_fast, submit tokenizer first:
sbatch /scratch/lerobot-benchmark/slurm_scripts/00_tokenizer.sh
# Wait, then submit pi0_fast
# All other policies can run in parallel:
for script in /scratch/lerobot-benchmark/slurm_scripts/[0-9][0-9]_*.sh; do
[[ "$script" == *pi0_fast* ]] && continue
sbatch "$script"
done
```
Each job publishes its result to `lerobot/benchmark-libero` on the Hub when it finishes.
## Prerequisites
- SLURM cluster with CUDA GPUs (A100 80GB recommended for VLM policies)
- `pip install lerobot[pi,smolvla,groot,xvla,multi_task_dit,libero] datasets`
- `huggingface-cli login`
+606
View File
@@ -0,0 +1,606 @@
#!/usr/bin/env python
"""Generate SLURM sbatch scripts for training all LeRobot policies on LIBERO.
Each generated script trains one policy, evaluates it, and publishes its
results row to a HuggingFace leaderboard dataset — no separate collection
step needed.
Usage:
# Generate scripts for all policies:
python benchmarks/libero/run_benchmark.py \\
--output_dir /scratch/lerobot-benchmark --hub_org lerobot
# Generate for a subset:
python benchmarks/libero/run_benchmark.py \\
--policies pi0 smolvla act \\
--output_dir /scratch/lerobot-benchmark --hub_org lerobot
"""
from __future__ import annotations
import argparse
import json
import subprocess
import textwrap
import uuid
from dataclasses import dataclass, field
from datetime import UTC, datetime
from pathlib import Path
# ──────────────────────────────────────────────────────────────────────
# Policy benchmark configs
# ──────────────────────────────────────────────────────────────────────
@dataclass
class PolicyBenchmarkConfig:
"""Training configuration for a single policy on a benchmark."""
policy_type: str
policy_path: str | None = None
num_gpus: int = 1
chunk_size: int | None = None # Set on policies that use chunk_size (not horizon)
extra_policy_args: dict[str, str] = field(default_factory=dict)
needs_tokenizer: bool = False
tokenizer_args: dict[str, str] = field(default_factory=dict)
COMMON_TRAINING_ARGS: dict[str, str] = {
"dataset.repo_id": "lerobot/libero",
"dataset.use_imagenet_stats": "false",
"env.type": "libero",
"env.task": "libero_spatial",
"steps": "5000",
"batch_size": "32",
"eval_freq": "0",
"save_freq": "5000",
"save_checkpoint": "true",
"log_freq": "100",
"wandb.enable": "true",
"policy.push_to_hub": "true",
"rename_map": (
'{"observation.images.image":"observation.images.camera1",'
'"observation.images.image2":"observation.images.camera2"}'
),
}
EVAL_ARGS: dict[str, str] = {
"env.type": "libero",
"env.task": "libero_spatial",
"eval.n_episodes": "20",
"eval.batch_size": "10",
}
POLICY_CONFIGS: dict[str, PolicyBenchmarkConfig] = {
"pi0": PolicyBenchmarkConfig(
policy_type="pi0",
policy_path="lerobot/pi0_base",
num_gpus=8,
chunk_size=30,
extra_policy_args={
"policy.n_action_steps": "30",
"policy.scheduler_decay_steps": "5000",
},
),
"pi0_fast": PolicyBenchmarkConfig(
policy_type="pi0_fast",
policy_path="lerobot/pi0fast-base",
num_gpus=8,
chunk_size=30,
extra_policy_args={
"policy.n_action_steps": "30",
"policy.scheduler_decay_steps": "5000",
},
needs_tokenizer=True,
tokenizer_args={
"repo_id": "lerobot/libero",
"action_horizon": "30",
"encoded_dims": "0:7",
"normalization_mode": "QUANTILES",
"vocab_size": "1024",
"scale": "10.0",
"push_to_hub": "true",
},
),
"pi05": PolicyBenchmarkConfig(
policy_type="pi05",
policy_path="lerobot/pi05_base",
num_gpus=8,
chunk_size=30,
extra_policy_args={
"policy.n_action_steps": "30",
"policy.scheduler_decay_steps": "5000",
},
),
"groot": PolicyBenchmarkConfig(
policy_type="groot",
policy_path=None,
num_gpus=8,
chunk_size=30,
extra_policy_args={
"policy.n_action_steps": "30",
"policy.base_model_path": "nvidia/GR00T-N1.5-3B",
"policy.tune_diffusion_model": "true",
"policy.tune_projector": "true",
"policy.tune_llm": "false",
"policy.tune_visual": "false",
"policy.use_bf16": "true",
},
),
"act": PolicyBenchmarkConfig(
policy_type="act",
policy_path=None,
num_gpus=1,
chunk_size=30,
extra_policy_args={"policy.n_action_steps": "30"},
),
"diffusion": PolicyBenchmarkConfig(
policy_type="diffusion",
policy_path=None,
num_gpus=1,
chunk_size=None,
extra_policy_args={
"policy.horizon": "32",
"policy.n_action_steps": "30",
"policy.n_obs_steps": "2",
},
),
"smolvla": PolicyBenchmarkConfig(
policy_type="smolvla",
policy_path="lerobot/smolvla_base",
num_gpus=8,
chunk_size=30,
extra_policy_args={
"policy.n_action_steps": "30",
"policy.load_vlm_weights": "true",
"policy.freeze_vision_encoder": "false",
"policy.train_expert_only": "false",
"policy.scheduler_decay_steps": "5000",
},
),
"xvla": PolicyBenchmarkConfig(
policy_type="xvla",
policy_path="lerobot/xvla-widowx",
num_gpus=4,
chunk_size=32,
extra_policy_args={
"policy.n_action_steps": "32",
"policy.scheduler_decay_steps": "5000",
},
),
"multi_task_dit": PolicyBenchmarkConfig(
policy_type="multi_task_dit",
policy_path=None,
num_gpus=1,
chunk_size=None,
extra_policy_args={
"policy.horizon": "32",
"policy.n_action_steps": "30",
},
),
}
ALL_POLICY_NAMES = list(POLICY_CONFIGS.keys())
# GPU memory estimates (GB) for SLURM --mem allocation
GPU_MEM_ESTIMATES: dict[str, int] = {
"pi0": 320,
"pi0_fast": 320,
"pi05": 280,
"groot": 320,
"act": 64,
"diffusion": 64,
"smolvla": 160,
"xvla": 160,
"multi_task_dit": 64,
}
# ──────────────────────────────────────────────────────────────────────
# SLURM script generation
# ──────────────────────────────────────────────────────────────────────
def _cli_args(args: dict[str, str]) -> str:
"""Build a backslash-continued CLI arg string with proper shell quoting."""
lines = []
for key, value in args.items():
if any(c in str(value) for c in ["{", "}", " ", '"', "'"]):
lines.append(f" --{key}='{value}'")
else:
lines.append(f" --{key}={value}")
return " \\\n".join(lines)
def _training_cli_args(
policy_name: str,
output_dir: Path,
hub_org: str,
benchmark_uuid: str,
) -> str:
cfg = POLICY_CONFIGS[policy_name]
args: dict[str, str] = {}
args.update(COMMON_TRAINING_ARGS)
args["policy.type"] = cfg.policy_type
if cfg.policy_path:
args["policy.path"] = cfg.policy_path
if cfg.chunk_size is not None:
args["policy.chunk_size"] = str(cfg.chunk_size)
args.update(cfg.extra_policy_args)
args["output_dir"] = str(output_dir / "train" / policy_name)
args["policy.repo_id"] = f"{hub_org}/{policy_name}_libero"
args["wandb.project"] = "lerobot-libero-benchmark"
args["wandb.run_name"] = f"{policy_name}_{benchmark_uuid[:8]}"
return _cli_args(args)
def _publish_snippet(
policy_name: str,
output_dir: Path,
hub_org: str,
benchmark_uuid: str,
hub_dataset: str,
) -> str:
"""Inline Python that each SLURM job runs to publish its own result row."""
cfg = POLICY_CONFIGS[policy_name]
steps = int(COMMON_TRAINING_ARGS["steps"])
bs = int(COMMON_TRAINING_ARGS["batch_size"])
eff_bs = bs * cfg.num_gpus
train_dir = output_dir / "train" / policy_name
return textwrap.dedent(f"""\
python3 -c "
import json, os, re, sys
from pathlib import Path
from datetime import datetime, timezone
timing = {{}}
tp = Path('{output_dir}/logs/{policy_name}_timing.txt')
if tp.exists():
for ln in tp.read_text().splitlines():
if '=' in ln:
k, _, v = ln.partition('=')
timing[k.strip()] = v.strip()
# Parse eval results
eval_sr, eval_per_task, eval_n = None, '{{}}', 0
eval_dir = Path('{train_dir}/eval_results')
if eval_dir.exists():
for jf in eval_dir.glob('**/*.json'):
try:
d = json.loads(jf.read_text())
except Exception:
continue
if 'avg_success_rate' in d:
eval_sr = d['avg_success_rate']
elif 'eval_info' in d and 'avg_success_rate' in d.get('eval_info', {{}}):
eval_sr = d['eval_info']['avg_success_rate']
pt = {{k: v for k, v in d.items() if 'success_rate' in k and k != 'avg_success_rate'}}
if pt:
eval_per_task = json.dumps(pt)
if 'n_episodes' in d:
eval_n = d['n_episodes']
# Parse final loss from SLURM stdout
final_loss = None
for lf in sorted(Path('{output_dir}/logs').glob('{policy_name}_*.out'), reverse=True):
losses = re.findall(r'\\\"loss\\\"\\s*:\\s*([\\d.e+-]+)', lf.read_text())
if losses:
final_loss = float(losses[-1])
break
# Parse peak GPU mem
peak_mem = 0.0
csv_p = Path('{output_dir}/logs/{policy_name}_gpu_mem.csv')
if csv_p.exists():
for ln in csv_p.read_text().splitlines():
parts = ln.strip().split(',')
if len(parts) >= 2:
try:
peak_mem = max(peak_mem, float(parts[1].strip()))
except ValueError:
pass
# Parse train config for optimizer details
lr, opt_wd, sched_type, sched_warmup, sched_decay = 0.0, 0.0, '', 0, 0
freeze_ve, train_eo, grad_ckpt = False, False, False
cfg_path = Path('{train_dir}/checkpoints/{steps:06d}/pretrained_model/train_config.json')
if cfg_path.exists():
tc = json.loads(cfg_path.read_text())
o = tc.get('optimizer', {{}})
lr = o.get('lr', 0.0)
opt_wd = o.get('weight_decay', 0.0)
s = tc.get('scheduler', {{}})
sched_type = s.get('type', '')
sched_warmup = s.get('num_warmup_steps', 0)
sched_decay = s.get('num_decay_steps', 0)
p = tc.get('policy', {{}})
freeze_ve = p.get('freeze_vision_encoder', False)
train_eo = p.get('train_expert_only', False)
grad_ckpt = p.get('gradient_checkpointing', False)
row = {{
'benchmark_uuid': '{benchmark_uuid}',
'policy_type': '{policy_name}',
'policy_repo_id': '{hub_org}/{policy_name}_libero',
'base_model_repo_id': '{cfg.policy_path or ""}',
'dataset_repo_id': '{COMMON_TRAINING_ARGS["dataset.repo_id"]}',
'env_type': '{COMMON_TRAINING_ARGS["env.type"]}',
'env_task': '{COMMON_TRAINING_ARGS["env.task"]}',
'steps': {steps},
'batch_size_per_gpu': {bs},
'num_gpus': {cfg.num_gpus},
'effective_batch_size': {eff_bs},
'total_samples_seen': {steps * eff_bs},
'chunk_size': {cfg.chunk_size or 0},
'learning_rate': lr,
'optimizer_type': 'AdamW',
'optimizer_weight_decay': opt_wd,
'scheduler_type': sched_type,
'scheduler_warmup_steps': sched_warmup,
'scheduler_decay_steps': sched_decay,
'freeze_vision_encoder': freeze_ve,
'train_expert_only': train_eo,
'gradient_checkpointing': grad_ckpt,
'eval_success_rate': eval_sr,
'eval_success_rate_per_task': eval_per_task,
'eval_n_episodes': eval_n,
'final_train_loss': final_loss,
'training_time_s': float(timing.get('TRAINING_TIME_S', 0)),
'peak_gpu_memory_mb': peak_mem or float(timing.get('MAX_GPU_MEM_MB', 0)),
'gpu_type': timing.get('GPU_TYPE', 'unknown'),
'lerobot_commit': timing.get('LEROBOT_COMMIT', 'unknown'),
'timestamp': datetime.now(timezone.utc).isoformat(),
}}
# Save locally
Path('{train_dir}/benchmark_result.json').write_text(json.dumps(row, indent=2, default=str))
# Push to HF dataset
try:
from datasets import Dataset, load_dataset
try:
existing = load_dataset('{hub_dataset}', split='train')
rows = existing.to_list() + [row]
except Exception:
rows = [row]
Dataset.from_list(rows).push_to_hub('{hub_dataset}', split='train')
print('Published result to {hub_dataset}')
except ImportError:
print('datasets library not installed — result saved locally only')
except Exception as e:
print(f'Failed to push to hub: {{e}} — result saved locally')
"
""")
def _generate_sbatch_script(
policy_name: str,
output_dir: Path,
hub_org: str,
benchmark_uuid: str,
hub_dataset: str,
lerobot_commit: str,
) -> str:
cfg = POLICY_CONFIGS[policy_name]
steps = int(COMMON_TRAINING_ARGS["steps"])
log_dir = output_dir / "logs"
train_dir = output_dir / "train" / policy_name
checkpoint_path = train_dir / f"checkpoints/{steps:06d}/pretrained_model"
training_args = _training_cli_args(policy_name, output_dir, hub_org, benchmark_uuid)
eval_args = _cli_args(EVAL_ARGS)
publish = _publish_snippet(policy_name, output_dir, hub_org, benchmark_uuid, hub_dataset)
return textwrap.dedent(f"""\
#!/bin/bash
#SBATCH --job-name=bench_{policy_name}
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:{cfg.num_gpus}
#SBATCH --cpus-per-task={cfg.num_gpus * 8}
#SBATCH --mem={GPU_MEM_ESTIMATES.get(policy_name, 128)}G
#SBATCH --time=06:00:00
#SBATCH --output={log_dir}/{policy_name}_%j.out
#SBATCH --error={log_dir}/{policy_name}_%j.err
set -euo pipefail
echo "=========================================="
echo "LeRobot LIBERO Benchmark — {policy_name}"
echo "UUID: {benchmark_uuid}"
echo "Start: $(date -Iseconds)"
echo "Host: $(hostname) | GPUs: {cfg.num_gpus}"
echo "=========================================="
START_TIME=$(date +%s)
# GPU memory monitoring (every 30s)
nvidia-smi --query-gpu=index,memory.used,memory.total,gpu_name \\
--format=csv,noheader,nounits -l 30 \\
> "{log_dir}/{policy_name}_gpu_mem.csv" &
GPU_MONITOR_PID=$!
# ── Training ──────────────────────────────────────────────────
echo "[$(date -Iseconds)] Starting training..."
accelerate launch --num_processes={cfg.num_gpus} \\
$(which lerobot-train) \\
{training_args}
TRAIN_EXIT=$?
TRAIN_END=$(date +%s)
echo "[$(date -Iseconds)] Training exit code: $TRAIN_EXIT"
# ── Evaluation ────────────────────────────────────────────────
EVAL_EXIT=1
if [ $TRAIN_EXIT -eq 0 ]; then
echo "[$(date -Iseconds)] Starting evaluation..."
lerobot-eval \\
--policy.path="{checkpoint_path}" \\
{eval_args} \\
--output_dir="{train_dir}/eval_results"
EVAL_EXIT=$?
echo "[$(date -Iseconds)] Eval exit code: $EVAL_EXIT"
else
echo "[$(date -Iseconds)] Skipping eval — training failed."
fi
# ── Timing ────────────────────────────────────────────────────
END_TIME=$(date +%s)
kill $GPU_MONITOR_PID 2>/dev/null || true
cat > "{log_dir}/{policy_name}_timing.txt" <<TIMING_EOF
BENCHMARK_UUID={benchmark_uuid}
POLICY_TYPE={policy_name}
TRAINING_TIME_S=$((TRAIN_END - START_TIME))
TOTAL_TIME_S=$((END_TIME - START_TIME))
TRAIN_EXIT=$TRAIN_EXIT
EVAL_EXIT=$EVAL_EXIT
MAX_GPU_MEM_MB=$(awk -F',' '{{print $2}}' "{log_dir}/{policy_name}_gpu_mem.csv" 2>/dev/null | sort -n | tail -1)
GPU_TYPE=$(nvidia-smi --query-gpu=gpu_name --format=csv,noheader | head -1 | xargs)
LEROBOT_COMMIT={lerobot_commit}
TIMING_EOF
# ── Publish result to HF dataset ──────────────────────────────
echo "[$(date -Iseconds)] Publishing result..."
{publish}
echo "=========================================="
echo "Done: $(date -Iseconds)"
echo "Training: $((TRAIN_END - START_TIME))s | Total: $((END_TIME - START_TIME))s"
echo "=========================================="
""")
def _generate_tokenizer_script(
output_dir: Path,
hub_org: str,
benchmark_uuid: str,
) -> str:
cfg = POLICY_CONFIGS["pi0_fast"]
log_dir = output_dir / "logs"
tokenizer_hub_repo = f"{hub_org}/fast-tokenizer-libero"
tok_args = dict(cfg.tokenizer_args)
tok_args["hub_repo_id"] = tokenizer_hub_repo
return textwrap.dedent(f"""\
#!/bin/bash
#SBATCH --job-name=bench_tokenizer
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=01:00:00
#SBATCH --output={log_dir}/tokenizer_%j.out
#SBATCH --error={log_dir}/tokenizer_%j.err
set -euo pipefail
echo "LeRobot — FAST Tokenizer | UUID: {benchmark_uuid}"
lerobot-train-tokenizer \\
{_cli_args(tok_args)}
echo "Tokenizer pushed to: {tokenizer_hub_repo}"
""")
# ──────────────────────────────────────────────────────────────────────
# Main
# ──────────────────────────────────────────────────────────────────────
def main() -> None:
parser = argparse.ArgumentParser(description="Generate SLURM scripts for LeRobot LIBERO benchmark.")
parser.add_argument(
"--policies",
nargs="+",
default=ALL_POLICY_NAMES,
choices=ALL_POLICY_NAMES,
help="Policies to benchmark (default: all).",
)
parser.add_argument("--output_dir", type=Path, required=True, help="Root output directory.")
parser.add_argument("--hub_org", type=str, default="lerobot", help="HuggingFace org.")
parser.add_argument("--hub_dataset", type=str, default=None, help="HF dataset repo for results.")
parser.add_argument("--uuid", type=str, default=None, help="Override benchmark UUID.")
args = parser.parse_args()
benchmark_uuid = args.uuid or str(uuid.uuid4())
output_dir: Path = args.output_dir.resolve()
policies: list[str] = args.policies
hub_org: str = args.hub_org
hub_dataset: str = args.hub_dataset or f"{hub_org}/benchmark-libero"
try:
commit = subprocess.check_output(["git", "rev-parse", "HEAD"], text=True).strip()
except (subprocess.CalledProcessError, FileNotFoundError):
commit = "unknown"
scripts_dir = output_dir / "slurm_scripts"
log_dir = output_dir / "logs"
scripts_dir.mkdir(parents=True, exist_ok=True)
log_dir.mkdir(parents=True, exist_ok=True)
for p in policies:
(output_dir / "train" / p).mkdir(parents=True, exist_ok=True)
generated: dict[str, Path] = {}
# Tokenizer job for pi0_fast
tokenizer_path = None
if "pi0_fast" in policies:
script = _generate_tokenizer_script(output_dir, hub_org, benchmark_uuid)
tokenizer_path = scripts_dir / "00_tokenizer.sh"
tokenizer_path.write_text(script)
tokenizer_path.chmod(0o755)
generated["tokenizer"] = tokenizer_path
tokenizer_hub_repo = f"{hub_org}/fast-tokenizer-libero"
POLICY_CONFIGS["pi0_fast"].extra_policy_args["policy.action_tokenizer_name"] = tokenizer_hub_repo
# Per-policy scripts
for i, name in enumerate(sorted(policies), start=1):
script = _generate_sbatch_script(name, output_dir, hub_org, benchmark_uuid, hub_dataset, commit)
path = scripts_dir / f"{i:02d}_{name}.sh"
path.write_text(script)
path.chmod(0o755)
generated[name] = path
# Manifest
manifest = {
"benchmark_uuid": benchmark_uuid,
"timestamp": datetime.now(UTC).isoformat(),
"lerobot_commit": commit,
"hub_org": hub_org,
"hub_dataset": hub_dataset,
"policies": policies,
"output_dir": str(output_dir),
"scripts": {k: str(v) for k, v in generated.items()},
}
manifest_path = output_dir / "benchmark_manifest.json"
manifest_path.write_text(json.dumps(manifest, indent=2))
# Instructions
print("=" * 60)
print("LeRobot LIBERO Benchmark — Scripts Generated")
print(f"UUID: {benchmark_uuid}")
print(f"Output: {output_dir}")
print(f"Results dataset: {hub_dataset}")
print("=" * 60)
print()
for _name, path in sorted(generated.items()):
print(f" {path}")
print()
if tokenizer_path:
print("IMPORTANT: pi0_fast requires tokenizer training FIRST.")
print(f" 1. sbatch {tokenizer_path}")
print(" 2. Wait for completion")
print(f" 3. sbatch {generated.get('pi0_fast', 'N/A')}")
print(" 4. All other policies can run in parallel")
else:
print("All scripts can be submitted in parallel.")
print()
print("Each job publishes its result to the HF dataset automatically.")
if __name__ == "__main__":
main()
+156
View File
@@ -0,0 +1,156 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Publish benchmark rows and lightweight artifacts to a Hub dataset."""
from __future__ import annotations
import argparse
import json
from datetime import UTC, datetime
from pathlib import Path
from typing import Any
from lerobot.utils.history_repo import UploadTarget, make_hub_file_url, upload_targets, utc_timestamp_slug
def load_json_if_exists(path: Path) -> dict[str, Any] | None:
if not path.exists():
return None
return json.loads(path.read_text())
def find_latest_train_config_path(run_root: Path) -> Path | None:
checkpoints_dir = run_root / "train" / "checkpoints"
if not checkpoints_dir.exists():
return None
candidates = sorted(
checkpoints_dir.glob("*/pretrained_model/train_config.json"),
key=lambda path: path.parts[-3],
)
return candidates[-1] if candidates else None
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--benchmark", required=True)
parser.add_argument("--policy", required=True)
parser.add_argument("--run_root", required=True, type=Path)
parser.add_argument("--results_repo", required=True)
parser.add_argument("--git_commit", required=True)
parser.add_argument("--num_gpus", required=True, type=int)
parser.add_argument("--microbatch_per_gpu", required=True, type=int)
parser.add_argument("--gradient_accumulation_steps", required=True, type=int)
parser.add_argument("--effective_batch_size", required=True, type=int)
parser.add_argument("--train_wall_time_s", required=True, type=float)
parser.add_argument("--eval_wall_time_s", required=True, type=float)
parser.add_argument("--slurm_job_id", default="")
parser.add_argument("--docker_image", required=True)
return parser.parse_args()
def build_row(args: argparse.Namespace) -> tuple[dict[str, Any], list[UploadTarget]]:
now = datetime.now(UTC)
created_at = now.isoformat()
timestamp = utc_timestamp_slug(now)
run_id = f"{timestamp}__{args.benchmark}__{args.policy}__{args.slurm_job_id or 'manual'}"
eval_info = load_json_if_exists(args.run_root / "eval" / "eval_info.json") or {}
train_config_path = find_latest_train_config_path(args.run_root)
train_config = load_json_if_exists(train_config_path) or {}
artifact_prefix = f"artifacts/{args.benchmark}/{args.policy}/{run_id}"
row_path_in_repo = f"rows/{args.benchmark}/{args.policy}/{run_id}.json"
row = {
"schema_version": 1,
"created_at": created_at,
"run_id": run_id,
"benchmark": args.benchmark,
"policy": args.policy,
"git_commit": args.git_commit,
"slurm_job_id": args.slurm_job_id or None,
"docker_image": args.docker_image,
"resources": {
"num_gpus": args.num_gpus,
"microbatch_per_gpu": args.microbatch_per_gpu,
"gradient_accumulation_steps": args.gradient_accumulation_steps,
"effective_batch_size": args.effective_batch_size,
},
"timings": {
"train_wall_time_s": args.train_wall_time_s,
"eval_wall_time_s": args.eval_wall_time_s,
"total_wall_time_s": args.train_wall_time_s + args.eval_wall_time_s,
},
"eval": {
"overall": eval_info.get("overall", {}),
"per_group": eval_info.get("per_group", {}),
"per_task_count": len(eval_info.get("per_task", [])),
},
"paths": {
"run_root": str(args.run_root),
"train_dir": str(args.run_root / "train"),
"eval_dir": str(args.run_root / "eval"),
},
"train_config": train_config,
"artifact_urls": {
"row": make_hub_file_url(args.results_repo, row_path_in_repo),
},
}
row_path = args.run_root / "benchmark_row.json"
row_path.parent.mkdir(parents=True, exist_ok=True)
upload_list = [UploadTarget(local_path=row_path, path_in_repo=row_path_in_repo)]
eval_info_path = args.run_root / "eval" / "eval_info.json"
if eval_info_path.exists():
row["artifact_urls"]["eval_info"] = make_hub_file_url(
args.results_repo, f"{artifact_prefix}/eval_info.json"
)
upload_list.append(
UploadTarget(local_path=eval_info_path, path_in_repo=f"{artifact_prefix}/eval_info.json")
)
if train_config_path is not None and train_config_path.exists():
row["artifact_urls"]["train_config"] = make_hub_file_url(
args.results_repo, f"{artifact_prefix}/train_config.json"
)
upload_list.append(
UploadTarget(local_path=train_config_path, path_in_repo=f"{artifact_prefix}/train_config.json")
)
row_path.write_text(json.dumps(row, indent=2, sort_keys=True))
return row, upload_list
def main() -> int:
args = parse_args()
row, upload_list = build_row(args)
uploaded = upload_targets(
repo_id=args.results_repo,
targets=upload_list,
repo_type="dataset",
private=False,
commit_message=f"Add benchmark row {row['run_id']}",
)
row["uploaded_paths"] = uploaded
row_path = args.run_root / "benchmark_row.json"
row_path.write_text(json.dumps(row, indent=2, sort_keys=True))
print(json.dumps(row, indent=2, sort_keys=True))
return 0
if __name__ == "__main__":
raise SystemExit(main())
+647
View File
@@ -0,0 +1,647 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Generate lightweight SLURM jobs for policy x benchmark benchmarking."""
from __future__ import annotations
import argparse
import json
import math
import subprocess
from dataclasses import asdict, dataclass, field
from datetime import UTC, datetime
from pathlib import Path
from typing import Any
from lerobot.utils.history_repo import utc_timestamp_slug
MAX_GPUS = 8
MIN_GPUS = 1
DEFAULT_STEPS = 20_000
DEFAULT_EFFECTIVE_BATCH_SIZE = 256
DEFAULT_MICROBATCH_PER_GPU = 32
DEFAULT_EVAL_BATCH_SIZE = 1
DEFAULT_CPUS_PER_GPU = 8
DEFAULT_MEMORY_PER_GPU_GB = 40
@dataclass(frozen=True)
class BenchmarkSpec:
name: str
dataset_repo_id: str
docker_image: str
eval_env_type: str
eval_task: str
eval_n_episodes: int
train_steps: int = DEFAULT_STEPS
effective_batch_size: int = DEFAULT_EFFECTIVE_BATCH_SIZE
train_extra_args: dict[str, Any] = field(default_factory=dict)
eval_extra_args: dict[str, Any] = field(default_factory=dict)
@dataclass(frozen=True)
class PolicySpec:
name: str
policy_type: str
num_gpus: int
policy_path: str | None = None
microbatch_per_gpu: int = DEFAULT_MICROBATCH_PER_GPU
extra_train_args: dict[str, Any] = field(default_factory=dict)
extra_eval_args: dict[str, Any] = field(default_factory=dict)
needs_tokenizer: bool = False
tokenizer_args: dict[str, Any] = field(default_factory=dict)
@dataclass(frozen=True)
class PlannedJob:
benchmark: str
policy: str
run_rel: str
num_gpus: int
microbatch_per_gpu: int
gradient_accumulation_steps: int
effective_batch_size: int
docker_image: str
train_args: dict[str, Any]
eval_args: dict[str, Any]
tokenizer_args: dict[str, Any] | None
script_path: str
BENCHMARKS: dict[str, BenchmarkSpec] = {
"libero_plus": BenchmarkSpec(
name="libero_plus",
dataset_repo_id="lerobot/libero_plus",
docker_image="lerobot-benchmark-libero-plus:latest",
eval_env_type="libero_plus",
eval_task="libero_spatial,libero_object,libero_goal,libero_10",
eval_n_episodes=10,
train_extra_args={
"rename_map": {
"observation.images.image": "observation.images.camera1",
"observation.images.image2": "observation.images.camera2",
},
},
eval_extra_args={
"env.camera_name_mapping": {
"agentview_image": "camera1",
"robot0_eye_in_hand_image": "camera2",
},
"env.max_parallel_tasks": 1,
"eval.batch_size": DEFAULT_EVAL_BATCH_SIZE,
"eval.use_async_envs": False,
"eval.max_episodes_rendered": 0,
"policy.device": "cuda",
},
),
"robomme": BenchmarkSpec(
name="robomme",
dataset_repo_id="lerobot/robomme",
docker_image="lerobot-benchmark-robomme:latest",
eval_env_type="robomme",
eval_task=(
"BinFill,PickXtimes,SwingXtimes,StopCube,VideoUnmask,VideoUnmaskSwap,"
"ButtonUnmask,ButtonUnmaskSwap,PickHighlight,VideoRepick,VideoPlaceButton,"
"VideoPlaceOrder,MoveCube,InsertPeg,PatternLock,RouteStick"
),
eval_n_episodes=50,
train_extra_args={
"rename_map": {
"observation.images.image": "observation.images.camera1",
"observation.images.wrist_image": "observation.images.camera2",
},
},
eval_extra_args={
"env.dataset_split": "test",
"env.max_parallel_tasks": 1,
"rename_map": {
"observation.images.image": "observation.images.camera1",
"observation.images.wrist_image": "observation.images.camera2",
},
"eval.batch_size": DEFAULT_EVAL_BATCH_SIZE,
"eval.use_async_envs": False,
"eval.max_episodes_rendered": 0,
"policy.device": "cuda",
},
),
}
POLICIES: dict[str, PolicySpec] = {
"pi0": PolicySpec(
name="pi0",
policy_type="pi0",
policy_path="lerobot/pi0_base",
num_gpus=8,
extra_train_args={
"policy.n_action_steps": 30,
"policy.scheduler_decay_steps": DEFAULT_STEPS,
"policy.empty_cameras": 0,
},
),
"pi0_fast": PolicySpec(
name="pi0_fast",
policy_type="pi0_fast",
policy_path="lerobot/pi0fast-base",
num_gpus=8,
extra_train_args={
"policy.n_action_steps": 30,
"policy.scheduler_decay_steps": DEFAULT_STEPS,
"policy.empty_cameras": 0,
},
needs_tokenizer=True,
tokenizer_args={
"action_horizon": 30,
"encoded_dims": "0:7",
"normalization_mode": "QUANTILES",
"vocab_size": 1024,
"scale": 10.0,
"push_to_hub": True,
},
),
"pi05": PolicySpec(
name="pi05",
policy_type="pi05",
policy_path="lerobot/pi05_base",
num_gpus=8,
extra_train_args={
"policy.n_action_steps": 30,
"policy.scheduler_decay_steps": DEFAULT_STEPS,
"policy.empty_cameras": 0,
},
),
"groot": PolicySpec(
name="groot",
policy_type="groot",
num_gpus=8,
extra_train_args={
"policy.n_action_steps": 30,
"policy.base_model_path": "nvidia/GR00T-N1.5-3B",
"policy.tune_diffusion_model": True,
"policy.tune_projector": True,
"policy.tune_llm": False,
"policy.tune_visual": False,
"policy.use_bf16": True,
},
),
"act": PolicySpec(
name="act",
policy_type="act",
num_gpus=1,
extra_train_args={
"policy.n_action_steps": 30,
},
),
"diffusion": PolicySpec(
name="diffusion",
policy_type="diffusion",
num_gpus=1,
extra_train_args={
"policy.horizon": 32,
"policy.n_action_steps": 30,
"policy.n_obs_steps": 2,
},
),
"smolvla": PolicySpec(
name="smolvla",
policy_type="smolvla",
policy_path="lerobot/smolvla_base",
num_gpus=8,
extra_train_args={
"policy.n_action_steps": 30,
"policy.load_vlm_weights": True,
"policy.freeze_vision_encoder": False,
"policy.train_expert_only": False,
"policy.scheduler_decay_steps": DEFAULT_STEPS,
"policy.empty_cameras": 1,
},
),
"xvla": PolicySpec(
name="xvla",
policy_type="xvla",
policy_path="lerobot/xvla-widowx",
num_gpus=4,
extra_train_args={
"policy.n_action_steps": 32,
"policy.scheduler_decay_steps": DEFAULT_STEPS,
"policy.empty_cameras": 1,
},
),
"multi_task_dit": PolicySpec(
name="multi_task_dit",
policy_type="multi_task_dit",
num_gpus=1,
extra_train_args={
"policy.horizon": 32,
"policy.n_action_steps": 30,
},
),
}
def normalize_repo_id(hub_org: str, repo_or_id: str) -> str:
return repo_or_id if "/" in repo_or_id else f"{hub_org}/{repo_or_id}"
def get_requested_names(
requested: list[str] | None,
available: dict[str, Any],
*,
kind: str,
) -> list[str]:
if not requested:
return list(available)
unknown = sorted(set(requested) - set(available))
if unknown:
raise ValueError(f"Unknown {kind}: {', '.join(unknown)}. Available: {', '.join(available)}")
return requested
def compute_gradient_accumulation_steps(
*,
effective_batch_size: int,
num_gpus: int,
microbatch_per_gpu: int,
) -> int:
per_step_batch = num_gpus * microbatch_per_gpu
if effective_batch_size % per_step_batch != 0:
raise ValueError(
f"Cannot reach effective batch {effective_batch_size} with {num_gpus=} and "
f"{microbatch_per_gpu=}."
)
return effective_batch_size // per_step_batch
def make_run_slug() -> str:
return utc_timestamp_slug()
def shell_value(value: Any) -> str:
if isinstance(value, bool):
value = "true" if value else "false"
elif isinstance(value, (dict, list)):
value = json.dumps(value, sort_keys=True)
else:
value = str(value)
escaped = (
value.replace("\\", "\\\\")
.replace('"', '\\"')
.replace("$", "\\$")
.replace("`", "\\`")
)
return f'"{escaped}"'
def format_cli_args(args: dict[str, Any]) -> str:
lines = []
for key, value in args.items():
lines.append(f" --{key}={shell_value(value)}")
return " \\\n".join(lines)
def build_train_args(
*,
benchmark: BenchmarkSpec,
policy: PolicySpec,
train_dir: str,
gradient_accumulation_steps: int,
) -> dict[str, Any]:
args: dict[str, Any] = {
"dataset.repo_id": benchmark.dataset_repo_id,
"output_dir": train_dir,
"steps": benchmark.train_steps,
"batch_size": policy.microbatch_per_gpu,
"gradient_accumulation_steps": gradient_accumulation_steps,
"eval_freq": 0,
"save_freq": benchmark.train_steps,
"save_checkpoint": True,
"log_freq": 100,
"wandb.enable": False,
"policy.push_to_hub": False,
"policy.device": "cuda",
}
if policy.policy_path:
args["policy.path"] = policy.policy_path
else:
args["policy.type"] = policy.policy_type
args.update(benchmark.train_extra_args)
args.update(policy.extra_train_args)
return args
def build_eval_args(
*,
benchmark: BenchmarkSpec,
policy: PolicySpec,
checkpoint_path: str,
eval_dir: str,
) -> dict[str, Any]:
args: dict[str, Any] = {
"policy.path": checkpoint_path,
"env.type": benchmark.eval_env_type,
"env.task": benchmark.eval_task,
"eval.n_episodes": benchmark.eval_n_episodes,
"output_dir": eval_dir,
}
args.update(benchmark.eval_extra_args)
args.update(policy.extra_eval_args)
return args
def plan_jobs(
*,
output_dir: Path,
hub_org: str,
results_repo: str,
policies: list[str],
benchmarks: list[str],
) -> list[PlannedJob]:
_ = hub_org
_ = results_repo
scripts_dir = output_dir / "slurm"
jobs: list[PlannedJob] = []
for benchmark_name in benchmarks:
benchmark = BENCHMARKS[benchmark_name]
for policy_name in policies:
policy = POLICIES[policy_name]
num_gpus = max(MIN_GPUS, min(policy.num_gpus, MAX_GPUS))
run_rel = f"runs/{benchmark_name}/{policy_name}/{make_run_slug()}"
run_root = f"/benchmark-output/{run_rel}"
gradient_accumulation_steps = compute_gradient_accumulation_steps(
effective_batch_size=benchmark.effective_batch_size,
num_gpus=num_gpus,
microbatch_per_gpu=policy.microbatch_per_gpu,
)
train_dir = f"{run_root}/train"
checkpoint_path = f"{train_dir}/checkpoints/{benchmark.train_steps:06d}/pretrained_model"
eval_dir = f"{run_root}/eval"
train_args = build_train_args(
benchmark=benchmark,
policy=policy,
train_dir=train_dir,
gradient_accumulation_steps=gradient_accumulation_steps,
)
eval_args = build_eval_args(
benchmark=benchmark,
policy=policy,
checkpoint_path=checkpoint_path,
eval_dir=eval_dir,
)
tokenizer_args = None
if policy.needs_tokenizer:
tokenizer_repo_id = f"{hub_org}/{policy_name}-{benchmark_name}-tokenizer"
tokenizer_args = {
"repo_id": benchmark.dataset_repo_id,
"output_dir": f"{run_root}/tokenizer",
"hub_repo_id": tokenizer_repo_id,
**policy.tokenizer_args,
}
train_args["policy.action_tokenizer_name"] = tokenizer_repo_id
script_path = str(scripts_dir / f"{benchmark_name}__{policy_name}.sbatch")
jobs.append(
PlannedJob(
benchmark=benchmark_name,
policy=policy_name,
run_rel=run_rel,
num_gpus=num_gpus,
microbatch_per_gpu=policy.microbatch_per_gpu,
gradient_accumulation_steps=gradient_accumulation_steps,
effective_batch_size=benchmark.effective_batch_size,
docker_image=benchmark.docker_image,
train_args=train_args,
eval_args=eval_args,
tokenizer_args=tokenizer_args,
script_path=script_path,
)
)
return jobs
def render_sbatch_script(
*,
job: PlannedJob,
output_dir: Path,
results_repo_id: str,
git_commit: str,
) -> str:
host_output_dir = output_dir.resolve()
run_root = f"/benchmark-output/{job.run_rel}"
host_run_root = host_output_dir / job.run_rel
cpus_per_task = max(DEFAULT_CPUS_PER_GPU, DEFAULT_CPUS_PER_GPU * job.num_gpus)
mem_gb = max(DEFAULT_MEMORY_PER_GPU_GB, DEFAULT_MEMORY_PER_GPU_GB * job.num_gpus)
gpu_ids_expr = "${GPU_IDS}"
train_cli = format_cli_args(job.train_args)
eval_cli = format_cli_args(job.eval_args)
tokenizer_command = ""
if job.tokenizer_args:
tokenizer_cli = format_cli_args(job.tokenizer_args)
tokenizer_command = f"""
docker run --rm --gpus all \\
--shm-size=16g \\
-e CUDA_VISIBLE_DEVICES={gpu_ids_expr} \\
-e HF_TOKEN="${{HF_TOKEN:-}}" \\
-e HF_USER_TOKEN="${{HF_TOKEN:-}}" \\
-e HF_HOME=/tmp/hf \\
-v "{host_output_dir}:/benchmark-output" \\
-w /lerobot \\
"{job.docker_image}" \\
bash -lc '
set -euo pipefail
if [[ -n "${{HF_TOKEN:-}}" ]]; then
hf auth login --token "${{HF_TOKEN}}" --add-to-git-credential 2>/dev/null || true
fi
lerobot-train-tokenizer \\
{tokenizer_cli}
'
"""
return f"""#!/bin/bash
#SBATCH --job-name=bench-{job.benchmark}-{job.policy}
#SBATCH --gres=gpu:{job.num_gpus}
#SBATCH --cpus-per-task={cpus_per_task}
#SBATCH --mem={mem_gb}G
#SBATCH --output={output_dir.resolve()}/logs/{job.benchmark}__{job.policy}__%j.out
#SBATCH --error={output_dir.resolve()}/logs/{job.benchmark}__{job.policy}__%j.err
set -euo pipefail
HF_TOKEN="${{HF_TOKEN:-${{HF_USER_TOKEN:-}}}}"
GPU_IDS="$(seq -s, 0 $(({job.num_gpus} - 1)))"
RUN_ROOT="{run_root}"
mkdir -p "{host_output_dir}/logs"
mkdir -p "{host_run_root.parent}"
{tokenizer_command}
TRAIN_START="$(date +%s)"
docker run --rm --gpus all \\
--shm-size=16g \\
-e CUDA_VISIBLE_DEVICES="${{GPU_IDS}}" \\
-e HF_TOKEN="${{HF_TOKEN:-}}" \\
-e HF_USER_TOKEN="${{HF_TOKEN:-}}" \\
-e HF_HOME=/tmp/hf \\
-v "{host_output_dir}:/benchmark-output" \\
-w /lerobot \\
"{job.docker_image}" \\
bash -lc '
set -euo pipefail
if [[ -n "${{HF_TOKEN:-}}" ]]; then
hf auth login --token "${{HF_TOKEN}}" --add-to-git-credential 2>/dev/null || true
fi
accelerate launch --num_processes={job.num_gpus} $(which lerobot-train) \\
{train_cli}
'
TRAIN_END="$(date +%s)"
EVAL_START="$(date +%s)"
docker run --rm --gpus all \\
--shm-size=16g \\
-e CUDA_VISIBLE_DEVICES="${{GPU_IDS}}" \\
-e HF_TOKEN="${{HF_TOKEN:-}}" \\
-e HF_USER_TOKEN="${{HF_TOKEN:-}}" \\
-e HF_HOME=/tmp/hf \\
-v "{host_output_dir}:/benchmark-output" \\
-w /lerobot \\
"{job.docker_image}" \\
bash -lc '
set -euo pipefail
if [[ -n "${{HF_TOKEN:-}}" ]]; then
hf auth login --token "${{HF_TOKEN}}" --add-to-git-credential 2>/dev/null || true
fi
lerobot-eval \\
{eval_cli}
'
EVAL_END="$(date +%s)"
TRAIN_WALL_TIME_S="$((TRAIN_END - TRAIN_START))"
EVAL_WALL_TIME_S="$((EVAL_END - EVAL_START))"
docker run --rm --gpus all \\
--shm-size=16g \\
-e CUDA_VISIBLE_DEVICES="${{GPU_IDS}}" \\
-e HF_TOKEN="${{HF_TOKEN:-}}" \\
-e HF_USER_TOKEN="${{HF_TOKEN:-}}" \\
-e HF_HOME=/tmp/hf \\
-e RUN_ROOT="${{RUN_ROOT}}" \\
-e TRAIN_WALL_TIME_S="${{TRAIN_WALL_TIME_S}}" \\
-e EVAL_WALL_TIME_S="${{EVAL_WALL_TIME_S}}" \\
-v "{host_output_dir}:/benchmark-output" \\
-w /lerobot \\
"{job.docker_image}" \\
bash -lc '
set -euo pipefail
if [[ -n "${{HF_TOKEN:-}}" ]]; then
hf auth login --token "${{HF_TOKEN}}" --add-to-git-credential 2>/dev/null || true
fi
uv run python benchmarks/publish_benchmark_result.py \\
--benchmark={job.benchmark} \\
--policy={job.policy} \\
--run_root="${{RUN_ROOT}}" \\
--results_repo={results_repo_id} \\
--git_commit={git_commit} \\
--num_gpus={job.num_gpus} \\
--microbatch_per_gpu={job.microbatch_per_gpu} \\
--gradient_accumulation_steps={job.gradient_accumulation_steps} \\
--effective_batch_size={job.effective_batch_size} \\
--train_wall_time_s="${{TRAIN_WALL_TIME_S}}" \\
--eval_wall_time_s="${{EVAL_WALL_TIME_S}}" \\
--slurm_job_id="${{SLURM_JOB_ID:-}}" \\
--docker_image={job.docker_image}
'
"""
def write_manifest(
*,
output_dir: Path,
jobs: list[PlannedJob],
git_commit: str,
hub_org: str,
results_repo: str,
) -> Path:
manifest = {
"generated_at": datetime.now(UTC).isoformat(),
"git_commit": git_commit,
"hub_org": hub_org,
"results_repo": results_repo,
"jobs": [asdict(job) for job in jobs],
}
manifest_path = output_dir / "manifest.json"
manifest_path.write_text(json.dumps(manifest, indent=2, sort_keys=True))
return manifest_path
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--policies", nargs="*", default=None)
parser.add_argument("--benchmarks", nargs="*", default=None)
parser.add_argument("--output_dir", required=True, type=Path)
parser.add_argument("--hub_org", required=True)
parser.add_argument("--results_repo", required=True)
parser.add_argument("--submit", action="store_true")
return parser.parse_args()
def get_git_commit() -> str:
return subprocess.check_output(["git", "rev-parse", "HEAD"], text=True).strip()
def main() -> int:
args = parse_args()
args.output_dir.mkdir(parents=True, exist_ok=True)
(args.output_dir / "slurm").mkdir(parents=True, exist_ok=True)
(args.output_dir / "logs").mkdir(parents=True, exist_ok=True)
selected_policies = get_requested_names(args.policies, POLICIES, kind="policies")
selected_benchmarks = get_requested_names(args.benchmarks, BENCHMARKS, kind="benchmarks")
git_commit = get_git_commit()
results_repo_id = normalize_repo_id(args.hub_org, args.results_repo)
jobs = plan_jobs(
output_dir=args.output_dir,
hub_org=args.hub_org,
results_repo=results_repo_id,
policies=selected_policies,
benchmarks=selected_benchmarks,
)
for job in jobs:
script = render_sbatch_script(
job=job,
output_dir=args.output_dir,
results_repo_id=results_repo_id,
git_commit=git_commit,
)
script_path = Path(job.script_path)
script_path.write_text(script)
script_path.chmod(0o755)
if args.submit:
subprocess.run(["sbatch", str(script_path)], check=True)
manifest_path = write_manifest(
output_dir=args.output_dir,
jobs=jobs,
git_commit=git_commit,
hub_org=args.hub_org,
results_repo=results_repo_id,
)
print(f"Wrote {len(jobs)} benchmark jobs to {args.output_dir}")
print(f"Manifest: {manifest_path}")
return 0
if __name__ == "__main__":
raise SystemExit(main())
+42
View File
@@ -0,0 +1,42 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Benchmark image for LIBERO integration tests.
# Extends the nightly GPU image (which already has all extras installed)
# with the PR's source code and LIBERO-specific asset setup.
#
# Build: docker build -f docker/Dockerfile.benchmark.libero -t lerobot-benchmark-libero .
# Run: docker run --gpus all --rm lerobot-benchmark-libero lerobot-eval ...
FROM huggingface/lerobot-gpu:latest
# Pre-download lerobot/libero-assets from HF Hub so nothing is fetched at
# runtime (which times out on CI). Point the libero config at the cached path.
# libero/libero/__init__.py calls input() when ~/.libero/config.yaml is missing,
# so we write the config before any libero import can happen.
RUN LIBERO_DIR=$(python -c \
"import importlib.util, os; s=importlib.util.find_spec('libero'); \
print(os.path.join(os.path.dirname(s.origin), 'libero'))") && \
mkdir -p /home/user_lerobot/.libero && \
python -c "\
from huggingface_hub import snapshot_download; \
snapshot_download(repo_id='lerobot/libero-assets', repo_type='dataset', \
local_dir='/home/user_lerobot/.libero/assets')" && \
printf "assets: /home/user_lerobot/.libero/assets\nbddl_files: ${LIBERO_DIR}/bddl_files\ndatasets: ${LIBERO_DIR}/../datasets\ninit_states: ${LIBERO_DIR}/init_files\n" \
> /home/user_lerobot/.libero/config.yaml
# Overlay the PR's source code on top of the nightly image.
COPY --chown=user_lerobot:user_lerobot . .
CMD ["/bin/bash"]
+48
View File
@@ -0,0 +1,48 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
FROM huggingface/lerobot-gpu:latest
USER root
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
unzip libexpat1 libfontconfig1-dev libmagickwand-dev \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
USER user_lerobot
RUN uv pip install --no-cache \
"robosuite==1.4.1" bddl easydict mujoco matplotlib wand scikit-image gym
ENV LIBERO_PLUS_ROOT=/home/user_lerobot/libero-plus/libero/libero
RUN git clone --depth=1 https://github.com/sylvestf/LIBERO-plus.git /home/user_lerobot/libero-plus \
&& cd /home/user_lerobot/libero-plus && uv pip install --no-cache --no-deps -e "." \
&& uv pip uninstall hf-libero 2>/dev/null || true
ENV PYTHONPATH="/home/user_lerobot/libero-plus:${PYTHONPATH}"
RUN python -c "\
from huggingface_hub import hf_hub_download; \
hf_hub_download(repo_id='Sylvest/LIBERO-plus', repo_type='dataset', \
filename='assets.zip', local_dir='/tmp/libero-plus-dl')" \
&& unzip -q /tmp/libero-plus-dl/assets.zip -d /tmp/libero-plus-dl/extract \
&& mv /tmp/libero-plus-dl/extract/inspire/hdd/project/embodied-multimodality/public/syfei/libero_new/release/dataset/LIBERO-plus-0/assets \
${LIBERO_PLUS_ROOT}/assets \
&& rm -rf /tmp/libero-plus-dl
RUN mkdir -p /home/user_lerobot/.libero \
&& printf "assets: ${LIBERO_PLUS_ROOT}/assets\nbddl_files: ${LIBERO_PLUS_ROOT}/bddl_files\ndatasets: ${LIBERO_PLUS_ROOT}/../datasets\ninit_states: ${LIBERO_PLUS_ROOT}/init_files\n" \
> /home/user_lerobot/.libero/config.yaml
COPY --chown=user_lerobot:user_lerobot . .
CMD ["/bin/bash"]
+27
View File
@@ -0,0 +1,27 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Benchmark image for MetaWorld integration tests.
# Extends the nightly GPU image (which already has all extras installed)
# with the PR's source code.
#
# Build: docker build -f docker/Dockerfile.benchmark.metaworld -t lerobot-benchmark-metaworld .
# Run: docker run --gpus all --rm lerobot-benchmark-metaworld lerobot-eval ...
FROM huggingface/lerobot-gpu:latest
# Overlay the PR's source code on top of the nightly image.
COPY --chown=user_lerobot:user_lerobot . .
CMD ["/bin/bash"]
+39
View File
@@ -0,0 +1,39 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
FROM huggingface/lerobot-gpu:latest
ENV NVIDIA_DRIVER_CAPABILITIES=all \
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/nvidia_icd.json
USER root
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
libvulkan1 libvulkan-dev mesa-vulkan-drivers \
&& mkdir -p /usr/share/vulkan/icd.d \
&& echo '{"file_format_version":"1.0.0","ICD":{"library_path":"libGLX_nvidia.so.0","api_version":"1.3.0"}}' \
> /usr/share/vulkan/icd.d/nvidia_icd.json \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
USER user_lerobot
COPY --chown=user_lerobot:user_lerobot setup.py pyproject.toml uv.lock README.md MANIFEST.in ./
RUN printf 'gymnasium==0.29.1\nnumpy==1.26.4\n' > /tmp/robomme_override.txt \
&& uv pip install --no-cache --override /tmp/robomme_override.txt \
-e ".[smolvla,av-dep]" \
"robomme @ git+https://github.com/RoboMME/robomme_benchmark.git@main" \
&& python -c "import robomme; print('robomme import OK')"
COPY --chown=user_lerobot:user_lerobot . .
CMD ["/bin/bash"]
+114
View File
@@ -0,0 +1,114 @@
#!/usr/bin/env python3
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Extract natural-language task descriptions for a benchmark suite.
Runs inside the benchmark Docker container (where the env library is installed)
immediately after lerobot-eval, writing a JSON file that parse_eval_metrics.py
picks up and embeds in metrics.json.
Output format: {"<suite>_<task_idx>": "<nl instruction>", ...}
Usage:
python scripts/ci/extract_task_descriptions.py \\
--env libero --task libero_spatial \\
--output /tmp/eval-artifacts/task_descriptions.json
"""
from __future__ import annotations
import argparse
import json
import re
import sys
from pathlib import Path
# LIBERO-plus derives task.language by space-joining the perturbation-variant
# filename, so strip the perturbation metadata blob to recover the base prompt.
_LIBERO_PERTURBATION_TAIL_RE = re.compile(
r"(?:\s(?:view|initstate|noise|add|tb|table|light|level)(?:\s\d+)+)+$"
)
def _strip_libero_perturbation_tail(instruction: str) -> str:
return _LIBERO_PERTURBATION_TAIL_RE.sub("", instruction).strip()
def _libero_descriptions(task_suite: str) -> dict[str, str]:
from libero.libero import benchmark # type: ignore[import-untyped]
suite_dict = benchmark.get_benchmark_dict()
if task_suite not in suite_dict:
print(
f"[extract_task_descriptions] Unknown LIBERO suite '{task_suite}'. "
f"Available: {list(suite_dict.keys())}",
file=sys.stderr,
)
return {}
suite = suite_dict[task_suite]()
return {
f"{task_suite}_{i}": _strip_libero_perturbation_tail(suite.get_task(i).language)
for i in range(suite.n_tasks)
}
def _metaworld_descriptions(task_name: str) -> dict[str, str]:
# MetaWorld tasks don't expose a separate NL description attribute;
# use a cleaned version of the task name as the description.
label = task_name.removeprefix("metaworld-").replace("-", " ").strip()
return {f"{task_name}_0": label}
def _robomme_descriptions(task_names: str) -> dict[str, str]:
return {
f"{task_name}_0": task_name.replace("_", " ").strip()
for task_name in (task.strip() for task in task_names.split(","))
if task_name
}
def main() -> int:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--env", required=True, help="Environment family (libero, metaworld, ...)")
parser.add_argument("--task", required=True, help="Task/suite name (e.g. libero_spatial)")
parser.add_argument("--output", required=True, help="Path to write task_descriptions.json")
args = parser.parse_args()
descriptions: dict[str, str] = {}
try:
if args.env in {"libero", "libero_plus"}:
descriptions = _libero_descriptions(args.task)
elif args.env == "metaworld":
descriptions = _metaworld_descriptions(args.task)
elif args.env == "robomme":
descriptions = _robomme_descriptions(args.task)
else:
print(
f"[extract_task_descriptions] No description extractor for env '{args.env}'.",
file=sys.stderr,
)
except Exception as exc:
print(f"[extract_task_descriptions] Warning: {exc}", file=sys.stderr)
out_path = Path(args.output)
out_path.parent.mkdir(parents=True, exist_ok=True)
out_path.write_text(json.dumps(descriptions, indent=2))
print(f"[extract_task_descriptions] {len(descriptions)} descriptions → {out_path}")
return 0
if __name__ == "__main__":
sys.exit(main())
+147
View File
@@ -0,0 +1,147 @@
#!/usr/bin/env python3
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Parse lerobot-eval output into a small metrics.json artifact.
Reads eval_info.json written by lerobot-eval --output_dir and extracts the
key metrics needed by the health dashboard. Handles both single-task and
multi-task eval output formats.
NOTE: This script runs on the bare CI runner (not inside Docker), so it
must use only Python stdlib modules. Do not add third-party imports.
Usage:
python scripts/ci/parse_eval_metrics.py \\
--artifacts-dir /tmp/libero-artifacts \\
--env libero \\
--task libero_spatial \\
--policy pepijn223/smolvla_libero
Writes <artifacts-dir>/metrics.json. The CI workflow then uploads this file
as a GitHub Actions artifact named "<env>-metrics".
"""
from __future__ import annotations
import argparse
import json
import math
import sys
from pathlib import Path
def _safe_float(v: float | int | None) -> float | None:
if v is None:
return None
f = float(v)
return None if math.isnan(f) else f
def _safe_int(v: float | int | None) -> int | None:
if v is None:
return None
f = float(v)
return None if math.isnan(f) else int(f)
def _extract_metrics(info: dict) -> tuple[float | None, int | None, float | None, float | None]:
"""Extract (pc_success, n_episodes, avg_sum_reward, eval_s) from eval_info.json.
Handles two output shapes:
- Single-task: {"aggregated": {"pc_success": 80.0, ...}}
- Multi-task: {"overall": {"pc_success": 80.0, "n_episodes": 5, ...}}
"""
for key in ("aggregated", "overall"):
if key not in info:
continue
agg = info[key]
pc = agg.get("pc_success")
n = agg.get("n_episodes")
reward = agg.get("avg_sum_reward")
eval_s = agg.get("eval_s")
if pc is not None and not math.isnan(pc):
return (
float(pc),
_safe_int(n),
_safe_float(reward),
_safe_float(eval_s),
)
return None, None, None, None
def main() -> int:
parser = argparse.ArgumentParser(
description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter
)
parser.add_argument("--artifacts-dir", required=True, help="Path to the mounted artifacts volume")
parser.add_argument("--env", required=True, help="Environment name (e.g. libero)")
parser.add_argument("--task", required=True, help="Task name (e.g. libero_spatial)")
parser.add_argument("--policy", required=True, help="Policy hub path (e.g. pepijn223/smolvla_libero)")
args = parser.parse_args()
artifacts_dir = Path(args.artifacts_dir)
eval_info_path = artifacts_dir / "eval_info.json"
pc_success: float | None = None
n_episodes: int | None = None
avg_sum_reward: float | None = None
eval_s: float | None = None
if eval_info_path.exists():
try:
info = json.loads(eval_info_path.read_text())
pc_success, n_episodes, avg_sum_reward, eval_s = _extract_metrics(info)
except (json.JSONDecodeError, KeyError, TypeError) as exc:
print(f"[parse_eval_metrics] Warning: could not parse eval_info.json: {exc}", file=sys.stderr)
else:
print(
f"[parse_eval_metrics] Warning: {eval_info_path} not found — eval may have failed.",
file=sys.stderr,
)
task_descriptions: dict[str, str] = {}
task_desc_path = artifacts_dir / "task_descriptions.json"
if task_desc_path.exists():
try:
task_descriptions = json.loads(task_desc_path.read_text())
except json.JSONDecodeError as exc:
print(
f"[parse_eval_metrics] Warning: could not parse task_descriptions.json: {exc}",
file=sys.stderr,
)
metrics = {
"env": args.env,
"task": args.task,
"policy": args.policy,
"pc_success": pc_success,
"n_episodes": n_episodes,
"avg_sum_reward": avg_sum_reward,
"eval_s": eval_s,
"task_descriptions": task_descriptions,
}
out_path = artifacts_dir / "metrics.json"
out_path.write_text(json.dumps(metrics, indent=2))
print(f"[parse_eval_metrics] Written: {out_path}")
print(json.dumps(metrics, indent=2))
return 0
if __name__ == "__main__":
sys.exit(main())
+27
View File
@@ -0,0 +1,27 @@
---
title: LeRobot Benchmark Leaderboard
emoji: 🤖
colorFrom: yellow
colorTo: orange
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Benchmark history for LeRobot policy x benchmark runs
---
# LeRobot Benchmark Leaderboard
This Space reads immutable benchmark rows from a Hugging Face dataset and shows:
- Latest result per policy and benchmark
- Historical trends over time
- Direct links to uploaded eval and config artifacts
## Configuration
Set `BENCHMARK_RESULTS_REPO` in the Space settings if you want to point the UI
at a different public dataset. The default is:
- `lerobot/benchmark-history`
+226
View File
@@ -0,0 +1,226 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import annotations
import json
import os
import time
from pathlib import Path
from typing import Any
import gradio as gr
import pandas as pd
import plotly.express as px
from huggingface_hub import HfApi, hf_hub_download
RESULTS_REPO = os.environ.get("BENCHMARK_RESULTS_REPO", "lerobot/benchmark-history")
CACHE_DIR = Path("/tmp/benchmark-leaderboard-cache")
CACHE_DIR.mkdir(parents=True, exist_ok=True)
CACHE_TTL_S = 300
_CACHE: dict[str, tuple[float, pd.DataFrame]] = {}
def _row_to_record(row: dict[str, Any]) -> dict[str, Any]:
overall = row.get("eval", {}).get("overall", {})
resources = row.get("resources", {})
timings = row.get("timings", {})
artifact_urls = row.get("artifact_urls", {})
return {
"created_at": row.get("created_at"),
"benchmark": row.get("benchmark"),
"policy": row.get("policy"),
"success_rate": overall.get("pc_success"),
"n_episodes": overall.get("n_episodes"),
"avg_sum_reward": overall.get("avg_sum_reward"),
"train_wall_time_s": timings.get("train_wall_time_s"),
"eval_wall_time_s": timings.get("eval_wall_time_s"),
"total_wall_time_s": timings.get("total_wall_time_s"),
"num_gpus": resources.get("num_gpus"),
"microbatch_per_gpu": resources.get("microbatch_per_gpu"),
"gradient_accumulation_steps": resources.get("gradient_accumulation_steps"),
"effective_batch_size": resources.get("effective_batch_size"),
"git_commit": row.get("git_commit"),
"row_url": artifact_urls.get("row"),
"eval_info_url": artifact_urls.get("eval_info"),
"train_config_url": artifact_urls.get("train_config"),
}
def load_rows(repo_id: str = RESULTS_REPO) -> pd.DataFrame:
cache_key = f"rows::{repo_id}"
cached = _CACHE.get(cache_key)
if cached is not None and (time.monotonic() - cached[0]) < CACHE_TTL_S:
return cached[1]
api = HfApi()
files = [path for path in api.list_repo_files(repo_id=repo_id, repo_type="dataset") if path.startswith("rows/")]
records: list[dict[str, Any]] = []
for path_in_repo in sorted(files, reverse=True):
local_path = hf_hub_download(repo_id=repo_id, repo_type="dataset", filename=path_in_repo, cache_dir=CACHE_DIR)
with open(local_path) as f:
row = json.load(f)
records.append(_row_to_record(row))
df = pd.DataFrame.from_records(records)
if not df.empty:
df["created_at"] = pd.to_datetime(df["created_at"], utc=True)
df = df.sort_values("created_at", ascending=False).reset_index(drop=True)
_CACHE[cache_key] = (time.monotonic(), df)
return df
def make_latest_table(df: pd.DataFrame) -> pd.DataFrame:
if df.empty:
return df
latest = (
df.sort_values("created_at", ascending=False)
.groupby(["benchmark", "policy"], as_index=False)
.first()
.sort_values(["benchmark", "success_rate"], ascending=[True, False], na_position="last")
)
return latest[
[
"benchmark",
"policy",
"success_rate",
"n_episodes",
"train_wall_time_s",
"eval_wall_time_s",
"num_gpus",
"effective_batch_size",
"git_commit",
"row_url",
"eval_info_url",
"train_config_url",
]
]
def make_history_figure(df: pd.DataFrame, benchmark: str, policy: str | None) -> Any:
filtered = df[df["benchmark"] == benchmark]
if policy and policy != "All":
filtered = filtered[filtered["policy"] == policy]
if filtered.empty:
return px.line(title="No benchmark rows found")
fig = px.line(
filtered.sort_values("created_at"),
x="created_at",
y="success_rate",
color="policy",
markers=True,
hover_data=["git_commit", "num_gpus", "train_wall_time_s", "eval_wall_time_s"],
title=f"{benchmark} success rate history",
)
fig.update_layout(yaxis_title="Success rate (%)", xaxis_title="Run time")
return fig
def make_run_markdown(df: pd.DataFrame, benchmark: str, policy: str | None) -> str:
filtered = df[df["benchmark"] == benchmark]
if policy and policy != "All":
filtered = filtered[filtered["policy"] == policy]
if filtered.empty:
return "No matching runs yet."
latest = filtered.sort_values("created_at", ascending=False).iloc[0]
row_link = latest["row_url"] if pd.notna(latest["row_url"]) else None
eval_link = latest["eval_info_url"] if pd.notna(latest["eval_info_url"]) else None
train_link = latest["train_config_url"] if pd.notna(latest["train_config_url"]) else None
lines = [
f"Latest run: `{latest['policy']}` on `{latest['benchmark']}`",
f"Success rate: `{latest['success_rate']}`",
f"GPUs: `{latest['num_gpus']}`",
f"Effective batch size: `{latest['effective_batch_size']}`",
f"Commit: `{latest['git_commit']}`",
]
if row_link:
lines.append(f"Row JSON: [open]({row_link})")
if eval_link:
lines.append(f"Eval Info: [open]({eval_link})")
if train_link:
lines.append(f"Train Config: [open]({train_link})")
return "\n\n".join(lines)
def refresh_view(benchmark: str, policy: str) -> tuple[pd.DataFrame, dict[str, Any], Any, str]:
df = load_rows()
latest_table = make_latest_table(df)
benchmark_names = sorted(df["benchmark"].dropna().unique().tolist()) if not df.empty else []
if benchmark not in benchmark_names and benchmark_names:
benchmark = benchmark_names[0]
policy_choices = ["All"]
if benchmark and not df.empty:
policy_choices.extend(sorted(df[df["benchmark"] == benchmark]["policy"].dropna().unique().tolist()))
if policy not in policy_choices:
policy = "All"
history = make_history_figure(df, benchmark, policy)
summary = make_run_markdown(df, benchmark, policy)
return latest_table, gr.update(choices=policy_choices, value=policy), history, summary
with gr.Blocks(title="LeRobot Benchmark Leaderboard") as demo:
gr.Markdown(
f"""
# LeRobot Benchmark Leaderboard
Results dataset: [`{RESULTS_REPO}`](https://huggingface.co/datasets/{RESULTS_REPO})
"""
)
with gr.Row():
benchmark_dropdown = gr.Dropdown(label="Benchmark", choices=[])
policy_dropdown = gr.Dropdown(label="Policy", choices=["All"], value="All")
refresh_button = gr.Button("Refresh")
latest_table = gr.Dataframe(label="Latest Results", interactive=False)
history_plot = gr.Plot(label="History")
latest_summary = gr.Markdown()
def _initial_state():
df = load_rows()
benchmarks = sorted(df["benchmark"].dropna().unique().tolist()) if not df.empty else []
benchmark = benchmarks[0] if benchmarks else ""
latest, policy_choices, history, summary = refresh_view(benchmark, "All")
return (
gr.update(choices=benchmarks, value=benchmark),
policy_choices,
latest,
history,
summary,
)
demo.load(
_initial_state,
outputs=[benchmark_dropdown, policy_dropdown, latest_table, history_plot, latest_summary],
)
refresh_button.click(
refresh_view,
inputs=[benchmark_dropdown, policy_dropdown],
outputs=[latest_table, policy_dropdown, history_plot, latest_summary],
)
benchmark_dropdown.change(
refresh_view,
inputs=[benchmark_dropdown, policy_dropdown],
outputs=[latest_table, policy_dropdown, history_plot, latest_summary],
)
policy_dropdown.change(
refresh_view,
inputs=[benchmark_dropdown, policy_dropdown],
outputs=[latest_table, policy_dropdown, history_plot, latest_summary],
)
if __name__ == "__main__":
demo.launch()
@@ -0,0 +1,4 @@
gradio>=5.0.0,<6.0.0
plotly>=5.18.0
pandas>=2.0.0
huggingface-hub>=1.0.0,<2.0.0
+6
View File
@@ -67,11 +67,17 @@ class EvalConfig:
# `batch_size` specifies the number of environments to use in a gym.vector.VectorEnv.
# Set to 0 for auto-tuning based on available CPU cores and n_episodes.
batch_size: int = 0
# Number of rollout videos to save per evaluated task. Set to 0 to disable videos.
max_episodes_rendered: int = 10
# `use_async_envs` specifies whether to use asynchronous environments (multiprocessing).
# Defaults to True; automatically downgraded to SyncVectorEnv when batch_size=1.
use_async_envs: bool = True
def __post_init__(self) -> None:
if self.max_episodes_rendered < 0:
raise ValueError(
f"`max_episodes_rendered` must be non-negative, got {self.max_episodes_rendered}."
)
if self.batch_size == 0:
self.batch_size = self._auto_batch_size()
if self.batch_size > self.n_episodes:
+6
View File
@@ -56,6 +56,7 @@ class TrainPipelineConfig(HubMixin):
# Number of workers for the dataloader.
num_workers: int = 4
batch_size: int = 8
gradient_accumulation_steps: int = 1
steps: int = 100_000
eval_freq: int = 20_000
log_freq: int = 200
@@ -132,6 +133,11 @@ class TrainPipelineConfig(HubMixin):
if isinstance(self.dataset.repo_id, list):
raise NotImplementedError("LeRobotMultiDataset is not currently implemented.")
if self.gradient_accumulation_steps <= 0:
raise ValueError(
f"`gradient_accumulation_steps` must be strictly positive, got {self.gradient_accumulation_steps}."
)
if not self.use_policy_training_preset and (self.optimizer is None or self.scheduler is None):
raise ValueError("Optimizer and Scheduler must be set when the policy presets are not used.")
elif self.use_policy_training_preset and not self.resume:
+11 -1
View File
@@ -18,7 +18,15 @@
# from lerobot.utils.import_utils import require_package
# require_package("gymnasium", extra="<update_extra>", import_name="gymnasium")
from .configs import AlohaEnv, EnvConfig, HILSerlRobotEnvConfig, HubEnvConfig, PushtEnv
from .configs import (
AlohaEnv,
EnvConfig,
HILSerlRobotEnvConfig,
HubEnvConfig,
LiberoPlusEnv,
PushtEnv,
RoboMMEEnv,
)
from .factory import make_env, make_env_config, make_env_pre_post_processors
from .utils import check_env_attributes_and_types, close_envs, env_to_policy_features, preprocess_observation
@@ -27,7 +35,9 @@ __all__ = [
"EnvConfig",
"HILSerlRobotEnvConfig",
"HubEnvConfig",
"LiberoPlusEnv",
"PushtEnv",
"RoboMMEEnv",
"check_env_attributes_and_types",
"close_envs",
"env_to_policy_features",
+55
View File
@@ -574,3 +574,58 @@ class IsaaclabArenaEnv(HubEnvConfig):
),
PolicyProcessorPipeline(steps=[]),
)
@EnvConfig.register_subclass("libero_plus")
@dataclass
class LiberoPlusEnv(LiberoEnv):
"""Config for LIBERO-plus robustness benchmark evaluation."""
task: str = "libero_spatial"
@EnvConfig.register_subclass("robomme")
@dataclass
class RoboMMEEnv(EnvConfig):
"""RoboMME memory-augmented manipulation benchmark."""
task: str = "PickXtimes"
fps: int = 10
episode_length: int = 300
action_space: str = "joint_angle"
dataset_split: str = "test"
task_ids: list[int] | None = None
features: dict[str, PolicyFeature] = field(
default_factory=lambda: {
ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(8,)),
"image": PolicyFeature(type=FeatureType.VISUAL, shape=(256, 256, 3)),
"wrist_image": PolicyFeature(type=FeatureType.VISUAL, shape=(256, 256, 3)),
OBS_STATE: PolicyFeature(type=FeatureType.STATE, shape=(8,)),
}
)
features_map: dict[str, str] = field(
default_factory=lambda: {
ACTION: ACTION,
"image": f"{OBS_IMAGES}.image",
"wrist_image": f"{OBS_IMAGES}.wrist_image",
OBS_STATE: OBS_STATE,
}
)
@property
def gym_kwargs(self) -> dict:
return {}
def create_envs(self, n_envs: int, use_async_envs: bool = True):
from .robomme import create_robomme_envs
env_cls = _make_vec_env_cls(use_async_envs, n_envs)
return create_robomme_envs(
task=self.task,
n_envs=n_envs,
action_space_type=self.action_space,
dataset=self.dataset_split,
episode_length=self.episode_length,
task_ids=self.task_ids,
env_cls=env_cls,
)
+22 -7
View File
@@ -16,6 +16,7 @@
from __future__ import annotations
import os
import re
from collections import defaultdict
from collections.abc import Callable, Iterable, Mapping, Sequence
from functools import partial
@@ -69,14 +70,28 @@ def _select_task_ids(total_tasks: int, task_ids: Iterable[int] | None) -> list[i
return ids
# LIBERO-plus perturbation variants encode the perturbation in the filename
# but on disk only the base `.pruned_init` exists — strip the suffix to match
# LIBERO-plus's own suite.get_task_init_states() (we reimplement it here so we
# can pass weights_only=False for PyTorch 2.6+ numpy pickles).
_LIBERO_PERTURBATION_SUFFIX_RE = re.compile(r"_(?:language|view|light)_[^.]*|_(?:table|tb)_\d+")
def get_task_init_states(task_suite: Any, i: int) -> np.ndarray:
init_states_path = (
Path(get_libero_path("init_states"))
/ task_suite.tasks[i].problem_folder
/ task_suite.tasks[i].init_states_file
)
init_states = torch.load(init_states_path, weights_only=False) # nosec B614
return init_states
task = task_suite.tasks[i]
filename = Path(task.init_states_file)
root = Path(get_libero_path("init_states"))
# `_add_` / `_level` variants store extra-object layouts under libero_newobj/
# as a flat array that must be reshaped to (1, -1).
if "_add_" in filename.name or "_level" in filename.name:
init_states_path = root / "libero_newobj" / task.problem_folder / filename.name
init_states = torch.load(init_states_path, weights_only=False) # nosec B614
return init_states.reshape(1, -1)
stripped = _LIBERO_PERTURBATION_SUFFIX_RE.sub("", filename.stem) + filename.suffix
init_states_path = root / task.problem_folder / stripped
return torch.load(init_states_path, weights_only=False) # nosec B614
def get_libero_dummy_action():
+209
View File
@@ -0,0 +1,209 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""RoboMME environment wrapper for LeRobot evaluation."""
from __future__ import annotations
from collections.abc import Callable, Sequence
from functools import partial
from typing import Any
import gymnasium as gym
import numpy as np
from gymnasium import spaces
ROBOMME_TASKS = [
"BinFill",
"PickXtimes",
"SwingXtimes",
"StopCube",
"VideoUnmask",
"VideoUnmaskSwap",
"ButtonUnmask",
"ButtonUnmaskSwap",
"PickHighlight",
"VideoRepick",
"VideoPlaceButton",
"VideoPlaceOrder",
"MoveCube",
"InsertPeg",
"PatternLock",
"RouteStick",
]
class RoboMMEGymEnv(gym.Env):
"""Thin Gymnasium wrapper around a single RoboMME episode env."""
metadata = {"render_modes": ["rgb_array"], "render_fps": 10}
def __init__(
self,
task: str = "PickXtimes",
action_space_type: str = "joint_angle",
dataset: str = "test",
episode_idx: int = 0,
max_steps: int = 300,
):
super().__init__()
from robomme.env_record_wrapper import BenchmarkEnvBuilder
self._builder = BenchmarkEnvBuilder(
env_id=task,
dataset=dataset,
action_space=action_space_type,
gui_render=False,
max_steps=max_steps,
)
self._max_episode_steps = max_steps
self._episode_idx = episode_idx
self._max_steps = max_steps
self._env = None
self._last_raw_obs: dict | None = None
action_dim = 8 if action_space_type == "joint_angle" else 7
self.action_space = spaces.Box(low=-1.0, high=1.0, shape=(action_dim,), dtype=np.float32)
self.observation_space = spaces.Dict(
{
"image": spaces.Box(0, 255, shape=(256, 256, 3), dtype=np.uint8),
"wrist_image": spaces.Box(0, 255, shape=(256, 256, 3), dtype=np.uint8),
"state": spaces.Box(-np.inf, np.inf, shape=(8,), dtype=np.float32),
}
)
def reset(self, *, seed=None, options=None):
super().reset(seed=seed)
self._env = self._builder.make_env_for_episode(
episode_idx=self._episode_idx,
max_steps=self._max_steps,
)
obs, info = self._env.reset()
self._last_raw_obs = obs
return self._convert_obs(obs), self._convert_info(info)
def step(self, action):
obs, reward, terminated, truncated, info = self._env.step(action)
self._last_raw_obs = obs
terminated_bool = bool(terminated.item()) if hasattr(terminated, "item") else bool(terminated)
truncated_bool = bool(truncated.item()) if hasattr(truncated, "item") else bool(truncated)
status = info.get("status", "ongoing")
conv_info = self._convert_info(info)
conv_info["is_success"] = status == "success"
return self._convert_obs(obs), float(reward), terminated_bool, truncated_bool, conv_info
def render(self) -> np.ndarray | None:
if self._last_raw_obs is None:
return np.zeros((256, 256, 3), dtype=np.uint8)
front = self._last_raw_obs.get("front_rgb_list")
if front is None:
return np.zeros((256, 256, 3), dtype=np.uint8)
frame = front[-1] if isinstance(front, list) else front
return np.asarray(frame, dtype=np.uint8)
def _convert_obs(self, obs: dict) -> dict:
front_rgb = (
obs["front_rgb_list"][-1] if isinstance(obs["front_rgb_list"], list) else obs["front_rgb_list"]
)
wrist_rgb = (
obs["wrist_rgb_list"][-1] if isinstance(obs["wrist_rgb_list"], list) else obs["wrist_rgb_list"]
)
joint_state = (
obs["joint_state_list"][-1]
if isinstance(obs["joint_state_list"], list)
else obs["joint_state_list"]
)
gripper_state = (
obs["gripper_state_list"][-1]
if isinstance(obs["gripper_state_list"], list)
else obs["gripper_state_list"]
)
joint = np.asarray(joint_state, dtype=np.float32).flatten()[:7]
gripper = np.asarray(gripper_state, dtype=np.float32).flatten()[:1]
state = np.concatenate([joint, gripper])
return {
"image": np.asarray(front_rgb, dtype=np.uint8),
"wrist_image": np.asarray(wrist_rgb, dtype=np.uint8),
"state": state,
}
def _convert_info(self, info: dict) -> dict:
return {
"status": info.get("status", "ongoing"),
"task_goal": info.get("task_goal", ""),
}
def _make_env_fns(
*,
task: str,
n_envs: int,
action_space_type: str,
dataset: str,
episode_length: int,
task_id: int,
) -> list[Callable[[], RoboMMEGymEnv]]:
def _make_one(episode_index: int) -> RoboMMEGymEnv:
return RoboMMEGymEnv(
task=task,
action_space_type=action_space_type,
dataset=dataset,
episode_idx=episode_index,
max_steps=episode_length,
)
return [partial(_make_one, task_id + i) for i in range(n_envs)]
def create_robomme_envs(
task: str,
n_envs: int = 1,
action_space_type: str = "joint_angle",
dataset: str = "test",
episode_length: int = 300,
task_ids: list[int] | None = None,
env_cls: Callable[[Sequence[Callable[[], Any]]], Any] | None = None,
) -> dict[str, dict[int, gym.vector.VectorEnv]]:
"""Create vectorized RoboMME environments for evaluation."""
if env_cls is None or not callable(env_cls):
raise ValueError("env_cls must be a callable that wraps a list of env factory callables.")
if not isinstance(n_envs, int) or n_envs <= 0:
raise ValueError(f"n_envs must be a positive int; got {n_envs}.")
if task_ids is None:
task_ids = [0]
task_names = [t.strip() for t in task.split(",") if t.strip()]
out: dict[str, dict[int, gym.vector.VectorEnv]] = {}
for task_name in task_names:
envs_by_task: dict[int, gym.vector.VectorEnv] = {}
for task_id in task_ids:
fns = _make_env_fns(
task=task_name,
n_envs=n_envs,
action_space_type=action_space_type,
dataset=dataset,
episode_length=episode_length,
task_id=task_id,
)
envs_by_task[task_id] = env_cls(fns)
out[task_name] = envs_by_task
return out
+8
View File
@@ -216,6 +216,14 @@ class FeetechMotorsBus(SerialMotorsBus):
self.write("Maximum_Acceleration", motor, maximum_acceleration)
self.write("Acceleration", motor, acceleration)
# Clear bit 4 (0x10) of the Phase register (0x12) to set angle feedback mode to 0.
# This forces position readings to be in the range [0, resolution - 1] and prevents overflow or negative values.
# Only known to be necessary for the STS3215.
if self.motors[motor].model == "sts3215":
phase = self.read("Phase", motor, normalize=False)
if phase & 0x10:
self.write("Phase", motor, phase & ~0x10)
@property
def is_calibrated(self) -> bool:
motors_calibration = self.read_calibration()
+27 -24
View File
@@ -175,33 +175,36 @@ def actor_cli(cfg: TrainRLServerPipelineConfig):
interactions_process.start()
receive_policy_process.start()
act_with_policy(
cfg=cfg,
shutdown_event=shutdown_event,
parameters_queue=parameters_queue,
transitions_queue=transitions_queue,
interactions_queue=interactions_queue,
)
logging.info("[ACTOR] Policy process joined")
try:
act_with_policy(
cfg=cfg,
shutdown_event=shutdown_event,
parameters_queue=parameters_queue,
transitions_queue=transitions_queue,
interactions_queue=interactions_queue,
)
logging.info("[ACTOR] Policy loop finished")
except Exception:
logging.exception("[ACTOR] Unhandled exception in act_with_policy")
shutdown_event.set()
finally:
logging.info("[ACTOR] Closing queues")
transitions_queue.close()
interactions_queue.close()
parameters_queue.close()
logging.info("[ACTOR] Closing queues")
transitions_queue.close()
interactions_queue.close()
parameters_queue.close()
transitions_process.join()
logging.info("[ACTOR] Transitions process joined")
interactions_process.join()
logging.info("[ACTOR] Interactions process joined")
receive_policy_process.join()
logging.info("[ACTOR] Receive policy process joined")
transitions_process.join()
logging.info("[ACTOR] Transitions process joined")
interactions_process.join()
logging.info("[ACTOR] Interactions process joined")
receive_policy_process.join()
logging.info("[ACTOR] Receive policy process joined")
transitions_queue.cancel_join_thread()
interactions_queue.cancel_join_thread()
parameters_queue.cancel_join_thread()
logging.info("[ACTOR] join queues")
transitions_queue.cancel_join_thread()
interactions_queue.cancel_join_thread()
parameters_queue.cancel_join_thread()
logging.info("[ACTOR] queues closed")
logging.info("[ACTOR] Cleanup complete")
# Core algorithm functions
+24 -21
View File
@@ -218,30 +218,33 @@ def start_learner_threads(
)
communication_process.start()
add_actor_information_and_train(
cfg=cfg,
wandb_logger=wandb_logger,
shutdown_event=shutdown_event,
transition_queue=transition_queue,
interaction_message_queue=interaction_message_queue,
parameters_queue=parameters_queue,
)
logging.info("[LEARNER] Training process stopped")
try:
add_actor_information_and_train(
cfg=cfg,
wandb_logger=wandb_logger,
shutdown_event=shutdown_event,
transition_queue=transition_queue,
interaction_message_queue=interaction_message_queue,
parameters_queue=parameters_queue,
)
logging.info("[LEARNER] Training process stopped")
except Exception:
logging.exception("[LEARNER] Unhandled exception in training loop")
shutdown_event.set()
finally:
logging.info("[LEARNER] Closing queues")
transition_queue.close()
interaction_message_queue.close()
parameters_queue.close()
logging.info("[LEARNER] Closing queues")
transition_queue.close()
interaction_message_queue.close()
parameters_queue.close()
communication_process.join()
logging.info("[LEARNER] Communication process joined")
communication_process.join()
logging.info("[LEARNER] Communication process joined")
transition_queue.cancel_join_thread()
interaction_message_queue.cancel_join_thread()
parameters_queue.cancel_join_thread()
logging.info("[LEARNER] join queues")
transition_queue.cancel_join_thread()
interaction_message_queue.cancel_join_thread()
parameters_queue.cancel_join_thread()
logging.info("[LEARNER] queues closed")
logging.info("[LEARNER] Cleanup complete")
# Core algorithm functions
+1 -1
View File
@@ -572,7 +572,7 @@ def eval_main(cfg: EvalPipelineConfig):
preprocessor=preprocessor,
postprocessor=postprocessor,
n_episodes=cfg.eval.n_episodes,
max_episodes_rendered=10,
max_episodes_rendered=cfg.eval.max_episodes_rendered,
videos_dir=Path(cfg.output_dir) / "videos",
start_seed=cfg.seed,
max_parallel_tasks=cfg.env.max_parallel_tasks,
+98 -39
View File
@@ -71,6 +71,9 @@ def update_policy(
lr_scheduler=None,
lock=None,
rabc_weights_provider=None,
*,
do_optimizer_step: bool = True,
loss_divisor: int = 1,
) -> tuple[MetricsTracker, dict]:
"""
Performs a single training step to update the policy's weights.
@@ -122,34 +125,38 @@ def update_policy(
loss, output_dict = policy.forward(batch)
# TODO(rcadene): policy.unnormalize_outputs(out_dict)
logged_loss = loss.detach()
if loss_divisor > 1:
loss = loss / loss_divisor
# Use accelerator's backward method
accelerator.backward(loss)
# Clip gradients if specified
if grad_clip_norm > 0:
grad_norm = accelerator.clip_grad_norm_(policy.parameters(), grad_clip_norm)
else:
grad_norm = torch.nn.utils.clip_grad_norm_(
policy.parameters(), float("inf"), error_if_nonfinite=False
)
grad_norm_value = 0.0
if do_optimizer_step:
if grad_clip_norm > 0:
grad_norm = accelerator.clip_grad_norm_(policy.parameters(), grad_clip_norm)
else:
grad_norm = torch.nn.utils.clip_grad_norm_(
policy.parameters(), float("inf"), error_if_nonfinite=False
)
grad_norm_value = grad_norm.item()
# Optimizer step
with lock if lock is not None else nullcontext():
optimizer.step()
with lock if lock is not None else nullcontext():
optimizer.step()
optimizer.zero_grad()
optimizer.zero_grad()
# Step through pytorch scheduler at every batch instead of epoch
if lr_scheduler is not None:
lr_scheduler.step()
# Step through pytorch scheduler at every optimizer step instead of epoch
if lr_scheduler is not None:
lr_scheduler.step()
# Update internal buffers if policy has update method
if has_method(accelerator.unwrap_model(policy, keep_fp32_wrapper=True), "update"):
accelerator.unwrap_model(policy, keep_fp32_wrapper=True).update()
# Update internal buffers if policy has update method
if has_method(accelerator.unwrap_model(policy, keep_fp32_wrapper=True), "update"):
accelerator.unwrap_model(policy, keep_fp32_wrapper=True).update()
train_metrics.loss = loss.item()
train_metrics.grad_norm = grad_norm.item()
train_metrics.loss = logged_loss.item()
train_metrics.grad_norm = grad_norm_value
train_metrics.lr = optimizer.param_groups[0]["lr"]
train_metrics.update_s = time.perf_counter() - start_time
return train_metrics, output_dict
@@ -359,8 +366,16 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
logging.info(f"{dataset.num_frames=} ({format_big_number(dataset.num_frames)})")
logging.info(f"{dataset.num_episodes=}")
num_processes = accelerator.num_processes
effective_bs = cfg.batch_size * num_processes
logging.info(f"Effective batch size: {cfg.batch_size} x {num_processes} = {effective_bs}")
micro_batch = cfg.batch_size
logical_batch = cfg.batch_size * cfg.gradient_accumulation_steps
effective_bs = logical_batch * num_processes
logging.info(
"Effective batch size: %s x %s x %s = %s",
micro_batch,
cfg.gradient_accumulation_steps,
num_processes,
effective_bs,
)
logging.info(f"{num_learnable_params=} ({format_big_number(num_learnable_params)})")
logging.info(f"{num_total_params=} ({format_big_number(num_total_params)})")
@@ -407,9 +422,10 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
}
# Keep global batch size for logging; MetricsTracker handles world size internally.
effective_batch_size = cfg.batch_size * accelerator.num_processes
logical_batch_size = cfg.batch_size * cfg.gradient_accumulation_steps
effective_batch_size = logical_batch_size * accelerator.num_processes
train_tracker = MetricsTracker(
cfg.batch_size,
logical_batch_size,
dataset.num_frames,
dataset.num_episodes,
train_metrics,
@@ -431,21 +447,62 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
)
for _ in range(step, cfg.steps):
start_time = time.perf_counter()
batch = next(dl_iter)
batch = preprocessor(batch)
train_tracker.dataloading_s = time.perf_counter() - start_time
step_dataloading_s = 0.0
step_update_s = 0.0
step_losses = []
step_grad_norm = 0.0
step_lr = optimizer.param_groups[0]["lr"]
output_dict = {}
optimizer.zero_grad()
for accumulation_idx in range(cfg.gradient_accumulation_steps):
start_time = time.perf_counter()
batch = next(dl_iter)
batch = preprocessor(batch)
step_dataloading_s += time.perf_counter() - start_time
train_tracker, output_dict = update_policy(
train_tracker,
policy,
batch,
optimizer,
cfg.optimizer.grad_clip_norm,
accelerator=accelerator,
lr_scheduler=lr_scheduler,
rabc_weights_provider=rabc_weights,
)
is_last_microbatch = accumulation_idx == cfg.gradient_accumulation_steps - 1
micro_metrics = MetricsTracker(
cfg.batch_size,
dataset.num_frames,
dataset.num_episodes,
{
"loss": AverageMeter("loss", ":.3f"),
"grad_norm": AverageMeter("grdn", ":.3f"),
"lr": AverageMeter("lr", ":0.1e"),
"update_s": AverageMeter("updt_s", ":.3f"),
},
accelerator=accelerator,
)
sync_context = (
nullcontext()
if is_last_microbatch or accelerator.num_processes == 1
else accelerator.no_sync(policy)
)
with sync_context:
micro_metrics, micro_output_dict = update_policy(
micro_metrics,
policy,
batch,
optimizer,
cfg.optimizer.grad_clip_norm,
accelerator=accelerator,
lr_scheduler=lr_scheduler if is_last_microbatch else None,
rabc_weights_provider=rabc_weights,
do_optimizer_step=is_last_microbatch,
loss_divisor=cfg.gradient_accumulation_steps,
)
step_update_s += micro_metrics.update_s.val
step_losses.append(micro_metrics.loss.val)
if is_last_microbatch:
step_grad_norm = micro_metrics.grad_norm.val
step_lr = micro_metrics.lr.val
output_dict = micro_output_dict
train_tracker.loss = sum(step_losses) / len(step_losses)
train_tracker.grad_norm = step_grad_norm
train_tracker.lr = step_lr
train_tracker.update_s = step_update_s
train_tracker.dataloading_s = step_dataloading_s
# Note: eval and checkpoint happens *after* the `step`th training update has completed, so we
# increment `step` here.
@@ -510,7 +567,7 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
postprocessor=postprocessor,
n_episodes=cfg.eval.n_episodes,
videos_dir=cfg.output_dir / "eval" / f"videos_step_{step_id}",
max_episodes_rendered=4,
max_episodes_rendered=cfg.eval.max_episodes_rendered,
start_seed=cfg.seed,
max_parallel_tasks=cfg.env.max_parallel_tasks,
)
@@ -541,7 +598,9 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
if wandb_logger:
wandb_log_dict = {**eval_tracker.to_dict(), **eval_info}
wandb_logger.log_dict(wandb_log_dict, step, mode="eval")
wandb_logger.log_video(eval_info["overall"]["video_paths"][0], step, mode="eval")
video_paths = eval_info["overall"].get("video_paths", [])
if video_paths:
wandb_logger.log_video(video_paths[0], step, mode="eval")
accelerator.wait_for_everyone()
+70
View File
@@ -0,0 +1,70 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import annotations
import json
from dataclasses import dataclass
from datetime import UTC, datetime
from pathlib import Path
from typing import Any
from huggingface_hub import HfApi
def utc_timestamp_slug(now: datetime | None = None) -> str:
current = now or datetime.now(UTC)
return current.strftime("%Y%m%dT%H%M%SZ")
def make_hub_file_url(repo_id: str, path_in_repo: str, repo_type: str = "dataset") -> str:
prefix = "datasets/" if repo_type == "dataset" else ""
return f"https://huggingface.co/{prefix}{repo_id}/resolve/main/{path_in_repo}"
def write_json(path: Path, payload: dict[str, Any]) -> None:
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(json.dumps(payload, indent=2, sort_keys=True))
@dataclass(frozen=True)
class UploadTarget:
local_path: Path
path_in_repo: str
def upload_targets(
repo_id: str,
targets: list[UploadTarget],
*,
repo_type: str = "dataset",
token: str | None = None,
private: bool | None = None,
commit_message: str | None = None,
) -> dict[str, str]:
api = HfApi(token=token)
api.create_repo(repo_id=repo_id, repo_type=repo_type, private=private, exist_ok=True)
uploaded: dict[str, str] = {}
for target in targets:
api.upload_file(
path_or_fileobj=str(target.local_path),
path_in_repo=target.path_in_repo,
repo_id=repo_id,
repo_type=repo_type,
commit_message=commit_message or f"Upload {target.path_in_repo}",
)
uploaded[target.path_in_repo] = make_hub_file_url(repo_id, target.path_in_repo, repo_type=repo_type)
return uploaded
+142
View File
@@ -0,0 +1,142 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import json
from benchmarks.run_benchmark_matrix import (
PlannedJob,
compute_gradient_accumulation_steps,
plan_jobs,
render_sbatch_script,
write_manifest,
)
def _one_job(job_list: list[PlannedJob]) -> PlannedJob:
assert len(job_list) == 1
return job_list[0]
def test_compute_gradient_accumulation_steps_for_fixed_effective_batch():
assert compute_gradient_accumulation_steps(
effective_batch_size=256,
num_gpus=8,
microbatch_per_gpu=32,
) == 1
assert compute_gradient_accumulation_steps(
effective_batch_size=256,
num_gpus=4,
microbatch_per_gpu=32,
) == 2
assert compute_gradient_accumulation_steps(
effective_batch_size=256,
num_gpus=1,
microbatch_per_gpu=32,
) == 8
def test_plan_jobs_filters_libero_plus_only(tmp_path):
jobs = plan_jobs(
output_dir=tmp_path,
hub_org="lerobot",
results_repo="lerobot/benchmark-history",
policies=["pi0", "act"],
benchmarks=["libero_plus"],
)
assert [job.benchmark for job in jobs] == ["libero_plus", "libero_plus"]
assert [job.policy for job in jobs] == ["pi0", "act"]
def test_plan_jobs_includes_libero_plus_and_robomme(tmp_path):
jobs = plan_jobs(
output_dir=tmp_path,
hub_org="lerobot",
results_repo="lerobot/benchmark-history",
policies=["pi0"],
benchmarks=["libero_plus", "robomme"],
)
assert [job.benchmark for job in jobs] == ["libero_plus", "robomme"]
assert jobs[0].effective_batch_size == 256
assert jobs[1].effective_batch_size == 256
def test_plan_jobs_sets_expected_gpu_and_accumulation(tmp_path):
jobs = plan_jobs(
output_dir=tmp_path,
hub_org="lerobot",
results_repo="lerobot/benchmark-history",
policies=["pi0", "xvla", "act"],
benchmarks=["robomme"],
)
by_policy = {job.policy: job for job in jobs}
assert by_policy["pi0"].num_gpus == 8
assert by_policy["pi0"].gradient_accumulation_steps == 1
assert by_policy["xvla"].num_gpus == 4
assert by_policy["xvla"].gradient_accumulation_steps == 2
assert by_policy["act"].num_gpus == 1
assert by_policy["act"].gradient_accumulation_steps == 8
def test_render_sbatch_script_contains_train_eval_and_publish(tmp_path):
job = _one_job(
plan_jobs(
output_dir=tmp_path,
hub_org="lerobot",
results_repo="lerobot/benchmark-history",
policies=["pi0_fast"],
benchmarks=["robomme"],
)
)
script = render_sbatch_script(
job=job,
output_dir=tmp_path,
results_repo_id="lerobot/benchmark-history",
git_commit="deadbeef",
)
assert "docker/Dockerfile" not in script
assert "lerobot-benchmark-robomme:latest" in script
assert '--dataset.repo_id="lerobot/robomme"' in script
assert '--env.type="robomme"' in script
assert "--gradient_accumulation_steps=1" in script
assert "lerobot-train-tokenizer" in script
assert "benchmarks/publish_benchmark_result.py" in script
def test_write_manifest_records_job_metadata(tmp_path):
jobs = plan_jobs(
output_dir=tmp_path,
hub_org="lerobot",
results_repo="lerobot/benchmark-history",
policies=["pi0"],
benchmarks=["libero_plus", "robomme"],
)
manifest_path = write_manifest(
output_dir=tmp_path,
jobs=jobs,
git_commit="deadbeef",
hub_org="lerobot",
results_repo="lerobot/benchmark-history",
)
manifest = json.loads(manifest_path.read_text())
assert manifest["git_commit"] == "deadbeef"
assert manifest["results_repo"] == "lerobot/benchmark-history"
assert [job["benchmark"] for job in manifest["jobs"]] == ["libero_plus", "robomme"]
+123
View File
@@ -0,0 +1,123 @@
#!/usr/bin/env python
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import annotations
import sys
from types import ModuleType
from unittest.mock import MagicMock
import numpy as np
def _install_robomme_stub():
stub = ModuleType("robomme")
wrapper_stub = ModuleType("robomme.env_record_wrapper")
class FakeBuilder:
def __init__(self, **kwargs):
pass
def make_env_for_episode(self, episode_idx: int, max_steps: int):
env = MagicMock()
obs = {
"front_rgb_list": [np.zeros((256, 256, 3), dtype=np.uint8)],
"wrist_rgb_list": [np.zeros((256, 256, 3), dtype=np.uint8)],
"joint_state_list": [np.zeros(7, dtype=np.float32)],
"gripper_state_list": [np.zeros(2, dtype=np.float32)],
}
env.reset.return_value = (obs, {"status": "ongoing", "task_goal": "pick the cube"})
env.step.return_value = (obs, 0.0, False, False, {"status": "ongoing", "task_goal": ""})
return env
wrapper_stub.BenchmarkEnvBuilder = FakeBuilder
stub.env_record_wrapper = wrapper_stub
sys.modules["robomme"] = stub
sys.modules["robomme.env_record_wrapper"] = wrapper_stub
def _uninstall_robomme_stub():
sys.modules.pop("robomme", None)
sys.modules.pop("robomme.env_record_wrapper", None)
def test_robomme_env_config_defaults():
from lerobot.envs.configs import RoboMMEEnv
cfg = RoboMMEEnv()
assert cfg.task == "PickXtimes"
assert cfg.fps == 10
assert cfg.episode_length == 300
assert cfg.action_space == "joint_angle"
assert cfg.dataset_split == "test"
assert cfg.task_ids is None
def test_robomme_features_map():
from lerobot.envs.configs import RoboMMEEnv
from lerobot.utils.constants import ACTION, OBS_IMAGES, OBS_STATE
cfg = RoboMMEEnv()
assert cfg.features_map[ACTION] == ACTION
assert cfg.features_map["image"] == f"{OBS_IMAGES}.image"
assert cfg.features_map["wrist_image"] == f"{OBS_IMAGES}.wrist_image"
assert cfg.features_map[OBS_STATE] == OBS_STATE
def test_convert_obs_list_format():
_install_robomme_stub()
try:
from lerobot.envs.robomme import RoboMMEGymEnv
env = RoboMMEGymEnv.__new__(RoboMMEGymEnv)
front = np.full((256, 256, 3), 42, dtype=np.uint8)
wrist = np.full((256, 256, 3), 7, dtype=np.uint8)
joints = np.arange(7, dtype=np.float32)
gripper = np.array([0.5, 0.5], dtype=np.float32)
obs_raw = {
"front_rgb_list": [np.zeros_like(front), front],
"wrist_rgb_list": [np.zeros_like(wrist), wrist],
"joint_state_list": [np.zeros(7, dtype=np.float32), joints],
"gripper_state_list": [np.zeros(2, dtype=np.float32), gripper],
}
result = env._convert_obs(obs_raw)
np.testing.assert_array_equal(result["image"], front)
np.testing.assert_array_equal(result["wrist_image"], wrist)
assert result["state"].shape == (8,)
np.testing.assert_array_almost_equal(result["state"][:7], joints)
assert result["state"][7] == gripper[0]
finally:
_uninstall_robomme_stub()
def test_create_robomme_envs_multi_task():
_install_robomme_stub()
try:
from lerobot.envs.robomme import create_robomme_envs
env_cls = MagicMock(return_value=MagicMock())
result = create_robomme_envs(
task="PickXtimes,BinFill,StopCube",
n_envs=1,
env_cls=env_cls,
)
assert set(result.keys()) == {"PickXtimes", "BinFill", "StopCube"}
finally:
_uninstall_robomme_stub()
+61
View File
@@ -429,6 +429,67 @@ def test_set_half_turn_homings(mock_motors, dummy_motors):
assert all(mock_motors.stubs[stub].wait_called() for stub in write_homing_stubs)
@pytest.mark.parametrize(
"initial_phase, expected_phase",
[
(0b00010000, 0b00000000), # bit 4 set - cleared
(0b11111111, 0b11101111), # all bits set - bit 4 cleared, others preserved
(0b00000000, 0b00000000), # bit 4 already 0 - unchanged
],
ids=["bit4_set", "all_bits_set", "bit4_already_cleared"],
)
def test_configure_motors_clears_sts3215_phase_bit4(initial_phase, expected_phase, mock_motors, dummy_motors):
"""Phase register bit 4 (angle feedback mode) must be cleared for sts3215, other bits preserved."""
phase_read_stubs = []
phase_write_stubs = []
for motor in dummy_motors.values():
mock_motors.build_write_stub(*STS_SMS_SERIES_CONTROL_TABLE["Return_Delay_Time"], motor.id, 0)
mock_motors.build_write_stub(*STS_SMS_SERIES_CONTROL_TABLE["Maximum_Acceleration"], motor.id, 254)
mock_motors.build_write_stub(*STS_SMS_SERIES_CONTROL_TABLE["Acceleration"], motor.id, 254)
phase_read_stubs.append(
mock_motors.build_read_stub(*STS_SMS_SERIES_CONTROL_TABLE["Phase"], motor.id, initial_phase)
)
if initial_phase != expected_phase:
phase_write_stubs.append(
mock_motors.build_write_stub(*STS_SMS_SERIES_CONTROL_TABLE["Phase"], motor.id, expected_phase)
)
bus = FeetechMotorsBus(port=mock_motors.port, motors=dummy_motors)
bus.connect(handshake=False)
with patch.object(bus, "write", wraps=bus.write) as mock_write:
bus.configure_motors()
assert all(mock_motors.stubs[stub].called for stub in phase_read_stubs)
if initial_phase != expected_phase: # ensure that phase is written only if it needs to be changed
assert all(mock_motors.stubs[stub].wait_called() for stub in phase_write_stubs)
else: # If no write should be made, ensure that Phase is not written for any motor
write_data_names = [call.args[0] for call in mock_write.call_args_list]
assert "Phase" not in write_data_names
def test_configure_motors_skips_phase_for_non_sts3215(mock_motors):
"""Phase register must not be touched for motors other than sts3215."""
motors = {
"dummy_1": Motor(1, "sts3250", MotorNormMode.RANGE_M100_100),
"dummy_2": Motor(2, "sts3250", MotorNormMode.RANGE_M100_100),
"dummy_3": Motor(3, "sts3250", MotorNormMode.RANGE_M100_100),
}
for motor in motors.values():
mock_motors.build_write_stub(*STS_SMS_SERIES_CONTROL_TABLE["Return_Delay_Time"], motor.id, 0)
mock_motors.build_write_stub(*STS_SMS_SERIES_CONTROL_TABLE["Maximum_Acceleration"], motor.id, 254)
mock_motors.build_write_stub(*STS_SMS_SERIES_CONTROL_TABLE["Acceleration"], motor.id, 254)
bus = FeetechMotorsBus(port=mock_motors.port, motors=motors)
bus.connect(handshake=False)
with patch.object(bus, "read", wraps=bus.read) as mock_read:
bus.configure_motors()
read_data_names = [call.args[0] for call in mock_read.call_args_list]
assert "Phase" not in read_data_names
def test_record_ranges_of_motion(mock_motors, dummy_motors):
positions = {
1: [351, 42, 1337],