diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 1055975d7..f218dcd29 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -19,6 +19,8 @@
title: Multi GPU training
- local: peft_training
title: Training with PEFT (e.g., LoRA)
+ - local: benchmark_training
+ title: Benchmark Training & Evaluation
title: "Tutorials"
- sections:
- local: lerobot-dataset-v3
diff --git a/docs/source/benchmark_training.mdx b/docs/source/benchmark_training.mdx
new file mode 100644
index 000000000..ffe20ae51
--- /dev/null
+++ b/docs/source/benchmark_training.mdx
@@ -0,0 +1,260 @@
+# Benchmark Training & Evaluation
+
+This guide explains how to train and evaluate policies on the simulation benchmarks
+integrated in LeRobot: **LIBERO**, **LIBERO-plus**, **MetaWorld**, **RoboCasa**, and **RoboMME**.
+
+The workflow is:
+
+1. Pick one or more benchmarks.
+2. For each benchmark, train a policy on its combined dataset (multi-GPU).
+3. Upload the trained policy to the Hugging Face Hub.
+4. Evaluate the policy on every task suite within that benchmark.
+
+## Prerequisites
+
+Install the benchmark-specific dependencies for the environments you want to evaluate on:
+
+```bash
+# LIBERO (original)
+pip install -e ".[libero]"
+
+# LIBERO-plus
+pip install -e ".[libero_plus]"
+
+# MetaWorld
+pip install -e ".[metaworld]"
+
+# RoboCasa
+pip install -e ".[robocasa]"
+
+# RoboMME
+pip install -e ".[robomme]"
+```
+
+`libero_plus` includes the same EGL probe dependencies as `libero` so headless
+renderer setup is consistent between both installs.
+
+If your environment has CMake build-isolation issues, use the same fallback as
+standard LIBERO installs:
+
+```bash
+PATH=/usr/bin:/bin:$PATH pip install --no-build-isolation -e ".[libero-plus]"
+```
+
+For multi-GPU training you also need [Accelerate](https://huggingface.co/docs/accelerate):
+
+```bash
+pip install accelerate
+```
+
+## Quick start — single benchmark
+
+Train SmolVLA on LIBERO-plus with 4 GPUs for 50 000 steps:
+
+```bash
+lerobot-benchmark train \
+ --benchmarks libero_plus \
+ --policy-path lerobot/smolvla_base \
+ --hub-user $HF_USER \
+ --num-gpus 4 \
+ --steps 50000 \
+ --batch-size 32 \
+ --wandb
+```
+
+This trains on the combined LIBERO-plus dataset and pushes the checkpoint to
+`$HF_USER/smolvla_libero_plus` on the Hub.
+
+Then evaluate on **all four** LIBERO suites (spatial, object, goal, 10):
+
+```bash
+lerobot-benchmark eval \
+ --benchmarks libero_plus \
+ --hub-user $HF_USER \
+ --n-episodes 50
+```
+
+This automatically runs a separate `lerobot-eval` for each suite.
+
+## Full sweep — multiple benchmarks
+
+Run training **and** evaluation across all benchmarks:
+
+```bash
+lerobot-benchmark all \
+ --benchmarks libero,libero_plus,metaworld,robocasa,robomme \
+ --policy-path lerobot/smolvla_base \
+ --hub-user $HF_USER \
+ --num-gpus 4 \
+ --steps 50000 \
+ --batch-size 32 \
+ --wandb \
+ --push-eval-to-hub
+```
+
+For each benchmark the runner:
+1. Trains a policy on its dataset.
+2. Evaluates on every eval task in the benchmark (e.g. 4 suites for LIBERO).
+3. Uploads eval results + videos to the Hub.
+
+
+
+Use `--dry-run` to print the exact `lerobot-train` / `lerobot-eval` commands without executing them, so you can inspect or modify them before running.
+
+
+
+## Using the CLI directly (without the benchmark runner)
+
+You can also compose the commands yourself. The benchmark runner is a thin wrapper; here is what it does under the hood.
+
+### Training
+
+```bash
+accelerate launch \
+ --multi_gpu \
+ --num_processes=4 \
+ $(which lerobot-train) \
+ --policy.path=lerobot/smolvla_base \
+ --dataset.repo_id=$HF_USER/libero_plus \
+ --policy.repo_id=$HF_USER/smolvla_libero_plus \
+ --env.type=libero_plus \
+ --env.task=libero_spatial \
+ --steps=50000 \
+ --batch_size=32 \
+ --eval_freq=10000 \
+ --save_freq=10000 \
+ --output_dir=outputs/train/smolvla_libero_plus \
+ --job_name=smolvla_libero_plus \
+ --policy.push_to_hub=true \
+ --wandb.enable=true
+```
+
+### Evaluation (run once per suite)
+
+```bash
+for SUITE in libero_spatial libero_object libero_goal libero_10; do
+ lerobot-eval \
+ --policy.path=$HF_USER/smolvla_libero_plus \
+ --env.type=libero_plus \
+ --env.task=$SUITE \
+ --eval.n_episodes=50 \
+ --eval.batch_size=10 \
+ --output_dir=outputs/eval/smolvla_libero_plus/$SUITE \
+ --policy.device=cuda
+done
+```
+
+## Available benchmarks
+
+| Benchmark | Env type | Dataset | Eval tasks | Action dim |
+|---|---|---|---|---|
+| `libero` | `libero` | `{hub_user}/libero` | spatial, object, goal, 10 | 7 |
+| `libero_plus` | `libero_plus` | `{hub_user}/libero_plus` | spatial, object, goal, 10 | 7 |
+| `metaworld` | `metaworld` | `{hub_user}/metaworld` | push-v2 | 4 |
+| `robocasa` | `robocasa` | `{hub_user}/robocasa` | PickPlaceCounterToCabinet | 12 |
+| `robomme` | `robomme` | `{hub_user}/robomme` | PickXtimes | 8 |
+
+Run `lerobot-benchmark list` to see the full registry with all eval tasks.
+
+## Policy naming convention
+
+The benchmark runner stores trained policies under:
+
+```
+{hub_user}/{policy_name}_{benchmark}
+```
+
+The default `--policy-name` is `smolvla`. So training on `libero_plus` as user `alice` produces `alice/smolvla_libero_plus`.
+
+You can override this, e.g. `--policy-name pi05` if training π₀.₅ instead.
+
+## Multi-GPU considerations
+
+The effective batch size is `batch_size × num_gpus`. With `--batch-size=32` and
+`--num-gpus=4`, you train with an effective batch of 128 per step. LeRobot does **not**
+auto-scale the learning rate; see the [Multi-GPU Training guide](./multi_gpu_training) for
+details on when and how to adjust it.
+
+## Custom benchmarks
+
+To add a new benchmark, edit the `BENCHMARK_REGISTRY` in
+`src/lerobot/scripts/lerobot_benchmark.py`:
+
+```python
+from lerobot.scripts.lerobot_benchmark import BenchmarkEntry, BENCHMARK_REGISTRY
+
+BENCHMARK_REGISTRY["my_benchmark"] = BenchmarkEntry(
+ dataset_repo_id="{hub_user}/my_dataset",
+ env_type="my_env",
+ env_task="MyDefaultTask",
+ eval_tasks=["TaskA", "TaskB", "TaskC"],
+)
+```
+
+Then use `--benchmarks my_benchmark` as usual. The runner will train once and
+evaluate separately on TaskA, TaskB, and TaskC.
+
+## Outputs
+
+After training and evaluation, your outputs directory looks like:
+
+```
+outputs/
+├── train/
+│ ├── smolvla_libero/
+│ │ ├── checkpoints/
+│ │ └── ...
+│ ├── smolvla_libero_plus/
+│ ├── smolvla_robocasa/
+│ └── smolvla_robomme/
+└── eval/
+ ├── smolvla_libero/
+ │ ├── libero_spatial/
+ │ │ ├── eval_info.json
+ │ │ └── videos/
+ │ ├── libero_object/
+ │ ├── libero_goal/
+ │ └── libero_10/
+ ├── smolvla_libero_plus/
+ │ ├── libero_spatial/
+ │ ├── libero_object/
+ │ ├── libero_goal/
+ │ └── libero_10/
+ ├── smolvla_robocasa/
+ └── smolvla_robomme/
+```
+
+Each `eval_info.json` contains per-episode rewards, success rates, and aggregate metrics.
+
+## Uploading eval results to the Hub
+
+Add `--push-eval-to-hub` to upload evaluation metrics and videos to the policy's
+Hub repo after each eval run:
+
+```bash
+lerobot-benchmark eval \
+ --benchmarks libero_plus,robocasa \
+ --hub-user $HF_USER \
+ --push-eval-to-hub
+```
+
+For LIBERO-plus, each suite's results are uploaded to `eval/libero_spatial/`,
+`eval/libero_object/`, etc. inside the `$HF_USER/smolvla_libero_plus` model repo.
+
+This also works with the `all` subcommand — pass `--push-eval-to-hub` and results
+are automatically uploaded after each eval run.
+
+## Passing extra arguments
+
+Any arguments after the recognized flags are forwarded to `lerobot-train` or
+`lerobot-eval`. For example, to use PEFT/LoRA during training:
+
+```bash
+lerobot-benchmark train \
+ --benchmarks libero_plus \
+ --policy-path lerobot/smolvla_base \
+ --hub-user $HF_USER \
+ --num-gpus 4 \
+ --steps 50000 \
+ --peft.method_type=LORA --peft.r=16
+```
diff --git a/pyproject.toml b/pyproject.toml
index b46868cd6..28eb59bf2 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -177,9 +177,12 @@ pusht = ["gym-pusht>=0.1.5,<0.2.0", "pymunk>=6.6.0,<7.0.0"] # TODO: Fix pymunk v
libero = ["lerobot[transformers-dep]", "hf-libero>=0.1.3,<0.2.0; sys_platform == 'linux'", "lerobot[scipy-dep]"]
libero_plus = [
"lerobot[transformers-dep]",
+ "hf-egl-probe>=1.0.1; sys_platform == 'linux'",
+ "egl_probe>=1.0.1; sys_platform == 'linux'",
"libero @ git+https://github.com/sylvestf/LIBERO-plus.git@main ; sys_platform == 'linux'",
"lerobot[scipy-dep]",
]
+libero-plus = ["lerobot[libero_plus]"]
robomme = [
"robomme @ git+https://github.com/RoboMME/robomme_benchmark.git@main ; sys_platform == 'linux'",
]
@@ -236,6 +239,7 @@ lerobot-find-joint-limits="lerobot.scripts.lerobot_find_joint_limits:main"
lerobot-imgtransform-viz="lerobot.scripts.lerobot_imgtransform_viz:main"
lerobot-edit-dataset="lerobot.scripts.lerobot_edit_dataset:main"
lerobot-setup-can="lerobot.scripts.lerobot_setup_can:main"
+lerobot-benchmark="lerobot.scripts.lerobot_benchmark:main"
# ---------------- Tool Configurations ----------------
[tool.setuptools.package-data]
diff --git a/scripts/multimodal_analysis.py b/scripts/multimodal_analysis.py
new file mode 100644
index 000000000..f01dc18e1
--- /dev/null
+++ b/scripts/multimodal_analysis.py
@@ -0,0 +1,689 @@
+#!/usr/bin/env python
+
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+Chunk-level multi-modality analysis for comparing full/mixed vs curated datasets.
+
+Treats each action chunk (sliding window of CHUNK_SIZE consecutive frames) as the
+atomic unit, tagged by the SARM progress score at its start frame. For each
+progress band, compares the full vs HQ dataset on:
+
+ 1. Intra-band action variance
+ 2. Progress delta per chunk
+ 3. GMM + BIC optimal K (number of distinct strategies)
+ 4. PCA embedding (visual cluster inspection)
+
+Usage:
+ python chunk_multimodality_analysis.py \\
+ --full-dataset lerobot-data-collection/level12_rac_2_2026-02-08_1 \\
+ --hq-dataset lerobot-data-collection/level2_final_quality3 \\
+ --output-dir ./chunk_analysis
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+from collections import defaultdict
+from pathlib import Path
+
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+from huggingface_hub import hf_hub_download
+from scipy.stats import gaussian_kde
+from sklearn.decomposition import PCA
+from sklearn.mixture import GaussianMixture
+from sklearn.preprocessing import StandardScaler
+
+from lerobot.datasets.lerobot_dataset import LeRobotDataset
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
+logger = logging.getLogger(__name__)
+
+# ── Visual style ──────────────────────────────────────────────────────────
+
+BG = "#0e1117"
+CARD = "#1a1d27"
+BORDER = "#2a2d3a"
+SUB = "#8b8fa8"
+TEXT = "#e8eaf0"
+C_FULL = "#f7934f"
+C_HQ = "#4dc98a"
+
+
+def _style_ax(ax: plt.Axes) -> None:
+ ax.set_facecolor(CARD)
+ ax.tick_params(colors=SUB, labelsize=8)
+ for spine in ax.spines.values():
+ spine.set_color(BORDER)
+
+
+def _save(fig: plt.Figure, path: Path) -> None:
+ fig.savefig(path, dpi=150, bbox_inches="tight", facecolor=BG)
+ plt.close(fig)
+ logger.info("Saved %s", path)
+
+
+# ── Step 0: Load episodes ────────────────────────────────────────────────
+
+def _load_sarm_progress(repo_id: str) -> pd.DataFrame | None:
+ """Try to download sarm_progress.parquet from the Hub."""
+ try:
+ path = hf_hub_download(
+ repo_id=repo_id, filename="sarm_progress.parquet",
+ repo_type="dataset",
+ )
+ df = pd.read_parquet(path)
+ col = "progress_sparse" if "progress_sparse" in df.columns else "progress_dense"
+ if col not in df.columns:
+ logger.warning("sarm_progress.parquet has no progress columns — ignoring")
+ return None
+ logger.info("Loaded SARM progress (%s) for %s (%d rows)", col, repo_id, len(df))
+ return df.rename(columns={col: "progress"})[["episode_index", "frame_index", "progress"]]
+ except Exception as exc:
+ logger.warning("Could not load sarm_progress.parquet for %s: %s", repo_id, exc)
+ return None
+
+
+def load_episodes(
+ repo_id: str,
+ n_joints: int = 16,
+ max_episodes: int | None = None,
+) -> list[dict]:
+ dataset = LeRobotDataset(repo_id, download_videos=False)
+ raw = dataset.hf_dataset
+
+ sarm_df = _load_sarm_progress(repo_id)
+ # Build per-episode progress arrays from SARM parquet (indexed by frame_index)
+ sarm_by_ep: dict[int, dict[int, float]] = {}
+ if sarm_df is not None:
+ if max_episodes is not None:
+ sarm_df = sarm_df[sarm_df["episode_index"] < max_episodes]
+ for ep_id, grp in sarm_df.groupby("episode_index"):
+ sarm_by_ep[int(ep_id)] = dict(
+ zip(grp["frame_index"].astype(int), grp["progress"].astype(float))
+ )
+
+ episodes: dict[int, dict] = defaultdict(lambda: {"actions": [], "progress": []})
+ for row in raw:
+ ep = int(row["episode_index"])
+ if max_episodes is not None and ep >= max_episodes:
+ continue
+ action = np.array(row["action"], dtype=np.float32)[:n_joints]
+ episodes[ep]["actions"].append(action)
+ fi = int(row["frame_index"])
+ ep_prog = sarm_by_ep.get(ep, {})
+ episodes[ep]["progress"].append(ep_prog.get(fi, float("nan")))
+
+ has_sarm = len(sarm_lookup) > 0
+ result = []
+ for ep_id, d in sorted(episodes.items()):
+ actions = np.stack(d["actions"])
+ T = len(actions)
+ if has_sarm:
+ prog = np.array(d["progress"], dtype=np.float32)
+ prog = np.clip(np.nan_to_num(prog, nan=0.0), 0.0, 1.0)
+ prog = np.maximum.accumulate(prog)
+ else:
+ prog = np.linspace(0.0, 1.0, T, dtype=np.float32)
+ result.append({"episode": ep_id, "actions": actions, "progress": prog})
+
+ src = "SARM" if has_sarm else "time-based"
+ logger.info("Progress source: %s", src)
+ return result
+
+
+# ── Step 1: Filter short episodes ────────────────────────────────────────
+
+def auto_length_threshold(
+ episodes_full: list[dict], episodes_hq: list[dict]
+) -> int:
+ all_lengths = np.array(
+ [e["actions"].shape[0] for e in episodes_full + episodes_hq]
+ )
+ kde = gaussian_kde(all_lengths, bw_method=0.25)
+ xs = np.linspace(all_lengths.min(), np.percentile(all_lengths, 40), 300)
+ return int(xs[np.argmin(kde(xs))])
+
+
+def plot_length_distribution(
+ episodes_full: list[dict],
+ episodes_hq: list[dict],
+ threshold: int,
+ out_path: Path,
+) -> None:
+ lens_full = np.array([e["actions"].shape[0] for e in episodes_full])
+ lens_hq = np.array([e["actions"].shape[0] for e in episodes_hq])
+ all_lens = np.concatenate([lens_full, lens_hq])
+
+ fig, ax = plt.subplots(figsize=(10, 5))
+ fig.patch.set_facecolor(BG)
+ _style_ax(ax)
+
+ bins = np.linspace(all_lens.min(), all_lens.max(), 50)
+ ax.hist(lens_full, bins=bins, alpha=0.5, color=C_FULL, label="Full/Mixed")
+ ax.hist(lens_hq, bins=bins, alpha=0.5, color=C_HQ, label="HQ")
+
+ xs = np.linspace(all_lens.min(), all_lens.max(), 300)
+ kde = gaussian_kde(all_lens, bw_method=0.25)
+ ax.plot(xs, kde(xs) * len(all_lens) * (bins[1] - bins[0]), color=TEXT, lw=1.5, label="KDE (combined)")
+
+ ax.axvline(threshold, color="#ff4b4b", ls="--", lw=1.5, label=f"Threshold = {threshold}")
+ ax.set_xlabel("Episode length (frames)", color=SUB)
+ ax.set_ylabel("Count", color=SUB)
+ ax.set_title("Episode Length Distribution", color=TEXT, fontsize=13)
+ ax.legend(facecolor=CARD, edgecolor=BORDER, labelcolor=TEXT, fontsize=8)
+ _save(fig, out_path)
+
+
+def filter_episodes(episodes: list[dict], min_length: int) -> list[dict]:
+ kept = [e for e in episodes if e["actions"].shape[0] >= min_length]
+ logger.info("Kept %d / %d episodes (min_length=%d)", len(kept), len(episodes), min_length)
+ return kept
+
+
+# ── Step 2: Extract chunks ───────────────────────────────────────────────
+
+def extract_chunks(
+ episodes: list[dict],
+ chunk_size: int = 30,
+ chunk_stride: int = 15,
+) -> list[dict]:
+ chunks = []
+ for ep in episodes:
+ actions = ep["actions"]
+ T = len(actions)
+ prog = ep["progress"]
+
+ for t in range(0, T - chunk_size, chunk_stride):
+ chunk = actions[t : t + chunk_size]
+ p_start = float(prog[t])
+ p_end = float(prog[min(t + chunk_size, T - 1)])
+
+ chunks.append({
+ "action_mean": chunk.mean(axis=0).astype(np.float32),
+ "action_flat": chunk.flatten().astype(np.float32),
+ "progress_start": p_start,
+ "progress_delta": p_end - p_start,
+ "episode": ep["episode"],
+ })
+ return chunks
+
+
+# ── Step 3: Adaptive progress bands ─────────────────────────────────────
+
+def make_bands(n_bands: int = 5) -> list[tuple[float, float]]:
+ edges = np.linspace(0.0, 1.0, n_bands + 1)
+ return [(float(edges[i]), float(edges[i + 1])) for i in range(n_bands)]
+
+
+def assign_bands(
+ chunks: list[dict], band_edges: list[tuple[float, float]]
+) -> list[dict]:
+ n = len(band_edges)
+ for c in chunks:
+ p = c["progress_start"]
+ c["band"] = next(
+ (bi for bi, (lo, hi) in enumerate(band_edges) if p < hi),
+ n - 1,
+ )
+ return chunks
+
+
+def split_by_band(chunks: list[dict], n_bands: int) -> dict[int, list[dict]]:
+ out: dict[int, list[dict]] = {b: [] for b in range(n_bands)}
+ for c in chunks:
+ out[c["band"]].append(c)
+ return out
+
+
+# ── Step 4: Intra-band action variance ──────────────────────────────────
+
+def band_variance_matrix(
+ bands: dict[int, list[dict]], n_bands: int, n_joints: int
+) -> np.ndarray:
+ var_mat = np.full((n_bands, n_joints), np.nan)
+ for b, clist in bands.items():
+ if len(clist) < 3:
+ continue
+ means = np.stack([c["action_mean"] for c in clist])
+ var_mat[b] = np.var(means, axis=0)
+ return var_mat
+
+
+def plot_variance_heatmap(
+ var_full: np.ndarray,
+ var_hq: np.ndarray,
+ band_edges: list[tuple[float, float]],
+ out_path: Path,
+) -> None:
+ n_bands = var_full.shape[0]
+ vmin = 0.0
+ vmax = max(np.nanmax(var_full), np.nanmax(var_hq))
+
+ band_labels = [f"{lo:.0%}–{hi:.0%}" for lo, hi in band_edges]
+ joint_labels = [f"J{j}" for j in range(var_full.shape[1])]
+
+ fig, axes = plt.subplots(3, 1, figsize=(12, 10), gridspec_kw={"height_ratios": [3, 3, 2]})
+ fig.patch.set_facecolor(BG)
+ fig.suptitle("Intra-Band Action Variance", color=TEXT, fontsize=14, y=0.98)
+
+ for ax_idx, (mat, label) in enumerate([(var_full, "Full/Mixed"), (var_hq, "HQ")]):
+ ax = axes[ax_idx]
+ _style_ax(ax)
+ im = ax.imshow(mat, aspect="auto", cmap="YlOrRd", vmin=vmin, vmax=vmax)
+ ax.set_yticks(range(n_bands))
+ ax.set_yticklabels(band_labels, fontsize=7, color=SUB)
+ ax.set_xticks(range(var_full.shape[1]))
+ ax.set_xticklabels(joint_labels, fontsize=7, color=SUB)
+ ax.set_title(f"Panel {'A' if ax_idx == 0 else 'B'}: {label}", color=TEXT, fontsize=11)
+ fig.colorbar(im, ax=ax, fraction=0.02, pad=0.02)
+
+ with np.errstate(invalid="ignore"):
+ mean_full = np.nanmean(var_full, axis=1)
+ mean_hq = np.nanmean(var_hq, axis=1)
+ ratio = np.where(np.isnan(mean_full) | np.isnan(mean_hq), np.nan,
+ mean_full / (mean_hq + 1e-8))
+ ax_bar = axes[2]
+ _style_ax(ax_bar)
+ colors = [
+ "#ff4b4b" if r > 2.0 else "#ffaa33" if r > 1.2 else C_HQ
+ for r in ratio
+ ]
+ ax_bar.bar(range(n_bands), ratio, color=colors, edgecolor=BORDER)
+ ax_bar.axhline(1.0, color=SUB, ls="--", lw=0.8)
+ ax_bar.set_xticks(range(n_bands))
+ ax_bar.set_xticklabels(band_labels, fontsize=7, color=SUB)
+ ax_bar.set_ylabel("Variance ratio\n(Full / HQ)", color=SUB, fontsize=9)
+ ax_bar.set_title("Panel C: Variance Ratio per Band", color=TEXT, fontsize=11)
+
+ fig.tight_layout(rect=[0, 0, 1, 0.96])
+ _save(fig, out_path)
+
+
+# ── Step 5: Progress delta per band ──────────────────────────────────────
+
+def plot_progress_delta(
+ bands_full: dict[int, list[dict]],
+ bands_hq: dict[int, list[dict]],
+ band_edges: list[tuple[float, float]],
+ out_path: Path,
+) -> None:
+ n_bands = len(band_edges)
+ band_labels = [f"{lo:.0%}–{hi:.0%}" for lo, hi in band_edges]
+ x = np.arange(n_bands)
+ w = 0.35
+
+ means_full, stds_full = [], []
+ means_hq, stds_hq = [], []
+ all_deltas_full, all_deltas_hq = [], []
+
+ for b in range(n_bands):
+ df = np.array([c["progress_delta"] for c in bands_full.get(b, [])])
+ dh = np.array([c["progress_delta"] for c in bands_hq.get(b, [])])
+ means_full.append(np.mean(df) if len(df) > 0 else 0)
+ stds_full.append(np.std(df) if len(df) > 0 else 0)
+ means_hq.append(np.mean(dh) if len(dh) > 0 else 0)
+ stds_hq.append(np.std(dh) if len(dh) > 0 else 0)
+ all_deltas_full.extend(df.tolist())
+ all_deltas_hq.extend(dh.tolist())
+
+ fig, (ax_bar, ax_viol) = plt.subplots(1, 2, figsize=(14, 5), gridspec_kw={"width_ratios": [3, 1]})
+ fig.patch.set_facecolor(BG)
+ fig.suptitle("Progress Delta per Chunk", color=TEXT, fontsize=14)
+
+ _style_ax(ax_bar)
+ ax_bar.bar(x - w / 2, means_full, w, yerr=stds_full, color=C_FULL, edgecolor=BORDER,
+ capsize=3, label="Full/Mixed", error_kw={"ecolor": SUB})
+ ax_bar.bar(x + w / 2, means_hq, w, yerr=stds_hq, color=C_HQ, edgecolor=BORDER,
+ capsize=3, label="HQ", error_kw={"ecolor": SUB})
+ ax_bar.set_xticks(x)
+ ax_bar.set_xticklabels(band_labels, fontsize=7, color=SUB, rotation=30)
+ ax_bar.set_ylabel("Mean progress Δ", color=SUB)
+ ax_bar.legend(facecolor=CARD, edgecolor=BORDER, labelcolor=TEXT, fontsize=8)
+
+ _style_ax(ax_viol)
+ data_viol = [np.array(all_deltas_full), np.array(all_deltas_hq)]
+ if all(len(d) > 0 for d in data_viol):
+ parts = ax_viol.violinplot(data_viol, positions=[0, 1], showmeans=True, showmedians=True)
+ for pc, c in zip(parts["bodies"], [C_FULL, C_HQ]):
+ pc.set_facecolor(c)
+ pc.set_alpha(0.7)
+ for key in ("cmeans", "cmedians", "cbars", "cmins", "cmaxes"):
+ if key in parts:
+ parts[key].set_color(SUB)
+ ax_viol.set_xticks([0, 1])
+ ax_viol.set_xticklabels(["Full", "HQ"], color=SUB)
+ ax_viol.set_ylabel("Progress Δ", color=SUB)
+ ax_viol.set_title("Overall Distribution", color=TEXT, fontsize=10)
+
+ fig.tight_layout()
+ _save(fig, out_path)
+
+
+# ── Step 6: GMM + BIC per band ──────────────────────────────────────────
+
+def gmm_optimal_k(
+ band_chunks: list[dict],
+ pca_components: int = 15,
+ max_k: int = 12,
+ seed: int = 42,
+) -> int | None:
+ if len(band_chunks) < 20:
+ return None
+ X = np.stack([c["action_flat"] for c in band_chunks])
+ X = StandardScaler().fit_transform(X)
+ n = min(pca_components, X.shape[1], X.shape[0] - 1)
+ X_r = PCA(n_components=n, random_state=seed).fit_transform(X)
+ bics = []
+ for k in range(1, min(max_k + 1, len(X_r) // 6)):
+ gmm = GaussianMixture(
+ n_components=k, covariance_type="full",
+ n_init=5, max_iter=300, random_state=seed,
+ )
+ gmm.fit(X_r)
+ bics.append((k, gmm.bic(X_r)))
+ if not bics:
+ return None
+ return min(bics, key=lambda x: x[1])[0]
+
+
+def plot_gmm_bic(
+ bands_full: dict[int, list[dict]],
+ bands_hq: dict[int, list[dict]],
+ band_edges: list[tuple[float, float]],
+ seed: int,
+ out_path: Path,
+) -> tuple[list[int | None], list[int | None]]:
+ n_bands = len(band_edges)
+ ks_full = [gmm_optimal_k(bands_full.get(b, []), seed=seed) for b in range(n_bands)]
+ ks_hq = [gmm_optimal_k(bands_hq.get(b, []), seed=seed) for b in range(n_bands)]
+
+ band_labels = [f"{lo:.0%}–{hi:.0%}" for lo, hi in band_edges]
+
+ fig, ax = plt.subplots(figsize=(10, 5))
+ fig.patch.set_facecolor(BG)
+ _style_ax(ax)
+
+ xs = np.arange(n_bands)
+ valid_full = [(i, k) for i, k in enumerate(ks_full) if k is not None]
+ valid_hq = [(i, k) for i, k in enumerate(ks_hq) if k is not None]
+
+ if valid_full:
+ xi, yi = zip(*valid_full)
+ ax.plot(xi, yi, "o-", color=C_FULL, label="Full/Mixed", lw=2, markersize=7)
+ if valid_hq:
+ xi, yi = zip(*valid_hq)
+ ax.plot(xi, yi, "o-", color=C_HQ, label="HQ", lw=2, markersize=7)
+
+ if valid_full and valid_hq:
+ all_x = sorted(set([i for i, _ in valid_full]) & set([i for i, _ in valid_hq]))
+ if len(all_x) >= 2:
+ kf_interp = {i: k for i, k in valid_full}
+ kh_interp = {i: k for i, k in valid_hq}
+ shared_x = [i for i in all_x if i in kf_interp and i in kh_interp]
+ yf = [kf_interp[i] for i in shared_x]
+ yh = [kh_interp[i] for i in shared_x]
+ ax.fill_between(shared_x, yf, yh, alpha=0.15, color=TEXT)
+
+ ax.set_xticks(xs)
+ ax.set_xticklabels(band_labels, fontsize=7, color=SUB, rotation=30)
+ ax.set_ylabel("Optimal K (GMM-BIC)", color=SUB)
+ ax.set_title("Number of Distinct Strategies per Band", color=TEXT, fontsize=13)
+ ax.legend(facecolor=CARD, edgecolor=BORDER, labelcolor=TEXT, fontsize=9)
+ ax.yaxis.set_major_locator(plt.MaxNLocator(integer=True))
+ fig.tight_layout()
+ _save(fig, out_path)
+ return ks_full, ks_hq
+
+
+# ── Step 7: PCA scatter per band ────────────────────────────────────────
+
+def plot_pca_scatter(
+ bands_full: dict[int, list[dict]],
+ bands_hq: dict[int, list[dict]],
+ band_edges: list[tuple[float, float]],
+ out_path: Path,
+) -> None:
+ n_plot = min(4, len(band_edges))
+ fig, axes = plt.subplots(2, n_plot, figsize=(4 * n_plot, 7))
+ fig.patch.set_facecolor(BG)
+ fig.suptitle("PCA of Action Chunks per Band", color=TEXT, fontsize=14)
+
+ if n_plot == 1:
+ axes = axes.reshape(2, 1)
+
+ for col, b in enumerate(range(n_plot)):
+ cf = bands_full.get(b, [])
+ ch = bands_hq.get(b, [])
+ lo, hi = band_edges[b]
+
+ for row, (clist, color, label) in enumerate([
+ (cf, C_FULL, "Full/Mixed"), (ch, C_HQ, "HQ")
+ ]):
+ ax = axes[row, col]
+ _style_ax(ax)
+ if row == 0:
+ ax.set_title(f"{lo:.0%}–{hi:.0%}", color=TEXT, fontsize=10)
+ if col == 0:
+ ax.set_ylabel(label, color=SUB, fontsize=9)
+
+ if len(cf) < 3 or len(ch) < 3:
+ ax.text(0.5, 0.5, "Too few\nchunks", transform=ax.transAxes,
+ ha="center", va="center", color=SUB, fontsize=9)
+ continue
+
+ X_full_b = np.stack([c["action_flat"] for c in cf])
+ X_hq_b = np.stack([c["action_flat"] for c in ch])
+ X_all = np.vstack([X_full_b, X_hq_b])
+ X_all = StandardScaler().fit_transform(X_all)
+ X_2d = PCA(n_components=2, random_state=42).fit_transform(X_all)
+
+ X_2d_full = X_2d[: len(cf)]
+ X_2d_hq = X_2d[len(cf) :]
+
+ pts = X_2d_full if row == 0 else X_2d_hq
+ ax.scatter(pts[:, 0], pts[:, 1], s=8, alpha=0.5, color=color, edgecolors="none")
+
+ fig.tight_layout(rect=[0, 0, 1, 0.95])
+ _save(fig, out_path)
+
+
+# ── Plot 1: Chunk counts per band ───────────────────────────────────────
+
+def plot_chunk_counts(
+ bands_full: dict[int, list[dict]],
+ bands_hq: dict[int, list[dict]],
+ band_edges: list[tuple[float, float]],
+ out_path: Path,
+) -> None:
+ n_bands = len(band_edges)
+ band_labels = [f"{lo:.0%}–{hi:.0%}" for lo, hi in band_edges]
+ x = np.arange(n_bands)
+ w = 0.35
+
+ counts_full = [len(bands_full.get(b, [])) for b in range(n_bands)]
+ counts_hq = [len(bands_hq.get(b, [])) for b in range(n_bands)]
+
+ fig, ax = plt.subplots(figsize=(10, 5))
+ fig.patch.set_facecolor(BG)
+ _style_ax(ax)
+
+ ax.bar(x - w / 2, counts_full, w, color=C_FULL, edgecolor=BORDER, label="Full/Mixed")
+ ax.bar(x + w / 2, counts_hq, w, color=C_HQ, edgecolor=BORDER, label="HQ")
+ ax.set_xticks(x)
+ ax.set_xticklabels(band_labels, fontsize=7, color=SUB, rotation=30)
+ ax.set_ylabel("Chunk count", color=SUB)
+ ax.set_title("Chunk Counts per Progress Band", color=TEXT, fontsize=13)
+ ax.legend(facecolor=CARD, edgecolor=BORDER, labelcolor=TEXT, fontsize=8)
+ fig.tight_layout()
+ _save(fig, out_path)
+
+
+# ── Summary figure ───────────────────────────────────────────────────────
+
+def plot_summary(
+ var_full: np.ndarray,
+ var_hq: np.ndarray,
+ band_edges: list[tuple[float, float]],
+ ks_full: list[int | None],
+ ks_hq: list[int | None],
+ bands_full: dict[int, list[dict]],
+ bands_hq: dict[int, list[dict]],
+ out_path: Path,
+) -> None:
+ with np.errstate(invalid="ignore"):
+ mean_full = np.nanmean(var_full, axis=1)
+ mean_hq = np.nanmean(var_hq, axis=1)
+ ratio = np.where(np.isnan(mean_full) | np.isnan(mean_hq), np.nan,
+ mean_full / (mean_hq + 1e-8))
+ valid_ratio = ratio[~np.isnan(ratio)]
+ mean_ratio = float(np.mean(valid_ratio)) if len(valid_ratio) > 0 else float("nan")
+ peak_idx = int(np.argmax(valid_ratio)) if len(valid_ratio) > 0 else 0
+ peak_ratio = float(valid_ratio[peak_idx]) if len(valid_ratio) > 0 else float("nan")
+ lo, hi = band_edges[peak_idx]
+ peak_band = f"{lo:.0%}–{hi:.0%}"
+
+ valid_kf = [k for k in ks_full if k is not None]
+ valid_kh = [k for k in ks_hq if k is not None]
+ mean_k_full = np.mean(valid_kf) if valid_kf else float("nan")
+ mean_k_hq = np.mean(valid_kh) if valid_kh else float("nan")
+
+ n_bands = len(band_edges)
+ deltas_full = [c["progress_delta"] for b in range(n_bands) for c in bands_full.get(b, [])]
+ deltas_hq = [c["progress_delta"] for b in range(n_bands) for c in bands_hq.get(b, [])]
+ mean_delta_full = float(np.mean(deltas_full)) if deltas_full else float("nan")
+ mean_delta_hq = float(np.mean(deltas_hq)) if deltas_hq else float("nan")
+
+ rows = [
+ ("Mean variance ratio (Full / HQ)", f"{mean_ratio:.2f}x"),
+ ("Peak variance ratio", f"{peak_ratio:.2f}x at {peak_band}"),
+ ("Mean GMM K — Full", f"{mean_k_full:.1f}"),
+ ("Mean GMM K — HQ", f"{mean_k_hq:.1f}"),
+ ("Mean progress Δ — Full", f"{mean_delta_full:.4f}"),
+ ("Mean progress Δ — HQ", f"{mean_delta_hq:.4f}"),
+ ]
+
+ fig, ax = plt.subplots(figsize=(8, 3))
+ fig.patch.set_facecolor(BG)
+ ax.set_facecolor(CARD)
+ ax.axis("off")
+
+ table = ax.table(
+ cellText=[[m, v] for m, v in rows],
+ colLabels=["Metric", "Value"],
+ loc="center",
+ cellLoc="left",
+ )
+ table.auto_set_font_size(False)
+ table.set_fontsize(10)
+ for key, cell in table.get_celld().items():
+ cell.set_edgecolor(BORDER)
+ cell.set_facecolor(CARD)
+ cell.set_text_props(color=TEXT)
+ if key[0] == 0:
+ cell.set_text_props(color=TEXT, fontweight="bold")
+ table.scale(1, 1.6)
+ ax.set_title("Summary Statistics", color=TEXT, fontsize=13, pad=15)
+ fig.tight_layout()
+ _save(fig, out_path)
+
+ for metric, value in rows:
+ logger.info(" %s: %s", metric, value)
+
+
+# ── Main ─────────────────────────────────────────────────────────────────
+
+def main(args: argparse.Namespace) -> None:
+ out = Path(args.output_dir)
+ out.mkdir(parents=True, exist_ok=True)
+
+ logger.info("Loading FULL dataset: %s", args.full_dataset)
+ episodes_full = load_episodes(args.full_dataset, args.n_joints, args.max_episodes)
+ logger.info("Loading HQ dataset: %s", args.hq_dataset)
+ episodes_hq = load_episodes(args.hq_dataset, args.n_joints, args.max_episodes)
+ logger.info("Loaded %d full episodes, %d HQ episodes", len(episodes_full), len(episodes_hq))
+
+ # Step 1: length threshold + filter
+ if args.min_episode_length is not None:
+ threshold = args.min_episode_length
+ else:
+ threshold = auto_length_threshold(episodes_full, episodes_hq)
+ logger.info("Episode length threshold: %d", threshold)
+
+ plot_length_distribution(episodes_full, episodes_hq, threshold, out / "0_length_distribution.png")
+ episodes_full = filter_episodes(episodes_full, threshold)
+ episodes_hq = filter_episodes(episodes_hq, threshold)
+
+ # Step 2: extract chunks
+ chunks_full = extract_chunks(episodes_full, args.chunk_size, args.chunk_stride)
+ chunks_hq = extract_chunks(episodes_hq, args.chunk_size, args.chunk_stride)
+ logger.info("Extracted %d full chunks, %d HQ chunks", len(chunks_full), len(chunks_hq))
+
+ # Step 3: fixed equal-width bands over episode-relative progress
+ band_edges = make_bands(args.n_bands)
+ n_bands = len(band_edges)
+ logger.info("Progress bands (%d): %s", n_bands,
+ [f"{lo:.0%}–{hi:.0%}" for lo, hi in band_edges])
+
+ chunks_full = assign_bands(chunks_full, band_edges)
+ chunks_hq = assign_bands(chunks_hq, band_edges)
+ bands_full = split_by_band(chunks_full, n_bands)
+ bands_hq = split_by_band(chunks_hq, n_bands)
+
+ # Plot 1: chunk counts
+ plot_chunk_counts(bands_full, bands_hq, band_edges, out / "1_chunk_counts_per_band.png")
+
+ # Step 4: variance heatmap
+ var_full = band_variance_matrix(bands_full, n_bands, args.n_joints)
+ var_hq = band_variance_matrix(bands_hq, n_bands, args.n_joints)
+ plot_variance_heatmap(var_full, var_hq, band_edges, out / "2_variance_heatmap.png")
+
+ # Step 5: progress delta
+ plot_progress_delta(bands_full, bands_hq, band_edges, out / "3_progress_delta_per_band.png")
+
+ # Step 6: GMM BIC
+ ks_full, ks_hq = plot_gmm_bic(bands_full, bands_hq, band_edges, args.seed, out / "4_gmm_bic_per_band.png")
+
+ # Step 7: PCA scatter
+ plot_pca_scatter(bands_full, bands_hq, band_edges, out / "5_pca_per_band.png")
+
+ # Summary
+ plot_summary(var_full, var_hq, band_edges, ks_full, ks_hq,
+ bands_full, bands_hq, out / "6_summary.png")
+
+ logger.info("All figures saved to %s", out)
+
+
+if __name__ == "__main__":
+ p = argparse.ArgumentParser(
+ description="Chunk-level multi-modality analysis: Full/Mixed vs HQ dataset.",
+ formatter_class=argparse.RawDescriptionHelpFormatter,
+ )
+ p.add_argument("--full-dataset", default="lerobot-data-collection/level12_rac_2_2026-02-08_1")
+ p.add_argument("--hq-dataset", default="lerobot-data-collection/level2_final_quality3_trim_0_hil_data")
+ p.add_argument("--output-dir", default="./chunk_analysis")
+ p.add_argument("--chunk-size", type=int, default=30)
+ p.add_argument("--chunk-stride", type=int, default=15)
+ p.add_argument("--n-bands", type=int, default=5, help="Number of equal-width progress bands")
+ p.add_argument("--max-episodes", type=int, default=500)
+ p.add_argument("--n-joints", type=int, default=16)
+ p.add_argument("--min-episode-length", type=int, default=None,
+ help="Override auto-detected length filter threshold")
+ p.add_argument("--seed", type=int, default=42)
+ args = p.parse_args()
+ main(args)
diff --git a/scripts/train_smolvla_libero_plus.slurm b/scripts/train_smolvla_libero_plus.slurm
new file mode 100644
index 000000000..413838813
--- /dev/null
+++ b/scripts/train_smolvla_libero_plus.slurm
@@ -0,0 +1,29 @@
+#!/bin/bash
+#SBATCH --job-name=smolvla_libero_plus
+#SBATCH --partition=hopper-prod
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=1
+#SBATCH --gpus-per-node=4
+#SBATCH --cpus-per-task=48
+#SBATCH --mem=200G
+#SBATCH --time=12:00:00
+#SBATCH --output=logs/smolvla_libero_plus_%j.out
+#SBATCH --error=logs/smolvla_libero_plus_%j.err
+
+set -euo pipefail
+
+eval "$(conda shell.bash hook 2>/dev/null)"
+conda activate lerobot312
+
+cd /admin/home/pepijn/lerobot_wt_robocasa
+
+lerobot-benchmark train \
+ --benchmarks libero_plus \
+ --policy-path lerobot/smolvla_base \
+ --hub-user pepijn223 \
+ --num-gpus 4 \
+ --steps 30000 \
+ --batch-size 32 \
+ --eval-freq 0 \
+ --wandb \
+ --dataset.repo_id=pepijn223/libero_plus_lerobot
diff --git a/src/lerobot/envs/libero.py b/src/lerobot/envs/libero.py
index 968b1e734..f5f639891 100644
--- a/src/lerobot/envs/libero.py
+++ b/src/lerobot/envs/libero.py
@@ -16,6 +16,7 @@
from __future__ import annotations
import os
+import re
from collections import defaultdict
from collections.abc import Callable, Iterable, Mapping, Sequence
from functools import partial
@@ -28,12 +29,51 @@ import torch
from gymnasium import spaces
try:
- from libero.libero import benchmark, get_libero_path
- from libero.libero.envs import OffScreenRenderEnv
+ import libero as _libero_pkg # noqa: F401
except ImportError:
- # LIBERO-plus may be installed from source with an extra nested package level.
- from libero.libero.libero import benchmark, get_libero_path
- from libero.libero.libero.envs import OffScreenRenderEnv
+ raise ImportError(
+ "Could not import libero. Install benchmark dependencies with one of:\n"
+ " pip install -e \".[libero]\"\n"
+ " pip install -e \".[libero_plus]\" (alias: \".[libero-plus]\")"
+ )
+
+# LIBERO's env_wrapper unconditionally imports wand (ImageMagick Python binding)
+# which requires the system-level libMagickWand library. The wand features are only
+# used for visual noise perturbations and are not needed for standard evaluation.
+# Pre-install a stub so the import succeeds even without ImageMagick.
+import sys
+import types
+
+if "wand" not in sys.modules:
+ try:
+ import wand.api # noqa: F401
+ except (ImportError, OSError):
+
+ class _AttrSink:
+ """Accepts any attribute get/set without error."""
+
+ def __getattr__(self, _name):
+ return self
+
+ def __setattr__(self, _name, _value):
+ pass
+
+ def __call__(self, *a, **kw):
+ pass
+
+ _wand = types.ModuleType("wand")
+ _wand_api = types.ModuleType("wand.api")
+ _wand_api.library = _AttrSink()
+ _wand_image = types.ModuleType("wand.image")
+ _wand_image.Image = type("Image", (), {})
+ _wand.api = _wand_api
+ _wand.image = _wand_image
+ sys.modules["wand"] = _wand
+ sys.modules["wand.api"] = _wand_api
+ sys.modules["wand.image"] = _wand_image
+
+from libero.libero import benchmark, get_libero_path
+from libero.libero.envs import OffScreenRenderEnv
from lerobot.processor import RobotObservation
@@ -74,13 +114,30 @@ def _select_task_ids(total_tasks: int, task_ids: Iterable[int] | None) -> list[i
def get_task_init_states(task_suite: Any, i: int) -> np.ndarray:
- init_states_path = (
- Path(get_libero_path("init_states"))
- / task_suite.tasks[i].problem_folder
- / task_suite.tasks[i].init_states_file
+ init_states_dir = Path(get_libero_path("init_states")) / task_suite.tasks[i].problem_folder
+ init_states_file = task_suite.tasks[i].init_states_file
+
+ candidate_names = [init_states_file]
+ # Some LIBERO-plus task names include a "_table_" suffix while shipped
+ # init files use the base name without that table suffix.
+ if "_table_" in init_states_file:
+ candidate_names.append(re.sub(r"_table_\d+(?=\.pruned_init$|\.init$)", "", init_states_file))
+
+ for name in candidate_names:
+ candidate_path = init_states_dir / name
+ if candidate_path.exists():
+ return torch.load(candidate_path, weights_only=False) # nosec B614
+
+ # Last-resort fallback: pick any file matching the base prefix + extension.
+ stem, suffix = os.path.splitext(init_states_file)
+ stem = re.sub(r"_table_\d+$", "", stem)
+ fallback_matches = sorted(init_states_dir.glob(f"{stem}*{suffix}"))
+ if fallback_matches:
+ return torch.load(fallback_matches[0], weights_only=False) # nosec B614
+
+ raise FileNotFoundError(
+ f"Could not find init states for task {i}. Tried {candidate_names} in '{init_states_dir}'."
)
- init_states = torch.load(init_states_path, weights_only=False) # nosec B614
- return init_states
def get_libero_dummy_action():
@@ -100,6 +157,29 @@ TASK_SUITE_MAX_STEPS: dict[str, int] = {
}
+def _make_offscreen_env_with_renderer_fallback(env_args: dict[str, Any]) -> Any:
+ """Create OffScreenRenderEnv and fallback to OSMesa if EGL is unavailable."""
+ try:
+ return OffScreenRenderEnv(**env_args)
+ except ImportError as exc:
+ msg = str(exc)
+ if "EGL" not in msg and "PLATFORM_DEVICE" not in msg:
+ raise
+
+ # Headless clusters often miss EGL PLATFORM_DEVICE support. Retry with
+ # software rendering to keep evaluation working.
+ os.environ["MUJOCO_GL"] = "osmesa"
+ os.environ["PYOPENGL_PLATFORM"] = "osmesa"
+ try:
+ return OffScreenRenderEnv(**env_args)
+ except Exception as fallback_exc:
+ raise ImportError(
+ "Failed to initialize robosuite offscreen renderer with both EGL and "
+ "OSMesa backends. Set up EGL-capable drivers or install OSMesa (e.g. "
+ "`conda install -c conda-forge mesalib`) and retry."
+ ) from fallback_exc
+
+
class LiberoEnv(gym.Env):
metadata = {"render_modes": ["rgb_array"], "render_fps": 80}
@@ -244,7 +324,7 @@ class LiberoEnv(gym.Env):
"camera_heights": self.observation_height,
"camera_widths": self.observation_width,
}
- env = OffScreenRenderEnv(**env_args)
+ env = _make_offscreen_env_with_renderer_fallback(env_args)
env.reset()
return env
diff --git a/src/lerobot/scripts/lerobot_benchmark.py b/src/lerobot/scripts/lerobot_benchmark.py
new file mode 100644
index 000000000..b8297162b
--- /dev/null
+++ b/src/lerobot/scripts/lerobot_benchmark.py
@@ -0,0 +1,462 @@
+#!/usr/bin/env python
+
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Benchmark runner: train and evaluate policies across simulation benchmarks.
+
+Orchestrates per-benchmark training and evaluation using the existing
+``lerobot-train`` and ``lerobot-eval`` CLI tools.
+
+Typical usage::
+
+ # Train SmolVLA on LIBERO-plus (4 GPUs, 50k steps):
+ lerobot-benchmark train \\
+ --benchmarks libero_plus \\
+ --policy-path lerobot/smolvla_base \\
+ --hub-user $HF_USER \\
+ --num-gpus 4 --steps 50000
+
+ # Evaluate the trained policies:
+ lerobot-benchmark eval \\
+ --benchmarks libero_plus \\
+ --hub-user $HF_USER
+
+ # Full pipeline (train → upload → eval) for multiple benchmarks:
+ lerobot-benchmark all \\
+ --benchmarks libero_plus,robocasa,robomme \\
+ --policy-path lerobot/smolvla_base \\
+ --hub-user $HF_USER \\
+ --num-gpus 4 --steps 50000
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+import shutil
+import subprocess
+import sys
+from dataclasses import dataclass, field
+from pathlib import Path
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
+log = logging.getLogger(__name__)
+
+
+@dataclass
+class BenchmarkEntry:
+ """Training + evaluation settings for a single benchmark.
+
+ When ``eval_tasks`` is set, evaluation runs once per task in the list
+ (e.g. libero_spatial, libero_object, …). ``env_task`` is still used as
+ the task for mid-training evaluation during ``lerobot-train``.
+ """
+
+ dataset_repo_id: str
+ env_type: str
+ env_task: str
+ eval_tasks: list[str] | None = None
+ train_overrides: dict[str, str] = field(default_factory=dict)
+ eval_overrides: dict[str, str] = field(default_factory=dict)
+
+
+LIBERO_SUITES = ["libero_spatial", "libero_object", "libero_goal", "libero_10"]
+
+# Each benchmark maps a human-readable name to its dataset and eval env.
+# ``dataset_repo_id`` can contain ``{hub_user}`` which is interpolated at
+# runtime from ``--hub-user``.
+BENCHMARK_REGISTRY: dict[str, BenchmarkEntry] = {
+ "libero": BenchmarkEntry(
+ dataset_repo_id="{hub_user}/libero",
+ env_type="libero",
+ env_task="libero_spatial",
+ eval_tasks=LIBERO_SUITES,
+ ),
+ "libero_plus": BenchmarkEntry(
+ dataset_repo_id="{hub_user}/libero_plus",
+ env_type="libero_plus",
+ env_task="libero_spatial",
+ eval_tasks=LIBERO_SUITES,
+ ),
+ "metaworld": BenchmarkEntry(
+ dataset_repo_id="{hub_user}/metaworld",
+ env_type="metaworld",
+ env_task="metaworld-push-v2",
+ ),
+ "robocasa": BenchmarkEntry(
+ dataset_repo_id="{hub_user}/robocasa",
+ env_type="robocasa",
+ env_task="PickPlaceCounterToCabinet",
+ ),
+ "robomme": BenchmarkEntry(
+ dataset_repo_id="{hub_user}/robomme",
+ env_type="robomme",
+ env_task="PickXtimes",
+ ),
+}
+
+
+def _policy_repo_id(hub_user: str, policy_name: str, benchmark: str) -> str:
+ return f"{hub_user}/{policy_name}_{benchmark}"
+
+
+def _extra_keys(extra_args: list[str]) -> set[str]:
+ """Extract ``--key`` prefixes from extra CLI args for override detection."""
+ keys: set[str] = set()
+ for arg in extra_args:
+ if arg.startswith("--") and "=" in arg:
+ keys.add(arg.split("=", 1)[0])
+ return keys
+
+
+def _build_train_cmd(
+ benchmark: BenchmarkEntry,
+ *,
+ policy_path: str,
+ hub_user: str,
+ policy_name: str,
+ benchmark_name: str,
+ num_gpus: int,
+ steps: int,
+ batch_size: int,
+ eval_freq: int,
+ save_freq: int,
+ wandb: bool,
+ extra_args: list[str],
+) -> list[str]:
+ """Build the ``accelerate launch lerobot-train`` command list."""
+ lerobot_train = shutil.which("lerobot-train")
+ if lerobot_train is None:
+ raise RuntimeError("lerobot-train not found on PATH. Is lerobot installed?")
+
+ # Strip bare "--" separators that argparse may pass through
+ cleaned_extra = [a for a in extra_args if a != "--"]
+ overridden = _extra_keys(cleaned_extra)
+
+ repo_id = _policy_repo_id(hub_user, policy_name, benchmark_name)
+ dataset_id = benchmark.dataset_repo_id.format(hub_user=hub_user)
+
+ defaults: list[tuple[str, str]] = [
+ ("--policy.path", policy_path),
+ ("--dataset.repo_id", dataset_id),
+ ("--policy.repo_id", repo_id),
+ ("--env.type", benchmark.env_type),
+ ("--env.task", benchmark.env_task),
+ ("--steps", str(steps)),
+ ("--batch_size", str(batch_size)),
+ ("--eval_freq", str(eval_freq)),
+ ("--save_freq", str(save_freq)),
+ ("--output_dir", f"outputs/train/{policy_name}_{benchmark_name}"),
+ ("--job_name", f"{policy_name}_{benchmark_name}"),
+ ("--policy.push_to_hub", "true"),
+ ]
+ if wandb:
+ defaults.append(("--wandb.enable", "true"))
+ for k, v in benchmark.train_overrides.items():
+ defaults.append((f"--{k}", v))
+
+ cmd: list[str] = [
+ "accelerate", "launch",
+ "--multi_gpu",
+ f"--num_processes={num_gpus}",
+ lerobot_train,
+ ]
+ for key, val in defaults:
+ if key not in overridden:
+ cmd.append(f"{key}={val}")
+ cmd.extend(cleaned_extra)
+ return cmd
+
+
+def _build_eval_cmd(
+ benchmark: BenchmarkEntry,
+ *,
+ hub_user: str,
+ policy_name: str,
+ benchmark_name: str,
+ eval_task: str | None = None,
+ n_episodes: int,
+ batch_size_eval: int,
+ extra_args: list[str],
+) -> list[str]:
+ """Build the ``lerobot-eval`` command list.
+
+ ``eval_task`` overrides the benchmark's ``env_task`` so the same
+ benchmark can be evaluated on multiple suites (e.g. LIBERO).
+ """
+ lerobot_eval = shutil.which("lerobot-eval")
+ if lerobot_eval is None:
+ raise RuntimeError("lerobot-eval not found on PATH. Is lerobot installed?")
+
+ task = eval_task or benchmark.env_task
+ repo_id = _policy_repo_id(hub_user, policy_name, benchmark_name)
+ out_dir = _eval_output_dir(policy_name, benchmark_name, eval_task=task)
+
+ cleaned_extra = [a for a in extra_args if a != "--"]
+ overridden = _extra_keys(cleaned_extra)
+
+ defaults: list[tuple[str, str]] = [
+ ("--policy.path", repo_id),
+ ("--env.type", benchmark.env_type),
+ ("--env.task", task),
+ ("--eval.n_episodes", str(n_episodes)),
+ ("--eval.batch_size", str(batch_size_eval)),
+ ("--output_dir", out_dir),
+ ("--policy.device", "cuda"),
+ ]
+ for k, v in benchmark.eval_overrides.items():
+ defaults.append((f"--{k}", v))
+
+ cmd: list[str] = [lerobot_eval]
+ for key, val in defaults:
+ if key not in overridden:
+ cmd.append(f"{key}={val}")
+ cmd.extend(cleaned_extra)
+ return cmd
+
+
+def _eval_output_dir(policy_name: str, benchmark_name: str, eval_task: str | None = None) -> Path:
+ if eval_task:
+ return Path(f"outputs/eval/{policy_name}_{benchmark_name}/{eval_task}")
+ return Path(f"outputs/eval/{policy_name}_{benchmark_name}")
+
+
+def _run(cmd: list[str], *, dry_run: bool) -> None:
+ log.info("Command: %s", " \\\n ".join(cmd))
+ if dry_run:
+ log.info("[dry-run] Skipping execution.")
+ return
+ result = subprocess.run(cmd, check=False)
+ if result.returncode != 0:
+ log.error("Command failed with exit code %d", result.returncode)
+ sys.exit(result.returncode)
+
+
+def _push_eval_to_hub(
+ *,
+ hub_user: str,
+ policy_name: str,
+ benchmark_name: str,
+ eval_task: str | None = None,
+ dry_run: bool,
+) -> None:
+ """Upload eval results (metrics + videos) to the policy repo on the Hub."""
+ from huggingface_hub import HfApi
+
+ repo_id = _policy_repo_id(hub_user, policy_name, benchmark_name)
+ local_dir = _eval_output_dir(policy_name, benchmark_name, eval_task=eval_task)
+ hub_path = f"eval/{eval_task}" if eval_task else f"eval/{benchmark_name}"
+
+ if not local_dir.exists():
+ log.warning("Eval output dir %s does not exist, skipping hub upload.", local_dir)
+ return
+
+ log.info("Uploading eval results from %s to %s (path_in_repo=%s)", local_dir, repo_id, hub_path)
+ if dry_run:
+ log.info("[dry-run] Skipping upload.")
+ return
+
+ api = HfApi()
+ api.upload_folder(
+ folder_path=str(local_dir),
+ repo_id=repo_id,
+ path_in_repo=hub_path,
+ repo_type="model",
+ commit_message=f"Upload eval results for {eval_task or benchmark_name}",
+ )
+
+
+def _resolve_benchmarks(names: str) -> list[tuple[str, BenchmarkEntry]]:
+ out = []
+ for name in names.split(","):
+ name = name.strip()
+ if name not in BENCHMARK_REGISTRY:
+ available = ", ".join(BENCHMARK_REGISTRY)
+ raise ValueError(f"Unknown benchmark '{name}'. Available: {available}")
+ out.append((name, BENCHMARK_REGISTRY[name]))
+ return out
+
+
+def cmd_train(args: argparse.Namespace) -> None:
+ benchmarks = _resolve_benchmarks(args.benchmarks)
+ for bname, bentry in benchmarks:
+ log.info("=== Training on benchmark: %s ===", bname)
+ cmd = _build_train_cmd(
+ bentry,
+ policy_path=args.policy_path,
+ hub_user=args.hub_user,
+ policy_name=args.policy_name,
+ benchmark_name=bname,
+ num_gpus=args.num_gpus,
+ steps=args.steps,
+ batch_size=args.batch_size,
+ eval_freq=args.eval_freq,
+ save_freq=args.save_freq,
+ wandb=args.wandb,
+ extra_args=args.extra,
+ )
+ _run(cmd, dry_run=args.dry_run)
+
+
+def _run_eval_for_benchmark(
+ bname: str,
+ bentry: BenchmarkEntry,
+ args: argparse.Namespace,
+) -> None:
+ """Run evaluation for a single benchmark, iterating over all its eval_tasks."""
+ tasks = bentry.eval_tasks or [bentry.env_task]
+ for task in tasks:
+ log.info("=== Evaluating %s / %s ===", bname, task)
+ cmd = _build_eval_cmd(
+ bentry,
+ hub_user=args.hub_user,
+ policy_name=args.policy_name,
+ benchmark_name=bname,
+ eval_task=task if bentry.eval_tasks else None,
+ n_episodes=args.n_episodes,
+ batch_size_eval=args.batch_size_eval,
+ extra_args=args.extra,
+ )
+ _run(cmd, dry_run=args.dry_run)
+ if args.push_eval_to_hub:
+ _push_eval_to_hub(
+ hub_user=args.hub_user,
+ policy_name=args.policy_name,
+ benchmark_name=bname,
+ eval_task=task if bentry.eval_tasks else None,
+ dry_run=args.dry_run,
+ )
+
+
+def cmd_eval(args: argparse.Namespace) -> None:
+ benchmarks = _resolve_benchmarks(args.benchmarks)
+ for bname, bentry in benchmarks:
+ _run_eval_for_benchmark(bname, bentry, args)
+
+
+def cmd_all(args: argparse.Namespace) -> None:
+ """Train on each benchmark, then evaluate each."""
+ benchmarks = _resolve_benchmarks(args.benchmarks)
+
+ log.info("Phase 1: Training on %d benchmark(s)", len(benchmarks))
+ for bname, bentry in benchmarks:
+ log.info("=== Training on benchmark: %s ===", bname)
+ cmd = _build_train_cmd(
+ bentry,
+ policy_path=args.policy_path,
+ hub_user=args.hub_user,
+ policy_name=args.policy_name,
+ benchmark_name=bname,
+ num_gpus=args.num_gpus,
+ steps=args.steps,
+ batch_size=args.batch_size,
+ eval_freq=args.eval_freq,
+ save_freq=args.save_freq,
+ wandb=args.wandb,
+ extra_args=args.extra,
+ )
+ _run(cmd, dry_run=args.dry_run)
+
+ log.info("Phase 2: Evaluating %d benchmark(s)", len(benchmarks))
+ for bname, bentry in benchmarks:
+ _run_eval_for_benchmark(bname, bentry, args)
+
+
+def _add_common_args(p: argparse.ArgumentParser) -> None:
+ p.add_argument(
+ "--benchmarks", required=True,
+ help="Comma-separated benchmark names (e.g. libero_plus,robocasa,robomme).",
+ )
+ p.add_argument("--hub-user", required=True, help="HuggingFace Hub username.")
+ p.add_argument(
+ "--policy-name", default="smolvla",
+ help="Short policy name used in repo IDs and output dirs (default: smolvla).",
+ )
+ p.add_argument("--dry-run", action="store_true", help="Print commands without executing.")
+
+
+def _add_train_args(p: argparse.ArgumentParser) -> None:
+ p.add_argument("--policy-path", default="lerobot/smolvla_base", help="Pretrained policy path.")
+ p.add_argument("--num-gpus", type=int, default=4, help="Number of GPUs.")
+ p.add_argument("--steps", type=int, default=50_000, help="Total training steps.")
+ p.add_argument("--batch-size", type=int, default=32, help="Per-GPU batch size.")
+ p.add_argument("--eval-freq", type=int, default=10_000, help="Eval every N steps (0 to disable).")
+ p.add_argument("--save-freq", type=int, default=10_000, help="Save checkpoint every N steps.")
+ p.add_argument("--wandb", action="store_true", help="Enable Weights & Biases logging.")
+
+
+def _add_eval_args(p: argparse.ArgumentParser) -> None:
+ p.add_argument("--n-episodes", type=int, default=50, help="Number of eval episodes.")
+ p.add_argument("--batch-size-eval", type=int, default=10, help="Eval batch size (parallel envs).")
+ p.add_argument(
+ "--push-eval-to-hub", action="store_true",
+ help="Upload eval results (metrics + videos) to the policy repo on the Hub.",
+ )
+
+
+def build_parser() -> argparse.ArgumentParser:
+ parser = argparse.ArgumentParser(
+ prog="lerobot-benchmark",
+ description="Train and evaluate policies across simulation benchmarks.",
+ )
+ sub = parser.add_subparsers(dest="command", required=True)
+
+ # train
+ p_train = sub.add_parser("train", help="Train a policy on each selected benchmark.")
+ _add_common_args(p_train)
+ _add_train_args(p_train)
+ p_train.set_defaults(func=cmd_train)
+
+ # eval
+ p_eval = sub.add_parser("eval", help="Evaluate trained policies on each benchmark.")
+ _add_common_args(p_eval)
+ _add_eval_args(p_eval)
+ p_eval.set_defaults(func=cmd_eval)
+
+ # all (train + eval)
+ p_all = sub.add_parser("all", help="Train then evaluate on each benchmark.")
+ _add_common_args(p_all)
+ _add_train_args(p_all)
+ _add_eval_args(p_all)
+ p_all.set_defaults(func=cmd_all)
+
+ # list
+ p_list = sub.add_parser("list", help="List available benchmarks.")
+ p_list.set_defaults(func=lambda _args: _list_benchmarks())
+
+ return parser
+
+
+def _list_benchmarks() -> None:
+ print("Available benchmarks:\n")
+ for name, entry in BENCHMARK_REGISTRY.items():
+ print(f" {name}")
+ print(f" dataset: {entry.dataset_repo_id}")
+ print(f" env: {entry.env_type}")
+ if entry.eval_tasks:
+ print(f" eval on: {', '.join(entry.eval_tasks)}")
+ else:
+ print(f" eval on: {entry.env_task}")
+ print()
+
+
+def main() -> None:
+ parser = build_parser()
+ args, extra = parser.parse_known_args()
+ args.extra = extra
+ args.func(args)
+
+
+if __name__ == "__main__":
+ main()