fix(ci): downgrade contents permission to read in claude.yml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
chore: remove root CLAUDE.md (moved to .github/CLAUDE.md)
2026-05-11 22:59:50 +00:00 · 2026-04-08 19:19:31 +02:00 · 2026-04-08 18:04:48 +02:00 · 2026-04-08 18:03:06 +02:00 · 2026-04-08 17:59:55 +02:00 · 2026-04-08 17:57:50 +02:00
22 changed files with 350 additions and 1255 deletions
@@ -0,0 +1,86 @@
+# LeRobot — Claude Code Instructions
+
+You are a senior robotics ML engineer reviewing code for **LeRobot**, a PyTorch framework for real-world robot learning.
+Apply these principles to every PR review, fix, or task.
+
+---
+
+## Core Abstractions
+
+These are the load-bearing types. Handle them with care — breaking changes here affect every user.
+
+| Type             | Location                     | Role                                                         |
+| ---------------- | ---------------------------- | ------------------------------------------------------------ |
+| `LeRobotDataset` | `src/lerobot/datasets/`      | Streaming replay buffer; HF Hub integration                  |
+| `Policy`         | `src/lerobot/policies/`      | Base class for all learning agents (ACT, Diffusion, SARM, …) |
+| `Robot`          | `src/lerobot/robots/`        | Hardware abstraction; carries `_output_pipeline`             |
+| `Teleoperator`   | `src/lerobot/teleoperators/` | Leader-side hardware abstraction; carries `_output_pipeline` |
+| `Env`            | `src/lerobot/envs/`          | Gym-like robotics environments                               |
+| `Processor`      | `src/lerobot/processor/`     | Data transformation pipelines attached to robots/teleops     |
+
+**Never break their public APIs without a migration note and explicit user approval.**
+
+---
+
+## Engineering Principles
+
+### Code quality
+
+- Explicit over magic — no hidden control flow, no implicit state.
+- No deep inheritance trees. Prefer composition.
+- No decorative comment separators (`===`, `---`, etc.).
+- Add comments only where the logic is non-obvious.
+- No over-engineering. YAGNI applies strictly.
+
+### Type safety
+
+- All new and modified Python code must be fully typed (PEP 484).
+- `mypy --strict` must pass on changed files.
+- Do not widen or weaken existing type signatures.
+
+### Backwards compatibility
+
+- Public API changes require migration notes.
+- Additive changes are preferred over modifications.
+- `so100_follower` / `so101_follower` are aliases — never bleed changes there unintentionally.
+
+### HF ecosystem
+
+- Use `push_to_hub()`, HF Hub dataset streaming, and `evaluate` scripts.
+- Dataset changes must preserve streaming compatibility.
+- Prefer reusing HF primitives over rolling custom solutions.
+
+---
+
+## PR Review Checklist
+
+Before approving or marking P1 issues resolved, verify:
+
+- [ ] `pre-commit run -a` would pass (ruff, mypy, typos, zizmor, bandit)
+- [ ] All new/modified code is typed and passes `mypy --strict`
+- [ ] New features have unit tests; no silent behavioral changes
+- [ ] Public APIs of `LeRobotDataset`, `Policy`, `Robot`, `Teleoperator`, `Env` are unchanged (or migration note present)
+- [ ] HF Hub streaming still works for dataset changes
+- [ ] No unnecessary abstractions introduced
+- [ ] No breaking changes to training scripts (`lerobot-train`, `lerobot-eval`, `lerobot-record`)
+
+---
+
+## ML-Specific Checks
+
+Flag these as **P1** if found:
+
+- **Data leakage**: train and val/test splits must be constructed before any normalization or augmentation that uses train statistics.
+- **Loss function errors**: verify reduction mode (`mean` vs `sum`), correct masking, correct shape alignment.
+- **Gradient flow**: new modules must have gradients flowing (check `requires_grad`, no detached tensors in the loss path by accident).
+- **Distributed training**: operations on tensors must be DDP-safe; no in-place ops on parameters; batch norm needs `SyncBatchNorm` if used.
+- **Memory leaks**: no accumulation of tensors outside the training loop; `optimizer.zero_grad()` called correctly.
+
+---
+
+## What to Skip
+
+- Don't flag style nitpicks on unchanged surrounding code.
+- Don't propose refactors outside the PR's scope.
+- Don't add docstrings or comments to code the PR didn't touch.
+- Don't suggest speculative future features (YAGNI).
@@ -1,309 +0,0 @@
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# Integration tests: build an isolated Docker image per benchmark and run a
-# 1-episode smoke eval. Each benchmark gets its own image so incompatible
-# dependency trees (e.g. hf-libero vs metaworld==3.0.0) can never collide.
-#
-# To add a new benchmark:
-#   1. Add docker/Dockerfile.benchmark.<name>  (install only lerobot[<name>])
-#   2. Copy one of the jobs below and adjust the image name and eval command.
-name: Benchmark Integration Tests
-
-on:
-  # Run manually from the Actions tab
-  workflow_dispatch:
-
-  # Run every Monday at 02:00 UTC.
-  schedule:
-    - cron: "0 2 * * 1"
-
-  push:
-    branches:
-      - feat/benchmark-ci
-      - main
-    paths:
-      - "src/lerobot/envs/**"
-      - "src/lerobot/scripts/lerobot_eval.py"
-      - "docker/Dockerfile.benchmark.*"
-      - ".github/workflows/benchmark_tests.yml"
-      - "pyproject.toml"
-
-  pull_request:
-    branches:
-      - main
-    paths:
-      - "src/lerobot/envs/**"
-      - "src/lerobot/scripts/lerobot_eval.py"
-      - "docker/Dockerfile.benchmark.*"
-      - ".github/workflows/benchmark_tests.yml"
-      - "pyproject.toml"
-
-permissions:
-  contents: read
-
-env:
-  UV_VERSION: "0.8.0"
-  PYTHON_VERSION: "3.12"
-
-# Cancel in-flight runs for the same branch/PR.
-concurrency:
-  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
-  cancel-in-progress: true
-
-jobs:
-  # ── LIBERO ────────────────────────────────────────────────────────────────
-  # Isolated image: lerobot[libero] only (hf-libero, dm-control, mujoco chain)
-  libero-integration-test:
-    name: Libero — build image + 1-episode eval
-    runs-on:
-      group: aws-g6-4xlarge-plus
-    env:
-      HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
-
-    steps:
-      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd  # v6.0.2
-        with:
-          persist-credentials: false
-          lfs: true
-
-      - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
-        with:
-          cache-binary: false
-
-      # Build the benchmark-specific image; layer cache lives in the runner's
-      # local Docker daemon — reused across re-runs on the same machine.
-      - name: Build Libero benchmark image
-        uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
-        with:
-          context: .
-          file: docker/Dockerfile.benchmark.libero
-          push: false
-          load: true
-          tags: lerobot-benchmark-libero:ci
-          cache-from: type=local,src=/tmp/.buildx-cache-libero
-          cache-to: type=local,dest=/tmp/.buildx-cache-libero,mode=max
-
-      - name: Login to Hugging Face
-        if: env.HF_USER_TOKEN != ''
-        run: |
-          docker run --rm \
-            -e HF_HOME=/tmp/hf \
-            lerobot-benchmark-libero:ci \
-            bash -c "hf auth login --token '$HF_USER_TOKEN' --add-to-git-credential && hf auth whoami"
-
-      - name: Run Libero smoke eval (1 episode)
-        run: |
-          # Named container (no --rm) so we can docker cp artifacts out.
-          # Output to /tmp inside the container — user_lerobot cannot create
-          # root-level dirs like /artifacts.
-          docker run --name libero-eval --gpus all \
-            --shm-size=4g \
-            -e HF_HOME=/tmp/hf \
-            -e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-            -e HF_HUB_DOWNLOAD_TIMEOUT=300 \
-            lerobot-benchmark-libero:ci \
-            bash -c "
-              hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
-              lerobot-eval \
-                --policy.path=pepijn223/smolvla_libero \
-                --env.type=libero \
-                --env.task=libero_spatial \
-                --eval.batch_size=1 \
-                --eval.n_episodes=1 \
-                --eval.use_async_envs=false \
-                --policy.device=cuda \
-                '--env.camera_name_mapping={\"agentview_image\": \"camera1\", \"robot0_eye_in_hand_image\": \"camera2\"}' \
-                --policy.empty_cameras=1 \
-                --output_dir=/tmp/eval-artifacts
-              python3 /lerobot/scripts/ci/extract_task_descriptions.py \
-                --env libero --task libero_spatial \
-                --output /tmp/eval-artifacts/task_descriptions.json 2>/dev/null || true
-            "
-
-      - name: Copy Libero artifacts from container
-        if: always()
-        run: |
-          mkdir -p /tmp/libero-artifacts
-          docker cp libero-eval:/tmp/eval-artifacts/. /tmp/libero-artifacts/ 2>/dev/null || true
-          docker rm -f libero-eval || true
-
-      - name: Parse Libero eval metrics
-        if: always()
-        run: |
-          python3 scripts/ci/parse_eval_metrics.py \
-            --artifacts-dir /tmp/libero-artifacts \
-            --env libero \
-            --task libero_spatial \
-            --policy pepijn223/smolvla_libero
-
-      - name: Upload Libero rollout video
-        if: always()
-        uses: actions/upload-artifact@v4
-        with:
-          name: libero-rollout-video
-          path: /tmp/libero-artifacts/videos/
-          if-no-files-found: warn
-
-      - name: Upload Libero eval metrics
-        if: always()
-        uses: actions/upload-artifact@v4
-        with:
-          name: libero-metrics
-          path: /tmp/libero-artifacts/metrics.json
-          if-no-files-found: warn
-
-      # ── LIBERO TRAIN+EVAL SMOKE ──────────────────────────────────────────────
-      # Train SmolVLA for 1 step (batch_size=1, dataset episode 0 only) then
-      # immediately runs eval inside the training loop (eval_freq=1, 1 episode).
-      # Tests the full train→eval-within-training pipeline end-to-end.
-      - name: Run Libero train+eval smoke (1 step, eval_freq=1)
-        run: |
-          docker run --name libero-train-smoke --gpus all \
-            --shm-size=4g \
-            -e HF_HOME=/tmp/hf \
-            -e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-            -e HF_HUB_DOWNLOAD_TIMEOUT=300 \
-            lerobot-benchmark-libero:ci \
-            bash -c "
-              hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
-              accelerate launch --num_processes=1 \$(which lerobot-train) \
-                --policy.path=lerobot/smolvla_base \
-                --policy.load_vlm_weights=true \
-                --policy.scheduler_decay_steps=25000 \
-                --policy.freeze_vision_encoder=false \
-                --policy.train_expert_only=false \
-                --dataset.repo_id=lerobot/libero \
-                --dataset.episodes=[0] \
-                --dataset.use_imagenet_stats=false \
-                --env.type=libero \
-                --env.task=libero_spatial \
-                '--env.camera_name_mapping={\"agentview_image\": \"camera1\", \"robot0_eye_in_hand_image\": \"camera2\"}' \
-                --policy.empty_cameras=1 \
-                --output_dir=/tmp/train-smoke \
-                --steps=1 \
-                --batch_size=1 \
-                --eval_freq=1 \
-                --eval.n_episodes=1 \
-                --eval.batch_size=1 \
-                --eval.use_async_envs=false \
-                --save_freq=1 \
-                --policy.push_to_hub=false \
-                '--rename_map={\"observation.images.image\": \"observation.images.camera1\", \"observation.images.image2\": \"observation.images.camera2\"}'
-            "
-
-      - name: Copy Libero train-smoke artifacts from container
-        if: always()
-        run: |
-          mkdir -p /tmp/libero-train-smoke-artifacts
-          docker cp libero-train-smoke:/tmp/train-smoke/. /tmp/libero-train-smoke-artifacts/ 2>/dev/null || true
-          docker rm -f libero-train-smoke || true
-
-      - name: Upload Libero train-smoke eval video
-        if: always()
-        uses: actions/upload-artifact@v4
-        with:
-          name: libero-train-smoke-video
-          path: /tmp/libero-train-smoke-artifacts/eval/
-          if-no-files-found: warn
-
-  # ── METAWORLD ─────────────────────────────────────────────────────────────
-  # Isolated image: lerobot[metaworld] only (metaworld==3.0.0, mujoco>=3 chain)
-  metaworld-integration-test:
-    name: MetaWorld — build image + 1-episode eval
-    runs-on:
-      group: aws-g6-4xlarge-plus
-    env:
-      HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
-
-    steps:
-      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd  # v6.0.2
-        with:
-          persist-credentials: false
-          lfs: true
-
-      - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
-        with:
-          cache-binary: false
-
-      - name: Build MetaWorld benchmark image
-        uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
-        with:
-          context: .
-          file: docker/Dockerfile.benchmark.metaworld
-          push: false
-          load: true
-          tags: lerobot-benchmark-metaworld:ci
-          cache-from: type=local,src=/tmp/.buildx-cache-metaworld
-          cache-to: type=local,dest=/tmp/.buildx-cache-metaworld,mode=max
-
-      - name: Run MetaWorld smoke eval (1 episode)
-        run: |
-          docker run --name metaworld-eval --gpus all \
-            --shm-size=4g \
-            -e HF_HOME=/tmp/hf \
-            -e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-            -e HF_HUB_DOWNLOAD_TIMEOUT=300 \
-            lerobot-benchmark-metaworld:ci \
-            bash -c "
-              hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
-              lerobot-eval \
-                --policy.path=pepijn223/smolvla_metaworld \
-                --env.type=metaworld \
-                --env.task=metaworld-push-v3 \
-                --eval.batch_size=1 \
-                --eval.n_episodes=1 \
-                --eval.use_async_envs=false \
-                --policy.device=cuda \
-                '--rename_map={\"observation.image\": \"observation.images.camera1\"}' \
-                --policy.empty_cameras=2 \
-                --output_dir=/tmp/eval-artifacts
-              python3 /lerobot/scripts/ci/extract_task_descriptions.py \
-                --env metaworld --task metaworld-push-v3 \
-                --output /tmp/eval-artifacts/task_descriptions.json 2>/dev/null || true
-            "
-
-      - name: Copy MetaWorld artifacts from container
-        if: always()
-        run: |
-          mkdir -p /tmp/metaworld-artifacts
-          docker cp metaworld-eval:/tmp/eval-artifacts/. /tmp/metaworld-artifacts/ 2>/dev/null || true
-          docker rm -f metaworld-eval || true
-
-      - name: Parse MetaWorld eval metrics
-        if: always()
-        run: |
-          python3 scripts/ci/parse_eval_metrics.py \
-            --artifacts-dir /tmp/metaworld-artifacts \
-            --env metaworld \
-            --task metaworld-push-v3 \
-            --policy pepijn223/smolvla_metaworld
-
-      - name: Upload MetaWorld rollout video
-        if: always()
-        uses: actions/upload-artifact@v4
-        with:
-          name: metaworld-rollout-video
-          path: /tmp/metaworld-artifacts/videos/
-          if-no-files-found: warn
-
-      - name: Upload MetaWorld eval metrics
-        if: always()
-        uses: actions/upload-artifact@v4
-        with:
-          name: metaworld-metrics
-          path: /tmp/metaworld-artifacts/metrics.json
-          if-no-files-found: warn
@@ -0,0 +1,49 @@
+name: Claude Code Review
+
+on:
+  pull_request:
+    types: [opened, synchronize, ready_for_review, reopened]
+
+jobs:
+  claude-review:
+    runs-on: ubuntu-latest
+    permissions:
+      contents: read
+      pull-requests: write
+      issues: read
+      id-token: write
+      actions: read
+    env:
+      FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
+
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 1
+          persist-credentials: false
+
+      - name: Run Claude Code Review
+        id: claude-review
+        uses: anthropics/claude-code-action@26ddc358fe3befff50c5ec2f80304c90c763f6f8 # v1
+        with:
+          anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
+          use_sticky_comment: true
+          prompt: |
+            Read `.github/CLAUDE.md` for lerobot-specific conventions, then review this PR.
+            Provide structured, actionable feedback.
+
+            Focus areas (in priority order):
+            1. **Correctness**: Logic errors, off-by-ones, wrong tensor shapes, incorrect loss functions
+            2. **Type safety**: All new/modified Python code must pass `mypy --strict`; check for missing annotations
+            3. **Backwards compatibility**: Does this break `LeRobotDataset`, `Policy`, `Robot`, `Teleoperator`, `Env`, or `Processor` public APIs?
+            4. **Tests**: New features must have tests; no silent behavioral changes
+            5. **Code style**: Explicit over magic, no unnecessary abstractions, no decorative comments
+            6. **HF integration**: Dataset streaming, `push_to_hub`, HF Hub compatibility preserved?
+            7. **pre-commit**: Would `pre-commit run -a` pass? (ruff, mypy, typos, zizmor)
+
+            Format findings as P1 (must fix) / P2 (should fix) / P3 (nice to have).
+            Skip P3 if the PR is already high quality.
+          claude_args: '--model claude-opus-4-6'
+          # See https://github.com/anthropics/claude-code-action/blob/main/docs/usage.md
+          # or https://code.claude.com/docs/en/cli-reference for available options
@@ -0,0 +1,58 @@
+name: Claude Code
+
+on:
+  issue_comment:
+    types: [created]
+  pull_request_review_comment:
+    types: [created]
+  issues:
+    types: [opened, assigned]
+  pull_request_review:
+    types: [submitted]
+
+jobs:
+  claude:
+    if: |
+      (github.event_name == 'issue_comment' &&
+       contains(github.event.comment.body, '@claude') &&
+       (github.event.comment.author_association == 'OWNER' || github.event.comment.author_association == 'MEMBER' || github.event.comment.author_association == 'COLLABORATOR')) ||
+      (github.event_name == 'pull_request_review_comment' &&
+       contains(github.event.comment.body, '@claude') &&
+       (github.event.comment.author_association == 'OWNER' || github.event.comment.author_association == 'MEMBER' || github.event.comment.author_association == 'COLLABORATOR')) ||
+      (github.event_name == 'pull_request_review' &&
+       contains(github.event.review.body, '@claude') &&
+       (github.event.review.author_association == 'OWNER' || github.event.review.author_association == 'MEMBER' || github.event.review.author_association == 'COLLABORATOR')) ||
+      (github.event_name == 'issues' &&
+       (contains(github.event.issue.body, '@claude') || contains(github.event.issue.title, '@claude')) &&
+       (github.event.issue.author_association == 'OWNER' || github.event.issue.author_association == 'MEMBER' || github.event.issue.author_association == 'COLLABORATOR'))
+    runs-on: ubuntu-latest
+    permissions:
+      contents: read
+      pull-requests: write
+      issues: write
+      id-token: write
+      actions: read
+    env:
+      FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
+
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 1
+          persist-credentials: false
+
+      - name: Run Claude Code
+        id: claude
+        uses: anthropics/claude-code-action@26ddc358fe3befff50c5ec2f80304c90c763f6f8 # v1
+        with:
+          anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
+          use_sticky_comment: true
+
+          # This is an optional setting that allows Claude to read CI results on PRs
+          additional_permissions: |
+            actions: read
+
+          claude_args: '--system-prompt "Read .github/CLAUDE.md for lerobot-specific conventions before responding."'
+          # See https://github.com/anthropics/claude-code-action/blob/main/docs/usage.md
+          # or https://code.claude.com/docs/en/cli-reference for available options
@@ -1,89 +0,0 @@
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# Isolated benchmark image for LIBERO integration tests.
-# Installs only lerobot[libero] so its dep tree (hf-libero, dm-control, mujoco)
-# cannot conflict with other benchmarks.
-#
-# Build:  docker build -f docker/Dockerfile.benchmark.libero -t lerobot-benchmark-libero .
-# Run:    docker run --gpus all --rm lerobot-benchmark-libero lerobot-eval ...
-
-ARG CUDA_VERSION=12.4.1
-ARG OS_VERSION=22.04
-FROM nvidia/cuda:${CUDA_VERSION}-base-ubuntu${OS_VERSION}
-
-ARG PYTHON_VERSION=3.12
-
-ENV DEBIAN_FRONTEND=noninteractive \
-    MUJOCO_GL=egl \
-    PATH=/lerobot/.venv/bin:$PATH \
-    CUDA_VISIBLE_DEVICES=0 \
-    DEVICE=cuda
-
-# System deps — same set as Dockerfile.internal
-RUN apt-get update && apt-get install -y --no-install-recommends \
-    software-properties-common build-essential git curl \
-    libglib2.0-0 libgl1-mesa-glx libegl1-mesa ffmpeg \
-    libusb-1.0-0-dev speech-dispatcher libgeos-dev portaudio19-dev \
-    cmake pkg-config ninja-build \
-    && add-apt-repository -y ppa:deadsnakes/ppa \
-    && apt-get update \
-    && apt-get install -y --no-install-recommends \
-       python${PYTHON_VERSION} \
-       python${PYTHON_VERSION}-venv \
-       python${PYTHON_VERSION}-dev \
-    && curl -LsSf https://astral.sh/uv/install.sh | sh \
-    && mv /root/.local/bin/uv /usr/local/bin/uv \
-    && useradd --create-home --shell /bin/bash user_lerobot \
-    && usermod -aG sudo user_lerobot \
-    && apt-get clean && rm -rf /var/lib/apt/lists/*
-
-WORKDIR /lerobot
-RUN chown -R user_lerobot:user_lerobot /lerobot
-USER user_lerobot
-
-ENV HOME=/home/user_lerobot \
-    HF_HOME=/home/user_lerobot/.cache/huggingface \
-    HF_LEROBOT_HOME=/home/user_lerobot/.cache/huggingface/lerobot \
-    TORCH_HOME=/home/user_lerobot/.cache/torch \
-    TRITON_CACHE_DIR=/home/user_lerobot/.cache/triton
-
-RUN uv venv --python python${PYTHON_VERSION}
-
-# Install only lerobot[libero] — completely isolated from metaworld's dep tree
-COPY --chown=user_lerobot:user_lerobot setup.py pyproject.toml uv.lock README.md MANIFEST.in ./
-COPY --chown=user_lerobot:user_lerobot src/ src/
-
-RUN uv sync --locked --extra libero --extra smolvla --no-cache
-
-# Pre-download lerobot/libero-assets from HF Hub so nothing is fetched at
-# runtime (which times out on CI). Point the libero config at the cached path.
-# libero/libero/__init__.py calls input() when ~/.libero/config.yaml is missing,
-# so we write the config before any libero import can happen.
-RUN LIBERO_DIR=$(python${PYTHON_VERSION} -c \
-      "import importlib.util, os; s=importlib.util.find_spec('libero'); \
-       print(os.path.join(os.path.dirname(s.origin), 'libero'))") && \
-    mkdir -p /home/user_lerobot/.libero && \
-    python${PYTHON_VERSION} -c "\
-from huggingface_hub import snapshot_download; \
-snapshot_download(repo_id='lerobot/libero-assets', repo_type='dataset', \
-                  local_dir='/home/user_lerobot/.libero/assets')" && \
-    printf "assets: /home/user_lerobot/.libero/assets\nbddl_files: ${LIBERO_DIR}/bddl_files\ndatasets: ${LIBERO_DIR}/../datasets\ninit_states: ${LIBERO_DIR}/init_files\n" \
-    > /home/user_lerobot/.libero/config.yaml
-
-RUN chmod +x /lerobot/.venv/lib/python${PYTHON_VERSION}/site-packages/triton/backends/nvidia/bin/ptxas
-
-COPY --chown=user_lerobot:user_lerobot . .
-
-CMD ["/bin/bash"]
@@ -1,74 +0,0 @@
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# Isolated benchmark image for MetaWorld integration tests.
-# Installs only lerobot[metaworld] so its dep tree (metaworld==3.0.0, mujoco>=3)
-# cannot conflict with other benchmarks.
-#
-# Build:  docker build -f docker/Dockerfile.benchmark.metaworld -t lerobot-benchmark-metaworld .
-# Run:    docker run --gpus all --rm lerobot-benchmark-metaworld lerobot-eval ...
-
-ARG CUDA_VERSION=12.4.1
-ARG OS_VERSION=22.04
-FROM nvidia/cuda:${CUDA_VERSION}-base-ubuntu${OS_VERSION}
-
-ARG PYTHON_VERSION=3.12
-
-ENV DEBIAN_FRONTEND=noninteractive \
-    MUJOCO_GL=egl \
-    PATH=/lerobot/.venv/bin:$PATH \
-    CUDA_VISIBLE_DEVICES=0 \
-    DEVICE=cuda
-
-# System deps — same set as Dockerfile.internal
-RUN apt-get update && apt-get install -y --no-install-recommends \
-    software-properties-common build-essential git curl \
-    libglib2.0-0 libgl1-mesa-glx libegl1-mesa ffmpeg \
-    libusb-1.0-0-dev speech-dispatcher libgeos-dev portaudio19-dev \
-    cmake pkg-config ninja-build \
-    && add-apt-repository -y ppa:deadsnakes/ppa \
-    && apt-get update \
-    && apt-get install -y --no-install-recommends \
-       python${PYTHON_VERSION} \
-       python${PYTHON_VERSION}-venv \
-       python${PYTHON_VERSION}-dev \
-    && curl -LsSf https://astral.sh/uv/install.sh | sh \
-    && mv /root/.local/bin/uv /usr/local/bin/uv \
-    && useradd --create-home --shell /bin/bash user_lerobot \
-    && usermod -aG sudo user_lerobot \
-    && apt-get clean && rm -rf /var/lib/apt/lists/*
-
-WORKDIR /lerobot
-RUN chown -R user_lerobot:user_lerobot /lerobot
-USER user_lerobot
-
-ENV HOME=/home/user_lerobot \
-    HF_HOME=/home/user_lerobot/.cache/huggingface \
-    HF_LEROBOT_HOME=/home/user_lerobot/.cache/huggingface/lerobot \
-    TORCH_HOME=/home/user_lerobot/.cache/torch \
-    TRITON_CACHE_DIR=/home/user_lerobot/.cache/triton
-
-RUN uv venv --python python${PYTHON_VERSION}
-
-# Install only lerobot[metaworld] — completely isolated from libero's dep tree
-COPY --chown=user_lerobot:user_lerobot setup.py pyproject.toml uv.lock README.md MANIFEST.in ./
-COPY --chown=user_lerobot:user_lerobot src/ src/
-
-RUN uv sync --locked --extra metaworld --extra smolvla --no-cache
-
-RUN chmod +x /lerobot/.venv/lib/python${PYTHON_VERSION}/site-packages/triton/backends/nvidia/bin/ptxas
-
-COPY --chown=user_lerobot:user_lerobot . .
-
-CMD ["/bin/bash"]
@@ -73,8 +73,6 @@
    title: Control & Train Robots in Sim (LeIsaac)
  title: "Simulation"
 - sections:
-  - local: evaluation
-    title: Evaluation (lerobot-eval)
  - local: adding_benchmarks
    title: Adding a New Benchmark
  - local: libero
@@ -26,7 +26,7 @@ During evaluation, data moves through four stages:
 1. gym.Env  ──→  raw observations (numpy dicts)

 2. Preprocessing  ──→  standard LeRobot keys + task description
-   (preprocess_observation in envs/utils.py, env.call("task_description"))
+   (preprocess_observation, add_envs_task in envs/utils.py)

 3. Processors  ──→  env-specific then policy-specific transforms
   (env_preprocessor, policy_preprocessor)
@@ -122,17 +122,15 @@ Each `EnvConfig` subclass declares two dicts that tell the policy what to expect

 ### Checklist

-| File                                      | Required | Why                                                          |
-| ----------------------------------------- | -------- | ------------------------------------------------------------ |
-| `src/lerobot/envs/<benchmark>.py`         | Yes      | Wraps the simulator as a standard gym.Env                    |
-| `src/lerobot/envs/configs.py`             | Yes      | Registers your benchmark and its `create_envs()` for the CLI |
-| `src/lerobot/processor/env_processor.py`  | Optional | Custom observation/action transforms                         |
-| `src/lerobot/envs/utils.py`               | Optional | Only if you need new raw observation keys                    |
-| `pyproject.toml`                          | Yes      | Declares benchmark-specific dependencies                     |
-| `docs/source/<benchmark>.mdx`             | Yes      | User-facing documentation page                               |
-| `docs/source/_toctree.yml`                | Yes      | Adds your page to the docs sidebar                           |
-| `docker/Dockerfile.benchmark.<benchmark>` | Yes      | Isolated Docker image for CI smoke tests                     |
-| `.github/workflows/benchmark_tests.yml`   | Yes      | CI job that builds the image and runs a 1-episode smoke eval |
+| File                                     | Required | Why                                                          |
+| ---------------------------------------- | -------- | ------------------------------------------------------------ |
+| `src/lerobot/envs/<benchmark>.py`        | Yes      | Wraps the simulator as a standard gym.Env                    |
+| `src/lerobot/envs/configs.py`            | Yes      | Registers your benchmark and its `create_envs()` for the CLI |
+| `src/lerobot/processor/env_processor.py` | Optional | Custom observation/action transforms                         |
+| `src/lerobot/envs/utils.py`              | Optional | Only if you need new raw observation keys                    |
+| `pyproject.toml`                         | Yes      | Declares benchmark-specific dependencies                     |
+| `docs/source/<benchmark>.mdx`            | Yes      | User-facing documentation page                               |
+| `docs/source/_toctree.yml`               | Yes      | Adds your page to the docs sidebar                           |

 ### 1. The gym.Env wrapper (`src/lerobot/envs/<benchmark>.py`)

@@ -163,8 +161,6 @@ class MyBenchmarkEnv(gym.Env):
        ...
 ```

-**GPU-based simulators (e.g. MuJoCo with EGL rendering):** If your simulator allocates GPU/EGL contexts during `__init__`, defer that allocation to a `_ensure_env()` helper called on first `reset()`/`step()`. This avoids inheriting stale GPU handles when `AsyncVectorEnv` spawns worker processes. See `LiberoEnv._ensure_env()` for the pattern.
-
 Also provide a factory function that returns the nested dict structure:

 ```python
@@ -211,7 +207,7 @@ class MyBenchmarkEnvConfig(EnvConfig):
    def gym_kwargs(self) -> dict:
        return {"obs_type": self.obs_type, "render_mode": self.render_mode}

-    def create_envs(self, n_envs: int, use_async_envs: bool = True):
+    def create_envs(self, n_envs: int, use_async_envs: bool = False):
        """Override for multi-task benchmarks or custom env creation."""
        from lerobot.envs.<benchmark> import create_<benchmark>_envs
        return create_<benchmark>_envs(task=self.task, n_envs=n_envs, ...)
@@ -297,87 +293,14 @@ Add your benchmark to the "Benchmarks" section:
  title: "Benchmarks"
 ```

-### 7. CI smoke test (`docker/` + `.github/workflows/benchmark_tests.yml`)
-
-Each benchmark must have an isolated Docker image and a CI job that runs a 1-episode eval. This catches install-time regressions (broken transitive deps, import errors, interactive prompts) before they reach users.
-
-**Create `docker/Dockerfile.benchmark.<benchmark>`** — copy an existing one and change only the extra name:
-
-```dockerfile
-# Isolated benchmark image — installs lerobot[<benchmark>] only.
-# Build: docker build -f docker/Dockerfile.benchmark.<benchmark> -t lerobot-benchmark-<benchmark> .
-ARG CUDA_VERSION=12.4.1
-ARG OS_VERSION=22.04
-FROM nvidia/cuda:${CUDA_VERSION}-base-ubuntu${OS_VERSION}
-ARG PYTHON_VERSION=3.12
-# ... (same system deps as Dockerfile.benchmark.libero) ...
-RUN uv sync --locked --extra <benchmark> --no-cache
-```
-
-Each benchmark gets its own image so its dependency tree (pinned simulator packages, specific mujoco/scipy versions) cannot conflict with other benchmarks.
-
-**Add a job to `.github/workflows/benchmark_tests.yml`** — copy an existing job block and adjust:
-
-```yaml
-<benchmark>-integration-test:
-  name: <Benchmark> — build image + 1-episode eval
-  runs-on:
-    group: aws-g6-4xlarge-plus
-  env:
-    HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}
-  steps:
-    - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
-      with:
-        persist-credentials: false
-        lfs: true
-    - name: Set up Docker Buildx
-      uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
-      with:
-        cache-binary: false
-    - name: Build <Benchmark> image
-      uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
-      with:
-        context: .
-        file: docker/Dockerfile.benchmark.<benchmark>
-        push: false
-        load: true
-        tags: lerobot-benchmark-<benchmark>:ci
-        cache-from: type=local,src=/tmp/.buildx-cache-<benchmark>
-        cache-to: type=local,dest=/tmp/.buildx-cache-<benchmark>,mode=max
-    - name: Run <Benchmark> smoke eval (1 episode)
-      run: |
-        docker run --rm --gpus all \
-          --shm-size=4g \
-          -e HF_HOME=/tmp/hf \
-          -e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-          lerobot-benchmark-<benchmark>:ci \
-          bash -c "
-            hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
-            lerobot-eval \
-              --policy.path=<hub_policy_path> \
-              --env.type=<benchmark> \
-              --env.task=<task> \
-              --eval.batch_size=1 \
-              --eval.n_episodes=1 \
-              --eval.use_async_envs=false \
-              --policy.device=cuda
-          "
-```
-
-**Tips:**
-
- If the benchmark library prompts for user input on import (like LIBERO asking for a dataset folder), pass the relevant env var in the `docker run` command (e.g. `-e LIBERO_DATA_FOLDER=/tmp/libero_data`).
- The job is scoped to only trigger on changes to `src/lerobot/envs/**`, `src/lerobot/scripts/lerobot_eval.py`, and the Dockerfiles — it won't run on unrelated PRs.
-
 ## Verifying your integration

 After completing the steps above, confirm that everything works:

 1. **Install** — `pip install -e ".[mybenchmark]"` and verify the dependency group installs cleanly.
 2. **Smoke test env creation** — call `make_env()` with your config in Python, check that the returned dict has the expected `{suite: {task_id: VectorEnv}}` shape, and that `reset()` returns observations with the right keys.
-3. **Run a full eval** — `lerobot-eval --env.type=<name> --env.task=<task> --eval.n_episodes=1 --policy.path=<any_compatible_policy>` to exercise the full pipeline end-to-end. (`batch_size` defaults to auto-tuning based on CPU cores; pass `--eval.batch_size=1` to force a single environment.)
+3. **Run a full eval** — `lerobot-eval --env.type=<name> --env.task=<task> --eval.n_episodes=1 --eval.batch_size=1 --policy.path=<any_compatible_policy>` to exercise the full pipeline end-to-end.
 4. **Check success detection** — verify that `info["is_success"]` flips to `True` when the task is actually completed. This is what the eval loop uses to compute success rates.
-5. **Add CI smoke test** — follow step 7 above to add a Dockerfile and CI job. This ensures the install stays green as dependencies evolve.

 ## Writing a benchmark doc page

@@ -388,7 +311,7 @@ Each benchmark `.mdx` page should include:
 - **Overview image or GIF.**
 - **Available tasks** — table of task suites with counts and brief descriptions.
 - **Installation** — `pip install -e ".[<benchmark>]"` plus any extra steps (env vars, system packages).
- **Evaluation** — recommended `lerobot-eval` command with `n_episodes` for reproducible results. `batch_size` defaults to auto; only specify it if needed. Include single-task and multi-task examples if applicable. See the [Evaluation guide](evaluation) for details.
+- **Evaluation** — recommended `lerobot-eval` command with `n_episodes` and `batch_size` for reproducible results. Include single-task and multi-task examples if applicable.
 - **Policy inputs and outputs** — observation keys with shapes, action space description.
 - **Recommended evaluation episodes** — how many episodes per task is standard.
 - **Training** — example `lerobot-train` command.
@@ -90,11 +90,17 @@ The same policy can work with different environment processors, and the same env

 ```python
 # Use SmolVLA policy with LIBERO environment
-libero_preprocessor, libero_postprocessor = make_env_pre_post_processors(libero_cfg)
+# Use SmolVLA policy with LIBERO environment
+libero_preprocessor, libero_postprocessor = make_env_pre_post_processors(
+    env_cfg=libero_cfg,
+    policy_cfg=smolvla_cfg,
+)
 smolvla_preprocessor, smolvla_postprocessor = make_pre_post_processors(smolvla_cfg)
-
 # Or use ACT policy with the same LIBERO environment
-libero_preprocessor, libero_postprocessor = make_env_pre_post_processors(libero_cfg)
+libero_preprocessor, libero_postprocessor = make_env_pre_post_processors(
+    env_cfg=libero_cfg,
+    policy_cfg=act_cfg,
+)
 act_preprocessor, act_postprocessor = make_pre_post_processors(act_cfg)
 ```

@@ -151,7 +157,7 @@ observation = {

 ### Factory Function

-The `make_env_pre_post_processors` function follows the same pattern as `make_pre_post_processors` for policies:
+The `make_env_pre_post_processors` function delegates to `env_cfg.get_env_processors()`:

 ```python
 from lerobot.envs.factory import make_env_pre_post_processors
@@ -159,47 +165,31 @@ from lerobot.envs.configs import LiberoEnv, PushtEnv

 # For LIBERO: Returns LiberoProcessorStep in preprocessor
 libero_cfg = LiberoEnv(task="libero_spatial", camera_name=["agentview"])
-env_preprocessor, env_postprocessor = make_env_pre_post_processors(libero_cfg)
+env_preprocessor, env_postprocessor = make_env_pre_post_processors(libero_cfg, policy_cfg)

 # For other environments: Returns identity processors (no-op)
 pusht_cfg = PushtEnv()
-env_preprocessor, env_postprocessor = make_env_pre_post_processors(pusht_cfg)
+env_preprocessor, env_postprocessor = make_env_pre_post_processors(pusht_cfg, policy_cfg)
 ```

-### Implementation in `envs/factory.py`
+### How It Works
+
+Each `EnvConfig` subclass can override `get_env_processors()` to return benchmark-specific
+processor pipelines. The base class returns identity (no-op) processors by default.

 ```python
-def make_env_pre_post_processors(
-    env_cfg: EnvConfig,
-) -> tuple[
-    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
-    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
-]:
-    """
-    Create preprocessor and postprocessor pipelines for environment observations.
-
-    Args:
-        env_cfg: The configuration of the environment.
-
-    Returns:
-        A tuple containing:
-            - preprocessor: Pipeline that processes environment observations
-            - postprocessor: Pipeline that processes environment outputs
-    """
-    # For LIBERO environments, add the LiberoProcessorStep to preprocessor
-    if isinstance(env_cfg, LiberoEnv) or "libero" in env_cfg.type:
-        preprocessor = PolicyProcessorPipeline(steps=[LiberoProcessorStep()])
-    else:
-        # For all other environments, return an identity preprocessor
-        preprocessor = PolicyProcessorPipeline(steps=[])
-
-    # Postprocessor is currently identity for all environments
-    # Future: Could add environment-specific action transformations
-    postprocessor = PolicyProcessorPipeline(steps=[])
-
-    return preprocessor, postprocessor
+# In your EnvConfig subclass:
+def get_env_processors(self):
+    from lerobot.processor.pipeline import PolicyProcessorPipeline
+    return (
+        PolicyProcessorPipeline(steps=[MyProcessorStep()]),
+        PolicyProcessorPipeline(steps=[]),
+    )
 ```

+The factory function `make_env_pre_post_processors` simply delegates to this method,
+with a special case for `XVLAConfig` policies which override the env processors entirely.
+
 ### Integration in Evaluation

 In `lerobot_eval.py`, the environment processors are created once and used throughout:
@@ -219,7 +209,10 @@ def eval_main(cfg: EvalPipelineConfig):
    )

    # Create environment processors (NEW!)
-    env_preprocessor, env_postprocessor = make_env_pre_post_processors(env_cfg=cfg.env)
+    env_preprocessor, env_postprocessor = make_env_pre_post_processors(
+        env_cfg=cfg.env,
+        policy_cfg=cfg.policy,
+    )

    # Run evaluation with both processor types
    eval_policy_all(
@@ -323,21 +316,22 @@ class MyEnvProcessorStep(ObservationProcessorStep):
        return processed
 ```

-### 2. Update the Factory
+### 2. Update Your `EnvConfig` Subclass

 ```python
-# In src/lerobot/envs/factory.py
+# In src/lerobot/envs/configs.py
+@EnvConfig.register_subclass("myenv")
+@dataclass
+class MyEnvConfig(EnvConfig):
+    # ... task/features/gym kwargs ...

-def make_env_pre_post_processors(env_cfg: EnvConfig):
-    if isinstance(env_cfg, LiberoEnv) or "libero" in env_cfg.type:
-        preprocessor = PolicyProcessorPipeline(steps=[LiberoProcessorStep()])
-    elif isinstance(env_cfg, MyEnvConfig) or "myenv" in env_cfg.type:
-        preprocessor = PolicyProcessorPipeline(steps=[MyEnvProcessorStep()])
-    else:
-        preprocessor = PolicyProcessorPipeline(steps=[])
+    def get_env_processors(self):
+        from lerobot.processor.pipeline import PolicyProcessorPipeline

-    postprocessor = PolicyProcessorPipeline(steps=[])
-    return preprocessor, postprocessor
+        return (
+            PolicyProcessorPipeline(steps=[MyEnvProcessorStep()]),
+            PolicyProcessorPipeline(steps=[]),
+        )
 ```

 ### 3. Use in Evaluation
@@ -1,162 +0,0 @@
-# Evaluation
-
-`lerobot-eval` runs a trained policy on a simulation benchmark and reports success rate, reward, and (optionally) episode videos. It handles environment creation, batched rollouts, and metric aggregation automatically.
-
-## Quick start
-
-Evaluate a Hub-hosted policy on LIBERO:
-
-```bash
-lerobot-eval \
-    --policy.path=pepijn223/smolvla_libero \
-    --env.type=libero \
-    --env.task=libero_spatial \
-    --eval.n_episodes=10 \
-    --policy.device=cuda
-```
-
-Evaluate a local checkpoint:
-
-```bash
-lerobot-eval \
-    --policy.path=outputs/train/act_pusht/checkpoints/005000/pretrained_model \
-    --env.type=pusht \
-    --eval.n_episodes=10
-```
-
-`batch_size` defaults to **auto** (based on CPU cores). The script picks the right number of parallel environments for your machine.
-
-## Key flags
-
-| Flag                    | Default        | Description                                                                           |
-| ----------------------- | -------------- | ------------------------------------------------------------------------------------- |
-| `--policy.path`         | required       | Hub repo ID or local path to a pretrained model                                       |
-| `--env.type`            | required       | Benchmark name (`pusht`, `libero`, `metaworld`, etc.)                                 |
-| `--env.task`            | varies         | Task or suite name (e.g. `libero_spatial`, `libero_10`)                               |
-| `--eval.n_episodes`     | `50`           | Total episodes to run (across all tasks)                                              |
-| `--eval.batch_size`     | `0` (auto)     | Number of parallel environments. `0` = auto-tune from CPU cores                       |
-| `--eval.use_async_envs` | `true`         | Use `AsyncVectorEnv` (parallel stepping). Auto-downgrades to sync when `batch_size=1` |
-| `--policy.device`       | `cuda`         | Inference device                                                                      |
-| `--policy.use_amp`      | `false`        | Mixed-precision inference (saves VRAM, faster on Ampere+)                             |
-| `--seed`                | `1000`         | Random seed for reproducibility                                                       |
-| `--output_dir`          | auto-generated | Where to write results and videos                                                     |
-
-### Environment-specific flags
-
-Some benchmarks accept additional flags through `--env.*`:
-
-```bash
-# LIBERO: map simulator camera names to policy feature names
--env.camera_name_mapping='{"agentview_image": "camera1", "robot0_eye_in_hand_image": "camera2"}'
-
-# Fill unused camera slots with zeros
--policy.empty_cameras=1
-```
-
-See each benchmark's documentation ([LIBERO](libero), [Meta-World](metaworld)) for benchmark-specific flags.
-
-## How batch_size works
-
-`batch_size` controls how many environments run in parallel within a single `VectorEnv`:
-
-| `batch_size`  | Behavior                                                             |
-| ------------- | -------------------------------------------------------------------- |
-| `0` (default) | Auto-tune: `floor(cpu_cores × 0.7)`, capped by `n_episodes` and `64` |
-| `1`           | Single environment, synchronous. Useful for debugging                |
-| `N`           | N environments step in parallel via `AsyncVectorEnv`                 |
-
-When `batch_size > 1` and `use_async_envs=true`, each environment runs in its own subprocess via Gymnasium's `AsyncVectorEnv`. This parallelizes the simulation stepping (the main bottleneck), while the policy runs a single batched forward pass on GPU.
-
-**Example:** On a 16-core machine with `n_episodes=100`:
-
- Auto batch_size = `floor(16 × 0.7)` = `11`
- 11 environments step simultaneously → ~11× faster than sequential
-
-## Performance
-
-### AsyncVectorEnv (default)
-
-`AsyncVectorEnv` spawns one subprocess per environment. Each subprocess has its own simulator instance. While the policy computes actions on GPU, all environments step in parallel on CPU:
-
-```
-GPU:  [inference]....[inference]....[inference]....
-CPU:  [step × N]....................[step × N]......
-      ↑ parallel                   ↑ parallel
-```
-
-For GPU-based simulators (LIBERO, Meta-World), the environments use **lazy initialization**: the GPU/EGL context is created inside the worker subprocess on first `reset()`, not in the parent process. This avoids `EGL_BAD_CONTEXT` crashes from inheriting stale GPU handles across `fork()`.
-
-### Lazy task loading
-
-For multi-task benchmarks (e.g. LIBERO with 10 tasks), environments are wrapped in `_LazyAsyncVectorEnv` which defers worker creation until the task is actually evaluated. This keeps peak process count = `batch_size` instead of `n_tasks × batch_size`. After each task completes, workers are closed to free resources.
-
-### Tuning for speed
-
-| Situation                      | Recommendation                                        |
-| ------------------------------ | ----------------------------------------------------- |
-| Slow eval, low GPU utilization | Increase `batch_size` (or leave at auto)              |
-| Out of memory (system RAM)     | Decrease `batch_size`                                 |
-| Out of GPU memory              | Decrease `batch_size`, or use `--policy.use_amp=true` |
-| Debugging / single-stepping    | `--eval.batch_size=1 --eval.use_async_envs=false`     |
-
-## Output
-
-Results are written to `output_dir` (default: `outputs/eval/<date>/<time>_<job_name>/`):
-
- `eval_info.json` — full metrics: per-episode, per-task, per-group, and overall aggregates
- `videos/` — episode recordings (when `--eval.n_episodes_to_render > 0`)
-
-### Metrics
-
-| Metric           | Description                                                          |
-| ---------------- | -------------------------------------------------------------------- |
-| `pc_success`     | Success rate (%). Based on `info["is_success"]` from the environment |
-| `avg_sum_reward` | Mean cumulative reward per episode                                   |
-| `avg_max_reward` | Mean peak reward per episode                                         |
-| `n_episodes`     | Total episodes evaluated                                             |
-| `eval_s`         | Total wall-clock time                                                |
-| `eval_ep_s`      | Mean wall-clock time per episode                                     |
-
-## Multi-task evaluation
-
-For benchmarks with multiple tasks (LIBERO suites, Meta-World MT50), `lerobot-eval` automatically:
-
-1. Creates environments for all tasks in the selected suite(s)
-2. Evaluates each task sequentially (one task's workers at a time)
-3. Aggregates metrics per-task, per-group (suite), and overall
-
-```bash
-# Evaluate all 10 tasks in libero_spatial
-lerobot-eval \
-    --policy.path=pepijn223/smolvla_libero \
-    --env.type=libero \
-    --env.task=libero_spatial \
-    --eval.n_episodes=10
-
-# Evaluate multiple suites
-lerobot-eval \
-    --policy.path=pepijn223/smolvla_libero \
-    --env.type=libero \
-    --env.task="libero_spatial,libero_object" \
-    --eval.n_episodes=10
-```
-
-## API usage
-
-You can call the eval functions directly from Python:
-
-```python
-from lerobot.envs.factory import make_env
-from lerobot.policies.factory import make_policy
-from lerobot.scripts.lerobot_eval import eval_policy
-
-envs = make_env(env_cfg, n_envs=10)
-policy = make_policy(cfg=policy_cfg, env_cfg=env_cfg)
-
-metrics = eval_policy(
-    env=envs["libero_spatial"][0],
-    policy=policy,
-    n_episodes=10,
-)
-print(metrics["pc_success"])
-```
@@ -1,89 +0,0 @@
-#!/usr/bin/env python3
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""Extract natural-language task descriptions for a benchmark suite.
-
-Runs inside the benchmark Docker container (where the env library is installed)
-immediately after lerobot-eval, writing a JSON file that parse_eval_metrics.py
-picks up and embeds in metrics.json.
-
-Output format: {"<suite>_<task_idx>": "<nl instruction>", ...}
-
-Usage:
-    python scripts/ci/extract_task_descriptions.py \\
-        --env libero --task libero_spatial \\
-        --output /tmp/eval-artifacts/task_descriptions.json
-"""
-
-from __future__ import annotations
-
-import argparse
-import json
-import sys
-from pathlib import Path
-
-
-def _libero_descriptions(task_suite: str) -> dict[str, str]:
-    from libero.libero import benchmark  # type: ignore[import-untyped]
-
-    suite_dict = benchmark.get_benchmark_dict()
-    if task_suite not in suite_dict:
-        print(
-            f"[extract_task_descriptions] Unknown LIBERO suite '{task_suite}'. "
-            f"Available: {list(suite_dict.keys())}",
-            file=sys.stderr,
-        )
-        return {}
-    suite = suite_dict[task_suite]()
-    return {f"{task_suite}_{i}": suite.get_task(i).language for i in range(suite.n_tasks)}
-
-
-def _metaworld_descriptions(task_name: str) -> dict[str, str]:
-    # MetaWorld tasks don't expose a separate NL description attribute;
-    # use a cleaned version of the task name as the description.
-    label = task_name.removeprefix("metaworld-").replace("-", " ").strip()
-    return {f"{task_name}_0": label}
-
-
-def main() -> int:
-    parser = argparse.ArgumentParser(description=__doc__)
-    parser.add_argument("--env", required=True, help="Environment family (libero, metaworld, ...)")
-    parser.add_argument("--task", required=True, help="Task/suite name (e.g. libero_spatial)")
-    parser.add_argument("--output", required=True, help="Path to write task_descriptions.json")
-    args = parser.parse_args()
-
-    descriptions: dict[str, str] = {}
-    try:
-        if args.env == "libero":
-            descriptions = _libero_descriptions(args.task)
-        elif args.env == "metaworld":
-            descriptions = _metaworld_descriptions(args.task)
-        else:
-            print(
-                f"[extract_task_descriptions] No description extractor for env '{args.env}'.",
-                file=sys.stderr,
-            )
-    except Exception as exc:
-        print(f"[extract_task_descriptions] Warning: {exc}", file=sys.stderr)
-
-    out_path = Path(args.output)
-    out_path.parent.mkdir(parents=True, exist_ok=True)
-    out_path.write_text(json.dumps(descriptions, indent=2))
-    print(f"[extract_task_descriptions] {len(descriptions)} descriptions → {out_path}")
-    return 0
-
-
-if __name__ == "__main__":
-    sys.exit(main())
@@ -1,129 +0,0 @@
-#!/usr/bin/env python3
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""Parse lerobot-eval output into a small metrics.json artifact.
-
-Reads eval_info.json written by lerobot-eval --output_dir and extracts the
-key metrics needed by the health dashboard. Handles both single-task and
-multi-task eval output formats.
-
-Usage:
-    python scripts/ci/parse_eval_metrics.py \\
-        --artifacts-dir /tmp/libero-artifacts \\
-        --env libero \\
-        --task libero_spatial \\
-        --policy pepijn223/smolvla_libero
-
-Writes <artifacts-dir>/metrics.json. The CI workflow then uploads this file
-as a GitHub Actions artifact named "<env>-metrics".
-"""
-
-from __future__ import annotations
-
-import argparse
-import json
-import math
-import sys
-from pathlib import Path
-
-
-def _extract_metrics(info: dict) -> tuple[float | None, int | None, float | None, float | None]:
-    """Extract (pc_success, n_episodes, avg_sum_reward, eval_s) from eval_info.json.
-
-    Handles two output shapes:
-      - Single-task: {"aggregated": {"pc_success": 80.0, ...}}
-      - Multi-task:  {"overall": {"pc_success": 80.0, "n_episodes": 5, ...}}
-    """
-    for key in ("aggregated", "overall"):
-        if key not in info:
-            continue
-        agg = info[key]
-        pc = agg.get("pc_success")
-        n = agg.get("n_episodes")
-        reward = agg.get("avg_sum_reward")
-        eval_s = agg.get("eval_s")
-        if pc is not None and not math.isnan(pc):
-            return (
-                float(pc),
-                int(n) if n is not None else None,
-                float(reward) if reward is not None else None,
-                float(eval_s) if eval_s is not None else None,
-            )
-
-    return None, None, None, None
-
-
-def main() -> int:
-    parser = argparse.ArgumentParser(
-        description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter
-    )
-    parser.add_argument("--artifacts-dir", required=True, help="Path to the mounted artifacts volume")
-    parser.add_argument("--env", required=True, help="Environment name (e.g. libero)")
-    parser.add_argument("--task", required=True, help="Task name (e.g. libero_spatial)")
-    parser.add_argument("--policy", required=True, help="Policy hub path (e.g. pepijn223/smolvla_libero)")
-    args = parser.parse_args()
-
-    artifacts_dir = Path(args.artifacts_dir)
-    eval_info_path = artifacts_dir / "eval_info.json"
-
-    pc_success: float | None = None
-    n_episodes: int | None = None
-    avg_sum_reward: float | None = None
-    eval_s: float | None = None
-
-    if eval_info_path.exists():
-        try:
-            info = json.loads(eval_info_path.read_text())
-            pc_success, n_episodes, avg_sum_reward, eval_s = _extract_metrics(info)
-        except (json.JSONDecodeError, KeyError, TypeError) as exc:
-            print(f"[parse_eval_metrics] Warning: could not parse eval_info.json: {exc}", file=sys.stderr)
-    else:
-        print(
-            f"[parse_eval_metrics] Warning: {eval_info_path} not found — eval may have failed.",
-            file=sys.stderr,
-        )
-
-    task_descriptions: dict[str, str] = {}
-    task_desc_path = artifacts_dir / "task_descriptions.json"
-    if task_desc_path.exists():
-        try:
-            task_descriptions = json.loads(task_desc_path.read_text())
-        except json.JSONDecodeError as exc:
-            print(
-                f"[parse_eval_metrics] Warning: could not parse task_descriptions.json: {exc}",
-                file=sys.stderr,
-            )
-
-    metrics = {
-        "env": args.env,
-        "task": args.task,
-        "policy": args.policy,
-        "pc_success": pc_success,
-        "n_episodes": n_episodes,
-        "avg_sum_reward": avg_sum_reward,
-        "eval_s": eval_s,
-        "task_descriptions": task_descriptions,
-    }
-
-    out_path = artifacts_dir / "metrics.json"
-    out_path.write_text(json.dumps(metrics, indent=2))
-    print(f"[parse_eval_metrics] Written: {out_path}")
-    print(json.dumps(metrics, indent=2))
-
-    return 0
-
-
-if __name__ == "__main__":
-    sys.exit(main())
@@ -65,27 +65,20 @@ class WandBConfig:
 class EvalConfig:
    n_episodes: int = 50
    # `batch_size` specifies the number of environments to use in a gym.vector.VectorEnv.
-    # Set to 0 for auto-tuning based on available CPU cores and n_episodes.
-    batch_size: int = 0
+    batch_size: int = 50
    # `use_async_envs` specifies whether to use asynchronous environments (multiprocessing).
-    # Defaults to True; automatically downgraded to SyncVectorEnv when batch_size=1.
-    use_async_envs: bool = True
+    use_async_envs: bool = False

    def __post_init__(self) -> None:
-        if self.batch_size == 0:
-            self.batch_size = self._auto_batch_size()
        if self.batch_size > self.n_episodes:
-            self.batch_size = self.n_episodes
-
-    def _auto_batch_size(self) -> int:
-        """Pick batch_size based on CPU cores, capped by n_episodes."""
-        import math
-        import os
-
-        cpu_cores = os.cpu_count() or 4
-        # Each async env worker needs ~1 core; leave headroom for main process + inference.
-        by_cpu = max(1, math.floor(cpu_cores * 0.7))
-        return min(by_cpu, self.n_episodes, 64)
+            raise ValueError(
+                "The eval batch size is greater than the number of eval episodes "
+                f"({self.batch_size} > {self.n_episodes}). As a result, {self.batch_size} "
+                f"eval environments will be instantiated, but only {self.n_episodes} will be used. "
+                "This might significantly slow down evaluation. To fix this, you should update your command "
+                f"to increase the number of episodes to match the batch size (e.g. `eval.n_episodes={self.batch_size}`), "
+                f"or lower the batch size (e.g. `eval.batch_size={self.n_episodes}`)."
+            )


@dataclass
@@ -44,13 +44,6 @@ from lerobot.utils.constants import (
 )


-def _make_vec_env_cls(use_async: bool, n_envs: int):
-    """Return the right VectorEnv constructor."""
-    if use_async and n_envs > 1:
-        return gym.vector.AsyncVectorEnv
-    return gym.vector.SyncVectorEnv
-
-
@dataclass
 class EnvConfig(draccus.ChoiceRegistry, abc.ABC):
    task: str | None = None
@@ -82,14 +75,13 @@ class EnvConfig(draccus.ChoiceRegistry, abc.ABC):
    def create_envs(
        self,
        n_envs: int,
-        use_async_envs: bool = True,
+        use_async_envs: bool = False,
    ) -> dict[str, dict[int, gym.vector.VectorEnv]]:
        """Create {suite: {task_id: VectorEnv}}.

        Default: single-task env via gym.make(). Multi-task benchmarks override.
-        AsyncVectorEnv is the default for n_envs > 1; auto-downgraded to Sync for n_envs=1.
        """
-        env_cls = gym.vector.AsyncVectorEnv if (use_async_envs and n_envs > 1) else gym.vector.SyncVectorEnv
+        env_cls = gym.vector.AsyncVectorEnv if use_async_envs else gym.vector.SyncVectorEnv

        if self.gym_id not in gym_registry:
            print(f"gym id '{self.gym_id}' not found, attempting to import '{self.package_name}'...")
@@ -402,22 +394,17 @@ class LiberoEnv(EnvConfig):

    @property
    def gym_kwargs(self) -> dict:
-        kwargs: dict[str, Any] = {
-            "obs_type": self.obs_type,
-            "render_mode": self.render_mode,
-            "observation_height": self.observation_height,
-            "observation_width": self.observation_width,
-        }
+        kwargs: dict[str, Any] = {"obs_type": self.obs_type, "render_mode": self.render_mode}
        if self.task_ids is not None:
            kwargs["task_ids"] = self.task_ids
        return kwargs

-    def create_envs(self, n_envs: int, use_async_envs: bool = True):
+    def create_envs(self, n_envs: int, use_async_envs: bool = False):
        from lerobot.envs.libero import create_libero_envs

        if self.task is None:
            raise ValueError("LiberoEnv requires a task to be specified")
-        env_cls = _make_vec_env_cls(use_async_envs, n_envs)
+        env_cls = gym.vector.AsyncVectorEnv if use_async_envs else gym.vector.SyncVectorEnv
        return create_libero_envs(
            task=self.task,
            n_envs=n_envs,
@@ -481,12 +468,12 @@ class MetaworldEnv(EnvConfig):
            "render_mode": self.render_mode,
        }

-    def create_envs(self, n_envs: int, use_async_envs: bool = True):
+    def create_envs(self, n_envs: int, use_async_envs: bool = False):
        from lerobot.envs.metaworld import create_metaworld_envs

        if self.task is None:
            raise ValueError("MetaWorld requires a task to be specified")
-        env_cls = _make_vec_env_cls(use_async_envs, n_envs)
+        env_cls = gym.vector.AsyncVectorEnv if use_async_envs else gym.vector.SyncVectorEnv
        return create_metaworld_envs(
            task=self.task,
            n_envs=n_envs,
@@ -58,7 +58,7 @@ def make_env_pre_post_processors(
 def make_env(
    cfg: EnvConfig | str,
    n_envs: int = 1,
-    use_async_envs: bool = True,
+    use_async_envs: bool = False,
    hub_cache_dir: str | None = None,
    trust_remote_code: bool = False,
 ) -> dict[str, dict[int, gym.vector.VectorEnv]]:
@@ -29,7 +29,6 @@ from gymnasium import spaces
 from libero.libero import benchmark, get_libero_path
 from libero.libero.envs import OffScreenRenderEnv

-from lerobot.envs.utils import _LazyAsyncVectorEnv
 from lerobot.types import RobotObservation


@@ -151,17 +150,7 @@ class LiberoEnv(gym.Env):

        self.init_state_id = self.episode_index  # tie each sub-env to a fixed init state

-        # Extract task metadata without allocating GPU resources (safe before fork).
-        task = task_suite.get_task(task_id)
-        self.task = task.name
-        self.task_description = task.language
-        self._task_bddl_file = os.path.join(
-            get_libero_path("bddl_files"), task.problem_folder, task.bddl_file
-        )
-        self._env: OffScreenRenderEnv | None = (
-            None  # deferred — created on first reset() inside the worker subprocess
-        )
-
+        self._env = self._make_envs_task(task_suite, self.task_id)
        default_steps = 500
        self._max_episode_steps = (
            TASK_SUITE_MAX_STEPS.get(task_suite_name, default_steps)
@@ -232,33 +221,29 @@ class LiberoEnv(gym.Env):
            low=ACTION_LOW, high=ACTION_HIGH, shape=(ACTION_DIM,), dtype=np.float32
        )

-    def _ensure_env(self) -> None:
-        """Create the underlying OffScreenRenderEnv on first use.
-
-        Called inside the worker subprocess after fork(), so each worker gets
-        its own clean EGL context rather than inheriting a stale one from the
-        parent process (which causes EGL_BAD_CONTEXT crashes with AsyncVectorEnv).
-        """
-        if self._env is not None:
-            return
-        env = OffScreenRenderEnv(
-            bddl_file_name=self._task_bddl_file,
-            camera_heights=self.observation_height,
-            camera_widths=self.observation_width,
-        )
-        env.reset()
-        self._env = env
-
    def render(self):
-        self._ensure_env()
        raw_obs = self._env.env._get_observations()
        pixels = self._format_raw_obs(raw_obs)["pixels"]
        image = next(iter(pixels.values()))
        image = image[::-1, ::-1]  # flip both H and W for visualization
        return image

+    def _make_envs_task(self, task_suite: Any, task_id: int = 0):
+        task = task_suite.get_task(task_id)
+        self.task = task.name
+        self.task_description = task.language
+        task_bddl_file = os.path.join(get_libero_path("bddl_files"), task.problem_folder, task.bddl_file)
+
+        env_args = {
+            "bddl_file_name": task_bddl_file,
+            "camera_heights": self.observation_height,
+            "camera_widths": self.observation_width,
+        }
+        env = OffScreenRenderEnv(**env_args)
+        env.reset()
+        return env
+
    def _format_raw_obs(self, raw_obs: RobotObservation) -> RobotObservation:
-        assert self._env is not None, "_format_raw_obs called before _ensure_env()"
        images = {}
        for camera_name in self.camera_name:
            image = raw_obs[camera_name]
@@ -310,7 +295,6 @@ class LiberoEnv(gym.Env):
        )

    def reset(self, seed=None, **kwargs):
-        self._ensure_env()
        super().reset(seed=seed)
        self._env.seed(seed)
        raw_obs = self._env.reset()
@@ -337,8 +321,6 @@ class LiberoEnv(gym.Env):
        return observation, info

    def step(self, action: np.ndarray) -> tuple[RobotObservation, float, bool, bool, dict[str, Any]]:
-        self._ensure_env()
-        assert self._env is not None
        if action.ndim != 1:
            raise ValueError(
                f"Expected action to be 1-D (shape (action_dim,)), "
@@ -363,8 +345,7 @@ class LiberoEnv(gym.Env):
        return observation, reward, terminated, truncated, info

    def close(self):
-        if self._env is not None:
-            self._env.close()
+        self._env.close()


 def _make_env_fns(
@@ -447,8 +428,6 @@ def create_libero_envs(
    if task_ids_filter is not None:
        print(f"Restricting to task_ids={task_ids_filter}")

-    is_async = env_cls is gym.vector.AsyncVectorEnv
-
    out: dict[str, dict[int, Any]] = defaultdict(dict)
    for suite_name in suite_names:
        suite = _get_suite(suite_name)
@@ -457,11 +436,6 @@ def create_libero_envs(
        if not selected:
            raise ValueError(f"No tasks selected for suite '{suite_name}' (available: {total}).")

-        # All tasks in a suite share identical observation/action spaces.
-        # Probe once and reuse to avoid creating a temp env per task.
-        cached_obs_space: spaces.Space | None = None
-        cached_act_space: spaces.Space | None = None
-
        for tid in selected:
            fns = _make_env_fns(
                suite=suite,
@@ -475,14 +449,8 @@ def create_libero_envs(
                control_mode=control_mode,
                camera_name_mapping=camera_name_mapping,
            )
-            if is_async:
-                lazy = _LazyAsyncVectorEnv(fns, cached_obs_space, cached_act_space)
-                if cached_obs_space is None:
-                    cached_obs_space = lazy.observation_space
-                    cached_act_space = lazy.action_space
-                out[suite_name][tid] = lazy
-            else:
-                out[suite_name][tid] = env_cls(fns)
+            out[suite_name][tid] = env_cls(fns)
            print(f"Built vec env | suite={suite_name} | task_id={tid} | n_envs={n_envs}")

+    # return plain dicts for predictability
    return {suite: dict(task_map) for suite, task_map in out.items()}
@@ -25,7 +25,6 @@ import metaworld.policies as policies
 import numpy as np
 from gymnasium import spaces

-from lerobot.envs.utils import _LazyAsyncVectorEnv
 from lerobot.types import RobotObservation

 # ---- Load configuration data from the external JSON file ----
@@ -98,9 +97,8 @@ class MetaworldEnv(gym.Env):
        self.visualization_height = visualization_height
        self.camera_name = camera_name

-        self._env_name = self.task  # already stripped of "metaworld-" prefix above
-        self._env = None  # deferred — created on first reset() inside the worker subprocess
-        self._max_episode_steps = 500  # MT1 environments always have max_path_length=500
+        self._env = self._make_envs_task(self.task)
+        self._max_episode_steps = self._env.max_path_length
        self.task_description = TASK_DESCRIPTIONS[self.task]

        self.expert_policy = TASK_POLICY_MAPPING[self.task]()
@@ -138,24 +136,6 @@ class MetaworldEnv(gym.Env):

        self.action_space = spaces.Box(low=-1, high=1, shape=(ACTION_DIM,), dtype=np.float32)

-    def _ensure_env(self) -> None:
-        """Create the underlying MetaWorld env on first use.
-
-        Called inside the worker subprocess after fork(), so each worker gets
-        its own clean rendering context rather than inheriting a stale one from
-        the parent process (which causes crashes with AsyncVectorEnv).
-        """
-        if self._env is not None:
-            return
-        mt1 = metaworld.MT1(self._env_name, seed=42)
-        env = mt1.train_classes[self._env_name](render_mode="rgb_array", camera_name=self.camera_name)
-        env.set_task(mt1.train_tasks[0])
-        if self.camera_name == "corner2":
-            env.model.cam_pos[2] = [0.75, 0.075, 0.7]
-        env.reset()
-        env._freeze_rand_vec = False  # otherwise no randomization
-        self._env = env
-
    def render(self) -> np.ndarray:
        """
        Render the current environment frame.
@@ -163,13 +143,26 @@ class MetaworldEnv(gym.Env):
        Returns:
            np.ndarray: The rendered RGB image from the environment.
        """
-        self._ensure_env()
        image = self._env.render()
        if self.camera_name == "corner2":
            # Images from this camera are flipped — correct them
            image = np.flip(image, (0, 1))
        return image

+    def _make_envs_task(self, env_name: str):
+        mt1 = metaworld.MT1(env_name, seed=42)
+        env = mt1.train_classes[env_name](render_mode="rgb_array", camera_name=self.camera_name)
+        env.set_task(mt1.train_tasks[0])
+        if self.camera_name == "corner2":
+            env.model.cam_pos[2] = [
+                0.75,
+                0.075,
+                0.7,
+            ]  # corner2 position, similar to https://arxiv.org/pdf/2206.14244
+        env.reset()
+        env._freeze_rand_vec = False  # otherwise no randomization
+        return env
+
    def _format_raw_obs(self, raw_obs: np.ndarray) -> RobotObservation:
        image = None
        if self._env is not None:
@@ -216,7 +209,6 @@ class MetaworldEnv(gym.Env):
            observation (RobotObservation): The initial formatted observation.
            info (Dict[str, Any]): Additional info about the reset state.
        """
-        self._ensure_env()
        super().reset(seed=seed)

        raw_obs, info = self._env.reset(seed=seed)
@@ -240,7 +232,6 @@ class MetaworldEnv(gym.Env):
            truncated (bool): Whether the episode was truncated due to a time limit.
            info (Dict[str, Any]): Additional environment info.
        """
-        self._ensure_env()
        if action.ndim != 1:
            raise ValueError(
                f"Expected action to be 1-D (shape (action_dim,)), "
@@ -272,8 +263,7 @@ class MetaworldEnv(gym.Env):
        return observation, reward, terminated, truncated, info

    def close(self):
-        if self._env is not None:
-            self._env.close()
+        self._env.close()


 # ---- Main API ----------------------------------------------------------------
@@ -307,9 +297,6 @@ def create_metaworld_envs(

    print(f"Creating Meta-World envs | task_groups={task_groups} | n_envs(per task)={n_envs}")

-    is_async = env_cls is gym.vector.AsyncVectorEnv
-    cached_obs_space = None
-    cached_act_space = None
    out: dict[str, dict[int, Any]] = defaultdict(dict)

    for group in task_groups:
@@ -322,14 +309,7 @@ def create_metaworld_envs(
            # build n_envs factories
            fns = [(lambda tn=task_name: MetaworldEnv(task=tn, **gym_kwargs)) for _ in range(n_envs)]

-            if is_async:
-                lazy = _LazyAsyncVectorEnv(fns, cached_obs_space, cached_act_space)
-                if cached_obs_space is None:
-                    cached_obs_space = lazy.observation_space
-                    cached_act_space = lazy.action_space
-                out[group][tid] = lazy
-            else:
-                out[group][tid] = env_cls(fns)
+            out[group][tid] = env_cls(fns)

    # return a plain dict for consistency
    return {group: dict(task_map) for group, task_map in out.items()}
@@ -16,7 +16,7 @@
 import importlib.util
 import os
 import warnings
-from collections.abc import Callable, Mapping, Sequence
+from collections.abc import Mapping, Sequence
 from functools import singledispatch
 from typing import Any

@@ -130,99 +130,56 @@ def env_to_policy_features(env_cfg: EnvConfig) -> dict[str, PolicyFeature]:
    return policy_features


-def _sub_env_has_attr(env: gym.vector.VectorEnv, attr: str) -> bool:
-    try:
-        env.get_attr(attr)
-        return True
-    except (AttributeError, Exception):
-        return False
-
-
-class _LazyAsyncVectorEnv:
-    """Defers AsyncVectorEnv creation until first use.
-
-    Creating all tasks' AsyncVectorEnvs upfront spawns N_tasks × n_envs worker
-    processes, all of which allocate EGL/GPU resources immediately. Since tasks
-    are evaluated sequentially, only one task's workers need to be alive at a
-    time. This wrapper stores the factory functions and creates the real
-    AsyncVectorEnv on first reset()/step()/call(), keeping peak process count = n_envs.
-    """
-
-    def __init__(
-        self,
-        env_fns: list[Callable],
-        observation_space=None,
-        action_space=None,
-    ):
-        self._env_fns = env_fns
-        self._env: gym.vector.AsyncVectorEnv | None = None
-        self.num_envs = len(env_fns)
-        if observation_space is not None and action_space is not None:
-            self.observation_space = observation_space
-            self.action_space = action_space
-        else:
-            tmp = env_fns[0]()
-            self.observation_space = tmp.observation_space
-            self.action_space = tmp.action_space
-            tmp.close()
-        self.single_observation_space = self.observation_space
-        self.single_action_space = self.action_space
-
-    def _ensure(self) -> None:
-        if self._env is None:
-            self._env = gym.vector.AsyncVectorEnv(self._env_fns, context="forkserver", shared_memory=True)
-
-    def reset(self, **kwargs):
-        self._ensure()
-        return self._env.reset(**kwargs)
-
-    def step(self, actions):
-        self._ensure()
-        return self._env.step(actions)
-
-    def call(self, name, *args, **kwargs):
-        self._ensure()
-        return self._env.call(name, *args, **kwargs)
-
-    def get_attr(self, name):
-        self._ensure()
-        return self._env.get_attr(name)
-
-    def close(self) -> None:
-        if self._env is not None:
-            self._env.close()
-            self._env = None
+def are_all_envs_same_type(env: gym.vector.VectorEnv) -> bool:
+    first_type = type(env.envs[0])  # Get type of first env
+    return all(type(e) is first_type for e in env.envs)  # Fast type check


 def check_env_attributes_and_types(env: gym.vector.VectorEnv) -> None:
    with warnings.catch_warnings():
-        warnings.simplefilter("once", UserWarning)
+        warnings.simplefilter("once", UserWarning)  # Apply filter only in this function

-        if not (_sub_env_has_attr(env, "task_description") and _sub_env_has_attr(env, "task")):
+        if not (hasattr(env.envs[0], "task_description") and hasattr(env.envs[0], "task")):
            warnings.warn(
                "The environment does not have 'task_description' and 'task'. Some policies require these features.",
                UserWarning,
                stacklevel=2,
            )
+        if not are_all_envs_same_type(env):
+            warnings.warn(
+                "The environments have different types. Make sure you infer the right task from each environment. Empty task will be passed instead.",
+                UserWarning,
+                stacklevel=2,
+            )


 def add_envs_task(env: gym.vector.VectorEnv, observation: RobotObservation) -> RobotObservation:
    """Adds task feature to the observation dict with respect to the first environment attribute."""
-    if _sub_env_has_attr(env, "task_description"):
-        task_result = list(env.call("task_description"))
+    if hasattr(env.envs[0], "task_description"):
+        task_result = env.call("task_description")

+        if isinstance(task_result, tuple):
+            task_result = list(task_result)
+
+        if not isinstance(task_result, list):
+            raise TypeError(f"Expected task_description to return a list, got {type(task_result)}")
        if not all(isinstance(item, str) for item in task_result):
            raise TypeError("All items in task_description result must be strings")

        observation["task"] = task_result
-    elif _sub_env_has_attr(env, "task"):
-        task_result = list(env.call("task"))
+    elif hasattr(env.envs[0], "task"):
+        task_result = env.call("task")

+        if isinstance(task_result, tuple):
+            task_result = list(task_result)
+
+        if not isinstance(task_result, list):
+            raise TypeError(f"Expected task to return a list, got {type(task_result)}")
        if not all(isinstance(item, str) for item in task_result):
            raise TypeError("All items in task result must be strings")

        observation["task"] = task_result
-    else:
+    else:  #  For envs without language instructions, e.g. aloha transfer cube and etc.
        num_envs = observation[list(observation.keys())[0]].shape[0]
        observation["task"] = ["" for _ in range(num_envs)]
    return observation
@@ -136,8 +136,8 @@ class TokenizerProcessorStep(ObservationProcessorStep):
        # Standardize to a list of strings for the tokenizer
        if isinstance(task, str):
            return [task]
-        elif isinstance(task, (list, tuple)) and all(isinstance(t, str) for t in task):
-            return list(task)
+        elif isinstance(task, list) and all(isinstance(t, str) for t in task):
+            return task

        return None

@@ -73,6 +73,7 @@ from lerobot.configs import parser
 from lerobot.configs.eval import EvalPipelineConfig
 from lerobot.envs.factory import make_env, make_env_pre_post_processors
 from lerobot.envs.utils import (
+    add_envs_task,
    check_env_attributes_and_types,
    close_envs,
    preprocess_observation,
@@ -165,15 +166,9 @@ def rollout(
        if return_observations:
            all_observations.append(deepcopy(observation))

-        # Infer "task" from sub-environments (prefer natural language description).
-        # env.call() works with both SyncVectorEnv and AsyncVectorEnv.
-        try:
-            observation["task"] = list(env.call("task_description"))
-        except Exception:
-            try:
-                observation["task"] = list(env.call("task"))
-            except Exception:
-                observation["task"] = [""] * env.num_envs
+        # Infer "task" from attributes of environments.
+        # TODO: works with SyncVectorEnv but not AsyncVectorEnv
+        observation = add_envs_task(env, observation)

        # Apply environment-specific preprocessing (e.g., LiberoProcessorStep for LIBERO)
        observation = env_preprocessor(observation)
@@ -323,9 +318,8 @@ def eval_policy(
        n_to_render_now = min(max_episodes_rendered - n_episodes_rendered, env.num_envs)
        if isinstance(env, gym.vector.SyncVectorEnv):
            ep_frames.append(np.stack([env.envs[i].render() for i in range(n_to_render_now)]))  # noqa: B023
-        elif hasattr(env, "call"):
+        elif isinstance(env, gym.vector.AsyncVectorEnv):
            # Here we must render all frames and discard any we don't need.
-            # Covers AsyncVectorEnv and _LazyAsyncVectorEnv (which wraps one).
            ep_frames.append(np.stack(env.call("render")[:n_to_render_now]))

    if max_episodes_rendered > 0:
@@ -527,7 +521,7 @@ def eval_main(cfg: EvalPipelineConfig):

    logging.info(colored("Output dir:", "yellow", attrs=["bold"]) + f" {cfg.output_dir}")

-    logging.info(f"Making environment (batch_size={cfg.eval.batch_size}, async={cfg.eval.use_async_envs}).")
+    logging.info("Making environment.")
    envs = make_env(
        cfg.env,
        n_envs=cfg.eval.batch_size,
@@ -761,39 +755,23 @@ def eval_policy_all(
    )

    if max_parallel_tasks <= 1:
-        prefetch_thread: threading.Thread | None = None
-        for i, (task_group, task_id, env) in enumerate(tasks):
-            if prefetch_thread is not None:
-                prefetch_thread.join()
-                prefetch_thread = None
-
-            try:
-                tg, tid, metrics = task_runner(task_group, task_id, env)
-                _accumulate_to(tg, metrics)
-                per_task_infos.append({"task_group": tg, "task_id": tid, "metrics": metrics})
-            finally:
-                env.close()
-                # Prefetch next task's workers *after* closing current env to prevent
-                # GPU memory overlap between consecutive tasks.
-                if i + 1 < len(tasks):
-                    next_env = tasks[i + 1][2]
-                    if hasattr(next_env, "_ensure"):
-                        prefetch_thread = threading.Thread(target=next_env._ensure, daemon=True)
-                        prefetch_thread.start()
+        # sequential path (single accumulator path on the main thread)
+        # NOTE: keeping a single-threaded accumulator avoids concurrent list appends or locks
+        for task_group, task_id, env in tasks:
+            tg, tid, metrics = task_runner(task_group, task_id, env)
+            _accumulate_to(tg, metrics)
+            per_task_infos.append({"task_group": tg, "task_id": tid, "metrics": metrics})
    else:
+        # threaded path: submit all tasks, consume completions on main thread and accumulate there
        with cf.ThreadPoolExecutor(max_workers=max_parallel_tasks) as executor:
            fut2meta = {}
            for task_group, task_id, env in tasks:
                fut = executor.submit(task_runner, task_group, task_id, env)
-                fut2meta[fut] = (task_group, task_id, env)
+                fut2meta[fut] = (task_group, task_id)
            for fut in cf.as_completed(fut2meta):
-                tg, tid, env = fut2meta[fut]
-                try:
-                    tg, tid, metrics = fut.result()
-                    _accumulate_to(tg, metrics)
-                    per_task_infos.append({"task_group": tg, "task_id": tid, "metrics": metrics})
-                finally:
-                    env.close()
+                tg, tid, metrics = fut.result()
+                _accumulate_to(tg, metrics)
+                per_task_infos.append({"task_group": tg, "task_id": tid, "metrics": metrics})

    # compute aggregated metrics helper (robust to lists/scalars)
    def _agg_from_list(xs):
@@ -90,7 +90,7 @@ def test_base_create_envs():
        envs = _Env().create_envs(n_envs=2)
        assert "_dispatch_base_test" in envs
        env = envs["_dispatch_base_test"][0]
-        assert isinstance(env, gym.vector.VectorEnv)
+        assert isinstance(env, gym.vector.SyncVectorEnv)
        assert env.num_envs == 2
        env.close()
    finally:
@@ -189,30 +189,6 @@ def test_list_of_strings_tokenization(mock_auto_tokenizer):
    assert attention_mask.shape == (2, 8)


-@require_package("transformers")
-@patch("lerobot.processor.tokenizer_processor.AutoTokenizer")
-def test_tuple_of_strings_tokenization(mock_auto_tokenizer):
-    """Test tokenization of a tuple of strings (returned by VectorEnv.call())."""
-    mock_tokenizer = MockTokenizer(vocab_size=100)
-    mock_auto_tokenizer.from_pretrained.return_value = mock_tokenizer
-
-    processor = TokenizerProcessorStep(tokenizer_name="test-tokenizer", max_length=8)
-
-    transition = create_transition(
-        observation={"state": torch.tensor([1.0, 2.0])},
-        action=torch.tensor([0.1, 0.2]),
-        complementary_data={"task": ("pick up cube", "place on table")},
-    )
-
-    result = processor(transition)
-
-    observation = result[TransitionKey.OBSERVATION]
-    tokens = observation[f"{OBS_LANGUAGE}.tokens"]
-    attention_mask = observation[f"{OBS_LANGUAGE}.attention_mask"]
-    assert tokens.shape == (2, 8)
-    assert attention_mask.shape == (2, 8)
-
-
@require_package("transformers")
@patch("lerobot.processor.tokenizer_processor.AutoTokenizer")
 def test_custom_keys(mock_auto_tokenizer):
Author	SHA1	Message	Date
Pepijn	63dedac255	fix(ci): downgrade contents permission to read in claude.yml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 19:19:31 +02:00
Pepijn	b0286b10cf	chore: remove root CLAUDE.md (moved to .github/CLAUDE.md) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 18:04:48 +02:00
Pepijn	7a8b02cd32	refactor(ci): move CLAUDE.md to .github/ to keep repo root clean CLAUDE.md is CI-only config — moving it to .github/ ensures it is not visible at the repo root when contributors clone lerobot. Both workflows now explicitly reference .github/CLAUDE.md in their prompt/system-prompt. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 18:03:06 +02:00
Pepijn	892e9f13b7	docs(claude): remove LOC minimization guideline from CLAUDE.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 17:59:55 +02:00
Pepijn	4b8436aefa	feat(ci): restrict @claude trigger to repo owners, members, and collaborators Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 17:57:50 +02:00
Pepijn	9d97426cb8	Merge branch 'main' into fix/claude-code-action-precommit	2026-04-08 17:56:32 +02:00
Pepijn	e8f504edaa	feat(ci): use claude-opus-4-6 for PR reviews Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 17:56:01 +02:00
Pepijn	db7334a384	docs(claude): add Processor to core abstractions in CLAUDE.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 17:54:17 +02:00
Pepijn	5de7aa5a4f	refactor(envs): move benchmark dispatch into EnvConfig subclasses (#3272 ) * docs(benchmarks): add benchmark integration guide and standardize benchmark docs Add a comprehensive guide for adding new benchmarks to LeRobot, and refactor the existing LIBERO and Meta-World docs to follow the new standardized template. * refactor(envs): move dispatch logic from factory into EnvConfig subclasses Replace hardcoded if/elif chains in factory.py with create_envs() and get_env_processors() methods on EnvConfig. New benchmarks now only need to register a config subclass — no factory.py edits required. Net -23 lines: factory.py shrinks from ~200 to ~70 lines of logic. * docs(benchmarks): clean up adding-benchmarks guide for clarity Rewrite for simpler language, better structure, and easier navigation. Move quick-reference table to the top, fold eval explanation into architecture section, condense the doc template to a bulleted outline. * fix link * fix task count * fix(tests): fix 3 failing dispatch tests - test_registry_all_types: skip non-EnvConfig stubs (e.g. TestPluginConfig) - test_processors_delegation: use None instead of abstract PreTrainedConfig - test_custom_get_env_processors_override: use DataProcessorPipeline for isinstance check (PolicyProcessorPipeline is a subscripted generic) * fix: enable SmolVLA eval on LIBERO with custom camera mappings - Thread camera_name_mapping from LiberoEnv config through to gym envs - Sync features_map with camera_name_mapping in LiberoEnv.__post_init__ - Fix render() to use first available camera instead of hardcoded "image" - Handle non-dict final_info in rollout by falling back to info["is_success"] - Add use_peft legacy field to SmolVLAConfig for checkpoint compat - Add defaults to GR00TN15Config init=False fields for transformers 5.3 Made-with: Cursor * fix: use direct AutoresetMode import for gymnasium compat Made-with: Cursor * fix: handle gymnasium < 1.0 without AutoresetMode Made-with: Cursor * refactor: revert policy changes, keep env-only camera mapping fixes - Revert GR00T N1.5 default_factory/default changes (transformers compat) - Revert SmolVLA use_peft legacy field - Apply ruff formatting fixes - camera_name_mapping stays entirely in env/eval layer (no policy changes) Made-with: Cursor * Update docs/source/env_processor.mdx Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co> Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> * Update docs/source/env_processor.mdx Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co> Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> * Update docs/source/env_processor.mdx Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co> Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> * fix(eval): raise RuntimeError for unsupported final_info format (Gymnasium < 1.0) Made-with: Cursor * style: fix markdown code fences in env_processor.mdx Made-with: Cursor * docs: remove duplicate code blocks in env_processor.mdx Made-with: Cursor * style: revert quadruple backticks to triple (prettier compat) * docs(env_processor): add EnvConfig subclass step and policy_cfg examples - Add missing '### 2. Update Your EnvConfig Subclass' section with get_env_processors() snippet - Update factory usage example to show policy_cfg parameter and keyword-argument style for both SmolVLA and ACT cases * docs(env_processor): rename step 2 and fix policy_cfg examples - Rename '### 2. Update the Factory' → '### 2. Update Your EnvConfig Subclass' - Update factory usage examples to use keyword-argument style with policy_cfg parameter for both SmolVLA and ACT cases --------- Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>	2026-04-08 17:48:58 +02:00
Pepijn	fc8d89b128	feat(ci): add CLAUDE.md and improve claude-code-action workflows - Add CLAUDE.md with lerobot-specific review instructions (core abstractions, engineering principles, ML-specific checks, PR checklist) - Enable use_sticky_comment: true on both workflows (single updating comment per PR) - Add structured lerobot-specific review prompt to claude-code-review.yml - Upgrade permissions: contents/pull-requests/issues write for interactive claude.yml - Add actions: read to claude-code-review.yml for CI log access - Set FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true to suppress Node.js 20 deprecation warnings Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 17:47:41 +02:00
Pepijn	e0bde22193	fix(ci): pin claude-code-action to commit SHA and add persist-credentials: false Fixes pre-commit zizmor failures from PR #3322: - Pin anthropics/claude-code-action@v1 to commit hash (26ddc358) to satisfy blanket pinning policy - Add persist-credentials: false to actions/checkout steps to suppress credential-persistence warning - Remove trailing blank lines to satisfy end-of-file-fixer Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 17:34:53 +02:00
Pauline Bailly-Masson	055f20f658	"Claude Code Review workflow"	2026-04-08 17:22:05 +02:00
Pauline Bailly-Masson	30d2fe3bb3	"Claude PR Assistant workflow"	2026-04-08 17:22:03 +02:00