diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 1a4558f93..575c4c367 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -19,6 +19,8 @@
     title: Train RL in Simulation
   - local: async
     title: Use Async Inference
+  - local: libero
+    title: Using LIBERO
   title: "Tutorials"
 - sections:
   - local: smolvla
diff --git a/docs/source/libero.mdx b/docs/source/libero.mdx
new file mode 100644
index 000000000..8c4ba532e
--- /dev/null
+++ b/docs/source/libero.mdx
@@ -0,0 +1,230 @@
+# LIBERO
+
+**LIBERO** is a benchmark designed to study **lifelong robot learning**. The idea is that robots won’t just be pretrained once in a factory, they’ll need to keep learning and adapting with their human users over time. This ongoing adaptation is called **lifelong learning in decision making (LLDM)**, and it’s a key step toward building robots that become truly personalized helpers. The benchmark was first introduced in the [LIBERO paper](https://arxiv.org/abs/2306.03310) and the [original repository](https://github.com/Lifelong-Robot-Learning/LIBERO).
+
+To make progress on this challenge, LIBERO provides a set of standardized tasks that focus on **knowledge transfer**: how well a robot can apply what it has already learned to new situations. By evaluating on LIBERO, different algorithms can be compared fairly and researchers can build on each other’s work.
+
+LIBERO includes **five task suites**:
+
+- **LIBERO-Spatial (`libero_spatial`)** – tasks that require reasoning about spatial relations.
+- **LIBERO-Object (`libero_object`)** – tasks centered on manipulating different objects.
+- **LIBERO-Goal (`libero_goal`)** – goal-conditioned tasks where the robot must adapt to changing targets.
+- **LIBERO-90 (`libero_90`)** – 90 short-horizon tasks from the LIBERO-100 collection.
+- **LIBERO-Long (`libero_10`)** – 10 long-horizon tasks from the LIBERO-100 collection.
+
+Together, these suites cover **130 tasks**, ranging from simple object manipulations to complex multi-step scenarios. LIBERO is meant to grow over time, and to serve as a shared benchmark where the community can test and improve lifelong learning algorithms.
+
+![An overview of the LIBERO benchmark](https://libero-project.github.io/assets/img/libero/fig1.png)
+_Figure 1: An overview of the LIBERO benchmark._
+
+## Evaluating with LIBERO
+
+At **LeRobot**, we ported [LIBERO](https://github.com/Lifelong-Robot-Learning/LIBERO) into our framework and used it primarily to **benchmark [SmolVLA](https://huggingface.co/docs/lerobot/en/smolvla)**, our lightweight Vision-Language-Action model, comparing it against state-of-the-art VLA models such as Pi0, OpenVLA, Octo, and Diffusion Policy.
+
+LIBERO is now part of our **multi-eval supported simulation**, allowing you to benchmark your policies either on a **single suite of tasks** or across **multiple suites at once** with just a single flag.
+
+To install LIBERO, first follow the [LeRobot Installation Guide](https://huggingface.co/docs/lerobot/installation).
+Once LeRobot is installed, there are two options:
+
+1. **Install via pip** (recommended):
+
+   ```bash
+   pip install "lerobot[libero,smolvla]"
+   ```
+
+2. **Install from source**:
+   ```bash
+   git clone https://github.com/huggingface/lerobot.git
+   cd lerobot
+   pip install -e ".[libero,smolvla]"
+   ```
+
+### Single-suite evaluation
+
+Evaluate a policy on one LIBERO suite:
+
+```bash
+python src/lerobot/scripts/eval.py \
+  --policy.path="your-policy-id" \
+  --env.type=libero \
+  --env.task=libero_object \
+  --env.multitask_eval=False \
+  --eval.batch_size=2 \
+  --eval.n_episodes=3
+```
+
+- `--env.task` picks the suite (`libero_object`, `libero_spatial`, etc.).
+- `--eval.batch_size` controls how many environments run in parallel.
+- `--eval.n_episodes` sets how many episodes to run in total.
+
+---
+
+### Multi-suite evaluation
+
+Benchmark a policy across multiple suites at once:
+
+```bash
+python src/lerobot/scripts/eval.py \
+  --policy.path="your-policy-id" \
+  --env.type=libero \
+  --env.task=libero_object \
+  --env.multitask_eval=True \
+  --eval.batch_size=1 \
+  --eval.n_episodes=2
+```
+
+- Pass a comma-separated list to `--env.task` for multi-suite evaluation.
+- Set `-env.multitask_eval=True` to enable evaluation across all tasks in those suites.
+
+### Policy inputs and outputs
+
+When using LIBERO through LeRobot, policies interact with the environment via **observations** and **actions**:
+
+- **Observations**
+  - `observation.state` – proprioceptive features (agent state).
+  - `observation.images.image` – main camera view (`agentview_image`).
+  - `observation.images.image2` – wrist camera view (`robot0_eye_in_hand_image`).
+
+  ⚠️ **Note:** LeRobot enforces the `.images.*` prefix for any visual features. Make sure your dataset metadata keys match this convention when evaluating.
+
+  ## Input Features and Metadata Alignment
+
+  To train or evaluate a policy, you use `make_policy`, which builds a feature-naming dictionary for the observations the policy expects.
+  This mapping can come from:
+  - Dataset metadata
+  - The evaluation environment
+  - The policy path (if a pretrained repo ID is provided)
+
+  ### Common Issues
+
+  A common problem is when the keys in the dataset, environment, and policy config do not match. For example:
+  - `wrist_image` vs `observation.images.image2`
+  - `observation.image2` (as in SmolVLA) vs the `.images.*` prefix convention
+
+  Such mismatches will cause `KeyError`s. This may be due to assumptions in `make_policy` or missing error handling.
+
+  ***
+
+  ### How to Check Expected Features
+  - Open your policy config (`config.json`), e.g. [example here](https://huggingface.co/jadechoghari/smolvla-libero/blob/main/config.json).
+  - Or add a breakpoint in `train.py` and inspect:
+
+    ````python
+    print(policy.config.input_features)
+    To ensure you can just check what your policy expects as `input_features`:
+
+    - Open your policy config (`config.json`), e.g. [example here](https://huggingface.co/jadechoghari/smolvla-libero/blob/main/config.json).
+    - Or add a breakpoint in `train.py` and inspect:
+      ```python
+      print(policy.config.input_features)
+    Fixing KeyErrors (Preprocessing Example)
+    ````
+
+## Fixing KeyErrors (Preprocessing Example)
+
+If your dataset columns do not follow the expected naming, you can rename them in-place before training:
+
+````python
+import pyarrow.parquet as pq
+import shutil
+
+def rename_columns(parquet_path, rename_map):
+    table = pq.read_table(parquet_path)
+    schema = table.schema
+    new_names = [rename_map.get(name, name) for name in schema.names]
+    renamed_table = table.rename_columns(new_names)
+    backup_path = parquet_path + ".bak"
+    shutil.copy(parquet_path, backup_path)
+    pq.write_table(renamed_table, parquet_path)
+    print(f"patched {parquet_path}, backup at {backup_path}")
+
+# example mapping: align dataset keys to LeRobot convention
+rename_map = {
+    "image": "observation.images.image",
+    "wrist_image": "observation.images.image2",
+}
+
+rename_columns("episode_000001.parquet", rename_map)
+
+
+
+- **Actions**
+  - Continuous control values in a `Box(-1, 1, shape=(7,))` space.
+
+We also provide a notebook for quick testing:
+Training with LIBERO
+
+## Training with LIBERO
+
+When training on LIBERO tasks, make sure your dataset parquet and metadata keys follow the LeRobot convention.
+
+The environment expects:
+
+- `observation.state` → 8-dim agent state
+- `observation.images.image` → main camera (`agentview_image`)
+- `observation.images.image2` → wrist camera (`robot0_eye_in_hand_image`)
+
+⚠️ Cleaning the dataset upfront is **cleaner and more efficient** than remapping keys inside the code. We plan to provide a script to easily preprocess such data.
+To avoid potential mismatches and `KeyError`s, we provide a **preprocessed LIBERO dataset** that is fully compatible with the current LeRobot codebase and requires no additional manipulations.
+
+- 🔗 [Preprocessed LIBERO dataset (Hugging Face LeRobot org)](https://huggingface.co/datasets/HuggingFaceVLA/libero)
+- 🔗 [Original LIBERO dataset (physical-intelligence)](https://huggingface.co/datasets/physical-intelligence/libero)
+
+The preprocessed dataset follows LeRobot naming conventions (e.g., `.images.*` prefix for visual features) and aligns with policy configs out-of-the-box.
+The original dataset is acknowledged here as the primary source.
+---
+
+### Example training command
+
+```bash
+python src/lerobot/scripts/train.py \
+  --policy.type=smolvla \
+  --policy.repo_id=${HF_USER}/libero-test \
+  --dataset.repo_id=jadechoghari/smol-libero3 \
+  --env.type=libero \
+  --env.task=libero_10 \
+  --output_dir=./outputs/ \
+  --steps=100000 \
+  --batch_size=4 \
+  --env.multitask_eval=True \
+  --eval.batch_size=1 \
+  --eval.n_episodes=1 \
+  --eval_freq=1000 \
+````
+
+---
+
+### Note on rendering
+
+LeRobot uses MuJoCo for simulation. You need to set the rendering backend before training or evaluation:
+
+- `export MUJOCO_GL=egl` → for headless servers (e.g. HPC, cloud)
+
+---
+
+## Colab Note on Parallel Evaluation
+
+When running evaluation on Colab, you may encounter warnings such as:
+
+```
+UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
+```
+
+This happens because Colab’s rendering contexts are **not thread-safe**, and `ThreadPoolExecutor(max_workers=num_workers)` can trigger segfaults or leaked semaphore warnings.
+
+**Colab Note:**
+Parallel evaluation is not supported in Colab. To avoid these issues, run sequentially or disable multitask evaluation:
+
+Run sequentially:
+
+```bash
+--env.max_parallel_tasks=1
+```
+
+Or disable multitask evaluation:
+
+```bash
+--env.multitask_eval=False
+```
+
+If you want to take advantage of **parallel evaluation**, we recommend **not using Colab**. Instead, run locally or on a proper compute environment where multi-threaded rendering is easily supported.
diff --git a/pyproject.toml b/pyproject.toml
index d16d00776..bb5aadc8e 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -135,7 +135,26 @@ video_benchmark = ["scikit-image>=0.23.2", "pandas>=2.2.2"]
 aloha = ["gym-aloha>=0.1.1"]
 pusht = ["gym-pusht>=0.1.5", "pymunk>=6.6.0,<7.0.0"] # TODO: Fix pymunk version in gym-pusht instead
 xarm = ["gym-xarm>=0.1.1"]
-
+libero = [
+    "hydra-core>=1.2,<1.4",
+    "numpy",
+    "wandb",
+    "easydict",
+    "transformers",
+    "opencv-python",
+    "robomimic==0.2.0",
+    "einops",
+    "thop",
+    "robosuite==1.4.0",
+    "mujoco>=2.3.7,<3.0.0",
+    "bddl==1.0.1",
+    "matplotlib",
+    "cloudpickle",
+    "future",
+    "gym",
+    "egl_probe @ git+https://github.com/jadechoghari/egl_probe.git#egg=egl_probe",
+    "libero @ git+https://github.com/jadechoghari/LIBERO.git@main#egg=libero",
+]
 # All
 all = [
     "lerobot[dynamixel]",
@@ -154,7 +173,8 @@ all = [
     "lerobot[video_benchmark]",
     "lerobot[aloha]",
     "lerobot[pusht]",
-    "lerobot[xarm]"
+    "lerobot[xarm]",
+    "lerobot[libero]"
 ]
 
 [project.scripts]
diff --git a/src/lerobot/envs/configs.py b/src/lerobot/envs/configs.py
index 35797c6ed..5c648de70 100644
--- a/src/lerobot/envs/configs.py
+++ b/src/lerobot/envs/configs.py
@@ -30,6 +30,8 @@ class EnvConfig(draccus.ChoiceRegistry, abc.ABC):
     fps: int = 30
     features: dict[str, PolicyFeature] = field(default_factory=dict)
     features_map: dict[str, str] = field(default_factory=dict)
+    multitask_eval: bool = False
+    max_parallel_tasks: int = 5
 
     @property
     def type(self) -> str:
@@ -271,3 +273,55 @@ class HILEnvConfig(EnvConfig):
             "use_gamepad": self.use_gamepad,
             "gripper_penalty": self.gripper_penalty,
         }
+
+
+@EnvConfig.register_subclass("libero")
+@dataclass
+class LiberoEnv(EnvConfig):
+    task: str = "libero_10"  # can also choose libero_spatial, libero_object, etc.
+    fps: int = 30
+    episode_length: int = 520
+    obs_type: str = "pixels_agent_pos"
+    render_mode: str = "rgb_array"
+    camera_name: str = "agentview_image,robot0_eye_in_hand_image"
+    init_states: bool = True
+    multitask_eval: bool = True
+    features: dict[str, PolicyFeature] = field(
+        default_factory=lambda: {
+            "action": PolicyFeature(type=FeatureType.ACTION, shape=(7,)),
+        }
+    )
+    features_map: dict[str, str] = field(
+        default_factory=lambda: {
+            "action": ACTION,
+            "agent_pos": OBS_STATE,
+            "pixels/agentview_image": f"{OBS_IMAGES}.image",
+            "pixels/robot0_eye_in_hand_image": f"{OBS_IMAGES}.image2",
+        }
+    )
+
+    def __post_init__(self):
+        if self.obs_type == "pixels":
+            self.features["pixels/agentview_image"] = PolicyFeature(
+                type=FeatureType.VISUAL, shape=(360, 360, 3)
+            )
+            self.features["pixels/robot0_eye_in_hand_image"] = PolicyFeature(
+                type=FeatureType.VISUAL, shape=(360, 360, 3)
+            )
+        elif self.obs_type == "pixels_agent_pos":
+            self.features["agent_pos"] = PolicyFeature(type=FeatureType.STATE, shape=(8,))
+            self.features["pixels/agentview_image"] = PolicyFeature(
+                type=FeatureType.VISUAL, shape=(360, 360, 3)
+            )
+            self.features["pixels/robot0_eye_in_hand_image"] = PolicyFeature(
+                type=FeatureType.VISUAL, shape=(360, 360, 3)
+            )
+
+    @property
+    def gym_kwargs(self) -> dict:
+        return {
+            # "task": self.task,
+            "obs_type": self.obs_type,
+            "render_mode": self.render_mode,
+            # "max_episode_steps": self.episode_length,
+        }
diff --git a/src/lerobot/envs/factory.py b/src/lerobot/envs/factory.py
index dc6d96d61..be49a9990 100644
--- a/src/lerobot/envs/factory.py
+++ b/src/lerobot/envs/factory.py
@@ -17,7 +17,7 @@ import importlib
 
 import gymnasium as gym
 
-from lerobot.envs.configs import AlohaEnv, EnvConfig, HILEnvConfig, PushtEnv, XarmEnv
+from lerobot.envs.configs import AlohaEnv, EnvConfig, HILEnvConfig, LiberoEnv, PushtEnv, XarmEnv
 
 
 def make_env_config(env_type: str, **kwargs) -> EnvConfig:
@@ -29,11 +29,15 @@ def make_env_config(env_type: str, **kwargs) -> EnvConfig:
         return XarmEnv(**kwargs)
     elif env_type == "hil":
         return HILEnvConfig(**kwargs)
+    elif env_type == "libero":
+        return LiberoEnv(**kwargs)
     else:
         raise ValueError(f"Policy type '{env_type}' is not available.")
 
 
-def make_env(cfg: EnvConfig, n_envs: int = 1, use_async_envs: bool = False) -> gym.vector.VectorEnv | None:
+def make_env(
+    cfg: EnvConfig, n_envs: int = 1, use_async_envs: bool = False
+) -> gym.vector.VectorEnv | dict[str, dict[int, gym.vector.VectorEnv]]:
     """Makes a gym vector environment according to the config.
 
     Args:
@@ -48,24 +52,41 @@ def make_env(cfg: EnvConfig, n_envs: int = 1, use_async_envs: bool = False) -> g
 
     Returns:
         gym.vector.VectorEnv: The parallelized gym.env instance.
+        dict[str, dict[int, gym.vector.VectorEnv]]: A mapping from task suite
+            names to indexed vectorized environments (when multitask eval is used).
+
     """
     if n_envs < 1:
         raise ValueError("`n_envs must be at least 1")
 
-    package_name = f"gym_{cfg.type}"
-
-    try:
-        importlib.import_module(package_name)
-    except ModuleNotFoundError as e:
-        print(f"{package_name} is not installed. Please install it with `pip install 'lerobot[{cfg.type}]'`")
-        raise e
-
-    gym_handle = f"{package_name}/{cfg.task}"
-
     # batched version of the env that returns an observation of shape (b, c)
     env_cls = gym.vector.AsyncVectorEnv if use_async_envs else gym.vector.SyncVectorEnv
-    env = env_cls(
-        [lambda: gym.make(gym_handle, disable_env_checker=True, **cfg.gym_kwargs) for _ in range(n_envs)]
-    )
+
+    if "libero" in cfg.type:
+        from lerobot.envs.libero import create_libero_envs
+
+        env = create_libero_envs(
+            task=cfg.task,
+            n_envs=n_envs,
+            camera_name=cfg.camera_name,
+            init_states=cfg.init_states,
+            gym_kwargs=cfg.gym_kwargs,
+            env_cls=env_cls,
+            multitask_eval=cfg.multitask_eval,
+        )
+    else:
+        package_name = f"gym_{cfg.type}"
+        try:
+            importlib.import_module(package_name)
+        except ModuleNotFoundError as e:
+            print(
+                f"{package_name} is not installed. Please install it with `pip install 'lerobot[{cfg.type}]'`"
+            )
+            raise e
+
+        gym_handle = f"{package_name}/{cfg.task}"
+        env = env_cls(
+            [lambda: gym.make(gym_handle, disable_env_checker=True, **cfg.gym_kwargs) for _ in range(n_envs)]
+        )
 
     return env
diff --git a/src/lerobot/envs/libero.py b/src/lerobot/envs/libero.py
new file mode 100644
index 000000000..f815228e7
--- /dev/null
+++ b/src/lerobot/envs/libero.py
@@ -0,0 +1,325 @@
+import math
+import os
+from collections import defaultdict
+from collections.abc import Callable
+from itertools import chain
+from typing import Any
+
+import gymnasium as gym
+import numpy as np
+import torch
+from gymnasium import spaces
+from libero.libero import benchmark, get_libero_path
+from libero.libero.envs import OffScreenRenderEnv
+
+
+def create_libero_envs(
+    task: str,
+    n_envs: int,
+    gym_kwargs: dict[str, Any] = None,
+    camera_name: str = "agentview_image,robot0_eye_in_hand_image",
+    init_states: bool = True,
+    env_cls: Callable = None,
+    multitask_eval: bool = True,
+) -> dict[str, dict[str, Any]]:
+    """
+    Here n_envs is per task and equal to the number of rollouts.
+    Returns:
+        dict[str, dict[str, list[LiberoEnv]]]: keys are task_suite and values are list of LiberoEnv envs.
+    """
+    print("num envs", n_envs)
+    print("multitask_eval", multitask_eval)
+    print("gym_kwargs", gym_kwargs)
+    if gym_kwargs is None:
+        gym_kwargs = {}
+
+    if not multitask_eval:
+        benchmark_dict = benchmark.get_benchmark_dict()
+        task_suite = benchmark_dict[task]()  # can also choose libero_spatial, libero_object, libero_10 etc.
+        tasks_id = list(range(len(task_suite.tasks)))
+        episode_indices = [0 for i in range(len(tasks_id))]
+        if len(tasks_id) == 1:
+            tasks_id = [tasks_id[0] for _ in range(n_envs)]
+            episode_indices = list(range(n_envs))
+        elif len(tasks_id) < n_envs and n_envs % len(tasks_id) == 0:
+            n_repeat = n_envs // len(tasks_id)
+            print("n_repeat", n_repeat)
+            episode_indices = []
+            for _ in range(len(tasks_id)):
+                episode_indices.extend(list(range(n_repeat)))
+            tasks_id = list(chain.from_iterable([[item] * n_repeat for item in tasks_id]))
+        elif n_envs < len(tasks_id):
+            tasks_id = tasks_id[:n_envs]
+            episode_indices = list(range(n_envs))[:n_envs]
+            print(f"WARNING: n_envs < len(tasks_id), evaluating only on {tasks_id}")
+        print(f"Creating Libero envs with task ids {tasks_id} from suite {task}")
+        assert n_envs == len(tasks_id), (
+            f"len(n_envs) and tasks_id should be the same, got {n_envs} and {len(tasks_id)}"
+        )
+        return env_cls(
+            [
+                lambda i=i: LiberoEnv(
+                    task_suite=task_suite,
+                    task_id=tasks_id[i],
+                    task_suite_name=task,
+                    camera_name=camera_name,
+                    init_states=init_states,
+                    episode_index=episode_indices[i],
+                    **gym_kwargs,
+                )
+                for i in range(n_envs)
+            ]
+        )
+    else:
+        envs = defaultdict(dict)
+        benchmark_dict = benchmark.get_benchmark_dict()
+        task = task.split(",")
+        for _task in task:
+            task_suite = benchmark_dict[
+                _task
+            ]()  # can also choose libero_spatial, libero_object, libero_10 etc.
+            tasks_ids = list(range(len(task_suite.tasks)))
+            for tasks_id in tasks_ids:
+                episode_indices = list(range(n_envs))
+                print(
+                    f"Creating Libero envs with task ids {tasks_id} from suite {_task}, episode_indices: {episode_indices}"
+                )
+                envs_list = [
+                    (
+                        lambda i=i,
+                        task_suite=task_suite,
+                        tasks_id=tasks_id,
+                        _task=_task,
+                        episode_indices=episode_indices: LiberoEnv(
+                            task_suite=task_suite,
+                            task_id=tasks_id,
+                            task_suite_name=_task,
+                            camera_name=camera_name,
+                            init_states=init_states,
+                            episode_index=episode_indices[i],
+                            **gym_kwargs,
+                        )
+                    )
+                    for i in range(n_envs)
+                ]
+                envs[_task][tasks_id] = env_cls(envs_list)
+        return envs
+
+
+def quat2axisangle(quat):
+    """
+    Copied from robosuite: https://github.com/ARISE-Initiative/robosuite/blob/eafb81f54ffc104f905ee48a16bb15f059176ad3/robosuite/utils/transform_utils.py#L490C1-L512C55
+
+    Converts quaternion to axis-angle format.
+    Returns a unit vector direction scaled by its angle in radians.
+
+    Args:
+        quat (np.array): (x,y,z,w) vec4 float angles
+
+    Returns:
+        np.array: (ax,ay,az) axis-angle exponential coordinates
+    """
+    # clip quaternion
+    if quat[3] > 1.0:
+        quat[3] = 1.0
+    elif quat[3] < -1.0:
+        quat[3] = -1.0
+
+    den = np.sqrt(1.0 - quat[3] * quat[3])
+    if math.isclose(den, 0.0):
+        # This is (close to) a zero degree rotation, immediately return
+        return np.zeros(3)
+
+    return (quat[:3] * 2.0 * math.acos(quat[3])) / den
+
+
+def get_task_init_states(task_suite, i):
+    init_states_path = os.path.join(
+        get_libero_path("init_states"),
+        task_suite.tasks[i].problem_folder,
+        task_suite.tasks[i].init_states_file,
+    )
+    init_states = torch.load(init_states_path, weights_only=False)  # nosec B614
+    return init_states
+
+
+def get_libero_dummy_action():
+    """Get dummy/no-op action, used to roll out the simulation while the robot does nothing."""
+    return [0, 0, 0, 0, 0, 0, -1]
+
+
+OBS_STATE_DIM = 8
+ACTION_DIM = 7
+
+
+class LiberoEnv(gym.Env):
+    metadata = {"render_modes": ["rgb_array"], "render_fps": 80}
+
+    def __init__(
+        self,
+        task_suite,
+        task_id,
+        task_suite_name,
+        camera_name="agentview_image,robot0_eye_in_hand_image",
+        obs_type="pixels",
+        render_mode="rgb_array",
+        observation_width=256,
+        observation_height=256,
+        visualization_width=640,
+        visualization_height=480,
+        init_states=True,
+        episode_index=0,
+    ):
+        super().__init__()
+        self.task_id = task_id
+        self.obs_type = obs_type
+        self.render_mode = render_mode
+        self.observation_width = observation_width
+        self.observation_height = observation_height
+        self.visualization_width = visualization_width
+        self.visualization_height = visualization_height
+        self.init_states = init_states
+        self.camera_name = camera_name.split(
+            ","
+        )  # agentview_image (main) or robot0_eye_in_hand_image (wrist)
+
+        # Map raw camera names to "image1" and "image2".
+        # The preprocessing step `preprocess_observation` will then prefix these with `.images.*`,
+        # following the LeRobot convention (e.g., `observation.images.image`, `observation.images.image2`).
+        # This ensures the policy consistently receives observations in the
+        # expected format regardless of the original camera naming.
+        self.camera_name_mapping = {
+            "agentview_image": "image",
+            "robot0_eye_in_hand_image": "image2",
+        }
+
+        self.num_steps_wait = (
+            10  # Do nothing for the first few timesteps to wait for the simulator drops objects
+        )
+        self.episode_index = episode_index
+
+        self._env = self._make_envs_task(task_suite, self.task_id)
+        if task_suite_name == "libero_spatial":
+            max_steps = 220  # longest training demo has 193 steps
+        elif task_suite_name == "libero_object":
+            max_steps = 280  # longest training demo has 254 steps
+        elif task_suite_name == "libero_goal":
+            max_steps = 300  # longest training demo has 270 steps
+        elif task_suite_name == "libero_10":
+            max_steps = 520  # longest training demo has 505 steps
+        elif task_suite_name == "libero_90":
+            max_steps = 400  # longest training demo has 373 steps
+        self._max_episode_steps = max_steps
+
+        images = {}
+        for cam in self.camera_name:
+            images[self.camera_name_mapping[cam]] = spaces.Box(
+                low=0,
+                high=255,
+                shape=(self.observation_height, self.observation_width, 3),
+                dtype=np.uint8,
+            )
+
+        if self.obs_type == "state":
+            raise NotImplementedError()
+        elif self.obs_type == "pixels":
+            self.observation_space = spaces.Dict(
+                {
+                    "pixels": spaces.Dict(images),
+                }
+            )
+        elif self.obs_type == "pixels_agent_pos":
+            self.observation_space = spaces.Dict(
+                {
+                    "pixels": spaces.Dict(images),
+                    "agent_pos": spaces.Box(
+                        low=-1000.0,
+                        high=1000.0,
+                        shape=(OBS_STATE_DIM,),
+                        dtype=np.float64,
+                    ),
+                }
+            )
+
+        self.action_space = spaces.Box(low=-1, high=1, shape=(ACTION_DIM,), dtype=np.float32)
+
+    def render(self):
+        raw_obs = self._env.env._get_observations()
+        formatted = self._format_raw_obs(raw_obs)
+        # grab the "main" camera
+        return formatted["pixels"]["image"]
+
+    def _make_envs_task(self, task_suite, task_id: int = 0):
+        task = task_suite.get_task(task_id)
+        self.task = task.name
+        self.task_description = task.language
+        task_bddl_file = os.path.join(get_libero_path("bddl_files"), task.problem_folder, task.bddl_file)
+
+        env_args = {
+            "bddl_file_name": task_bddl_file,
+            "camera_heights": self.observation_height,
+            "camera_widths": self.observation_width,
+        }
+        env = OffScreenRenderEnv(**env_args)
+        env.reset()
+        if self.init_states:
+            init_states = get_task_init_states(
+                task_suite, task_id
+            )  # for benchmarking purpose, we fix the a set of initial states FIXME(mshukor): should be in the reset()?
+            init_state_id = self.episode_index  # episode index
+            env.set_init_state(init_states[init_state_id])
+
+        return env
+
+    def _format_raw_obs(self, raw_obs):
+        images = {}
+        for camera_name in self.camera_name:
+            image = raw_obs[camera_name]
+            image = image[::-1, ::-1]  # rotate 180 degrees
+            images[self.camera_name_mapping[camera_name]] = image
+        # images = image if len(images) == 1 else images
+        state = np.concatenate(
+            (
+                raw_obs["robot0_eef_pos"],
+                quat2axisangle(raw_obs["robot0_eef_quat"]),
+                raw_obs["robot0_gripper_qpos"],
+            )
+        )
+        agent_pos = state
+        if self.obs_type == "state":
+            raise NotImplementedError()
+        elif self.obs_type == "pixels":
+            obs = {"pixels": images.copy()}
+        elif self.obs_type == "pixels_agent_pos":
+            obs = {
+                "pixels": images.copy(),
+                "agent_pos": agent_pos,
+            }
+        return obs
+
+    def reset(self, seed=None, **kwargs):
+        super().reset(seed=seed)
+
+        self._env.seed(seed)
+        raw_obs = self._env.reset()
+        # Do nothing for the first few timesteps to wait for the simulator drops objects
+        for _ in range(self.num_steps_wait):
+            raw_obs, _, _, _ = self._env.step(get_libero_dummy_action())
+        observation = self._format_raw_obs(raw_obs)
+        info = {"is_success": False}
+        return observation, info
+
+    def step(self, action):
+        assert action.ndim == 1
+        action[-1] = 1.0 - action[-1]
+        raw_obs, reward, done, info = self._env.step(action)
+        is_success = self._env.check_success()
+        terminated = done or is_success
+        info["is_success"] = is_success
+        observation = self._format_raw_obs(raw_obs)
+        truncated = False
+        # note if it is unable to complete get libero error after many steps
+        return observation, reward, terminated, truncated, info
+
+    def close(self):
+        self._env.close()
diff --git a/src/lerobot/scripts/eval.py b/src/lerobot/scripts/eval.py
index 13d30c686..3145bed35 100644
--- a/src/lerobot/scripts/eval.py
+++ b/src/lerobot/scripts/eval.py
@@ -46,6 +46,7 @@ Note that in both examples, the repo/folder should contain at least `config.json
 You can learn about the CLI options for this script in the `EvalPipelineConfig` in lerobot/configs/eval.py
 """
 
+import concurrent
 import json
 import logging
 import threading
@@ -145,7 +146,7 @@ def rollout(
         leave=False,
     )
     check_env_attributes_and_types(env)
-    while not np.all(done):
+    while not np.all(done) and step < max_steps:
         # Numpy array to tensor and changing dictionary keys to LeRobot policy format.
         observation = preprocess_observation(observation)
         if return_observations:
@@ -158,10 +159,8 @@ def rollout(
         # Infer "task" from attributes of environments.
         # TODO: works with SyncVectorEnv but not AsyncVectorEnv
         observation = add_envs_task(env, observation)
-
         with torch.inference_mode():
             action = policy.select_action(observation)
-
         # Convert to CPU / numpy.
         action = action.to("cpu").numpy()
         assert action.ndim == 2, "Action dimensions should be (batch, action_dim)"
@@ -179,7 +178,12 @@ def rollout(
             successes = [False] * env.num_envs
 
         # Keep track of which environments are done so far.
+        # Mark the episode as done if we reach the maximum step limit.
+        # This ensures that the rollout always terminates cleanly at `max_steps`,
+        # and allows logging/saving (e.g., videos) to be triggered consistently.
         done = terminated | truncated | done
+        if step + 1 == max_steps:
+            done = np.ones_like(done, dtype=bool)
 
         all_actions.append(torch.from_numpy(action))
         all_rewards.append(torch.from_numpy(reward))
@@ -402,7 +406,6 @@ def eval_policy(
             "eval_ep_s": (time.time() - start) / n_episodes,
         },
     }
-
     if return_episode_data:
         info["episodes"] = episode_data
 
@@ -474,37 +477,188 @@ def eval_main(cfg: EvalPipelineConfig):
     env = make_env(cfg.env, n_envs=cfg.eval.batch_size, use_async_envs=cfg.eval.use_async_envs)
 
     logging.info("Making policy.")
-
     policy = make_policy(
         cfg=cfg.policy,
         env_cfg=cfg.env,
     )
     policy.eval()
-
     with torch.no_grad(), torch.autocast(device_type=device.type) if cfg.policy.use_amp else nullcontext():
-        info = eval_policy(
-            env,
-            policy,
-            cfg.eval.n_episodes,
-            max_episodes_rendered=10,
-            videos_dir=Path(cfg.output_dir) / "videos",
-            start_seed=cfg.seed,
-        )
-    print(info["aggregated"])
+        if cfg.env.multitask_eval:
+            info = eval_policy_multitask(
+                env,
+                policy,
+                cfg.eval.n_episodes,
+                max_episodes_rendered=10,
+                videos_dir=Path(cfg.output_dir) / "videos",
+                start_seed=cfg.seed,
+                max_parallel_tasks=cfg.env.max_parallel_tasks,
+                verbose=False,
+            )
+            print("Overall Aggregated Metrics:")
+            print(info["overall"]["aggregated"])
+
+            # Print per-suite stats
+            for task_group, task_group_info in info.items():
+                if task_group == "overall":
+                    continue  # Skip the overall stats since we already printed it
+                print(f"\nAggregated Metrics for {task_group}:")
+                print(task_group_info["aggregated"])
+            for _task_group, v in env.items():
+                for _env in v.values():
+                    _env.close()
+        else:
+            info = eval_policy(
+                env,
+                policy,
+                cfg.eval.n_episodes,
+                max_episodes_rendered=10,
+                videos_dir=Path(cfg.output_dir) / "videos",
+                start_seed=cfg.seed,
+            )
+            print(info["aggregated"])
+            env.close()
 
     # Save info
     with open(Path(cfg.output_dir) / "eval_info.json", "w") as f:
         json.dump(info, f, indent=2)
 
-    env.close()
-
     logging.info("End of eval")
 
 
-def main():
-    init_logging()
-    eval_main()
+def eval_policy_multitask(
+    envs: dict[str, dict[str, gym.vector.VectorEnv]],
+    policy,
+    n_episodes: int,
+    max_episodes_rendered: int = 0,
+    videos_dir: Path | None = None,
+    return_episode_data: bool = False,
+    start_seed: int | None = None,
+    max_parallel_tasks: int = 5,
+    verbose: bool = True,
+) -> dict:
+    global_start = time.time()
+    results = {}
+
+    overall_rewards, overall_max_rewards, overall_successes = [], [], []
+    overall_video_paths = []
+    overall_episode_data = None
+
+    def eval_task(task_group, task_id, env):
+        """Evaluates a single task in parallel."""
+        print(f"Evaluating: task_group: {task_group}, task_id: {task_id} ...")
+        if videos_dir is not None:
+            task_videos_dir = videos_dir / f"{task_group}_{task_id}"
+            task_videos_dir.mkdir(parents=True, exist_ok=True)
+        task_result = eval_policy(
+            env,
+            policy,
+            n_episodes,
+            max_episodes_rendered,
+            task_videos_dir,
+            return_episode_data,
+            start_seed,
+        )
+
+        per_episode = task_result["per_episode"]
+        return {
+            "task_group": task_group,
+            "task_id": task_id,
+            "sum_rewards": [ep["sum_reward"] for ep in per_episode],
+            "max_rewards": [ep["max_reward"] for ep in per_episode],
+            "successes": [ep["success"] for ep in per_episode],
+            "video_paths": task_result.get("video_paths", []),
+        }
+
+    task_group_results = {}
+    if max_parallel_tasks == 1:
+        # sequential mode (safe for colab / EGL)
+        for task_group, tasks in envs.items():
+            for task_id, env in tasks.items():
+                task_result = eval_task(task_group, task_id, env)
+                if task_group not in task_group_results:
+                    task_group_results[task_group] = {
+                        "sum_rewards": [],
+                        "max_rewards": [],
+                        "successes": [],
+                        "video_paths": [],
+                    }
+                task_group_results[task_group]["sum_rewards"].extend(task_result["sum_rewards"])
+                task_group_results[task_group]["max_rewards"].extend(task_result["max_rewards"])
+                task_group_results[task_group]["successes"].extend(task_result["successes"])
+                task_group_results[task_group]["video_paths"].extend(task_result["video_paths"])
+    else:
+        with concurrent.futures.ThreadPoolExecutor(max_workers=max_parallel_tasks) as executor:
+            future_to_task = {
+                executor.submit(eval_task, task_group, task_id, env): (task_group, task_id)
+                for task_group, tasks in envs.items()
+                for task_id, env in tasks.items()
+            }
+
+            task_group_results = {}
+
+            for future in concurrent.futures.as_completed(future_to_task):
+                task_result = future.result()
+                task_group = task_result["task_group"]
+
+                if task_group not in task_group_results:
+                    task_group_results[task_group] = {
+                        "sum_rewards": [],
+                        "max_rewards": [],
+                        "successes": [],
+                        "video_paths": [],
+                    }
+
+                task_group_results[task_group]["sum_rewards"].extend(task_result["sum_rewards"])
+                task_group_results[task_group]["max_rewards"].extend(task_result["max_rewards"])
+                task_group_results[task_group]["successes"].extend(task_result["successes"])
+                task_group_results[task_group]["video_paths"].extend(task_result["video_paths"])
+
+    # Process results per task group
+    for task_group, data in task_group_results.items():
+        suite_rewards = data["sum_rewards"]
+        suite_max_rewards = data["max_rewards"]
+        suite_successes = data["successes"]
+        suite_video_paths = data["video_paths"]
+
+        suite_eval_s = time.time() - global_start
+        suite_eval_ep_s = suite_eval_s / max(1, len(suite_rewards))
+
+        results[task_group] = {
+            "aggregated": {
+                "avg_sum_reward": float(np.nanmean(suite_rewards)),
+                "avg_max_reward": float(np.nanmean(suite_max_rewards)),
+                "pc_success": float(np.nanmean(suite_successes) * 100),
+                "eval_s": suite_eval_s,
+                "eval_ep_s": suite_eval_ep_s,
+            },
+            "video_paths": suite_video_paths,
+            "episodes": None,  # Modify if episode data is needed
+        }
+
+        overall_rewards.extend(suite_rewards)
+        overall_max_rewards.extend(suite_max_rewards)
+        overall_successes.extend(suite_successes)
+        overall_video_paths.extend(suite_video_paths)
+
+    # Global metrics
+    global_eval_s = time.time() - global_start
+    global_eval_ep_s = global_eval_s / max(1, len(overall_rewards))
+
+    results["overall"] = {
+        "aggregated": {
+            "avg_sum_reward": float(np.nanmean(overall_rewards)),
+            "avg_max_reward": float(np.nanmean(overall_max_rewards)),
+            "pc_success": float(np.nanmean(overall_successes) * 100),
+            "eval_s": global_eval_s,
+            "eval_ep_s": global_eval_ep_s,
+        },
+        "video_paths": overall_video_paths,
+        "episodes": overall_episode_data,
+    }
+
+    return results
 
 
 if __name__ == "__main__":
-    main()
+    init_logging()
+    eval_main()
diff --git a/src/lerobot/scripts/train.py b/src/lerobot/scripts/train.py
index 235352cd8..d5d562518 100644
--- a/src/lerobot/scripts/train.py
+++ b/src/lerobot/scripts/train.py
@@ -34,7 +34,7 @@ from lerobot.optim.factory import make_optimizer_and_scheduler
 from lerobot.policies.factory import make_policy
 from lerobot.policies.pretrained import PreTrainedPolicy
 from lerobot.policies.utils import get_device_from_parameters
-from lerobot.scripts.eval import eval_policy
+from lerobot.scripts.eval import eval_policy, eval_policy_multitask
 from lerobot.utils.logging_utils import AverageMeter, MetricsTracker
 from lerobot.utils.random_utils import set_seed
 from lerobot.utils.train_utils import (
@@ -186,7 +186,6 @@ def train(cfg: TrainPipelineConfig):
     dl_iter = cycle(dataloader)
 
     policy.train()
-
     train_metrics = {
         "loss": AverageMeter("loss", ":.3f"),
         "grad_norm": AverageMeter("grdn", ":.3f"),
@@ -252,14 +251,33 @@ def train(cfg: TrainPipelineConfig):
                 torch.no_grad(),
                 torch.autocast(device_type=device.type) if cfg.policy.use_amp else nullcontext(),
             ):
-                eval_info = eval_policy(
-                    eval_env,
-                    policy,
-                    cfg.eval.n_episodes,
-                    videos_dir=cfg.output_dir / "eval" / f"videos_step_{step_id}",
-                    max_episodes_rendered=4,
-                    start_seed=cfg.seed,
-                )
+                if cfg.env.multitask_eval:
+                    eval_info = eval_policy_multitask(
+                        eval_env,
+                        policy,
+                        cfg.eval.n_episodes,
+                        videos_dir=cfg.output_dir / "eval" / f"videos_step_{step_id}",
+                        max_episodes_rendered=4,
+                        start_seed=cfg.seed,
+                        max_parallel_tasks=cfg.env.max_parallel_tasks,
+                    )
+                    aggregated = eval_info["overall"]["aggregated"]
+                    # Print per-suite stats, log?
+                    for task_group, task_group_info in eval_info.items():
+                        if task_group == "overall":
+                            continue  # Skip the overall stats since we already printed it
+                        print(f"\nAggregated Metrics for {task_group}:")
+                        print(task_group_info["aggregated"])
+                else:
+                    eval_info = eval_policy(
+                        eval_env,
+                        policy,
+                        cfg.eval.n_episodes,
+                        videos_dir=cfg.output_dir / "eval" / f"videos_step_{step_id}",
+                        max_episodes_rendered=4,
+                        start_seed=cfg.seed,
+                    )
+                    aggregated = eval_info["aggregated"]
 
             eval_metrics = {
                 "avg_sum_reward": AverageMeter("∑rwrd", ":.3f"),
@@ -269,9 +287,9 @@ def train(cfg: TrainPipelineConfig):
             eval_tracker = MetricsTracker(
                 cfg.batch_size, dataset.num_frames, dataset.num_episodes, eval_metrics, initial_step=step
             )
-            eval_tracker.eval_s = eval_info["aggregated"].pop("eval_s")
-            eval_tracker.avg_sum_reward = eval_info["aggregated"].pop("avg_sum_reward")
-            eval_tracker.pc_success = eval_info["aggregated"].pop("pc_success")
+            eval_tracker.eval_s = aggregated.pop("eval_s")
+            eval_tracker.avg_sum_reward = aggregated.pop("avg_sum_reward")
+            eval_tracker.pc_success = aggregated.pop("pc_success")
             logging.info(eval_tracker)
             if wandb_logger:
                 wandb_log_dict = {**eval_tracker.to_dict(), **eval_info}
@@ -279,7 +297,12 @@ def train(cfg: TrainPipelineConfig):
                 wandb_logger.log_video(eval_info["video_paths"][0], step, mode="eval")
 
     if eval_env:
-        eval_env.close()
+        if cfg.env.multitask_eval:
+            for _task_group, envs_dict in eval_env.items():
+                for _idx, env in envs_dict.items():
+                    env.close()
+        else:
+            eval_env.close()
     logging.info("End of training")
 
     if cfg.policy.push_to_hub: