fix(eval): use FeatureType enum comparison instead of string value

fix(eval): infer recording features from actual env observations
fix(eval): align raw frame keys with dataset schema and fix numpy types
2026-06-16 07:49:48 +00:00 · 2026-06-15 18:50:24 +02:00 · 2026-06-15 18:47:16 +02:00 · 2026-06-15 18:38:12 +02:00 · 2026-06-15 18:26:44 +02:00 · 2026-06-15 17:03:36 +02:00
30 changed files with 404 additions and 89 deletions
@@ -167,9 +167,9 @@ jobs:

      # ── LIBERO TRAIN+EVAL SMOKE ──────────────────────────────────────────────
      # Train SmolVLA for 1 step (batch_size=1, dataset episode 0 only) then
-      # immediately runs eval inside the training loop (eval_freq=1, 1 episode).
+      # immediately runs eval inside the training loop (env_eval_freq=1, 1 episode).
      # Tests the full train→eval-within-training pipeline end-to-end.
-      - name: Run Libero train+eval smoke (1 step, eval_freq=1)
+      - name: Run Libero train+eval smoke (1 step, env_eval_freq=1)
        if: env.HF_USER_TOKEN != ''
        run: |
          docker run --name libero-train-smoke --gpus all \
@@ -196,7 +196,7 @@ jobs:
                --output_dir=/tmp/train-smoke \
                --steps=1 \
                --batch_size=1 \
-                --eval_freq=1 \
+                --env_eval_freq=1 \
                --eval.n_episodes=1 \
                --eval.batch_size=1 \
                --eval.use_async_envs=false \
@@ -58,7 +58,7 @@ test-act-ete-train:
 		--dataset.episodes="[0]" \
 		--batch_size=2 \
 		--steps=4 \
-		--eval_freq=2 \
+		--env_eval_freq=2 \
 		--eval.n_episodes=1 \
 		--eval.batch_size=1 \
 		--save_freq=2 \
@@ -96,7 +96,7 @@ test-diffusion-ete-train:
 		--dataset.episodes="[0]" \
 		--batch_size=2 \
 		--steps=2 \
-		--eval_freq=2 \
+		--env_eval_freq=2 \
 		--eval.n_episodes=1 \
 		--eval.batch_size=1 \
 		--save_checkpoint=true \
@@ -126,7 +126,7 @@ test-tdmpc-ete-train:
 		--dataset.episodes="[0]" \
 		--batch_size=2 \
 		--steps=2 \
-		--eval_freq=2 \
+		--env_eval_freq=2 \
 		--eval.n_episodes=1 \
 		--eval.batch_size=1 \
 		--save_checkpoint=true \
@@ -161,7 +161,7 @@ test-smolvla-ete-train:
 		--dataset.episodes="[0]" \
 		--batch_size=2 \
 		--steps=4 \
-		--eval_freq=2 \
+		--env_eval_freq=2 \
 		--eval.n_episodes=1 \
 		--eval.batch_size=1 \
 		--save_freq=2 \
@@ -719,7 +719,7 @@ Example configuration for training the [reward classifier](https://huggingface.c
  "num_workers": 4,
  "steps": 5000,
  "log_freq": 10,
-  "eval_freq": 1000,
+  "env_eval_freq": 1000,
  "save_freq": 1000,
  "save_checkpoint": true,
  "seed": 2,
@@ -143,7 +143,7 @@ lerobot-train \
  --batch_size=4 \
  --eval.batch_size=1 \
  --eval.n_episodes=1 \
-  --eval_freq=1000
+  --env_eval_freq=1000
 ```

 ## Reproducing published results
@@ -173,7 +173,7 @@ lerobot-train \
    --batch_size=4 \
    --eval.batch_size=1 \
    --eval.n_episodes=1 \
-    --eval_freq=1000
+    --env_eval_freq=1000
 ```

 ## Relationship to LIBERO
@@ -120,11 +120,11 @@ lerobot-train \
  --batch_size=4 \
  --eval.batch_size=1 \
  --eval.n_episodes=1 \
-  --eval_freq=1000
+  --env_eval_freq=1000
 ```

 ## Practical tips

 - Use the one-hot task conditioning for multi-task training (MT10/MT50 conventions) so policies have explicit task context.
 - Inspect the dataset task descriptions and the `info["is_success"]` keys when writing post-processing or logging so your success metrics line up with the benchmark.
- Adjust `batch_size`, `steps`, and `eval_freq` to match your compute budget.
+- Adjust `batch_size`, `steps`, and `env_eval_freq` to match your compute budget.
@@ -103,7 +103,7 @@ accelerate launch \
  --batch_size=32 \
  --num_workers=4 \
  --log_freq=20 \
-  --eval_freq=-1 \
+  --env_eval_freq=-1 \
  --save_checkpoint=true \
  --save_freq=2000
 ```
@@ -142,7 +142,7 @@ accelerate launch \
  --batch_size=32 \
  --num_workers=4 \
  --log_freq=20 \
-  --eval_freq=-1 \
+  --env_eval_freq=-1 \
  --save_checkpoint=true \
  --save_freq=2000
 ```
@@ -314,7 +314,7 @@ lerobot-train \
  --steps=30000 \
  --save_freq=1000 \
  --log_freq=100 \
-  --eval_freq=1000 \
+  --env_eval_freq=1000 \
  --policy.type=multi_task_dit \
  --policy.device=cuda \
  --policy.horizon=32 \
@@ -166,7 +166,7 @@ lerobot-train \
  --output_dir=./outputs/smolvla_robocasa_CloseFridge \
  --steps=100000 \
  --batch_size=4 \
-  --eval_freq=5000 \
+  --env_eval_freq=5000 \
  --eval.batch_size=1 \
  --eval.n_episodes=5 \
  --save_freq=10000
@@ -165,7 +165,7 @@ lerobot-train \
  --output_dir=./outputs/smolvla_vlabench_primitive \
  --steps=100000 \
  --batch_size=4 \
-  --eval_freq=5000 \
+  --env_eval_freq=5000 \
  --eval.batch_size=1 \
  --eval.n_episodes=1 \
  --save_freq=10000
@@ -180,24 +180,32 @@ class WandBLogger:
                self._wandb_custom_step_key.add(new_custom_key)
                self._wandb.define_metric(new_custom_key, hidden=True)

+        batch_data = {}
        for k, v in d.items():
+            # Skip the custom step key here, it's added to the batch below.
+            if custom_step_key is not None and k == custom_step_key:
+                continue
+
+            if isinstance(v, list):
+                for i, elem in enumerate(v):
+                    if isinstance(elem, (int | float)):
+                        batch_data[f"{mode}/{k}_{i}"] = elem
+                continue
+
            if not isinstance(v, (int | float | str)):
                logging.warning(
                    f'WandB logging of key "{k}" was ignored as its type "{type(v)}" is not handled by this wrapper.'
                )
                continue

-            # Do not log the custom step key itself.
-            if self._wandb_custom_step_key is not None and k in self._wandb_custom_step_key:
-                continue
+            batch_data[f"{mode}/{k}"] = v

+        if batch_data:
            if custom_step_key is not None:
-                value_custom_step = d[custom_step_key]
-                data = {f"{mode}/{k}": v, f"{mode}/{custom_step_key}": value_custom_step}
-                self._wandb.log(data)
-                continue
-
-            self._wandb.log(data={f"{mode}/{k}": v}, step=step)
+                batch_data[f"{mode}/{custom_step_key}"] = d[custom_step_key]
+                self._wandb.log(batch_data)
+            else:
+                self._wandb.log(data=batch_data, step=step)

    def log_video(self, video_path: str, step: int, mode: str = "train"):
        if mode not in {"train", "eval"}:
@@ -39,6 +39,8 @@ class DatasetConfig:
    # This reduces memory and speeds up DataLoader IPC. The training pipeline handles the conversion.
    return_uint8: bool = False
    streaming: bool = False
+    # Fraction of episodes held out per task for offline evaluation (0.0 = disabled).
+    eval_split: float = 0.0

    def __post_init__(self) -> None:
        if self.episodes is not None:
@@ -73,6 +75,8 @@ class EvalConfig:
    # `use_async_envs` specifies whether to use asynchronous environments (multiprocessing).
    # Defaults to True; automatically downgraded to SyncVectorEnv when batch_size=1.
    use_async_envs: bool = True
+    # Whether to record eval rollouts as a LeRobot v3.0 dataset on disk.
+    recording: bool = False

    def __post_init__(self) -> None:
        if self.batch_size == 0:
@@ -79,6 +79,8 @@ class PreTrainedConfig(draccus.ChoiceRegistry, HubMixin, abc.ABC):  # type: igno
    # Either the repo ID of a model hosted on the Hub or a path to a directory containing weights
    # saved using `Policy.save_pretrained`. If not provided, the policy is initialized from scratch.
    pretrained_path: Path | None = None
+    # Optional Hub revision (commit hash, branch, or tag) to pin the pretrained model version.
+    pretrained_revision: str | None = None

    def __post_init__(self) -> None:
        if not self.device or not is_torch_device_available(self.device):
@@ -56,6 +56,8 @@ class RewardModelConfig(draccus.ChoiceRegistry, HubMixin, abc.ABC):
    device: str | None = None

    pretrained_path: str | None = None
+    # Optional Hub revision (commit hash, branch, or tag) to pin the pretrained reward model version.
+    pretrained_revision: str | None = None

    push_to_hub: bool = False
    repo_id: str | None = None
@@ -100,8 +100,13 @@ class TrainPipelineConfig(HubMixin):
    prefetch_factor: int = 4
    persistent_workers: bool = True
    steps: int = 100_000
-    eval_freq: int = 20_000
+    # Run policy in the simulation environment every N steps to measure reward/success (0 = disabled).
+    env_eval_freq: int = 20_000
    log_freq: int = 200
+    # Compute eval loss on held-out episodes every N steps (0 = disabled). Requires eval_split > 0.
+    eval_steps: int = 0
+    # Cap on total eval samples, split uniformly across tasks (0 = use all held-out data).
+    max_eval_samples: int = 0
    tolerance_s: float = 1e-4
    save_checkpoint: bool = True
    # Checkpoint is saved every `save_freq` training iterations and after the last training step.
@@ -35,7 +35,7 @@ from .dataset_tools import (
    remove_feature,
    split_dataset,
 )
-from .factory import make_dataset, resolve_delta_timestamps
+from .factory import make_dataset, make_train_eval_datasets, resolve_delta_timestamps
 from .image_writer import safe_stop_image_writer
 from .io_utils import load_episodes, write_stats
 from .language import (
@@ -89,6 +89,7 @@ __all__ = [
    "get_feature_stats",
    "load_episodes",
    "make_dataset",
+    "make_train_eval_datasets",
    "merge_datasets",
    "modify_features",
    "modify_tasks",
@@ -14,6 +14,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import logging
+import math
 from pprint import pformat

 import torch
@@ -130,3 +131,81 @@ def make_dataset(cfg: TrainPipelineConfig) -> LeRobotDataset | MultiLeRobotDatas
                dataset.meta.stats[key][stats_type] = torch.tensor(stats, dtype=torch.float32)

    return dataset
+
+
+def make_train_eval_datasets(
+    cfg: TrainPipelineConfig,
+) -> tuple[LeRobotDataset | MultiLeRobotDataset, LeRobotDataset | None]:
+    """Create train and optional eval datasets by splitting episodes based on eval_split.
+
+    The last ceil(n_episodes * eval_split) episodes per task are held out for evaluation.
+    If eval_split == 0.0, returns (full_dataset, None).
+    """
+    full_dataset = make_dataset(cfg)
+
+    if cfg.dataset.eval_split == 0.0:
+        return full_dataset, None
+
+    base_episodes = (
+        full_dataset.episodes if full_dataset.episodes is not None else list(range(full_dataset.num_episodes))
+    )
+
+    episode_tasks = full_dataset.meta.episodes["tasks"]
+    task_to_episodes: dict[str, list[int]] = {}
+    for ep_idx in base_episodes:
+        task_key = episode_tasks[ep_idx][0] if episode_tasks[ep_idx] else ""
+        task_to_episodes.setdefault(task_key, []).append(ep_idx)
+
+    train_episodes, eval_episodes = [], []
+    for eps in task_to_episodes.values():
+        n_eval = math.ceil(len(eps) * cfg.dataset.eval_split)
+        train_episodes.extend(eps[: len(eps) - n_eval])
+        eval_episodes.extend(eps[len(eps) - n_eval :])
+
+    if not train_episodes:
+        raise ValueError(
+            f"eval_split={cfg.dataset.eval_split} leaves 0 training episodes from {len(base_episodes)} total."
+        )
+
+    logging.info(
+        f"Train/eval split: {len(train_episodes)} train, {len(eval_episodes)} eval "
+        f"(eval_split={cfg.dataset.eval_split}, {len(task_to_episodes)} tasks)"
+    )
+
+    delta_timestamps = resolve_delta_timestamps(cfg.trainable_config, full_dataset.meta)
+
+    train_image_transforms = (
+        ImageTransforms(cfg.dataset.image_transforms) if cfg.dataset.image_transforms.enable else None
+    )
+
+    train_dataset = LeRobotDataset(
+        cfg.dataset.repo_id,
+        root=cfg.dataset.root,
+        episodes=train_episodes,
+        delta_timestamps=delta_timestamps,
+        image_transforms=train_image_transforms,
+        revision=cfg.dataset.revision,
+        video_backend=cfg.dataset.video_backend,
+        return_uint8=True,
+        tolerance_s=cfg.tolerance_s,
+    )
+
+    eval_dataset = LeRobotDataset(
+        cfg.dataset.repo_id,
+        root=cfg.dataset.root,
+        episodes=eval_episodes,
+        delta_timestamps=delta_timestamps,
+        image_transforms=None,
+        revision=cfg.dataset.revision,
+        video_backend=cfg.dataset.video_backend,
+        return_uint8=True,
+        tolerance_s=cfg.tolerance_s,
+    )
+
+    if cfg.dataset.use_imagenet_stats:
+        for ds in (train_dataset, eval_dataset):
+            for key in ds.meta.camera_keys:
+                for stats_type, stats in IMAGENET_STATS.items():
+                    ds.meta.stats[key][stats_type] = torch.tensor(stats, dtype=torch.float32)
+
+    return train_dataset, eval_dataset
@@ -153,7 +153,7 @@ def cast_stats_to_numpy(stats: dict) -> dict[str, dict[str, np.ndarray]]:
    Returns:
        dict: The statistics dictionary with values cast to numpy arrays.
    """
-    stats = {key: np.array(value) for key, value in flatten_dict(stats).items()}
+    stats = {key: np.atleast_1d(np.array(value)) for key, value in flatten_dict(stats).items()}
    return unflatten_dict(stats)


@@ -474,6 +474,8 @@ class LeRobotDataset(torch.utils.data.Dataset):
        if reader.hf_dataset is None:
            # One-shot load after finalize()
            reader.load_and_activate()
+        if reader._absolute_to_relative_idx is not None and idx in reader._absolute_to_relative_idx:
+            idx = reader._absolute_to_relative_idx[idx]
        return reader.get_item(idx)

    def select_columns(self, column_names: str | list[str]):
@@ -70,21 +70,19 @@ def aggregate_pipeline_dataset_features(
    initial_features: dict[PipelineFeatureType, dict[str, Any]],
    *,
    use_videos: bool = True,
-    exclude_images: bool = False,
    patterns: Sequence[str] | None = None,
 ) -> dict[str, dict]:
    """
    Aggregates and filters pipeline features to create a dataset-ready features dictionary.

    This function transforms initial features using the pipeline, categorizes them as action or observations
-    (image or state), filters them based on `exclude_images` and `patterns`, and finally
+    (image or state), filters them based on `use_videos` and `patterns`, and finally
    formats them for use with a Hugging Face LeRobot Dataset.

    Args:
        pipeline: The DataProcessorPipeline to apply.
        initial_features: A dictionary of raw feature specs for actions and observations.
-        use_videos: Controls the storage dtype for image features. If True, images are stored as "video"; if False, they are stored as "image".
-        exclude_images: If True, image features are dropped entirely from the output.
+        use_videos: If False, image features are excluded.
        patterns: A sequence of regex patterns to filter action and state features.
                  Image features are not affected by this filter.

@@ -122,7 +120,7 @@ def aggregate_pipeline_dataset_features(
            )

            # 2. Apply filtering rules.
-            if is_image and exclude_images:
+            if is_image and not use_videos:
                continue
            if not is_image and not should_keep(key, compiled_patterns):
                continue
@@ -126,6 +126,26 @@ def preprocess_observation(observations: dict[str, np.ndarray]) -> dict[str, Ten
    if "camera_obs" in observations:
        return_observations[f"{OBS_STR}.camera_obs"] = observations["camera_obs"]

+    # Pass through any remaining ndarray/tensor keys not already handled above,
+    # so env plugins can expose extra observation keys via get_env_processors().
+    _handled = {"pixels", "environment_state", "agent_pos", "robot_state", "policy", "camera_obs"}
+    for key, value in observations.items():
+        if key in _handled:
+            continue
+        target = f"{OBS_STR}.{key}"
+        if target in return_observations:
+            continue
+        if isinstance(value, np.ndarray):
+            val = torch.from_numpy(value).float()
+            if val.dim() == 1:
+                val = val.unsqueeze(0)
+            return_observations[target] = val
+        elif isinstance(value, Tensor):
+            val = value.float()
+            if val.dim() == 1:
+                val = val.unsqueeze(0)
+            return_observations[target] = val
+
    return return_observations


@@ -148,7 +148,7 @@ class ACTPolicy(PreTrainedPolicy):
        l1_loss = (abs_err * valid_mask).sum() / num_valid.clamp_min(1)

        loss_dict = {"l1_loss": l1_loss.item()}
-        if self.config.use_vae:
+        if self.config.use_vae and log_sigma_x2_hat is not None:
            # Calculate Dₖₗ(latent_pdf || standard_normal). Note: After computing the KL-divergence for
            # each dimension independently, we sum over the latent dimension to get the total
            # KL-divergence per batch element, then take the mean over the batch.
@@ -101,11 +101,23 @@ class DiffusionPolicy(PreTrainedPolicy):

    @torch.no_grad()
    def predict_action_chunk(self, batch: dict[str, Tensor], noise: Tensor | None = None) -> Tensor:
-        """Predict a chunk of actions given environment observations."""
-        # stack n latest observations from the queue
-        batch = {k: torch.stack(list(self._queues[k]), dim=1) for k in batch if k in self._queues}
-        actions = self.diffusion.generate_actions(batch, noise=noise)
+        """Predict a chunk of actions given environment observations.

+        Supports two modes:
+        - Online (queues populated via select_action): stacks observations from internal queues.
+        - Offline (empty queues, e.g. dataloader batch): uses the batch directly.
+        """
+        queues_populated = any(len(q) > 0 for q in self._queues.values())
+        if queues_populated:
+            batch = {k: torch.stack(list(self._queues[k]), dim=1) for k in batch if k in self._queues}
+        else:
+            batch = dict(batch)
+            if self.config.image_features:
+                for key in self.config.image_features:
+                    if batch[key].ndim == 4:
+                        batch[key] = batch[key].unsqueeze(1)
+                batch[OBS_IMAGES] = torch.stack([batch[key] for key in self.config.image_features], dim=-4)
+        actions = self.diffusion.generate_actions(batch, noise=noise)
        return actions

    @torch.no_grad()
@@ -252,6 +252,7 @@ class ProcessorConfigKwargs(TypedDict, total=False):
 def make_pre_post_processors(
    policy_cfg: PreTrainedConfig,
    pretrained_path: str | None = None,
+    pretrained_revision: str | None = None,
    **kwargs: Unpack[ProcessorConfigKwargs],
 ) -> tuple[
    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
@@ -309,6 +310,7 @@ def make_pre_post_processors(
            overrides=kwargs.get("preprocessor_overrides", {}),
            to_transition=batch_to_transition,
            to_output=transition_to_batch,
+            revision=pretrained_revision,
        )
        postprocessor = PolicyProcessorPipeline.from_pretrained(
            pretrained_model_name_or_path=pretrained_path,
@@ -318,6 +320,7 @@ def make_pre_post_processors(
            overrides=kwargs.get("postprocessor_overrides", {}),
            to_transition=policy_action_to_transition,
            to_output=transition_to_policy_action,
+            revision=pretrained_revision,
        )
        _reconnect_relative_absolute_steps(preprocessor, postprocessor)
        return preprocessor, postprocessor
@@ -557,6 +560,7 @@ def make_policy(
        # Load a pretrained policy and override the config if needed (for example, if there are inference-time
        # hyperparameters that we want to vary).
        kwargs["pretrained_name_or_path"] = cfg.pretrained_path
+        kwargs["revision"] = cfg.pretrained_revision
        policy = policy_cls.from_pretrained(**kwargs)
    elif cfg.pretrained_path and cfg.use_peft:
        # Load a pretrained PEFT model on top of the policy. The pretrained path points to the folder/repo
@@ -124,6 +124,7 @@ def make_reward_model(cfg: RewardModelConfig, **kwargs) -> PreTrainedRewardModel

    if cfg.pretrained_path:
        kwargs["pretrained_name_or_path"] = cfg.pretrained_path
+        kwargs["revision"] = cfg.pretrained_revision
        reward_model = reward_cls.from_pretrained(**kwargs)
    else:
        reward_model = reward_cls(**kwargs)
@@ -72,8 +72,9 @@ from termcolor import colored
 from torch import Tensor, nn
 from tqdm import trange

-from lerobot.configs import parser
+from lerobot.configs import FeatureType, parser
 from lerobot.configs.eval import EvalPipelineConfig
+from lerobot.datasets.lerobot_dataset import LeRobotDataset
 from lerobot.envs import (
    check_env_attributes_and_types,
    close_envs,
@@ -84,7 +85,7 @@ from lerobot.envs import (
 from lerobot.policies import PreTrainedPolicy, make_policy, make_pre_post_processors
 from lerobot.processor import PolicyProcessorPipeline
 from lerobot.types import PolicyAction
-from lerobot.utils.constants import ACTION, DONE, OBS_STR, REWARD
+from lerobot.utils.constants import ACTION, DONE, OBS_IMAGE, OBS_IMAGES, OBS_STR, REWARD
 from lerobot.utils.device_utils import get_safe_torch_device
 from lerobot.utils.import_utils import register_third_party_plugins
 from lerobot.utils.io_utils import write_video
@@ -95,6 +96,81 @@ from lerobot.utils.utils import (
 )


+def _env_features_to_dataset_features(env_features: dict, raw_obs: dict | None = None) -> dict:
+    """Convert EnvConfig.features (PolicyFeature objects) to the plain dict format for LeRobotDataset.create().
+
+    If raw_obs is provided, visual feature shapes are inferred from the actual observation
+    to avoid mismatches between the env config and the real observation resolution.
+    """
+    features = {}
+    for key, ft in env_features.items():
+        if ft.type is FeatureType.VISUAL:
+            shape = tuple(ft.shape)
+            if raw_obs is not None and key in raw_obs and isinstance(raw_obs[key], np.ndarray):
+                shape = raw_obs[key].shape[1:]  # strip batch dim
+            elif raw_obs is not None and "pixels" in raw_obs:
+                pixels = raw_obs["pixels"]
+                if isinstance(pixels, dict):
+                    for cam_name, img in pixels.items():
+                        if key == f"{OBS_IMAGES}.{cam_name}" or key == cam_name:
+                            shape = img.shape[1:]  # strip batch dim
+                elif key in ("pixels", OBS_IMAGE):
+                    shape = pixels.shape[1:]  # strip batch dim
+            features[key] = {"dtype": "video", "shape": shape, "names": ["height", "width", "channel"]}
+        else:
+            shape = tuple(ft.shape)
+            if raw_obs is not None and key in raw_obs and isinstance(raw_obs[key], np.ndarray):
+                shape = raw_obs[key].shape[1:]  # strip batch dim
+            features[key] = {"dtype": "float32", "shape": shape, "names": None}
+    features["next.reward"] = {"dtype": "float32", "shape": (1,), "names": None}
+    features["next.success"] = {"dtype": "bool", "shape": (1,), "names": None}
+    features["next.done"] = {"dtype": "bool", "shape": (1,), "names": None}
+    return features
+
+
+def _build_raw_frame(
+    raw_obs: dict,
+    env_idx: int,
+    action: np.ndarray,
+    reward: float,
+    success: bool,
+    done: bool,
+    task: str,
+    env_features: dict,
+) -> dict:
+    """Build a dataset frame from raw env observations for one env index.
+
+    Keys in the frame match the keys in env_features so they align with the
+    dataset schema created by _env_features_to_dataset_features().
+    """
+    frame: dict[str, Any] = {}
+    for key in env_features:
+        if key == ACTION:
+            continue
+        if "pixels" in raw_obs and isinstance(raw_obs["pixels"], dict):
+            for cam_name, img in raw_obs["pixels"].items():
+                candidate = f"{OBS_IMAGES}.{cam_name}"
+                if candidate == key:
+                    frame[key] = img[env_idx]
+            if key in frame:
+                continue
+        if "pixels" in raw_obs and not isinstance(raw_obs["pixels"], dict) and key in ("pixels", OBS_IMAGE):
+            frame[key] = raw_obs["pixels"][env_idx]
+            continue
+        raw_key = key
+        if raw_key in raw_obs and isinstance(raw_obs[raw_key], np.ndarray):
+            val = raw_obs[raw_key][env_idx]
+            if val.dtype == np.float64:
+                val = val.astype(np.float32)
+            frame[key] = val
+    frame[ACTION] = action
+    frame["next.reward"] = np.atleast_1d(np.float32(reward))
+    frame["next.success"] = np.atleast_1d(np.bool_(success))
+    frame["next.done"] = np.atleast_1d(np.bool_(done))
+    frame["task"] = task
+    return frame
+
+
 def rollout(
    env: gym.vector.VectorEnv,
    policy: PreTrainedPolicy,
@@ -105,6 +181,7 @@ def rollout(
    seeds: list[int] | None = None,
    return_observations: bool = False,
    render_callback: Callable[[gym.vector.VectorEnv], None] | None = None,
+    recording_dataset: Any | None = None,
 ) -> dict:
    """Run a batched policy rollout once through a batch of environments.

@@ -145,6 +222,14 @@ def rollout(
    if render_callback is not None:
        render_callback(env)

+    raw_observation = deepcopy(observation) if recording_dataset is not None else None
+    task_desc = ""
+    if recording_dataset is not None:
+        try:
+            task_desc = list(env.call("task_description"))[0]
+        except (AttributeError, NotImplementedError):
+            task_desc = ""
+
    all_observations = []
    all_actions = []
    all_rewards = []
@@ -217,6 +302,26 @@ def rollout(
        else:
            successes = [False] * env.num_envs

+        if recording_dataset is not None and raw_observation is not None:
+            prev_done = done.copy()
+            for env_idx in range(env.num_envs):
+                if prev_done[env_idx]:
+                    continue
+                frame = _build_raw_frame(
+                    raw_observation,
+                    env_idx,
+                    action_numpy[env_idx],
+                    reward[env_idx],
+                    successes[env_idx],
+                    bool(terminated[env_idx] | truncated[env_idx]),
+                    task_desc,
+                    recording_dataset.features,
+                )
+                recording_dataset.add_frame(frame)
+                if terminated[env_idx] or truncated[env_idx]:
+                    recording_dataset.save_episode()
+            raw_observation = deepcopy(observation)
+
        # Keep track of which environments are done so far.
        # Mark the episode as done if we reach the maximum step limit.
        # This ensures that the rollout always terminates cleanly at `max_steps`,
@@ -273,6 +378,7 @@ def eval_policy(
    videos_dir: Path | None = None,
    return_episode_data: bool = False,
    start_seed: int | None = None,
+    recording_dataset: Any | None = None,
 ) -> dict:
    """
    Args:
@@ -361,6 +467,7 @@ def eval_policy(
            seeds=list(seeds) if seeds else None,
            return_observations=return_episode_data,
            render_callback=render_frame if max_episodes_rendered > 0 else None,
+            recording_dataset=recording_dataset,
        )

        # Figure out where in each rollout sequence the first done condition was encountered (results after
@@ -563,6 +670,10 @@ def eval_main(cfg: EvalPipelineConfig):
    # Create environment-specific preprocessor and postprocessor (e.g., for LIBERO environments)
    env_preprocessor, env_postprocessor = make_env_pre_post_processors(env_cfg=cfg.env, policy_cfg=cfg.policy)

+    recording_dir = Path(cfg.output_dir) / "recordings" if cfg.eval.recording else None
+    max_episodes_rendered = 0 if cfg.eval.recording else 10
+    videos_dir = None if cfg.eval.recording else Path(cfg.output_dir) / "videos"
+
    with torch.no_grad(), torch.autocast(device_type=device.type) if cfg.policy.use_amp else nullcontext():
        info = eval_policy_all(
            envs=envs,
@@ -572,10 +683,13 @@ def eval_main(cfg: EvalPipelineConfig):
            preprocessor=preprocessor,
            postprocessor=postprocessor,
            n_episodes=cfg.eval.n_episodes,
-            max_episodes_rendered=10,
-            videos_dir=Path(cfg.output_dir) / "videos",
+            max_episodes_rendered=max_episodes_rendered,
+            videos_dir=videos_dir,
+            return_episode_data=False,
            start_seed=cfg.seed,
            max_parallel_tasks=cfg.env.max_parallel_tasks,
+            recording_dir=recording_dir,
+            env_features=cfg.env.features if cfg.eval.recording else None,
        )
        print("Overall Aggregated Metrics:")
        print(info["overall"])
@@ -618,6 +732,7 @@ def eval_one(
    videos_dir: Path | None,
    return_episode_data: bool,
    start_seed: int | None,
+    recording_dataset: Any | None = None,
 ) -> TaskMetrics:
    """Evaluates one task_id of one suite using the provided vec env."""

@@ -635,6 +750,7 @@ def eval_one(
        videos_dir=task_videos_dir,
        return_episode_data=return_episode_data,
        start_seed=start_seed,
+        recording_dataset=recording_dataset,
    )

    per_episode = task_result["per_episode"]
@@ -661,6 +777,8 @@ def run_one(
    videos_dir: Path | None,
    return_episode_data: bool,
    start_seed: int | None,
+    recording_dir: Path | None = None,
+    env_features: dict | None = None,
 ):
    """
    Run eval_one for a single (task_group, task_id, env).
@@ -672,21 +790,39 @@ def run_one(
        task_videos_dir = videos_dir / f"{task_group}_{task_id}"
        task_videos_dir.mkdir(parents=True, exist_ok=True)

-    # Call the existing eval_one (assumed to return TaskMetrics-like dict)
-    metrics = eval_one(
-        env,
-        policy=policy,
-        env_preprocessor=env_preprocessor,
-        env_postprocessor=env_postprocessor,
-        preprocessor=preprocessor,
-        postprocessor=postprocessor,
-        n_episodes=n_episodes,
-        max_episodes_rendered=max_episodes_rendered,
-        videos_dir=task_videos_dir,
-        return_episode_data=return_episode_data,
-        start_seed=start_seed,
-    )
-    # ensure we always provide video_paths key to simplify accumulation
+    recording_dataset = None
+    if recording_dir is not None and env_features is not None:
+        task_recording_dir = recording_dir / f"{task_group}_{task_id}"
+        fps = env.unwrapped.metadata.get("render_fps", 30)
+        sample_obs, _ = env.reset()
+        features = _env_features_to_dataset_features(env_features, raw_obs=sample_obs)
+        recording_dataset = LeRobotDataset.create(
+            repo_id=f"eval_{task_group}_{task_id}",
+            fps=fps,
+            features=features,
+            root=str(task_recording_dir),
+            use_videos=True,
+        )
+
+    try:
+        metrics = eval_one(
+            env,
+            policy=policy,
+            env_preprocessor=env_preprocessor,
+            env_postprocessor=env_postprocessor,
+            preprocessor=preprocessor,
+            postprocessor=postprocessor,
+            n_episodes=n_episodes,
+            max_episodes_rendered=max_episodes_rendered,
+            videos_dir=task_videos_dir,
+            return_episode_data=return_episode_data,
+            start_seed=start_seed,
+            recording_dataset=recording_dataset,
+        )
+    finally:
+        if recording_dataset is not None:
+            recording_dataset.finalize()
+
    if max_episodes_rendered > 0:
        metrics.setdefault("video_paths", [])
    return task_group, task_id, metrics
@@ -702,6 +838,8 @@ def eval_policy_all(
    n_episodes: int,
    *,
    max_episodes_rendered: int = 0,
+    recording_dir: Path | None = None,
+    env_features: dict | None = None,
    videos_dir: Path | None = None,
    return_episode_data: bool = False,
    start_seed: int | None = None,
@@ -761,6 +899,8 @@ def eval_policy_all(
        videos_dir=videos_dir,
        return_episode_data=return_episode_data,
        start_seed=start_seed,
+        recording_dir=recording_dir,
+        env_features=env_features,
    )

    if max_parallel_tasks <= 1:
@@ -45,7 +45,8 @@ from lerobot.common.train_utils import (
 from lerobot.common.wandb_utils import WandBLogger
 from lerobot.configs import parser
 from lerobot.configs.train import TrainPipelineConfig
-from lerobot.datasets import EpisodeAwareSampler, compute_sampler_state, make_dataset
+from lerobot.datasets import EpisodeAwareSampler, compute_sampler_state
+from lerobot.datasets.factory import make_train_eval_datasets
 from lerobot.envs import close_envs, make_env, make_env_pre_post_processors
 from lerobot.optim.factory import make_optimizer_and_scheduler
 from lerobot.policies import PreTrainedPolicy, make_policy, make_pre_post_processors
@@ -244,19 +245,19 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
    # LeRobotDataset skips its snapshot_download when try_load() succeeds, so no rank re-downloads.
    if is_main_process:
        logging.info("Creating dataset")
-        dataset = make_dataset(cfg)
+        dataset, eval_dataset = make_train_eval_datasets(cfg)

    accelerator.wait_for_everyone()

    # Other ranks read from the shared copy populated by the main process.
    if not is_main_process:
-        dataset = make_dataset(cfg)
+        dataset, eval_dataset = make_train_eval_datasets(cfg)

    # Create environment used for evaluating checkpoints during training on simulation data.
    # On real-world data, no need to create an environment as evaluations are done outside train.py,
    # using the eval.py instead, with gym_dora environment and dora-rs.
    eval_env = None
-    if cfg.eval_freq > 0 and cfg.env is not None and is_main_process:
+    if cfg.env_eval_freq > 0 and cfg.env is not None and is_main_process:
        logging.info("Creating env")
        eval_env = make_env(cfg.env, n_envs=cfg.eval.batch_size, use_async_envs=cfg.eval.use_async_envs)

@@ -345,6 +346,7 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
        preprocessor, postprocessor = make_pre_post_processors(
            policy_cfg=cfg.policy,
            pretrained_path=processor_pretrained_path,
+            pretrained_revision=getattr(cfg.policy, "pretrained_revision", None),
            **processor_kwargs,
        )

@@ -455,6 +457,31 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
        persistent_workers=cfg.persistent_workers and cfg.num_workers > 0,
    )

+    # Build eval dataloader if a held-out split exists
+    eval_dataloader = None
+    if eval_dataset is not None:
+        eval_ds = eval_dataset
+        if cfg.max_eval_samples > 0 and hasattr(eval_dataset, "hf_dataset"):
+            task_indices = eval_dataset.hf_dataset["task_index"]
+            unique_tasks = sorted(set(task_indices))
+            per_task = max(1, cfg.max_eval_samples // len(unique_tasks))
+            selected: list[int] = []
+            for t in unique_tasks:
+                frames = [i for i, ti in enumerate(task_indices) if ti == t][:per_task]
+                selected.extend(frames)
+            eval_ds = torch.utils.data.Subset(eval_dataset, selected)
+
+        eval_collate_fn = lerobot_collate_fn if dataset.meta.has_language_columns else None
+        eval_dataloader = torch.utils.data.DataLoader(
+            eval_ds,
+            batch_size=cfg.batch_size,
+            shuffle=False,
+            num_workers=cfg.num_workers,
+            pin_memory=device.type == "cuda",
+            drop_last=False,
+            collate_fn=eval_collate_fn,
+        )
+
    # Prepare everything with accelerator
    accelerator.wait_for_everyone()
    policy, optimizer, dataloader, lr_scheduler = accelerator.prepare(
@@ -534,7 +561,8 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
        train_tracker.step()
        is_log_step = cfg.log_freq > 0 and step % cfg.log_freq == 0
        is_saving_step = step % cfg.save_freq == 0 or step == cfg.steps
-        is_eval_step = cfg.eval_freq > 0 and step % cfg.eval_freq == 0
+        is_env_eval_step = cfg.env_eval_freq > 0 and step % cfg.env_eval_freq == 0
+        is_eval_step = cfg.eval_steps > 0 and eval_dataloader is not None and step % cfg.eval_steps == 0

        if is_log_step:
            # Collective reduce must run on every rank, before the main-process gate below.
@@ -557,6 +585,27 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
                    wandb_logger.log_dict(wandb_log_dict, step)
            train_tracker.reset_averages()

+        if is_eval_step:
+            policy.eval()
+            eval_loss_sum = 0.0
+            n_eval_batches = 0
+            with torch.no_grad(), accelerator.autocast():
+                for eval_batch in eval_dataloader:
+                    for cam_key in dataset.meta.camera_keys:
+                        if cam_key in eval_batch and eval_batch[cam_key].dtype == torch.uint8:
+                            eval_batch[cam_key] = eval_batch[cam_key].to(dtype=torch.float32) / 255.0
+                    eval_batch = preprocessor(eval_batch)
+                    loss, _ = policy.forward(eval_batch)
+                    eval_loss_sum += loss.item()
+                    n_eval_batches += 1
+            eval_loss = eval_loss_sum / max(n_eval_batches, 1)
+            policy.train()
+
+            if is_main_process:
+                logging.info(f"step {step}: eval_loss={eval_loss:.4f}")
+                if wandb_logger:
+                    wandb_logger.log_dict({"eval_loss": eval_loss}, step=step, mode="eval")
+
        if cfg.save_checkpoint and is_saving_step:
            if is_main_process:
                logging.info(f"Checkpoint policy after step {step}")
@@ -579,7 +628,7 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):

            accelerator.wait_for_everyone()

-        if cfg.env and is_eval_step:
+        if cfg.env and is_env_eval_step:
            if is_main_process:
                step_id = get_step_identifier(step, cfg.steps)
                logging.info(f"Eval policy at step {step}")
@@ -216,9 +216,15 @@ def register_third_party_plugins() -> None:

    This function uses `importlib.metadata` to find packages installed in the environment
    (including editable installs) starting with 'lerobot_robot_', 'lerobot_camera_',
-    'lerobot_teleoperator_', or 'lerobot_policy_' and imports them.
+    'lerobot_teleoperator_', 'lerobot_policy_', or 'lerobot_env_' and imports them.
    """
-    prefixes = ("lerobot_robot_", "lerobot_camera_", "lerobot_teleoperator_", "lerobot_policy_")
+    prefixes = (
+        "lerobot_robot_",
+        "lerobot_camera_",
+        "lerobot_teleoperator_",
+        "lerobot_policy_",
+        "lerobot_env_",
+    )
    imported: list[str] = []
    failed: list[str] = []

@@ -2370,32 +2370,14 @@ def test_aggregate_images_when_use_videos_false():
    out = aggregate_pipeline_dataset_features(
        pipeline=rp,
        initial_features={PipelineFeatureType.ACTION: {}, PipelineFeatureType.OBSERVATION: initial},
-        use_videos=False,  # images kept, stored as "image" dtype
+        use_videos=False,  # expect "image" dtype
        patterns=None,
    )

    key = f"{OBS_IMAGES}.back"
    key_front = f"{OBS_IMAGES}.front"
-    assert key in out
-    assert key_front in out
-    assert out[key]["dtype"] == "image"
-    assert out[key_front]["dtype"] == "image"
-    assert out[key]["shape"] == initial["back"]
-
-
-def test_aggregate_images_excluded():
-    rp = DataProcessorPipeline([AddObservationStateFeatures(add_front_image=True)])
-    initial = {"back": (480, 640, 3)}
-
-    out = aggregate_pipeline_dataset_features(
-        pipeline=rp,
-        initial_features={PipelineFeatureType.ACTION: {}, PipelineFeatureType.OBSERVATION: initial},
-        exclude_images=True,
-        patterns=None,
-    )
-
-    assert f"{OBS_IMAGES}.back" not in out
-    assert f"{OBS_IMAGES}.front" not in out
+    assert key not in out
+    assert key_front not in out


 def test_aggregate_images_when_use_videos_true():
@@ -134,7 +134,7 @@ class TestMultiGPUTraining:
                f"--output_dir={output_dir}",
                "--batch_size=4",
                "--steps=10",
-                "--eval_freq=-1",
+                "--env_eval_freq=-1",
                "--log_freq=5",
                "--save_freq=10",
                "--seed=42",
@@ -177,7 +177,7 @@ class TestMultiGPUTraining:
                f"--output_dir={output_dir}",
                "--batch_size=4",
                "--steps=20",
-                "--eval_freq=-1",
+                "--env_eval_freq=-1",
                "--log_freq=5",
                "--save_freq=10",
                "--seed=42",
Author	SHA1	Message	Date
Khalil Meftah	e069557228	fix(eval): use FeatureType enum comparison instead of string value	2026-06-15 18:50:24 +02:00
Khalil Meftah	58cf6c8710	fix(eval): infer recording features from actual env observations	2026-06-15 18:47:16 +02:00
Khalil Meftah	36470d059e	fix(eval): align raw frame keys with dataset schema and fix numpy types	2026-06-15 18:38:12 +02:00
Khalil Meftah	040a1df9d6	fix(datasets): remap absolute indices in __getitem__ for filtered datasets	2026-06-15 18:26:44 +02:00
Khalil Meftah	87ae050b28	Merge branch 'feat/eval-dataset-recording' into test/gs-gym-integration	2026-06-15 17:03:36 +02:00
Khalil Meftah	3bec437d83	Merge branch 'feat/env-plugin-discovery' into test/gs-gym-integration	2026-06-15 17:03:22 +02:00
Khalil Meftah	97f53732bf	Merge branch 'fix/logging-stats-robustness' into test/gs-gym-integration	2026-06-15 17:03:03 +02:00
Khalil Meftah	b31837ffeb	Merge branch 'feat/pretrained-revision' into test/gs-gym-integration	2026-06-15 17:02:51 +02:00
Khalil Meftah	fd822287e4	Merge branch 'fix/offline-policy-inference' into test/gs-gym-integration	2026-06-15 17:02:37 +02:00
Khalil Meftah	7e2d7024c4	Merge branch 'feat/offline-validation' into test/gs-gym-integration	2026-06-15 17:02:03 +02:00
Khalil Meftah	240393d238	feat(eval): record eval rollouts as raw LeRobot datasets - Record raw env observations inline during rollout(), before preprocess_observation() transforms them. Uses LeRobotDataset.create() with add_frame()/save_episode(). - Supports vectorized envs: each env in the batch records independently, with save_episode() called per env on termination. Each task gets its own dataset under output_dir/recordings/{task_group}_{task_id}/. Enabled via --eval.recording=true; disabled by default.	2026-06-15 16:12:25 +02:00
Khalil Meftah	6407a244c0	feat(envs): add generic observation passthrough - Add generic observation passthrough in preprocess_observation() for unhandled ndarray/tensor keys, replacing the pattern of adding per-env hardcoded key handlers. Extra keys are forwarded as observation.<key> and can be shaped by env-specific ProcessorSteps via get_env_processors().	2026-06-15 14:17:59 +02:00
Khalil Meftah	0511c12b8f	feat(envs): add env plugin discovery - Add 'lerobot_env_' to third-party plugin discovery prefixes, completing the plugin system for all component types (robots, cameras, teleoperators, policies, and now environments). External packages named lerobot_env_* can self-register EnvConfig subclasses on import, enabling --env.type= resolution without lerobot code changes.	2026-06-15 14:13:12 +02:00
Khalil Meftah	0efa3dc874	fix(stats): handle scalar stats robustly - Wrap cast_stats_to_numpy with np.atleast_1d to prevent 0-d arrays from scalar stats causing shape mismatches downstream.	2026-06-15 12:28:18 +02:00
Khalil Meftah	949f4fcbe9	fix(logging): batch wandb metrics - Batch all metrics into a single wandb.log() call instead of one per key, reducing API overhead. - Add support for list-valued metrics by expanding them to indexed keys (e.g. metric_0, metric_1).	2026-06-15 12:25:06 +02:00
Khalil Meftah	0d1d5e0a86	feat(hub): add pretrained_revision to pin Hub model versions - Add pretrained_revision field to PreTrainedConfig (policies) and RewardModelConfig (reward models), and thread it through make_policy(), make_pre_post_processors(), and make_reward_model() so that weights and processor configs can be loaded from a specific Hub commit, branch, or tag. Defaults to None (latest version, preserving current behavior). Dataset and env hub loading already supported revision pinning.	2026-06-15 11:58:57 +02:00
Khalil Meftah	84abfe5c60	fix(policies): support offline batch inference for ACT and Diffusion - Guard ACT's KL divergence computation against None latent params to prevent crashes during eval when use_vae is set but the forward path returns no VAE outputs. - Add offline batch fallback to Diffusion's predict_action_chunk() so it works with dataloader batches (empty queues) in addition to the existing online rollout path (populated queues). This enables batched action prediction for offline evaluation.	2026-06-15 11:35:06 +02:00
Khalil Meftah	2201401c99	feat(training): add inline offline validation with train/eval split - Add eval_split config for balanced per-task holdout - Add eval_steps for periodic inline eval loss computation - Add max_eval_samples to cap eval cost	2026-06-14 21:29:54 +02:00
Khalil Meftah	64773e7b22	refactor(training): rename eval_freq to env_eval_freq - Rename eval_freq to env_eval_freq to distinguish sim environment evaluation from offline loss evaluation.	2026-06-14 14:19:25 +02:00