refactor(eval): remove shape inference and shallow copy helpers

Merge branch 'main' into feat/eval-dataset-recording
refactor(eval): per-env datasets recording, no double reset
2026-06-17 00:07:03 +00:00 · 2026-06-16 22:13:23 +02:00 · 2026-06-16 21:45:06 +02:00 · 2026-06-16 21:35:05 +02:00 · 2026-06-16 17:58:59 +02:00 · 2026-06-16 15:22:50 +02:00
37 changed files with 331 additions and 323 deletions
@@ -167,9 +167,9 @@ jobs:

      # ── LIBERO TRAIN+EVAL SMOKE ──────────────────────────────────────────────
      # Train SmolVLA for 1 step (batch_size=1, dataset episode 0 only) then
-      # immediately runs eval inside the training loop (env_eval_freq=1, 1 episode).
+      # immediately runs eval inside the training loop (eval_freq=1, 1 episode).
      # Tests the full train→eval-within-training pipeline end-to-end.
-      - name: Run Libero train+eval smoke (1 step, env_eval_freq=1)
+      - name: Run Libero train+eval smoke (1 step, eval_freq=1)
        if: env.HF_USER_TOKEN != ''
        run: |
          docker run --name libero-train-smoke --gpus all \
@@ -196,7 +196,7 @@ jobs:
                --output_dir=/tmp/train-smoke \
                --steps=1 \
                --batch_size=1 \
-                --env_eval_freq=1 \
+                --eval_freq=1 \
                --eval.n_episodes=1 \
                --eval.batch_size=1 \
                --eval.use_async_envs=false \
@@ -58,7 +58,7 @@ test-act-ete-train:
 		--dataset.episodes="[0]" \
 		--batch_size=2 \
 		--steps=4 \
-		--env_eval_freq=2 \
+		--eval_freq=2 \
 		--eval.n_episodes=1 \
 		--eval.batch_size=1 \
 		--save_freq=2 \
@@ -96,7 +96,7 @@ test-diffusion-ete-train:
 		--dataset.episodes="[0]" \
 		--batch_size=2 \
 		--steps=2 \
-		--env_eval_freq=2 \
+		--eval_freq=2 \
 		--eval.n_episodes=1 \
 		--eval.batch_size=1 \
 		--save_checkpoint=true \
@@ -126,7 +126,7 @@ test-tdmpc-ete-train:
 		--dataset.episodes="[0]" \
 		--batch_size=2 \
 		--steps=2 \
-		--env_eval_freq=2 \
+		--eval_freq=2 \
 		--eval.n_episodes=1 \
 		--eval.batch_size=1 \
 		--save_checkpoint=true \
@@ -161,7 +161,7 @@ test-smolvla-ete-train:
 		--dataset.episodes="[0]" \
 		--batch_size=2 \
 		--steps=4 \
-		--env_eval_freq=2 \
+		--eval_freq=2 \
 		--eval.n_episodes=1 \
 		--eval.batch_size=1 \
 		--save_freq=2 \
@@ -719,7 +719,7 @@ Example configuration for training the [reward classifier](https://huggingface.c
  "num_workers": 4,
  "steps": 5000,
  "log_freq": 10,
-  "env_eval_freq": 1000,
+  "eval_freq": 1000,
  "save_freq": 1000,
  "save_checkpoint": true,
  "seed": 2,
@@ -143,7 +143,7 @@ lerobot-train \
  --batch_size=4 \
  --eval.batch_size=1 \
  --eval.n_episodes=1 \
-  --env_eval_freq=1000
+  --eval_freq=1000
 ```

 ## Reproducing published results
@@ -173,7 +173,7 @@ lerobot-train \
    --batch_size=4 \
    --eval.batch_size=1 \
    --eval.n_episodes=1 \
-    --env_eval_freq=1000
+    --eval_freq=1000
 ```

 ## Relationship to LIBERO
@@ -120,11 +120,11 @@ lerobot-train \
  --batch_size=4 \
  --eval.batch_size=1 \
  --eval.n_episodes=1 \
-  --env_eval_freq=1000
+  --eval_freq=1000
 ```

 ## Practical tips

 - Use the one-hot task conditioning for multi-task training (MT10/MT50 conventions) so policies have explicit task context.
 - Inspect the dataset task descriptions and the `info["is_success"]` keys when writing post-processing or logging so your success metrics line up with the benchmark.
- Adjust `batch_size`, `steps`, and `env_eval_freq` to match your compute budget.
+- Adjust `batch_size`, `steps`, and `eval_freq` to match your compute budget.
@@ -103,7 +103,7 @@ accelerate launch \
  --batch_size=32 \
  --num_workers=4 \
  --log_freq=20 \
-  --env_eval_freq=-1 \
+  --eval_freq=-1 \
  --save_checkpoint=true \
  --save_freq=2000
 ```
@@ -142,7 +142,7 @@ accelerate launch \
  --batch_size=32 \
  --num_workers=4 \
  --log_freq=20 \
-  --env_eval_freq=-1 \
+  --eval_freq=-1 \
  --save_checkpoint=true \
  --save_freq=2000
 ```
@@ -314,7 +314,7 @@ lerobot-train \
  --steps=30000 \
  --save_freq=1000 \
  --log_freq=100 \
-  --env_eval_freq=1000 \
+  --eval_freq=1000 \
  --policy.type=multi_task_dit \
  --policy.device=cuda \
  --policy.horizon=32 \
@@ -166,7 +166,7 @@ lerobot-train \
  --output_dir=./outputs/smolvla_robocasa_CloseFridge \
  --steps=100000 \
  --batch_size=4 \
-  --env_eval_freq=5000 \
+  --eval_freq=5000 \
  --eval.batch_size=1 \
  --eval.n_episodes=5 \
  --save_freq=10000
@@ -165,7 +165,7 @@ lerobot-train \
  --output_dir=./outputs/smolvla_vlabench_primitive \
  --steps=100000 \
  --batch_size=4 \
-  --env_eval_freq=5000 \
+  --eval_freq=5000 \
  --eval.batch_size=1 \
  --eval.n_episodes=1 \
  --save_freq=10000
@@ -54,6 +54,7 @@ from typing import Any
 import pyarrow as pa
 import pyarrow.parquet as pq

+from lerobot.datasets.io_utils import write_table_one_row_group_per_episode
 from lerobot.datasets.language import (
    EVENT_ONLY_STYLES,
    LANGUAGE_EVENTS,
@@ -274,12 +275,11 @@ class LanguageColumnsWriter:
        new_table = self._materialize_table(
            table, per_row_persistent, per_row_events, drop_old=self.drop_existing_subtask_index
        )
-        # Atomic replace: write to a sibling tmp path and rename so a crash
-        # mid-write can't leave a half-written shard that ``pq.read_table``
-        # would then fail to open. ``Path.replace`` is atomic on POSIX +
-        # Windows when source and target sit on the same filesystem.
+        # Re-emit one row group per episode (a bulk pq.write_table would collapse
+        # them into one). Write to a sibling tmp path and atomically rename so a
+        # crash mid-write can't leave a half-written shard.
        tmp_path = path.with_suffix(path.suffix + ".tmp")
-        pq.write_table(new_table, tmp_path)
+        write_table_one_row_group_per_episode(new_table, tmp_path)
        tmp_path.replace(path)

    def _materialize_table(
@@ -180,32 +180,24 @@ class WandBLogger:
                self._wandb_custom_step_key.add(new_custom_key)
                self._wandb.define_metric(new_custom_key, hidden=True)

-        batch_data = {}
        for k, v in d.items():
-            # Skip the custom step key here, it's added to the batch below.
-            if custom_step_key is not None and k == custom_step_key:
-                continue
-
-            if isinstance(v, list):
-                for i, elem in enumerate(v):
-                    if isinstance(elem, (int | float)):
-                        batch_data[f"{mode}/{k}_{i}"] = elem
-                continue
-
            if not isinstance(v, (int | float | str)):
                logging.warning(
                    f'WandB logging of key "{k}" was ignored as its type "{type(v)}" is not handled by this wrapper.'
                )
                continue

-            batch_data[f"{mode}/{k}"] = v
+            # Do not log the custom step key itself.
+            if self._wandb_custom_step_key is not None and k in self._wandb_custom_step_key:
+                continue

-        if batch_data:
            if custom_step_key is not None:
-                batch_data[f"{mode}/{custom_step_key}"] = d[custom_step_key]
-                self._wandb.log(batch_data)
-            else:
-                self._wandb.log(data=batch_data, step=step)
+                value_custom_step = d[custom_step_key]
+                data = {f"{mode}/{k}": v, f"{mode}/{custom_step_key}": value_custom_step}
+                self._wandb.log(data)
+                continue
+
+            self._wandb.log(data={f"{mode}/{k}": v}, step=step)

    def log_video(self, video_path: str, step: int, mode: str = "train"):
        if mode not in {"train", "eval"}:
@@ -39,8 +39,6 @@ class DatasetConfig:
    # This reduces memory and speeds up DataLoader IPC. The training pipeline handles the conversion.
    return_uint8: bool = False
    streaming: bool = False
-    # Fraction of episodes held out per task for offline evaluation (0.0 = disabled).
-    eval_split: float = 0.0

    def __post_init__(self) -> None:
        if self.episodes is not None:
@@ -79,8 +79,6 @@ class PreTrainedConfig(draccus.ChoiceRegistry, HubMixin, abc.ABC):  # type: igno
    # Either the repo ID of a model hosted on the Hub or a path to a directory containing weights
    # saved using `Policy.save_pretrained`. If not provided, the policy is initialized from scratch.
    pretrained_path: Path | None = None
-    # Optional Hub revision (commit hash, branch, or tag) to pin the pretrained model version.
-    pretrained_revision: str | None = None

    def __post_init__(self) -> None:
        if not self.device or not is_torch_device_available(self.device):
@@ -56,8 +56,6 @@ class RewardModelConfig(draccus.ChoiceRegistry, HubMixin, abc.ABC):
    device: str | None = None

    pretrained_path: str | None = None
-    # Optional Hub revision (commit hash, branch, or tag) to pin the pretrained reward model version.
-    pretrained_revision: str | None = None

    push_to_hub: bool = False
    repo_id: str | None = None
@@ -100,13 +100,8 @@ class TrainPipelineConfig(HubMixin):
    prefetch_factor: int = 4
    persistent_workers: bool = True
    steps: int = 100_000
-    # Run policy in the simulation environment every N steps to measure reward/success (0 = disabled).
-    env_eval_freq: int = 20_000
+    eval_freq: int = 20_000
    log_freq: int = 200
-    # Compute eval loss on held-out episodes every N steps (0 = disabled). Requires eval_split > 0.
-    eval_steps: int = 0
-    # Cap on total eval samples, split uniformly across tasks (0 = use all held-out data).
-    max_eval_samples: int = 0
    tolerance_s: float = 1e-4
    save_checkpoint: bool = True
    # Checkpoint is saved every `save_freq` training iterations and after the last training step.
@@ -35,7 +35,7 @@ from .dataset_tools import (
    remove_feature,
    split_dataset,
 )
-from .factory import make_dataset, make_train_eval_datasets, resolve_delta_timestamps
+from .factory import make_dataset, resolve_delta_timestamps
 from .image_writer import safe_stop_image_writer
 from .io_utils import load_episodes, write_stats
 from .language import (
@@ -89,7 +89,6 @@ __all__ = [
    "get_feature_stats",
    "load_episodes",
    "make_dataset",
-    "make_train_eval_datasets",
    "merge_datasets",
    "modify_features",
    "modify_tasks",
@@ -32,6 +32,7 @@ from .feature_utils import features_equal_for_merge, get_hf_features_from_featur
 from .io_utils import (
    get_file_size_in_mb,
    get_parquet_file_size_in_mb,
+    to_parquet_one_row_group_per_episode,
    to_parquet_with_hf_images,
    write_info,
    write_stats,
@@ -551,6 +552,7 @@ def aggregate_data(src_meta, dst_meta, data_idx, data_files_size_in_mb, chunk_si
            aggr_root=dst_meta.root,
            hf_features=hf_features,
            concatenate=concatenate_data,
+            one_row_group_per_episode=True,
        )

        # Record the mapping from source to actual destination
@@ -628,6 +630,7 @@ def append_or_create_parquet_file(
    aggr_root: Path = None,
    hf_features: datasets.Features | None = None,
    concatenate: bool = True,
+    one_row_group_per_episode: bool = False,
 ) -> tuple[dict[str, int], tuple[int, int]]:
    """Appends data to an existing parquet file or creates a new one based on size constraints.

@@ -645,6 +648,8 @@ def append_or_create_parquet_file(
        aggr_root: Root path for the aggregated dataset.
        hf_features: Optional HuggingFace Features schema for proper image typing.
        concatenate: When False, always rotate to a new file instead of appending to the current one.
+        one_row_group_per_episode: True for DATA parquet (emit one row group per episode); False for
+            the episodes-metadata parquet (already one row per episode).

    Returns:
        tuple: (updated_idx, (dst_chunk, dst_file)) where updated_idx is the index dict
@@ -657,6 +662,8 @@ def append_or_create_parquet_file(
        dst_path.parent.mkdir(parents=True, exist_ok=True)
        if contains_images:
            to_parquet_with_hf_images(df, dst_path, features=hf_features)
+        elif one_row_group_per_episode:
+            to_parquet_one_row_group_per_episode(df, dst_path)
        else:
            df.to_parquet(dst_path)
        return idx, (dst_chunk, dst_file)
@@ -683,6 +690,8 @@ def append_or_create_parquet_file(

    if contains_images:
        to_parquet_with_hf_images(final_df, target_path, features=hf_features)
+    elif one_row_group_per_episode:
+        to_parquet_one_row_group_per_episode(final_df, target_path)
    else:
        final_df.to_parquet(target_path)

@@ -15,6 +15,7 @@
 # limitations under the License.
 import contextlib
 from collections.abc import Callable
+from copy import deepcopy
 from pathlib import Path

 import numpy as np
@@ -709,7 +710,7 @@ class LeRobotDatasetMetadata:

        obj.root.mkdir(parents=True, exist_ok=False)

-        features = {**features, **DEFAULT_FEATURES}
+        features = {**deepcopy(features), **DEFAULT_FEATURES}
        _validate_feature_names(features)

        obj.tasks = None
@@ -27,6 +27,7 @@ import logging
 import shutil
 from collections.abc import Callable
 from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
+from copy import deepcopy
 from pathlib import Path

 import datasets
@@ -1101,7 +1102,9 @@ def _copy_episodes_metadata_and_stats(
    if dst_meta.video_keys and src_dataset.meta.video_keys:
        for key in dst_meta.video_keys:
            if key in src_dataset.meta.features:
-                dst_meta.info.features[key]["info"] = src_dataset.meta.info.features[key].get("info", {})
+                dst_meta.info.features[key]["info"] = deepcopy(
+                    src_dataset.meta.info.features[key].get("info", {})
+                )

    write_info(dst_meta.info, dst_meta.root)

@@ -14,7 +14,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import logging
-import math
 from pprint import pformat

 import torch
@@ -131,81 +130,3 @@ def make_dataset(cfg: TrainPipelineConfig) -> LeRobotDataset | MultiLeRobotDatas
                dataset.meta.stats[key][stats_type] = torch.tensor(stats, dtype=torch.float32)

    return dataset
-
-
-def make_train_eval_datasets(
-    cfg: TrainPipelineConfig,
-) -> tuple[LeRobotDataset | MultiLeRobotDataset, LeRobotDataset | None]:
-    """Create train and optional eval datasets by splitting episodes based on eval_split.
-
-    The last ceil(n_episodes * eval_split) episodes per task are held out for evaluation.
-    If eval_split == 0.0, returns (full_dataset, None).
-    """
-    full_dataset = make_dataset(cfg)
-
-    if cfg.dataset.eval_split == 0.0:
-        return full_dataset, None
-
-    base_episodes = (
-        full_dataset.episodes if full_dataset.episodes is not None else list(range(full_dataset.num_episodes))
-    )
-
-    episode_tasks = full_dataset.meta.episodes["tasks"]
-    task_to_episodes: dict[str, list[int]] = {}
-    for ep_idx in base_episodes:
-        task_key = episode_tasks[ep_idx][0] if episode_tasks[ep_idx] else ""
-        task_to_episodes.setdefault(task_key, []).append(ep_idx)
-
-    train_episodes, eval_episodes = [], []
-    for eps in task_to_episodes.values():
-        n_eval = math.ceil(len(eps) * cfg.dataset.eval_split)
-        train_episodes.extend(eps[: len(eps) - n_eval])
-        eval_episodes.extend(eps[len(eps) - n_eval :])
-
-    if not train_episodes:
-        raise ValueError(
-            f"eval_split={cfg.dataset.eval_split} leaves 0 training episodes from {len(base_episodes)} total."
-        )
-
-    logging.info(
-        f"Train/eval split: {len(train_episodes)} train, {len(eval_episodes)} eval "
-        f"(eval_split={cfg.dataset.eval_split}, {len(task_to_episodes)} tasks)"
-    )
-
-    delta_timestamps = resolve_delta_timestamps(cfg.trainable_config, full_dataset.meta)
-
-    train_image_transforms = (
-        ImageTransforms(cfg.dataset.image_transforms) if cfg.dataset.image_transforms.enable else None
-    )
-
-    train_dataset = LeRobotDataset(
-        cfg.dataset.repo_id,
-        root=cfg.dataset.root,
-        episodes=train_episodes,
-        delta_timestamps=delta_timestamps,
-        image_transforms=train_image_transforms,
-        revision=cfg.dataset.revision,
-        video_backend=cfg.dataset.video_backend,
-        return_uint8=True,
-        tolerance_s=cfg.tolerance_s,
-    )
-
-    eval_dataset = LeRobotDataset(
-        cfg.dataset.repo_id,
-        root=cfg.dataset.root,
-        episodes=eval_episodes,
-        delta_timestamps=delta_timestamps,
-        image_transforms=None,
-        revision=cfg.dataset.revision,
-        video_backend=cfg.dataset.video_backend,
-        return_uint8=True,
-        tolerance_s=cfg.tolerance_s,
-    )
-
-    if cfg.dataset.use_imagenet_stats:
-        for ds in (train_dataset, eval_dataset):
-            for key in ds.meta.camera_keys:
-                for stats_type, stats in IMAGENET_STATS.items():
-                    ds.meta.stats[key][stats_type] = torch.tensor(stats, dtype=torch.float32)
-
-    return train_dataset, eval_dataset
@@ -20,6 +20,7 @@ import datasets
 import numpy as np
 import pandas
 import pandas as pd
+import pyarrow as pa
 import pyarrow.dataset as pa_ds
 import pyarrow.parquet as pq
 import torch
@@ -153,7 +154,7 @@ def cast_stats_to_numpy(stats: dict) -> dict[str, dict[str, np.ndarray]]:
    Returns:
        dict: The statistics dictionary with values cast to numpy arrays.
    """
-    stats = {key: np.atleast_1d(np.array(value)) for key, value in flatten_dict(stats).items()}
+    stats = {key: np.array(value) for key, value in flatten_dict(stats).items()}
    return unflatten_dict(stats)


@@ -270,21 +271,49 @@ def hf_transform_to_torch(items_dict: dict[str, list[Any]]) -> dict[str, list[to
    return items_dict


+def write_table_one_row_group_per_episode(table: pa.Table, path: Path) -> None:
+    """Write ``table`` with one parquet row group per episode (in episode order).
+
+    Keeps shards random-access friendly (``read_row_group(i)`` fetches episode i),
+    mirroring the recording writer. ``table`` must carry a contiguous
+    ``episode_index`` column.
+    """
+    episode_index = table.column("episode_index").to_numpy(zero_copy_only=False)
+    starts = np.concatenate(([0], np.nonzero(np.diff(episode_index))[0] + 1))
+    writer = pq.ParquetWriter(str(path), table.schema, compression="snappy", use_dictionary=True)
+    try:
+        for start, stop in zip(starts, np.append(starts[1:], len(episode_index)), strict=True):
+            writer.write_table(table.slice(start, stop - start))  # one episode -> one row group
+    finally:
+        writer.close()
+
+
 def to_parquet_with_hf_images(
    df: pandas.DataFrame, path: Path, features: datasets.Features | None = None
 ) -> None:
-    """This function correctly writes to parquet a panda DataFrame that contains images encoded by HF dataset.
-    This way, it can be loaded by HF dataset and correctly formatted images are returned.
+    """Write a DataFrame with HF-encoded images to parquet, one row group per episode.

-    Args:
-        df: DataFrame to write to parquet.
-        path: Path to write the parquet file.
-        features: Optional HuggingFace Features schema. If provided, ensures image columns
-                  are properly typed as Image() in the parquet schema.
+    Images are embedded into the arrow table first (``ParquetWriter.write_table``
+    does not embed external image files like ``Dataset.to_parquet`` does).
+    ``features`` types image columns as ``Image()`` in the parquet schema.
    """
-    # TODO(qlhoest): replace this weird synthax by `df.to_parquet(path)` only
    ds = datasets.Dataset.from_dict(df.to_dict(orient="list"), features=features)
-    ds.to_parquet(path)
+    ds = embed_images(ds)
+    table = ds.with_format("arrow")[:]
+    if "episode_index" in table.column_names:
+        write_table_one_row_group_per_episode(table, path)
+    else:
+        # No episode boundaries to align row groups to — keep a single write.
+        pq.write_table(table, str(path))
+
+
+def to_parquet_one_row_group_per_episode(df: pandas.DataFrame, path: Path) -> None:
+    """Write a (non-image) DataFrame to parquet with one row group per episode."""
+    table = pa.Table.from_pandas(df, preserve_index=False)
+    if "episode_index" in table.column_names:
+        write_table_one_row_group_per_episode(table, path)
+    else:
+        pq.write_table(table, str(path))


 def item_to_torch(item: dict) -> dict:
@@ -474,8 +474,6 @@ class LeRobotDataset(torch.utils.data.Dataset):
        if reader.hf_dataset is None:
            # One-shot load after finalize()
            reader.load_and_activate()
-        if reader._absolute_to_relative_idx is not None and idx in reader._absolute_to_relative_idx:
-            idx = reader._absolute_to_relative_idx[idx]
        return reader.get_item(idx)

    def select_columns(self, column_names: str | list[str]):
@@ -70,19 +70,21 @@ def aggregate_pipeline_dataset_features(
    initial_features: dict[PipelineFeatureType, dict[str, Any]],
    *,
    use_videos: bool = True,
+    exclude_images: bool = False,
    patterns: Sequence[str] | None = None,
 ) -> dict[str, dict]:
    """
    Aggregates and filters pipeline features to create a dataset-ready features dictionary.

    This function transforms initial features using the pipeline, categorizes them as action or observations
-    (image or state), filters them based on `use_videos` and `patterns`, and finally
+    (image or state), filters them based on `exclude_images` and `patterns`, and finally
    formats them for use with a Hugging Face LeRobot Dataset.

    Args:
        pipeline: The DataProcessorPipeline to apply.
        initial_features: A dictionary of raw feature specs for actions and observations.
-        use_videos: If False, image features are excluded.
+        use_videos: Controls the storage dtype for image features. If True, images are stored as "video"; if False, they are stored as "image".
+        exclude_images: If True, image features are dropped entirely from the output.
        patterns: A sequence of regex patterns to filter action and state features.
                  Image features are not affected by this filter.

@@ -120,7 +122,7 @@ def aggregate_pipeline_dataset_features(
            )

            # 2. Apply filtering rules.
-            if is_image and not use_videos:
+            if is_image and exclude_images:
                continue
            if not is_image and not should_keep(key, compiled_patterns):
                continue
@@ -126,26 +126,6 @@ def preprocess_observation(observations: dict[str, np.ndarray]) -> dict[str, Ten
    if "camera_obs" in observations:
        return_observations[f"{OBS_STR}.camera_obs"] = observations["camera_obs"]

-    # Pass through any remaining ndarray/tensor keys not already handled above,
-    # so env plugins can expose extra observation keys via get_env_processors().
-    _handled = {"pixels", "environment_state", "agent_pos", "robot_state", "policy", "camera_obs"}
-    for key, value in observations.items():
-        if key in _handled:
-            continue
-        target = f"{OBS_STR}.{key}"
-        if target in return_observations:
-            continue
-        if isinstance(value, np.ndarray):
-            val = torch.from_numpy(value).float()
-            if val.dim() == 1:
-                val = val.unsqueeze(0)
-            return_observations[target] = val
-        elif isinstance(value, Tensor):
-            val = value.float()
-            if val.dim() == 1:
-                val = val.unsqueeze(0)
-            return_observations[target] = val
-
    return return_observations


@@ -148,7 +148,7 @@ class ACTPolicy(PreTrainedPolicy):
        l1_loss = (abs_err * valid_mask).sum() / num_valid.clamp_min(1)

        loss_dict = {"l1_loss": l1_loss.item()}
-        if self.config.use_vae and log_sigma_x2_hat is not None:
+        if self.config.use_vae:
            # Calculate Dₖₗ(latent_pdf || standard_normal). Note: After computing the KL-divergence for
            # each dimension independently, we sum over the latent dimension to get the total
            # KL-divergence per batch element, then take the mean over the batch.
@@ -101,23 +101,11 @@ class DiffusionPolicy(PreTrainedPolicy):

    @torch.no_grad()
    def predict_action_chunk(self, batch: dict[str, Tensor], noise: Tensor | None = None) -> Tensor:
-        """Predict a chunk of actions given environment observations.
-
-        Supports two modes:
-        - Online (queues populated via select_action): stacks observations from internal queues.
-        - Offline (empty queues, e.g. dataloader batch): uses the batch directly.
-        """
-        queues_populated = any(len(q) > 0 for q in self._queues.values())
-        if queues_populated:
-            batch = {k: torch.stack(list(self._queues[k]), dim=1) for k in batch if k in self._queues}
-        else:
-            batch = dict(batch)
-            if self.config.image_features:
-                for key in self.config.image_features:
-                    if batch[key].ndim == 4:
-                        batch[key] = batch[key].unsqueeze(1)
-                batch[OBS_IMAGES] = torch.stack([batch[key] for key in self.config.image_features], dim=-4)
+        """Predict a chunk of actions given environment observations."""
+        # stack n latest observations from the queue
+        batch = {k: torch.stack(list(self._queues[k]), dim=1) for k in batch if k in self._queues}
        actions = self.diffusion.generate_actions(batch, noise=noise)
+
        return actions

    @torch.no_grad()
@@ -252,7 +252,6 @@ class ProcessorConfigKwargs(TypedDict, total=False):
 def make_pre_post_processors(
    policy_cfg: PreTrainedConfig,
    pretrained_path: str | None = None,
-    pretrained_revision: str | None = None,
    **kwargs: Unpack[ProcessorConfigKwargs],
 ) -> tuple[
    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
@@ -310,7 +309,6 @@ def make_pre_post_processors(
            overrides=kwargs.get("preprocessor_overrides", {}),
            to_transition=batch_to_transition,
            to_output=transition_to_batch,
-            revision=pretrained_revision,
        )
        postprocessor = PolicyProcessorPipeline.from_pretrained(
            pretrained_model_name_or_path=pretrained_path,
@@ -320,7 +318,6 @@ def make_pre_post_processors(
            overrides=kwargs.get("postprocessor_overrides", {}),
            to_transition=policy_action_to_transition,
            to_output=transition_to_policy_action,
-            revision=pretrained_revision,
        )
        _reconnect_relative_absolute_steps(preprocessor, postprocessor)
        return preprocessor, postprocessor
@@ -560,7 +557,6 @@ def make_policy(
        # Load a pretrained policy and override the config if needed (for example, if there are inference-time
        # hyperparameters that we want to vary).
        kwargs["pretrained_name_or_path"] = cfg.pretrained_path
-        kwargs["revision"] = cfg.pretrained_revision
        policy = policy_cls.from_pretrained(**kwargs)
    elif cfg.pretrained_path and cfg.use_peft:
        # Load a pretrained PEFT model on top of the policy. The pretrained path points to the folder/repo
@@ -124,7 +124,6 @@ def make_reward_model(cfg: RewardModelConfig, **kwargs) -> PreTrainedRewardModel

    if cfg.pretrained_path:
        kwargs["pretrained_name_or_path"] = cfg.pretrained_path
-        kwargs["revision"] = cfg.pretrained_revision
        reward_model = reward_cls.from_pretrained(**kwargs)
    else:
        reward_model = reward_cls(**kwargs)
@@ -96,31 +96,14 @@ from lerobot.utils.utils import (
 )


-def _env_features_to_dataset_features(env_features: dict, raw_obs: dict | None = None) -> dict:
-    """Convert EnvConfig.features (PolicyFeature objects) to the plain dict format for LeRobotDataset.create().
-
-    If raw_obs is provided, visual feature shapes are inferred from the actual observation
-    to avoid mismatches between the env config and the real observation resolution.
-    """
+def _env_features_to_dataset_features(env_features: dict) -> dict:
+    """Convert EnvConfig.features to the dict format expected by LeRobotDataset.create()."""
    features = {}
    for key, ft in env_features.items():
+        shape = tuple(ft.shape)
        if ft.type is FeatureType.VISUAL:
-            shape = tuple(ft.shape)
-            if raw_obs is not None and key in raw_obs and isinstance(raw_obs[key], np.ndarray):
-                shape = raw_obs[key].shape[1:]  # strip batch dim
-            elif raw_obs is not None and "pixels" in raw_obs:
-                pixels = raw_obs["pixels"]
-                if isinstance(pixels, dict):
-                    for cam_name, img in pixels.items():
-                        if key == f"{OBS_IMAGES}.{cam_name}" or key == cam_name:
-                            shape = img.shape[1:]  # strip batch dim
-                elif key in ("pixels", OBS_IMAGE):
-                    shape = pixels.shape[1:]  # strip batch dim
            features[key] = {"dtype": "video", "shape": shape, "names": ["height", "width", "channel"]}
        else:
-            shape = tuple(ft.shape)
-            if raw_obs is not None and key in raw_obs and isinstance(raw_obs[key], np.ndarray):
-                shape = raw_obs[key].shape[1:]  # strip batch dim
            features[key] = {"dtype": "float32", "shape": shape, "names": None}
    features["next.reward"] = {"dtype": "float32", "shape": (1,), "names": None}
    features["next.success"] = {"dtype": "bool", "shape": (1,), "names": None}
@@ -147,6 +130,8 @@ def _build_raw_frame(
    for key in env_features:
        if key == ACTION:
            continue
+        if key.startswith("next."):
+            continue
        if "pixels" in raw_obs and isinstance(raw_obs["pixels"], dict):
            for cam_name, img in raw_obs["pixels"].items():
                candidate = f"{OBS_IMAGES}.{cam_name}"
@@ -157,9 +142,8 @@ def _build_raw_frame(
        if "pixels" in raw_obs and not isinstance(raw_obs["pixels"], dict) and key in ("pixels", OBS_IMAGE):
            frame[key] = raw_obs["pixels"][env_idx]
            continue
-        raw_key = key
-        if raw_key in raw_obs and isinstance(raw_obs[raw_key], np.ndarray):
-            val = raw_obs[raw_key][env_idx]
+        if key in raw_obs and isinstance(raw_obs[key], np.ndarray):
+            val = raw_obs[key][env_idx]
            if val.dtype == np.float64:
                val = val.astype(np.float32)
            frame[key] = val
@@ -181,7 +165,8 @@ def rollout(
    seeds: list[int] | None = None,
    return_observations: bool = False,
    render_callback: Callable[[gym.vector.VectorEnv], None] | None = None,
-    recording_dataset: Any | None = None,
+    recording_dir: Path | None = None,
+    env_features: dict | None = None,
 ) -> dict:
    """Run a batched policy rollout once through a batch of environments.

@@ -222,9 +207,25 @@ def rollout(
    if render_callback is not None:
        render_callback(env)

-    raw_observation = deepcopy(observation) if recording_dataset is not None else None
+    recording_datasets: list[LeRobotDataset] | None = None
+    raw_observation = None
    task_desc = ""
-    if recording_dataset is not None:
+    if recording_dir is not None and env_features is not None:
+        features = _env_features_to_dataset_features(env_features)
+        fps = env.unwrapped.metadata.get("render_fps", 30)
+        recording_datasets = []
+        for i in range(env.num_envs):
+            root = str(recording_dir / f"env_{i}") if env.num_envs > 1 else str(recording_dir)
+            recording_datasets.append(
+                LeRobotDataset.create(
+                    repo_id="eval_recording",
+                    fps=fps,
+                    features=features,
+                    root=root,
+                    use_videos=True,
+                )
+            )
+        raw_observation = deepcopy(observation)
        try:
            task_desc = list(env.call("task_description"))[0]
        except (AttributeError, NotImplementedError):
@@ -302,7 +303,7 @@ def rollout(
        else:
            successes = [False] * env.num_envs

-        if recording_dataset is not None and raw_observation is not None:
+        if recording_datasets is not None and raw_observation is not None:
            prev_done = done.copy()
            for env_idx in range(env.num_envs):
                if prev_done[env_idx]:
@@ -315,11 +316,11 @@ def rollout(
                    successes[env_idx],
                    bool(terminated[env_idx] | truncated[env_idx]),
                    task_desc,
-                    recording_dataset.features,
+                    recording_datasets[env_idx].features,
                )
-                recording_dataset.add_frame(frame)
+                recording_datasets[env_idx].add_frame(frame)
                if terminated[env_idx] or truncated[env_idx]:
-                    recording_dataset.save_episode()
+                    recording_datasets[env_idx].save_episode()
            raw_observation = deepcopy(observation)

        # Keep track of which environments are done so far.
@@ -360,6 +361,10 @@ def rollout(
            stacked_observations[key] = torch.stack([obs[key] for obs in all_observations], dim=1)
        ret[OBS_STR] = stacked_observations

+    if recording_datasets is not None:
+        for ds in recording_datasets:
+            ds.finalize()
+
    if hasattr(policy, "use_original_modules"):
        policy.use_original_modules()

@@ -378,7 +383,8 @@ def eval_policy(
    videos_dir: Path | None = None,
    return_episode_data: bool = False,
    start_seed: int | None = None,
-    recording_dataset: Any | None = None,
+    recording_dir: Path | None = None,
+    env_features: dict | None = None,
 ) -> dict:
    """
    Args:
@@ -467,7 +473,8 @@ def eval_policy(
            seeds=list(seeds) if seeds else None,
            return_observations=return_episode_data,
            render_callback=render_frame if max_episodes_rendered > 0 else None,
-            recording_dataset=recording_dataset,
+            recording_dir=recording_dir,
+            env_features=env_features,
        )

        # Figure out where in each rollout sequence the first done condition was encountered (results after
@@ -732,7 +739,8 @@ def eval_one(
    videos_dir: Path | None,
    return_episode_data: bool,
    start_seed: int | None,
-    recording_dataset: Any | None = None,
+    recording_dir: Path | None = None,
+    env_features: dict | None = None,
 ) -> TaskMetrics:
    """Evaluates one task_id of one suite using the provided vec env."""

@@ -750,7 +758,8 @@ def eval_one(
        videos_dir=task_videos_dir,
        return_episode_data=return_episode_data,
        start_seed=start_seed,
-        recording_dataset=recording_dataset,
+        recording_dir=recording_dir,
+        env_features=env_features,
    )

    per_episode = task_result["per_episode"]
@@ -790,38 +799,25 @@ def run_one(
        task_videos_dir = videos_dir / f"{task_group}_{task_id}"
        task_videos_dir.mkdir(parents=True, exist_ok=True)

-    recording_dataset = None
+    task_recording_dir = None
    if recording_dir is not None and env_features is not None:
        task_recording_dir = recording_dir / f"{task_group}_{task_id}"
-        fps = env.unwrapped.metadata.get("render_fps", 30)
-        sample_obs, _ = env.reset()
-        features = _env_features_to_dataset_features(env_features, raw_obs=sample_obs)
-        recording_dataset = LeRobotDataset.create(
-            repo_id=f"eval_{task_group}_{task_id}",
-            fps=fps,
-            features=features,
-            root=str(task_recording_dir),
-            use_videos=True,
-        )

-    try:
-        metrics = eval_one(
-            env,
-            policy=policy,
-            env_preprocessor=env_preprocessor,
-            env_postprocessor=env_postprocessor,
-            preprocessor=preprocessor,
-            postprocessor=postprocessor,
-            n_episodes=n_episodes,
-            max_episodes_rendered=max_episodes_rendered,
-            videos_dir=task_videos_dir,
-            return_episode_data=return_episode_data,
-            start_seed=start_seed,
-            recording_dataset=recording_dataset,
-        )
-    finally:
-        if recording_dataset is not None:
-            recording_dataset.finalize()
+    metrics = eval_one(
+        env,
+        policy=policy,
+        env_preprocessor=env_preprocessor,
+        env_postprocessor=env_postprocessor,
+        preprocessor=preprocessor,
+        postprocessor=postprocessor,
+        n_episodes=n_episodes,
+        max_episodes_rendered=max_episodes_rendered,
+        videos_dir=task_videos_dir,
+        return_episode_data=return_episode_data,
+        start_seed=start_seed,
+        recording_dir=task_recording_dir,
+        env_features=env_features,
+    )

    if max_episodes_rendered > 0:
        metrics.setdefault("video_paths", [])
@@ -45,8 +45,7 @@ from lerobot.common.train_utils import (
 from lerobot.common.wandb_utils import WandBLogger
 from lerobot.configs import parser
 from lerobot.configs.train import TrainPipelineConfig
-from lerobot.datasets import EpisodeAwareSampler, compute_sampler_state
-from lerobot.datasets.factory import make_train_eval_datasets
+from lerobot.datasets import EpisodeAwareSampler, compute_sampler_state, make_dataset
 from lerobot.envs import close_envs, make_env, make_env_pre_post_processors
 from lerobot.optim.factory import make_optimizer_and_scheduler
 from lerobot.policies import PreTrainedPolicy, make_policy, make_pre_post_processors
@@ -245,19 +244,19 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
    # LeRobotDataset skips its snapshot_download when try_load() succeeds, so no rank re-downloads.
    if is_main_process:
        logging.info("Creating dataset")
-        dataset, eval_dataset = make_train_eval_datasets(cfg)
+        dataset = make_dataset(cfg)

    accelerator.wait_for_everyone()

    # Other ranks read from the shared copy populated by the main process.
    if not is_main_process:
-        dataset, eval_dataset = make_train_eval_datasets(cfg)
+        dataset = make_dataset(cfg)

    # Create environment used for evaluating checkpoints during training on simulation data.
    # On real-world data, no need to create an environment as evaluations are done outside train.py,
    # using the eval.py instead, with gym_dora environment and dora-rs.
    eval_env = None
-    if cfg.env_eval_freq > 0 and cfg.env is not None and is_main_process:
+    if cfg.eval_freq > 0 and cfg.env is not None and is_main_process:
        logging.info("Creating env")
        eval_env = make_env(cfg.env, n_envs=cfg.eval.batch_size, use_async_envs=cfg.eval.use_async_envs)

@@ -346,7 +345,6 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
        preprocessor, postprocessor = make_pre_post_processors(
            policy_cfg=cfg.policy,
            pretrained_path=processor_pretrained_path,
-            pretrained_revision=getattr(cfg.policy, "pretrained_revision", None),
            **processor_kwargs,
        )

@@ -457,31 +455,6 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
        persistent_workers=cfg.persistent_workers and cfg.num_workers > 0,
    )

-    # Build eval dataloader if a held-out split exists
-    eval_dataloader = None
-    if eval_dataset is not None:
-        eval_ds = eval_dataset
-        if cfg.max_eval_samples > 0 and hasattr(eval_dataset, "hf_dataset"):
-            task_indices = eval_dataset.hf_dataset["task_index"]
-            unique_tasks = sorted(set(task_indices))
-            per_task = max(1, cfg.max_eval_samples // len(unique_tasks))
-            selected: list[int] = []
-            for t in unique_tasks:
-                frames = [i for i, ti in enumerate(task_indices) if ti == t][:per_task]
-                selected.extend(frames)
-            eval_ds = torch.utils.data.Subset(eval_dataset, selected)
-
-        eval_collate_fn = lerobot_collate_fn if dataset.meta.has_language_columns else None
-        eval_dataloader = torch.utils.data.DataLoader(
-            eval_ds,
-            batch_size=cfg.batch_size,
-            shuffle=False,
-            num_workers=cfg.num_workers,
-            pin_memory=device.type == "cuda",
-            drop_last=False,
-            collate_fn=eval_collate_fn,
-        )
-
    # Prepare everything with accelerator
    accelerator.wait_for_everyone()
    policy, optimizer, dataloader, lr_scheduler = accelerator.prepare(
@@ -561,8 +534,7 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
        train_tracker.step()
        is_log_step = cfg.log_freq > 0 and step % cfg.log_freq == 0
        is_saving_step = step % cfg.save_freq == 0 or step == cfg.steps
-        is_env_eval_step = cfg.env_eval_freq > 0 and step % cfg.env_eval_freq == 0
-        is_eval_step = cfg.eval_steps > 0 and eval_dataloader is not None and step % cfg.eval_steps == 0
+        is_eval_step = cfg.eval_freq > 0 and step % cfg.eval_freq == 0

        if is_log_step:
            # Collective reduce must run on every rank, before the main-process gate below.
@@ -585,27 +557,6 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):
                    wandb_logger.log_dict(wandb_log_dict, step)
            train_tracker.reset_averages()

-        if is_eval_step:
-            policy.eval()
-            eval_loss_sum = 0.0
-            n_eval_batches = 0
-            with torch.no_grad(), accelerator.autocast():
-                for eval_batch in eval_dataloader:
-                    for cam_key in dataset.meta.camera_keys:
-                        if cam_key in eval_batch and eval_batch[cam_key].dtype == torch.uint8:
-                            eval_batch[cam_key] = eval_batch[cam_key].to(dtype=torch.float32) / 255.0
-                    eval_batch = preprocessor(eval_batch)
-                    loss, _ = policy.forward(eval_batch)
-                    eval_loss_sum += loss.item()
-                    n_eval_batches += 1
-            eval_loss = eval_loss_sum / max(n_eval_batches, 1)
-            policy.train()
-
-            if is_main_process:
-                logging.info(f"step {step}: eval_loss={eval_loss:.4f}")
-                if wandb_logger:
-                    wandb_logger.log_dict({"eval_loss": eval_loss}, step=step, mode="eval")
-
        if cfg.save_checkpoint and is_saving_step:
            if is_main_process:
                logging.info(f"Checkpoint policy after step {step}")
@@ -628,7 +579,7 @@ def train(cfg: TrainPipelineConfig, accelerator: "Accelerator | None" = None):

            accelerator.wait_for_everyone()

-        if cfg.env and is_env_eval_step:
+        if cfg.env and is_eval_step:
            if is_main_process:
                step_id = get_step_identifier(step, cfg.steps)
                logging.info(f"Eval policy at step {step}")
@@ -216,15 +216,9 @@ def register_third_party_plugins() -> None:

    This function uses `importlib.metadata` to find packages installed in the environment
    (including editable installs) starting with 'lerobot_robot_', 'lerobot_camera_',
-    'lerobot_teleoperator_', 'lerobot_policy_', or 'lerobot_env_' and imports them.
+    'lerobot_teleoperator_', or 'lerobot_policy_' and imports them.
    """
-    prefixes = (
-        "lerobot_robot_",
-        "lerobot_camera_",
-        "lerobot_teleoperator_",
-        "lerobot_policy_",
-        "lerobot_env_",
-    )
+    prefixes = ("lerobot_robot_", "lerobot_camera_", "lerobot_teleoperator_", "lerobot_policy_")
    imported: list[str] = []
    failed: list[str] = []

@@ -28,6 +28,7 @@ import pytest
 pytest.importorskip("datasets", reason="datasets is required (install lerobot[dataset])")
 pytest.importorskip("pandas", reason="pandas is required (install lerobot[dataset])")

+import pandas as pd  # noqa: E402
 import pyarrow.parquet as pq  # noqa: E402

 from lerobot.annotations.steerable_pipeline.reader import iter_episodes  # noqa: E402
@@ -344,6 +345,78 @@ def test_annotation_metadata_sync_allows_non_streaming_load(
    assert len(dataset) == 24


+def _build_packed_dataset(root: Path, episode_lengths: list[int], *, fps: int = 10) -> Path:
+    """Pack several episodes into a single shard (vs build_annotation_dataset's one-per-file),
+    so the writer's rewrite must re-emit one row group per episode instead of collapsing them."""
+    from lerobot.datasets.io_utils import write_tasks
+    from lerobot.utils.io_utils import write_json
+
+    data_dir = root / "data" / "chunk-000"
+    data_dir.mkdir(parents=True, exist_ok=True)
+
+    episode_index, frame_index, timestamp, task_index, subtask_index = [], [], [], [], []
+    for ep, length in enumerate(episode_lengths):
+        episode_index += [ep] * length
+        frame_index += list(range(length))
+        timestamp += [round(i / fps, 6) for i in range(length)]
+        task_index += [0] * length
+        subtask_index += [0] * length  # legacy column the writer must drop
+    pd.DataFrame(
+        {
+            "episode_index": episode_index,
+            "frame_index": frame_index,
+            "timestamp": timestamp,
+            "task_index": task_index,
+            "subtask_index": subtask_index,
+        }
+    ).to_parquet(data_dir / "file-000.parquet", index=False)
+
+    tasks_df = pd.DataFrame({"task_index": [0]}, index=pd.Index(["do the thing"], name="task"))
+    write_tasks(tasks_df, root)
+    write_json(
+        {"codebase_version": "v3.1", "fps": fps, "features": {}, "total_episodes": len(episode_lengths)},
+        root / "meta" / "info.json",
+    )
+    return root
+
+
+def test_writer_one_row_group_per_episode(tmp_path: Path) -> None:
+    """Rewriting a packed shard must keep one row group per episode, not collapse
+    every episode into a single giant row group."""
+    episode_lengths = [4, 6, 5]  # unequal lengths, all in one shard
+    root = _build_packed_dataset(tmp_path / "ds", episode_lengths)
+    shard = root / "data" / "chunk-000" / "file-000.parquet"
+    assert pq.ParquetFile(shard).metadata.num_row_groups == 1, "fixture should start collapsed"
+
+    staging_dir = tmp_path / "stage"
+    for ep in range(len(episode_lengths)):
+        _stage_episode(
+            staging_dir,
+            ep,
+            plan=[
+                {
+                    "role": "assistant",
+                    "content": f"subtask for ep {ep}",
+                    "style": "subtask",
+                    "timestamp": 0.0,
+                    "tool_calls": None,
+                }
+            ],
+        )
+
+    records = list(iter_episodes(root))
+    LanguageColumnsWriter().write_all(records, staging_dir, root)
+
+    # One row group per episode, with row counts matching the episode lengths.
+    md = pq.ParquetFile(shard).metadata
+    assert md.num_row_groups == len(episode_lengths)
+    assert [md.row_group(i).num_rows for i in range(md.num_row_groups)] == episode_lengths
+    # Language columns are still present after the per-episode rewrite.
+    table = pq.read_table(shard)
+    assert "language_persistent" in table.column_names
+    assert "language_events" in table.column_names
+
+
 def test_speech_atom_shape_matches_plan_spec() -> None:
    atom = speech_atom(2.5, "I'm cleaning up!")
    assert atom["role"] == "assistant"
@@ -32,6 +32,26 @@ from lerobot.datasets.lerobot_dataset import LeRobotDataset
 from tests.fixtures.constants import DUMMY_REPO_ID


+def assert_data_shards_one_row_group_per_episode(root):
+    """Every aggregated DATA shard must have exactly one parquet row group per episode."""
+    import pyarrow.parquet as pq
+
+    shards = sorted((root / "data").rglob("*.parquet"))
+    assert shards, f"no data shards found under {root}/data"
+    n_episodes = 0
+    for shard in shards:
+        pf = pq.ParquetFile(shard)
+        episodes = pf.read(columns=["episode_index"]).column("episode_index").to_pylist()
+        assert pf.metadata.num_row_groups == len(set(episodes)), shard
+        for i in range(pf.metadata.num_row_groups):
+            rg_episodes = set(
+                pf.read_row_group(i, columns=["episode_index"]).column("episode_index").to_pylist()
+            )
+            assert len(rg_episodes) == 1, f"{shard} row group {i} spans episodes {rg_episodes}"
+        n_episodes += len(set(episodes))
+    return n_episodes
+
+
 def assert_episode_and_frame_counts(aggr_ds, expected_episodes, expected_frames):
    """Test that total number of episodes and frames are correctly aggregated."""
    assert aggr_ds.num_episodes == expected_episodes, (
@@ -566,6 +586,41 @@ def assert_image_frames_integrity(aggr_ds, ds_0, ds_1):
            )


+@pytest.mark.parametrize("use_videos", [True, False], ids=["video", "image"])
+def test_aggregate_one_row_group_per_episode(tmp_path, lerobot_dataset_factory, use_videos):
+    """Aggregated DATA shards keep one row group per episode (not one collapsed group).
+
+    Covers both the non-image (``df.to_parquet``) and image
+    (``to_parquet_with_hf_images``) write branches, including the merge-into-
+    existing-file branch via a low file-size threshold that forces packing.
+    """
+    ds_0 = lerobot_dataset_factory(
+        root=tmp_path / "rg_0",
+        repo_id=f"{DUMMY_REPO_ID}_rg_0",
+        total_episodes=3,
+        total_frames=60,
+        use_videos=use_videos,
+    )
+    ds_1 = lerobot_dataset_factory(
+        root=tmp_path / "rg_1",
+        repo_id=f"{DUMMY_REPO_ID}_rg_1",
+        total_episodes=4,
+        total_frames=80,
+        use_videos=use_videos,
+    )
+
+    aggr_root = tmp_path / "rg_aggr"
+    aggregate_datasets(
+        repo_ids=[ds_0.repo_id, ds_1.repo_id],
+        roots=[ds_0.root, ds_1.root],
+        aggr_repo_id=f"{DUMMY_REPO_ID}_rg_aggr",
+        aggr_root=aggr_root,
+    )
+
+    n_episodes = assert_data_shards_one_row_group_per_episode(aggr_root)
+    assert n_episodes == ds_0.num_episodes + ds_1.num_episodes
+
+
 def test_aggregate_image_datasets(tmp_path, lerobot_dataset_factory):
    """Test aggregation of image-based datasets preserves HuggingFace Image schema.

@@ -51,7 +51,7 @@ from lerobot.robots import make_robot_from_config
 from lerobot.transforms import ImageTransforms, ImageTransformsConfig
 from lerobot.utils.constants import ACTION, DONE, OBS_IMAGES, OBS_STATE, OBS_STR, REWARD
 from lerobot.utils.feature_utils import hw_to_dataset_features
-from tests.fixtures.constants import DUMMY_CHW, DUMMY_HWC, DUMMY_REPO_ID
+from tests.fixtures.constants import DUMMY_CHW, DUMMY_HWC, DUMMY_MOTOR_FEATURES, DUMMY_REPO_ID
 from tests.mocks.mock_robot import MockRobotConfig
 from tests.utils import require_x86_64_kernel

@@ -133,6 +133,21 @@ def test_dataset_feature_with_forward_slash_raises_error():
        )


+def test_create_does_not_mutate_input_features(tmp_path, empty_lerobot_dataset_factory):
+    # ``create`` must deep-copy features so a dataset built from another's features stays independent.
+    dataset = empty_lerobot_dataset_factory(
+        root=tmp_path / "ds1", features=DUMMY_MOTOR_FEATURES, use_videos=False
+    )
+    dataset_copy = empty_lerobot_dataset_factory(
+        root=tmp_path / "ds2", features=dataset.meta.features, use_videos=False
+    )
+
+    original_shape = dataset.meta.info.features["state"]["shape"]
+    dataset_copy.meta.info.features["state"]["shape"] = (999,)
+
+    assert dataset.meta.info.features["state"]["shape"] == original_shape
+
+
 def test_add_frame_missing_task(tmp_path, empty_lerobot_dataset_factory):
    features = {"state": {"dtype": "float32", "shape": (1,), "names": None}}
    dataset = empty_lerobot_dataset_factory(root=tmp_path / "test", features=features)
@@ -2370,14 +2370,32 @@ def test_aggregate_images_when_use_videos_false():
    out = aggregate_pipeline_dataset_features(
        pipeline=rp,
        initial_features={PipelineFeatureType.ACTION: {}, PipelineFeatureType.OBSERVATION: initial},
-        use_videos=False,  # expect "image" dtype
+        use_videos=False,  # images kept, stored as "image" dtype
        patterns=None,
    )

    key = f"{OBS_IMAGES}.back"
    key_front = f"{OBS_IMAGES}.front"
-    assert key not in out
-    assert key_front not in out
+    assert key in out
+    assert key_front in out
+    assert out[key]["dtype"] == "image"
+    assert out[key_front]["dtype"] == "image"
+    assert out[key]["shape"] == initial["back"]
+
+
+def test_aggregate_images_excluded():
+    rp = DataProcessorPipeline([AddObservationStateFeatures(add_front_image=True)])
+    initial = {"back": (480, 640, 3)}
+
+    out = aggregate_pipeline_dataset_features(
+        pipeline=rp,
+        initial_features={PipelineFeatureType.ACTION: {}, PipelineFeatureType.OBSERVATION: initial},
+        exclude_images=True,
+        patterns=None,
+    )
+
+    assert f"{OBS_IMAGES}.back" not in out
+    assert f"{OBS_IMAGES}.front" not in out


 def test_aggregate_images_when_use_videos_true():
@@ -134,7 +134,7 @@ class TestMultiGPUTraining:
                f"--output_dir={output_dir}",
                "--batch_size=4",
                "--steps=10",
-                "--env_eval_freq=-1",
+                "--eval_freq=-1",
                "--log_freq=5",
                "--save_freq=10",
                "--seed=42",
@@ -177,7 +177,7 @@ class TestMultiGPUTraining:
                f"--output_dir={output_dir}",
                "--batch_size=4",
                "--steps=20",
-                "--env_eval_freq=-1",
+                "--eval_freq=-1",
                "--log_freq=5",
                "--save_freq=10",
                "--seed=42",
Author	SHA1	Message	Date
Khalil Meftah	4f5e6596be	refactor(eval): remove shape inference and shallow copy helpers	2026-06-16 22:13:23 +02:00
Khalil Meftah	afeeeb8982	Merge branch 'main' into feat/eval-dataset-recording	2026-06-16 21:45:06 +02:00
Khalil Meftah	040c6b3d66	refactor(eval): per-env datasets recording, no double reset - Extract _infer_shape_from_obs() to reduce nesting in feature conversion - Move dataset creation into rollout() using its own env.reset() observation, eliminating the extra reset in run_one() - Replace deepcopy with _shallow_copy_obs() for raw observation stashing - Support batch_size > 1: each parallel env records to its own dataset (single env skips the env_0/ nesting for simplicity) - One-time warning for env_features keys missing from observations - Pass recording_dir + env_features through the call chain instead of a pre-built recording_dataset object	2026-06-16 21:35:05 +02:00
Caroline Pascal	287c823f13	fix(features copy): adding deepcopy on LeRobot dataset features to avoid shallow copy leaks (#3826 ) * fix(features copy): adding deepcopy on LeRobot dataset features to avoid shallow copy leaks * tests(test): adding new test	2026-06-16 17:58:59 +02:00
Khalil Meftah	acd31c7de2	fix(eval): use FeatureType enum comparison instead of string value	2026-06-16 15:22:50 +02:00
Pepijn	58ccc01508	fix(datasets): enforce one parquet row group per episode in v3 data writes (#3807 ) * fix(datasets): enforce one parquet row group per episode in v3 data writes LeRobot v3 data shards must hold exactly one row group per episode so a reader can fetch episode i with pq.ParquetFile(path).read_row_group(i) (a byte-range read) instead of loading the whole shard. The recording writer already does this (one write_table per episode); the aggregate and lerobot-annotate re-write paths instead concatenated many episodes and wrote them in one shot, collapsing the file to a single row group. - io_utils: add write_table_one_row_group_per_episode (one ParquetWriter, one write_table per episode — same pattern as the recording writer); to_parquet_with_hf_images embeds images then writes per-episode row groups; to_parquet_one_row_group_per_episode wraps it for plain frames - aggregate: route non-image data writes through the per-episode writer; leave the episodes-metadata parquet untouched (already one row/episode) - annotate: rewrite shards via the per-episode writer instead of a single bulk pq.write_table - tests: invariant coverage through the aggregate (image + video) and annotate paths No change to on-disk schema, paths, naming, rollover thresholds, or compression. Readers stay backward-compatible (old collapsed files load). * Update src/lerobot/datasets/io_utils.py Co-authored-by: Caroline Pascal <caroline8.pascal@gmail.com> Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> * Update src/lerobot/datasets/io_utils.py Co-authored-by: Caroline Pascal <caroline8.pascal@gmail.com> Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> * fix(datasets): correct indentation and add strict= in row-group helper The web-edited numpy version of write_table_one_row_group_per_episode had an over-indented line (IndentationError, breaking pre-commit + test collection) and a zip() without strict=. Fix both; behaviour unchanged. --------- Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com> Co-authored-by: Caroline Pascal <caroline8.pascal@gmail.com>	2026-06-16 12:15:48 +02:00
Caroline Pascal	38327fdc84	fix(images/videos): fixing aggregate_pipeline_dataset_features to avoid unwanted images features deletion (#3783 ) * fix(images/videos): fixing aggregate_pipeline_dataset_features to avoid unwanted images features deletion when videos are not used * fix(docstrings): improving docstrings Signed-off-by: Caroline Pascal <caroline8.pascal@gmail.com> --------- Signed-off-by: Caroline Pascal <caroline8.pascal@gmail.com>	2026-06-15 17:55:52 +02:00