Enable mypy static type checking in pre-commit configuration and update mypy settings in pyproject.toml

Lower limits by 50% for current and torque for gripper motor (#1809 )
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
2026-05-14 16:19:45 +00:00 · 2025-09-01 17:21:30 +02:00 · 2025-08-29 16:06:55 +02:00 · 2025-08-28 11:18:54 +02:00 · 2025-08-25 12:39:32 +02:00 · 2025-08-22 15:24:02 +02:00
55 changed files with 1303 additions and 6335 deletions
@@ -86,11 +86,11 @@ repos:

  # TODO(Steven): Uncomment when ready to use
  ##### Static Analysis & Typing #####
-  # - repo: https://github.com/pre-commit/mirrors-mypy
-  #   rev: v1.16.0
-  #   hooks:
-  #     - id: mypy
-  #       args: [--python-version=3.10]
+  - repo: https://github.com/pre-commit/mirrors-mypy
+    rev: v1.16.0
+    hooks:
+      - id: mypy
+        args: [--python-version=3.10]

  ##### Docstring Checks #####
  # - repo: https://github.com/akaihola/darglint2
@@ -233,7 +233,7 @@ Under the hood, the `LeRobotDataset` format makes use of several ways to seriali

 Here are the important details and internal structure organization of a typical `LeRobotDataset` instantiated with `dataset = LeRobotDataset("lerobot/aloha_static_coffee")`. The exact features will change from dataset to dataset but not the main aspects:

-````
+```
 dataset attributes:
  ├ hf_dataset: a Hugging Face dataset (backed by Arrow/parquet). Typical features example:
  │  ├ observation.images.cam_high (VideoFrame):
@@ -246,30 +246,20 @@ dataset attributes:
  │  ├ timestamp (float32): timestamp in the episode
  │  ├ next.done (bool): indicates the end of an episode ; True for the last frame in each episode
  │  └ index (int64): general index in the whole dataset
-  ├ meta: a LeRobotDatasetMetadata object containing:
-  │  ├ info: a dictionary of metadata on the dataset
-  │  │  ├ codebase_version (str): this is to keep track of the codebase version the dataset was created with
-  │  │  ├ fps (int): frame per second the dataset is recorded/synchronized to
-  │  │  ├ features (dict): all features contained in the dataset with their shapes and types
-  │  │  ├ total_episodes (int): total number of episodes in the dataset
-  │  │  ├ total_frames (int): total number of frames in the dataset
-  │  │  ├ robot_type (str): robot type used for recording
-  │  │  ├ data_path (str): formattable string for the parquet files
-  │  │  └ video_path (str): formattable string for the video files (if using videos)
-  │  ├ episodes: a DataFrame containing episode metadata with columns:
-  │  │  ├ episode_index (int): index of the episode
-  │  │  ├ tasks (list): list of tasks for this episode
-  │  │  ├ length (int): number of frames in this episode
-  │  │  ├ dataset_from_index (int): start index of this episode in the dataset
-  │  │  └ dataset_to_index (int): end index of this episode in the dataset
-  │  ├ stats: a dictionary of statistics (max, mean, min, std) for each feature in the dataset, for instance
-  │  │  ├ observation.images.front_cam: {'max': tensor with same number of dimensions (e.g. `(c, 1, 1)` for images, `(c,)` for states), etc.}
-  │  │  └ ...
-  │  └ tasks: a DataFrame containing task information with task names as index and task_index as values
-  ├ root (Path): local directory where the dataset is stored
-  ├ image_transforms (Callable): optional image transformations to apply to visual modalities
-  └ delta_timestamps (dict): optional delta timestamps for temporal queries
-decoding videos (e.g., 'pyav', 'torchcodec')
+  ├ episode_data_index: contains 2 tensors with the start and end indices of each episode
+  │  ├ from (1D int64 tensor): first frame index for each episode — shape (num episodes,) starts with 0
+  │  └ to: (1D int64 tensor): last frame index for each episode — shape (num episodes,)
+  ├ stats: a dictionary of statistics (max, mean, min, std) for each feature in the dataset, for instance
+  │  ├ observation.images.cam_high: {'max': tensor with same number of dimensions (e.g. `(c, 1, 1)` for images, `(c,)` for states), etc.}
+  │  ...
+  ├ info: a dictionary of metadata on the dataset
+  │  ├ codebase_version (str): this is to keep track of the codebase version the dataset was created with
+  │  ├ fps (float): frame per second the dataset is recorded/synchronized to
+  │  ├ video (bool): indicates if frames are encoded in mp4 video files to save space or stored as png files
+  │  └ encoding (dict): if video, this documents the main options that were used with ffmpeg to encode the videos
+  ├ videos_dir (Path): where the mp4 videos or png images are stored/accessed
+  └ camera_keys (list of string): the keys to access camera features in the item returned by the dataset (e.g. `["observation.images.cam_high", ...]`)
+```

 A `LeRobotDataset` is serialised using several widespread file formats for each of its parts, namely:

@@ -293,7 +283,7 @@ lerobot-eval \
    --eval.n_episodes=10 \
    --policy.use_amp=false \
    --policy.device=cuda
-````
+```

 Note: After training your own policy, you can re-evaluate the checkpoints with:

@@ -108,8 +108,7 @@ def save_decoded_frames(


 def save_first_episode(imgs_dir: Path, dataset: LeRobotDataset) -> None:
-    episode_index = 0
-    ep_num_images = dataset.meta.episodes["length"][episode_index]
+    ep_num_images = dataset.episode_data_index["to"][0].item()
    if imgs_dir.exists() and len(list(imgs_dir.glob("frame_*.png"))) == ep_num_images:
        return

@@ -266,8 +265,7 @@ def benchmark_encoding_decoding(
            overwrite=True,
        )

-    episode_index = 0
-    ep_num_images = dataset.meta.episodes["length"][episode_index]
+    ep_num_images = dataset.episode_data_index["to"][0].item()
    width, height = tuple(dataset[0][dataset.meta.camera_keys[0]].shape[-2:])
    num_pixels = width * height
    video_size_bytes = video_path.stat().st_size
@@ -29,7 +29,7 @@ ENV DEBIAN_FRONTEND=noninteractive \

 # Install system dependencies and uv (as root)
 RUN apt-get update && apt-get install -y --no-install-recommends \
-    build-essential git curl libglib2.0-0 libegl1-mesa ffmpeg \
+    build-essential git curl libglib2.0-0 libegl1-mesa-dev ffmpeg \
    libusb-1.0-0-dev speech-dispatcher libgeos-dev portaudio19-dev \
    && curl -LsSf https://astral.sh/uv/install.sh | sh \
    && mv /root/.local/bin/uv /usr/local/bin/uv \
@@ -19,8 +19,6 @@
    title: Train RL in Simulation
  - local: async
    title: Use Async Inference
-  - local: porting_datasets_v3
-    title: Porting Large Datasets
  title: "Tutorials"
 - sections:
  - local: smolvla
@@ -41,6 +39,8 @@
 - sections:
  - local: notebooks
    title: Notebooks
+  - local: feetech
+    title: Updating Feetech Firmware
  title: "Resources"
 - sections:
  - local: contributing
@@ -0,0 +1,71 @@
+# Feetech Motor Firmware Update
+
+This tutorial guides you through updating the firmware of Feetech motors using the official Feetech software.
+
+## Prerequisites
+
+- Windows computer (Feetech software is only available for Windows)
+- Feetech motor control board
+- USB cable to connect the control board to your computer
+- Feetech motors connected to the control board
+
+## Step 1: Download Feetech Software
+
+1. Visit the official Feetech software download page: [https://www.feetechrc.com/software.html](https://www.feetechrc.com/software.html)
+2. Download the latest version of the Feetech debugging software (FD)
+3. Install the software on your Windows computer
+
+## Step 2: Hardware Setup
+
+1. Connect your Feetech motors to the motor control board
+2. Connect the motor control board to your Windows computer via USB cable
+3. Ensure power is supplied to the motors
+
+## Step 3: Configure Connection
+
+1. Launch the Feetech debugging software
+2. Select the correct COM port from the port dropdown menu
+   - If unsure which port to use, check Windows Device Manager under "Ports (COM & LPT)"
+3. Set the appropriate baud rate (typically 1000000 for most Feetech motors)
+4. Click "Open" to establish communication with the control board
+
+## Step 4: Scan for Motors
+
+1. Once connected, click the "Search" button to detect all connected motors
+2. The software will automatically discover and list all motors on the bus
+3. Each motor will appear with its ID number
+
+## Step 5: Update Firmware
+
+For each motor you want to update:
+
+1. **Select the motor** from the list by clicking on it
+2. **Click on Upgrade tab**:
+3. **Click on Online button**:
+   - If an potential firmware update is found, it will be displayed in the box
+4. **Click on Upgrade button**:
+   - The update progress will be displayed
+
+## Step 6: Verify Update
+
+1. After the update completes, the software should automatically refresh the motor information
+2. Verify that the firmware version has been updated to the expected version
+
+## Important Notes
+
+⚠️ **Warning**: Do not disconnect power or USB during firmware updates, it will potentially brick the motor.
+
+## Bonus: Motor Debugging on Linux/macOS
+
+For debugging purposes only, you can use the open-source Feetech Debug Tool:
+
+- **Repository**: [FT_SCServo_Debug_Qt](https://github.com/CarolinePascal/FT_SCServo_Debug_Qt/tree/fix/port-search-timer)
+
+### Installation Instructions
+
+Follow the instructions in the repository to install the tool, for Ubuntu you can directly install it, for MacOS you need to build it from source.
+
+**Limitations:**
+
+- This tool is for debugging and parameter adjustment only
+- Firmware updates must still be done on Windows with official Feetech software
@@ -1,321 +0,0 @@
-# Porting Large Datasets to LeRobot Dataset v3.0
-
-This tutorial explains how to port large-scale robotic datasets to the LeRobot Dataset v3.0 format. We'll use the **DROID 1.0.1** dataset as our primary example, which demonstrates handling multi-terabyte datasets with thousands of shards across SLURM clusters.
-
-## File Organization: v2.1 vs v3.0
-
-Dataset v3.0 fundamentally changes how data is organized and stored:
-
-**v2.1 Structure (Episode-based)**:
-
-```
-dataset/
-├── data/chunk-000/episode_000000.parquet
-├── data/chunk-000/episode_000001.parquet
-├── videos/chunk-000/camera/episode_000000.mp4
-└── meta/episodes.jsonl
-```
-
-**v3.0 Structure (File-based)**:
-
-```
-dataset/
-├── data/chunk-000/file-000.parquet        # Multiple episodes per file
-├── videos/camera/chunk-000/file-000.mp4   # Consolidated video chunks
-└── meta/episodes/chunk-000/file-000.parquet  # Structured metadata
-```
-
-This transition from individual episode files to file-based chunks dramatically improves performance and reduces storage overhead.
-
-## What's New in Dataset v3.0
-
-Dataset v3.0 introduces significant improvements for handling large datasets:
-
-### 🏗️ **Enhanced File Organization**
-
- **File-based structure**: Episodes are now grouped into chunked files rather than individual episode files
- **Configurable file sizes**: for data and video files
- **Improved storage efficiency**: Better compression and reduced overhead
-
-### 📊 **Modern Metadata Management**
-
- **Parquet-based metadata**: Replaced JSON Lines with efficient parquet format
- **Structured episode access**: Direct pandas DataFrame access via `dataset.meta.episodes`
- **Per-episode statistics**: Enhanced statistics tracking at episode level
-
-### 🚀 **Performance Enhancements**
-
- **Memory-mapped access**: Improved RAM usage through PyArrow memory mapping
- **Faster loading**: Significantly reduced dataset initialization time
- **Better scalability**: Designed for datasets with millions of episodes
-
-## Prerequisites
-
-Before porting large datasets, ensure you have:
-
- **LeRobot installed** with v3.0 support. Follow our [Installation Guide](./installation).
- **Sufficient storage**: Raw datasets can be very large (e.g., DROID requires 2TB)
- **Cluster access** (recommended for large datasets): SLURM or similar job scheduler
- **Dataset-specific dependencies**: For DROID, you'll need TensorFlow Dataset utilities
-
-## Understanding the DROID Dataset
-
-[DROID 1.0.1](https://droid-dataset.github.io/droid/the-droid-dataset) is an excellent example of a large-scale robotic dataset:
-
- **Size**: 1.7TB (RLDS format), 8.7TB (raw data)
- **Structure**: 2048 pre-defined TensorFlow dataset shards
- **Content**: 76,000+ robot manipulation trajectories from Franka Emika Panda robots
- **Scope**: Real-world manipulation tasks across multiple environments and objects
- **Format**: Originally in TensorFlow Records/RLDS format, requiring conversion to LeRobot format
- **Hosting**: Google Cloud Storage with public access via `gsutil`
-
-The dataset contains diverse manipulation demonstrations with:
-
- Multiple camera views (wrist camera, exterior cameras)
- Natural language task descriptions
- Robot proprioceptive state and actions
- Success/failure annotations
-
-### DROID Features Schema
-
-```python
-DROID_FEATURES = {
-    # Episode markers
-    "is_first": {"dtype": "bool", "shape": (1,)},
-    "is_last": {"dtype": "bool", "shape": (1,)},
-    "is_terminal": {"dtype": "bool", "shape": (1,)},
-
-    # Language instructions
-    "language_instruction": {"dtype": "string", "shape": (1,)},
-    "language_instruction_2": {"dtype": "string", "shape": (1,)},
-    "language_instruction_3": {"dtype": "string", "shape": (1,)},
-
-    # Robot state
-    "observation.state.gripper_position": {"dtype": "float32", "shape": (1,)},
-    "observation.state.cartesian_position": {"dtype": "float32", "shape": (6,)},
-    "observation.state.joint_position": {"dtype": "float32", "shape": (7,)},
-
-    # Camera observations
-    "observation.images.wrist_left": {"dtype": "image"},
-    "observation.images.exterior_1_left": {"dtype": "image"},
-    "observation.images.exterior_2_left": {"dtype": "image"},
-
-    # Actions
-    "action.gripper_position": {"dtype": "float32", "shape": (1,)},
-    "action.cartesian_position": {"dtype": "float32", "shape": (6,)},
-    "action.joint_position": {"dtype": "float32", "shape": (7,)},
-
-    # Standard LeRobot format
-    "observation.state": {"dtype": "float32", "shape": (8,)},  # joints + gripper
-    "action": {"dtype": "float32", "shape": (8,)},  # joints + gripper
-}
-```
-
-## Approach 1: Single Computer Porting
-
-### Step 1: Install Dependencies
-
-For DROID specifically:
-
-```bash
-pip install tensorflow
-pip install tensorflow_datasets
-```
-
-For other datasets, install the appropriate readers for your source format.
-
-### Step 2: Download Raw Data
-
-Download DROID from Google Cloud Storage using `gsutil`:
-
-```bash
-# Install Google Cloud SDK if not already installed
-# https://cloud.google.com/sdk/docs/install
-
-# Download the full RLDS dataset (1.7TB)
-gsutil -m cp -r gs://gresearch/robotics/droid/1.0.1 /your/data/
-
-# Or download just the 100-episode sample (2GB) for testing
-gsutil -m cp -r gs://gresearch/robotics/droid_100 /your/data/
-```
-
-> [!WARNING]
-> Large datasets require substantial time and storage:
->
-> - **Full DROID (1.7TB)**: Several days to download depending on bandwidth
-> - **Processing time**: 7+ days for local porting of full dataset
-> - **Upload time**: 3+ days to push to Hugging Face Hub
-> - **Local storage**: ~400GB for processed LeRobot format
-
-### Step 3: Port the Dataset
-
-```bash
-python examples/port_datasets/droid_rlds/port_droid.py \
-    --raw-dir /your/data/droid/1.0.1 \
-    --repo-id your_id/droid_1.0.1 \
-    --push-to-hub
-```
-
-### Development and Testing
-
-For development, you can port a single shard:
-
-```bash
-python examples/port_datasets/droid_rlds/port_droid.py \
-    --raw-dir /your/data/droid/1.0.1 \
-    --repo-id your_id/droid_1.0.1_test \
-    --num-shards 2048 \
-    --shard-index 0
-```
-
-This approach works for smaller datasets or testing, but large datasets require cluster computing.
-
-## Approach 2: SLURM Cluster Porting (Recommended)
-
-For large datasets like DROID, parallel processing across multiple nodes dramatically reduces processing time.
-
-### Step 1: Install Cluster Dependencies
-
-```bash
-pip install datatrove  # Hugging Face's distributed processing library
-```
-
-### Step 2: Configure Your SLURM Environment
-
-Find your partition information:
-
-```bash
-sinfo --format="%R"  # List available partitions
-sinfo -N -p your_partition -h -o "%N cpus=%c mem=%m"  # Check resources
-```
-
-Choose a **CPU partition** - no GPU needed for dataset porting.
-
-### Step 3: Launch Parallel Porting Jobs
-
-```bash
-python examples/port_datasets/droid_rlds/slurm_port_shards.py \
-    --raw-dir /your/data/droid/1.0.1 \
-    --repo-id your_id/droid_1.0.1 \
-    --logs-dir /your/logs \
-    --job-name port_droid \
-    --partition your_partition \
-    --workers 2048 \
-    --cpus-per-task 8 \
-    --mem-per-cpu 1950M
-```
-
-#### Parameter Guidelines
-
- **`--workers`**: Number of parallel jobs (max 2048 for DROID's shard count)
- **`--cpus-per-task`**: 8 CPUs recommended for frame encoding parallelization
- **`--mem-per-cpu`**: ~16GB total RAM (8×1950M) for loading raw frames
-
-> [!TIP]
-> Start with fewer workers (e.g., 100) to test your cluster configuration before launching thousands of jobs.
-
-### Step 4: Monitor Progress
-
-Check running jobs:
-
-```bash
-squeue -u $USER
-```
-
-Monitor overall progress:
-
-```bash
-jobs_status /your/logs
-```
-
-Inspect individual job logs:
-
-```bash
-less /your/logs/port_droid/slurm_jobs/JOB_ID_WORKER_ID.out
-```
-
-Debug failed jobs:
-
-```bash
-failed_logs /your/logs/port_droid
-```
-
-### Step 5: Aggregate Shards
-
-Once all porting jobs complete:
-
-```bash
-python examples/port_datasets/droid_rlds/slurm_aggregate_shards.py \
-    --repo-id your_id/droid_1.0.1 \
-    --logs-dir /your/logs \
-    --job-name aggr_droid \
-    --partition your_partition \
-    --workers 2048 \
-    --cpus-per-task 8 \
-    --mem-per-cpu 1950M
-```
-
-### Step 6: Upload to Hub
-
-```bash
-python examples/port_datasets/droid_rlds/slurm_upload.py \
-    --repo-id your_id/droid_1.0.1 \
-    --logs-dir /your/logs \
-    --job-name upload_droid \
-    --partition your_partition \
-    --workers 50 \
-    --cpus-per-task 4 \
-    --mem-per-cpu 1950M
-```
-
-> [!NOTE]
-> Upload uses fewer workers (50) since it's network-bound rather than compute-bound.
-
-## Dataset v3.0 File Structure
-
-Your completed dataset will have this modern structure:
-
-```
-dataset/
-├── meta/
-│   ├── episodes/
-│   │   └── chunk-000/
-│   │       └── file-000.parquet    # Episode metadata
-│   ├── tasks.parquet               # Task definitions
-│   ├── stats.json                  # Aggregated statistics
-│   └── info.json                   # Dataset information
-├── data/
-│   └── chunk-000/
-│       └── file-000.parquet        # Consolidated episode data
-└── videos/
-    └── camera_key/
-        └── chunk-000/
-            └── file-000.mp4        # Consolidated video files
-```
-
-This replaces the old episode-per-file structure with efficient, optimally-sized chunks.
-
-## Migrating from Dataset v2.1
-
-If you have existing datasets in v2.1 format, use the migration tool:
-
-```bash
-python src/lerobot/datasets/v30/convert_dataset_v21_to_v30.py \
-    --repo-id your_id/existing_dataset
-```
-
-This automatically:
-
- Converts file structure to v3.0 format
- Migrates metadata from JSON Lines to parquet
- Aggregates statistics and creates per-episode stats
- Updates version information
-
-## Performance Benefits
-
-Dataset v3.0 provides significant improvements for large datasets:
-
- **Faster loading**: 3-5x reduction in initialization time
- **Memory efficiency**: Better RAM usage through memory mapping
- **Scalable processing**: Handles millions of episodes efficiently
- **Storage optimization**: Reduced file count and improved compression
@@ -92,11 +92,11 @@ print(dataset.hf_dataset)
 # LeRobot datasets also subclasses PyTorch datasets so you can do everything you know and love from working
 # with the latter, like iterating through the dataset.
 # The __getitem__ iterates over the frames of the dataset. Since our datasets are also structured by
-# episodes, you can access the frame indices of any episode using dataset.meta.episodes. Here, we access
+# episodes, you can access the frame indices of any episode using the episode_data_index. Here, we access
 # frame indices associated to the first episode:
 episode_index = 0
-from_idx = dataset.meta.episodes["dataset_from_index"][episode_index]
-to_idx = dataset.meta.episodes["dataset_to_index"][episode_index]
+from_idx = dataset.episode_data_index["from"][episode_index].item()
+to_idx = dataset.episode_data_index["to"][episode_index].item()

 # Then we grab all the image frames from the first camera:
 camera_key = dataset.meta.camera_keys[0]
@@ -1,503 +0,0 @@
-import json
-import logging
-import shutil
-import time
-from pathlib import Path
-
-import h5py
-import numpy as np
-import pandas as pd
-
-from lerobot.datasets.lerobot_dataset import LeRobotDataset
-from lerobot.datasets.utils import (
-    DEFAULT_CHUNK_SIZE,
-    DEFAULT_VIDEO_FILE_SIZE_IN_MB,
-    DEFAULT_VIDEO_PATH,
-    EPISODES_DIR,
-    get_video_duration_in_s,
-    get_video_size_in_mb,
-    update_chunk_file_indices,
-    write_info,
-)
-from lerobot.datasets.video_utils import concat_video_files
-from lerobot.utils.utils import get_elapsed_time_in_days_hours_minutes_seconds
-
-AGIBOT_FPS = 30
-AGIBOT_ROBOT_TYPE = "AgiBot_A2D"
-AGIBOT_FEATURES = {
-    # gripper open range in mm (0 for pull open, 1 for full close)
-    "observation.state.effector.position": {
-        "dtype": "float32",
-        "shape": (2,),
-        "names": {
-            "axes": ["left_gripper", "right_gripper"],
-        },
-    },
-    # flange xyz in meters
-    "observation.state.end.position": {
-        "dtype": "float32",
-        "shape": (6,),
-        "names": {
-            "axes": ["left_x", "left_y", "left_z", "right_x", "right_y", "right_z"],
-        },
-    },
-    # flange quaternion with xyzw
-    "observation.state.end.orientation": {
-        "dtype": "float32",
-        "shape": (8,),
-        "names": {
-            "axes": ["left_x", "left_y", "left_z", "left_w", "right_x", "right_y", "right_z", "right_w"],
-        },
-    },
-    # in radians
-    "observation.state.head.position": {
-        "dtype": "float32",
-        "shape": (2,),
-        "names": {
-            "axes": ["yaw", "pitch"],
-        },
-    },
-    # in motor steps
-    "observation.state.joint.current_value": {
-        "dtype": "float32",
-        "shape": (14,),
-        "names": {
-            "axes": [f"left_joint_{i}" for i in range(7)] + [f"right_joint_{i}" for i in range(7)],
-        },
-    },
-    # same as current_value but in radians
-    "observation.state.joint.position": {
-        "dtype": "float32",
-        "shape": (14,),
-        "names": {
-            "axes": [f"left_joint_{i}" for i in range(7)] + [f"right_joint_{i}" for i in range(7)],
-        },
-    },
-    # pitch in radians, lift in meters
-    "observation.state.waist.position": {
-        "dtype": "float32",
-        "shape": (2,),
-        "names": {
-            "axes": ["pitch", "lift"],
-        },
-    },
-    # concatenation of head.position, joint.position, effector.position, waist.position
-    "observation.state": {
-        "dtype": "float32",
-        "shape": (20,),
-        "names": {
-            "axes": ["head_yaw", "head_pitch"]
-            + [f"left_joint_{i}" for i in range(7)]
-            + ["left_gripper"]
-            + [f"right_joint_{i}" for i in range(7)]
-            + ["right_gripper"]
-            + ["waist_pitch", "waist_lift"],
-        },
-    },
-    # gripper open range in mm (0 for pull open, 1 for full close)
-    "action.effector.position": {
-        "dtype": "float32",
-        "shape": (2,),
-        "names": {
-            "axes": ["left_gripper", "right_gripper"],
-        },
-    },
-    # flange xyz in meters
-    "action.end.position": {
-        "dtype": "float32",
-        "shape": (6,),
-        "names": {
-            "axes": ["left_x", "left_y", "left_z", "right_x", "right_y", "right_z"],
-        },
-    },
-    # flange quaternion with xyzw
-    "action.end.orientation": {
-        "dtype": "float32",
-        "shape": (8,),
-        "names": {
-            "axes": ["left_x", "left_y", "left_z", "left_w", "right_x", "right_y", "right_z", "right_w"],
-        },
-    },
-    # in radians
-    "action.head.position": {
-        "dtype": "float32",
-        "shape": (2,),
-        "names": {
-            "axes": ["yaw", "pitch"],
-        },
-    },
-    # goal joint position in radians
-    "action.joint.position": {
-        "dtype": "float32",
-        "shape": (14,),
-        "names": {
-            "axes": [f"left_joint_{i}" for i in range(7)] + [f"right_joint_{i}" for i in range(7)],
-        },
-    },
-    "action.robot.velocity": {
-        "dtype": "float32",
-        "shape": (2,),
-        "names": {
-            "axes": ["velocity_x", "yaw_rate"],
-        },
-    },
-    # pitch in radians, lift in meters
-    "action.waist.position": {
-        "dtype": "float32",
-        "shape": (2,),
-        "names": {
-            "axes": ["pitch", "lift"],
-        },
-    },
-    # concatenation of head.position, joint.position, effector.position, waist.position, robot.velocity
-    "action": {
-        "dtype": "float32",
-        "shape": (22,),
-        "names": {
-            "axes": ["head_yaw", "head_pitch"]
-            + [f"left_joint_{i}" for i in range(7)]
-            + ["left_gripper"]
-            + [f"right_joint_{i}" for i in range(7)]
-            + ["right_gripper"]
-            + ["waist_pitch", "waist_lift"]
-            + ["velocity_x", "yaw_rate"],
-        },
-    },
-    # episode level annotation
-    "init_scene_text": {
-        "dtype": "string",
-        "shape": (1,),
-        "names": None,
-    },
-    # frame level annotation
-    "action_text": {
-        "dtype": "string",
-        "shape": (1,),
-        "names": None,
-    },
-    # frame level annotation
-    "skill": {
-        "dtype": "string",
-        "shape": (1,),
-        "names": None,
-    },
-}
-
-AGIBOT_IMAGES_FEATURES = {
-    "observation.images.top_head": {
-        "dtype": "video",
-        "shape": (480, 640, 3),
-        "names": ["height", "width", "channel"],
-    },
-    "observation.images.hand_left": {
-        "dtype": "video",
-        "shape": (480, 640, 3),
-        "names": ["height", "width", "channel"],
-    },
-    "observation.images.hand_right": {
-        "dtype": "video",
-        "shape": (480, 640, 3),
-        "names": ["height", "width", "channel"],
-    },
-    "observation.images.head_center_fisheye": {
-        "dtype": "video",
-        "shape": (748, 960, 3),
-        "names": ["height", "width", "channel"],
-    },
-    "observation.images.head_left_fisheye": {
-        "dtype": "video",
-        "shape": (748, 960, 3),
-        "names": ["height", "width", "channel"],
-    },
-    "observation.images.head_right_fisheye": {
-        "dtype": "video",
-        "shape": (748, 960, 3),
-        "names": ["height", "width", "channel"],
-    },
-    "observation.images.back_left_fisheye": {
-        "dtype": "video",
-        "shape": (748, 960, 3),
-        "names": ["height", "width", "channel"],
-    },
-    "observation.images.back_right_fisheye": {
-        "dtype": "video",
-        "shape": (748, 960, 3),
-        "names": ["height", "width", "channel"],
-    },
-}
-
-
-def load_info_per_task(raw_dir):
-    info_per_task = {}
-    task_info_dir = raw_dir / "task_info"
-    for path in task_info_dir.glob("task_*.json"):
-        task_index = int(path.name.replace("task_", "").replace(".json", ""))
-        with open(path) as f:
-            task_info = json.load(f)
-
-        task_info = {ep["episode_id"]: ep for ep in task_info}
-        info_per_task[task_index] = task_info
-
-    return info_per_task
-
-
-def create_frame_idx_to_frames_label_idx(ep_info):
-    frame_idx_to_frames_label_idx = {}
-    for label_idx, frames_label in enumerate(ep_info["label_info"]["action_config"]):
-        for frame_idx in range(frames_label["start_frame"], frames_label["end_frame"]):
-            frame_idx_to_frames_label_idx[frame_idx] = label_idx
-    return frame_idx_to_frames_label_idx
-
-
-def generate_lerobot_frames(raw_dir: Path, task_index: int, episode_index: int):
-    r"""/!\ The frames dont contain observation.cameras.*"""
-    info_per_task = load_info_per_task(raw_dir)
-    ep_info = info_per_task[task_index][episode_index]
-    frame_idx_to_frames_label_idx = create_frame_idx_to_frames_label_idx(ep_info)
-
-    # Empty features are commented out.
-    keys_mapping = {
-        # STATE
-        # "observation.state.effector.force": "state/effector/force",
-        "observation.state.effector.position": "state/effector/position",
-        # "observation.state.end.angular": "state/end/angular",
-        "observation.state.end.position": "state/end/position",
-        "observation.state.end.orientation": "state/end/orientation",
-        # "observation.state.end.velocity": "state/end/velocity",
-        # "observation.state.end.wrench": "state/end/wrench",
-        # "observation.state.head.effort": "state/head/effort",
-        "observation.state.head.position": "state/head/position",
-        # "observation.state.head.velocity": "state/head/velocity",
-        "observation.state.joint.current_value": "state/joint/current_value",
-        # "observation.state.joint.effort": "state/joint/effort",
-        "observation.state.joint.position": "state/joint/position",
-        # "observation.state.joint.velocity": "state/joint/velocity",
-        # "observation.state.robot.orientation": "state/robot/orientation",
-        # "observation.state.robot.orientation_drift": "state/robot/orientation_drift",
-        # "observation.state.robot.position": "state/robot/position",
-        # "observation.state.robot.position_drift": "state/robot/position_drift",
-        # "observation.state.waist.effort": "state/waist/effort",
-        "observation.state.waist.position": "state/waist/position",
-        # "observation.state.waist.velocity": "state/waist/velocity",
-        # ----- ACTION (index are also commented out) -----
-        # "action.effector.index": "action/effector/index",
-        "action.effector.position": "action/effector/position",
-        # "action.effector.force": "action/effector/force",
-        # "action.end.index": "action/end/index",
-        "action.end.position": "action/end/position",
-        "action.end.orientation": "action/end/orientation",
-        # "action.head.index": "action/head/index",
-        "action.head.position": "action/head/position",
-        # "action.joint.index": "action/joint/index",
-        "action.joint.position": "action/joint/position",
-        # "action.joint.effort": "action/joint/effort",
-        # "action.joint.velocity": "action/joint/velocity",
-        # "action.robot.index": "action/robot/index",
-        # "action.robot.position": "action/robot/position",
-        # "action.robot.orientation": "action/robot/orientation",
-        # "action.robot.angular": "action/robot/angular",
-        "action.robot.velocity": "action/robot/velocity",
-        # "action.waist.index": "action/waist/index",
-        "action.waist.position": "action/waist/position",
-    }
-
-    h5_path = raw_dir / f"proprio_stats/{task_index}/{episode_index}/proprio_stats.h5"
-    with h5py.File(h5_path) as h5:
-        num_frames = len(h5["state/joint/position"])
-
-        for h5_key in keys_mapping.values():
-            col_num_frames = h5[h5_key].shape[0]
-            if col_num_frames != num_frames:
-                raise ValueError(
-                    f"HDF5 column '{h5_key}' is expected to have {num_frames} but has {col_num_frames}' frames instead."
-                )
-
-        for i in range(num_frames):
-            # Create frame
-            f = {new_key: h5[h5_key][i] for new_key, h5_key in keys_mapping.items()}
-
-            for key in f:
-                f[key] = np.array(f[key]).astype(np.float32)
-
-            f["observation.state.end.position"] = f["observation.state.end.position"].reshape(6)
-            f["observation.state.end.orientation"] = f["observation.state.end.orientation"].reshape(8)
-            f["observation.state"] = np.concatenate(
-                [
-                    f["observation.state.head.position"],
-                    f["observation.state.joint.position"][:7],  # left
-                    f["observation.state.effector.position"][[0]],  # left
-                    f["observation.state.joint.position"][7:],  # right
-                    f["observation.state.effector.position"][[1]],  # right
-                    f["observation.state.waist.position"],
-                ]
-            )
-
-            f["action.end.position"] = f["action.end.position"].reshape(6)
-            f["action.end.orientation"] = f["action.end.orientation"].reshape(8)
-            f["action"] = np.concatenate(
-                [
-                    f["action.head.position"],
-                    f["action.joint.position"][:7],  # left
-                    f["action.effector.position"][[0]],  # left
-                    f["action.joint.position"][7:],  # right
-                    f["action.effector.position"][[1]],  # right
-                    f["action.waist.position"],
-                    f["action.robot.velocity"],
-                ]
-            )
-
-            # episode level annotation
-            f["task"] = ep_info["task_name"]
-            f["init_scene_text"] = ep_info["init_scene_text"]
-
-            # frame level annotation
-            if i in frame_idx_to_frames_label_idx:
-                frames_label_idx = frame_idx_to_frames_label_idx[i]
-                frames_label = ep_info["label_info"]["action_config"][frames_label_idx]
-                f["action_text"] = frames_label["action_text"]
-                f["skill"] = frames_label["skill"]
-            else:
-                f["action_text"] = ""
-                f["skill"] = ""
-
-            yield f
-
-
-def update_meta_data(
-    df,
-    ep_to_meta,
-):
-    def _update(row):
-        ep_idx = row["episode_index"]
-        for key, meta in ep_to_meta[ep_idx].items():
-            row[f"videos/{key}/chunk_index"] = meta["chunk_index"]
-            row[f"videos/{key}/file_index"] = meta["file_index"]
-            row[f"videos/{key}/from_timestamp"] = meta["from_timestamp"]
-            row[f"videos/{key}/to_timestamp"] = meta["to_timestamp"]
-        return row
-
-    return df.apply(_update, axis=1)
-
-
-def move_videos_to_lerobot_directory(lerobot_dataset, raw_dir, task_index, episode_names):
-    keys_mapping = {
-        "observation.images.top_head": "head_color",
-        "observation.images.hand_left": "hand_left_color",
-        "observation.images.hand_right": "hand_right_color",
-        "observation.images.head_center_fisheye": "head_center_fisheye_color",
-        "observation.images.head_left_fisheye": "head_left_fisheye_color",
-        "observation.images.head_right_fisheye": "head_right_fisheye_color",
-        "observation.images.back_left_fisheye": "back_left_fisheye_color",
-        "observation.images.back_right_fisheye": "back_right_fisheye_color",
-    }
-
-    # sanity check
-    for key in keys_mapping:
-        if key not in lerobot_dataset.meta.info["features"]:
-            raise ValueError(f"Key '{key}' not found in features.")
-
-    video_keys = keys_mapping.keys()
-    chunk_idx = dict.fromkeys(video_keys, 0)
-    file_idx = dict.fromkeys(video_keys, 0)
-    latest_duration_in_s = dict.fromkeys(video_keys, 0)
-    ep_to_meta = {}
-    for ep_idx, ep_name in enumerate(episode_names):
-        for key in video_keys:
-            raw_videos_dir = raw_dir / f"observations/{task_index}/{ep_name}/videos"
-            old_key = keys_mapping[key]
-            ep_path = raw_videos_dir / f"{old_key}.mp4"
-            ep_duration_in_s = get_video_duration_in_s(ep_path)
-
-            aggr_path = lerobot_dataset.root / DEFAULT_VIDEO_PATH.format(
-                video_key=key,
-                chunk_index=chunk_idx[key],
-                file_index=file_idx[key],
-            )
-            if not aggr_path.exists():
-                # First video
-                aggr_path.parent.mkdir(parents=True, exist_ok=True)
-                shutil.copy(str(ep_path), str(aggr_path))
-            else:
-                size_in_mb = get_video_size_in_mb(ep_path)
-                aggr_size_in_mb = get_video_size_in_mb(aggr_path)
-
-                if aggr_size_in_mb + size_in_mb >= DEFAULT_VIDEO_FILE_SIZE_IN_MB:
-                    # Size limit is reached, prepare new parquet file
-                    chunk_idx[key], file_idx[key] = update_chunk_file_indices(
-                        chunk_idx[key], file_idx[key], DEFAULT_CHUNK_SIZE
-                    )
-                    aggr_path = lerobot_dataset.root / DEFAULT_VIDEO_PATH.format(
-                        video_key=key,
-                        chunk_index=chunk_idx[key],
-                        file_index=file_idx[key],
-                    )
-                    aggr_path.parent.mkdir(parents=True, exist_ok=True)
-                    shutil.copy(str(ep_path), str(aggr_path))
-                    latest_duration_in_s[key] = 0
-                else:
-                    # Update the existing parquet file with new rows
-                    concat_video_files(
-                        [aggr_path, ep_path],
-                        lerobot_dataset.root,
-                        key,
-                        chunk_idx[key],
-                        file_idx[key],
-                    )
-
-            if ep_idx not in ep_to_meta:
-                ep_to_meta[ep_idx] = {}
-            ep_to_meta[ep_idx][key] = {
-                "chunk_index": chunk_idx[key],
-                "file_index": file_idx[key],
-                "from_timestamp": latest_duration_in_s[key],
-                "to_timestamp": latest_duration_in_s[key] + ep_duration_in_s,
-            }
-            latest_duration_in_s[key] += ep_duration_in_s
-
-    # Update episodes meta data
-    for meta_path in (lerobot_dataset.root / EPISODES_DIR).glob("chunk-*/file-*.parquet"):
-        df = pd.read_parquet(meta_path)
-        df = update_meta_data(df, ep_to_meta)
-        df.to_parquet(meta_path)
-
-
-def port_agibot(
-    raw_dir: Path, repo_id: str, task_index: int, episode_indices: list[int], push_to_hub: bool = False
-):
-    lerobot_dataset = LeRobotDataset.create(
-        repo_id=repo_id,
-        robot_type=AGIBOT_ROBOT_TYPE,
-        fps=AGIBOT_FPS,
-        features=AGIBOT_FEATURES,
-    )
-
-    start_time = time.time()
-    num_episodes = len(episode_indices)
-    logging.info(f"Number of episodes {num_episodes}")
-
-    for i, episode_index in enumerate(episode_indices):
-        elapsed_time = time.time() - start_time
-        d, h, m, s = get_elapsed_time_in_days_hours_minutes_seconds(elapsed_time)
-
-        logging.info(
-            f"{i} / {num_episodes} episodes processed (after {d} days, {h} hours, {m} minutes, {s:.3f} seconds)"
-        )
-
-        for frame in generate_lerobot_frames(raw_dir, task_index, episode_index):
-            lerobot_dataset.add_frame(frame)
-
-        lerobot_dataset.save_episode()
-        logging.info("Save_episode")
-
-    # Videos have already been encoded with the proper format, so we rely on hacks
-    # HACK: Add extra images features
-    lerobot_dataset.meta.info["features"].update(AGIBOT_IMAGES_FEATURES)
-    write_info(lerobot_dataset.meta.info, lerobot_dataset.meta.root)
-    move_videos_to_lerobot_directory(lerobot_dataset, raw_dir, task_index, episode_indices)
-
-    if push_to_hub:
-        lerobot_dataset.push_to_hub(
-            # Add agibot tag, since it belongs to the agibot collection of datasets
-            tags=["agibot"],
-            private=False,
-        )
@@ -1,198 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import logging
-import tarfile
-from pathlib import Path
-
-from datatrove.executor import LocalPipelineExecutor
-from datatrove.executor.slurm import SlurmPipelineExecutor
-from datatrove.pipeline.base import PipelineStep
-from port_datasets.agibot_hdf5.download import (
-    RAW_REPO_ID,
-    download_meta_data,
-    get_observations_files,
-)
-
-
-class PortAgiBotShards(PipelineStep):
-    def __init__(
-        self,
-        raw_dir: Path | str,
-        repo_id: str = None,
-    ):
-        super().__init__()
-        self.raw_dir = Path(raw_dir)
-        self.repo_id = repo_id
-
-    def run(self, data=None, rank: int = 0, world_size: int = 1):
-        import shutil
-
-        from datasets.utils.tqdm import disable_progress_bars
-        from port_datasets.agibot_hdf5.download import (
-            RAW_REPO_ID,
-            download,
-            get_observations_files,
-            no_depth,
-        )
-        from port_datasets.agibot_hdf5.port_agibot import port_agibot
-        from port_datasets.droid_rlds.port_droid import validate_dataset
-
-        from lerobot.constants import HF_LEROBOT_HOME
-        from lerobot.utils.utils import init_logging
-
-        init_logging()
-        disable_progress_bars()
-
-        shard_repo_id = f"{self.repo_id}_world_{world_size}_rank_{rank}"
-
-        dataset_dir = HF_LEROBOT_HOME / shard_repo_id
-        if dataset_dir.exists():
-            shutil.rmtree(dataset_dir)
-
-        obs_files, _ = get_observations_files(self.raw_dir, RAW_REPO_ID)
-        obs_file = obs_files[rank]
-
-        # Download subset
-        download(self.raw_dir, allow_patterns=obs_file)
-
-        tar_path = self.raw_dir / obs_file
-        with tarfile.open(tar_path, "r") as tar:
-            extracted_files = tar.getnames()
-
-        task_index = int(tar_path.parent.name)
-        episode_names = [int(p) for p in extracted_files if "/" not in p]
-
-        # Untar if needed
-        if not all((tar_path.parent / f"{ep_name}").exists() for ep_name in episode_names):
-            logging.info(f"Untar-ing {tar_path}...")
-            with tarfile.open(tar_path, "r") as tar:
-                tar.extractall(path=tar_path.parent, filter=no_depth)  # nosec B202
-
-        port_agibot(self.raw_dir, shard_repo_id, task_index, episode_names, push_to_hub=False)
-
-        for ep_name in episode_names:
-            shutil.rmtree(str(tar_path.parent / f"{ep_name}"))
-
-        tar_path.unlink()
-
-        validate_dataset(shard_repo_id)
-
-
-def make_port_executor(
-    raw_dir, repo_id, job_name, logs_dir, workers, partition, cpus_per_task, mem_per_cpu, slurm=True
-):
-    download_meta_data(raw_dir)
-    obs_files, _ = get_observations_files(raw_dir, RAW_REPO_ID)
-    num_shards = len(obs_files)
-
-    kwargs = {
-        "pipeline": [
-            PortAgiBotShards(raw_dir, repo_id),
-        ],
-        "logging_dir": str(logs_dir / job_name),
-    }
-
-    if slurm:
-        kwargs.update(
-            {
-                "job_name": job_name,
-                "tasks": num_shards,
-                "workers": workers,
-                "time": "08:00:00",
-                "partition": partition,
-                "cpus_per_task": cpus_per_task,
-                "sbatch_args": {"mem-per-cpu": mem_per_cpu},
-            }
-        )
-        executor = SlurmPipelineExecutor(**kwargs)
-    else:
-        kwargs.update(
-            {
-                "tasks": num_shards,
-                "workers": 1,
-            }
-        )
-        executor = LocalPipelineExecutor(**kwargs)
-
-    return executor
-
-
-def main():
-    parser = argparse.ArgumentParser()
-
-    parser.add_argument(
-        "--raw-dir",
-        type=Path,
-        required=True,
-        help="Directory containing input raw datasets (e.g. `path/to/dataset` or `path/to/dataset/version).",
-    )
-    parser.add_argument(
-        "--repo-id",
-        type=str,
-        help="Repositery identifier on Hugging Face: a community or a user name `/` the name of the dataset, required when push-to-hub is True.",
-    )
-    parser.add_argument(
-        "--logs-dir",
-        type=Path,
-        help="Path to logs directory for `datatrove`.",
-    )
-    parser.add_argument(
-        "--job-name",
-        type=str,
-        default="port_droid",
-        help="Job name used in slurm, and name of the directory created inside the provided logs directory.",
-    )
-    parser.add_argument(
-        "--slurm",
-        type=int,
-        default=1,
-        help="Launch over slurm. Use `--slurm 0` to launch sequentially (useful to debug).",
-    )
-    parser.add_argument(
-        "--workers",
-        type=int,
-        default=2048,
-        help="Number of slurm workers. It should be less than the maximum number of shards.",
-    )
-    parser.add_argument(
-        "--partition",
-        type=str,
-        help="Slurm partition. Ideally a CPU partition. No need for GPU partition.",
-    )
-    parser.add_argument(
-        "--cpus-per-task",
-        type=int,
-        default=8,
-        help="Number of cpus that each slurm worker will use.",
-    )
-    parser.add_argument(
-        "--mem-per-cpu",
-        type=str,
-        default="1950M",
-        help="Memory per cpu that each worker will use.",
-    )
-
-    args = parser.parse_args()
-    kwargs = vars(args)
-    kwargs["slurm"] = kwargs.pop("slurm") == 1
-    port_executor = make_port_executor(**kwargs)
-    port_executor.run()
-
-
-if __name__ == "__main__":
-    main()
@@ -1,85 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import json
-from pathlib import Path
-
-
-def find_missing_workers(completions_dir, world_size):
-    """Find workers that are not completed and returns their indices."""
-    full = list(range(world_size))
-
-    completed = []
-    for path in completions_dir.glob("*"):
-        if path.name in [".", ".."]:
-            continue
-        index = path.name.lstrip("0")
-        index = 0 if index == "" else int(index)
-        completed.append(index)
-
-    missing_workers = set(full) - set(completed)
-    return missing_workers
-
-
-def find_output_files(slurm_dir, worker_indices):
-    """Find output files associated to worker indices, and return tuples
-    of (worker index, output file path)
-    """
-    out_files = []
-    for path in slurm_dir.glob("*.out"):
-        _, worker_id = path.name.replace(".out", "").split("_")
-        worker_id = int(worker_id)
-        if worker_id in worker_indices:
-            out_files.append((worker_id, path))
-    return out_files
-
-
-def display_error_files(logs_dir, job_name):
-    executor_path = Path(logs_dir) / job_name / "executor.json"
-    completions_dir = Path(logs_dir) / job_name / "completions"
-
-    with open(executor_path) as f:
-        executor = json.load(f)
-
-    missing_workers = find_missing_workers(completions_dir, executor["world_size"])
-
-    for missing in sorted(missing_workers)[::-1]:
-        print(missing)
-
-
-def main():
-    parser = argparse.ArgumentParser()
-
-    parser.add_argument(
-        "--logs-dir",
-        type=str,
-        help="Path to logs directory for `datatrove`.",
-    )
-    parser.add_argument(
-        "--job-name",
-        type=str,
-        default="port_droid",
-        help="Job name used in slurm, and name of the directory created inside the provided logs directory.",
-    )
-
-    args = parser.parse_args()
-
-    display_error_files(**vars(args))
-
-
-if __name__ == "__main__":
-    main()
@@ -1,430 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import logging
-import time
-from pathlib import Path
-
-import numpy as np
-import tensorflow_datasets as tfds
-
-from lerobot.datasets.lerobot_dataset import LeRobotDataset, LeRobotDatasetMetadata
-from lerobot.utils.utils import get_elapsed_time_in_days_hours_minutes_seconds
-
-DROID_SHARDS = 2048
-DROID_FPS = 15
-DROID_ROBOT_TYPE = "Franka"
-
-# Dataset schema slightly adapted from: https://droid-dataset.github.io/droid/the-droid-dataset.html#-dataset-schema
-DROID_FEATURES = {
-    # true on first step of the episode
-    "is_first": {
-        "dtype": "bool",
-        "shape": (1,),
-        "names": None,
-    },
-    # true on last step of the episode
-    "is_last": {
-        "dtype": "bool",
-        "shape": (1,),
-        "names": None,
-    },
-    # true on last step of the episode if it is a terminal step, True for demos
-    "is_terminal": {
-        "dtype": "bool",
-        "shape": (1,),
-        "names": None,
-    },
-    # language_instruction is also stored as "task" to follow LeRobot standard
-    "language_instruction": {
-        "dtype": "string",
-        "shape": (1,),
-        "names": None,
-    },
-    "language_instruction_2": {
-        "dtype": "string",
-        "shape": (1,),
-        "names": None,
-    },
-    "language_instruction_3": {
-        "dtype": "string",
-        "shape": (1,),
-        "names": None,
-    },
-    "observation.state.gripper_position": {
-        "dtype": "float32",
-        "shape": (1,),
-        "names": {
-            "axes": ["gripper"],
-        },
-    },
-    "observation.state.cartesian_position": {
-        "dtype": "float32",
-        "shape": (6,),
-        "names": {
-            "axes": ["x", "y", "z", "roll", "pitch", "yaw"],
-        },
-    },
-    "observation.state.joint_position": {
-        "dtype": "float32",
-        "shape": (7,),
-        "names": {
-            "axes": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "joint_6"],
-        },
-    },
-    # Add this new feature to follow LeRobot standard of using joint position + gripper
-    "observation.state": {
-        "dtype": "float32",
-        "shape": (8,),
-        "names": {
-            "axes": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "joint_6", "gripper"],
-        },
-    },
-    # Initially called wrist_image_left
-    "observation.images.wrist_left": {
-        "dtype": "video",
-        "shape": (180, 320, 3),
-        "names": [
-            "height",
-            "width",
-            "channels",
-        ],
-    },
-    # Initially called exterior_image_1_left
-    "observation.images.exterior_1_left": {
-        "dtype": "video",
-        "shape": (180, 320, 3),
-        "names": [
-            "height",
-            "width",
-            "channels",
-        ],
-    },
-    # Initially called exterior_image_2_left
-    "observation.images.exterior_2_left": {
-        "dtype": "video",
-        "shape": (180, 320, 3),
-        "names": [
-            "height",
-            "width",
-            "channels",
-        ],
-    },
-    "action.gripper_position": {
-        "dtype": "float32",
-        "shape": (1,),
-        "names": {
-            "axes": ["gripper"],
-        },
-    },
-    "action.gripper_velocity": {
-        "dtype": "float32",
-        "shape": (1,),
-        "names": {
-            "axes": ["gripper"],
-        },
-    },
-    "action.cartesian_position": {
-        "dtype": "float32",
-        "shape": (6,),
-        "names": {
-            "axes": ["x", "y", "z", "roll", "pitch", "yaw"],
-        },
-    },
-    "action.cartesian_velocity": {
-        "dtype": "float32",
-        "shape": (6,),
-        "names": {
-            "axes": ["x", "y", "z", "roll", "pitch", "yaw"],
-        },
-    },
-    "action.joint_position": {
-        "dtype": "float32",
-        "shape": (7,),
-        "names": {
-            "axes": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "joint_6"],
-        },
-    },
-    "action.joint_velocity": {
-        "dtype": "float32",
-        "shape": (7,),
-        "names": {
-            "axes": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "joint_6"],
-        },
-    },
-    # This feature was called "action" in RLDS dataset and consists of [6x joint velocities, 1x gripper position]
-    "action.original": {
-        "dtype": "float32",
-        "shape": (7,),
-        "names": {
-            "axes": ["x", "y", "z", "roll", "pitch", "yaw", "gripper"],
-        },
-    },
-    # Add this new feature to follow LeRobot standard of using joint position + gripper
-    "action": {
-        "dtype": "float32",
-        "shape": (8,),
-        "names": {
-            "axes": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "joint_6", "gripper"],
-        },
-    },
-    "discount": {
-        "dtype": "float32",
-        "shape": (1,),
-        "names": None,
-    },
-    "reward": {
-        "dtype": "float32",
-        "shape": (1,),
-        "names": None,
-    },
-    # Meta data that are the same for all frames in the episode
-    "task_category": {
-        "dtype": "string",
-        "shape": (1,),
-        "names": None,
-    },
-    "building": {
-        "dtype": "string",
-        "shape": (1,),
-        "names": None,
-    },
-    "collector_id": {
-        "dtype": "string",
-        "shape": (1,),
-        "names": None,
-    },
-    "date": {
-        "dtype": "string",
-        "shape": (1,),
-        "names": None,
-    },
-    "camera_extrinsics.wrist_left": {
-        "dtype": "float32",
-        "shape": (6,),
-        "names": {
-            "axes": ["x", "y", "z", "roll", "pitch", "yaw"],
-        },
-    },
-    "camera_extrinsics.exterior_1_left": {
-        "dtype": "float32",
-        "shape": (6,),
-        "names": {
-            "axes": ["x", "y", "z", "roll", "pitch", "yaw"],
-        },
-    },
-    "camera_extrinsics.exterior_2_left": {
-        "dtype": "float32",
-        "shape": (6,),
-        "names": {
-            "axes": ["x", "y", "z", "roll", "pitch", "yaw"],
-        },
-    },
-    "is_episode_successful": {
-        "dtype": "bool",
-        "shape": (1,),
-        "names": None,
-    },
-}
-
-
-def is_episode_successful(tf_episode_metadata):
-    # Adapted from: https://github.com/droid-dataset/droid_policy_learning/blob/dd1020eb20d981f90b5ff07dc80d80d5c0cb108b/robomimic/utils/rlds_utils.py#L8
-    return "/success/" in tf_episode_metadata["file_path"].numpy().decode()
-
-
-def generate_lerobot_frames(tf_episode):
-    m = tf_episode["episode_metadata"]
-    frame_meta = {
-        "task_category": m["building"].numpy().decode(),
-        "building": m["building"].numpy().decode(),
-        "collector_id": m["collector_id"].numpy().decode(),
-        "date": m["date"].numpy().decode(),
-        "camera_extrinsics.wrist_left": m["extrinsics_wrist_cam"].numpy(),
-        "camera_extrinsics.exterior_1_left": m["extrinsics_exterior_cam_1"].numpy(),
-        "camera_extrinsics.exterior_2_left": m["extrinsics_exterior_cam_2"].numpy(),
-        "is_episode_successful": np.array([is_episode_successful(m)]),
-    }
-    for f in tf_episode["steps"]:
-        # Dataset schema slightly adapted from: https://droid-dataset.github.io/droid/the-droid-dataset.html#-dataset-schema
-        frame = {
-            "is_first": np.array([f["is_first"].numpy()]),
-            "is_last": np.array([f["is_last"].numpy()]),
-            "is_terminal": np.array([f["is_terminal"].numpy()]),
-            "language_instruction": f["language_instruction"].numpy().decode(),
-            "language_instruction_2": f["language_instruction_2"].numpy().decode(),
-            "language_instruction_3": f["language_instruction_3"].numpy().decode(),
-            "observation.state.gripper_position": f["observation"]["gripper_position"].numpy(),
-            "observation.state.cartesian_position": f["observation"]["cartesian_position"].numpy(),
-            "observation.state.joint_position": f["observation"]["joint_position"].numpy(),
-            "observation.images.wrist_left": f["observation"]["wrist_image_left"].numpy(),
-            "observation.images.exterior_1_left": f["observation"]["exterior_image_1_left"].numpy(),
-            "observation.images.exterior_2_left": f["observation"]["exterior_image_2_left"].numpy(),
-            "action.gripper_position": f["action_dict"]["gripper_position"].numpy(),
-            "action.gripper_velocity": f["action_dict"]["gripper_velocity"].numpy(),
-            "action.cartesian_position": f["action_dict"]["cartesian_position"].numpy(),
-            "action.cartesian_velocity": f["action_dict"]["cartesian_velocity"].numpy(),
-            "action.joint_position": f["action_dict"]["joint_position"].numpy(),
-            "action.joint_velocity": f["action_dict"]["joint_velocity"].numpy(),
-            "discount": np.array([f["discount"].numpy()]),
-            "reward": np.array([f["reward"].numpy()]),
-            "action.original": f["action"].numpy(),
-        }
-
-        # language_instruction is also stored as "task" to follow LeRobot standard
-        frame["task"] = frame["language_instruction"]
-
-        # Add this new feature to follow LeRobot standard of using joint position + gripper
-        frame["observation.state"] = np.concatenate(
-            [frame["observation.state.joint_position"], frame["observation.state.gripper_position"]]
-        )
-        frame["action"] = np.concatenate([frame["action.joint_position"], frame["action.gripper_position"]])
-
-        # Meta data that are the same for all frames in the episode
-        frame.update(frame_meta)
-
-        # Cast fp64 to fp32
-        for key in frame:
-            if isinstance(frame[key], np.ndarray) and frame[key].dtype == np.float64:
-                frame[key] = frame[key].astype(np.float32)
-
-        yield frame
-
-
-def port_droid(
-    raw_dir: Path,
-    repo_id: str,
-    push_to_hub: bool = False,
-    num_shards: int | None = None,
-    shard_index: int | None = None,
-):
-    dataset_name = raw_dir.parent.name
-    version = raw_dir.name
-    data_dir = raw_dir.parent.parent
-
-    builder = tfds.builder(f"{dataset_name}/{version}", data_dir=data_dir, version="")
-
-    if num_shards is not None:
-        tfds_num_shards = builder.info.splits["train"].num_shards
-        if tfds_num_shards != DROID_SHARDS:
-            raise ValueError(
-                f"Number of shards of Droid dataset is expected to be {DROID_SHARDS} but is {tfds_num_shards}."
-            )
-        if num_shards != tfds_num_shards:
-            raise ValueError(
-                f"We only shard over the fixed number of shards provided by tensorflow dataset ({tfds_num_shards}), but {num_shards} shards provided instead."
-            )
-        if shard_index >= tfds_num_shards:
-            raise ValueError(
-                f"Shard index is greater than the num of shards ({shard_index} >= {num_shards})."
-            )
-
-        raw_dataset = builder.as_dataset(split=f"train[{shard_index}shard]")
-    else:
-        raw_dataset = builder.as_dataset(split="train")
-
-    lerobot_dataset = LeRobotDataset.create(
-        repo_id=repo_id,
-        robot_type=DROID_ROBOT_TYPE,
-        fps=DROID_FPS,
-        features=DROID_FEATURES,
-    )
-
-    start_time = time.time()
-    num_episodes = raw_dataset.cardinality().numpy().item()
-    logging.info(f"Number of episodes {num_episodes}")
-
-    for episode_index, episode in enumerate(raw_dataset):
-        elapsed_time = time.time() - start_time
-        d, h, m, s = get_elapsed_time_in_days_hours_minutes_seconds(elapsed_time)
-
-        logging.info(
-            f"{episode_index} / {num_episodes} episodes processed (after {d} days, {h} hours, {m} minutes, {s:.3f} seconds)"
-        )
-
-        for frame in generate_lerobot_frames(episode):
-            lerobot_dataset.add_frame(frame)
-
-        lerobot_dataset.save_episode()
-        logging.info("Save_episode")
-
-    if push_to_hub:
-        lerobot_dataset.push_to_hub(
-            # Add openx tag, since it belongs to the openx collection of datasets
-            tags=["openx"],
-            private=False,
-        )
-
-
-def validate_dataset(repo_id):
-    """Sanity check that ensure meta data can be loaded and all files are present."""
-    meta = LeRobotDatasetMetadata(repo_id)
-
-    if meta.total_episodes == 0:
-        raise ValueError("Number of episodes is 0.")
-
-    for ep_idx in range(meta.total_episodes):
-        data_path = meta.root / meta.get_data_file_path(ep_idx)
-
-        if not data_path.exists():
-            raise ValueError(f"Parquet file is missing in: {data_path}")
-
-        for vid_key in meta.video_keys:
-            vid_path = meta.root / meta.get_video_file_path(ep_idx, vid_key)
-            if not vid_path.exists():
-                raise ValueError(f"Video file is missing in: {vid_path}")
-
-
-def main():
-    parser = argparse.ArgumentParser()
-
-    parser.add_argument(
-        "--raw-dir",
-        type=Path,
-        required=True,
-        help="Directory containing input raw datasets (e.g. `path/to/dataset` or `path/to/dataset/version).",
-    )
-    parser.add_argument(
-        "--repo-id",
-        type=str,
-        help="Repositery identifier on Hugging Face: a community or a user name `/` the name of the dataset, required when push-to-hub is True",
-    )
-    parser.add_argument(
-        "--push-to-hub",
-        action="store_true",
-        help="Upload to hub.",
-    )
-    parser.add_argument(
-        "--num-shards",
-        type=int,
-        default=None,
-        help="Number of shards. Can be either None to load the full dataset, or 2048 to load one of the 2048 tensorflow dataset files.",
-    )
-    parser.add_argument(
-        "--shard-index",
-        type=int,
-        default=None,
-        help="Index of the shard. Can be either None to load the full dataset, or in [0,2047] to load one of the 2048 tensorflow dataset files.",
-    )
-
-    args = parser.parse_args()
-
-    port_droid(**vars(args))
-
-
-if __name__ == "__main__":
-    main()
@@ -1,148 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import logging
-from pathlib import Path
-
-from datatrove.executor import LocalPipelineExecutor
-from datatrove.executor.slurm import SlurmPipelineExecutor
-from datatrove.pipeline.base import PipelineStep
-from port_datasets.droid_rlds.port_droid import DROID_SHARDS
-
-from lerobot.datasets.aggregate import aggregate_datasets
-from lerobot.utils.utils import init_logging
-
-
-class AggregateDatasets(PipelineStep):
-    def __init__(
-        self,
-        repo_ids: list[str],
-        aggregated_repo_id: str,
-    ):
-        super().__init__()
-        self.repo_ids = repo_ids
-        self.aggr_repo_id = aggregated_repo_id
-
-    def run(self, data=None, rank: int = 0, world_size: int = 1):
-        init_logging()
-
-        # Since aggregate_datasets already handles parallel processing internally,
-        # we only need one worker to run the entire aggregation
-        if rank == 0:
-            logging.info(f"Starting aggregation of {len(self.repo_ids)} datasets into {self.aggr_repo_id}")
-            aggregate_datasets(self.repo_ids, self.aggr_repo_id)
-            logging.info("Aggregation complete!")
-        else:
-            logging.info(f"Worker {rank} skipping - only worker 0 performs aggregation")
-
-
-def make_aggregate_executor(
-    repo_ids, repo_id, job_name, logs_dir, workers, partition, cpus_per_task, mem_per_cpu, slurm=True
-):
-    kwargs = {
-        "pipeline": [
-            AggregateDatasets(repo_ids, repo_id),
-        ],
-        "logging_dir": str(logs_dir / job_name),
-    }
-
-    if slurm:
-        # For aggregation, we only need 1 task since aggregate_datasets handles everything
-        kwargs.update(
-            {
-                "job_name": job_name,
-                "tasks": 1,  # Only need 1 task for aggregation
-                "workers": 1,  # Only need 1 worker
-                "time": "08:00:00",
-                "partition": partition,
-                "cpus_per_task": cpus_per_task,
-                "sbatch_args": {"mem-per-cpu": mem_per_cpu},
-            }
-        )
-        executor = SlurmPipelineExecutor(**kwargs)
-    else:
-        kwargs.update(
-            {
-                "tasks": 1,
-                "workers": 1,
-            }
-        )
-        executor = LocalPipelineExecutor(**kwargs)
-
-    return executor
-
-
-def main():
-    parser = argparse.ArgumentParser()
-
-    parser.add_argument(
-        "--repo-id",
-        type=str,
-        help="Repository identifier on Hugging Face: a community or a user name `/` the name of the dataset, required when push-to-hub is True.",
-    )
-    parser.add_argument(
-        "--logs-dir",
-        type=Path,
-        help="Path to logs directory for `datatrove`.",
-    )
-    parser.add_argument(
-        "--job-name",
-        type=str,
-        default="aggr_droid",
-        help="Job name used in slurm, and name of the directory created inside the provided logs directory.",
-    )
-    parser.add_argument(
-        "--slurm",
-        type=int,
-        default=1,
-        help="Launch over slurm. Use `--slurm 0` to launch sequentially (useful to debug).",
-    )
-    parser.add_argument(
-        "--workers",
-        type=int,
-        default=1,  # Changed default to 1 since aggregation doesn't need multiple workers
-        help="Number of slurm workers. For aggregation, this should be 1.",
-    )
-    parser.add_argument(
-        "--partition",
-        type=str,
-        help="Slurm partition. Ideally a CPU partition. No need for GPU partition.",
-    )
-    parser.add_argument(
-        "--cpus-per-task",
-        type=int,
-        default=8,
-        help="Number of cpus that each slurm worker will use.",
-    )
-    parser.add_argument(
-        "--mem-per-cpu",
-        type=str,
-        default="1950M",
-        help="Memory per cpu that each worker will use.",
-    )
-
-    args = parser.parse_args()
-    kwargs = vars(args)
-    kwargs["slurm"] = kwargs.pop("slurm") == 1
-
-    repo_ids = [f"{args.repo_id}_world_{DROID_SHARDS}_rank_{rank}" for rank in range(DROID_SHARDS)]
-    aggregate_executor = make_aggregate_executor(repo_ids, **kwargs)
-    aggregate_executor.run()
-
-
-if __name__ == "__main__":
-    main()
@@ -1,162 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-from pathlib import Path
-
-from datatrove.executor import LocalPipelineExecutor
-from datatrove.executor.slurm import SlurmPipelineExecutor
-from datatrove.pipeline.base import PipelineStep
-from port_datasets.droid_rlds.port_droid import DROID_SHARDS
-
-
-class PortDroidShards(PipelineStep):
-    def __init__(
-        self,
-        raw_dir: Path | str,
-        repo_id: str = None,
-    ):
-        super().__init__()
-        self.raw_dir = Path(raw_dir)
-        self.repo_id = repo_id
-
-    def run(self, data=None, rank: int = 0, world_size: int = 1):
-        from datasets.utils.tqdm import disable_progress_bars
-        from port_datasets.droid_rlds.port_droid import port_droid, validate_dataset
-
-        from lerobot.utils.utils import init_logging
-
-        init_logging()
-        disable_progress_bars()
-
-        shard_repo_id = f"{self.repo_id}_world_{world_size}_rank_{rank}"
-
-        try:
-            validate_dataset(shard_repo_id)
-            return
-        except Exception:
-            pass  # nosec B110 - Dataset doesn't exist yet, continue with porting
-
-        port_droid(
-            self.raw_dir,
-            shard_repo_id,
-            push_to_hub=False,
-            num_shards=world_size,
-            shard_index=rank,
-        )
-
-        validate_dataset(shard_repo_id)
-
-
-def make_port_executor(
-    raw_dir, repo_id, job_name, logs_dir, workers, partition, cpus_per_task, mem_per_cpu, slurm=True
-):
-    kwargs = {
-        "pipeline": [
-            PortDroidShards(raw_dir, repo_id),
-        ],
-        "logging_dir": str(logs_dir / job_name),
-    }
-
-    if slurm:
-        kwargs.update(
-            {
-                "job_name": job_name,
-                "tasks": DROID_SHARDS,
-                "workers": workers,
-                "time": "08:00:00",
-                "partition": partition,
-                "cpus_per_task": cpus_per_task,
-                "sbatch_args": {"mem-per-cpu": mem_per_cpu},
-            }
-        )
-        executor = SlurmPipelineExecutor(**kwargs)
-    else:
-        kwargs.update(
-            {
-                "tasks": 1,
-                "workers": 1,
-            }
-        )
-        executor = LocalPipelineExecutor(**kwargs)
-
-    return executor
-
-
-def main():
-    parser = argparse.ArgumentParser()
-
-    parser.add_argument(
-        "--raw-dir",
-        type=Path,
-        required=True,
-        help="Directory containing input raw datasets (e.g. `path/to/dataset` or `path/to/dataset/version).",
-    )
-    parser.add_argument(
-        "--repo-id",
-        type=str,
-        help="Repositery identifier on Hugging Face: a community or a user name `/` the name of the dataset, required when push-to-hub is True.",
-    )
-    parser.add_argument(
-        "--logs-dir",
-        type=Path,
-        help="Path to logs directory for `datatrove`.",
-    )
-    parser.add_argument(
-        "--job-name",
-        type=str,
-        default="port_droid",
-        help="Job name used in slurm, and name of the directory created inside the provided logs directory.",
-    )
-    parser.add_argument(
-        "--slurm",
-        type=int,
-        default=1,
-        help="Launch over slurm. Use `--slurm 0` to launch sequentially (useful to debug).",
-    )
-    parser.add_argument(
-        "--workers",
-        type=int,
-        default=2048,
-        help="Number of slurm workers. It should be less than the maximum number of shards.",
-    )
-    parser.add_argument(
-        "--partition",
-        type=str,
-        help="Slurm partition. Ideally a CPU partition. No need for GPU partition.",
-    )
-    parser.add_argument(
-        "--cpus-per-task",
-        type=int,
-        default=8,
-        help="Number of cpus that each slurm worker will use.",
-    )
-    parser.add_argument(
-        "--mem-per-cpu",
-        type=str,
-        default="1950M",
-        help="Memory per cpu that each worker will use.",
-    )
-
-    args = parser.parse_args()
-    kwargs = vars(args)
-    kwargs["slurm"] = kwargs.pop("slurm") == 1
-    port_executor = make_port_executor(**kwargs)
-    port_executor.run()
-
-
-if __name__ == "__main__":
-    main()
@@ -1,281 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import logging
-import os
-from pathlib import Path
-
-from datatrove.executor import LocalPipelineExecutor
-from datatrove.executor.slurm import SlurmPipelineExecutor
-from datatrove.pipeline.base import PipelineStep
-from huggingface_hub import HfApi
-from huggingface_hub.constants import REPOCARD_NAME
-from port_datasets.droid_rlds.port_droid import DROID_SHARDS
-
-from lerobot.datasets.lerobot_dataset import CODEBASE_VERSION, LeRobotDatasetMetadata
-from lerobot.datasets.utils import create_lerobot_dataset_card
-from lerobot.utils.utils import init_logging
-
-
-class UploadDataset(PipelineStep):
-    def __init__(
-        self,
-        repo_id: str,
-        branch: str | None = None,
-        revision: str | None = None,
-        tags: list | None = None,
-        license: str | None = "apache-2.0",
-        private: bool = False,
-        distant_repo_id: str | None = None,
-        **card_kwargs,
-    ):
-        super().__init__()
-        self.repo_id = repo_id
-        self.distant_repo_id = self.repo_id if distant_repo_id is None else distant_repo_id
-        self.branch = branch
-        self.tags = tags
-        self.license = license
-        self.private = private
-        self.card_kwargs = card_kwargs
-        self.revision = revision if revision else CODEBASE_VERSION
-
-        if os.environ.get("HF_HUB_ENABLE_HF_TRANSFER", "0") != "1":
-            logging.warning(
-                'HF_HUB_ENABLE_HF_TRANSFER is not set to "1". Install hf_transfer and set the env '
-                "variable for faster uploads:\npip install hf-transfer\nexport HF_HUB_ENABLE_HF_TRANSFER=1"
-            )
-
-        self.create_repo()
-
-    def create_repo(self):
-        logging.info(f"Loading meta data from {self.repo_id}...")
-        meta = LeRobotDatasetMetadata(self.repo_id)
-
-        logging.info(f"Creating repo {self.distant_repo_id}...")
-        hub_api = HfApi()
-        hub_api.create_repo(
-            repo_id=self.distant_repo_id,
-            private=self.private,
-            repo_type="dataset",
-            exist_ok=True,
-        )
-        if self.branch:
-            hub_api.create_branch(
-                repo_id=self.distant_repo_id,
-                branch=self.branch,
-                revision=self.revision,
-                repo_type="dataset",
-                exist_ok=True,
-            )
-
-        if not hub_api.file_exists(
-            self.distant_repo_id, REPOCARD_NAME, repo_type="dataset", revision=self.branch
-        ):
-            card = create_lerobot_dataset_card(
-                tags=self.tags, dataset_info=meta.info, license=self.license, **self.card_kwargs
-            )
-            card.push_to_hub(repo_id=self.distant_repo_id, repo_type="dataset", revision=self.branch)
-
-            hub_api.create_tag(self.distant_repo_id, tag=CODEBASE_VERSION, repo_type="dataset")
-
-        def list_files_recursively(directory):
-            base_path = Path(directory)
-            return [str(file.relative_to(base_path)) for file in base_path.rglob("*") if file.is_file()]
-
-        logging.info(f"Listing all local files from {self.repo_id}...")
-        self.file_paths = list_files_recursively(meta.root)
-        self.file_paths = sorted(self.file_paths)
-
-    def create_chunks(self, lst, n):
-        from itertools import islice
-
-        it = iter(lst)
-        return [list(islice(it, size)) for size in [len(lst) // n + (i < len(lst) % n) for i in range(n)]]
-
-    def create_commits(self, additions):
-        import logging
-        import math
-        import random
-        import time
-
-        from huggingface_hub import create_commit
-        from huggingface_hub.utils import HfHubHTTPError
-
-        FILES_BETWEEN_COMMITS = 10  # noqa: N806
-        BASE_DELAY = 0.1  # noqa: N806
-        MAX_RETRIES = 12  # noqa: N806
-
-        # Split the files into smaller chunks for faster commit
-        # and avoiding "A commit has happened since" error
-        num_chunks = math.ceil(len(additions) / FILES_BETWEEN_COMMITS)
-        chunks = self.create_chunks(additions, num_chunks)
-
-        for chunk in chunks:
-            retries = 0
-            while True:
-                try:
-                    create_commit(
-                        self.distant_repo_id,
-                        repo_type="dataset",
-                        operations=chunk,
-                        commit_message=f"DataTrove upload ({len(chunk)} files)",
-                        revision=self.branch,
-                    )
-                    # TODO: every 100 chunks super_squach_commits()
-                    logging.info("create_commit completed!")
-                    break
-                except HfHubHTTPError as e:
-                    if "A commit has happened since" in e.server_message:
-                        if retries >= MAX_RETRIES:
-                            logging.error(f"Failed to create commit after {MAX_RETRIES=}. Giving up.")
-                            raise e
-                        logging.info("Commit creation race condition issue. Waiting...")
-                        time.sleep(BASE_DELAY * 2**retries + random.uniform(0, 2))
-                        retries += 1
-                    else:
-                        raise e
-
-    def run(self, data=None, rank: int = 0, world_size: int = 1):
-        import logging
-
-        from datasets.utils.tqdm import disable_progress_bars
-        from huggingface_hub import CommitOperationAdd, preupload_lfs_files
-
-        from lerobot.datasets.lerobot_dataset import LeRobotDatasetMetadata
-        from lerobot.utils.utils import init_logging
-
-        init_logging()
-        disable_progress_bars()
-
-        chunks = self.create_chunks(self.file_paths, world_size)
-        file_paths = chunks[rank]
-
-        if len(file_paths) == 0:
-            raise ValueError(file_paths)
-
-        logging.info("Pre-uploading LFS files...")
-        for i, path in enumerate(file_paths):
-            logging.info(f"{i}: {path}")
-
-        meta = LeRobotDatasetMetadata(self.repo_id)
-        additions = [
-            CommitOperationAdd(path_in_repo=path, path_or_fileobj=meta.root / path) for path in file_paths
-        ]
-        preupload_lfs_files(
-            repo_id=self.distant_repo_id, repo_type="dataset", additions=additions, revision=self.branch
-        )
-
-        logging.info("Creating commits...")
-        self.create_commits(additions)
-        logging.info("Done!")
-
-
-def make_upload_executor(
-    repo_id, job_name, logs_dir, workers, partition, cpus_per_task, mem_per_cpu, slurm=True
-):
-    kwargs = {
-        "pipeline": [
-            UploadDataset(repo_id),
-        ],
-        "logging_dir": str(logs_dir / job_name),
-    }
-
-    if slurm:
-        kwargs.update(
-            {
-                "job_name": job_name,
-                "tasks": DROID_SHARDS,
-                "workers": workers,
-                "time": "08:00:00",
-                "partition": partition,
-                "cpus_per_task": cpus_per_task,
-                "sbatch_args": {"mem-per-cpu": mem_per_cpu},
-            }
-        )
-        executor = SlurmPipelineExecutor(**kwargs)
-    else:
-        kwargs.update(
-            {
-                "tasks": DROID_SHARDS,
-                "workers": 1,
-            }
-        )
-        executor = LocalPipelineExecutor(**kwargs)
-
-    return executor
-
-
-def main():
-    parser = argparse.ArgumentParser()
-
-    parser.add_argument(
-        "--repo-id",
-        type=str,
-        help="Repositery identifier on Hugging Face: a community or a user name `/` the name of the dataset, required when push-to-hub is True.",
-    )
-    parser.add_argument(
-        "--logs-dir",
-        type=Path,
-        help="Path to logs directory for `datatrove`.",
-    )
-    parser.add_argument(
-        "--job-name",
-        type=str,
-        default="upload_droid",
-        help="Job name used in slurm, and name of the directory created inside the provided logs directory.",
-    )
-    parser.add_argument(
-        "--slurm",
-        type=int,
-        default=1,
-        help="Launch over slurm. Use `--slurm 0` to launch sequentially (useful to debug).",
-    )
-    parser.add_argument(
-        "--workers",
-        type=int,
-        default=50,
-        help="Number of slurm workers. It should be less than the maximum number of shards.",
-    )
-    parser.add_argument(
-        "--partition",
-        type=str,
-        help="Slurm partition. Ideally a CPU partition. No need for GPU partition.",
-    )
-    parser.add_argument(
-        "--cpus-per-task",
-        type=int,
-        default=8,
-        help="Number of cpus that each slurm worker will use.",
-    )
-    parser.add_argument(
-        "--mem-per-cpu",
-        type=str,
-        default="1950M",
-        help="Memory per cpu that each worker will use.",
-    )
-
-    init_logging()
-
-    args = parser.parse_args()
-    kwargs = vars(args)
-    kwargs["slurm"] = kwargs.pop("slurm") == 1
-    upload_executor = make_upload_executor(**kwargs)
-    upload_executor.run()
-
-
-if __name__ == "__main__":
-    main()
@@ -1,111 +0,0 @@
-#!/usr/bin/env python
-"""
-Example script demonstrating dataset tools utilities.
-
-This script shows how to:
-1. Delete episodes from a dataset
-2. Split a dataset into train/val sets
-3. Add/remove features
-4. Merge datasets
-
-Usage:
-    python examples/use_dataset_tools.py
-"""
-
-import numpy as np
-
-from lerobot.datasets.dataset_tools import (
-    add_feature,
-    delete_episodes,
-    merge_datasets,
-    remove_feature,
-    split_dataset,
-)
-from lerobot.datasets.lerobot_dataset import LeRobotDataset
-
-
-def main():
-    # Load an existing dataset (replace with your dataset)
-    dataset = LeRobotDataset("lerobot/pusht")
-
-    print(f"Original dataset: {dataset.meta.total_episodes} episodes, {dataset.meta.total_frames} frames")
-    print(f"Features: {list(dataset.meta.features.keys())}")
-
-    # Example 1: Delete episodes
-    print("\n1. Deleting episodes 0 and 2...")
-    filtered_dataset = delete_episodes(dataset, episode_indices=[0, 2], repo_id="pusht_filtered")
-    print(f"Filtered dataset: {filtered_dataset.meta.total_episodes} episodes")
-
-    # Example 2: Split dataset
-    print("\n2. Splitting dataset into train/val...")
-    splits = split_dataset(
-        dataset,
-        splits={"train": 0.8, "val": 0.2},
-    )
-    print(f"Train split: {splits['train'].meta.total_episodes} episodes")
-    print(f"Val split: {splits['val'].meta.total_episodes} episodes")
-
-    # Example 3: Add a feature
-    print("\n3. Adding a reward feature...")
-
-    # Method 1: Pre-computed values
-    reward_values = np.random.randn(dataset.meta.total_frames).astype(np.float32)
-    dataset_with_reward = add_feature(
-        dataset,
-        feature_name="reward",
-        feature_values=reward_values,
-        feature_info={
-            "dtype": "float32",
-            "shape": (1,),
-            "names": None,
-        },
-        repo_id="pusht_with_reward",
-    )
-
-    # Method 2: Using a callable
-    def compute_success(frame_dict, episode_idx, frame_idx):
-        # Example: mark last 10 frames of each episode as successful
-        episode_length = 10  # You'd get this from episode metadata
-        return float(frame_idx >= episode_length - 10)
-
-    dataset_with_success = add_feature(
-        dataset_with_reward,
-        feature_name="success",
-        feature_values=compute_success,
-        feature_info={
-            "dtype": "float32",
-            "shape": (1,),
-            "names": None,
-        },
-        repo_id="pusht_with_reward_and_success",
-    )
-
-    print(f"New features: {list(dataset_with_success.meta.features.keys())}")
-
-    # Example 4: Remove features
-    print("\n4. Removing the success feature...")
-    dataset_cleaned = remove_feature(dataset_with_success, feature_names="success", repo_id="pusht_cleaned")
-    print(f"Features after removal: {list(dataset_cleaned.meta.features.keys())}")
-
-    # Example 5: Merge datasets
-    print("\n5. Merging train and val splits back together...")
-    merged = merge_datasets([splits["train"], splits["val"]], output_repo_id="pusht_merged")
-    print(f"Merged dataset: {merged.meta.total_episodes} episodes")
-
-    # Example 6: Complex workflow
-    print("\n6. Complex workflow example...")
-
-    # Remove a camera if dataset has multiple
-    if len(dataset.meta.camera_keys) > 1:
-        camera_to_remove = dataset.meta.camera_keys[0]
-        print(f"Removing camera: {camera_to_remove}")
-        dataset_no_cam = remove_feature(
-            dataset, feature_names=camera_to_remove, repo_id="pusht_no_first_camera"
-        )
-        print(f"Remaining cameras: {dataset_no_cam.meta.camera_keys}")
-
-    print("\nDone! Check ~/.cache/huggingface/lerobot/ for the created datasets.")
-
-
-if __name__ == "__main__":
-    main()
@@ -257,8 +257,8 @@ default.extend-ignore-identifiers-re = [
 # color = true
 # paths = ["src/lerobot"]

-# [tool.mypy]
-# python_version = "3.10"
-# warn_return_any = true
-# warn_unused_configs = true
-# ignore_missing_imports = false
+[tool.mypy]
+python_version = "3.10"
+warn_return_any = true
+warn_unused_configs = true
+ignore_missing_imports = false
@@ -1,505 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2025 The HuggingFace Inc. team.
-# All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import logging
-import shutil
-from pathlib import Path
-
-import pandas as pd
-import tqdm
-
-from lerobot.datasets.compute_stats import aggregate_stats
-from lerobot.datasets.lerobot_dataset import LeRobotDatasetMetadata
-from lerobot.datasets.utils import (
-    DEFAULT_CHUNK_SIZE,
-    DEFAULT_DATA_FILE_SIZE_IN_MB,
-    DEFAULT_DATA_PATH,
-    DEFAULT_EPISODES_PATH,
-    DEFAULT_VIDEO_FILE_SIZE_IN_MB,
-    DEFAULT_VIDEO_PATH,
-    get_parquet_file_size_in_mb,
-    get_video_size_in_mb,
-    to_parquet_with_hf_images,
-    update_chunk_file_indices,
-    write_info,
-    write_stats,
-    write_tasks,
-)
-from lerobot.datasets.video_utils import concat_video_files
-
-
-def validate_all_metadata(all_metadata: list[LeRobotDatasetMetadata]):
-    """Validates that all dataset metadata have consistent properties.
-
-    Ensures all datasets have the same fps, robot_type, and features to guarantee
-    compatibility when aggregating them into a single dataset.
-
-    Args:
-        all_metadata: List of LeRobotDatasetMetadata objects to validate.
-
-    Returns:
-        tuple: A tuple containing (fps, robot_type, features) from the first metadata.
-
-    Raises:
-        ValueError: If any metadata has different fps, robot_type, or features
-                   than the first metadata in the list.
-    """
-
-    fps = all_metadata[0].fps
-    robot_type = all_metadata[0].robot_type
-    features = all_metadata[0].features
-
-    for meta in tqdm.tqdm(all_metadata, desc="Validate all meta data"):
-        if fps != meta.fps:
-            raise ValueError(f"Same fps is expected, but got fps={meta.fps} instead of {fps}.")
-        if robot_type != meta.robot_type:
-            raise ValueError(
-                f"Same robot_type is expected, but got robot_type={meta.robot_type} instead of {robot_type}."
-            )
-        if features != meta.features:
-            raise ValueError(
-                f"Same features is expected, but got features={meta.features} instead of {features}."
-            )
-
-    return fps, robot_type, features
-
-
-def update_data_df(df, src_meta, dst_meta):
-    """Updates a data DataFrame with new indices and task mappings for aggregation.
-
-    Adjusts episode indices, frame indices, and task indices to account for
-    previously aggregated data in the destination dataset.
-
-    Args:
-        df: DataFrame containing the data to be updated.
-        src_meta: Source dataset metadata.
-        dst_meta: Destination dataset metadata.
-
-    Returns:
-        pd.DataFrame: Updated DataFrame with adjusted indices.
-    """
-
-    def _update(row):
-        row["episode_index"] = row["episode_index"] + dst_meta.info["total_episodes"]
-        row["index"] = row["index"] + dst_meta.info["total_frames"]
-        task = src_meta.tasks.iloc[row["task_index"]].name
-        row["task_index"] = dst_meta.tasks.loc[task].task_index.item()
-        return row
-
-    return df.apply(_update, axis=1)
-
-
-def update_meta_data(
-    df,
-    dst_meta,
-    meta_idx,
-    data_idx,
-    videos_idx,
-):
-    """Updates metadata DataFrame with new chunk, file, and timestamp indices.
-
-    Adjusts all indices and timestamps to account for previously aggregated
-    data and videos in the destination dataset.
-
-    Args:
-        df: DataFrame containing the metadata to be updated.
-        dst_meta: Destination dataset metadata.
-        meta_idx: Dictionary containing current metadata chunk and file indices.
-        data_idx: Dictionary containing current data chunk and file indices.
-        videos_idx: Dictionary containing current video indices and timestamps.
-
-    Returns:
-        pd.DataFrame: Updated DataFrame with adjusted indices and timestamps.
-    """
-
-    def _update(row):
-        row["meta/episodes/chunk_index"] = row["meta/episodes/chunk_index"] + meta_idx["chunk"]
-        row["meta/episodes/file_index"] = row["meta/episodes/file_index"] + meta_idx["file"]
-        row["data/chunk_index"] = row["data/chunk_index"] + data_idx["chunk"]
-        row["data/file_index"] = row["data/file_index"] + data_idx["file"]
-        for key, video_idx in videos_idx.items():
-            row[f"videos/{key}/chunk_index"] = row[f"videos/{key}/chunk_index"] + video_idx["chunk"]
-            row[f"videos/{key}/file_index"] = row[f"videos/{key}/file_index"] + video_idx["file"]
-            row[f"videos/{key}/from_timestamp"] = (
-                row[f"videos/{key}/from_timestamp"] + video_idx["latest_duration"]
-            )
-            row[f"videos/{key}/to_timestamp"] = (
-                row[f"videos/{key}/to_timestamp"] + video_idx["latest_duration"]
-            )
-
-        row["dataset_from_index"] = row["dataset_from_index"] + dst_meta.info["total_frames"]
-        row["dataset_to_index"] = row["dataset_to_index"] + dst_meta.info["total_frames"]
-        row["episode_index"] = row["episode_index"] + dst_meta.info["total_episodes"]
-        return row
-
-    return df.apply(_update, axis=1)
-
-
-def aggregate_datasets(
-    repo_ids: list[str],
-    aggr_repo_id: str,
-    roots: list[Path] = None,
-    aggr_root: Path = None,
-    data_files_size_in_mb: float = None,
-    video_files_size_in_mb: float = None,
-    chunk_size: int = None,
-):
-    """Aggregates multiple LeRobot datasets into a single unified dataset.
-
-    This is the main function that orchestrates the aggregation process by:
-    1. Loading and validating all source dataset metadata
-    2. Creating a new destination dataset with unified tasks
-    3. Aggregating videos, data, and metadata from all source datasets
-    4. Finalizing the aggregated dataset with proper statistics
-
-    Args:
-        repo_ids: List of repository IDs for the datasets to aggregate.
-        aggr_repo_id: Repository ID for the aggregated output dataset.
-        roots: Optional list of root paths for the source datasets.
-        aggr_root: Optional root path for the aggregated dataset.
-        data_files_size_in_mb: Maximum size for data files in MB (defaults to DEFAULT_DATA_FILE_SIZE_IN_MB)
-        video_files_size_in_mb: Maximum size for video files in MB (defaults to DEFAULT_VIDEO_FILE_SIZE_IN_MB)
-        chunk_size: Maximum number of files per chunk (defaults to DEFAULT_CHUNK_SIZE)
-    """
-    logging.info("Start aggregate_datasets")
-
-    if data_files_size_in_mb is None:
-        data_files_size_in_mb = DEFAULT_DATA_FILE_SIZE_IN_MB
-    if video_files_size_in_mb is None:
-        video_files_size_in_mb = DEFAULT_VIDEO_FILE_SIZE_IN_MB
-    if chunk_size is None:
-        chunk_size = DEFAULT_CHUNK_SIZE
-
-    all_metadata = (
-        [LeRobotDatasetMetadata(repo_id) for repo_id in repo_ids]
-        if roots is None
-        else [
-            LeRobotDatasetMetadata(repo_id, root=root) for repo_id, root in zip(repo_ids, roots, strict=False)
-        ]
-    )
-    fps, robot_type, features = validate_all_metadata(all_metadata)
-    video_keys = [key for key in features if features[key]["dtype"] == "video"]
-
-    dst_meta = LeRobotDatasetMetadata.create(
-        repo_id=aggr_repo_id,
-        fps=fps,
-        robot_type=robot_type,
-        features=features,
-        root=aggr_root,
-    )
-
-    logging.info("Find all tasks")
-    unique_tasks = pd.concat([m.tasks for m in all_metadata]).index.unique()
-    dst_meta.tasks = pd.DataFrame({"task_index": range(len(unique_tasks))}, index=unique_tasks)
-
-    meta_idx = {"chunk": 0, "file": 0}
-    data_idx = {"chunk": 0, "file": 0}
-    videos_idx = {
-        key: {"chunk": 0, "file": 0, "latest_duration": 0, "episode_duration": 0} for key in video_keys
-    }
-
-    dst_meta.episodes = {}
-
-    for src_meta in tqdm.tqdm(all_metadata, desc="Copy data and videos"):
-        videos_idx = aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chunk_size)
-        data_idx = aggregate_data(src_meta, dst_meta, data_idx, data_files_size_in_mb, chunk_size)
-
-        meta_idx = aggregate_metadata(src_meta, dst_meta, meta_idx, data_idx, videos_idx)
-
-        dst_meta.info["total_episodes"] += src_meta.total_episodes
-        dst_meta.info["total_frames"] += src_meta.total_frames
-
-    finalize_aggregation(dst_meta, all_metadata)
-    logging.info("Aggregation complete.")
-
-
-def aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chunk_size):
-    """Aggregates video chunks from a source dataset into the destination dataset.
-
-    Handles video file concatenation and rotation based on file size limits.
-    Creates new video files when size limits are exceeded.
-
-    Args:
-        src_meta: Source dataset metadata.
-        dst_meta: Destination dataset metadata.
-        videos_idx: Dictionary tracking video chunk and file indices.
-        video_files_size_in_mb: Maximum size for video files in MB (defaults to DEFAULT_VIDEO_FILE_SIZE_IN_MB)
-        chunk_size: Maximum number of files per chunk (defaults to DEFAULT_CHUNK_SIZE)
-
-    Returns:
-        dict: Updated videos_idx with current chunk and file indices.
-    """
-    for key, video_idx in videos_idx.items():
-        unique_chunk_file_pairs = {
-            (chunk, file)
-            for chunk, file in zip(
-                src_meta.episodes[f"videos/{key}/chunk_index"],
-                src_meta.episodes[f"videos/{key}/file_index"],
-                strict=False,
-            )
-        }
-        unique_chunk_file_pairs = sorted(unique_chunk_file_pairs)
-
-        chunk_idx = video_idx["chunk"]
-        file_idx = video_idx["file"]
-
-        for src_chunk_idx, src_file_idx in unique_chunk_file_pairs:
-            src_path = src_meta.root / DEFAULT_VIDEO_PATH.format(
-                video_key=key,
-                chunk_index=src_chunk_idx,
-                file_index=src_file_idx,
-            )
-
-            dst_path = dst_meta.root / DEFAULT_VIDEO_PATH.format(
-                video_key=key,
-                chunk_index=chunk_idx,
-                file_index=file_idx,
-            )
-
-            # If a new file is created, we don't want to increment the latest_duration
-            update_latest_duration = False
-
-            if not dst_path.exists():
-                # First write to this destination file
-                dst_path.parent.mkdir(parents=True, exist_ok=True)
-                shutil.copy(str(src_path), str(dst_path))
-                continue  # not accumulating further, already copied the file in place
-
-            # Check file sizes before appending
-            src_size = get_video_size_in_mb(src_path)
-            dst_size = get_video_size_in_mb(dst_path)
-
-            if dst_size + src_size >= video_files_size_in_mb:
-                # Rotate to a new chunk/file
-                chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, chunk_size)
-                dst_path = dst_meta.root / DEFAULT_VIDEO_PATH.format(
-                    video_key=key,
-                    chunk_index=chunk_idx,
-                    file_index=file_idx,
-                )
-                dst_path.parent.mkdir(parents=True, exist_ok=True)
-                shutil.copy(str(src_path), str(dst_path))
-            else:
-                # Get the timestamps shift for this video
-                timestamps_shift_s = dst_meta.info["total_frames"] / dst_meta.info["fps"]
-
-                # Append to existing video file
-                concat_video_files(
-                    [dst_path, src_path],
-                    dst_meta.root,
-                    key,
-                    chunk_idx,
-                    file_idx,
-                )
-                # Update the latest_duration when appending (shifts timestamps!)
-                update_latest_duration = not update_latest_duration
-
-        # Update the videos_idx with the final chunk and file indices for this key
-        videos_idx[key]["chunk"] = chunk_idx
-        videos_idx[key]["file"] = file_idx
-
-        if update_latest_duration:
-            videos_idx[key]["latest_duration"] += timestamps_shift_s
-
-    return videos_idx
-
-
-def aggregate_data(src_meta, dst_meta, data_idx, data_files_size_in_mb, chunk_size):
-    """Aggregates data chunks from a source dataset into the destination dataset.
-
-    Reads source data files, updates indices to match the aggregated dataset,
-    and writes them to the destination with proper file rotation.
-
-    Args:
-        src_meta: Source dataset metadata.
-        dst_meta: Destination dataset metadata.
-        data_idx: Dictionary tracking data chunk and file indices.
-
-    Returns:
-        dict: Updated data_idx with current chunk and file indices.
-    """
-    unique_chunk_file_ids = {
-        (c, f)
-        for c, f in zip(
-            src_meta.episodes["data/chunk_index"], src_meta.episodes["data/file_index"], strict=False
-        )
-    }
-
-    unique_chunk_file_ids = sorted(unique_chunk_file_ids)
-
-    for src_chunk_idx, src_file_idx in unique_chunk_file_ids:
-        src_path = src_meta.root / DEFAULT_DATA_PATH.format(
-            chunk_index=src_chunk_idx, file_index=src_file_idx
-        )
-        df = pd.read_parquet(src_path)
-        df = update_data_df(df, src_meta, dst_meta)
-
-        data_idx = append_or_create_parquet_file(
-            df,
-            src_path,
-            data_idx,
-            data_files_size_in_mb,
-            chunk_size,
-            DEFAULT_DATA_PATH,
-            contains_images=len(dst_meta.image_keys) > 0,
-            aggr_root=dst_meta.root,
-        )
-
-    return data_idx
-
-
-def aggregate_metadata(src_meta, dst_meta, meta_idx, data_idx, videos_idx):
-    """Aggregates metadata from a source dataset into the destination dataset.
-
-    Reads source metadata files, updates all indices and timestamps,
-    and writes them to the destination with proper file rotation.
-
-    Args:
-        src_meta: Source dataset metadata.
-        dst_meta: Destination dataset metadata.
-        meta_idx: Dictionary tracking metadata chunk and file indices.
-        data_idx: Dictionary tracking data chunk and file indices.
-        videos_idx: Dictionary tracking video indices and timestamps.
-
-    Returns:
-        dict: Updated meta_idx with current chunk and file indices.
-    """
-    chunk_file_ids = {
-        (c, f)
-        for c, f in zip(
-            src_meta.episodes["meta/episodes/chunk_index"],
-            src_meta.episodes["meta/episodes/file_index"],
-            strict=False,
-        )
-    }
-
-    chunk_file_ids = sorted(chunk_file_ids)
-    for chunk_idx, file_idx in chunk_file_ids:
-        src_path = src_meta.root / DEFAULT_EPISODES_PATH.format(chunk_index=chunk_idx, file_index=file_idx)
-        df = pd.read_parquet(src_path)
-        df = update_meta_data(
-            df,
-            dst_meta,
-            meta_idx,
-            data_idx,
-            videos_idx,
-        )
-
-        for k in videos_idx:
-            videos_idx[k]["latest_duration"] += videos_idx[k]["episode_duration"]
-
-        meta_idx = append_or_create_parquet_file(
-            df,
-            src_path,
-            meta_idx,
-            DEFAULT_DATA_FILE_SIZE_IN_MB,
-            DEFAULT_CHUNK_SIZE,
-            DEFAULT_EPISODES_PATH,
-            contains_images=False,
-            aggr_root=dst_meta.root,
-        )
-
-    return meta_idx
-
-
-def append_or_create_parquet_file(
-    df: pd.DataFrame,
-    src_path: Path,
-    idx: dict[str, int],
-    max_mb: float,
-    chunk_size: int,
-    default_path: str,
-    contains_images: bool = False,
-    aggr_root: Path = None,
-):
-    """Appends data to an existing parquet file or creates a new one based on size constraints.
-
-    Manages file rotation when size limits are exceeded to prevent individual files
-    from becoming too large. Handles both regular parquet files and those containing images.
-
-    Args:
-        df: DataFrame to write to the parquet file.
-        src_path: Path to the source file (used for size estimation).
-        idx: Dictionary containing current 'chunk' and 'file' indices.
-        max_mb: Maximum allowed file size in MB before rotation.
-        chunk_size: Maximum number of files per chunk before incrementing chunk index.
-        default_path: Format string for generating file paths.
-        contains_images: Whether the data contains images requiring special handling.
-        aggr_root: Root path for the aggregated dataset.
-
-    Returns:
-        dict: Updated index dictionary with current chunk and file indices.
-    """
-    dst_path = aggr_root / default_path.format(chunk_index=idx["chunk"], file_index=idx["file"])
-
-    if not dst_path.exists():
-        dst_path.parent.mkdir(parents=True, exist_ok=True)
-        if contains_images:
-            to_parquet_with_hf_images(df, dst_path)
-        else:
-            df.to_parquet(dst_path)
-        return idx
-
-    src_size = get_parquet_file_size_in_mb(src_path)
-    dst_size = get_parquet_file_size_in_mb(dst_path)
-
-    if dst_size + src_size >= max_mb:
-        idx["chunk"], idx["file"] = update_chunk_file_indices(idx["chunk"], idx["file"], chunk_size)
-        new_path = aggr_root / default_path.format(chunk_index=idx["chunk"], file_index=idx["file"])
-        new_path.parent.mkdir(parents=True, exist_ok=True)
-        final_df = df
-        target_path = new_path
-    else:
-        existing_df = pd.read_parquet(dst_path)
-        final_df = pd.concat([existing_df, df], ignore_index=True)
-        target_path = dst_path
-
-    if contains_images:
-        to_parquet_with_hf_images(final_df, target_path)
-    else:
-        final_df.to_parquet(target_path)
-
-    return idx
-
-
-def finalize_aggregation(aggr_meta, all_metadata):
-    """Finalizes the dataset aggregation by writing summary files and statistics.
-
-    Writes the tasks file, info file with total counts and splits, and
-    aggregated statistics from all source datasets.
-
-    Args:
-        aggr_meta: Aggregated dataset metadata.
-        all_metadata: List of all source dataset metadata objects.
-    """
-    logging.info("write tasks")
-    write_tasks(aggr_meta.tasks, aggr_meta.root)
-
-    logging.info("write info")
-    aggr_meta.info.update(
-        {
-            "total_tasks": len(aggr_meta.tasks),
-            "total_episodes": sum(m.total_episodes for m in all_metadata),
-            "total_frames": sum(m.total_frames for m in all_metadata),
-            "splits": {"train": f"0:{sum(m.total_episodes for m in all_metadata)}"},
-        }
-    )
-    write_info(aggr_meta.info, aggr_meta.root)
-
-    logging.info("write stats")
-    aggr_meta.stats = aggregate_stats([m.stats for m in all_metadata])
-    write_stats(aggr_meta.stats, aggr_meta.root)
@@ -47,18 +47,6 @@ If you encounter a problem, contact LeRobot maintainers on [Discord](https://dis
 or open an [issue on GitHub](https://github.com/huggingface/lerobot/issues/new/choose).
 """

-V30_MESSAGE = """
-The dataset you requested ({repo_id}) is in {version} format.
-While current version of LeRobot is backward-compatible with it, the version of your dataset still uses global
-stats instead of per-episode stats. Update your dataset stats to the new format using this command:
-```
-python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 --repo-id={repo_id}
-```
-
-If you encounter a problem, contact LeRobot maintainers on [Discord](https://discord.com/invite/s3KuuzsPFb)
-or open an [issue on GitHub](https://github.com/huggingface/lerobot/issues/new/choose).
-"""
-
 FUTURE_MESSAGE = """
 The dataset you requested ({repo_id}) is only available in {version} format.
 As we cannot ensure forward compatibility with it, please update your current version of lerobot.
@@ -70,14 +58,7 @@ class CompatibilityError(Exception): ...

 class BackwardCompatibilityError(CompatibilityError):
    def __init__(self, repo_id: str, version: packaging.version.Version):
-        if version.major == 3:
-            message = V30_MESSAGE.format(repo_id=repo_id, version=version)
-        elif version.major == 2:
-            message = V2_MESSAGE.format(repo_id=repo_id, version=version)
-        else:
-            raise NotImplementedError(
-                "Contact the maintainer on [Discord](https://discord.com/invite/s3KuuzsPFb)."
-            )
+        message = V2_MESSAGE.format(repo_id=repo_id, version=version)
        super().__init__(message)


@@ -1,761 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Dataset tools utilities for LeRobotDataset.
-
-This module provides utilities for:
- Deleting episodes from datasets
- Splitting datasets into multiple smaller datasets
- Adding/removing features from datasets
- Merging datasets (wrapper around aggregate functionality)
-"""
-
-import logging
-import shutil
-from collections.abc import Callable
-from pathlib import Path
-
-import numpy as np
-import pandas as pd
-import torch
-from tqdm import tqdm
-
-from lerobot.constants import HF_LEROBOT_HOME
-from lerobot.datasets.aggregate import aggregate_datasets
-from lerobot.datasets.compute_stats import aggregate_stats
-from lerobot.datasets.lerobot_dataset import LeRobotDataset, LeRobotDatasetMetadata
-from lerobot.datasets.utils import (
-    DEFAULT_CHUNK_SIZE,
-    DEFAULT_DATA_FILE_SIZE_IN_MB,
-    DEFAULT_DATA_PATH,
-    DEFAULT_VIDEO_FILE_SIZE_IN_MB,
-    DEFAULT_VIDEO_PATH,
-    get_parquet_file_size_in_mb,
-    get_video_size_in_mb,
-    to_parquet_with_hf_images,
-    update_chunk_file_indices,
-    write_info,
-    write_stats,
-    write_tasks,
-)
-
-
-def delete_episodes(
-    dataset: LeRobotDataset,
-    episode_indices: list[int],
-    output_dir: str | Path | None = None,
-    repo_id: str | None = None,
-) -> LeRobotDataset:
-    """Delete episodes from a LeRobotDataset and create a new dataset.
-
-    Args:
-        dataset: The source LeRobotDataset.
-        episode_indices: List of episode indices to delete.
-        output_dir: Directory to save the new dataset. If None, uses default location.
-        repo_id: Repository ID for the new dataset. If None, appends "_filtered" to original.
-
-    Returns:
-        LeRobotDataset: New dataset with episodes removed.
-    """
-    if not episode_indices:
-        raise ValueError("No episodes to delete")
-
-    # Validate episode indices
-    valid_indices = set(range(dataset.meta.total_episodes))
-    invalid = set(episode_indices) - valid_indices
-    if invalid:
-        raise ValueError(f"Invalid episode indices: {invalid}")
-
-    logging.info(f"Deleting {len(episode_indices)} episodes from dataset")
-
-    # Create new dataset metadata
-    if repo_id is None:
-        repo_id = f"{dataset.repo_id}_filtered"
-    if output_dir is None:
-        output_dir = HF_LEROBOT_HOME / repo_id
-    else:
-        output_dir = Path(output_dir)
-
-    # Get episodes to keep
-    episodes_to_keep = [i for i in range(dataset.meta.total_episodes) if i not in episode_indices]
-    if not episodes_to_keep:
-        raise ValueError("Cannot delete all episodes from dataset")
-
-    # Create new dataset
-    new_meta = LeRobotDatasetMetadata.create(
-        repo_id=repo_id,
-        fps=dataset.meta.fps,
-        features=dataset.meta.features,
-        robot_type=dataset.meta.robot_type,
-        root=output_dir,
-        use_videos=len(dataset.meta.video_keys) > 0,
-    )
-
-    # Process episodes
-    episode_mapping = {}  # old_idx -> new_idx
-    new_episode_idx = 0
-
-    for old_idx in tqdm(episodes_to_keep, desc="Processing episodes"):
-        episode_mapping[old_idx] = new_episode_idx
-        new_episode_idx += 1
-
-    # Copy data files and update indices
-    _copy_and_reindex_data(dataset, new_meta, episode_mapping)
-
-    # Copy video files if present
-    if dataset.meta.video_keys:
-        _copy_and_reindex_videos(dataset, new_meta, episode_mapping)
-
-    # Create new dataset instance
-    new_dataset = LeRobotDataset(
-        repo_id=repo_id,
-        root=output_dir,
-        image_transforms=dataset.image_transforms,
-        delta_timestamps=dataset.delta_timestamps,
-        tolerance_s=dataset.tolerance_s,
-    )
-
-    logging.info(f"Created new dataset with {len(episodes_to_keep)} episodes")
-    return new_dataset
-
-
-def split_dataset(
-    dataset: LeRobotDataset,
-    splits: dict[str, list[int]] | dict[str, float],
-    output_dir: str | Path | None = None,
-) -> dict[str, LeRobotDataset]:
-    """Split a LeRobotDataset into multiple smaller datasets.
-
-    Args:
-        dataset: The source LeRobotDataset to split.
-        splits: Either a dict mapping split names to episode indices, or a dict mapping
-                split names to fractions (must sum to <= 1.0).
-        output_dir: Base directory for output datasets. If None, uses default location.
-
-    Returns:
-        dict[str, LeRobotDataset]: Dictionary mapping split names to new datasets.
-
-    Examples:
-        # Split by specific episodes
-        splits = {"train": [0, 1, 2], "val": [3, 4]}
-        datasets = split_dataset(dataset, splits)
-
-        # Split by fractions
-        splits = {"train": 0.8, "val": 0.2}
-        datasets = split_dataset(dataset, splits)
-    """
-    if not splits:
-        raise ValueError("No splits provided")
-
-    # Convert fractions to episode indices if needed
-    if all(isinstance(v, float) for v in splits.values()):
-        splits = _fractions_to_episode_indices(dataset.meta.total_episodes, splits)
-
-    # Validate episodes
-    all_episodes = set()
-    for split_name, episodes in splits.items():
-        if not episodes:
-            raise ValueError(f"Split '{split_name}' has no episodes")
-        episode_set = set(episodes)
-        if episode_set & all_episodes:
-            raise ValueError("Episodes cannot appear in multiple splits")
-        all_episodes.update(episode_set)
-
-    # Validate all episodes are valid
-    valid_indices = set(range(dataset.meta.total_episodes))
-    invalid = all_episodes - valid_indices
-    if invalid:
-        raise ValueError(f"Invalid episode indices: {invalid}")
-
-    if output_dir is None:
-        output_dir = HF_LEROBOT_HOME
-    else:
-        output_dir = Path(output_dir)
-
-    result_datasets = {}
-
-    for split_name, episodes in splits.items():
-        logging.info(f"Creating split '{split_name}' with {len(episodes)} episodes")
-
-        # Create repo_id for split
-        split_repo_id = f"{dataset.repo_id}_{split_name}"
-        split_output_dir = output_dir / split_repo_id
-
-        # Create episode mapping
-        episode_mapping = {old_idx: new_idx for new_idx, old_idx in enumerate(sorted(episodes))}
-
-        # Create new dataset metadata
-        new_meta = LeRobotDatasetMetadata.create(
-            repo_id=split_repo_id,
-            fps=dataset.meta.fps,
-            features=dataset.meta.features,
-            robot_type=dataset.meta.robot_type,
-            root=split_output_dir,
-            use_videos=len(dataset.meta.video_keys) > 0,
-        )
-
-        # Copy data and videos
-        _copy_and_reindex_data(dataset, new_meta, episode_mapping)
-        if dataset.meta.video_keys:
-            _copy_and_reindex_videos(dataset, new_meta, episode_mapping)
-
-        # Create new dataset instance
-        new_dataset = LeRobotDataset(
-            repo_id=split_repo_id,
-            root=split_output_dir,
-            image_transforms=dataset.image_transforms,
-            delta_timestamps=dataset.delta_timestamps,
-            tolerance_s=dataset.tolerance_s,
-        )
-
-        result_datasets[split_name] = new_dataset
-
-    return result_datasets
-
-
-def merge_datasets(
-    datasets: list[LeRobotDataset],
-    output_repo_id: str,
-    output_dir: str | Path | None = None,
-) -> LeRobotDataset:
-    """Merge multiple LeRobotDatasets into a single dataset.
-
-    This is a wrapper around the aggregate_datasets functionality with a cleaner API.
-
-    Args:
-        datasets: List of LeRobotDatasets to merge.
-        output_repo_id: Repository ID for the merged dataset.
-        output_dir: Directory to save the merged dataset. If None, uses default location.
-
-    Returns:
-        LeRobotDataset: The merged dataset.
-    """
-    if not datasets:
-        raise ValueError("No datasets to merge")
-
-    if output_dir is None:
-        output_dir = HF_LEROBOT_HOME / output_repo_id
-    else:
-        output_dir = Path(output_dir)
-
-    # Extract repo_ids and roots
-    repo_ids = [ds.repo_id for ds in datasets]
-    roots = [ds.root for ds in datasets]
-
-    # Call aggregate_datasets
-    aggregate_datasets(
-        repo_ids=repo_ids,
-        aggr_repo_id=output_repo_id,
-        roots=roots,
-        aggr_root=output_dir,
-    )
-
-    # Create and return the merged dataset
-    merged_dataset = LeRobotDataset(
-        repo_id=output_repo_id,
-        root=output_dir,
-        image_transforms=datasets[0].image_transforms,
-        delta_timestamps=datasets[0].delta_timestamps,
-        tolerance_s=datasets[0].tolerance_s,
-    )
-
-    return merged_dataset
-
-
-def add_feature(
-    dataset: LeRobotDataset,
-    feature_name: str,
-    feature_values: np.ndarray | torch.Tensor | Callable,
-    feature_info: dict,
-    output_dir: str | Path | None = None,
-    repo_id: str | None = None,
-) -> LeRobotDataset:
-    """Add a new feature to a LeRobotDataset.
-
-    Args:
-        dataset: The source LeRobotDataset.
-        feature_name: Name of the new feature.
-        feature_values: Either:
-            - Array/tensor of shape (num_frames, ...) with values for each frame
-            - Callable that takes (frame_dict, episode_index, frame_index) and returns feature value
-        feature_info: Dictionary with feature metadata (dtype, shape, names).
-        output_dir: Directory to save the new dataset. If None, uses default location.
-        repo_id: Repository ID for the new dataset. If None, appends "_modified" to original.
-
-    Returns:
-        LeRobotDataset: New dataset with the added feature.
-    """
-    if feature_name in dataset.meta.features:
-        raise ValueError(f"Feature '{feature_name}' already exists in dataset")
-
-    if repo_id is None:
-        repo_id = f"{dataset.repo_id}_modified"
-    if output_dir is None:
-        output_dir = HF_LEROBOT_HOME / repo_id
-    else:
-        output_dir = Path(output_dir)
-
-    # Validate feature_info
-    required_keys = {"dtype", "shape"}
-    if not required_keys.issubset(feature_info.keys()):
-        raise ValueError(f"feature_info must contain keys: {required_keys}")
-
-    # Create new features dict
-    new_features = dataset.meta.features.copy()
-    new_features[feature_name] = feature_info
-
-    # Create new dataset metadata
-    new_meta = LeRobotDatasetMetadata.create(
-        repo_id=repo_id,
-        fps=dataset.meta.fps,
-        features=new_features,
-        robot_type=dataset.meta.robot_type,
-        root=output_dir,
-        use_videos=len(dataset.meta.video_keys) > 0,
-    )
-
-    # Process data with new feature
-    _copy_data_with_feature_changes(
-        dataset=dataset,
-        new_meta=new_meta,
-        add_features={feature_name: (feature_values, feature_info)},
-    )
-
-    # Copy videos if present
-    if dataset.meta.video_keys:
-        _copy_videos(dataset, new_meta)
-
-    # Create new dataset instance
-    new_dataset = LeRobotDataset(
-        repo_id=repo_id,
-        root=output_dir,
-        image_transforms=dataset.image_transforms,
-        delta_timestamps=dataset.delta_timestamps,
-        tolerance_s=dataset.tolerance_s,
-    )
-
-    return new_dataset
-
-
-def remove_feature(
-    dataset: LeRobotDataset,
-    feature_names: str | list[str],
-    output_dir: str | Path | None = None,
-    repo_id: str | None = None,
-) -> LeRobotDataset:
-    """Remove features from a LeRobotDataset.
-
-    Args:
-        dataset: The source LeRobotDataset.
-        feature_names: Name(s) of features to remove. Can be a single string or list.
-        output_dir: Directory to save the new dataset. If None, uses default location.
-        repo_id: Repository ID for the new dataset. If None, appends "_modified" to original.
-
-    Returns:
-        LeRobotDataset: New dataset with features removed.
-    """
-    if isinstance(feature_names, str):
-        feature_names = [feature_names]
-
-    # Validate features exist
-    for name in feature_names:
-        if name not in dataset.meta.features:
-            raise ValueError(f"Feature '{name}' not found in dataset")
-
-    # Check if trying to remove required features
-    required_features = {"timestamp", "frame_index", "episode_index", "index", "task_index"}
-    if any(name in required_features for name in feature_names):
-        raise ValueError(f"Cannot remove required features: {required_features}")
-
-    if repo_id is None:
-        repo_id = f"{dataset.repo_id}_modified"
-    if output_dir is None:
-        output_dir = HF_LEROBOT_HOME / repo_id
-    else:
-        output_dir = Path(output_dir)
-
-    # Create new features dict
-    new_features = {k: v for k, v in dataset.meta.features.items() if k not in feature_names}
-
-    # Check if removing video features
-    video_keys_to_remove = [name for name in feature_names if name in dataset.meta.video_keys]
-
-    # Check if videos will remain after removal
-    remaining_video_keys = [k for k in dataset.meta.video_keys if k not in video_keys_to_remove]
-
-    # Create new dataset metadata
-    new_meta = LeRobotDatasetMetadata.create(
-        repo_id=repo_id,
-        fps=dataset.meta.fps,
-        features=new_features,
-        robot_type=dataset.meta.robot_type,
-        root=output_dir,
-        use_videos=len(remaining_video_keys) > 0,
-    )
-
-    # Process data with removed features
-    _copy_data_with_feature_changes(
-        dataset=dataset,
-        new_meta=new_meta,
-        remove_features=feature_names,
-    )
-
-    # Copy videos (excluding removed ones)
-    if new_meta.video_keys:
-        _copy_videos(dataset, new_meta, exclude_keys=video_keys_to_remove)
-
-    # Create new dataset instance
-    new_dataset = LeRobotDataset(
-        repo_id=repo_id,
-        root=output_dir,
-        image_transforms=dataset.image_transforms,
-        delta_timestamps=dataset.delta_timestamps,
-        tolerance_s=dataset.tolerance_s,
-    )
-
-    return new_dataset
-
-
-# Helper functions
-
-
-def _fractions_to_episode_indices(
-    total_episodes: int,
-    splits: dict[str, float],
-) -> dict[str, list[int]]:
-    """Convert split fractions to episode indices."""
-    if sum(splits.values()) > 1.0:
-        raise ValueError("Split fractions must sum to <= 1.0")
-
-    indices = list(range(total_episodes))
-    result = {}
-    start_idx = 0
-
-    for split_name, fraction in splits.items():
-        num_episodes = int(total_episodes * fraction)
-        end_idx = start_idx + num_episodes
-        if split_name == list(splits.keys())[-1]:  # Last split gets remaining episodes
-            end_idx = total_episodes
-        result[split_name] = indices[start_idx:end_idx]
-        start_idx = end_idx
-
-    return result
-
-
-def _copy_and_reindex_data(
-    src_dataset: LeRobotDataset,
-    dst_meta: LeRobotDatasetMetadata,
-    episode_mapping: dict[int, int],
-) -> None:
-    """Copy data files and reindex episodes."""
-    # Get unique data files from episodes to keep
-    file_paths = set()
-    for old_idx in episode_mapping:
-        file_paths.add(src_dataset.meta.get_data_file_path(old_idx))
-
-    # Track global index
-    global_index = 0
-    chunk_idx, file_idx = 0, 0
-
-    # Process each data file
-    for src_path in tqdm(sorted(file_paths), desc="Processing data files"):
-        df = pd.read_parquet(src_dataset.root / src_path)
-
-        # Filter to keep only mapped episodes
-        mask = df["episode_index"].isin(episode_mapping.keys())
-        df = df[mask].copy()
-
-        if len(df) == 0:
-            continue
-
-        # Update episode indices
-        df["episode_index"] = df["episode_index"].map(episode_mapping)
-
-        # Update global index to be continuous
-        df["index"] = range(global_index, global_index + len(df))
-        global_index += len(df)
-
-        # Update task indices if needed
-        if dst_meta.tasks is None:
-            # Get unique tasks from filtered data
-            task_indices = df["task_index"].unique()
-            tasks = [src_dataset.meta.tasks.iloc[idx].name for idx in task_indices]
-            dst_meta.save_episode_tasks(list(set(tasks)))
-
-        # Remap task indices
-        task_mapping = {}
-        for old_task_idx in df["task_index"].unique():
-            task_name = src_dataset.meta.tasks.iloc[old_task_idx].name
-            new_task_idx = dst_meta.get_task_index(task_name)
-            task_mapping[old_task_idx] = new_task_idx
-        df["task_index"] = df["task_index"].map(task_mapping)
-
-        # Save processed data
-        chunk_idx, file_idx = _save_data_chunk(df, dst_meta, chunk_idx, file_idx)
-
-    # Process episodes metadata
-    _copy_and_reindex_episodes_metadata(src_dataset, dst_meta, episode_mapping)
-
-
-def _copy_and_reindex_videos(
-    src_dataset: LeRobotDataset,
-    dst_meta: LeRobotDatasetMetadata,
-    episode_mapping: dict[int, int],
-) -> None:
-    """Copy video files and update metadata."""
-    for video_key in src_dataset.meta.video_keys:
-        video_files = set()
-        for old_idx in episode_mapping:
-            video_files.add(src_dataset.meta.get_video_file_path(old_idx, video_key))
-
-        chunk_idx, file_idx = 0, 0
-
-        for src_path in tqdm(sorted(video_files), desc=f"Processing {video_key} videos"):
-            dst_path = dst_meta.root / DEFAULT_VIDEO_PATH.format(
-                video_key=video_key,
-                chunk_index=chunk_idx,
-                file_index=file_idx,
-            )
-            dst_path.parent.mkdir(parents=True, exist_ok=True)
-
-            # For simplicity, copy entire video files
-            # In production, you might want to extract only relevant segments
-            shutil.copy(src_dataset.root / src_path, dst_path)
-
-            # Update indices for next file
-            file_size = get_video_size_in_mb(dst_path)
-            if file_size >= DEFAULT_VIDEO_FILE_SIZE_IN_MB * 0.9:  # 90% threshold
-                chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, DEFAULT_CHUNK_SIZE)
-
-
-def _copy_and_reindex_episodes_metadata(
-    src_dataset: LeRobotDataset,
-    dst_meta: LeRobotDatasetMetadata,
-    episode_mapping: dict[int, int],
-) -> None:
-    """Copy and reindex episodes metadata."""
-    all_stats = []
-    frame_offset = 0
-
-    for old_idx, new_idx in tqdm(
-        sorted(episode_mapping.items(), key=lambda x: x[1]), desc="Processing episodes metadata"
-    ):
-        # Get episode from source
-        src_episode = src_dataset.meta.episodes[old_idx]
-
-        # Create episode dict
-        episode_dict = {
-            "episode_index": new_idx,
-            "tasks": src_episode["tasks"],  # Already a list of task names
-            "length": src_episode["length"],
-        }
-
-        # Copy other metadata
-        episode_metadata = {
-            "data/chunk_index": 0,  # Will be recalculated when saving
-            "data/file_index": 0,  # Will be recalculated when saving
-            "dataset_from_index": frame_offset,
-            "dataset_to_index": frame_offset + src_episode["length"],
-        }
-
-        # Update frame offset for next episode
-        frame_offset += src_episode["length"]
-
-        # Copy stats metadata
-        for key in src_episode.keys():
-            if key.startswith("stats/"):
-                episode_dict[key] = src_episode[key]
-
-        # Add episode metadata
-        stats_dict = {
-            key.replace("stats/", ""): value
-            for key, value in episode_dict.items()
-            if key.startswith("stats/")
-        }
-        all_stats.append(stats_dict)
-
-        # Calculate stats from dict
-        episode_stats = {}
-        for key in dst_meta.features:
-            if key in stats_dict:
-                episode_stats[key] = stats_dict[key]
-
-        dst_meta.save_episode(
-            new_idx, episode_dict["length"], episode_dict["tasks"], episode_stats, episode_metadata
-        )
-
-    # Aggregate all stats
-    if all_stats:
-        aggregated_stats = aggregate_stats(all_stats)
-        write_stats(aggregated_stats, dst_meta.root)
-
-
-def _save_data_chunk(
-    df: pd.DataFrame,
-    meta: LeRobotDatasetMetadata,
-    chunk_idx: int = 0,
-    file_idx: int = 0,
-) -> tuple[int, int]:
-    """Save a data chunk and return updated indices."""
-    path = meta.root / DEFAULT_DATA_PATH.format(chunk_index=chunk_idx, file_index=file_idx)
-    path.parent.mkdir(parents=True, exist_ok=True)
-
-    if len(meta.image_keys) > 0:
-        to_parquet_with_hf_images(df, path)
-    else:
-        df.to_parquet(path)
-
-    # Check if we need to rotate files
-    file_size = get_parquet_file_size_in_mb(path)
-    if file_size >= DEFAULT_DATA_FILE_SIZE_IN_MB * 0.9:  # 90% threshold
-        chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, DEFAULT_CHUNK_SIZE)
-
-    return chunk_idx, file_idx
-
-
-def _copy_data_with_feature_changes(
-    dataset: LeRobotDataset,
-    new_meta: LeRobotDatasetMetadata,
-    add_features: dict[str, tuple] | None = None,
-    remove_features: list[str] | None = None,
-) -> None:
-    """Copy data while adding or removing features."""
-    # Get all unique data files
-    file_paths = set()
-    for ep_idx in range(dataset.meta.total_episodes):
-        file_paths.add(dataset.meta.get_data_file_path(ep_idx))
-
-    frame_idx = 0
-
-    # Process each data file
-    for src_path in tqdm(sorted(file_paths), desc="Processing data files"):
-        df = pd.read_parquet(dataset.root / src_path)
-
-        # Remove features
-        if remove_features:
-            df = df.drop(columns=remove_features, errors="ignore")
-
-        # Add features
-        if add_features:
-            for feature_name, (values, _) in add_features.items():
-                if callable(values):
-                    # Compute values for each frame
-                    feature_values = []
-                    for _, row in df.iterrows():
-                        ep_idx = row["episode_index"]
-                        frame_in_ep = row["frame_index"]
-                        value = values(row.to_dict(), ep_idx, frame_in_ep)
-                        # Convert numpy arrays to scalars for single-element arrays
-                        if isinstance(value, np.ndarray) and value.size == 1:
-                            value = value.item()
-                        feature_values.append(value)
-                    df[feature_name] = feature_values
-                else:
-                    # Use provided values
-                    end_idx = frame_idx + len(df)
-                    # Convert to list to ensure proper shape handling
-                    feature_slice = values[frame_idx:end_idx]
-                    if len(feature_slice.shape) > 1 and feature_slice.shape[1] == 1:
-                        # Flatten single-element arrays to scalars for pandas
-                        df[feature_name] = feature_slice.flatten()
-                    else:
-                        df[feature_name] = feature_slice
-                    frame_idx = end_idx
-
-        # Save chunk
-        _save_data_chunk(df, new_meta)
-
-    # Copy episodes metadata and update stats
-    _copy_episodes_metadata_and_stats(dataset, new_meta)
-
-
-def _copy_videos(
-    src_dataset: LeRobotDataset,
-    dst_meta: LeRobotDatasetMetadata,
-    exclude_keys: list[str] | None = None,
-) -> None:
-    """Copy video files, optionally excluding certain keys."""
-    if exclude_keys is None:
-        exclude_keys = []
-
-    for video_key in src_dataset.meta.video_keys:
-        if video_key in exclude_keys:
-            continue
-
-        # Get all video files for this key
-        video_files = set()
-        for ep_idx in range(src_dataset.meta.total_episodes):
-            video_files.add(src_dataset.meta.get_video_file_path(ep_idx, video_key))
-
-        # Copy video files
-        for src_path in tqdm(sorted(video_files), desc=f"Copying {video_key} videos"):
-            # Maintain same structure
-            rel_path = src_path.relative_to(src_dataset.root)
-            dst_path = dst_meta.root / rel_path
-            dst_path.parent.mkdir(parents=True, exist_ok=True)
-            shutil.copy(src_dataset.root / src_path, dst_path)
-
-
-def _copy_episodes_metadata_and_stats(
-    src_dataset: LeRobotDataset,
-    dst_meta: LeRobotDatasetMetadata,
-) -> None:
-    """Copy episodes metadata and recalculate stats."""
-    # Copy tasks
-    if src_dataset.meta.tasks is not None:
-        write_tasks(src_dataset.meta.tasks, dst_meta.root)
-        dst_meta.tasks = src_dataset.meta.tasks.copy()
-
-    # Copy episodes metadata files
-    episodes_dir = src_dataset.root / "meta/episodes"
-    dst_episodes_dir = dst_meta.root / "meta/episodes"
-    if episodes_dir.exists():
-        shutil.copytree(episodes_dir, dst_episodes_dir, dirs_exist_ok=True)
-
-    # Update info
-    dst_meta.info.update(
-        {
-            "total_episodes": src_dataset.meta.total_episodes,
-            "total_frames": src_dataset.meta.total_frames,
-            "total_tasks": src_dataset.meta.total_tasks,
-            "splits": src_dataset.meta.info.get("splits", {"train": f"0:{src_dataset.meta.total_episodes}"}),
-        }
-    )
-
-    # Update video info if needed
-    if dst_meta.video_keys and src_dataset.meta.video_keys:
-        for key in dst_meta.video_keys:
-            if key in src_dataset.meta.features:
-                dst_meta.info["features"][key]["info"] = src_dataset.meta.info["features"][key].get(
-                    "info", {}
-                )
-
-    write_info(dst_meta.info, dst_meta.root)
-
-    # Recalculate stats if features changed
-    if set(dst_meta.features.keys()) != set(src_dataset.meta.features.keys()):
-        # Need to recalculate stats
-        logging.info("Recalculating dataset statistics...")
-        # This is a simplified version - in production you'd want to properly recalculate
-        if src_dataset.meta.stats:
-            new_stats = {}
-            for key in dst_meta.features:
-                if key in src_dataset.meta.stats:
-                    new_stats[key] = src_dataset.meta.stats[key]
-            write_stats(new_stats, dst_meta.root)
-    else:
-        # Copy existing stats
-        if src_dataset.meta.stats:
-            write_stats(src_dataset.meta.stats, dst_meta.root)
@@ -16,18 +16,16 @@
 import contextlib
 import logging
 import shutil
-import tempfile
 from collections.abc import Callable
 from pathlib import Path

 import datasets
 import numpy as np
 import packaging.version
-import pandas as pd
 import PIL.Image
 import torch
 import torch.utils
-from datasets import Dataset
+from datasets import concatenate_datasets, load_dataset
 from huggingface_hub import HfApi, snapshot_download
 from huggingface_hub.constants import REPOCARD_NAME
 from huggingface_hub.errors import RevisionNotFoundError
@@ -36,51 +34,46 @@ from lerobot.constants import HF_LEROBOT_HOME
 from lerobot.datasets.compute_stats import aggregate_stats, compute_episode_stats
 from lerobot.datasets.image_writer import AsyncImageWriter, write_image
 from lerobot.datasets.utils import (
-    DEFAULT_EPISODES_PATH,
    DEFAULT_FEATURES,
    DEFAULT_IMAGE_PATH,
    INFO_PATH,
+    TASKS_PATH,
    _validate_feature_names,
+    append_jsonlines,
+    backward_compatible_episodes_stats,
    check_delta_timestamps,
+    check_timestamps_sync,
    check_version_compatibility,
    create_empty_dataset_info,
    create_lerobot_dataset_card,
    embed_images,
-    flatten_dict,
    get_delta_indices,
-    get_hf_dataset_size_in_mb,
+    get_episode_data_index,
    get_hf_features_from_features,
-    get_parquet_file_size_in_mb,
-    get_parquet_num_frames,
    get_safe_version,
-    get_video_duration_in_s,
-    get_video_size_in_mb,
    hf_transform_to_torch,
    is_valid_version,
    load_episodes,
+    load_episodes_stats,
    load_info,
-    load_nested_dataset,
    load_stats,
    load_tasks,
-    to_parquet_with_hf_images,
-    update_chunk_file_indices,
    validate_episode_buffer,
    validate_frame,
+    write_episode,
+    write_episode_stats,
    write_info,
    write_json,
-    write_stats,
-    write_tasks,
 )
 from lerobot.datasets.video_utils import (
    VideoFrame,
-    concat_video_files,
    decode_video_frames,
    encode_video_frames,
    get_safe_default_codec,
    get_video_info,
 )

-CODEBASE_VERSION = "v3.0"
+CODEBASE_VERSION = "v2.1"


 class LeRobotDatasetMetadata:
@@ -104,18 +97,20 @@ class LeRobotDatasetMetadata:
                self.revision = get_safe_version(self.repo_id, self.revision)

            (self.root / "meta").mkdir(exist_ok=True, parents=True)
-            # TODO(rcadene): instead of downloading all episodes metadata files,
-            # download only the ones associated to the requested episodes. This would
-            # require adding `episodes: list[int]` as argument.
            self.pull_from_repo(allow_patterns="meta/")
            self.load_metadata()

    def load_metadata(self):
        self.info = load_info(self.root)
        check_version_compatibility(self.repo_id, self._version, CODEBASE_VERSION)
-        self.tasks = load_tasks(self.root)
+        self.tasks, self.task_to_task_index = load_tasks(self.root)
        self.episodes = load_episodes(self.root)
-        self.stats = load_stats(self.root)
+        if self._version < packaging.version.parse("v2.1"):
+            self.stats = load_stats(self.root)
+            self.episodes_stats = backward_compatible_episodes_stats(self.stats, self.episodes)
+        else:
+            self.episodes_stats = load_episodes_stats(self.root)
+            self.stats = aggregate_stats(list(self.episodes_stats.values()))

    def pull_from_repo(
        self,
@@ -137,19 +132,18 @@ class LeRobotDatasetMetadata:
        return packaging.version.parse(self.info["codebase_version"])

    def get_data_file_path(self, ep_index: int) -> Path:
-        ep = self.episodes[ep_index]
-        chunk_idx = ep["data/chunk_index"]
-        file_idx = ep["data/file_index"]
-        fpath = self.data_path.format(chunk_index=chunk_idx, file_index=file_idx)
+        ep_chunk = self.get_episode_chunk(ep_index)
+        fpath = self.data_path.format(episode_chunk=ep_chunk, episode_index=ep_index)
        return Path(fpath)

    def get_video_file_path(self, ep_index: int, vid_key: str) -> Path:
-        ep = self.episodes[ep_index]
-        chunk_idx = ep[f"videos/{vid_key}/chunk_index"]
-        file_idx = ep[f"videos/{vid_key}/file_index"]
-        fpath = self.video_path.format(video_key=vid_key, chunk_index=chunk_idx, file_index=file_idx)
+        ep_chunk = self.get_episode_chunk(ep_index)
+        fpath = self.video_path.format(episode_chunk=ep_chunk, video_key=vid_key, episode_index=ep_index)
        return Path(fpath)

+    def get_episode_chunk(self, ep_index: int) -> int:
+        return ep_index // self.chunks_size
+
    @property
    def data_path(self) -> str:
        """Formattable string for the parquet files."""
@@ -215,109 +209,40 @@ class LeRobotDatasetMetadata:
        """Total number of different tasks performed in this dataset."""
        return self.info["total_tasks"]

+    @property
+    def total_chunks(self) -> int:
+        """Total number of chunks (groups of episodes)."""
+        return self.info["total_chunks"]
+
    @property
    def chunks_size(self) -> int:
-        """Max number of files per chunk."""
+        """Max number of episodes per chunk."""
        return self.info["chunks_size"]

-    @property
-    def data_files_size_in_mb(self) -> int:
-        """Max size of data file in mega bytes."""
-        return self.info["data_files_size_in_mb"]
-
-    @property
-    def video_files_size_in_mb(self) -> int:
-        """Max size of video file in mega bytes."""
-        return self.info["video_files_size_in_mb"]
-
    def get_task_index(self, task: str) -> int | None:
        """
        Given a task in natural language, returns its task_index if the task already exists in the dataset,
        otherwise return None.
        """
-        if task in self.tasks.index:
-            return int(self.tasks.loc[task].task_index)
-        else:
-            return None
+        return self.task_to_task_index.get(task, None)

-    def save_episode_tasks(self, tasks: list[str]):
-        if len(set(tasks)) != len(tasks):
-            raise ValueError(f"Tasks are not unique: {tasks}")
-
-        if self.tasks is None:
-            new_tasks = tasks
-            task_indices = range(len(tasks))
-            self.tasks = pd.DataFrame({"task_index": task_indices}, index=tasks)
-        else:
-            new_tasks = [task for task in tasks if task not in self.tasks.index]
-            new_task_indices = range(len(self.tasks), len(self.tasks) + len(new_tasks))
-            for task_idx, task in zip(new_task_indices, new_tasks, strict=False):
-                self.tasks.loc[task] = task_idx
-
-        if len(new_tasks) > 0:
-            # Update on disk
-            write_tasks(self.tasks, self.root)
-
-    def _save_episode_metadata(self, episode_dict: dict) -> None:
-        """Save episode metadata to a parquet file and update the Hugging Face dataset of episodes metadata.
-
-        This function processes episodes metadata from a dictionary, converts it into a Hugging Face dataset,
-        and saves it as a parquet file. It handles both the creation of new parquet files and the
-        updating of existing ones based on size constraints. After saving the metadata, it reloads
-        the Hugging Face dataset to ensure it is up-to-date.
-
-        Notes: We both need to update parquet files and HF dataset:
-        - `pandas` loads parquet file in RAM
-        - `datasets` relies on a memory mapping from pyarrow (no RAM). It either converts parquet files to a pyarrow cache on disk,
-          or loads directly from pyarrow cache.
+    def add_task(self, task: str):
        """
-        # Convert buffer into HF Dataset
-        episode_dict = {key: [value] for key, value in episode_dict.items()}
-        ep_dataset = Dataset.from_dict(episode_dict)
-        ep_size_in_mb = get_hf_dataset_size_in_mb(ep_dataset)
-        df = pd.DataFrame(ep_dataset)
-        num_frames = episode_dict["length"][0]
+        Given a task in natural language, add it to the dictionary of tasks.
+        """
+        if task in self.task_to_task_index:
+            raise ValueError(f"The task '{task}' already exists and can't be added twice.")

-        if self.episodes is None:
-            # Initialize indices and frame count for a new dataset made of the first episode data
-            chunk_idx, file_idx = 0, 0
-            df["meta/episodes/chunk_index"] = [chunk_idx]
-            df["meta/episodes/file_index"] = [file_idx]
-            df["dataset_from_index"] = [0]
-            df["dataset_to_index"] = [num_frames]
-        else:
-            # Retrieve information from the latest parquet file
-            latest_ep = self.episodes[-1]
-            chunk_idx = latest_ep["meta/episodes/chunk_index"]
-            file_idx = latest_ep["meta/episodes/file_index"]
+        task_index = self.info["total_tasks"]
+        self.task_to_task_index[task] = task_index
+        self.tasks[task_index] = task
+        self.info["total_tasks"] += 1

-            latest_path = self.root / DEFAULT_EPISODES_PATH.format(chunk_index=chunk_idx, file_index=file_idx)
-            latest_size_in_mb = get_parquet_file_size_in_mb(latest_path)
-
-            if latest_size_in_mb + ep_size_in_mb >= self.data_files_size_in_mb:
-                # Size limit is reached, prepare new parquet file
-                chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, self.chunks_size)
-
-            # Update the existing pandas dataframe with new row
-            df["meta/episodes/chunk_index"] = [chunk_idx]
-            df["meta/episodes/file_index"] = [file_idx]
-            df["dataset_from_index"] = [latest_ep["dataset_to_index"]]
-            df["dataset_to_index"] = [latest_ep["dataset_to_index"] + num_frames]
-
-            if latest_size_in_mb + ep_size_in_mb < self.data_files_size_in_mb:
-                # Size limit wasnt reached, concatenate latest dataframe with new one
-                latest_df = pd.read_parquet(latest_path)
-                df = pd.concat([latest_df, df], ignore_index=True)
-
-        # Write the resulting dataframe from RAM to disk
-        path = self.root / DEFAULT_EPISODES_PATH.format(chunk_index=chunk_idx, file_index=file_idx)
-        path.parent.mkdir(parents=True, exist_ok=True)
-        df.to_parquet(path, index=False)
-
-        # Update the Hugging Face dataset by reloading it.
-        # This process should be fast because only the latest Parquet file has been modified.
-        # Therefore, only this file needs to be converted to PyArrow; the rest is loaded from the PyArrow memory-mapped cache.
-        self.episodes = load_episodes(self.root)
+        task_dict = {
+            "task_index": task_index,
+            "task": task,
+        }
+        append_jsonlines(task_dict, self.root / TASKS_PATH)

    def save_episode(
        self,
@@ -325,28 +250,30 @@ class LeRobotDatasetMetadata:
        episode_length: int,
        episode_tasks: list[str],
        episode_stats: dict[str, dict],
-        episode_metadata: dict,
    ) -> None:
+        self.info["total_episodes"] += 1
+        self.info["total_frames"] += episode_length
+
+        chunk = self.get_episode_chunk(episode_index)
+        if chunk >= self.total_chunks:
+            self.info["total_chunks"] += 1
+
+        self.info["splits"] = {"train": f"0:{self.info['total_episodes']}"}
+        self.info["total_videos"] += len(self.video_keys)
+
+        write_info(self.info, self.root)
+
        episode_dict = {
            "episode_index": episode_index,
            "tasks": episode_tasks,
            "length": episode_length,
        }
-        episode_dict.update(episode_metadata)
-        episode_dict.update(flatten_dict({"stats": episode_stats}))
-        self._save_episode_metadata(episode_dict)
+        self.episodes[episode_index] = episode_dict
+        write_episode(episode_dict, self.root)

-        # Update info
-        self.info["total_episodes"] += 1
-        self.info["total_frames"] += episode_length
-        self.info["total_tasks"] = len(self.tasks)
-        self.info["splits"] = {"train": f"0:{self.info['total_episodes']}"}
-        if len(self.video_keys) > 0:
-            self.update_video_info()
-        write_info(self.info, self.root)
-
-        self.stats = aggregate_stats([self.stats, episode_stats]) if self.stats is not None else episode_stats
-        write_stats(self.stats, self.root)
+        self.episodes_stats[episode_index] = episode_stats
+        self.stats = aggregate_stats([self.stats, episode_stats]) if self.stats else episode_stats
+        write_episode_stats(episode_index, episode_stats, self.root)

    def update_video_info(self) -> None:
        """
@@ -386,12 +313,12 @@ class LeRobotDatasetMetadata:

        obj.root.mkdir(parents=True, exist_ok=False)

+        # TODO(aliberts, rcadene): implement sanity check for features
        features = {**features, **DEFAULT_FEATURES}
        _validate_feature_names(features)

-        obj.tasks = None
-        obj.episodes = None
-        obj.stats = None
+        obj.tasks, obj.task_to_task_index = {}, {}
+        obj.episodes_stats, obj.stats, obj.episodes = {}, {}, {}
        obj.info = create_empty_dataset_info(CODEBASE_VERSION, fps, features, use_videos, robot_type)
        if len(obj.video_keys) > 0 and not use_videos:
            raise ValueError()
@@ -413,6 +340,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
        force_cache_sync: bool = False,
        download_videos: bool = True,
        video_backend: str | None = None,
+        batch_encoding_size: int = 1,
    ):
        """
        2 modes are available for instantiating this class, depending on 2 different use cases:
@@ -426,9 +354,9 @@ class LeRobotDataset(torch.utils.data.Dataset):
            - On the Hugging Face Hub at the address https://huggingface.co/datasets/{repo_id} and not on
              your local disk in the 'root' folder. Instantiating this class with this 'repo_id' will download
              the dataset from that address and load it, pending your dataset is compliant with
-              codebase_version v3.0. If your dataset has been created before this new format, you will be
-              prompted to convert it using our conversion script from v2.1 to v3.0, which you can find at
-              lerobot/datasets/v30/convert_dataset_v21_to_v30.py.
+              codebase_version v2.0. If your dataset has been created before this new format, you will be
+              prompted to convert it using our conversion script from v1.6 to v2.0, which you can find at
+              lerobot/datasets/v2/convert_dataset_v1_to_v2.py.


        2. Your dataset doesn't already exists (either on local disk or on the Hub): you can create an empty
@@ -449,47 +377,38 @@ class LeRobotDataset(torch.utils.data.Dataset):
        .
        ├── data
        │   ├── chunk-000
-        │   │   ├── file-000.parquet
-        │   │   ├── file-001.parquet
+        │   │   ├── episode_000000.parquet
+        │   │   ├── episode_000001.parquet
+        │   │   ├── episode_000002.parquet
        │   │   └── ...
        │   ├── chunk-001
-        │   │   ├── file-000.parquet
-        │   │   ├── file-001.parquet
+        │   │   ├── episode_001000.parquet
+        │   │   ├── episode_001001.parquet
+        │   │   ├── episode_001002.parquet
        │   │   └── ...
        │   └── ...
        ├── meta
-        │   ├── episodes
-        │   │   ├── chunk-000
-        │   │   │   ├── file-000.parquet
-        │   │   │   ├── file-001.parquet
-        │   │   │   └── ...
-        │   │   ├── chunk-001
-        │   │   │   └── ...
-        │   │   └── ...
+        │   ├── episodes.jsonl
        │   ├── info.json
        │   ├── stats.json
-        │   └── tasks.parquet
+        │   └── tasks.jsonl
        └── videos
-            ├── observation.images.laptop
-            │   ├── chunk-000
-            │   │   ├── file-000.mp4
-            │   │   ├── file-001.mp4
+            ├── chunk-000
+            │   ├── observation.images.laptop
+            │   │   ├── episode_000000.mp4
+            │   │   ├── episode_000001.mp4
+            │   │   ├── episode_000002.mp4
            │   │   └── ...
-            │   ├── chunk-001
+            │   ├── observation.images.phone
+            │   │   ├── episode_000000.mp4
+            │   │   ├── episode_000001.mp4
+            │   │   ├── episode_000002.mp4
            │   │   └── ...
-            │   └── ...
-            ├── observation.images.phone
-            │   ├── chunk-000
-            │   │   ├── file-000.mp4
-            │   │   ├── file-001.mp4
-            │   │   └── ...
-            │   ├── chunk-001
-            │   │   └── ...
-            │   └── ...
+            ├── chunk-001
            └── ...

-        Note that this file-based structure is designed to be as versatile as possible. Multiple episodes are
-        consolidated into chunked files which improves storage efficiency and loading performance. The
+        Note that this file-based structure is designed to be as versatile as possible. The files are split by
+        episodes which allows a more granular control over which episodes one wants to use and download. The
        structure of the dataset is entirely described in the info.json file, which can be easily downloaded
        or viewed directly on the hub before downloading any actual data. The type of files used are very
        simple and do not need complex tools to be read, it only uses .parquet, .json and .mp4 files (and .md
@@ -523,6 +442,8 @@ class LeRobotDataset(torch.utils.data.Dataset):
                True.
            video_backend (str | None, optional): Video backend to use for decoding videos. Defaults to torchcodec when available int the platform; otherwise, defaults to 'pyav'.
                You can also use the 'pyav' decoder used by Torchvision, which used to be the default option, or 'video_reader' which is another decoder of Torchvision.
+            batch_encoding_size (int, optional): Number of episodes to accumulate before batch encoding videos.
+                Set to 1 for immediate encoding (default), or higher for batched encoding. Defaults to 1.
        """
        super().__init__()
        self.repo_id = repo_id
@@ -534,6 +455,8 @@ class LeRobotDataset(torch.utils.data.Dataset):
        self.revision = revision if revision else CODEBASE_VERSION
        self.video_backend = video_backend if video_backend else get_safe_default_codec()
        self.delta_indices = None
+        self.batch_encoding_size = batch_encoding_size
+        self.episodes_since_last_encoding = 0

        # Unused attributes
        self.image_writer = None
@@ -545,20 +468,29 @@ class LeRobotDataset(torch.utils.data.Dataset):
        self.meta = LeRobotDatasetMetadata(
            self.repo_id, self.root, self.revision, force_cache_sync=force_cache_sync
        )
+        if self.episodes is not None and self.meta._version >= packaging.version.parse("v2.1"):
+            episodes_stats = [self.meta.episodes_stats[ep_idx] for ep_idx in self.episodes]
+            self.stats = aggregate_stats(episodes_stats)

        # Load actual data
        try:
            if force_cache_sync:
                raise FileNotFoundError
+            assert all((self.root / fpath).is_file() for fpath in self.get_episodes_file_paths())
            self.hf_dataset = self.load_hf_dataset()
-            # Check if cached dataset contains all requested episodes
-            if not self._check_cached_episodes_sufficient():
-                raise FileNotFoundError("Cached dataset doesn't contain all requested episodes")
        except (AssertionError, FileNotFoundError, NotADirectoryError):
            self.revision = get_safe_version(self.repo_id, self.revision)
-            self.download(download_videos)
+            self.download_episodes(download_videos)
            self.hf_dataset = self.load_hf_dataset()

+        self.episode_data_index = get_episode_data_index(self.meta.episodes, self.episodes)
+
+        # Check timestamps
+        timestamps = torch.stack(self.hf_dataset["timestamp"]).numpy()
+        episode_indices = torch.stack(self.hf_dataset["episode_index"]).numpy()
+        ep_data_index_np = {k: t.numpy() for k, t in self.episode_data_index.items()}
+        check_timestamps_sync(timestamps, episode_indices, ep_data_index_np, self.fps, self.tolerance_s)
+
        # Setup delta_indices
        if self.delta_timestamps is not None:
            check_delta_timestamps(self.delta_timestamps, self.fps, self.tolerance_s)
@@ -634,7 +566,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
            ignore_patterns=ignore_patterns,
        )

-    def download(self, download_videos: bool = True) -> None:
+    def download_episodes(self, download_videos: bool = True) -> None:
        """Downloads the dataset from the given 'repo_id' at the provided version. If 'episodes' is given, this
        will only download those episodes (selected by their episode_index). If 'episodes' is None, the whole
        dataset will be downloaded. Thanks to the behavior of snapshot_download, if the files are already present
@@ -642,10 +574,11 @@ class LeRobotDataset(torch.utils.data.Dataset):
        """
        # TODO(rcadene, aliberts): implement faster transfer
        # https://huggingface.co/docs/huggingface_hub/en/guides/download#faster-downloads
-        ignore_patterns = None if download_videos else "videos/"
        files = None
+        ignore_patterns = None if download_videos else "videos/"
        if self.episodes is not None:
            files = self.get_episodes_file_paths()
+
        self.pull_from_repo(allow_patterns=files, ignore_patterns=ignore_patterns)

    def get_episodes_file_paths(self) -> list[Path]:
@@ -658,40 +591,28 @@ class LeRobotDataset(torch.utils.data.Dataset):
                for ep_idx in episodes
            ]
            fpaths += video_files
-        # episodes are stored in the same files, so we return unique paths only
-        fpaths = list(set(fpaths))
+
        return fpaths

    def load_hf_dataset(self) -> datasets.Dataset:
        """hf_dataset contains all the observations, states, actions, rewards, etc."""
-        features = get_hf_features_from_features(self.features)
-        hf_dataset = load_nested_dataset(self.root / "data", features=features)
+        if self.episodes is None:
+            path = str(self.root / "data")
+            hf_dataset = load_dataset("parquet", data_dir=path, split="train")
+        else:
+            files = [str(self.root / self.meta.get_data_file_path(ep_idx)) for ep_idx in self.episodes]
+            hf_dataset = load_dataset("parquet", data_files=files, split="train")
+
+        # TODO(aliberts): hf_dataset.set_format("torch")
        hf_dataset.set_transform(hf_transform_to_torch)
        return hf_dataset

-    def _check_cached_episodes_sufficient(self) -> bool:
-        """Check if the cached dataset contains all requested episodes."""
-        if self.hf_dataset is None or len(self.hf_dataset) == 0:
-            return False
-
-        # Get available episode indices from cached dataset
-        available_episodes = set(self.hf_dataset["episode_index"])
-
-        # Determine requested episodes
-        if self.episodes is None:
-            # Requesting all episodes - check if we have all episodes from metadata
-            requested_episodes = set(range(self.meta.total_episodes))
-        else:
-            # Requesting specific episodes
-            requested_episodes = set(self.episodes)
-
-        # Check if all requested episodes are available in cached data
-        return requested_episodes.issubset(available_episodes)
-
    def create_hf_dataset(self) -> datasets.Dataset:
        features = get_hf_features_from_features(self.features)
        ft_dict = {col: [] for col in features}
        hf_dataset = datasets.Dataset.from_dict(ft_dict, features=features, split="train")
+
+        # TODO(aliberts): hf_dataset.set_format("torch")
        hf_dataset.set_transform(hf_transform_to_torch)
        return hf_dataset

@@ -723,16 +644,15 @@ class LeRobotDataset(torch.utils.data.Dataset):
            return get_hf_features_from_features(self.features)

    def _get_query_indices(self, idx: int, ep_idx: int) -> tuple[dict[str, list[int | bool]]]:
-        ep = self.meta.episodes[ep_idx]
-        ep_start = ep["dataset_from_index"]
-        ep_end = ep["dataset_to_index"]
+        ep_start = self.episode_data_index["from"][ep_idx]
+        ep_end = self.episode_data_index["to"][ep_idx]
        query_indices = {
-            key: [max(ep_start, min(ep_end - 1, idx + delta)) for delta in delta_idx]
+            key: [max(ep_start.item(), min(ep_end.item() - 1, idx + delta)) for delta in delta_idx]
            for key, delta_idx in self.delta_indices.items()
        }
        padding = {  # Pad values outside of current episode range
            f"{key}_is_pad": torch.BoolTensor(
-                [(idx + delta < ep_start) | (idx + delta >= ep_end) for delta in delta_idx]
+                [(idx + delta < ep_start.item()) | (idx + delta >= ep_end.item()) for delta in delta_idx]
            )
            for key, delta_idx in self.delta_indices.items()
        }
@@ -746,7 +666,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
        query_timestamps = {}
        for key in self.meta.video_keys:
            if query_indices is not None and key in query_indices:
-                timestamps = self.hf_dataset[query_indices[key]]["timestamp"]
+                timestamps = self.hf_dataset.select(query_indices[key])["timestamp"]
                query_timestamps[key] = torch.stack(timestamps).tolist()
            else:
                query_timestamps[key] = [current_ts]
@@ -755,7 +675,7 @@ class LeRobotDataset(torch.utils.data.Dataset):

    def _query_hf_dataset(self, query_indices: dict[str, list[int]]) -> dict:
        return {
-            key: torch.stack(self.hf_dataset[q_idx][key])
+            key: torch.stack(self.hf_dataset.select(q_idx)[key])
            for key, q_idx in query_indices.items()
            if key not in self.meta.video_keys
        }
@@ -766,17 +686,10 @@ class LeRobotDataset(torch.utils.data.Dataset):
        Segmentation Fault. This probably happens because a memory reference to the video loader is created in
        the main process and a subprocess fails to access it.
        """
-        ep = self.meta.episodes[ep_idx]
        item = {}
        for vid_key, query_ts in query_timestamps.items():
-            # Episodes are stored sequentially on a single mp4 to reduce the number of files.
-            # Thus we load the start timestamp of the episode on this mp4 and,
-            # shift the query timestamp accordingly.
-            from_timestamp = ep[f"videos/{vid_key}/from_timestamp"]
-            shifted_query_ts = [from_timestamp + ts for ts in query_ts]
-
            video_path = self.root / self.meta.get_video_file_path(ep_idx, vid_key)
-            frames = decode_video_frames(video_path, shifted_query_ts, self.tolerance_s, self.video_backend)
+            frames = decode_video_frames(video_path, query_ts, self.tolerance_s, self.video_backend)
            item[vid_key] = frames.squeeze(0)

        return item
@@ -814,7 +727,8 @@ class LeRobotDataset(torch.utils.data.Dataset):

        # Add task as a string
        task_idx = item["task_index"].item()
-        item["task"] = self.meta.tasks.iloc[task_idx].name
+        item["task"] = self.meta.tasks[task_idx]
+
        return item

    def __repr__(self):
@@ -844,9 +758,6 @@ class LeRobotDataset(torch.utils.data.Dataset):
        )
        return self.root / fpath

-    def _get_image_file_dir(self, episode_index: int, image_key: str) -> Path:
-        return self._get_image_file_path(episode_index, image_key, frame_index=0).parent
-
    def _save_image(self, image: torch.Tensor | np.ndarray | PIL.Image.Image, fpath: Path) -> None:
        if self.image_writer is None:
            if isinstance(image, torch.Tensor):
@@ -855,7 +766,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
        else:
            self.image_writer.save_image(image=image, fpath=fpath)

-    def add_frame(self, frame: dict) -> None:
+    def add_frame(self, frame: dict, task: str, timestamp: float | None = None) -> None:
        """
        This function only adds the frame to the episode_buffer. Apart from images — which are written in a
        temporary directory — nothing is written to disk. To save those frames, the 'save_episode()' method
@@ -873,10 +784,11 @@ class LeRobotDataset(torch.utils.data.Dataset):

        # Automatically add frame_index and timestamp to episode buffer
        frame_index = self.episode_buffer["size"]
-        timestamp = frame.pop("timestamp") if "timestamp" in frame else frame_index / self.fps
+        if timestamp is None:
+            timestamp = frame_index / self.fps
        self.episode_buffer["frame_index"].append(frame_index)
        self.episode_buffer["timestamp"].append(timestamp)
-        self.episode_buffer["task"].append(frame.pop("task"))  # Remove task from frame after processing
+        self.episode_buffer["task"].append(task)

        # Add frame features to episode_buffer
        for key in frame:
@@ -902,12 +814,19 @@ class LeRobotDataset(torch.utils.data.Dataset):
        """
        This will save to disk the current episode in self.episode_buffer.

+        Video encoding is handled automatically based on batch_encoding_size:
+        - If batch_encoding_size == 1: Videos are encoded immediately after each episode
+        - If batch_encoding_size > 1: Videos are encoded in batches.
+
        Args:
            episode_data (dict | None, optional): Dict containing the episode data to save. If None, this will
                save the current episode in self.episode_buffer, which is filled with 'add_frame'. Defaults to
                None.
        """
-        episode_buffer = episode_data if episode_data is not None else self.episode_buffer
+        if not episode_data:
+            episode_buffer = self.episode_buffer
+        else:
+            episode_buffer = episode_data

        validate_episode_buffer(episode_buffer, self.meta.total_episodes, self.features)

@@ -920,8 +839,11 @@ class LeRobotDataset(torch.utils.data.Dataset):
        episode_buffer["index"] = np.arange(self.meta.total_frames, self.meta.total_frames + episode_length)
        episode_buffer["episode_index"] = np.full((episode_length,), episode_index)

-        # Update tasks and task indices with new tasks if any
-        self.meta.save_episode_tasks(episode_tasks)
+        # Add new tasks to the tasks dictionary
+        for task in episode_tasks:
+            task_index = self.meta.get_task_index(task)
+            if task_index is None:
+                self.meta.add_task(task)

        # Given tasks in natural language, find their corresponding task indices
        episode_buffer["task_index"] = np.array([self.meta.get_task_index(task) for task in tasks])
@@ -933,142 +855,72 @@ class LeRobotDataset(torch.utils.data.Dataset):
                continue
            episode_buffer[key] = np.stack(episode_buffer[key])

-        # Wait for image writer to end, so that episode stats over images can be computed
        self._wait_image_writer()
+        self._save_episode_table(episode_buffer, episode_index)
        ep_stats = compute_episode_stats(episode_buffer, self.features)

-        ep_metadata = self._save_episode_data(episode_buffer)
-        for video_key in self.meta.video_keys:
-            ep_metadata.update(self._save_episode_video(video_key, episode_index))
+        has_video_keys = len(self.meta.video_keys) > 0
+        use_batched_encoding = self.batch_encoding_size > 1

-        # `meta.save_episode` need to be executed after encoding the videos
-        self.meta.save_episode(episode_index, episode_length, episode_tasks, ep_stats, ep_metadata)
+        if has_video_keys and not use_batched_encoding:
+            self.encode_episode_videos(episode_index)

-        if not episode_data:
-            # Reset episode buffer and clean up temporary images
-            self.clear_episode_buffer()
+        # `meta.save_episode` should be executed after encoding the videos
+        self.meta.save_episode(episode_index, episode_length, episode_tasks, ep_stats)

-    def _save_episode_data(self, episode_buffer: dict) -> dict:
-        """Save episode data to a parquet file and update the Hugging Face dataset of frames data.
-
-        This function processes episodes data from a buffer, converts it into a Hugging Face dataset,
-        and saves it as a parquet file. It handles both the creation of new parquet files and the
-        updating of existing ones based on size constraints. After saving the data, it reloads
-        the Hugging Face dataset to ensure it is up-to-date.
-
-        Notes: We both need to update parquet files and HF dataset:
-        - `pandas` loads parquet file in RAM
-        - `datasets` relies on a memory mapping from pyarrow (no RAM). It either converts parquet files to a pyarrow cache on disk,
-          or loads directly from pyarrow cache.
-        """
-        # Convert buffer into HF Dataset
-        ep_dict = {key: episode_buffer[key] for key in self.hf_features}
-        ep_dataset = datasets.Dataset.from_dict(ep_dict, features=self.hf_features, split="train")
-        ep_dataset = embed_images(ep_dataset)
-        ep_size_in_mb = get_hf_dataset_size_in_mb(ep_dataset)
-        ep_num_frames = len(ep_dataset)
-        df = pd.DataFrame(ep_dataset)
-
-        if self.meta.episodes is None:
-            # Initialize indices and frame count for a new dataset made of the first episode data
-            chunk_idx, file_idx = 0, 0
-            latest_num_frames = 0
-        else:
-            # Retrieve information from the latest parquet file
-            latest_ep = self.meta.episodes[-1]
-            chunk_idx = latest_ep["data/chunk_index"]
-            file_idx = latest_ep["data/file_index"]
-
-            latest_path = self.root / self.meta.data_path.format(chunk_index=chunk_idx, file_index=file_idx)
-            latest_size_in_mb = get_parquet_file_size_in_mb(latest_path)
-            latest_num_frames = get_parquet_num_frames(latest_path)
-
-            # Determine if a new parquet file is needed
-            if latest_size_in_mb + ep_size_in_mb >= self.meta.data_files_size_in_mb:
-                # Size limit is reached, prepare new parquet file
-                chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, self.meta.chunks_size)
-                latest_num_frames = 0
-            else:
-                # Update the existing parquet file with new rows
-                latest_df = pd.read_parquet(latest_path)
-                df = pd.concat([latest_df, df], ignore_index=True)
-
-        # Write the resulting dataframe from RAM to disk
-        path = self.root / self.meta.data_path.format(chunk_index=chunk_idx, file_index=file_idx)
-        path.parent.mkdir(parents=True, exist_ok=True)
-        if len(self.meta.image_keys) > 0:
-            to_parquet_with_hf_images(df, path)
-        else:
-            df.to_parquet(path)
-
-        # Update the Hugging Face dataset by reloading it.
-        # This process should be fast because only the latest Parquet file has been modified.
-        # Therefore, only this file needs to be converted to PyArrow; the rest is loaded from the PyArrow memory-mapped cache.
-        self.hf_dataset = self.load_hf_dataset()
-
-        metadata = {
-            "data/chunk_index": chunk_idx,
-            "data/file_index": file_idx,
-            "dataset_from_index": latest_num_frames,
-            "dataset_to_index": latest_num_frames + ep_num_frames,
-        }
-        return metadata
-
-    def _save_episode_video(self, video_key: str, episode_index: int):
-        # Encode episode frames into a temporary video
-        ep_path = self._encode_temporary_episode_video(video_key, episode_index)
-        ep_size_in_mb = get_video_size_in_mb(ep_path)
-        ep_duration_in_s = get_video_duration_in_s(ep_path)
-
-        if self.meta.episodes is None:
-            # Initialize indices for a new dataset made of the first episode data
-            chunk_idx, file_idx = 0, 0
-            latest_duration_in_s = 0
-            new_path = self.root / self.meta.video_path.format(
-                video_key=video_key, chunk_index=chunk_idx, file_index=file_idx
-            )
-            new_path.parent.mkdir(parents=True, exist_ok=True)
-            shutil.move(str(ep_path), str(new_path))
-        else:
-            # Retrieve information from the latest video file
-            latest_ep = self.meta.episodes[-1]
-            chunk_idx = latest_ep[f"videos/{video_key}/chunk_index"]
-            file_idx = latest_ep[f"videos/{video_key}/file_index"]
-
-            latest_path = self.root / self.meta.video_path.format(
-                video_key=video_key, chunk_index=chunk_idx, file_index=file_idx
-            )
-            latest_size_in_mb = get_video_size_in_mb(latest_path)
-            latest_duration_in_s = get_video_duration_in_s(latest_path)
-
-            if latest_size_in_mb + ep_size_in_mb >= self.meta.video_files_size_in_mb:
-                # Move temporary episode video to a new video file in the dataset
-                chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, self.meta.chunks_size)
-                new_path = self.root / self.meta.video_path.format(
-                    video_key=video_key, chunk_index=chunk_idx, file_index=file_idx
+        # Check if we should trigger batch encoding
+        if has_video_keys and use_batched_encoding:
+            self.episodes_since_last_encoding += 1
+            if self.episodes_since_last_encoding == self.batch_encoding_size:
+                start_ep = self.num_episodes - self.batch_encoding_size
+                end_ep = self.num_episodes
+                logging.info(
+                    f"Batch encoding {self.batch_encoding_size} videos for episodes {start_ep} to {end_ep - 1}"
                )
-                new_path.parent.mkdir(parents=True, exist_ok=True)
-                shutil.move(str(ep_path), str(new_path))
-            else:
-                # Update latest video file
-                concat_video_files([latest_path, ep_path], self.root, video_key, chunk_idx, file_idx)
+                self.batch_encode_videos(start_ep, end_ep)
+                self.episodes_since_last_encoding = 0

-        # Remove temporary directory
-        shutil.rmtree(str(ep_path.parent))
+        # Episode data index and timestamp checking
+        ep_data_index = get_episode_data_index(self.meta.episodes, [episode_index])
+        ep_data_index_np = {k: t.numpy() for k, t in ep_data_index.items()}
+        check_timestamps_sync(
+            episode_buffer["timestamp"],
+            episode_buffer["episode_index"],
+            ep_data_index_np,
+            self.fps,
+            self.tolerance_s,
+        )

-        metadata = {
-            "episode_index": episode_index,
-            f"videos/{video_key}/chunk_index": chunk_idx,
-            f"videos/{video_key}/file_index": file_idx,
-            f"videos/{video_key}/from_timestamp": latest_duration_in_s,
-            f"videos/{video_key}/to_timestamp": latest_duration_in_s + ep_duration_in_s,
-        }
-        return metadata
+        # Verify that we have one parquet file per episode and the number of video files matches the number of encoded episodes
+        parquet_files = list(self.root.rglob("*.parquet"))
+        assert len(parquet_files) == self.num_episodes
+        video_files = list(self.root.rglob("*.mp4"))
+        assert len(video_files) == (self.num_episodes - self.episodes_since_last_encoding) * len(
+            self.meta.video_keys
+        )
+
+        if not episode_data:  # Reset the buffer
+            self.episode_buffer = self.create_episode_buffer()
+
+    def _save_episode_table(self, episode_buffer: dict, episode_index: int) -> None:
+        episode_dict = {key: episode_buffer[key] for key in self.hf_features}
+        ep_dataset = datasets.Dataset.from_dict(episode_dict, features=self.hf_features, split="train")
+        ep_dataset = embed_images(ep_dataset)
+        self.hf_dataset = concatenate_datasets([self.hf_dataset, ep_dataset])
+        self.hf_dataset.set_transform(hf_transform_to_torch)
+        ep_data_path = self.root / self.meta.get_data_file_path(ep_index=episode_index)
+        ep_data_path.parent.mkdir(parents=True, exist_ok=True)
+        ep_dataset.to_parquet(ep_data_path)

    def clear_episode_buffer(self) -> None:
+        episode_index = self.episode_buffer["episode_index"]
+
+        # Clean up image files for the current episode buffer
        if self.image_writer is not None:
            for cam_key in self.meta.camera_keys:
-                img_dir = self.root / "images" / cam_key
+                img_dir = self._get_image_file_path(
+                    episode_index=episode_index, image_key=cam_key, frame_index=0
+                ).parent
                if img_dir.is_dir():
                    shutil.rmtree(img_dir)

@@ -1089,7 +941,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
    def stop_image_writer(self) -> None:
        """
        Whenever wrapping this dataset inside a parallelized DataLoader, this needs to be called first to
-        remove the image_writer in order for the LeRobotDataset object to be pickleable and parallelized.
+        remove the image_writer in order for the LeRobotDataset object to be picklable and parallelized.
        """
        if self.image_writer is not None:
            self.image_writer.stop()
@@ -1100,16 +952,55 @@ class LeRobotDataset(torch.utils.data.Dataset):
        if self.image_writer is not None:
            self.image_writer.wait_until_done()

-    def _encode_temporary_episode_video(self, video_key: str, episode_index: int) -> dict:
+    def encode_episode_videos(self, episode_index: int) -> None:
        """
        Use ffmpeg to convert frames stored as png into mp4 videos.
        Note: `encode_video_frames` is a blocking call. Making it asynchronous shouldn't speedup encoding,
        since video encoding with ffmpeg is already using multithreading.
+
+        This method handles video encoding steps:
+        - Video encoding via ffmpeg
+        - Video info updating in metadata
+        - Raw image cleanup
+
+        Args:
+            episode_index (int): Index of the episode to encode.
        """
-        temp_path = Path(tempfile.mkdtemp(dir=self.root)) / f"{video_key}_{episode_index:03d}.mp4"
-        img_dir = self._get_image_file_dir(episode_index, video_key)
-        encode_video_frames(img_dir, temp_path, self.fps, overwrite=True)
-        return temp_path
+        for key in self.meta.video_keys:
+            video_path = self.root / self.meta.get_video_file_path(episode_index, key)
+            if video_path.is_file():
+                # Skip if video is already encoded. Could be the case when resuming data recording.
+                continue
+            img_dir = self._get_image_file_path(
+                episode_index=episode_index, image_key=key, frame_index=0
+            ).parent
+            encode_video_frames(img_dir, video_path, self.fps, overwrite=True)
+            shutil.rmtree(img_dir)
+
+        # Update video info (only needed when first episode is encoded since it reads from episode 0)
+        if len(self.meta.video_keys) > 0 and episode_index == 0:
+            self.meta.update_video_info()
+            write_info(self.meta.info, self.meta.root)  # ensure video info always written properly
+
+    def batch_encode_videos(self, start_episode: int = 0, end_episode: int | None = None) -> None:
+        """
+        Batch encode videos for multiple episodes.
+
+        Args:
+            start_episode: Starting episode index (inclusive)
+            end_episode: Ending episode index (exclusive). If None, encodes all episodes from start_episode
+        """
+        if end_episode is None:
+            end_episode = self.meta.total_episodes
+
+        logging.info(f"Starting batch video encoding for episodes {start_episode} to {end_episode - 1}")
+
+        # Encode all episodes with cleanup enabled for individual episodes
+        for ep_idx in range(start_episode, end_episode):
+            logging.info(f"Encoding videos for episode {ep_idx}")
+            self.encode_episode_videos(ep_idx)
+
+        logging.info("Batch video encoding completed")

    @classmethod
    def create(
@@ -1124,6 +1015,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
        image_writer_processes: int = 0,
        image_writer_threads: int = 0,
        video_backend: str | None = None,
+        batch_encoding_size: int = 1,
    ) -> "LeRobotDataset":
        """Create a LeRobot Dataset from scratch in order to record data."""
        obj = cls.__new__(cls)
@@ -1140,6 +1032,8 @@ class LeRobotDataset(torch.utils.data.Dataset):
        obj.revision = None
        obj.tolerance_s = tolerance_s
        obj.image_writer = None
+        obj.batch_encoding_size = batch_encoding_size
+        obj.episodes_since_last_encoding = 0

        if image_writer_processes or image_writer_threads:
            obj.start_image_writer(image_writer_processes, image_writer_threads)
@@ -1152,6 +1046,7 @@ class LeRobotDataset(torch.utils.data.Dataset):
        obj.image_transforms = None
        obj.delta_timestamps = None
        obj.delta_indices = None
+        obj.episode_data_index = None
        obj.video_backend = video_backend if video_backend is not None else get_safe_default_codec()
        return obj

@@ -337,11 +337,13 @@ def compute_sampler_weights(
    if len(offline_dataset) > 0:
        offline_data_mask_indices = []
        for start_index, end_index in zip(
-            offline_dataset.meta.episodes["dataset_from_index"],
-            offline_dataset.meta.episodes["dataset_to_index"],
+            offline_dataset.episode_data_index["from"],
+            offline_dataset.episode_data_index["to"],
            strict=True,
        ):
-            offline_data_mask_indices.extend(range(start_index, end_index - offline_drop_n_last_frames))
+            offline_data_mask_indices.extend(
+                range(start_index.item(), end_index.item() - offline_drop_n_last_frames)
+            )
        offline_data_mask = torch.zeros(len(offline_dataset), dtype=torch.bool)
        offline_data_mask[torch.tensor(offline_data_mask_indices)] = True
        weights.append(
@@ -21,8 +21,7 @@ import torch
 class EpisodeAwareSampler:
    def __init__(
        self,
-        dataset_from_indices: list[int],
-        dataset_to_indices: list[int],
+        episode_data_index: dict,
        episode_indices_to_use: list | None = None,
        drop_n_first_frames: int = 0,
        drop_n_last_frames: int = 0,
@@ -31,8 +30,7 @@ class EpisodeAwareSampler:
        """Sampler that optionally incorporates episode boundary information.

        Args:
-            dataset_from_indices: List of indices containing the start of each episode in the dataset.
-            dataset_to_indices: List of indices containing the end of each episode in the dataset.
+            episode_data_index: Dictionary with keys 'from' and 'to' containing the start and end indices of each episode.
            episode_indices_to_use: List of episode indices to use. If None, all episodes are used.
                                    Assumes that episodes are indexed from 0 to N-1.
            drop_n_first_frames: Number of frames to drop from the start of each episode.
@@ -41,10 +39,12 @@ class EpisodeAwareSampler:
        """
        indices = []
        for episode_idx, (start_index, end_index) in enumerate(
-            zip(dataset_from_indices, dataset_to_indices, strict=True)
+            zip(episode_data_index["from"], episode_data_index["to"], strict=True)
        ):
            if episode_indices_to_use is None or episode_idx in episode_indices_to_use:
-                indices.extend(range(start_index + drop_n_first_frames, end_index - drop_n_last_frames))
+                indices.extend(
+                    range(start_index.item() + drop_n_first_frames, end_index.item() - drop_n_last_frames)
+                )

        self.indices = indices
        self.shuffle = shuffle
@@ -17,59 +17,43 @@ import contextlib
 import importlib.resources
 import json
 import logging
-import subprocess
 from collections.abc import Iterator
+from itertools import accumulate
 from pathlib import Path
 from pprint import pformat
 from types import SimpleNamespace
 from typing import Any

 import datasets
+import jsonlines
 import numpy as np
 import packaging.version
-import pandas
-import pandas as pd
-import pyarrow.parquet as pq
 import torch
-from datasets import Dataset, concatenate_datasets
 from datasets.table import embed_table_storage
 from huggingface_hub import DatasetCard, DatasetCardData, HfApi
 from huggingface_hub.errors import RevisionNotFoundError
 from PIL import Image as PILImage
 from torchvision import transforms

-from lerobot.configs.types import FeatureType, PolicyFeature
+from lerobot.configs.types import DictLike, FeatureType, PolicyFeature
 from lerobot.datasets.backward_compatibility import (
    V21_MESSAGE,
    BackwardCompatibilityError,
    ForwardCompatibilityError,
 )
-from lerobot.robots import Robot
 from lerobot.utils.utils import is_valid_numpy_dtype_string

-DEFAULT_CHUNK_SIZE = 1000  # Max number of files per chunk
-DEFAULT_DATA_FILE_SIZE_IN_MB = 100  # Max size per file
-DEFAULT_VIDEO_FILE_SIZE_IN_MB = 500  # Max size per file
+DEFAULT_CHUNK_SIZE = 1000  # Max number of episodes per chunk

 INFO_PATH = "meta/info.json"
+EPISODES_PATH = "meta/episodes.jsonl"
 STATS_PATH = "meta/stats.json"
+EPISODES_STATS_PATH = "meta/episodes_stats.jsonl"
+TASKS_PATH = "meta/tasks.jsonl"

-EPISODES_DIR = "meta/episodes"
-DATA_DIR = "data"
-VIDEO_DIR = "videos"
-
-CHUNK_FILE_PATTERN = "chunk-{chunk_index:03d}/file-{file_index:03d}"
-DEFAULT_TASKS_PATH = "meta/tasks.parquet"
-DEFAULT_EPISODES_PATH = EPISODES_DIR + "/" + CHUNK_FILE_PATTERN + ".parquet"
-DEFAULT_DATA_PATH = DATA_DIR + "/" + CHUNK_FILE_PATTERN + ".parquet"
-DEFAULT_VIDEO_PATH = VIDEO_DIR + "/{video_key}/" + CHUNK_FILE_PATTERN + ".mp4"
-DEFAULT_IMAGE_PATH = "images/{image_key}/episode-{episode_index:06d}/frame-{frame_index:06d}.png"
-
-LEGACY_EPISODES_PATH = "meta/episodes.jsonl"
-LEGACY_EPISODES_STATS_PATH = "meta/episodes_stats.jsonl"
-LEGACY_TASKS_PATH = "meta/tasks.jsonl"
-LEGACY_DEFAULT_VIDEO_PATH = "videos/chunk-{episode_chunk:03d}/{video_key}/episode_{episode_index:06d}.mp4"
-LEGACY_DEFAULT_PARQUET_PATH = "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet"
+DEFAULT_VIDEO_PATH = "videos/chunk-{episode_chunk:03d}/{video_key}/episode_{episode_index:06d}.mp4"
+DEFAULT_PARQUET_PATH = "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet"
+DEFAULT_IMAGE_PATH = "images/{image_key}/episode_{episode_index:06d}/frame_{frame_index:06d}.png"

 DATASET_CARD_TEMPLATE = """
 ---
@@ -90,79 +74,6 @@ DEFAULT_FEATURES = {
 }


-def get_parquet_file_size_in_mb(parquet_path: str | Path) -> float:
-    metadata = pq.read_metadata(parquet_path)
-    total_uncompressed_size = 0
-    for row_group in range(metadata.num_row_groups):
-        rg_metadata = metadata.row_group(row_group)
-        for column in range(rg_metadata.num_columns):
-            col_metadata = rg_metadata.column(column)
-            total_uncompressed_size += col_metadata.total_uncompressed_size
-    return total_uncompressed_size / (1024**2)
-
-
-def get_hf_dataset_size_in_mb(hf_ds: Dataset) -> int:
-    return hf_ds.data.nbytes // (1024**2)
-
-
-def update_chunk_file_indices(chunk_idx: int, file_idx: int, chunks_size: int) -> tuple[int, int]:
-    if file_idx == chunks_size - 1:
-        file_idx = 0
-        chunk_idx += 1
-    else:
-        file_idx += 1
-    return chunk_idx, file_idx
-
-
-def load_nested_dataset(pq_dir: Path, features: datasets.Features | None = None) -> Dataset:
-    """Find parquet files in provided directory {pq_dir}/chunk-xxx/file-xxx.parquet
-    Convert parquet files to pyarrow memory mapped in a cache folder for efficient RAM usage
-    Concatenate all pyarrow references to return HF Dataset format
-
-    Args:
-        pq_dir: Directory containing parquet files
-        features: Optional features schema to ensure consistent loading of complex types like images
-    """
-    paths = sorted(pq_dir.glob("*/*.parquet"))
-    if len(paths) == 0:
-        raise FileNotFoundError(f"Provided directory does not contain any parquet file: {pq_dir}")
-
-    # TODO(rcadene): set num_proc to accelerate conversion to pyarrow
-    datasets = [Dataset.from_parquet(str(path), features=features) for path in paths]
-    return concatenate_datasets(datasets)
-
-
-def get_parquet_num_frames(parquet_path: str | Path) -> int:
-    metadata = pq.read_metadata(parquet_path)
-    return metadata.num_rows
-
-
-def get_video_size_in_mb(mp4_path: Path) -> float:
-    file_size_bytes = mp4_path.stat().st_size
-    file_size_mb = file_size_bytes / (1024**2)
-    return file_size_mb
-
-
-def get_video_duration_in_s(mp4_file: Path) -> float:
-    # TODO(rcadene): move to video_utils.py
-    command = [
-        "ffprobe",
-        "-v",
-        "error",
-        "-show_entries",
-        "format=duration",
-        "-of",
-        "default=noprint_wrappers=1:nokey=1",
-        str(mp4_file),
-    ]
-    result = subprocess.run(
-        command,
-        stdout=subprocess.PIPE,
-        stderr=subprocess.STDOUT,
-    )
-    return float(result.stdout)
-
-
 def flatten_dict(d: dict, parent_key: str = "", sep: str = "/") -> dict:
    """Flatten a nested dictionary structure by collapsing nested keys into one key with a separator.

@@ -171,7 +82,6 @@ def flatten_dict(d: dict, parent_key: str = "", sep: str = "/") -> dict:
    >>> dct = {"a": {"b": 1, "c": {"d": 2}}, "e": 3}`
    >>> print(flatten_dict(dct))
    {"a/b": 1, "a/c/d": 2, "e": 3}
-    ```
    """
    items = []
    for k, v in d.items():
@@ -196,13 +106,23 @@ def unflatten_dict(d: dict, sep: str = "/") -> dict:
    return outdict


+def get_nested_item(obj: DictLike, flattened_key: str, sep: str = "/") -> Any:
+    split_keys = flattened_key.split(sep)
+    getter = obj[split_keys[0]]
+    if len(split_keys) == 1:
+        return getter
+
+    for key in split_keys[1:]:
+        getter = getter[key]
+
+    return getter
+
+
 def serialize_dict(stats: dict[str, torch.Tensor | np.ndarray | dict]) -> dict:
    serialized_dict = {}
    for key, value in flatten_dict(stats).items():
        if isinstance(value, (torch.Tensor, np.ndarray)):
            serialized_dict[key] = value.tolist()
-        elif isinstance(value, list) and isinstance(value[0], (int, float, list)):
-            serialized_dict[key] = value
        elif isinstance(value, np.generic):
            serialized_dict[key] = value.item()
        elif isinstance(value, (int, float)):
@@ -232,7 +152,24 @@ def write_json(data: dict, fpath: Path) -> None:
        json.dump(data, f, indent=4, ensure_ascii=False)


-def write_info(info: dict, local_dir: Path) -> None:
+def load_jsonlines(fpath: Path) -> list[Any]:
+    with jsonlines.open(fpath, "r") as reader:
+        return list(reader)
+
+
+def write_jsonlines(data: dict, fpath: Path) -> None:
+    fpath.parent.mkdir(exist_ok=True, parents=True)
+    with jsonlines.open(fpath, "w") as writer:
+        writer.write_all(data)
+
+
+def append_jsonlines(data: dict, fpath: Path) -> None:
+    fpath.parent.mkdir(exist_ok=True, parents=True)
+    with jsonlines.open(fpath, "a") as writer:
+        writer.write(data)
+
+
+def write_info(info: dict, local_dir: Path):
    write_json(info, local_dir / INFO_PATH)


@@ -243,55 +180,65 @@ def load_info(local_dir: Path) -> dict:
    return info


-def write_stats(stats: dict, local_dir: Path) -> None:
+def write_stats(stats: dict, local_dir: Path):
    serialized_stats = serialize_dict(stats)
    write_json(serialized_stats, local_dir / STATS_PATH)


-def cast_stats_to_numpy(stats: dict) -> dict[str, dict[str, np.ndarray]]:
+def cast_stats_to_numpy(stats) -> dict[str, dict[str, np.ndarray]]:
    stats = {key: np.array(value) for key, value in flatten_dict(stats).items()}
    return unflatten_dict(stats)


-def load_stats(local_dir: Path) -> dict[str, dict[str, np.ndarray]] | None:
+def load_stats(local_dir: Path) -> dict[str, dict[str, np.ndarray]]:
    if not (local_dir / STATS_PATH).exists():
        return None
    stats = load_json(local_dir / STATS_PATH)
    return cast_stats_to_numpy(stats)


-def write_tasks(tasks: pandas.DataFrame, local_dir: Path) -> None:
-    path = local_dir / DEFAULT_TASKS_PATH
-    path.parent.mkdir(parents=True, exist_ok=True)
-    tasks.to_parquet(path)
+def write_task(task_index: int, task: dict, local_dir: Path):
+    task_dict = {
+        "task_index": task_index,
+        "task": task,
+    }
+    append_jsonlines(task_dict, local_dir / TASKS_PATH)


-def load_tasks(local_dir: Path) -> pandas.DataFrame:
-    tasks = pd.read_parquet(local_dir / DEFAULT_TASKS_PATH)
-    return tasks
+def load_tasks(local_dir: Path) -> tuple[dict, dict]:
+    tasks = load_jsonlines(local_dir / TASKS_PATH)
+    tasks = {item["task_index"]: item["task"] for item in sorted(tasks, key=lambda x: x["task_index"])}
+    task_to_task_index = {task: task_index for task_index, task in tasks.items()}
+    return tasks, task_to_task_index


-def write_episodes(episodes: Dataset, local_dir: Path) -> None:
-    if get_hf_dataset_size_in_mb(episodes) > DEFAULT_DATA_FILE_SIZE_IN_MB:
-        raise NotImplementedError("Contact a maintainer.")
-
-    fpath = local_dir / DEFAULT_EPISODES_PATH.format(chunk_index=0, file_index=0)
-    fpath.parent.mkdir(parents=True, exist_ok=True)
-    episodes.to_parquet(fpath)
+def write_episode(episode: dict, local_dir: Path):
+    append_jsonlines(episode, local_dir / EPISODES_PATH)


-def load_episodes(local_dir: Path) -> datasets.Dataset:
-    episodes = load_nested_dataset(local_dir / EPISODES_DIR)
-    # Select episode features/columns containing references to episode data and videos
-    # (e.g. tasks, dataset_from_index, dataset_to_index, data/chunk_index, data/file_index, etc.)
-    # This is to speedup access to these data, instead of having to load episode stats.
-    episodes = episodes.select_columns([key for key in episodes.features if not key.startswith("stats/")])
-    return episodes
+def load_episodes(local_dir: Path) -> dict:
+    episodes = load_jsonlines(local_dir / EPISODES_PATH)
+    return {item["episode_index"]: item for item in sorted(episodes, key=lambda x: x["episode_index"])}
+
+
+def write_episode_stats(episode_index: int, episode_stats: dict, local_dir: Path):
+    # We wrap episode_stats in a dictionary since `episode_stats["episode_index"]`
+    # is a dictionary of stats and not an integer.
+    episode_stats = {"episode_index": episode_index, "stats": serialize_dict(episode_stats)}
+    append_jsonlines(episode_stats, local_dir / EPISODES_STATS_PATH)
+
+
+def load_episodes_stats(local_dir: Path) -> dict:
+    episodes_stats = load_jsonlines(local_dir / EPISODES_STATS_PATH)
+    return {
+        item["episode_index"]: cast_stats_to_numpy(item["stats"])
+        for item in sorted(episodes_stats, key=lambda x: x["episode_index"])
+    }


 def backward_compatible_episodes_stats(
    stats: dict[str, dict[str, np.ndarray]], episodes: list[int]
-) -> dict[int, dict[str, dict[str, np.ndarray]]]:
+) -> dict[str, dict[str, np.ndarray]]:
    return dict.fromkeys(episodes, stats)


@@ -307,7 +254,7 @@ def load_image_as_numpy(
    return img_array


-def hf_transform_to_torch(items_dict: dict[str, list[Any]]) -> dict[str, list[torch.Tensor | str]]:
+def hf_transform_to_torch(items_dict: dict[torch.Tensor | None]):
    """Get a transform function that convert items from Hugging Face dataset (pyarrow)
    to torch tensors. Importantly, images are converted from PIL, which corresponds to
    a channel last representation (h w c) of uint8 type, to a torch image representation
@@ -492,17 +439,6 @@ def build_dataset_frame(
    return frame


-def get_features_from_robot(robot: Robot, use_videos: bool = True) -> dict:
-    # TODO(rcadene): add fps for each feature
-    camera_ft = {}
-    if robot.cameras:
-        camera_ft = {
-            key: {"dtype": "video" if use_videos else "image", **ft}
-            for key, ft in robot.camera_features.items()
-        }
-    return {**robot.motor_features, **camera_ft, **DEFAULT_FEATURES}
-
-
 def dataset_to_policy_features(features: dict[str, dict]) -> dict[str, PolicyFeature]:
    # TODO(aliberts): Implement "type" in dataset features and simplify this
    policy_features = {}
@@ -547,17 +483,104 @@ def create_empty_dataset_info(
        "total_episodes": 0,
        "total_frames": 0,
        "total_tasks": 0,
+        "total_videos": 0,
+        "total_chunks": 0,
        "chunks_size": DEFAULT_CHUNK_SIZE,
-        "data_files_size_in_mb": DEFAULT_DATA_FILE_SIZE_IN_MB,
-        "video_files_size_in_mb": DEFAULT_VIDEO_FILE_SIZE_IN_MB,
        "fps": fps,
        "splits": {},
-        "data_path": DEFAULT_DATA_PATH,
+        "data_path": DEFAULT_PARQUET_PATH,
        "video_path": DEFAULT_VIDEO_PATH if use_videos else None,
        "features": features,
    }


+def get_episode_data_index(
+    episode_dicts: dict[dict], episodes: list[int] | None = None
+) -> dict[str, torch.Tensor]:
+    episode_lengths = {ep_idx: ep_dict["length"] for ep_idx, ep_dict in episode_dicts.items()}
+    if episodes is not None:
+        episode_lengths = {ep_idx: episode_lengths[ep_idx] for ep_idx in episodes}
+
+    cumulative_lengths = list(accumulate(episode_lengths.values()))
+    return {
+        "from": torch.LongTensor([0] + cumulative_lengths[:-1]),
+        "to": torch.LongTensor(cumulative_lengths),
+    }
+
+
+def check_timestamps_sync(
+    timestamps: np.ndarray,
+    episode_indices: np.ndarray,
+    episode_data_index: dict[str, np.ndarray],
+    fps: int,
+    tolerance_s: float,
+    raise_value_error: bool = True,
+) -> bool:
+    """
+    This check is to make sure that each timestamp is separated from the next by (1/fps) +/- tolerance
+    to account for possible numerical error.
+
+    Args:
+        timestamps (np.ndarray): Array of timestamps in seconds.
+        episode_indices (np.ndarray): Array indicating the episode index for each timestamp.
+        episode_data_index (dict[str, np.ndarray]): A dictionary that includes 'to',
+            which identifies indices for the end of each episode.
+        fps (int): Frames per second. Used to check the expected difference between consecutive timestamps.
+        tolerance_s (float): Allowed deviation from the expected (1/fps) difference.
+        raise_value_error (bool): Whether to raise a ValueError if the check fails.
+
+    Returns:
+        bool: True if all checked timestamp differences lie within tolerance, False otherwise.
+
+    Raises:
+        ValueError: If the check fails and `raise_value_error` is True.
+    """
+    if timestamps.shape != episode_indices.shape:
+        raise ValueError(
+            "timestamps and episode_indices should have the same shape. "
+            f"Found {timestamps.shape=} and {episode_indices.shape=}."
+        )
+
+    # Consecutive differences
+    diffs = np.diff(timestamps)
+    within_tolerance = np.abs(diffs - (1.0 / fps)) <= tolerance_s
+
+    # Mask to ignore differences at the boundaries between episodes
+    mask = np.ones(len(diffs), dtype=bool)
+    ignored_diffs = episode_data_index["to"][:-1] - 1  # indices at the end of each episode
+    mask[ignored_diffs] = False
+    filtered_within_tolerance = within_tolerance[mask]
+
+    # Check if all remaining diffs are within tolerance
+    if not np.all(filtered_within_tolerance):
+        # Track original indices before masking
+        original_indices = np.arange(len(diffs))
+        filtered_indices = original_indices[mask]
+        outside_tolerance_filtered_indices = np.nonzero(~filtered_within_tolerance)[0]
+        outside_tolerance_indices = filtered_indices[outside_tolerance_filtered_indices]
+
+        outside_tolerances = []
+        for idx in outside_tolerance_indices:
+            entry = {
+                "timestamps": [timestamps[idx], timestamps[idx + 1]],
+                "diff": diffs[idx],
+                "episode_index": episode_indices[idx].item()
+                if hasattr(episode_indices[idx], "item")
+                else episode_indices[idx],
+            }
+            outside_tolerances.append(entry)
+
+        if raise_value_error:
+            raise ValueError(
+                f"""One or several timestamps unexpectedly violate the tolerance inside episode range.
+                This might be due to synchronization issues during data collection.
+                \n{pformat(outside_tolerances)}"""
+            )
+        return False
+
+    return True
+
+
 def check_delta_timestamps(
    delta_timestamps: dict[str, list[float]], fps: int, tolerance_s: float, raise_value_error: bool = True
 ) -> bool:
@@ -596,7 +619,7 @@ def get_delta_indices(delta_timestamps: dict[str, list[float]], fps: int) -> dic
    return delta_indices


-def cycle(iterable: Any) -> Iterator[Any]:
+def cycle(iterable):
    """The equivalent of itertools.cycle, but safe for Pytorch dataloaders.

    See https://github.com/pytorch/pytorch/issues/23900 for information on why itertools.cycle is not safe.
@@ -609,7 +632,7 @@ def cycle(iterable: Any) -> Iterator[Any]:
            iterator = iter(iterable)


-def create_branch(repo_id: str, *, branch: str, repo_type: str | None = None) -> None:
+def create_branch(repo_id, *, branch: str, repo_type: str | None = None) -> None:
    """Create a branch on a existing Hugging Face repo. Delete the branch if it already
    exists before creating it.
    """
@@ -630,7 +653,7 @@ def create_lerobot_dataset_card(
    **kwargs,
 ) -> DatasetCard:
    """
-    Keyword arguments will be used to replace values in ./lerobot/datasets/card_template.md.
+    Keyword arguments will be used to replace values in src/lerobot/datasets/card_template.md.
    Note: If specified, license must be one of https://huggingface.co/docs/hub/repositories-licenses.
    """
    card_tags = ["LeRobot"]
@@ -717,28 +740,21 @@ class IterableNamespace(SimpleNamespace):
        return vars(self).keys()


-def validate_frame(frame: dict, features: dict) -> None:
+def validate_frame(frame: dict, features: dict):
    expected_features = set(features) - set(DEFAULT_FEATURES)
    actual_features = set(frame)

-    # task is a special required field that's not part of regular features
-    if "task" not in actual_features:
-        raise ValueError("Feature mismatch in `frame` dictionary:\nMissing features: {'task'}\n")
+    error_message = validate_features_presence(actual_features, expected_features)

-    # Remove task from actual_features for regular feature validation
-    actual_features_for_validation = actual_features - {"task"}
-
-    error_message = validate_features_presence(actual_features_for_validation, expected_features)
-
-    common_features = actual_features_for_validation & expected_features
-    for name in common_features:
+    common_features = actual_features & expected_features
+    for name in common_features - {"task"}:
        error_message += validate_feature_dtype_and_shape(name, features[name], frame[name])

    if error_message:
        raise ValueError(error_message)


-def validate_features_presence(actual_features: set[str], expected_features: set[str]) -> str:
+def validate_features_presence(actual_features: set[str], expected_features: set[str]):
    error_message = ""
    missing_features = expected_features - actual_features
    extra_features = actual_features - expected_features
@@ -753,9 +769,7 @@ def validate_features_presence(actual_features: set[str], expected_features: set
    return error_message


-def validate_feature_dtype_and_shape(
-    name: str, feature: dict, value: np.ndarray | PILImage.Image | str
-) -> str:
+def validate_feature_dtype_and_shape(name: str, feature: dict, value: np.ndarray | PILImage.Image | str):
    expected_dtype = feature["dtype"]
    expected_shape = feature["shape"]
    if is_valid_numpy_dtype_string(expected_dtype):
@@ -770,7 +784,7 @@ def validate_feature_dtype_and_shape(

 def validate_feature_numpy_array(
    name: str, expected_dtype: str, expected_shape: list[int], value: np.ndarray
-) -> str:
+):
    error_message = ""
    if isinstance(value, np.ndarray):
        actual_dtype = value.dtype
@@ -787,9 +801,7 @@ def validate_feature_numpy_array(
    return error_message


-def validate_feature_image_or_video(
-    name: str, expected_shape: list[str], value: np.ndarray | PILImage.Image
-) -> str:
+def validate_feature_image_or_video(name: str, expected_shape: list[str], value: np.ndarray | PILImage.Image):
    # Note: The check of pixels range ([0,1] for float and [0,255] for uint8) is done by the image writer threads.
    error_message = ""
    if isinstance(value, np.ndarray):
@@ -805,13 +817,13 @@ def validate_feature_image_or_video(
    return error_message


-def validate_feature_string(name: str, value: str) -> str:
+def validate_feature_string(name: str, value: str):
    if not isinstance(value, str):
        return f"The feature '{name}' is expected to be of type 'str', but type '{type(value)}' provided instead.\n"
    return ""


-def validate_episode_buffer(episode_buffer: dict, total_episodes: int, features: dict) -> None:
+def validate_episode_buffer(episode_buffer: dict, total_episodes: int, features: dict):
    if "size" not in episode_buffer:
        raise ValueError("size key not found in episode_buffer")

@@ -835,11 +847,3 @@ def validate_episode_buffer(episode_buffer: dict, total_episodes: int, features:
            f"In episode_buffer not in features: {buffer_keys - set(features)}"
            f"In features not in episode_buffer: {set(features) - buffer_keys}"
        )
-
-
-def to_parquet_with_hf_images(df: pandas.DataFrame, path: Path) -> None:
-    """This function correctly writes to parquet a panda DataFrame that contains images encoded by HF dataset.
-    This way, it can be loaded by HF dataset and correctly formatted images are returned.
-    """
-    # TODO(qlhoest): replace this weird synthax by `df.to_parquet(path)` only
-    datasets.Dataset.from_dict(df.to_dict(orient="list")).to_parquet(path)
@@ -121,12 +121,12 @@ from safetensors.torch import load_file

 from lerobot.datasets.utils import (
    DEFAULT_CHUNK_SIZE,
+    DEFAULT_PARQUET_PATH,
+    DEFAULT_VIDEO_PATH,
+    EPISODES_PATH,
    INFO_PATH,
-    LEGACY_DEFAULT_PARQUET_PATH,
-    LEGACY_DEFAULT_VIDEO_PATH,
-    LEGACY_EPISODES_PATH,
-    LEGACY_TASKS_PATH,
    STATS_PATH,
+    TASKS_PATH,
    create_branch,
    create_lerobot_dataset_card,
    flatten_dict,
@@ -290,12 +290,12 @@ def split_parquet_by_episodes(
    for ep_chunk in range(total_chunks):
        ep_chunk_start = DEFAULT_CHUNK_SIZE * ep_chunk
        ep_chunk_end = min(DEFAULT_CHUNK_SIZE * (ep_chunk + 1), total_episodes)
-        chunk_dir = "/".join(LEGACY_DEFAULT_PARQUET_PATH.split("/")[:-1]).format(episode_chunk=ep_chunk)
+        chunk_dir = "/".join(DEFAULT_PARQUET_PATH.split("/")[:-1]).format(episode_chunk=ep_chunk)
        (output_dir / chunk_dir).mkdir(parents=True, exist_ok=True)
        for ep_idx in range(ep_chunk_start, ep_chunk_end):
            ep_table = table.filter(pc.equal(table["episode_index"], ep_idx))
            episode_lengths.insert(ep_idx, len(ep_table))
-            output_file = output_dir / LEGACY_DEFAULT_PARQUET_PATH.format(
+            output_file = output_dir / DEFAULT_PARQUET_PATH.format(
                episode_chunk=ep_chunk, episode_index=ep_idx
            )
            pq.write_table(ep_table, output_file)
@@ -344,13 +344,13 @@ def move_videos(
        ep_chunk_start = DEFAULT_CHUNK_SIZE * ep_chunk
        ep_chunk_end = min(DEFAULT_CHUNK_SIZE * (ep_chunk + 1), total_episodes)
        for vid_key in video_keys:
-            chunk_dir = "/".join(LEGACY_DEFAULT_VIDEO_PATH.split("/")[:-1]).format(
+            chunk_dir = "/".join(DEFAULT_VIDEO_PATH.split("/")[:-1]).format(
                episode_chunk=ep_chunk, video_key=vid_key
            )
            (work_dir / chunk_dir).mkdir(parents=True, exist_ok=True)

            for ep_idx in range(ep_chunk_start, ep_chunk_end):
-                target_path = LEGACY_DEFAULT_VIDEO_PATH.format(
+                target_path = DEFAULT_VIDEO_PATH.format(
                    episode_chunk=ep_chunk, video_key=vid_key, episode_index=ep_idx
                )
                video_file = V1_VIDEO_FILE.format(video_key=vid_key, episode_index=ep_idx)
@@ -418,7 +418,7 @@ def _get_lfs_untracked_videos(work_dir: Path, video_files: list[str]) -> list[st
 def get_videos_info(repo_id: str, local_dir: Path, video_keys: list[str], branch: str) -> dict:
    # Assumes first episode
    video_files = [
-        LEGACY_DEFAULT_VIDEO_PATH.format(episode_chunk=0, video_key=vid_key, episode_index=0)
+        DEFAULT_VIDEO_PATH.format(episode_chunk=0, video_key=vid_key, episode_index=0)
        for vid_key in video_keys
    ]
    hub_api = HfApi()
@@ -495,7 +495,7 @@ def convert_dataset(

    assert set(tasks) == {task for ep_tasks in tasks_by_episodes.values() for task in ep_tasks}
    tasks = [{"task_index": task_idx, "task": task} for task_idx, task in enumerate(tasks)]
-    write_jsonlines(tasks, v20_dir / LEGACY_TASKS_PATH)
+    write_jsonlines(tasks, v20_dir / TASKS_PATH)
    features["task_index"] = {
        "dtype": "int64",
        "shape": (1,),
@@ -545,7 +545,7 @@ def convert_dataset(
        {"episode_index": ep_idx, "tasks": tasks_by_episodes[ep_idx], "length": episode_lengths[ep_idx]}
        for ep_idx in episode_indices
    ]
-    write_jsonlines(episodes, v20_dir / LEGACY_EPISODES_PATH)
+    write_jsonlines(episodes, v20_dir / EPISODES_PATH)

    # Assemble metadata v2.0
    metadata_v2_0 = {
@@ -559,8 +559,8 @@ def convert_dataset(
        "chunks_size": DEFAULT_CHUNK_SIZE,
        "fps": metadata_v1["fps"],
        "splits": {"train": f"0:{total_episodes}"},
-        "data_path": LEGACY_DEFAULT_PARQUET_PATH,
-        "video_path": LEGACY_DEFAULT_VIDEO_PATH if video_keys else None,
+        "data_path": DEFAULT_PARQUET_PATH,
+        "video_path": DEFAULT_VIDEO_PATH if video_keys else None,
        "features": features,
    }
    write_json(metadata_v2_0, v20_dir / INFO_PATH)
@@ -37,7 +37,7 @@ import logging
 from huggingface_hub import HfApi

 from lerobot.datasets.lerobot_dataset import CODEBASE_VERSION, LeRobotDataset
-from lerobot.datasets.utils import STATS_PATH, load_stats, write_info
+from lerobot.datasets.utils import EPISODES_STATS_PATH, STATS_PATH, load_stats, write_info
 from lerobot.datasets.v21.convert_stats import check_aggregate_stats, convert_stats

 V20 = "v2.0"
@@ -61,6 +61,9 @@ def convert_dataset(
    with SuppressWarnings():
        dataset = LeRobotDataset(repo_id, revision=V20, force_cache_sync=True)

+    if (dataset.root / EPISODES_STATS_PATH).is_file():
+        (dataset.root / EPISODES_STATS_PATH).unlink()
+
    convert_stats(dataset, num_workers=num_workers)
    ref_stats = load_stats(dataset.root)
    check_aggregate_stats(dataset, ref_stats)
@@ -13,28 +13,13 @@
 # limitations under the License.

 from concurrent.futures import ThreadPoolExecutor, as_completed
-from pathlib import Path

-import jsonlines
 import numpy as np
 from tqdm import tqdm

 from lerobot.datasets.compute_stats import aggregate_stats, get_feature_stats, sample_indices
 from lerobot.datasets.lerobot_dataset import LeRobotDataset
-from lerobot.datasets.utils import LEGACY_EPISODES_STATS_PATH, serialize_dict
-
-
-def append_jsonlines(data: dict, fpath: Path) -> None:
-    fpath.parent.mkdir(exist_ok=True, parents=True)
-    with jsonlines.open(fpath, "a") as writer:
-        writer.write(data)
-
-
-def legacy_write_episode_stats(episode_index: int, episode_stats: dict, local_dir: Path):
-    # We wrap episode_stats in a dictionary since `episode_stats["episode_index"]`
-    # is a dictionary of stats and not an integer.
-    episode_stats = {"episode_index": episode_index, "stats": serialize_dict(episode_stats)}
-    append_jsonlines(episode_stats, local_dir / LEGACY_EPISODES_STATS_PATH)
+from lerobot.datasets.utils import write_episode_stats


 def sample_episode_video_frames(dataset: LeRobotDataset, episode_index: int, ft_key: str) -> np.ndarray:
@@ -87,7 +72,7 @@ def convert_stats(dataset: LeRobotDataset, num_workers: int = 0):
            convert_episode_stats(dataset, ep_idx)

    for ep_idx in tqdm(range(total_episodes)):
-        legacy_write_episode_stats(ep_idx, dataset.meta.episodes_stats[ep_idx], dataset.root)
+        write_episode_stats(ep_idx, dataset.meta.episodes_stats[ep_idx], dataset.root)


 def check_aggregate_stats(
@@ -1,480 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""
-This script will help you convert any LeRobot dataset already pushed to the hub from codebase version 2.1 to
-3.0. It will:
-
- Generate per-episodes stats and writes them in `episodes_stats.jsonl`
- Check consistency between these new stats and the old ones.
- Remove the deprecated `stats.json`.
- Update codebase_version in `info.json`.
- Push this new version to the hub on the 'main' branch and tags it with "v3.0".
-
-Usage:
-
-```bash
-python src/lerobot/datasets/v30/convert_dataset_v21_to_v30.py \
-    --repo-id=lerobot/pusht
-```
-
-"""
-
-import argparse
-import shutil
-from pathlib import Path
-from typing import Any
-
-import jsonlines
-import pandas as pd
-import pyarrow as pa
-import tqdm
-from datasets import Dataset, Features, Image
-from huggingface_hub import HfApi, snapshot_download
-from requests import HTTPError
-
-from lerobot.constants import HF_LEROBOT_HOME
-from lerobot.datasets.compute_stats import aggregate_stats
-from lerobot.datasets.lerobot_dataset import CODEBASE_VERSION, LeRobotDataset
-from lerobot.datasets.utils import (
-    DEFAULT_CHUNK_SIZE,
-    DEFAULT_DATA_FILE_SIZE_IN_MB,
-    DEFAULT_DATA_PATH,
-    DEFAULT_VIDEO_FILE_SIZE_IN_MB,
-    DEFAULT_VIDEO_PATH,
-    LEGACY_EPISODES_PATH,
-    LEGACY_EPISODES_STATS_PATH,
-    LEGACY_TASKS_PATH,
-    cast_stats_to_numpy,
-    flatten_dict,
-    get_parquet_file_size_in_mb,
-    get_parquet_num_frames,
-    get_video_duration_in_s,
-    get_video_size_in_mb,
-    load_info,
-    update_chunk_file_indices,
-    write_episodes,
-    write_info,
-    write_stats,
-    write_tasks,
-)
-from lerobot.datasets.video_utils import concat_video_files
-
-V21 = "v2.1"
-
-
-"""
-------------------------
-OLD
-data/chunk-000/episode_000000.parquet
-
-NEW
-data/chunk-000/file_000.parquet
-------------------------
-OLD
-videos/chunk-000/CAMERA/episode_000000.mp4
-
-NEW
-videos/chunk-000/file_000.mp4
-------------------------
-OLD
-episodes.jsonl
-{"episode_index": 1, "tasks": ["Put the blue block in the green bowl"], "length": 266}
-
-NEW
-meta/episodes/chunk-000/episodes_000.parquet
-episode_index | video_chunk_index | video_file_index | data_chunk_index | data_file_index | tasks | length
-------------------------
-OLD
-tasks.jsonl
-{"task_index": 1, "task": "Put the blue block in the green bowl"}
-
-NEW
-meta/tasks/chunk-000/file_000.parquet
-task_index | task
-------------------------
-OLD
-episodes_stats.jsonl
-
-NEW
-meta/episodes_stats/chunk-000/file_000.parquet
-episode_index | mean | std | min | max
-------------------------
-UPDATE
-meta/info.json
-------------------------
-"""
-
-
-def load_jsonlines(fpath: Path) -> list[Any]:
-    with jsonlines.open(fpath, "r") as reader:
-        return list(reader)
-
-
-def legacy_load_episodes(local_dir: Path) -> dict:
-    episodes = load_jsonlines(local_dir / LEGACY_EPISODES_PATH)
-    return {item["episode_index"]: item for item in sorted(episodes, key=lambda x: x["episode_index"])}
-
-
-def legacy_load_episodes_stats(local_dir: Path) -> dict:
-    episodes_stats = load_jsonlines(local_dir / LEGACY_EPISODES_STATS_PATH)
-    return {
-        item["episode_index"]: cast_stats_to_numpy(item["stats"])
-        for item in sorted(episodes_stats, key=lambda x: x["episode_index"])
-    }
-
-
-def legacy_load_tasks(local_dir: Path) -> tuple[dict, dict]:
-    tasks = load_jsonlines(local_dir / LEGACY_TASKS_PATH)
-    tasks = {item["task_index"]: item["task"] for item in sorted(tasks, key=lambda x: x["task_index"])}
-    task_to_task_index = {task: task_index for task_index, task in tasks.items()}
-    return tasks, task_to_task_index
-
-
-def convert_tasks(root, new_root):
-    tasks, _ = legacy_load_tasks(root)
-    task_indices = tasks.keys()
-    task_strings = tasks.values()
-    df_tasks = pd.DataFrame({"task_index": task_indices}, index=task_strings)
-    write_tasks(df_tasks, new_root)
-
-
-def concat_data_files(paths_to_cat, new_root, chunk_idx, file_idx, image_keys):
-    # TODO(rcadene): to save RAM use Dataset.from_parquet(file) and concatenate_datasets
-    dataframes = [pd.read_parquet(file) for file in paths_to_cat]
-    # Concatenate all DataFrames along rows
-    concatenated_df = pd.concat(dataframes, ignore_index=True)
-
-    path = new_root / DEFAULT_DATA_PATH.format(chunk_index=chunk_idx, file_index=file_idx)
-    path.parent.mkdir(parents=True, exist_ok=True)
-
-    if len(image_keys) > 0:
-        schema = pa.Schema.from_pandas(concatenated_df)
-        features = Features.from_arrow_schema(schema)
-        for key in image_keys:
-            features[key] = Image()
-        schema = features.arrow_schema
-    else:
-        schema = None
-
-    concatenated_df.to_parquet(path, index=False, schema=schema)
-
-
-def convert_data(root, new_root):
-    data_dir = root / "data"
-    ep_paths = sorted(data_dir.glob("*/*.parquet"))
-
-    image_keys = get_image_keys(root)
-
-    ep_idx = 0
-    chunk_idx = 0
-    file_idx = 0
-    size_in_mb = 0
-    num_frames = 0
-    paths_to_cat = []
-    episodes_metadata = []
-    for ep_path in ep_paths:
-        ep_size_in_mb = get_parquet_file_size_in_mb(ep_path)
-        ep_num_frames = get_parquet_num_frames(ep_path)
-        ep_metadata = {
-            "episode_index": ep_idx,
-            "data/chunk_index": chunk_idx,
-            "data/file_index": file_idx,
-            "dataset_from_index": num_frames,
-            "dataset_to_index": num_frames + ep_num_frames,
-        }
-        size_in_mb += ep_size_in_mb
-        num_frames += ep_num_frames
-        episodes_metadata.append(ep_metadata)
-        ep_idx += 1
-
-        if size_in_mb < DEFAULT_DATA_FILE_SIZE_IN_MB:
-            paths_to_cat.append(ep_path)
-            continue
-
-        concat_data_files(paths_to_cat, new_root, chunk_idx, file_idx, image_keys)
-
-        # Reset for the next file
-        size_in_mb = ep_size_in_mb
-        num_frames = ep_num_frames
-        paths_to_cat = [ep_path]
-
-        chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, DEFAULT_CHUNK_SIZE)
-
-    # Write remaining data if any
-    if paths_to_cat:
-        concat_data_files(paths_to_cat, new_root, chunk_idx, file_idx, image_keys)
-
-    return episodes_metadata
-
-
-def get_video_keys(root):
-    info = load_info(root)
-    features = info["features"]
-    video_keys = [key for key, ft in features.items() if ft["dtype"] == "video"]
-    return video_keys
-
-
-def get_image_keys(root):
-    info = load_info(root)
-    features = info["features"]
-    image_keys = [key for key, ft in features.items() if ft["dtype"] == "image"]
-    return image_keys
-
-
-def convert_videos(root: Path, new_root: Path):
-    video_keys = get_video_keys(root)
-    if len(video_keys) == 0:
-        return None
-
-    video_keys = sorted(video_keys)
-
-    eps_metadata_per_cam = []
-    for camera in video_keys:
-        eps_metadata = convert_videos_of_camera(root, new_root, camera)
-        eps_metadata_per_cam.append(eps_metadata)
-
-    num_eps_per_cam = [len(eps_cam_map) for eps_cam_map in eps_metadata_per_cam]
-    if len(set(num_eps_per_cam)) != 1:
-        raise ValueError(f"All cams dont have same number of episodes ({num_eps_per_cam}).")
-
-    episods_metadata = []
-    num_cameras = len(video_keys)
-    num_episodes = num_eps_per_cam[0]
-    for ep_idx in range(num_episodes):
-        # Sanity check
-        ep_ids = [eps_metadata_per_cam[cam_idx][ep_idx]["episode_index"] for cam_idx in range(num_cameras)]
-        ep_ids += [ep_idx]
-        if len(set(ep_ids)) != 1:
-            raise ValueError(f"All episode indices need to match ({ep_ids}).")
-
-        ep_dict = {}
-        for cam_idx in range(num_cameras):
-            ep_dict.update(eps_metadata_per_cam[cam_idx][ep_idx])
-        episods_metadata.append(ep_dict)
-
-    return episods_metadata
-
-
-def convert_videos_of_camera(root: Path, new_root: Path, video_key):
-    # Access old paths to mp4
-    videos_dir = root / "videos"
-    ep_paths = sorted(videos_dir.glob(f"*/{video_key}/*.mp4"))
-
-    ep_idx = 0
-    chunk_idx = 0
-    file_idx = 0
-    size_in_mb = 0
-    duration_in_s = 0.0
-    paths_to_cat = []
-    episodes_metadata = []
-    for ep_path in tqdm.tqdm(ep_paths, desc=f"convert videos of {video_key}"):
-        ep_size_in_mb = get_video_size_in_mb(ep_path)
-        ep_duration_in_s = get_video_duration_in_s(ep_path)
-
-        # Check if adding this episode would exceed the limit
-        if size_in_mb + ep_size_in_mb >= DEFAULT_VIDEO_FILE_SIZE_IN_MB and len(paths_to_cat) > 0:
-            # Size limit would be exceeded, save current accumulation WITHOUT this episode
-            concat_video_files(paths_to_cat, new_root, video_key, chunk_idx, file_idx)
-
-            # Update episodes metadata for the file we just saved
-            for i, _ in enumerate(paths_to_cat):
-                past_ep_idx = ep_idx - len(paths_to_cat) + i
-                episodes_metadata[past_ep_idx][f"videos/{video_key}/chunk_index"] = chunk_idx
-                episodes_metadata[past_ep_idx][f"videos/{video_key}/file_index"] = file_idx
-
-            # Move to next file and start fresh with current episode
-            chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, DEFAULT_CHUNK_SIZE)
-            size_in_mb = 0
-            duration_in_s = 0.0
-            paths_to_cat = []
-
-        # Add current episode metadata
-        ep_metadata = {
-            "episode_index": ep_idx,
-            f"videos/{video_key}/chunk_index": chunk_idx,  # Will be updated when file is saved
-            f"videos/{video_key}/file_index": file_idx,  # Will be updated when file is saved
-            f"videos/{video_key}/from_timestamp": duration_in_s,
-            f"videos/{video_key}/to_timestamp": duration_in_s + ep_duration_in_s,
-        }
-        episodes_metadata.append(ep_metadata)
-
-        # Add current episode to accumulation
-        paths_to_cat.append(ep_path)
-        size_in_mb += ep_size_in_mb
-        duration_in_s += ep_duration_in_s
-        ep_idx += 1
-
-    # Write remaining videos if any
-    if paths_to_cat:
-        concat_video_files(paths_to_cat, new_root, video_key, chunk_idx, file_idx)
-
-        # Update episodes metadata for the final file
-        for i, _ in enumerate(paths_to_cat):
-            past_ep_idx = ep_idx - len(paths_to_cat) + i
-            episodes_metadata[past_ep_idx][f"videos/{video_key}/chunk_index"] = chunk_idx
-            episodes_metadata[past_ep_idx][f"videos/{video_key}/file_index"] = file_idx
-
-    return episodes_metadata
-
-
-def generate_episode_metadata_dict(
-    episodes_legacy_metadata, episodes_metadata, episodes_stats, episodes_videos=None
-):
-    num_episodes = len(episodes_metadata)
-    episodes_legacy_metadata_vals = list(episodes_legacy_metadata.values())
-    episodes_stats_vals = list(episodes_stats.values())
-    episodes_stats_keys = list(episodes_stats.keys())
-
-    for i in range(num_episodes):
-        ep_legacy_metadata = episodes_legacy_metadata_vals[i]
-        ep_metadata = episodes_metadata[i]
-        ep_stats = episodes_stats_vals[i]
-
-        ep_ids_set = {
-            ep_legacy_metadata["episode_index"],
-            ep_metadata["episode_index"],
-            episodes_stats_keys[i],
-        }
-
-        if episodes_videos is None:
-            ep_video = {}
-        else:
-            ep_video = episodes_videos[i]
-            ep_ids_set.add(ep_video["episode_index"])
-
-        if len(ep_ids_set) != 1:
-            raise ValueError(f"Number of episodes is not the same ({ep_ids_set}).")
-
-        ep_dict = {**ep_metadata, **ep_video, **ep_legacy_metadata, **flatten_dict({"stats": ep_stats})}
-        ep_dict["meta/episodes/chunk_index"] = 0
-        ep_dict["meta/episodes/file_index"] = 0
-        yield ep_dict
-
-
-def convert_episodes_metadata(root, new_root, episodes_metadata, episodes_video_metadata=None):
-    episodes_legacy_metadata = legacy_load_episodes(root)
-    episodes_stats = legacy_load_episodes_stats(root)
-
-    num_eps_set = {len(episodes_legacy_metadata), len(episodes_metadata)}
-    if episodes_video_metadata is not None:
-        num_eps_set.add(len(episodes_video_metadata))
-
-    if len(num_eps_set) != 1:
-        raise ValueError(f"Number of episodes is not the same ({num_eps_set}).")
-
-    ds_episodes = Dataset.from_generator(
-        lambda: generate_episode_metadata_dict(
-            episodes_legacy_metadata, episodes_metadata, episodes_stats, episodes_video_metadata
-        )
-    )
-    write_episodes(ds_episodes, new_root)
-
-    stats = aggregate_stats(list(episodes_stats.values()))
-    write_stats(stats, new_root)
-
-
-def convert_info(root, new_root):
-    info = load_info(root)
-    info["codebase_version"] = "v3.0"
-    del info["total_chunks"]
-    del info["total_videos"]
-    info["data_files_size_in_mb"] = DEFAULT_DATA_FILE_SIZE_IN_MB
-    info["video_files_size_in_mb"] = DEFAULT_VIDEO_FILE_SIZE_IN_MB
-    info["data_path"] = DEFAULT_DATA_PATH
-    info["video_path"] = DEFAULT_VIDEO_PATH
-    info["fps"] = float(info["fps"])
-    for key in info["features"]:
-        if info["features"][key]["dtype"] == "video":
-            # already has fps in video_info
-            continue
-        info["features"][key]["fps"] = info["fps"]
-    write_info(info, new_root)
-
-
-def convert_dataset(
-    repo_id: str,
-    branch: str | None = None,
-    num_workers: int = 4,
-):
-    root = HF_LEROBOT_HOME / repo_id
-    old_root = HF_LEROBOT_HOME / f"{repo_id}_old"
-    new_root = HF_LEROBOT_HOME / f"{repo_id}_v30"
-
-    if old_root.is_dir() and root.is_dir():
-        shutil.rmtree(str(root))
-        shutil.move(str(old_root), str(root))
-
-    if new_root.is_dir():
-        shutil.rmtree(new_root)
-
-    snapshot_download(
-        repo_id,
-        repo_type="dataset",
-        revision=V21,
-        local_dir=root,
-    )
-
-    convert_info(root, new_root)
-    convert_tasks(root, new_root)
-    episodes_metadata = convert_data(root, new_root)
-    episodes_videos_metadata = convert_videos(root, new_root)
-    convert_episodes_metadata(root, new_root, episodes_metadata, episodes_videos_metadata)
-
-    shutil.move(str(root), str(old_root))
-    shutil.move(str(new_root), str(root))
-
-    hub_api = HfApi()
-    try:
-        hub_api.delete_tag(repo_id, tag=CODEBASE_VERSION, repo_type="dataset")
-    except HTTPError as e:
-        print(f"tag={CODEBASE_VERSION} probably doesn't exist. Skipping exception ({e})")
-        pass
-    hub_api.delete_files(
-        delete_patterns=["data/chunk*/episode_*", "meta/*.jsonl", "videos/chunk*"],
-        repo_id=repo_id,
-        revision=branch,
-        repo_type="dataset",
-    )
-    hub_api.create_tag(repo_id, tag=CODEBASE_VERSION, revision=branch, repo_type="dataset")
-
-    LeRobotDataset(repo_id).push_to_hub()
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--repo-id",
-        type=str,
-        required=True,
-        help="Repository identifier on Hugging Face: a community or a user name `/` the name of the dataset "
-        "(e.g. `lerobot/pusht`, `cadene/aloha_sim_insertion_human`).",
-    )
-    parser.add_argument(
-        "--branch",
-        type=str,
-        default=None,
-        help="Repo branch to push your dataset. Defaults to the main branch.",
-    )
-    parser.add_argument(
-        "--num-workers",
-        type=int,
-        default=4,
-        help="Number of workers for parallelizing stats compute. Defaults to 4.",
-    )
-
-    args = parser.parse_args()
-    convert_dataset(**vars(args))
@@ -13,26 +13,22 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+import glob
 import importlib
-import json
 import logging
 import shutil
-import subprocess
-import tempfile
 import warnings
-from collections import OrderedDict
 from dataclasses import dataclass, field
 from pathlib import Path
 from typing import Any, ClassVar

+import av
 import pyarrow as pa
 import torch
 import torchvision
 from datasets.features.features import register_feature
 from PIL import Image

-from lerobot.datasets.utils import DEFAULT_VIDEO_PATH
-

 def get_safe_default_codec():
    if importlib.util.find_spec("torchcodec"):
@@ -106,7 +102,7 @@ def decode_video_frames_torchvision(
    keyframes_only = False
    torchvision.set_video_backend(backend)
    if backend == "pyav":
-        keyframes_only = True  # pyav doesnt support accuracte seek
+        keyframes_only = True  # pyav doesn't support accurate seek

    # set a video stream reader
    # TODO(rcadene): also load audio stream at the same time
@@ -159,7 +155,6 @@ def decode_video_frames_torchvision(
    )

    # get closest frames to the query timestamps
-    # TODO(rcadene): remove torch.stack
    closest_frames = torch.stack([loaded_frames[idx] for idx in argmin_])
    closest_ts = loaded_ts[argmin_]

@@ -257,104 +252,83 @@ def encode_video_frames(
    g: int | None = 2,
    crf: int | None = 30,
    fast_decode: int = 0,
-    log_level: str | None = "quiet",
+    log_level: int | None = av.logging.ERROR,
    overwrite: bool = False,
 ) -> None:
    """More info on ffmpeg arguments tuning on `benchmark/video/README.md`"""
+    # Check encoder availability
+    if vcodec not in ["h264", "hevc", "libsvtav1"]:
+        raise ValueError(f"Unsupported video codec: {vcodec}. Supported codecs are: h264, hevc, libsvtav1.")
+
    video_path = Path(video_path)
    imgs_dir = Path(imgs_dir)
-    video_path.parent.mkdir(parents=True, exist_ok=True)

-    ffmpeg_args = OrderedDict(
-        [
-            ("-f", "image2"),
-            ("-r", str(fps)),
-            ("-i", str(imgs_dir / "frame-%06d.png")),
-            ("-vcodec", vcodec),
-            ("-pix_fmt", pix_fmt),
-        ]
+    video_path.parent.mkdir(parents=True, exist_ok=overwrite)
+
+    # Encoders/pixel formats incompatibility check
+    if (vcodec == "libsvtav1" or vcodec == "hevc") and pix_fmt == "yuv444p":
+        logging.warning(
+            f"Incompatible pixel format 'yuv444p' for codec {vcodec}, auto-selecting format 'yuv420p'"
+        )
+        pix_fmt = "yuv420p"
+
+    # Get input frames
+    template = "frame_" + ("[0-9]" * 6) + ".png"
+    input_list = sorted(
+        glob.glob(str(imgs_dir / template)), key=lambda x: int(x.split("_")[-1].split(".")[0])
    )

+    # Define video output frame size (assuming all input frames are the same size)
+    if len(input_list) == 0:
+        raise FileNotFoundError(f"No images found in {imgs_dir}.")
+    dummy_image = Image.open(input_list[0])
+    width, height = dummy_image.size
+
+    # Define video codec options
+    video_options = {}
+
    if g is not None:
-        ffmpeg_args["-g"] = str(g)
+        video_options["g"] = str(g)

    if crf is not None:
-        ffmpeg_args["-crf"] = str(crf)
+        video_options["crf"] = str(crf)

    if fast_decode:
-        key = "-svtav1-params" if vcodec == "libsvtav1" else "-tune"
+        key = "svtav1-params" if vcodec == "libsvtav1" else "tune"
        value = f"fast-decode={fast_decode}" if vcodec == "libsvtav1" else "fastdecode"
-        ffmpeg_args[key] = value
+        video_options[key] = value

+    # Set logging level
    if log_level is not None:
-        ffmpeg_args["-loglevel"] = str(log_level)
+        # "While less efficient, it is generally preferable to modify logging with Python’s logging"
+        logging.getLogger("libav").setLevel(log_level)

-    ffmpeg_args = [item for pair in ffmpeg_args.items() for item in pair]
-    if overwrite:
-        ffmpeg_args.append("-y")
+    # Create and open output file (overwrite by default)
+    with av.open(str(video_path), "w") as output:
+        output_stream = output.add_stream(vcodec, fps, options=video_options)
+        output_stream.pix_fmt = pix_fmt
+        output_stream.width = width
+        output_stream.height = height

-    ffmpeg_cmd = ["ffmpeg"] + ffmpeg_args + [str(video_path)]
-    # redirect stdin to subprocess.DEVNULL to prevent reading random keyboard inputs from terminal
-    subprocess.run(ffmpeg_cmd, check=True, stdin=subprocess.DEVNULL)
+        # Loop through input frames and encode them
+        for input_data in input_list:
+            input_image = Image.open(input_data).convert("RGB")
+            input_frame = av.VideoFrame.from_image(input_image)
+            packet = output_stream.encode(input_frame)
+            if packet:
+                output.mux(packet)
+
+        # Flush the encoder
+        packet = output_stream.encode()
+        if packet:
+            output.mux(packet)
+
+    # Reset logging level
+    if log_level is not None:
+        av.logging.restore_default_callback()

    if not video_path.exists():
-        raise OSError(
-            f"Video encoding did not work. File not found: {video_path}. "
-            f"Try running the command manually to debug: `{''.join(ffmpeg_cmd)}`"
-        )
-
-
-def concat_video_files(paths_to_cat: list[Path], root: Path, video_key: str, chunk_idx: int, file_idx: int):
-    """
-    Concatenate multiple video files into a single video file using ffmpeg.
-
-    This function takes a list of video file paths and concatenates them into a single
-    output video file. It uses ffmpeg's concat demuxer with stream copy mode for fast
-    concatenation without re-encoding.
-
-    Args:
-        paths_to_cat: List of video file paths to concatenate, in order.
-        root: Root directory where temporary files and output will be created.
-        video_key: Video key identifier (e.g., camera name) used in output path.
-        chunk_idx: Chunk index for organizing output files.
-        file_idx: File index within the chunk.
-
-    Note:
-        - Creates a temporary directory for intermediate files that is cleaned up after use.
-        - Uses ffmpeg's concat demuxer which requires all input videos to have the same
-          codec, resolution, and frame rate for proper concatenation.
-        - Output path follows the DEFAULT_VIDEO_PATH pattern with video_key, chunk_idx,
-          and file_idx parameters.
-    """
-
-    tmp_dir = Path(tempfile.mkdtemp(dir=root))
-    path_concat_video_files = tmp_dir / "concat_video_files.txt"
-    with open(path_concat_video_files, "w") as f:
-        for ep_path in paths_to_cat:
-            f.write(f"file '{str(ep_path)}'\n")
-
-    path_tmp_output = tmp_dir / "tmp_output.mp4"
-    command = [
-        "ffmpeg",
-        "-y",
-        "-f",
-        "concat",
-        "-safe",
-        "0",
-        "-i",
-        str(path_concat_video_files),
-        "-c",
-        "copy",
-        str(path_tmp_output),
-    ]
-    subprocess.run(command, check=True)
-
-    output_path = root / DEFAULT_VIDEO_PATH.format(
-        video_key=video_key, chunk_index=chunk_idx, file_index=file_idx
-    )
-    output_path.parent.mkdir(parents=True, exist_ok=True)
-    shutil.move(str(path_tmp_output), str(output_path))
-    shutil.rmtree(str(tmp_dir))
+        raise OSError(f"Video encoding did not work. File not found: {video_path}.")


@dataclass
@@ -390,78 +364,68 @@ with warnings.catch_warnings():


 def get_audio_info(video_path: Path | str) -> dict:
-    ffprobe_audio_cmd = [
-        "ffprobe",
-        "-v",
-        "error",
-        "-select_streams",
-        "a:0",
-        "-show_entries",
-        "stream=channels,codec_name,bit_rate,sample_rate,bit_depth,channel_layout,duration",
-        "-of",
-        "json",
-        str(video_path),
-    ]
-    result = subprocess.run(ffprobe_audio_cmd, capture_output=True, text=True)
-    if result.returncode != 0:
-        raise RuntimeError(f"Error running ffprobe: {result.stderr}")
+    # Set logging level
+    logging.getLogger("libav").setLevel(av.logging.ERROR)

-    info = json.loads(result.stdout)
-    audio_stream_info = info["streams"][0] if info.get("streams") else None
-    if audio_stream_info is None:
-        return {"has_audio": False}
+    # Getting audio stream information
+    audio_info = {}
+    with av.open(str(video_path), "r") as audio_file:
+        try:
+            audio_stream = audio_file.streams.audio[0]
+        except IndexError:
+            # Reset logging level
+            av.logging.restore_default_callback()
+            return {"has_audio": False}

-    # Return the information, defaulting to None if no audio stream is present
-    return {
-        "has_audio": True,
-        "audio.channels": audio_stream_info.get("channels", None),
-        "audio.codec": audio_stream_info.get("codec_name", None),
-        "audio.bit_rate": int(audio_stream_info["bit_rate"]) if audio_stream_info.get("bit_rate") else None,
-        "audio.sample_rate": int(audio_stream_info["sample_rate"])
-        if audio_stream_info.get("sample_rate")
-        else None,
-        "audio.bit_depth": audio_stream_info.get("bit_depth", None),
-        "audio.channel_layout": audio_stream_info.get("channel_layout", None),
-    }
+        audio_info["audio.channels"] = audio_stream.channels
+        audio_info["audio.codec"] = audio_stream.codec.canonical_name
+        # In an ideal loseless case : bit depth x sample rate x channels = bit rate.
+        # In an actual compressed case, the bit rate is set according to the compression level : the lower the bit rate, the more compression is applied.
+        audio_info["audio.bit_rate"] = audio_stream.bit_rate
+        audio_info["audio.sample_rate"] = audio_stream.sample_rate  # Number of samples per second
+        # In an ideal loseless case : fixed number of bits per sample.
+        # In an actual compressed case : variable number of bits per sample (often reduced to match a given depth rate).
+        audio_info["audio.bit_depth"] = audio_stream.format.bits
+        audio_info["audio.channel_layout"] = audio_stream.layout.name
+        audio_info["has_audio"] = True
+
+    # Reset logging level
+    av.logging.restore_default_callback()
+
+    return audio_info


 def get_video_info(video_path: Path | str) -> dict:
-    ffprobe_video_cmd = [
-        "ffprobe",
-        "-v",
-        "error",
-        "-select_streams",
-        "v:0",
-        "-show_entries",
-        "stream=r_frame_rate,width,height,codec_name,nb_frames,duration,pix_fmt",
-        "-of",
-        "json",
-        str(video_path),
-    ]
-    result = subprocess.run(ffprobe_video_cmd, capture_output=True, text=True)
-    if result.returncode != 0:
-        raise RuntimeError(f"Error running ffprobe: {result.stderr}")
+    # Set logging level
+    logging.getLogger("libav").setLevel(av.logging.ERROR)

-    info = json.loads(result.stdout)
-    video_stream_info = info["streams"][0]
+    # Getting video stream information
+    video_info = {}
+    with av.open(str(video_path), "r") as video_file:
+        try:
+            video_stream = video_file.streams.video[0]
+        except IndexError:
+            # Reset logging level
+            av.logging.restore_default_callback()
+            return {}

-    # Calculate fps from r_frame_rate
-    r_frame_rate = video_stream_info["r_frame_rate"]
-    num, denom = map(int, r_frame_rate.split("/"))
-    fps = num / denom
+        video_info["video.height"] = video_stream.height
+        video_info["video.width"] = video_stream.width
+        video_info["video.codec"] = video_stream.codec.canonical_name
+        video_info["video.pix_fmt"] = video_stream.pix_fmt
+        video_info["video.is_depth_map"] = False

-    pixel_channels = get_video_pixel_channels(video_stream_info["pix_fmt"])
+        # Calculate fps from r_frame_rate
+        video_info["video.fps"] = int(video_stream.base_rate)

-    video_info = {
-        "video.fps": fps,
-        "video.height": video_stream_info["height"],
-        "video.width": video_stream_info["width"],
-        "video.channels": pixel_channels,
-        "video.codec": video_stream_info["codec_name"],
-        "video.pix_fmt": video_stream_info["pix_fmt"],
-        "video.is_depth_map": False,
-        **get_audio_info(video_path),
-    }
+        pixel_channels = get_video_pixel_channels(video_stream.pix_fmt)
+        video_info["video.channels"] = pixel_channels
+
+    # Reset logging level
+    av.logging.restore_default_callback()
+
+    # Adding audio stream information
+    video_info.update(**get_audio_info(video_path))

    return video_info

@@ -488,3 +452,66 @@ def get_image_pixel_channels(image: Image):
        return 4  # RGBA
    else:
        raise ValueError("Unknown format")
+
+
+class VideoEncodingManager:
+    """
+    Context manager that ensures proper video encoding and data cleanup even if exceptions occur.
+
+    This manager handles:
+    - Batch encoding for any remaining episodes when recording interrupted
+    - Cleaning up temporary image files from interrupted episodes
+    - Removing empty image directories
+
+    Args:
+        dataset: The LeRobotDataset instance
+    """
+
+    def __init__(self, dataset):
+        self.dataset = dataset
+
+    def __enter__(self):
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        # Handle any remaining episodes that haven't been batch encoded
+        if self.dataset.episodes_since_last_encoding > 0:
+            if exc_type is not None:
+                logging.info("Exception occurred. Encoding remaining episodes before exit...")
+            else:
+                logging.info("Recording stopped. Encoding remaining episodes...")
+
+            start_ep = self.dataset.num_episodes - self.dataset.episodes_since_last_encoding
+            end_ep = self.dataset.num_episodes
+            logging.info(
+                f"Encoding remaining {self.dataset.episodes_since_last_encoding} episodes, "
+                f"from episode {start_ep} to {end_ep - 1}"
+            )
+            self.dataset.batch_encode_videos(start_ep, end_ep)
+
+        # Clean up episode images if recording was interrupted
+        if exc_type is not None:
+            interrupted_episode_index = self.dataset.num_episodes
+            for key in self.dataset.meta.video_keys:
+                img_dir = self.dataset._get_image_file_path(
+                    episode_index=interrupted_episode_index, image_key=key, frame_index=0
+                ).parent
+                if img_dir.exists():
+                    logging.debug(
+                        f"Cleaning up interrupted episode images for episode {interrupted_episode_index}, camera {key}"
+                    )
+                    shutil.rmtree(img_dir)
+
+        # Clean up any remaining images directory if it's empty
+        img_dir = self.dataset.root / "images"
+        # Check for any remaining PNG files
+        png_files = list(img_dir.rglob("*.png"))
+        if len(png_files) == 0:
+            # Only remove the images directory if no PNG files remain
+            if img_dir.exists():
+                shutil.rmtree(img_dir)
+                logging.debug("Cleaned up empty images directory")
+        else:
+            logging.debug(f"Images directory is not empty, containing {len(png_files)} PNG files")
+
+        return False  # Don't suppress the original exception
@@ -107,6 +107,8 @@ X_SERIES_ENCODINGS_TABLE = {
    "Goal_PWM": X_SERIES_CONTROL_TABLE["Goal_PWM"][1],
    "Goal_Current": X_SERIES_CONTROL_TABLE["Goal_Current"][1],
    "Goal_Velocity": X_SERIES_CONTROL_TABLE["Goal_Velocity"][1],
+    "Goal_Position": X_SERIES_CONTROL_TABLE["Goal_Position"][1],
+    "Present_Position": X_SERIES_CONTROL_TABLE["Present_Position"][1],
    "Present_PWM": X_SERIES_CONTROL_TABLE["Present_PWM"][1],
    "Present_Current": X_SERIES_CONTROL_TABLE["Present_Current"][1],
    "Present_Velocity": X_SERIES_CONTROL_TABLE["Present_Velocity"][1],
@@ -73,6 +73,7 @@ from lerobot.configs.policies import PreTrainedConfig
 from lerobot.datasets.image_writer import safe_stop_image_writer
 from lerobot.datasets.lerobot_dataset import LeRobotDataset
 from lerobot.datasets.utils import build_dataset_frame, hw_to_dataset_features
+from lerobot.datasets.video_utils import VideoEncodingManager
 from lerobot.policies.factory import make_policy
 from lerobot.policies.pretrained import PreTrainedPolicy
 from lerobot.robots import (  # noqa: F401
@@ -271,8 +272,8 @@ def record_loop(

        if dataset is not None:
            action_frame = build_dataset_frame(dataset.features, sent_action, prefix="action")
-            frame = {**observation_frame, **action_frame, "task": single_task}
-            dataset.add_frame(frame)
+            frame = {**observation_frame, **action_frame}
+            dataset.add_frame(frame, task=single_task)

        if display_data:
            log_rerun_data(observation, action)
@@ -301,6 +302,7 @@ def record(cfg: RecordConfig) -> LeRobotDataset:
        dataset = LeRobotDataset(
            cfg.dataset.repo_id,
            root=cfg.dataset.root,
+            batch_encoding_size=cfg.dataset.video_encoding_batch_size,
        )

        if hasattr(robot, "cameras") and len(robot.cameras) > 0:
@@ -321,6 +323,7 @@ def record(cfg: RecordConfig) -> LeRobotDataset:
            use_videos=cfg.dataset.video,
            image_writer_processes=cfg.dataset.num_image_writer_processes,
            image_writer_threads=cfg.dataset.num_image_writer_threads_per_camera * len(robot.cameras),
+            batch_encoding_size=cfg.dataset.video_encoding_batch_size,
        )

    # Load pretrained policy
@@ -332,46 +335,47 @@ def record(cfg: RecordConfig) -> LeRobotDataset:

    listener, events = init_keyboard_listener()

-    recorded_episodes = 0
-    while recorded_episodes < cfg.dataset.num_episodes and not events["stop_recording"]:
-        log_say(f"Recording episode {dataset.num_episodes}", cfg.play_sounds)
-        record_loop(
-            robot=robot,
-            events=events,
-            fps=cfg.dataset.fps,
-            teleop=teleop,
-            policy=policy,
-            dataset=dataset,
-            control_time_s=cfg.dataset.episode_time_s,
-            single_task=cfg.dataset.single_task,
-            display_data=cfg.display_data,
-        )
-
-        # Execute a few seconds without recording to give time to manually reset the environment
-        # Skip reset for the last episode to be recorded
-        if not events["stop_recording"] and (
-            (recorded_episodes < cfg.dataset.num_episodes - 1) or events["rerecord_episode"]
-        ):
-            log_say("Reset the environment", cfg.play_sounds)
+    with VideoEncodingManager(dataset):
+        recorded_episodes = 0
+        while recorded_episodes < cfg.dataset.num_episodes and not events["stop_recording"]:
+            log_say(f"Recording episode {dataset.num_episodes}", cfg.play_sounds)
            record_loop(
                robot=robot,
                events=events,
                fps=cfg.dataset.fps,
                teleop=teleop,
-                control_time_s=cfg.dataset.reset_time_s,
+                policy=policy,
+                dataset=dataset,
+                control_time_s=cfg.dataset.episode_time_s,
                single_task=cfg.dataset.single_task,
                display_data=cfg.display_data,
            )

-        if events["rerecord_episode"]:
-            log_say("Re-record episode", cfg.play_sounds)
-            events["rerecord_episode"] = False
-            events["exit_early"] = False
-            dataset.clear_episode_buffer()
-            continue
+            # Execute a few seconds without recording to give time to manually reset the environment
+            # Skip reset for the last episode to be recorded
+            if not events["stop_recording"] and (
+                (recorded_episodes < cfg.dataset.num_episodes - 1) or events["rerecord_episode"]
+            ):
+                log_say("Reset the environment", cfg.play_sounds)
+                record_loop(
+                    robot=robot,
+                    events=events,
+                    fps=cfg.dataset.fps,
+                    teleop=teleop,
+                    control_time_s=cfg.dataset.reset_time_s,
+                    single_task=cfg.dataset.single_task,
+                    display_data=cfg.display_data,
+                )

-        dataset.save_episode()
-        recorded_episodes += 1
+            if events["rerecord_episode"]:
+                log_say("Re-record episode", cfg.play_sounds)
+                events["rerecord_episode"] = False
+                events["exit_early"] = False
+                dataset.clear_episode_buffer()
+                continue
+
+            dataset.save_episode()
+            recorded_episodes += 1

    log_say("Stop recording", cfg.play_sounds, blocking=True)

@@ -161,6 +161,11 @@ class SO100Follower(Robot):
                self.bus.write("I_Coefficient", motor, 0)
                self.bus.write("D_Coefficient", motor, 32)

+                if motor == "gripper":
+                    self.bus.write("Max_Torque_Limit", motor, 500)  # 50% of max torque to avoid burnout
+                    self.bus.write("Protection_Current", motor, 250)  # 50% of max current to avoid burnout
+                    self.bus.write("Overload_Torque", motor, 25)  # 25% torque when overloaded
+
    def setup_motors(self) -> None:
        for motor in reversed(self.bus.motors):
            input(f"Connect the controller board to the '{motor}' motor only and press enter.")
@@ -157,6 +157,13 @@ class SO101Follower(Robot):
                self.bus.write("I_Coefficient", motor, 0)
                self.bus.write("D_Coefficient", motor, 32)

+                if motor == "gripper":
+                    self.bus.write(
+                        "Max_Torque_Limit", motor, 500
+                    )  # 50% of the max torque limit to avoid burnout
+                    self.bus.write("Protection_Current", motor, 250)  # 50% of max current to avoid burnout
+                    self.bus.write("Overload_Torque", motor, 25)  # 25% torque when overloaded
+
    def setup_motors(self) -> None:
        for motor in reversed(self.bus.motors):
            input(f"Connect the controller board to the '{motor}' motor only and press enter.")
@@ -226,8 +226,7 @@ def convert_lerobot_dataset_to_cropper_lerobot_dataset(
                value = value.unsqueeze(0)
            new_frame[key] = value

-        new_frame["task"] = task
-        new_dataset.add_frame(new_frame)
+        new_dataset.add_frame(new_frame, task=task)

        if frame["episode_index"].item() != prev_episode_index:
            # Save the episode
@@ -2129,8 +2129,7 @@ def record_dataset(env, policy, cfg):
            frame["complementary_info.discrete_penalty"] = torch.tensor(
                [info.get("discrete_penalty", 0.0)], dtype=torch.float32
            )
-            frame["task"] = cfg.task
-            dataset.add_frame(frame)
+            dataset.add_frame(frame, task=cfg.task)

            # Maintain consistent timing
            if cfg.fps:
@@ -302,11 +302,6 @@ class RobotClient:

                    self.logger.debug(f"Current latest action: {latest_action}")

-                    # Get queue state before changes
-                    old_size, old_timesteps = self._inspect_action_queue()
-                    if not old_timesteps:
-                        old_timesteps = [latest_action]  # queue was empty
-
                    # Get queue state before changes
                    old_size, old_timesteps = self._inspect_action_queue()
                    if not old_timesteps:
@@ -166,8 +166,7 @@ def train(cfg: TrainPipelineConfig):
    if hasattr(cfg.policy, "drop_n_last_frames"):
        shuffle = False
        sampler = EpisodeAwareSampler(
-            dataset.meta.episodes["dataset_from_index"],
-            dataset.meta.episodes["dataset_to_index"],
+            dataset.episode_data_index,
            drop_n_last_frames=cfg.policy.drop_n_last_frames,
            shuffle=True,
        )
@@ -79,8 +79,8 @@ from lerobot.datasets.lerobot_dataset import LeRobotDataset

 class EpisodeSampler(torch.utils.data.Sampler):
    def __init__(self, dataset: LeRobotDataset, episode_index: int):
-        from_idx = dataset.meta.episodes["dataset_from_index"][episode_index]
-        to_idx = dataset.meta.episodes["dataset_to_index"][episode_index]
+        from_idx = dataset.episode_data_index["from"][episode_index].item()
+        to_idx = dataset.episode_data_index["to"][episode_index].item()
        self.frame_ids = range(from_idx, to_idx)

    def __iter__(self) -> Iterator:
@@ -283,7 +283,7 @@ def main():
    tolerance_s = kwargs.pop("tolerance_s")

    logging.info("Loading dataset")
-    dataset = LeRobotDataset(repo_id, episodes=[args.episode_index], root=root, tolerance_s=tolerance_s)
+    dataset = LeRobotDataset(repo_id, root=root, tolerance_s=tolerance_s)

    visualize_dataset(dataset, **vars(args))

@@ -152,17 +152,13 @@ def run_server(
        dataset_version = (
            str(dataset.meta._version) if isinstance(dataset, LeRobotDataset) else dataset.codebase_version
        )
-
-        # Check minimum version requirement
        match = re.search(r"v(\d+)\.", dataset_version)
        if match:
            major_version = int(match.group(1))
            if major_version < 2:
                return "Make sure to convert your LeRobotDataset to v2 & above."

-        # Get episode data once
        episode_data_csv_str, columns, ignored_columns = get_episode_data(dataset, episode_id)
-
        dataset_info = {
            "repo_id": f"{dataset_namespace}/{dataset_name}",
            "num_samples": dataset.num_frames
@@ -173,47 +169,19 @@ def run_server(
            else dataset.total_episodes,
            "fps": dataset.fps,
        }
-
        if isinstance(dataset, LeRobotDataset):
-            # Handle local datasets
-            # Determine if this is a chunked video dataset (v3.0+)
-            is_v3_or_later = False
-            match = re.search(r"v(\d+)\.(\d+)", dataset_version)
-            if match:
-                major_version = int(match.group(1))
-                is_v3_or_later = major_version >= 3
-
-            # Create videos_info with unified structure
-            videos_info = []
-
-            for key in dataset.meta.video_keys:
-                video_path = dataset.meta.get_video_file_path(episode_id, key)
-
-                if is_v3_or_later:
-                    # For v3.0+ datasets, get episode timestamps from chunked videos
-                    episode = dataset.meta.episodes[episode_id]
-                    from_timestamp = episode.get(f"videos/{key}/from_timestamp", 0)
-                    to_timestamp = episode.get(f"videos/{key}/to_timestamp", None)
-                    filename = key
-                else:
-                    # For v2.1 and earlier, videos are already per-episode
-                    from_timestamp = None
-                    to_timestamp = None
-                    filename = video_path.parent.name
-
-                videos_info.append(
-                    {
-                        "url": url_for("static", filename=str(video_path).replace("\\", "/")),
-                        "filename": filename,
-                        "start_time": from_timestamp,
-                        "end_time": to_timestamp,
-                        "is_chunked": is_v3_or_later,
-                    }
-                )
-
+            video_paths = [
+                dataset.meta.get_video_file_path(episode_id, key) for key in dataset.meta.video_keys
+            ]
+            videos_info = [
+                {
+                    "url": url_for("static", filename=str(video_path).replace("\\", "/")),
+                    "filename": video_path.parent.name,
+                }
+                for video_path in video_paths
+            ]
            tasks = dataset.meta.episodes[episode_id]["tasks"]
        else:
-            # Handle remote datasets from HF Hub
            video_keys = [key for key, ft in dataset.features.items() if ft["dtype"] == "video"]
            videos_info = [
                {
@@ -224,9 +192,6 @@ def run_server(
                        episode_index=episode_id,
                    ),
                    "filename": video_key,
-                    "start_time": None,
-                    "end_time": None,
-                    "is_chunked": False,
                }
                for video_key in video_keys
            ]
@@ -306,8 +271,8 @@ def get_episode_data(dataset: LeRobotDataset | IterableNamespace, episode_index)
    selected_columns.insert(0, "timestamp")

    if isinstance(dataset, LeRobotDataset):
-        from_idx = dataset.meta.episodes["dataset_from_index"][episode_index]
-        to_idx = dataset.meta.episodes["dataset_to_index"][episode_index]
+        from_idx = dataset.episode_data_index["from"][episode_index]
+        to_idx = dataset.episode_data_index["to"][episode_index]
        data = (
            dataset.hf_dataset.select(range(from_idx, to_idx))
            .select_columns(selected_columns)
@@ -343,7 +308,7 @@ def get_episode_data(dataset: LeRobotDataset | IterableNamespace, episode_index)

 def get_episode_video_paths(dataset: LeRobotDataset, ep_index: int) -> list[str]:
    # get first frame of episode (hack to get video_path of the episode)
-    first_frame_idx = dataset.meta.episodes["dataset_from_index"][ep_index]
+    first_frame_idx = dataset.episode_data_index["from"][ep_index].item()
    return [
        dataset.hf_dataset.select_columns(key)[first_frame_idx][key]["path"]
        for key in dataset.meta.video_keys
@@ -356,7 +321,7 @@ def get_episode_language_instruction(dataset: LeRobotDataset, ep_index: int) ->
        return None

    # get first frame index
-    first_frame_idx = dataset.meta.episodes["dataset_from_index"][ep_index]
+    first_frame_idx = dataset.episode_data_index["from"][ep_index].item()

    language_instruction = dataset.hf_dataset[first_frame_idx]["language_instruction"]
    # TODO (michel-aractingi) hack to get the sentence, some strings in openx are badly stored
@@ -565,7 +565,10 @@ class ReplayBuffer:
        lerobot_dataset.start_image_writer(num_processes=0, num_threads=3)

        # Convert transitions into episodes and frames
+        episode_index = 0
+        lerobot_dataset.episode_buffer = lerobot_dataset.create_episode_buffer(episode_index=episode_index)

+        frame_idx_in_episode = 0
        for idx in range(self.size):
            actual_idx = (self.position - self.size + idx) % self.capacity

@@ -579,7 +582,6 @@ class ReplayBuffer:
            frame_dict["action"] = self.actions[actual_idx].cpu()
            frame_dict["next.reward"] = torch.tensor([self.rewards[actual_idx]], dtype=torch.float32).cpu()
            frame_dict["next.done"] = torch.tensor([self.dones[actual_idx]], dtype=torch.bool).cpu()
-            frame_dict["task"] = task_name

            # Add complementary_info if available
            if self.has_complementary_info:
@@ -595,11 +597,19 @@ class ReplayBuffer:
                        frame_dict[f"complementary_info.{key}"] = val

            # Add to the dataset's buffer
-            lerobot_dataset.add_frame(frame_dict)
+            lerobot_dataset.add_frame(frame_dict, task=task_name)
+
+            # Move to next frame
+            frame_idx_in_episode += 1

            # If we reached an episode boundary, call save_episode, reset counters
            if self.dones[actual_idx] or self.truncateds[actual_idx]:
                lerobot_dataset.save_episode()
+                episode_index += 1
+                frame_idx_in_episode = 0
+                lerobot_dataset.episode_buffer = lerobot_dataset.create_episode_buffer(
+                    episode_index=episode_index
+                )

        # Save any remaining frames in the buffer
        if lerobot_dataset.episode_buffer["size"] > 0:
@@ -274,16 +274,6 @@ def move_cursor_up(lines):
    print(f"\033[{lines}A", end="")


-def get_elapsed_time_in_days_hours_minutes_seconds(elapsed_time_s: float):
-    days = int(elapsed_time_s // (24 * 3600))
-    elapsed_time_s %= 24 * 3600
-    hours = int(elapsed_time_s // 3600)
-    elapsed_time_s %= 3600
-    minutes = int(elapsed_time_s // 60)
-    seconds = elapsed_time_s % 60
-    return days, hours, minutes, seconds
-
-
 class TimerManager:
    """
    Lightweight utility to measure elapsed time.
@@ -47,26 +47,38 @@ def save_dataset_to_safetensors(output_dir, repo_id="lerobot/pusht"):
    )

    # save 2 first frames of first episode
-    i = dataset.meta.episodes["dataset_from_index"][0].item()
+    i = dataset.episode_data_index["from"][0].item()
    save_file(dataset[i], repo_dir / f"frame_{i}.safetensors")
    save_file(dataset[i + 1], repo_dir / f"frame_{i + 1}.safetensors")

    # save 2 frames at the middle of first episode
-    i = int(
-        (
-            dataset.meta.episodes["dataset_to_index"][0].item()
-            - dataset.meta.episodes["dataset_from_index"][0].item()
-        )
-        / 2
-    )
+    i = int((dataset.episode_data_index["to"][0].item() - dataset.episode_data_index["from"][0].item()) / 2)
    save_file(dataset[i], repo_dir / f"frame_{i}.safetensors")
    save_file(dataset[i + 1], repo_dir / f"frame_{i + 1}.safetensors")

    # save 2 last frames of first episode
-    i = dataset.meta.episodes["dataset_to_index"][0].item()
+    i = dataset.episode_data_index["to"][0].item()
    save_file(dataset[i - 2], repo_dir / f"frame_{i - 2}.safetensors")
    save_file(dataset[i - 1], repo_dir / f"frame_{i - 1}.safetensors")

+    # TODO(rcadene): Enable testing on second and last episode
+    # We currently cant because our test dataset only contains the first episode
+
+    # # save 2 first frames of second episode
+    # i = dataset.episode_data_index["from"][1].item()
+    # save_file(dataset[i], repo_dir / f"frame_{i}.safetensors")
+    # save_file(dataset[i + 1], repo_dir / f"frame_{i+1}.safetensors")
+
+    # # save 2 last frames of second episode
+    # i = dataset.episode_data_index["to"][1].item()
+    # save_file(dataset[i - 2], repo_dir / f"frame_{i-2}.safetensors")
+    # save_file(dataset[i - 1], repo_dir / f"frame_{i-1}.safetensors")
+
+    # # save 2 last frames of last episode
+    # i = dataset.episode_data_index["to"][-1].item()
+    # save_file(dataset[i - 2], repo_dir / f"frame_{i-2}.safetensors")
+    # save_file(dataset[i - 1], repo_dir / f"frame_{i-1}.safetensors")
+

 if __name__ == "__main__":
    for dataset in [
@@ -1,292 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from unittest.mock import patch
-
-import torch
-
-from lerobot.datasets.aggregate import aggregate_datasets
-from lerobot.datasets.lerobot_dataset import LeRobotDataset
-from tests.fixtures.constants import DUMMY_REPO_ID
-
-
-def assert_episode_and_frame_counts(aggr_ds, expected_episodes, expected_frames):
-    """Test that total number of episodes and frames are correctly aggregated."""
-    assert aggr_ds.num_episodes == expected_episodes, (
-        f"Expected {expected_episodes} episodes, got {aggr_ds.num_episodes}"
-    )
-    assert aggr_ds.num_frames == expected_frames, (
-        f"Expected {expected_frames} frames, got {aggr_ds.num_frames}"
-    )
-
-
-def assert_dataset_content_integrity(aggr_ds, ds_0, ds_1):
-    """Test that the content of both datasets is preserved correctly in the aggregated dataset."""
-    keys_to_ignore = ["episode_index", "index", "timestamp"]
-
-    # Test first part of dataset corresponds to ds_0, check first item (index 0) matches ds_0[0]
-    aggr_first_item = aggr_ds[0]
-    ds_0_first_item = ds_0[0]
-
-    # Compare all keys except episode_index and index which should be updated
-    for key in ds_0_first_item:
-        if key not in keys_to_ignore:
-            # Handle both tensor and non-tensor data
-            if torch.is_tensor(aggr_first_item[key]) and torch.is_tensor(ds_0_first_item[key]):
-                assert torch.allclose(aggr_first_item[key], ds_0_first_item[key], atol=1e-6), (
-                    f"First item key '{key}' doesn't match between aggregated and ds_0"
-                )
-            else:
-                assert aggr_first_item[key] == ds_0_first_item[key], (
-                    f"First item key '{key}' doesn't match between aggregated and ds_0"
-                )
-
-    # Check last item of ds_0 part (index len(ds_0)-1) matches ds_0[-1]
-    aggr_ds_0_last_item = aggr_ds[len(ds_0) - 1]
-    ds_0_last_item = ds_0[-1]
-
-    for key in ds_0_last_item:
-        if key not in keys_to_ignore:
-            # Handle both tensor and non-tensor data
-            if torch.is_tensor(aggr_ds_0_last_item[key]) and torch.is_tensor(ds_0_last_item[key]):
-                assert torch.allclose(aggr_ds_0_last_item[key], ds_0_last_item[key], atol=1e-6), (
-                    f"Last ds_0 item key '{key}' doesn't match between aggregated and ds_0"
-                )
-            else:
-                assert aggr_ds_0_last_item[key] == ds_0_last_item[key], (
-                    f"Last ds_0 item key '{key}' doesn't match between aggregated and ds_0"
-                )
-
-    # Test second part of dataset corresponds to ds_1
-    # Check first item of ds_1 part (index len(ds_0)) matches ds_1[0]
-    aggr_ds_1_first_item = aggr_ds[len(ds_0)]
-    ds_1_first_item = ds_1[0]
-
-    for key in ds_1_first_item:
-        if key not in keys_to_ignore:
-            # Handle both tensor and non-tensor data
-            if torch.is_tensor(aggr_ds_1_first_item[key]) and torch.is_tensor(ds_1_first_item[key]):
-                assert torch.allclose(aggr_ds_1_first_item[key], ds_1_first_item[key], atol=1e-6), (
-                    f"First ds_1 item key '{key}' doesn't match between aggregated and ds_1"
-                )
-            else:
-                assert aggr_ds_1_first_item[key] == ds_1_first_item[key], (
-                    f"First ds_1 item key '{key}' doesn't match between aggregated and ds_1"
-                )
-
-    # Check last item matches ds_1[-1]
-    aggr_last_item = aggr_ds[-1]
-    ds_1_last_item = ds_1[-1]
-
-    for key in ds_1_last_item:
-        if key not in keys_to_ignore:
-            # Handle both tensor and non-tensor data
-            if torch.is_tensor(aggr_last_item[key]) and torch.is_tensor(ds_1_last_item[key]):
-                assert torch.allclose(aggr_last_item[key], ds_1_last_item[key], atol=1e-6), (
-                    f"Last item key '{key}' doesn't match between aggregated and ds_1"
-                )
-            else:
-                assert aggr_last_item[key] == ds_1_last_item[key], (
-                    f"Last item key '{key}' doesn't match between aggregated and ds_1"
-                )
-
-
-def assert_metadata_consistency(aggr_ds, ds_0, ds_1):
-    """Test that metadata is correctly aggregated."""
-    # Test basic info
-    assert aggr_ds.fps == ds_0.fps == ds_1.fps, "FPS should be the same across all datasets"
-    assert aggr_ds.meta.info["robot_type"] == ds_0.meta.info["robot_type"] == ds_1.meta.info["robot_type"], (
-        "Robot type should be the same"
-    )
-
-    # Test features are the same
-    assert aggr_ds.features == ds_0.features == ds_1.features, "Features should be the same"
-
-    # Test tasks aggregation
-    expected_tasks = set(ds_0.meta.tasks.index) | set(ds_1.meta.tasks.index)
-    actual_tasks = set(aggr_ds.meta.tasks.index)
-    assert actual_tasks == expected_tasks, f"Expected tasks {expected_tasks}, got {actual_tasks}"
-
-
-def assert_episode_indices_updated_correctly(aggr_ds, ds_0, ds_1):
-    """Test that episode indices are correctly updated after aggregation."""
-    # ds_0 episodes should have episode_index 0 to ds_0.num_episodes-1
-    for i in range(len(ds_0)):
-        assert aggr_ds[i]["episode_index"] < ds_0.num_episodes, (
-            f"Episode index {aggr_ds[i]['episode_index']} at position {i} should be < {ds_0.num_episodes}"
-        )
-
-    def ds1_episodes_condition(ep_idx):
-        return (ep_idx >= ds_0.num_episodes) and (ep_idx < ds_0.num_episodes + ds_1.num_episodes)
-
-    # ds_1 episodes should have episode_index ds_0.num_episodes to total_episodes-1
-    for i in range(len(ds_0), len(ds_0) + len(ds_1)):
-        expected_min_episode_idx = ds_0.num_episodes
-        assert ds1_episodes_condition(aggr_ds[i]["episode_index"]), (
-            f"Episode index {aggr_ds[i]['episode_index']} at position {i} should be >= {expected_min_episode_idx}"
-        )
-
-
-def assert_video_frames_integrity(aggr_ds, ds_0, ds_1):
-    """Test that video frames are correctly preserved and frame indices are updated."""
-
-    def visual_frames_equal(frame1, frame2):
-        return torch.allclose(frame1, frame2)
-
-    video_keys = list(
-        filter(
-            lambda key: aggr_ds.meta.info["features"][key]["dtype"] == "video",
-            aggr_ds.meta.info["features"].keys(),
-        )
-    )
-
-    # Test the section corresponding to the first dataset (ds_0)
-    for i in range(len(ds_0)):
-        assert aggr_ds[i]["index"] == i, (
-            f"Frame index at position {i} should be {i}, but got {aggr_ds[i]['index']}"
-        )
-        for key in video_keys:
-            assert visual_frames_equal(aggr_ds[i][key], ds_0[i][key]), (
-                f"Visual frames at position {i} should be equal between aggregated and ds_0"
-            )
-
-    # Test the section corresponding to the second dataset (ds_1)
-    for i in range(len(ds_0), len(ds_0) + len(ds_1)):
-        # The frame index in the aggregated dataset should also match its position.
-        assert aggr_ds[i]["index"] == i, (
-            f"Frame index at position {i} should be {i}, but got {aggr_ds[i]['index']}"
-        )
-        for key in video_keys:
-            assert visual_frames_equal(aggr_ds[i][key], ds_1[i - len(ds_0)][key]), (
-                f"Visual frames at position {i} should be equal between aggregated and ds_1"
-            )
-
-
-def assert_dataset_iteration_works(aggr_ds):
-    """Test that we can iterate through the entire dataset without errors."""
-    for _ in aggr_ds:
-        pass
-
-
-def test_aggregate_datasets(tmp_path, lerobot_dataset_factory):
-    """Test basic aggregation functionality with standard parameters."""
-    ds_0_num_frames = 400
-    ds_1_num_frames = 800
-    ds_0_num_episodes = 10
-    ds_1_num_episodes = 25
-
-    # Create two datasets with different number of frames and episodes
-    ds_0 = lerobot_dataset_factory(
-        root=tmp_path / "test_0",
-        repo_id=f"{DUMMY_REPO_ID}_0",
-        total_episodes=ds_0_num_episodes,
-        total_frames=ds_0_num_frames,
-    )
-    ds_1 = lerobot_dataset_factory(
-        root=tmp_path / "test_1",
-        repo_id=f"{DUMMY_REPO_ID}_1",
-        total_episodes=ds_1_num_episodes,
-        total_frames=ds_1_num_frames,
-    )
-
-    aggregate_datasets(
-        repo_ids=[ds_0.repo_id, ds_1.repo_id],
-        roots=[ds_0.root, ds_1.root],
-        aggr_repo_id=f"{DUMMY_REPO_ID}_aggr",
-        aggr_root=tmp_path / "test_aggr",
-    )
-
-    # Mock the revision to prevent Hub calls during dataset loading
-    with (
-        patch("lerobot.datasets.lerobot_dataset.get_safe_version") as mock_get_safe_version,
-        patch("lerobot.datasets.lerobot_dataset.snapshot_download") as mock_snapshot_download,
-    ):
-        mock_get_safe_version.return_value = "v3.0"
-        mock_snapshot_download.return_value = str(tmp_path / "test_aggr")
-        aggr_ds = LeRobotDataset(f"{DUMMY_REPO_ID}_aggr", root=tmp_path / "test_aggr")
-
-    # Run all assertion functions
-    expected_total_episodes = ds_0.num_episodes + ds_1.num_episodes
-    expected_total_frames = ds_0.num_frames + ds_1.num_frames
-
-    assert_episode_and_frame_counts(aggr_ds, expected_total_episodes, expected_total_frames)
-    assert_dataset_content_integrity(aggr_ds, ds_0, ds_1)
-    assert_metadata_consistency(aggr_ds, ds_0, ds_1)
-    assert_episode_indices_updated_correctly(aggr_ds, ds_0, ds_1)
-    assert_video_frames_integrity(aggr_ds, ds_0, ds_1)
-    assert_dataset_iteration_works(aggr_ds)
-
-
-def test_aggregate_with_low_threshold(tmp_path, lerobot_dataset_factory):
-    """Test aggregation with small file size limits to force file rotation/sharding."""
-    ds_0_num_episodes = ds_1_num_episodes = 10
-    ds_0_num_frames = ds_1_num_frames = 400
-
-    ds_0 = lerobot_dataset_factory(
-        root=tmp_path / "small_0",
-        repo_id=f"{DUMMY_REPO_ID}_small_0",
-        total_episodes=ds_0_num_episodes,
-        total_frames=ds_0_num_frames,
-    )
-    ds_1 = lerobot_dataset_factory(
-        root=tmp_path / "small_1",
-        repo_id=f"{DUMMY_REPO_ID}_small_1",
-        total_episodes=ds_1_num_episodes,
-        total_frames=ds_1_num_frames,
-    )
-
-    # Use the new configurable parameters to force file rotation
-    aggregate_datasets(
-        repo_ids=[ds_0.repo_id, ds_1.repo_id],
-        roots=[ds_0.root, ds_1.root],
-        aggr_repo_id=f"{DUMMY_REPO_ID}_small_aggr",
-        aggr_root=tmp_path / "small_aggr",
-        # Tiny file size to trigger new file instantiation
-        data_files_size_in_mb=0.01,
-        video_files_size_in_mb=0.1,
-    )
-
-    # Mock the revision to prevent Hub calls during dataset loading
-    with (
-        patch("lerobot.datasets.lerobot_dataset.get_safe_version") as mock_get_safe_version,
-        patch("lerobot.datasets.lerobot_dataset.snapshot_download") as mock_snapshot_download,
-    ):
-        mock_get_safe_version.return_value = "v3.0"
-        mock_snapshot_download.return_value = str(tmp_path / "small_aggr")
-        aggr_ds = LeRobotDataset(f"{DUMMY_REPO_ID}_small_aggr", root=tmp_path / "small_aggr")
-
-    # Verify aggregation worked correctly despite file size constraints
-    expected_total_episodes = ds_0_num_episodes + ds_1_num_episodes
-    expected_total_frames = ds_0_num_frames + ds_1_num_frames
-
-    assert_episode_and_frame_counts(aggr_ds, expected_total_episodes, expected_total_frames)
-    assert_dataset_content_integrity(aggr_ds, ds_0, ds_1)
-    assert_metadata_consistency(aggr_ds, ds_0, ds_1)
-    assert_episode_indices_updated_correctly(aggr_ds, ds_0, ds_1)
-    assert_video_frames_integrity(aggr_ds, ds_0, ds_1)
-    assert_dataset_iteration_works(aggr_ds)
-
-    # Check that multiple files were actually created due to small size limits
-    data_dir = tmp_path / "small_aggr" / "data"
-    video_dir = tmp_path / "small_aggr" / "videos"
-
-    if data_dir.exists():
-        parquet_files = list(data_dir.rglob("*.parquet"))
-        assert len(parquet_files) > 1, "Small file size limits should create multiple parquet files"
-
-    if video_dir.exists():
-        video_files = list(video_dir.rglob("*.mp4"))
-        assert len(video_files) > 1, "Small file size limits should create multiple video files"
@@ -1,584 +0,0 @@
-#!/usr/bin/env python
-
-# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-"""Tests for dataset tools utilities."""
-
-from unittest.mock import patch
-
-import numpy as np
-import pytest
-import torch
-
-from lerobot.datasets.dataset_tools import (
-    add_feature,
-    delete_episodes,
-    merge_datasets,
-    remove_feature,
-    split_dataset,
-)
-
-
-@pytest.fixture
-def sample_dataset(tmp_path, empty_lerobot_dataset_factory):
-    """Create a sample dataset for testing."""
-    # Create an empty dataset and add data manually
-    features = {
-        "action": {"dtype": "float32", "shape": (6,), "names": None},
-        "observation.state": {"dtype": "float32", "shape": (4,), "names": None},
-        "observation.images.top": {"dtype": "image", "shape": (224, 224, 3), "names": None},
-    }
-
-    dataset = empty_lerobot_dataset_factory(
-        root=tmp_path / "test_dataset",
-        features=features,
-    )
-
-    # Add episodes manually
-    for ep_idx in range(5):
-        for _ in range(10):
-            frame = {
-                "action": np.random.randn(6).astype(np.float32),
-                "observation.state": np.random.randn(4).astype(np.float32),
-                "observation.images.top": np.random.randint(0, 255, size=(224, 224, 3), dtype=np.uint8),
-                "task": f"task_{ep_idx % 2}",
-            }
-            dataset.add_frame(frame)
-        dataset.save_episode()
-
-    return dataset
-
-
-class TestDeleteEpisodes:
-    def test_delete_single_episode(self, sample_dataset, tmp_path):
-        """Test deleting a single episode."""
-        output_dir = tmp_path / "filtered"
-
-        # Delete episode 2
-        # Mock the revision check and snapshot_download to prevent Hub calls
-        with (
-            patch("lerobot.datasets.lerobot_dataset.get_safe_version") as mock_get_safe_version,
-            patch("lerobot.datasets.lerobot_dataset.snapshot_download") as mock_snapshot_download,
-        ):
-            mock_get_safe_version.return_value = "v3.0"
-            mock_snapshot_download.return_value = str(output_dir)
-
-            new_dataset = delete_episodes(
-                sample_dataset,
-                episode_indices=[2],
-                output_dir=output_dir,
-            )
-
-        # Check results
-        assert new_dataset.meta.total_episodes == 4
-        assert new_dataset.meta.total_frames == 40
-
-        # Check episode indices are renumbered
-        episode_indices = {int(idx.item()) for idx in new_dataset.hf_dataset["episode_index"]}
-        assert episode_indices == {0, 1, 2, 3}
-
-        # Check data integrity
-        assert len(new_dataset) == 40
-
-    def test_delete_multiple_episodes(self, sample_dataset, tmp_path):
-        """Test deleting multiple episodes."""
-        output_dir = tmp_path / "filtered"
-
-        # Delete episodes 1 and 3
-        with (
-            patch("lerobot.datasets.lerobot_dataset.get_safe_version") as mock_get_safe_version,
-            patch("lerobot.datasets.lerobot_dataset.snapshot_download") as mock_snapshot_download,
-        ):
-            mock_get_safe_version.return_value = "v3.0"
-            mock_snapshot_download.return_value = str(output_dir)
-
-            new_dataset = delete_episodes(
-                sample_dataset,
-                episode_indices=[1, 3],
-                output_dir=output_dir,
-            )
-
-        # Check results
-        assert new_dataset.meta.total_episodes == 3
-        assert new_dataset.meta.total_frames == 30
-
-        # Check episode indices
-        episode_indices = {int(idx.item()) for idx in new_dataset.hf_dataset["episode_index"]}
-        assert episode_indices == {0, 1, 2}
-
-    def test_delete_invalid_episodes(self, sample_dataset, tmp_path):
-        """Test error handling for invalid episode indices."""
-        with pytest.raises(ValueError, match="Invalid episode indices"):
-            delete_episodes(
-                sample_dataset,
-                episode_indices=[10, 20],  # Out of range
-                output_dir=tmp_path / "filtered",
-            )
-
-    def test_delete_all_episodes(self, sample_dataset, tmp_path):
-        """Test error when trying to delete all episodes."""
-        with pytest.raises(ValueError, match="Cannot delete all episodes"):
-            delete_episodes(
-                sample_dataset,
-                episode_indices=list(range(5)),  # All episodes
-                output_dir=tmp_path / "filtered",
-            )
-
-    def test_delete_empty_list(self, sample_dataset, tmp_path):
-        """Test error when no episodes specified."""
-        with pytest.raises(ValueError, match="No episodes to delete"):
-            delete_episodes(
-                sample_dataset,
-                episode_indices=[],
-                output_dir=tmp_path / "filtered",
-            )
-
-
-class TestSplitDataset:
-    def test_split_by_episodes(self, sample_dataset, tmp_path):
-        """Test splitting dataset by specific episode indices."""
-        splits = {
-            "train": [0, 1, 2],
-            "val": [3, 4],
-        }
-
-        with (
-            patch("lerobot.datasets.lerobot_dataset.get_safe_version") as mock_get_safe_version,
-            patch("lerobot.datasets.lerobot_dataset.snapshot_download") as mock_snapshot_download,
-        ):
-            mock_get_safe_version.return_value = "v3.0"
-
-            # Mock snapshot_download to return the appropriate directory for each split
-            def mock_snapshot(repo_id, **kwargs):
-                if "train" in repo_id:
-                    return str(tmp_path / f"{sample_dataset.repo_id}_train")
-                elif "val" in repo_id:
-                    return str(tmp_path / f"{sample_dataset.repo_id}_val")
-                return str(kwargs.get("local_dir", tmp_path))
-
-            mock_snapshot_download.side_effect = mock_snapshot
-
-            result = split_dataset(
-                sample_dataset,
-                splits=splits,
-                output_dir=tmp_path,
-            )
-
-        # Check we got both splits
-        assert set(result.keys()) == {"train", "val"}
-
-        # Check train split
-        assert result["train"].meta.total_episodes == 3
-        assert result["train"].meta.total_frames == 30
-
-        # Check val split
-        assert result["val"].meta.total_episodes == 2
-        assert result["val"].meta.total_frames == 20
-
-        # Check episode renumbering
-        train_episodes = {int(idx.item()) for idx in result["train"].hf_dataset["episode_index"]}
-        assert train_episodes == {0, 1, 2}
-
-        val_episodes = {int(idx.item()) for idx in result["val"].hf_dataset["episode_index"]}
-        assert val_episodes == {0, 1}
-
-    def test_split_by_fractions(self, sample_dataset, tmp_path):
-        """Test splitting dataset by fractions."""
-        splits = {
-            "train": 0.6,  # 3 episodes
-            "val": 0.4,  # 2 episodes
-        }
-
-        with (
-            patch("lerobot.datasets.lerobot_dataset.get_safe_version") as mock_get_safe_version,
-            patch("lerobot.datasets.lerobot_dataset.snapshot_download") as mock_snapshot_download,
-        ):
-            mock_get_safe_version.return_value = "v3.0"
-
-            def mock_snapshot(repo_id, **kwargs):
-                for split_name in splits:
-                    if split_name in repo_id:
-                        return str(tmp_path / f"{sample_dataset.repo_id}_{split_name}")
-                return str(kwargs.get("local_dir", tmp_path))
-
-            mock_snapshot_download.side_effect = mock_snapshot
-
-            result = split_dataset(
-                sample_dataset,
-                splits=splits,
-                output_dir=tmp_path,
-            )
-
-        # Check splits
-        assert result["train"].meta.total_episodes == 3
-        assert result["val"].meta.total_episodes == 2
-
-    def test_split_overlapping_episodes(self, sample_dataset, tmp_path):
-        """Test error when episodes appear in multiple splits."""
-        splits = {
-            "train": [0, 1, 2],
-            "val": [2, 3, 4],  # Episode 2 appears in both
-        }
-
-        with pytest.raises(ValueError, match="Episodes cannot appear in multiple splits"):
-            split_dataset(sample_dataset, splits=splits, output_dir=tmp_path)
-
-    def test_split_invalid_fractions(self, sample_dataset, tmp_path):
-        """Test error when fractions sum to more than 1."""
-        splits = {
-            "train": 0.7,
-            "val": 0.5,  # Sum = 1.2
-        }
-
-        with pytest.raises(ValueError, match="Split fractions must sum to <= 1.0"):
-            split_dataset(sample_dataset, splits=splits, output_dir=tmp_path)
-
-    def test_split_empty(self, sample_dataset, tmp_path):
-        """Test error with empty splits."""
-        with pytest.raises(ValueError, match="No splits provided"):
-            split_dataset(sample_dataset, splits={}, output_dir=tmp_path)
-
-
-class TestMergeDatasets:
-    def test_merge_two_datasets(self, sample_dataset, tmp_path, empty_lerobot_dataset_factory):
-        """Test merging two datasets."""
-        # Create a second dataset manually
-        features = {
-            "action": {"dtype": "float32", "shape": (6,), "names": None},
-            "observation.state": {"dtype": "float32", "shape": (4,), "names": None},
-            "observation.images.top": {"dtype": "image", "shape": (224, 224, 3), "names": None},
-        }
-
-        dataset2 = empty_lerobot_dataset_factory(
-            root=tmp_path / "test_dataset2",
-            features=features,
-        )
-
-        # Add 3 episodes
-        for ep_idx in range(3):
-            for _ in range(10):
-                frame = {
-                    "action": np.random.randn(6).astype(np.float32),
-                    "observation.state": np.random.randn(4).astype(np.float32),
-                    "observation.images.top": np.random.randint(0, 255, size=(224, 224, 3), dtype=np.uint8),
-                    "task": f"task_{ep_idx % 2}",
-                }
-                dataset2.add_frame(frame)
-            dataset2.save_episode()
-
-        # Merge datasets
-        with (
-            patch("lerobot.datasets.lerobot_dataset.get_safe_version") as mock_get_safe_version,
-            patch("lerobot.datasets.lerobot_dataset.snapshot_download") as mock_snapshot_download,
-        ):
-            mock_get_safe_version.return_value = "v3.0"
-            mock_snapshot_download.return_value = str(tmp_path / "merged_dataset")
-
-            merged = merge_datasets(
-                [sample_dataset, dataset2],
-                output_repo_id="merged_dataset",
-                output_dir=tmp_path / "merged_dataset",
-            )
-
-        # Check results
-        assert merged.meta.total_episodes == 8  # 5 + 3
-        assert merged.meta.total_frames == 80  # 50 + 30
-
-        # Check episode indices are sequential
-        episode_indices = sorted({int(idx.item()) for idx in merged.hf_dataset["episode_index"]})
-        assert episode_indices == list(range(8))
-
-    def test_merge_empty_list(self, tmp_path):
-        """Test error when merging empty list."""
-        with pytest.raises(ValueError, match="No datasets to merge"):
-            merge_datasets([], output_repo_id="merged", output_dir=tmp_path)
-
-
-class TestAddFeature:
-    def test_add_feature_with_values(self, sample_dataset, tmp_path):
-        """Test adding a feature with pre-computed values."""
-        # Create reward values for all frames
-        num_frames = sample_dataset.meta.total_frames
-        reward_values = np.random.randn(num_frames, 1).astype(np.float32)
-
-        feature_info = {
-            "dtype": "float32",
-            "shape": (1,),
-            "names": None,
-        }
-
-        with (
-            patch("lerobot.datasets.lerobot_dataset.get_safe_version") as mock_get_safe_version,
-            patch("lerobot.datasets.lerobot_dataset.snapshot_download") as mock_snapshot_download,
-        ):
-            mock_get_safe_version.return_value = "v3.0"
-            mock_snapshot_download.return_value = str(tmp_path / "with_reward")
-
-            new_dataset = add_feature(
-                sample_dataset,
-                feature_name="reward",
-                feature_values=reward_values,
-                feature_info=feature_info,
-                output_dir=tmp_path / "with_reward",
-            )
-
-        # Check feature was added
-        assert "reward" in new_dataset.meta.features
-        assert new_dataset.meta.features["reward"] == feature_info
-
-        # Check values
-        assert len(new_dataset) == num_frames
-        sample_item = new_dataset[0]
-        assert "reward" in sample_item
-        # Scalar features don't have shape, just check it's a tensor
-        assert isinstance(sample_item["reward"], torch.Tensor)
-
-    def test_add_feature_with_callable(self, sample_dataset, tmp_path):
-        """Test adding a feature with a callable."""
-
-        def compute_reward(frame_dict, episode_idx, frame_idx):
-            # Simple reward based on episode and frame indices
-            return float(episode_idx * 10 + frame_idx)
-
-        feature_info = {
-            "dtype": "float32",
-            "shape": (1,),
-            "names": None,
-        }
-
-        with (
-            patch("lerobot.datasets.lerobot_dataset.get_safe_version") as mock_get_safe_version,
-            patch("lerobot.datasets.lerobot_dataset.snapshot_download") as mock_snapshot_download,
-        ):
-            mock_get_safe_version.return_value = "v3.0"
-            mock_snapshot_download.return_value = str(tmp_path / "with_reward")
-
-            new_dataset = add_feature(
-                sample_dataset,
-                feature_name="reward",
-                feature_values=compute_reward,
-                feature_info=feature_info,
-                output_dir=tmp_path / "with_reward",
-            )
-
-        # Check feature was added
-        assert "reward" in new_dataset.meta.features
-
-        # Check computed values
-        # Episode 0, frame 0 should have reward 0
-        items = [new_dataset[i] for i in range(10)]
-        first_episode_items = [item for item in items if item["episode_index"] == 0]
-        assert len(first_episode_items) == 10
-
-        # Check first frame of first episode
-        first_frame = first_episode_items[0]
-        assert first_frame["frame_index"] == 0
-        assert float(first_frame["reward"]) == 0.0
-
-    def test_add_existing_feature(self, sample_dataset, tmp_path):
-        """Test error when adding an existing feature."""
-        feature_info = {"dtype": "float32", "shape": (1,)}
-
-        with pytest.raises(ValueError, match="Feature 'action' already exists"):
-            add_feature(
-                sample_dataset,
-                feature_name="action",  # Already exists
-                feature_values=np.zeros(50),
-                feature_info=feature_info,
-                output_dir=tmp_path / "modified",
-            )
-
-    def test_add_feature_invalid_info(self, sample_dataset, tmp_path):
-        """Test error with invalid feature info."""
-        with pytest.raises(ValueError, match="feature_info must contain keys"):
-            add_feature(
-                sample_dataset,
-                feature_name="reward",
-                feature_values=np.zeros(50),
-                feature_info={"dtype": "float32"},  # Missing 'shape'
-                output_dir=tmp_path / "modified",
-            )
-
-
-class TestRemoveFeature:
-    def test_remove_single_feature(self, sample_dataset, tmp_path):
-        """Test removing a single feature."""
-        # First add a feature to remove
-        feature_info = {"dtype": "float32", "shape": (1,), "names": None}
-
-        with (
-            patch("lerobot.datasets.lerobot_dataset.get_safe_version") as mock_get_safe_version,
-            patch("lerobot.datasets.lerobot_dataset.snapshot_download") as mock_snapshot_download,
-        ):
-            mock_get_safe_version.return_value = "v3.0"
-            mock_snapshot_download.side_effect = lambda repo_id, **kwargs: str(
-                kwargs.get("local_dir", tmp_path)
-            )
-
-            dataset_with_reward = add_feature(
-                sample_dataset,
-                feature_name="reward",
-                feature_values=np.random.randn(50, 1).astype(np.float32),
-                feature_info=feature_info,
-                output_dir=tmp_path / "with_reward",
-            )
-
-            # Now remove it
-            dataset_without_reward = remove_feature(
-                dataset_with_reward,
-                feature_names="reward",
-                output_dir=tmp_path / "without_reward",
-            )
-
-        # Check feature was removed
-        assert "reward" not in dataset_without_reward.meta.features
-
-        # Check data
-        sample_item = dataset_without_reward[0]
-        assert "reward" not in sample_item
-
-    def test_remove_multiple_features(self, sample_dataset, tmp_path):
-        """Test removing multiple features at once."""
-        with (
-            patch("lerobot.datasets.lerobot_dataset.get_safe_version") as mock_get_safe_version,
-            patch("lerobot.datasets.lerobot_dataset.snapshot_download") as mock_snapshot_download,
-        ):
-            mock_get_safe_version.return_value = "v3.0"
-            mock_snapshot_download.side_effect = lambda repo_id, **kwargs: str(
-                kwargs.get("local_dir", tmp_path)
-            )
-
-            # Add two features
-            dataset = sample_dataset
-            for feature_name in ["reward", "success"]:
-                feature_info = {"dtype": "float32", "shape": (1,), "names": None}
-                dataset = add_feature(
-                    dataset,
-                    feature_name=feature_name,
-                    feature_values=np.random.randn(dataset.meta.total_frames, 1).astype(np.float32),
-                    feature_info=feature_info,
-                    output_dir=tmp_path / f"with_{feature_name}",
-                )
-
-            # Remove both
-            dataset_clean = remove_feature(
-                dataset,
-                feature_names=["reward", "success"],
-                output_dir=tmp_path / "clean",
-            )
-
-        # Check both were removed
-        assert "reward" not in dataset_clean.meta.features
-        assert "success" not in dataset_clean.meta.features
-
-    def test_remove_nonexistent_feature(self, sample_dataset, tmp_path):
-        """Test error when removing non-existent feature."""
-        with pytest.raises(ValueError, match="Feature 'nonexistent' not found"):
-            remove_feature(
-                sample_dataset,
-                feature_names="nonexistent",
-                output_dir=tmp_path / "modified",
-            )
-
-    def test_remove_required_feature(self, sample_dataset, tmp_path):
-        """Test error when trying to remove required features."""
-        with pytest.raises(ValueError, match="Cannot remove required features"):
-            remove_feature(
-                sample_dataset,
-                feature_names="timestamp",  # Required feature
-                output_dir=tmp_path / "modified",
-            )
-
-    def test_remove_camera_feature(self, sample_dataset, tmp_path):
-        """Test removing a camera feature."""
-        camera_keys = sample_dataset.meta.camera_keys
-        if not camera_keys:
-            pytest.skip("No camera keys in dataset")
-
-        # Remove first camera
-        camera_to_remove = camera_keys[0]
-
-        with (
-            patch("lerobot.datasets.lerobot_dataset.get_safe_version") as mock_get_safe_version,
-            patch("lerobot.datasets.lerobot_dataset.snapshot_download") as mock_snapshot_download,
-        ):
-            mock_get_safe_version.return_value = "v3.0"
-            mock_snapshot_download.return_value = str(tmp_path / "without_camera")
-
-            dataset_without_camera = remove_feature(
-                sample_dataset,
-                feature_names=camera_to_remove,
-                output_dir=tmp_path / "without_camera",
-            )
-
-        # Check camera was removed
-        assert camera_to_remove not in dataset_without_camera.meta.features
-        assert camera_to_remove not in dataset_without_camera.meta.camera_keys
-
-        # Check data
-        sample_item = dataset_without_camera[0]
-        assert camera_to_remove not in sample_item
-
-
-class TestIntegration:
-    def test_complex_workflow(self, sample_dataset, tmp_path):
-        """Test a complex workflow combining multiple operations."""
-        with (
-            patch("lerobot.datasets.lerobot_dataset.get_safe_version") as mock_get_safe_version,
-            patch("lerobot.datasets.lerobot_dataset.snapshot_download") as mock_snapshot_download,
-        ):
-            mock_get_safe_version.return_value = "v3.0"
-            mock_snapshot_download.side_effect = lambda repo_id, **kwargs: str(
-                kwargs.get("local_dir", tmp_path)
-            )
-
-            # 1. Add a reward feature
-            dataset = add_feature(
-                sample_dataset,
-                feature_name="reward",
-                feature_values=np.random.randn(50, 1).astype(np.float32),
-                feature_info={"dtype": "float32", "shape": (1,), "names": None},
-                output_dir=tmp_path / "step1",
-            )
-
-            # 2. Delete an episode
-            dataset = delete_episodes(
-                dataset,
-                episode_indices=[2],
-                output_dir=tmp_path / "step2",
-            )
-
-            # 3. Split into train/val
-            splits = split_dataset(
-                dataset,
-                splits={"train": 0.75, "val": 0.25},
-                output_dir=tmp_path / "step3",
-            )
-
-            # 4. Merge them back
-            merged = merge_datasets(
-                list(splits.values()),
-                output_repo_id="final_dataset",
-                output_dir=tmp_path / "step4",
-            )
-
-        # Check final dataset
-        assert merged.meta.total_episodes == 4  # Started with 5, deleted 1
-        assert merged.meta.total_frames == 40
-        assert "reward" in merged.meta.features  # Feature preserved
-
-        # Check data integrity
-        assert len(merged) == 40
-        sample_item = merged[0]
-        assert "reward" in sample_item
@@ -13,8 +13,10 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+import json
 import logging
 import re
+from copy import deepcopy
 from itertools import chain
 from pathlib import Path

@@ -36,13 +38,12 @@ from lerobot.datasets.lerobot_dataset import (
 )
 from lerobot.datasets.utils import (
    create_branch,
-    hw_to_dataset_features,
+    flatten_dict,
+    unflatten_dict,
 )
 from lerobot.envs.factory import make_env_config
 from lerobot.policies.factory import make_policy_config
-from lerobot.robots import make_robot_from_config
 from tests.fixtures.constants import DUMMY_CHW, DUMMY_HWC, DUMMY_REPO_ID
-from tests.mocks.mock_robot import MockRobotConfig
 from tests.utils import require_x86_64_kernel


@@ -68,17 +69,12 @@ def test_same_attributes_defined(tmp_path, lerobot_dataset_factory):
    objects have the same sets of attributes defined.
    """
    # Instantiate both ways
-    robot = make_robot_from_config(MockRobotConfig())
-    action_features = hw_to_dataset_features(robot.action_features, "action", True)
-    obs_features = hw_to_dataset_features(robot.observation_features, "observation", True)
-    dataset_features = {**action_features, **obs_features}
+    features = {"state": {"dtype": "float32", "shape": (1,), "names": None}}
    root_create = tmp_path / "create"
-    dataset_create = LeRobotDataset.create(
-        repo_id=DUMMY_REPO_ID, fps=30, features=dataset_features, root=root_create
-    )
+    dataset_create = LeRobotDataset.create(repo_id=DUMMY_REPO_ID, fps=30, features=features, root=root_create)

    root_init = tmp_path / "init"
-    dataset_init = lerobot_dataset_factory(root=root_init, total_episodes=1, total_frames=1)
+    dataset_init = lerobot_dataset_factory(root=root_init)

    init_attr = set(vars(dataset_init).keys())
    create_attr = set(vars(dataset_create).keys())
@@ -103,41 +99,13 @@ def test_dataset_initialization(tmp_path, lerobot_dataset_factory):
    assert dataset.num_frames == len(dataset)


-# TODO(rcadene, aliberts): do not run LeRobotDataset.create, instead refactor LeRobotDatasetMetadata.create
-# and test the small resulting function that validates the features
-def test_dataset_feature_with_forward_slash_raises_error():
-    # make sure dir does not exist
-    from lerobot.constants import HF_LEROBOT_HOME
-
-    dataset_dir = HF_LEROBOT_HOME / "lerobot/test/with/slash"
-    # make sure does not exist
-    if dataset_dir.exists():
-        dataset_dir.rmdir()
-
-    with pytest.raises(ValueError):
-        LeRobotDataset.create(
-            repo_id="lerobot/test/with/slash",
-            fps=30,
-            features={"a/b": {"dtype": "float32", "shape": 2, "names": None}},
-        )
-
-
-def test_add_frame_missing_task(tmp_path, empty_lerobot_dataset_factory):
-    features = {"state": {"dtype": "float32", "shape": (1,), "names": None}}
-    dataset = empty_lerobot_dataset_factory(root=tmp_path / "test", features=features)
-    with pytest.raises(
-        ValueError, match="Feature mismatch in `frame` dictionary:\nMissing features: {'task'}\n"
-    ):
-        dataset.add_frame({"state": torch.randn(1)})
-
-
 def test_add_frame_missing_feature(tmp_path, empty_lerobot_dataset_factory):
    features = {"state": {"dtype": "float32", "shape": (1,), "names": None}}
    dataset = empty_lerobot_dataset_factory(root=tmp_path / "test", features=features)
    with pytest.raises(
        ValueError, match="Feature mismatch in `frame` dictionary:\nMissing features: {'state'}\n"
    ):
-        dataset.add_frame({"task": "Dummy task"})
+        dataset.add_frame({"wrong_feature": torch.randn(1)}, task="Dummy task")


 def test_add_frame_extra_feature(tmp_path, empty_lerobot_dataset_factory):
@@ -146,7 +114,7 @@ def test_add_frame_extra_feature(tmp_path, empty_lerobot_dataset_factory):
    with pytest.raises(
        ValueError, match="Feature mismatch in `frame` dictionary:\nExtra features: {'extra'}\n"
    ):
-        dataset.add_frame({"state": torch.randn(1), "task": "Dummy task", "extra": "dummy_extra"})
+        dataset.add_frame({"state": torch.randn(1), "extra": "dummy_extra"}, task="Dummy task")


 def test_add_frame_wrong_type(tmp_path, empty_lerobot_dataset_factory):
@@ -155,7 +123,7 @@ def test_add_frame_wrong_type(tmp_path, empty_lerobot_dataset_factory):
    with pytest.raises(
        ValueError, match="The feature 'state' of dtype 'float16' is not of the expected dtype 'float32'.\n"
    ):
-        dataset.add_frame({"state": torch.randn(1, dtype=torch.float16), "task": "Dummy task"})
+        dataset.add_frame({"state": torch.randn(1, dtype=torch.float16)}, task="Dummy task")


 def test_add_frame_wrong_shape(tmp_path, empty_lerobot_dataset_factory):
@@ -165,7 +133,7 @@ def test_add_frame_wrong_shape(tmp_path, empty_lerobot_dataset_factory):
        ValueError,
        match=re.escape("The feature 'state' of shape '(1,)' does not have the expected shape '(2,)'.\n"),
    ):
-        dataset.add_frame({"state": torch.randn(1), "task": "Dummy task"})
+        dataset.add_frame({"state": torch.randn(1)}, task="Dummy task")


 def test_add_frame_wrong_shape_python_float(tmp_path, empty_lerobot_dataset_factory):
@@ -177,7 +145,7 @@ def test_add_frame_wrong_shape_python_float(tmp_path, empty_lerobot_dataset_fact
            "The feature 'state' is not a 'np.ndarray'. Expected type is 'float32', but type '<class 'float'>' provided instead.\n"
        ),
    ):
-        dataset.add_frame({"state": 1.0, "task": "Dummy task"})
+        dataset.add_frame({"state": 1.0}, task="Dummy task")


 def test_add_frame_wrong_shape_torch_ndim_0(tmp_path, empty_lerobot_dataset_factory):
@@ -187,7 +155,7 @@ def test_add_frame_wrong_shape_torch_ndim_0(tmp_path, empty_lerobot_dataset_fact
        ValueError,
        match=re.escape("The feature 'state' of shape '()' does not have the expected shape '(1,)'.\n"),
    ):
-        dataset.add_frame({"state": torch.tensor(1.0), "task": "Dummy task"})
+        dataset.add_frame({"state": torch.tensor(1.0)}, task="Dummy task")


 def test_add_frame_wrong_shape_numpy_ndim_0(tmp_path, empty_lerobot_dataset_factory):
@@ -199,13 +167,13 @@ def test_add_frame_wrong_shape_numpy_ndim_0(tmp_path, empty_lerobot_dataset_fact
            "The feature 'state' is not a 'np.ndarray'. Expected type is 'float32', but type '<class 'numpy.float32'>' provided instead.\n"
        ),
    ):
-        dataset.add_frame({"state": np.float32(1.0), "task": "Dummy task"})
+        dataset.add_frame({"state": np.float32(1.0)}, task="Dummy task")


 def test_add_frame(tmp_path, empty_lerobot_dataset_factory):
    features = {"state": {"dtype": "float32", "shape": (1,), "names": None}}
    dataset = empty_lerobot_dataset_factory(root=tmp_path / "test", features=features)
-    dataset.add_frame({"state": torch.randn(1), "task": "Dummy task"})
+    dataset.add_frame({"state": torch.randn(1)}, task="Dummy task")
    dataset.save_episode()

    assert len(dataset) == 1
@@ -217,7 +185,7 @@ def test_add_frame(tmp_path, empty_lerobot_dataset_factory):
 def test_add_frame_state_1d(tmp_path, empty_lerobot_dataset_factory):
    features = {"state": {"dtype": "float32", "shape": (2,), "names": None}}
    dataset = empty_lerobot_dataset_factory(root=tmp_path / "test", features=features)
-    dataset.add_frame({"state": torch.randn(2), "task": "Dummy task"})
+    dataset.add_frame({"state": torch.randn(2)}, task="Dummy task")
    dataset.save_episode()

    assert dataset[0]["state"].shape == torch.Size([2])
@@ -226,7 +194,7 @@ def test_add_frame_state_1d(tmp_path, empty_lerobot_dataset_factory):
 def test_add_frame_state_2d(tmp_path, empty_lerobot_dataset_factory):
    features = {"state": {"dtype": "float32", "shape": (2, 4), "names": None}}
    dataset = empty_lerobot_dataset_factory(root=tmp_path / "test", features=features)
-    dataset.add_frame({"state": torch.randn(2, 4), "task": "Dummy task"})
+    dataset.add_frame({"state": torch.randn(2, 4)}, task="Dummy task")
    dataset.save_episode()

    assert dataset[0]["state"].shape == torch.Size([2, 4])
@@ -235,7 +203,7 @@ def test_add_frame_state_2d(tmp_path, empty_lerobot_dataset_factory):
 def test_add_frame_state_3d(tmp_path, empty_lerobot_dataset_factory):
    features = {"state": {"dtype": "float32", "shape": (2, 4, 3), "names": None}}
    dataset = empty_lerobot_dataset_factory(root=tmp_path / "test", features=features)
-    dataset.add_frame({"state": torch.randn(2, 4, 3), "task": "Dummy task"})
+    dataset.add_frame({"state": torch.randn(2, 4, 3)}, task="Dummy task")
    dataset.save_episode()

    assert dataset[0]["state"].shape == torch.Size([2, 4, 3])
@@ -244,7 +212,7 @@ def test_add_frame_state_3d(tmp_path, empty_lerobot_dataset_factory):
 def test_add_frame_state_4d(tmp_path, empty_lerobot_dataset_factory):
    features = {"state": {"dtype": "float32", "shape": (2, 4, 3, 5), "names": None}}
    dataset = empty_lerobot_dataset_factory(root=tmp_path / "test", features=features)
-    dataset.add_frame({"state": torch.randn(2, 4, 3, 5), "task": "Dummy task"})
+    dataset.add_frame({"state": torch.randn(2, 4, 3, 5)}, task="Dummy task")
    dataset.save_episode()

    assert dataset[0]["state"].shape == torch.Size([2, 4, 3, 5])
@@ -253,7 +221,7 @@ def test_add_frame_state_4d(tmp_path, empty_lerobot_dataset_factory):
 def test_add_frame_state_5d(tmp_path, empty_lerobot_dataset_factory):
    features = {"state": {"dtype": "float32", "shape": (2, 4, 3, 5, 1), "names": None}}
    dataset = empty_lerobot_dataset_factory(root=tmp_path / "test", features=features)
-    dataset.add_frame({"state": torch.randn(2, 4, 3, 5, 1), "task": "Dummy task"})
+    dataset.add_frame({"state": torch.randn(2, 4, 3, 5, 1)}, task="Dummy task")
    dataset.save_episode()

    assert dataset[0]["state"].shape == torch.Size([2, 4, 3, 5, 1])
@@ -262,7 +230,7 @@ def test_add_frame_state_5d(tmp_path, empty_lerobot_dataset_factory):
 def test_add_frame_state_numpy(tmp_path, empty_lerobot_dataset_factory):
    features = {"state": {"dtype": "float32", "shape": (1,), "names": None}}
    dataset = empty_lerobot_dataset_factory(root=tmp_path / "test", features=features)
-    dataset.add_frame({"state": np.array([1], dtype=np.float32), "task": "Dummy task"})
+    dataset.add_frame({"state": np.array([1], dtype=np.float32)}, task="Dummy task")
    dataset.save_episode()

    assert dataset[0]["state"].ndim == 0
@@ -271,7 +239,7 @@ def test_add_frame_state_numpy(tmp_path, empty_lerobot_dataset_factory):
 def test_add_frame_string(tmp_path, empty_lerobot_dataset_factory):
    features = {"caption": {"dtype": "string", "shape": (1,), "names": None}}
    dataset = empty_lerobot_dataset_factory(root=tmp_path / "test", features=features)
-    dataset.add_frame({"caption": "Dummy caption", "task": "Dummy task"})
+    dataset.add_frame({"caption": "Dummy caption"}, task="Dummy task")
    dataset.save_episode()

    assert dataset[0]["caption"] == "Dummy caption"
@@ -286,7 +254,7 @@ def test_add_frame_image_wrong_shape(image_dataset):
        ),
    ):
        c, h, w = DUMMY_CHW
-        dataset.add_frame({"image": torch.randn(c, w, h), "task": "Dummy task"})
+        dataset.add_frame({"image": torch.randn(c, w, h)}, task="Dummy task")


 def test_add_frame_image_wrong_range(image_dataset):
@@ -299,14 +267,14 @@ def test_add_frame_image_wrong_range(image_dataset):
    Hence the image won't be saved on disk and save_episode will raise `FileNotFoundError`.
    """
    dataset = image_dataset
-    dataset.add_frame({"image": np.random.rand(*DUMMY_CHW) * 255, "task": "Dummy task"})
+    dataset.add_frame({"image": np.random.rand(*DUMMY_CHW) * 255}, task="Dummy task")
    with pytest.raises(FileNotFoundError):
        dataset.save_episode()


 def test_add_frame_image(image_dataset):
    dataset = image_dataset
-    dataset.add_frame({"image": np.random.rand(*DUMMY_CHW), "task": "Dummy task"})
+    dataset.add_frame({"image": np.random.rand(*DUMMY_CHW)}, task="Dummy task")
    dataset.save_episode()

    assert dataset[0]["image"].shape == torch.Size(DUMMY_CHW)
@@ -314,7 +282,7 @@ def test_add_frame_image(image_dataset):

 def test_add_frame_image_h_w_c(image_dataset):
    dataset = image_dataset
-    dataset.add_frame({"image": np.random.rand(*DUMMY_HWC), "task": "Dummy task"})
+    dataset.add_frame({"image": np.random.rand(*DUMMY_HWC)}, task="Dummy task")
    dataset.save_episode()

    assert dataset[0]["image"].shape == torch.Size(DUMMY_CHW)
@@ -323,7 +291,7 @@ def test_add_frame_image_h_w_c(image_dataset):
 def test_add_frame_image_uint8(image_dataset):
    dataset = image_dataset
    image = np.random.randint(0, 256, DUMMY_HWC, dtype=np.uint8)
-    dataset.add_frame({"image": image, "task": "Dummy task"})
+    dataset.add_frame({"image": image}, task="Dummy task")
    dataset.save_episode()

    assert dataset[0]["image"].shape == torch.Size(DUMMY_CHW)
@@ -332,7 +300,7 @@ def test_add_frame_image_uint8(image_dataset):
 def test_add_frame_image_pil(image_dataset):
    dataset = image_dataset
    image = np.random.randint(0, 256, DUMMY_HWC, dtype=np.uint8)
-    dataset.add_frame({"image": Image.fromarray(image), "task": "Dummy task"})
+    dataset.add_frame({"image": Image.fromarray(image)}, task="Dummy task")
    dataset.save_episode()

    assert dataset[0]["image"].shape == torch.Size(DUMMY_CHW)
@@ -351,13 +319,6 @@ def test_image_array_to_pil_image_wrong_range_float_0_255():
 # - [ ] test push_to_hub
 # - [ ] test smaller methods

-# TODO(rcadene):
-# - [ ] fix code so that old test_factory + backward pass
-# - [ ] write new unit tests to test save_episode + getitem
-#   - [ ] save_episode : case where new dataset, concatenate same file, write new file (meta/episodes, data, videos)
-#   - [ ]
-# - [ ] remove old tests
-

@pytest.mark.parametrize(
    "env_name, repo_id, policy_name",
@@ -377,8 +338,9 @@ def test_factory(env_name, repo_id, policy_name):
        # TODO(rcadene, aliberts): remove dataset download
        dataset=DatasetConfig(repo_id=repo_id, episodes=[0]),
        env=make_env_config(env_name),
-        policy=make_policy_config(policy_name),
+        policy=make_policy_config(policy_name, push_to_hub=False),
    )
+    cfg.validate()

    dataset = make_dataset(cfg)
    delta_timestamps = dataset.delta_timestamps
@@ -465,6 +427,30 @@ def test_multidataset_frames():
            assert torch.equal(sub_dataset_item[k], dataset_item[k])


+# TODO(aliberts): Move to more appropriate location
+def test_flatten_unflatten_dict():
+    d = {
+        "obs": {
+            "min": 0,
+            "max": 1,
+            "mean": 2,
+            "std": 3,
+        },
+        "action": {
+            "min": 4,
+            "max": 5,
+            "mean": 6,
+            "std": 7,
+        },
+    }
+
+    original_d = deepcopy(d)
+    d = unflatten_dict(flatten_dict(d))
+
+    # test equality between nested dicts
+    assert json.dumps(original_d, sort_keys=True) == json.dumps(d, sort_keys=True), f"{original_d} != {d}"
+
+
@pytest.mark.parametrize(
    "repo_id",
    [
@@ -511,22 +497,38 @@ def test_backward_compatibility(repo_id):
            )

    # test2 first frames of first episode
-    i = dataset.meta.episodes[0]["dataset_from_index"]
+    i = dataset.episode_data_index["from"][0].item()
    load_and_compare(i)
    load_and_compare(i + 1)

    # test 2 frames at the middle of first episode
-    i = int(
-        (dataset.meta.episodes[0]["dataset_to_index"] - dataset.meta.episodes[0]["dataset_from_index"]) / 2
-    )
+    i = int((dataset.episode_data_index["to"][0].item() - dataset.episode_data_index["from"][0].item()) / 2)
    load_and_compare(i)
    load_and_compare(i + 1)

    # test 2 last frames of first episode
-    i = dataset.meta.episodes[0]["dataset_to_index"]
+    i = dataset.episode_data_index["to"][0].item()
    load_and_compare(i - 2)
    load_and_compare(i - 1)

+    # TODO(rcadene): Enable testing on second and last episode
+    # We currently cant because our test dataset only contains the first episode
+
+    # # test 2 first frames of second episode
+    # i = dataset.episode_data_index["from"][1].item()
+    # load_and_compare(i)
+    # load_and_compare(i + 1)
+
+    # # test 2 last frames of second episode
+    # i = dataset.episode_data_index["to"][1].item()
+    # load_and_compare(i - 2)
+    # load_and_compare(i - 1)
+
+    # # test 2 last frames of last episode
+    # i = dataset.episode_data_index["to"][-1].item()
+    # load_and_compare(i - 2)
+    # load_and_compare(i - 1)
+

@pytest.mark.skip("Requires internet access")
 def test_create_branch():
@@ -552,3 +554,20 @@ def test_create_branch():

    # Clean
    api.delete_repo(repo_id, repo_type=repo_type)
+
+
+def test_dataset_feature_with_forward_slash_raises_error():
+    # make sure dir does not exist
+    from lerobot.constants import HF_LEROBOT_HOME
+
+    dataset_dir = HF_LEROBOT_HOME / "lerobot/test/with/slash"
+    # make sure does not exist
+    if dataset_dir.exists():
+        dataset_dir.rmdir()
+
+    with pytest.raises(ValueError):
+        LeRobotDataset.create(
+            repo_id="lerobot/test/with/slash",
+            fps=30,
+            features={"a/b": {"dtype": "float32", "shape": 2, "names": None}},
+        )
@@ -11,15 +11,83 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+from itertools import accumulate
+
+import datasets
+import numpy as np
+import pyarrow.compute as pc
 import pytest
+import torch

 from lerobot.datasets.utils import (
    check_delta_timestamps,
+    check_timestamps_sync,
    get_delta_indices,
 )
 from tests.fixtures.constants import DUMMY_MOTOR_FEATURES


+def calculate_total_episode(
+    hf_dataset: datasets.Dataset, raise_if_not_contiguous: bool = True
+) -> dict[str, torch.Tensor]:
+    episode_indices = sorted(hf_dataset.unique("episode_index"))
+    total_episodes = len(episode_indices)
+    if raise_if_not_contiguous and episode_indices != list(range(total_episodes)):
+        raise ValueError("episode_index values are not sorted and contiguous.")
+    return total_episodes
+
+
+def calculate_episode_data_index(hf_dataset: datasets.Dataset) -> dict[str, np.ndarray]:
+    episode_lengths = []
+    table = hf_dataset.data.table
+    total_episodes = calculate_total_episode(hf_dataset)
+    for ep_idx in range(total_episodes):
+        ep_table = table.filter(pc.equal(table["episode_index"], ep_idx))
+        episode_lengths.insert(ep_idx, len(ep_table))
+
+    cumulative_lengths = list(accumulate(episode_lengths))
+    return {
+        "from": np.array([0] + cumulative_lengths[:-1], dtype=np.int64),
+        "to": np.array(cumulative_lengths, dtype=np.int64),
+    }
+
+
+@pytest.fixture(scope="module")
+def synced_timestamps_factory(hf_dataset_factory):
+    def _create_synced_timestamps(fps: int = 30) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
+        hf_dataset = hf_dataset_factory(fps=fps)
+        timestamps = torch.stack(hf_dataset["timestamp"]).numpy()
+        episode_indices = torch.stack(hf_dataset["episode_index"]).numpy()
+        episode_data_index = calculate_episode_data_index(hf_dataset)
+        return timestamps, episode_indices, episode_data_index
+
+    return _create_synced_timestamps
+
+
+@pytest.fixture(scope="module")
+def unsynced_timestamps_factory(synced_timestamps_factory):
+    def _create_unsynced_timestamps(
+        fps: int = 30, tolerance_s: float = 1e-4
+    ) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
+        timestamps, episode_indices, episode_data_index = synced_timestamps_factory(fps=fps)
+        timestamps[30] += tolerance_s * 1.1  # Modify a single timestamp just outside tolerance
+        return timestamps, episode_indices, episode_data_index
+
+    return _create_unsynced_timestamps
+
+
+@pytest.fixture(scope="module")
+def slightly_off_timestamps_factory(synced_timestamps_factory):
+    def _create_slightly_off_timestamps(
+        fps: int = 30, tolerance_s: float = 1e-4
+    ) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
+        timestamps, episode_indices, episode_data_index = synced_timestamps_factory(fps=fps)
+        timestamps[30] += tolerance_s * 0.9  # Modify a single timestamp just inside tolerance
+        return timestamps, episode_indices, episode_data_index
+
+    return _create_slightly_off_timestamps
+
+
@pytest.fixture(scope="module")
 def valid_delta_timestamps_factory():
    def _create_valid_delta_timestamps(
@@ -68,6 +136,78 @@ def delta_indices_factory():
    return _delta_indices


+def test_check_timestamps_sync_synced(synced_timestamps_factory):
+    fps = 30
+    tolerance_s = 1e-4
+    timestamps, ep_idx, ep_data_index = synced_timestamps_factory(fps)
+    result = check_timestamps_sync(
+        timestamps=timestamps,
+        episode_indices=ep_idx,
+        episode_data_index=ep_data_index,
+        fps=fps,
+        tolerance_s=tolerance_s,
+    )
+    assert result is True
+
+
+def test_check_timestamps_sync_unsynced(unsynced_timestamps_factory):
+    fps = 30
+    tolerance_s = 1e-4
+    timestamps, ep_idx, ep_data_index = unsynced_timestamps_factory(fps, tolerance_s)
+    with pytest.raises(ValueError):
+        check_timestamps_sync(
+            timestamps=timestamps,
+            episode_indices=ep_idx,
+            episode_data_index=ep_data_index,
+            fps=fps,
+            tolerance_s=tolerance_s,
+        )
+
+
+def test_check_timestamps_sync_unsynced_no_exception(unsynced_timestamps_factory):
+    fps = 30
+    tolerance_s = 1e-4
+    timestamps, ep_idx, ep_data_index = unsynced_timestamps_factory(fps, tolerance_s)
+    result = check_timestamps_sync(
+        timestamps=timestamps,
+        episode_indices=ep_idx,
+        episode_data_index=ep_data_index,
+        fps=fps,
+        tolerance_s=tolerance_s,
+        raise_value_error=False,
+    )
+    assert result is False
+
+
+def test_check_timestamps_sync_slightly_off(slightly_off_timestamps_factory):
+    fps = 30
+    tolerance_s = 1e-4
+    timestamps, ep_idx, ep_data_index = slightly_off_timestamps_factory(fps, tolerance_s)
+    result = check_timestamps_sync(
+        timestamps=timestamps,
+        episode_indices=ep_idx,
+        episode_data_index=ep_data_index,
+        fps=fps,
+        tolerance_s=tolerance_s,
+    )
+    assert result is True
+
+
+def test_check_timestamps_sync_single_timestamp():
+    fps = 30
+    tolerance_s = 1e-4
+    timestamps, ep_idx = np.array([0.0]), np.array([0])
+    episode_data_index = {"to": np.array([1]), "from": np.array([0])}
+    result = check_timestamps_sync(
+        timestamps=timestamps,
+        episode_indices=ep_idx,
+        episode_data_index=episode_data_index,
+        fps=fps,
+        tolerance_s=tolerance_s,
+    )
+    assert result is True
+
+
 def test_check_delta_timestamps_valid(valid_delta_timestamps_factory):
    fps = 30
    tolerance_s = 1e-4
@@ -32,7 +32,7 @@ def test_drop_n_first_frames():
    )
    dataset.set_transform(hf_transform_to_torch)
    episode_data_index = calculate_episode_data_index(dataset)
-    sampler = EpisodeAwareSampler(episode_data_index["from"], episode_data_index["to"], drop_n_first_frames=1)
+    sampler = EpisodeAwareSampler(episode_data_index, drop_n_first_frames=1)
    assert sampler.indices == [1, 4, 5]
    assert len(sampler) == 3
    assert list(sampler) == [1, 4, 5]
@@ -48,7 +48,7 @@ def test_drop_n_last_frames():
    )
    dataset.set_transform(hf_transform_to_torch)
    episode_data_index = calculate_episode_data_index(dataset)
-    sampler = EpisodeAwareSampler(episode_data_index["from"], episode_data_index["to"], drop_n_last_frames=1)
+    sampler = EpisodeAwareSampler(episode_data_index, drop_n_last_frames=1)
    assert sampler.indices == [0, 3, 4]
    assert len(sampler) == 3
    assert list(sampler) == [0, 3, 4]
@@ -64,9 +64,7 @@ def test_episode_indices_to_use():
    )
    dataset.set_transform(hf_transform_to_torch)
    episode_data_index = calculate_episode_data_index(dataset)
-    sampler = EpisodeAwareSampler(
-        episode_data_index["from"], episode_data_index["to"], episode_indices_to_use=[0, 2]
-    )
+    sampler = EpisodeAwareSampler(episode_data_index, episode_indices_to_use=[0, 2])
    assert sampler.indices == [0, 1, 3, 4, 5]
    assert len(sampler) == 5
    assert list(sampler) == [0, 1, 3, 4, 5]
@@ -82,11 +80,11 @@ def test_shuffle():
    )
    dataset.set_transform(hf_transform_to_torch)
    episode_data_index = calculate_episode_data_index(dataset)
-    sampler = EpisodeAwareSampler(episode_data_index["from"], episode_data_index["to"], shuffle=False)
+    sampler = EpisodeAwareSampler(episode_data_index, shuffle=False)
    assert sampler.indices == [0, 1, 2, 3, 4, 5]
    assert len(sampler) == 6
    assert list(sampler) == [0, 1, 2, 3, 4, 5]
-    sampler = EpisodeAwareSampler(episode_data_index["from"], episode_data_index["to"], shuffle=True)
+    sampler = EpisodeAwareSampler(episode_data_index, shuffle=True)
    assert sampler.indices == [0, 1, 2, 3, 4, 5]
    assert len(sampler) == 6
    assert set(sampler) == {0, 1, 2, 3, 4, 5}
@@ -14,20 +14,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-import json
-from copy import deepcopy
-
 import torch
 from datasets import Dataset
 from huggingface_hub import DatasetCard

 from lerobot.datasets.push_dataset_to_hub.utils import calculate_episode_data_index
-from lerobot.datasets.utils import (
-    create_lerobot_dataset_card,
-    flatten_dict,
-    hf_transform_to_torch,
-    unflatten_dict,
-)
+from lerobot.datasets.utils import create_lerobot_dataset_card, hf_transform_to_torch


 def test_default_parameters():
@@ -61,26 +53,3 @@ def test_calculate_episode_data_index():
    episode_data_index = calculate_episode_data_index(dataset)
    assert torch.equal(episode_data_index["from"], torch.tensor([0, 2, 3]))
    assert torch.equal(episode_data_index["to"], torch.tensor([2, 3, 6]))
-
-
-def test_flatten_unflatten_dict():
-    d = {
-        "obs": {
-            "min": 0,
-            "max": 1,
-            "mean": 2,
-            "std": 3,
-        },
-        "action": {
-            "min": 4,
-            "max": 5,
-            "mean": 6,
-            "std": 7,
-        },
-    }
-
-    original_d = deepcopy(d)
-    d = unflatten_dict(flatten_dict(d))
-
-    # test equality between nested dicts
-    assert json.dumps(original_d, sort_keys=True) == json.dumps(d, sort_keys=True), f"{original_d} != {d}"
@@ -29,8 +29,8 @@ DUMMY_MOTOR_FEATURES = {
    },
 }
 DUMMY_CAMERA_FEATURES = {
-    "laptop": {"shape": (64, 96, 3), "names": ["height", "width", "channels"], "info": None},
-    "phone": {"shape": (64, 96, 3), "names": ["height", "width", "channels"], "info": None},
+    "laptop": {"shape": (480, 640, 3), "names": ["height", "width", "channels"], "info": None},
+    "phone": {"shape": (480, 640, 3), "names": ["height", "width", "channels"], "info": None},
 }
 DEFAULT_FPS = 30
 DUMMY_VIDEO_INFO = {
@@ -12,7 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import random
-import shutil
 from functools import partial
 from pathlib import Path
 from typing import Protocol
@@ -20,25 +19,19 @@ from unittest.mock import patch

 import datasets
 import numpy as np
-import pandas as pd
 import PIL.Image
 import pytest
 import torch
-from datasets import Dataset

 from lerobot.datasets.lerobot_dataset import CODEBASE_VERSION, LeRobotDataset, LeRobotDatasetMetadata
 from lerobot.datasets.utils import (
    DEFAULT_CHUNK_SIZE,
-    DEFAULT_DATA_FILE_SIZE_IN_MB,
-    DEFAULT_DATA_PATH,
    DEFAULT_FEATURES,
-    DEFAULT_VIDEO_FILE_SIZE_IN_MB,
+    DEFAULT_PARQUET_PATH,
    DEFAULT_VIDEO_PATH,
-    flatten_dict,
    get_hf_features_from_features,
    hf_transform_to_torch,
 )
-from lerobot.datasets.video_utils import encode_video_frames
 from tests.fixtures.constants import (
    DEFAULT_FPS,
    DUMMY_CAMERA_FEATURES,
@@ -53,10 +46,10 @@ class LeRobotDatasetFactory(Protocol):
    def __call__(self, *args, **kwargs) -> LeRobotDataset: ...


-def get_task_index(tasks: datasets.Dataset, task: str) -> int:
-    # TODO(rcadene): a bit complicated no? ^^
-    task_idx = tasks.loc[task].task_index.item()
-    return task_idx
+def get_task_index(task_dicts: dict, task: str) -> int:
+    tasks = {d["task_index"]: d["task"] for d in task_dicts.values()}
+    task_to_task_index = {task: task_idx for task_idx, task in tasks.items()}
+    return task_to_task_index[task]


@pytest.fixture(scope="session")
@@ -69,49 +62,15 @@ def img_tensor_factory():

@pytest.fixture(scope="session")
 def img_array_factory():
-    def _create_img_array(height=100, width=100, channels=3, dtype=np.uint8, content=None) -> np.ndarray:
-        if content is None:
-            # Original random noise behavior
-            if np.issubdtype(dtype, np.unsignedinteger):
-                # Int array in [0, 255] range
-                img_array = np.random.randint(0, 256, size=(height, width, channels), dtype=dtype)
-            elif np.issubdtype(dtype, np.floating):
-                # Float array in [0, 1] range
-                img_array = np.random.rand(height, width, channels).astype(dtype)
-            else:
-                raise ValueError(dtype)
+    def _create_img_array(height=100, width=100, channels=3, dtype=np.uint8) -> np.ndarray:
+        if np.issubdtype(dtype, np.unsignedinteger):
+            # Int array in [0, 255] range
+            img_array = np.random.randint(0, 256, size=(height, width, channels), dtype=dtype)
+        elif np.issubdtype(dtype, np.floating):
+            # Float array in [0, 1] range
+            img_array = np.random.rand(height, width, channels).astype(dtype)
        else:
-            # Create image with text content using OpenCV
-            import cv2
-
-            # Create white background
-            img_array = np.ones((height, width, channels), dtype=np.uint8) * 255
-
-            # Font settings
-            font = cv2.FONT_HERSHEY_SIMPLEX
-            font_scale = max(0.5, height / 200)  # Scale font with image size
-            font_color = (0, 0, 0)  # Black text
-            thickness = max(1, int(height / 100))
-
-            # Get text size to center it
-            text_size = cv2.getTextSize(content, font, font_scale, thickness)[0]
-            text_x = (width - text_size[0]) // 2
-            text_y = (height + text_size[1]) // 2
-
-            # Put text on image
-            cv2.putText(img_array, content, (text_x, text_y), font, font_scale, font_color, thickness)
-
-            # Handle single channel case
-            if channels == 1:
-                img_array = cv2.cvtColor(img_array, cv2.COLOR_BGR2GRAY)
-                img_array = img_array[:, :, np.newaxis]
-
-            # Convert to target dtype
-            if np.issubdtype(dtype, np.floating):
-                img_array = img_array.astype(dtype) / 255.0
-            else:
-                img_array = img_array.astype(dtype)
-
+            raise ValueError(dtype)
        return img_array

    return _create_img_array
@@ -158,10 +117,9 @@ def info_factory(features_factory):
        total_frames: int = 0,
        total_tasks: int = 0,
        total_videos: int = 0,
+        total_chunks: int = 0,
        chunks_size: int = DEFAULT_CHUNK_SIZE,
-        data_files_size_in_mb: float = DEFAULT_DATA_FILE_SIZE_IN_MB,
-        video_files_size_in_mb: float = DEFAULT_VIDEO_FILE_SIZE_IN_MB,
-        data_path: str = DEFAULT_DATA_PATH,
+        data_path: str = DEFAULT_PARQUET_PATH,
        video_path: str = DEFAULT_VIDEO_PATH,
        motor_features: dict = DUMMY_MOTOR_FEATURES,
        camera_features: dict = DUMMY_CAMERA_FEATURES,
@@ -175,9 +133,8 @@ def info_factory(features_factory):
            "total_frames": total_frames,
            "total_tasks": total_tasks,
            "total_videos": total_videos,
+            "total_chunks": total_chunks,
            "chunks_size": chunks_size,
-            "data_files_size_in_mb": data_files_size_in_mb,
-            "video_files_size_in_mb": video_files_size_in_mb,
            "fps": fps,
            "splits": {},
            "data_path": data_path,
@@ -218,26 +175,41 @@ def stats_factory():
    return _create_stats


+@pytest.fixture(scope="session")
+def episodes_stats_factory(stats_factory):
+    def _create_episodes_stats(
+        features: dict[str],
+        total_episodes: int = 3,
+    ) -> dict:
+        episodes_stats = {}
+        for episode_index in range(total_episodes):
+            episodes_stats[episode_index] = {
+                "episode_index": episode_index,
+                "stats": stats_factory(features),
+            }
+        return episodes_stats
+
+    return _create_episodes_stats
+
+
@pytest.fixture(scope="session")
 def tasks_factory():
-    def _create_tasks(total_tasks: int = 3) -> pd.DataFrame:
-        ids = list(range(total_tasks))
-        tasks = [f"Perform action {i}." for i in ids]
-        df = pd.DataFrame({"task_index": ids}, index=tasks)
-        return df
+    def _create_tasks(total_tasks: int = 3) -> int:
+        tasks = {}
+        for task_index in range(total_tasks):
+            task_dict = {"task_index": task_index, "task": f"Perform action {task_index}."}
+            tasks[task_index] = task_dict
+        return tasks

    return _create_tasks


@pytest.fixture(scope="session")
-def episodes_factory(tasks_factory, stats_factory):
+def episodes_factory(tasks_factory):
    def _create_episodes(
-        features: dict[str],
-        fps: int = DEFAULT_FPS,
        total_episodes: int = 3,
        total_frames: int = 400,
-        video_keys: list[str] | None = None,
-        tasks: pd.DataFrame | None = None,
+        tasks: dict | None = None,
        multi_task: bool = False,
    ):
        if total_episodes <= 0 or total_frames <= 0:
@@ -245,142 +217,66 @@ def episodes_factory(tasks_factory, stats_factory):
        if total_frames < total_episodes:
            raise ValueError("total_length must be greater than or equal to num_episodes.")

-        if tasks is None:
+        if not tasks:
            min_tasks = 2 if multi_task else 1
            total_tasks = random.randint(min_tasks, total_episodes)
            tasks = tasks_factory(total_tasks)

-        num_tasks_available = len(tasks)
-
-        if total_episodes < num_tasks_available and not multi_task:
+        if total_episodes < len(tasks) and not multi_task:
            raise ValueError("The number of tasks should be less than the number of episodes.")

        # Generate random lengths that sum up to total_length
        lengths = np.random.multinomial(total_frames, [1 / total_episodes] * total_episodes).tolist()

-        # Create empty dictionaries with all keys
-        d = {
-            "episode_index": [],
-            "meta/episodes/chunk_index": [],
-            "meta/episodes/file_index": [],
-            "data/chunk_index": [],
-            "data/file_index": [],
-            "dataset_from_index": [],
-            "dataset_to_index": [],
-            "tasks": [],
-            "length": [],
-        }
-        if video_keys is not None:
-            for video_key in video_keys:
-                d[f"videos/{video_key}/chunk_index"] = []
-                d[f"videos/{video_key}/file_index"] = []
-                d[f"videos/{video_key}/from_timestamp"] = []
-                d[f"videos/{video_key}/to_timestamp"] = []
+        tasks_list = [task_dict["task"] for task_dict in tasks.values()]
+        num_tasks_available = len(tasks_list)

-        for stats_key in flatten_dict({"stats": stats_factory(features)}):
-            d[stats_key] = []
-
-        num_frames = 0
-        remaining_tasks = list(tasks.index)
+        episodes = {}
+        remaining_tasks = tasks_list.copy()
        for ep_idx in range(total_episodes):
            num_tasks_in_episode = random.randint(1, min(3, num_tasks_available)) if multi_task else 1
-            tasks_to_sample = remaining_tasks if len(remaining_tasks) > 0 else list(tasks.index)
+            tasks_to_sample = remaining_tasks if remaining_tasks else tasks_list
            episode_tasks = random.sample(tasks_to_sample, min(num_tasks_in_episode, len(tasks_to_sample)))
            if remaining_tasks:
                for task in episode_tasks:
                    remaining_tasks.remove(task)

-            d["episode_index"].append(ep_idx)
-            # TODO(rcadene): remove heuristic of only one file
-            d["meta/episodes/chunk_index"].append(0)
-            d["meta/episodes/file_index"].append(0)
-            d["data/chunk_index"].append(0)
-            d["data/file_index"].append(0)
-            d["dataset_from_index"].append(num_frames)
-            d["dataset_to_index"].append(num_frames + lengths[ep_idx])
-            d["tasks"].append(episode_tasks)
-            d["length"].append(lengths[ep_idx])
+            episodes[ep_idx] = {
+                "episode_index": ep_idx,
+                "tasks": episode_tasks,
+                "length": lengths[ep_idx],
+            }

-            if video_keys is not None:
-                for video_key in video_keys:
-                    d[f"videos/{video_key}/chunk_index"].append(0)
-                    d[f"videos/{video_key}/file_index"].append(0)
-                    d[f"videos/{video_key}/from_timestamp"].append(num_frames / fps)
-                    d[f"videos/{video_key}/to_timestamp"].append((num_frames + lengths[ep_idx]) / fps)
-
-            # Add stats columns like "stats/action/max"
-            for stats_key, stats in flatten_dict({"stats": stats_factory(features)}).items():
-                d[stats_key].append(stats)
-
-            num_frames += lengths[ep_idx]
-
-        return Dataset.from_dict(d)
+        return episodes

    return _create_episodes


-@pytest.fixture(scope="session")
-def create_videos(info_factory, img_array_factory):
-    def _create_video_directory(
-        root: Path,
-        info: dict | None = None,
-        total_episodes: int = 3,
-        total_frames: int = 150,
-        total_tasks: int = 1,
-    ):
-        if info is None:
-            info = info_factory(
-                total_episodes=total_episodes, total_frames=total_frames, total_tasks=total_tasks
-            )
-
-        video_feats = {key: feats for key, feats in info["features"].items() if feats["dtype"] == "video"}
-        for key, ft in video_feats.items():
-            # create and save images with identifiable content
-            tmp_dir = root / "tmp_images"
-            tmp_dir.mkdir(parents=True, exist_ok=True)
-            for frame_index in range(info["total_frames"]):
-                content = f"{key}-{frame_index}"
-                img = img_array_factory(height=ft["shape"][0], width=ft["shape"][1], content=content)
-                pil_img = PIL.Image.fromarray(img)
-                path = tmp_dir / f"frame-{frame_index:06d}.png"
-                pil_img.save(path)
-
-            video_path = root / DEFAULT_VIDEO_PATH.format(video_key=key, chunk_index=0, file_index=0)
-            video_path.parent.mkdir(parents=True, exist_ok=True)
-            # Use the global fps from info, not video-specific fps which might not exist
-            encode_video_frames(tmp_dir, video_path, fps=info["fps"])
-            shutil.rmtree(tmp_dir)
-
-    return _create_video_directory
-
-
@pytest.fixture(scope="session")
 def hf_dataset_factory(features_factory, tasks_factory, episodes_factory, img_array_factory):
    def _create_hf_dataset(
        features: dict | None = None,
-        tasks: pd.DataFrame | None = None,
-        episodes: datasets.Dataset | None = None,
+        tasks: list[dict] | None = None,
+        episodes: list[dict] | None = None,
        fps: int = DEFAULT_FPS,
    ) -> datasets.Dataset:
-        if tasks is None:
+        if not tasks:
            tasks = tasks_factory()
-        if features is None:
+        if not episodes:
+            episodes = episodes_factory()
+        if not features:
            features = features_factory()
-        if episodes is None:
-            episodes = episodes_factory(features, fps)

        timestamp_col = np.array([], dtype=np.float32)
        frame_index_col = np.array([], dtype=np.int64)
        episode_index_col = np.array([], dtype=np.int64)
        task_index = np.array([], dtype=np.int64)
-        for ep_dict in episodes:
+        for ep_dict in episodes.values():
            timestamp_col = np.concatenate((timestamp_col, np.arange(ep_dict["length"]) / fps))
            frame_index_col = np.concatenate((frame_index_col, np.arange(ep_dict["length"], dtype=int)))
            episode_index_col = np.concatenate(
                (episode_index_col, np.full(ep_dict["length"], ep_dict["episode_index"], dtype=int))
            )
-            # Slightly incorrect, but for simplicity, we assign to all frames the first task defined in the episode metadata.
-            # TODO(rcadene): assign the tasks of the episode per chunks of frames
            ep_task_index = get_task_index(tasks, ep_dict["tasks"][0])
            task_index = np.concatenate((task_index, np.full(ep_dict["length"], ep_task_index, dtype=int)))

@@ -390,8 +286,8 @@ def hf_dataset_factory(features_factory, tasks_factory, episodes_factory, img_ar
        for key, ft in features.items():
            if ft["dtype"] == "image":
                robot_cols[key] = [
-                    img_array_factory(height=ft["shape"][1], width=ft["shape"][0], content=f"{key}-{i}")
-                    for i in range(len(index_col))
+                    img_array_factory(height=ft["shapes"][1], width=ft["shapes"][0])
+                    for _ in range(len(index_col))
                ]
            elif ft["shape"][0] > 1 and ft["dtype"] != "video":
                robot_cols[key] = np.random.random((len(index_col), ft["shape"][0])).astype(ft["dtype"])
@@ -418,6 +314,7 @@ def hf_dataset_factory(features_factory, tasks_factory, episodes_factory, img_ar
 def lerobot_dataset_metadata_factory(
    info_factory,
    stats_factory,
+    episodes_stats_factory,
    tasks_factory,
    episodes_factory,
    mock_snapshot_download_factory,
@@ -427,29 +324,29 @@ def lerobot_dataset_metadata_factory(
        repo_id: str = DUMMY_REPO_ID,
        info: dict | None = None,
        stats: dict | None = None,
-        tasks: pd.DataFrame | None = None,
-        episodes: datasets.Dataset | None = None,
+        episodes_stats: list[dict] | None = None,
+        tasks: list[dict] | None = None,
+        episodes: list[dict] | None = None,
    ) -> LeRobotDatasetMetadata:
-        if info is None:
+        if not info:
            info = info_factory()
-        if stats is None:
+        if not stats:
            stats = stats_factory(features=info["features"])
-        if tasks is None:
+        if not episodes_stats:
+            episodes_stats = episodes_stats_factory(
+                features=info["features"], total_episodes=info["total_episodes"]
+            )
+        if not tasks:
            tasks = tasks_factory(total_tasks=info["total_tasks"])
-        if episodes is None:
-            video_keys = [key for key, ft in info["features"].items() if ft["dtype"] == "video"]
+        if not episodes:
            episodes = episodes_factory(
-                features=info["features"],
-                fps=info["fps"],
-                total_episodes=info["total_episodes"],
-                total_frames=info["total_frames"],
-                video_keys=video_keys,
-                tasks=tasks,
+                total_episodes=info["total_episodes"], total_frames=info["total_frames"], tasks=tasks
            )

        mock_snapshot_download = mock_snapshot_download_factory(
            info=info,
            stats=stats,
+            episodes_stats=episodes_stats,
            tasks=tasks,
            episodes=episodes,
        )
@@ -469,6 +366,7 @@ def lerobot_dataset_metadata_factory(
 def lerobot_dataset_factory(
    info_factory,
    stats_factory,
+    episodes_stats_factory,
    tasks_factory,
    episodes_factory,
    hf_dataset_factory,
@@ -482,63 +380,50 @@ def lerobot_dataset_factory(
        total_frames: int = 150,
        total_tasks: int = 1,
        multi_task: bool = False,
-        use_videos: bool = True,
        info: dict | None = None,
        stats: dict | None = None,
-        tasks: pd.DataFrame | None = None,
-        episodes_metadata: datasets.Dataset | None = None,
+        episodes_stats: list[dict] | None = None,
+        tasks: list[dict] | None = None,
+        episode_dicts: list[dict] | None = None,
        hf_dataset: datasets.Dataset | None = None,
-        data_files_size_in_mb: float = DEFAULT_DATA_FILE_SIZE_IN_MB,
-        chunks_size: int = DEFAULT_CHUNK_SIZE,
        **kwargs,
    ) -> LeRobotDataset:
-        # Instantiate objects
-        if info is None:
+        if not info:
            info = info_factory(
-                total_episodes=total_episodes,
-                total_frames=total_frames,
-                total_tasks=total_tasks,
-                use_videos=use_videos,
-                data_files_size_in_mb=data_files_size_in_mb,
-                chunks_size=chunks_size,
+                total_episodes=total_episodes, total_frames=total_frames, total_tasks=total_tasks
            )
-        if stats is None:
+        if not stats:
            stats = stats_factory(features=info["features"])
-        if tasks is None:
+        if not episodes_stats:
+            episodes_stats = episodes_stats_factory(features=info["features"], total_episodes=total_episodes)
+        if not tasks:
            tasks = tasks_factory(total_tasks=info["total_tasks"])
-        if episodes_metadata is None:
-            video_keys = [key for key, ft in info["features"].items() if ft["dtype"] == "video"]
-            episodes_metadata = episodes_factory(
-                features=info["features"],
-                fps=info["fps"],
+        if not episode_dicts:
+            episode_dicts = episodes_factory(
                total_episodes=info["total_episodes"],
                total_frames=info["total_frames"],
-                video_keys=video_keys,
                tasks=tasks,
                multi_task=multi_task,
            )
-        if hf_dataset is None:
-            hf_dataset = hf_dataset_factory(
-                features=info["features"], tasks=tasks, episodes=episodes_metadata, fps=info["fps"]
-            )
+        if not hf_dataset:
+            hf_dataset = hf_dataset_factory(tasks=tasks, episodes=episode_dicts, fps=info["fps"])

-        # Write data on disk
        mock_snapshot_download = mock_snapshot_download_factory(
            info=info,
            stats=stats,
+            episodes_stats=episodes_stats,
            tasks=tasks,
-            episodes=episodes_metadata,
+            episodes=episode_dicts,
            hf_dataset=hf_dataset,
-            data_files_size_in_mb=data_files_size_in_mb,
-            chunks_size=chunks_size,
        )
        mock_metadata = lerobot_dataset_metadata_factory(
            root=root,
            repo_id=repo_id,
            info=info,
            stats=stats,
+            episodes_stats=episodes_stats,
            tasks=tasks,
-            episodes=episodes_metadata,
+            episodes=episode_dicts,
        )
        with (
            patch("lerobot.datasets.lerobot_dataset.LeRobotDatasetMetadata") as mock_metadata_patch,
@@ -11,181 +11,92 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-import logging
+import json
 from pathlib import Path

 import datasets
-import numpy as np
-import pandas as pd
+import jsonlines
 import pyarrow.compute as pc
 import pyarrow.parquet as pq
 import pytest
-from datasets import Dataset

 from lerobot.datasets.utils import (
-    DEFAULT_CHUNK_SIZE,
-    DEFAULT_DATA_FILE_SIZE_IN_MB,
-    DEFAULT_DATA_PATH,
-    get_hf_dataset_size_in_mb,
-    update_chunk_file_indices,
-    write_episodes,
-    write_info,
-    write_stats,
-    write_tasks,
+    EPISODES_PATH,
+    EPISODES_STATS_PATH,
+    INFO_PATH,
+    STATS_PATH,
+    TASKS_PATH,
 )


-def write_hf_dataset(
-    hf_dataset: Dataset,
-    local_dir: Path,
-    data_file_size_mb: float | None = None,
-    chunk_size: int | None = None,
-):
-    """
-    Writes a Hugging Face Dataset to one or more Parquet files in a structured directory format.
-
-    If the dataset size is within `DEFAULT_DATA_FILE_SIZE_IN_MB`, it's saved as a single file.
-    Otherwise, the dataset is split into multiple smaller Parquet files, each not exceeding the size limit.
-    The file and chunk indices are managed to organize the output files in a hierarchical structure,
-    e.g., `data/chunk-000/file-000.parquet`, `data/chunk-000/file-001.parquet`, etc.
-    This function ensures that episodes are not split across multiple files.
-
-    Args:
-        hf_dataset (Dataset): The Hugging Face Dataset to be written to disk.
-        local_dir (Path): The root directory where the dataset files will be stored.
-        data_file_size_mb (float, optional): Maximal size for the parquet data file, in MB. Defaults to DEFAULT_DATA_FILE_SIZE_IN_MB.
-        chunk_size (int, optional): Maximal number of files within a chunk folder before creating another one. Defaults to DEFAULT_CHUNK_SIZE.
-    """
-    if data_file_size_mb is None:
-        data_file_size_mb = DEFAULT_DATA_FILE_SIZE_IN_MB
-    if chunk_size is None:
-        chunk_size = DEFAULT_CHUNK_SIZE
-
-    dataset_size_in_mb = get_hf_dataset_size_in_mb(hf_dataset)
-
-    if dataset_size_in_mb <= data_file_size_mb:
-        # If the dataset is small enough, write it to a single file.
-        path = local_dir / DEFAULT_DATA_PATH.format(chunk_index=0, file_index=0)
-        path.parent.mkdir(parents=True, exist_ok=True)
-        hf_dataset.to_parquet(path)
-        return
-
-    # If the dataset is too large, split it into smaller chunks, keeping episodes whole.
-    episode_indices = np.array(hf_dataset["episode_index"])
-    episode_boundaries = np.where(np.diff(episode_indices) != 0)[0] + 1
-    episode_starts = np.concatenate(([0], episode_boundaries))
-    episode_ends = np.concatenate((episode_boundaries, [len(hf_dataset)]))
-
-    num_episodes = len(episode_starts)
-    current_episode_idx = 0
-    chunk_idx, file_idx = 0, 0
-
-    while current_episode_idx < num_episodes:
-        shard_start_row = episode_starts[current_episode_idx]
-        shard_end_row = episode_ends[current_episode_idx]
-        next_episode_to_try_idx = current_episode_idx + 1
-
-        while next_episode_to_try_idx < num_episodes:
-            potential_shard_end_row = episode_ends[next_episode_to_try_idx]
-            dataset_shard_candidate = hf_dataset.select(range(shard_start_row, potential_shard_end_row))
-            shard_size_mb = get_hf_dataset_size_in_mb(dataset_shard_candidate)
-
-            if shard_size_mb > data_file_size_mb:
-                break
-            else:
-                shard_end_row = potential_shard_end_row
-                next_episode_to_try_idx += 1
-
-        dataset_shard = hf_dataset.select(range(shard_start_row, shard_end_row))
-
-        if (
-            shard_start_row == episode_starts[current_episode_idx]
-            and shard_end_row == episode_ends[current_episode_idx]
-        ):
-            shard_size_mb = get_hf_dataset_size_in_mb(dataset_shard)
-            if shard_size_mb > data_file_size_mb:
-                logging.warning(
-                    f"Episode with index {hf_dataset[shard_start_row.item()]['episode_index']} has size {shard_size_mb:.2f}MB, "
-                    f"which is larger than data_file_size_mb ({data_file_size_mb}MB). "
-                    "Writing it to a separate shard anyway to preserve episode integrity."
-                )
-
-        # Define the path for the current shard and ensure the directory exists.
-        path = local_dir / DEFAULT_DATA_PATH.format(chunk_index=chunk_idx, file_index=file_idx)
-        path.parent.mkdir(parents=True, exist_ok=True)
-
-        # Write the shard to a Parquet file.
-        dataset_shard.to_parquet(path)
-
-        # Update chunk and file indices for the next iteration.
-        chunk_idx, file_idx = update_chunk_file_indices(chunk_idx, file_idx, chunk_size)
-        current_episode_idx = next_episode_to_try_idx
-
-
@pytest.fixture(scope="session")
-def create_info(info_factory):
-    def _create_info(dir: Path, info: dict | None = None):
-        if info is None:
+def info_path(info_factory):
+    def _create_info_json_file(dir: Path, info: dict | None = None) -> Path:
+        if not info:
            info = info_factory()
-        write_info(info, dir)
+        fpath = dir / INFO_PATH
+        fpath.parent.mkdir(parents=True, exist_ok=True)
+        with open(fpath, "w") as f:
+            json.dump(info, f, indent=4, ensure_ascii=False)
+        return fpath

-    return _create_info
+    return _create_info_json_file


@pytest.fixture(scope="session")
-def create_stats(stats_factory):
-    def _create_stats(dir: Path, stats: dict | None = None):
-        if stats is None:
+def stats_path(stats_factory):
+    def _create_stats_json_file(dir: Path, stats: dict | None = None) -> Path:
+        if not stats:
            stats = stats_factory()
-        write_stats(stats, dir)
+        fpath = dir / STATS_PATH
+        fpath.parent.mkdir(parents=True, exist_ok=True)
+        with open(fpath, "w") as f:
+            json.dump(stats, f, indent=4, ensure_ascii=False)
+        return fpath

-    return _create_stats
-
-
-# @pytest.fixture(scope="session")
-# def create_episodes_stats(episodes_stats_factory):
-#     def _create_episodes_stats(dir: Path, episodes_stats: Dataset | None = None):
-#         if episodes_stats is None:
-#             episodes_stats = episodes_stats_factory()
-#         write_episodes_stats(episodes_stats, dir)
-
-#     return _create_episodes_stats
+    return _create_stats_json_file


@pytest.fixture(scope="session")
-def create_tasks(tasks_factory):
-    def _create_tasks(dir: Path, tasks: pd.DataFrame | None = None):
-        if tasks is None:
+def episodes_stats_path(episodes_stats_factory):
+    def _create_episodes_stats_jsonl_file(dir: Path, episodes_stats: list[dict] | None = None) -> Path:
+        if not episodes_stats:
+            episodes_stats = episodes_stats_factory()
+        fpath = dir / EPISODES_STATS_PATH
+        fpath.parent.mkdir(parents=True, exist_ok=True)
+        with jsonlines.open(fpath, "w") as writer:
+            writer.write_all(episodes_stats.values())
+        return fpath
+
+    return _create_episodes_stats_jsonl_file
+
+
+@pytest.fixture(scope="session")
+def tasks_path(tasks_factory):
+    def _create_tasks_jsonl_file(dir: Path, tasks: list | None = None) -> Path:
+        if not tasks:
            tasks = tasks_factory()
-        write_tasks(tasks, dir)
+        fpath = dir / TASKS_PATH
+        fpath.parent.mkdir(parents=True, exist_ok=True)
+        with jsonlines.open(fpath, "w") as writer:
+            writer.write_all(tasks.values())
+        return fpath

-    return _create_tasks
+    return _create_tasks_jsonl_file


@pytest.fixture(scope="session")
-def create_episodes(episodes_factory):
-    def _create_episodes(dir: Path, episodes: datasets.Dataset | None = None):
-        if episodes is None:
-            # TODO(rcadene): add features, fps as arguments
+def episode_path(episodes_factory):
+    def _create_episodes_jsonl_file(dir: Path, episodes: list | None = None) -> Path:
+        if not episodes:
            episodes = episodes_factory()
-        write_episodes(episodes, dir)
+        fpath = dir / EPISODES_PATH
+        fpath.parent.mkdir(parents=True, exist_ok=True)
+        with jsonlines.open(fpath, "w") as writer:
+            writer.write_all(episodes.values())
+        return fpath

-    return _create_episodes
-
-
-@pytest.fixture(scope="session")
-def create_hf_dataset(hf_dataset_factory):
-    def _create_hf_dataset(
-        dir: Path,
-        hf_dataset: datasets.Dataset | None = None,
-        data_file_size_in_mb: float | None = None,
-        chunk_size: int | None = None,
-    ):
-        if hf_dataset is None:
-            hf_dataset = hf_dataset_factory()
-        write_hf_dataset(hf_dataset, dir, data_file_size_in_mb, chunk_size)
-
-    return _create_hf_dataset
+    return _create_episodes_jsonl_file


@pytest.fixture(scope="session")
@@ -193,8 +104,7 @@ def single_episode_parquet_path(hf_dataset_factory, info_factory):
    def _create_single_episode_parquet(
        dir: Path, ep_idx: int = 0, hf_dataset: datasets.Dataset | None = None, info: dict | None = None
    ) -> Path:
-        raise NotImplementedError()
-        if info is None:
+        if not info:
            info = info_factory()
        if hf_dataset is None:
            hf_dataset = hf_dataset_factory()
@@ -217,8 +127,7 @@ def multi_episode_parquet_path(hf_dataset_factory, info_factory):
    def _create_multi_episode_parquet(
        dir: Path, hf_dataset: datasets.Dataset | None = None, info: dict | None = None
    ) -> Path:
-        raise NotImplementedError()
-        if info is None:
+        if not info:
            info = info_factory()
        if hf_dataset is None:
            hf_dataset = hf_dataset_factory()
@@ -14,19 +14,15 @@
 from pathlib import Path

 import datasets
-import pandas as pd
 import pytest
 from huggingface_hub.utils import filter_repo_objects

 from lerobot.datasets.utils import (
-    DEFAULT_CHUNK_SIZE,
-    DEFAULT_DATA_FILE_SIZE_IN_MB,
-    DEFAULT_DATA_PATH,
-    DEFAULT_EPISODES_PATH,
-    DEFAULT_TASKS_PATH,
-    DEFAULT_VIDEO_PATH,
+    EPISODES_PATH,
+    EPISODES_STATS_PATH,
    INFO_PATH,
    STATS_PATH,
+    TASKS_PATH,
 )
 from tests.fixtures.constants import LEROBOT_TEST_DIR

@@ -34,16 +30,17 @@ from tests.fixtures.constants import LEROBOT_TEST_DIR
@pytest.fixture(scope="session")
 def mock_snapshot_download_factory(
    info_factory,
-    create_info,
+    info_path,
    stats_factory,
-    create_stats,
+    stats_path,
+    episodes_stats_factory,
+    episodes_stats_path,
    tasks_factory,
-    create_tasks,
+    tasks_path,
    episodes_factory,
-    create_episodes,
+    episode_path,
+    single_episode_parquet_path,
    hf_dataset_factory,
-    create_hf_dataset,
-    create_videos,
 ):
    """
    This factory allows to patch snapshot_download such that when called, it will create expected files rather
@@ -53,93 +50,82 @@ def mock_snapshot_download_factory(
    def _mock_snapshot_download_func(
        info: dict | None = None,
        stats: dict | None = None,
-        tasks: pd.DataFrame | None = None,
-        episodes: datasets.Dataset | None = None,
+        episodes_stats: list[dict] | None = None,
+        tasks: list[dict] | None = None,
+        episodes: list[dict] | None = None,
        hf_dataset: datasets.Dataset | None = None,
-        data_files_size_in_mb: float = DEFAULT_DATA_FILE_SIZE_IN_MB,
-        chunks_size: int = DEFAULT_CHUNK_SIZE,
    ):
-        if info is None:
-            info = info_factory(data_files_size_in_mb=data_files_size_in_mb, chunks_size=chunks_size)
-        if stats is None:
+        if not info:
+            info = info_factory()
+        if not stats:
            stats = stats_factory(features=info["features"])
-        if tasks is None:
-            tasks = tasks_factory(total_tasks=info["total_tasks"])
-        if episodes is None:
-            episodes = episodes_factory(
-                features=info["features"],
-                fps=info["fps"],
-                total_episodes=info["total_episodes"],
-                total_frames=info["total_frames"],
-                tasks=tasks,
+        if not episodes_stats:
+            episodes_stats = episodes_stats_factory(
+                features=info["features"], total_episodes=info["total_episodes"]
            )
-        if hf_dataset is None:
+        if not tasks:
+            tasks = tasks_factory(total_tasks=info["total_tasks"])
+        if not episodes:
+            episodes = episodes_factory(
+                total_episodes=info["total_episodes"], total_frames=info["total_frames"], tasks=tasks
+            )
+        if not hf_dataset:
            hf_dataset = hf_dataset_factory(tasks=tasks, episodes=episodes, fps=info["fps"])

+        def _extract_episode_index_from_path(fpath: str) -> int:
+            path = Path(fpath)
+            if path.suffix == ".parquet" and path.stem.startswith("episode_"):
+                episode_index = int(path.stem[len("episode_") :])  # 'episode_000000' -> 0
+                return episode_index
+            else:
+                return None
+
        def _mock_snapshot_download(
-            repo_id: str,  # TODO(rcadene): repo_id should be used no?
+            repo_id: str,
            local_dir: str | Path | None = None,
            allow_patterns: str | list[str] | None = None,
            ignore_patterns: str | list[str] | None = None,
            *args,
            **kwargs,
        ) -> str:
-            if local_dir is None:
+            if not local_dir:
                local_dir = LEROBOT_TEST_DIR

            # List all possible files
-            all_files = [
-                INFO_PATH,
-                STATS_PATH,
-                # TODO(rcadene): remove naive chunk 0 file 0 ?
-                DEFAULT_TASKS_PATH.format(chunk_index=0, file_index=0),
-                DEFAULT_EPISODES_PATH.format(chunk_index=0, file_index=0),
-                DEFAULT_DATA_PATH.format(chunk_index=0, file_index=0),
-            ]
+            all_files = []
+            meta_files = [INFO_PATH, STATS_PATH, EPISODES_STATS_PATH, TASKS_PATH, EPISODES_PATH]
+            all_files.extend(meta_files)

-            video_keys = [key for key, feats in info["features"].items() if feats["dtype"] == "video"]
-            for key in video_keys:
-                all_files.append(DEFAULT_VIDEO_PATH.format(video_key=key, chunk_index=0, file_index=0))
+            data_files = []
+            for episode_dict in episodes.values():
+                ep_idx = episode_dict["episode_index"]
+                ep_chunk = ep_idx // info["chunks_size"]
+                data_path = info["data_path"].format(episode_chunk=ep_chunk, episode_index=ep_idx)
+                data_files.append(data_path)
+            all_files.extend(data_files)

            allowed_files = filter_repo_objects(
                all_files, allow_patterns=allow_patterns, ignore_patterns=ignore_patterns
            )

-            request_info = False
-            request_tasks = False
-            request_episodes = False
-            request_stats = False
-            request_data = False
-            request_videos = False
+            # Create allowed files
            for rel_path in allowed_files:
-                if rel_path.startswith("meta/info.json"):
-                    request_info = True
-                elif rel_path.startswith("meta/stats"):
-                    request_stats = True
-                elif rel_path.startswith("meta/tasks"):
-                    request_tasks = True
-                elif rel_path.startswith("meta/episodes"):
-                    request_episodes = True
-                elif rel_path.startswith("data/"):
-                    request_data = True
-                elif rel_path.startswith("videos/"):
-                    request_videos = True
+                if rel_path.startswith("data/"):
+                    episode_index = _extract_episode_index_from_path(rel_path)
+                    if episode_index is not None:
+                        _ = single_episode_parquet_path(local_dir, episode_index, hf_dataset, info)
+                if rel_path == INFO_PATH:
+                    _ = info_path(local_dir, info)
+                elif rel_path == STATS_PATH:
+                    _ = stats_path(local_dir, stats)
+                elif rel_path == EPISODES_STATS_PATH:
+                    _ = episodes_stats_path(local_dir, episodes_stats)
+                elif rel_path == TASKS_PATH:
+                    _ = tasks_path(local_dir, tasks)
+                elif rel_path == EPISODES_PATH:
+                    _ = episode_path(local_dir, episodes)
                else:
-                    raise ValueError(f"{rel_path} not supported.")
-
-            if request_info:
-                create_info(local_dir, info)
-            if request_stats:
-                create_stats(local_dir, stats)
-            if request_tasks:
-                create_tasks(local_dir, tasks)
-            if request_episodes:
-                create_episodes(local_dir, episodes)
-            if request_data:
-                create_hf_dataset(local_dir, hf_dataset, data_files_size_in_mb, chunks_size)
-            if request_videos:
-                create_videos(root=local_dir, info=info)
-
+                    pass
            return str(local_dir)

        return _mock_snapshot_download
@@ -71,11 +71,7 @@ def dummy_dataset_metadata(lerobot_dataset_metadata_factory, info_factory, tmp_p
        },
    }
    info = info_factory(
-        total_episodes=1,
-        total_frames=1,
-        total_tasks=1,
-        camera_features=camera_features,
-        motor_features=motor_features,
+        total_episodes=1, total_frames=1, camera_features=camera_features, motor_features=motor_features
    )
    ds_meta = lerobot_dataset_metadata_factory(root=tmp_path / "init", info=info)
    return ds_meta
@@ -144,6 +140,7 @@ def test_policy(ds_repo_id, env_name, env_kwargs, policy_name, policy_kwargs):
    Note: We test various combinations of policy and dataset. The combinations are by no means exhaustive,
          and for now we add tests as we see fit.
    """
+
    train_cfg = TrainPipelineConfig(
        # TODO(rcadene, aliberts): remove dataset download
        dataset=DatasetConfig(repo_id=ds_repo_id, episodes=[0]),
@@ -14,8 +14,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-from unittest.mock import patch
-
 from lerobot.calibrate import CalibrateConfig, calibrate
 from lerobot.record import DatasetRecordConfig, RecordConfig, record
 from lerobot.replay import DatasetReplayConfig, ReplayConfig, replay
@@ -69,14 +67,7 @@ def test_record_and_resume(tmp_path):
    assert dataset.meta.total_tasks == 1

    cfg.resume = True
-    # Mock the revision to prevent Hub calls during resume
-    with (
-        patch("lerobot.datasets.lerobot_dataset.get_safe_version") as mock_get_safe_version,
-        patch("lerobot.datasets.lerobot_dataset.snapshot_download") as mock_snapshot_download,
-    ):
-        mock_get_safe_version.return_value = "v3.0"
-        mock_snapshot_download.return_value = str(tmp_path / "record")
-        dataset = record(cfg)
+    dataset = record(cfg)

    assert dataset.meta.total_episodes == dataset.num_episodes == 2
    assert dataset.meta.total_frames == dataset.num_frames == 6
@@ -112,12 +103,4 @@ def test_record_and_replay(tmp_path):
    )

    record(record_cfg)
-
-    # Mock the revision to prevent Hub calls during replay
-    with (
-        patch("lerobot.datasets.lerobot_dataset.get_safe_version") as mock_get_safe_version,
-        patch("lerobot.datasets.lerobot_dataset.snapshot_download") as mock_snapshot_download,
-    ):
-        mock_get_safe_version.return_value = "v3.0"
-        mock_snapshot_download.return_value = str(tmp_path / "record_and_replay")
-        replay(replay_cfg)
+    replay(replay_cfg)
@@ -384,7 +384,7 @@ def test_to_lerobot_dataset(tmp_path):
            elif feature == "next.done":
                assert torch.equal(value, buffer.dones[i])
            elif feature == "observation.image":
-                # Tensor -> numpy is not precise, so we have some diff there
+                # Tenssor -> numpy is not precise, so we have some diff there
                # TODO: Check and fix it
                torch.testing.assert_close(value, buffer.states["observation.image"][i], rtol=0.3, atol=0.003)
            elif feature == "observation.state":
Author	SHA1	Message	Date
AdilZouitine	8b4dcb1496	Enable mypy static type checking in pre-commit configuration and update mypy settings in pyproject.toml	2025-09-01 17:21:30 +02:00
Pepijn	882c80d446	Lower limits by 50% for current and torque for gripper motor (#1809 ) Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>	2025-08-29 16:06:55 +02:00
Pepijn	61b0eeae4b	Add feetech firmware update docs (#1793 ) * Add feetech firmware update docs * add bonus * formatting * adapt text * feedback pr	2025-08-28 11:18:54 +02:00
mgiac-hexagon	577cd10974	Removed dupicate lines of code (#1709 )	2025-08-25 12:39:32 +02:00
lxk	b0923ab74b	fix(dataset): Use provided episode_data in save_episode (#1740 ) The 'episode_data' parameter was previously ignored, causing an error if provided. This change ensures it is correctly used, which allows for asynchronous episode saving by passing a copy of the episode buffer, preventing conflicts with the main data collection loop.	2025-08-22 15:24:02 +02:00
Jack Vial	7f70b78f32	Add missing encoding table entries for Koch arm (#1534 )	2025-08-20 17:24:05 +02:00
Steven Palma	55198de096	fix(ci): rename libegl1-mesa in deb13 trixie (#1735 )	2025-08-14 11:12:06 +02:00