add support for robomind2lerobot (#26)

Co-authored-by: HaomingSong <haomingsong24@gmail.com>
2026-05-16 14:39:41 +00:00 · 2025-05-12 20:16:13 +08:00
parent 37f856f50b
commit 7af65ba23a
16 changed files with 1445 additions and 1 deletions
@@ -19,6 +19,7 @@ A curated collection of utilities for [LeRobot Projects](https://github.com/hugg

 ## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>

+- **\[2025.05.12\]** We have supported Data Conversion from RoboMIND to LeRobot! 🔥🔥🔥
 - **\[2025.04.20\]** We add Dataset Version Converter for LeRobotv2.0 to LeRobotv2.1! 🔥🔥🔥
 - **\[2025.04.15\]** We add Dataset Merging Tool for merging multi-source lerobot datasets! 🔥🔥🔥
 - **\[2025.04.14\]** We have supported Data Conversion from AgiBotWorld to LeRobot! 🔥🔥🔥
@@ -31,7 +32,7 @@ A curated collection of utilities for [LeRobot Projects](https://github.com/hugg

  - [x] [Open X-Embodiment to LeRobot](./openx2lerobot/README.md)
  - [x] [AgiBot-World to LeRobot](./agibot2lerobot/README.md)
-  - [ ] RoboMIND to LeRobot
+  - [x] [RoboMIND to LeRobot](./robomind2lerobot/README.md)
  - [ ] LeRobot to RLDS

 - **Version Conversion**:
@@ -0,0 +1,271 @@
+# RoboMIND to LeRobot
+
+RoboMIND (Multi-embodiment Intelligence Normative Data for Robot Manipulation), a dataset containing 107k demonstration trajectories across 479 diverse tasks involving 96 object classes. RoboMIND is collected through human teleoperation and encompasses comprehensive robotic-related information, including multi-view observations, proprioceptive robot state information, and linguistic task descriptions.. (Copied from [docs](https://x-humanoid-robomind.github.io/))
+
+## 🚀 What's New in This Script
+
+In this dataset, we have made several key improvements:
+
+- **Preservation of RoboMIND’s Original Information** 🧠: We have preserved as much of RoboMIND’s original information as possible, with field names strictly adhering to the original dataset’s naming conventions to ensure compatibility and consistency.
+- **State and Action as Dictionaries** 🧾: The traditional one-dimensional state and action have been transformed into dictionaries, allowing for greater flexibility in designing custom states and actions, enabling modular and scalable handling.
+
+Dataset Structure of `meta/info.json`:
+
+```json
+{
+  "codebase_version": "v2.1", // lastest lerobot format
+  "robot_type": "franka_3rgb", // specific robot type
+  "fps": 30, // control frequency
+  "features": {
+    "observation.images.image_key": {
+        "dtype": "video",
+        "shape": [
+            720,
+            1280,
+            3
+        ],
+        "names": [
+            "height",
+            "width",
+            "rgb"
+        ],
+        "info": {
+            "video.height": 720,
+            "video.width": 1280,
+            "video.codec": "av1",
+            "video.pix_fmt": "yuv420p",
+            "video.is_depth_map": false,
+            "video.fps": 30,
+            "video.channels": 3,
+            "has_audio": false
+        }
+    },
+    // for more states key, see configs
+    "observation.states.end_effector": {
+        "dtype": "float32",
+        "shape": [
+            6
+        ],
+        "names": {
+            "motors": [
+                "x",
+                "y",
+                "z",
+                "r",
+                "p",
+                "y"
+            ]
+        }
+    },
+    ...
+    // for more actions key, see configs
+    "actions.joint_position": {
+        "dtype": "float32",
+        "shape": [
+            8
+        ],
+        "names": {
+            "motors": [
+                "joint_0",
+                "joint_1",
+                "joint_2",
+                "joint_3",
+                "joint_4",
+                "joint_5",
+                "joint_6",
+                "gripper"
+            ]
+        }
+    },
+    ...
+  }
+}
+```
+
+## Installation
+
+1. Install LeRobot:  
+   Follow instructions in [official repo](https://github.com/huggingface/lerobot?tab=readme-ov-file#installation).
+
+2. Install others:  
+   We use ray for parallel conversion, significantly speeding up data processing tasks by distributing the workload across multiple cores or nodes (if any).
+   ```bash
+   pip install h5py
+   pip install -U "ray[default]"
+   ```
+
+## Get started
+
+> [!IMPORTANT]  
+> 1. If you want to save depth when converting the dataset, modify `_assert_type_and_shape()` function in [lerobot.common.datasets.compute_stats.py](https://github.com/huggingface/lerobot/blob/main/lerobot/common/datasets/compute_stats.py).
+>
+> ```python
+> def _assert_type_and_shape(stats_list: list[dict[str, dict]]):
+>     for i in range(len(stats_list)):
+>         for fkey in stats_list[i]:
+>             for k, v in stats_list[i][fkey].items():
+>                 if not isinstance(v, np.ndarray):
+>                     raise ValueError(
+>                         f"Stats must be composed of numpy array, but key '{k}' of feature '{fkey}' is of type '{type(v)}' instead."
+>                     )
+>                 if v.ndim == 0:
+>                     raise ValueError("Number of dimensions must be at least 1, and is 0 instead.")
+>                 if k == "count" and v.shape != (1,):
+>                     raise ValueError(f"Shape of 'count' must be (1), but is {v.shape} instead.")
+>                 # bypass depth check
+>                 if "image" in fkey and k != "count":
+>                     if "depth" not in fkey and v.shape != (3, 1, 1):
+>                         raise ValueError(f"Shape of '{k}' must be (3,1,1), but is {v.shape} instead.")
+>                     if "depth" in fkey and v.shape != (1, 1, 1):
+>                         raise ValueError(f"Shape of '{k}' must be (1,1,1), but is {v.shape} instead.")
+> ```
+> 
+> 2. The dataset needs to be organized into the following format before running the script due to differences in storage formats across platforms:
+> ```bash
+> /path/to/robomind/
+> ├── benchmark1_0_release
+> │   ├── h5_agilex_3rgb
+> │   │   ├── 10_packplate
+> │   │   ├── ...
+> │   ├── h5_franka_1rgb
+> │   │   ├── bread_in_basket
+> │   └── ...
+> ├── benchmark1_1_release
+> │   ├── h5_agilex_3rgb
+> │   │   ├── 20_takecorn_2
+> │   │   ├── ...
+> │   ├── h5_franka_3rgb
+> │   │   ├── apples_placed_on_a_ceramic_plate
+> │   └── ...
+> ├── benchmark1_2_release
+> │   ├── h5_franka_3rgb
+> │   │   └── 241223_upright_cup
+> │   └── h5_sim_franka_3rgb
+> │       ├── 408-place_upright_mug_on_the_left_middle
+> │       ├── ...
+> ├── language_description_annotation_json
+> │   ├── h5_agilex_3rgb.json
+> │   ├── h5_franka_1rgb.json
+> │   ├── h5_franka_3rgb.json
+> │   ├── h5_simulation_franka.json
+> │   ├── h5_tienkung_xsens.json
+> │   └── h5_ur_1rgb.json
+> └── RoboMIND_v1_2_instr.csv
+> ```
+
+> [!NOTE]
+> The conversion speed of this script is limited by the performance of the physical machine running it, including **CPU cores and memory**. We recommend using **2 CPU cores per task** for optimal performance. However, each task requires approximately 20 GiB of memory. To avoid running out of memory, you may need to increase the number of CPU cores per task depending on your system’s available memory.
+
+### Download source code:
+
+```bash
+git clone https://github.com/Tavish9/any4lerobot.git
+```
+
+### Modify path in `convert.sh`:
+
+There are three benchmarks, each with several embodiments, including `agilex_3rgb`, `franka_1rgb`, `franka_3rgb`, `franka_fr3_dual`, `tienkung_gello_1rgb`, `tienkung_prod1_gello_1rgb`, `tienkung_xsens_1rgb`, `ur_1rgb`.
+
+```bash
+python robomind_h5.py \
+    --src-path /path/to/robomind/ \
+    --output-path /path/to/local \
+    --benchmark benchmark1_1_release \
+    --embodiments agilex_3rgb franka_1rgb \
+    --cpus-per-task 2
+```
+
+### Execute the script:
+
+#### For single node
+
+```bash
+cd robomind2lerobot && bash convert.sh
+```
+
+#### For multi nodes
+
+**Direct Access to Nodes (2 nodes in example)**
+
+On Node 1:
+
+```bash
+ray start --head --port=6379
+```
+
+On Node 2:
+
+```bash
+ray start --address='node_1_ip:6379'
+```
+
+On either Node, check the ray cluster status, and start the script
+
+```bash
+ray status
+cd robomind2lerobot && bash convert.sh
+```
+
+**Slurm-managed System**
+
+```bash
+#!/bin/bash
+#SBATCH --job-name=ray-cluster
+#SBATCH --ntasks=2
+#SBATCH --nodes=2
+#SBATCH --partition=partition
+
+# Getting the node names
+nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST")
+nodes_array=($nodes)
+
+head_node=${nodes_array[0]}
+head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
+
+# if we detect a space character in the head node IP, we'll
+# convert it to an ipv4 address. This step is optional.
+if [[ "$head_node_ip" == *" "* ]]; then
+IFS=' ' read -ra ADDR <<<"$head_node_ip"
+if [[ ${#ADDR[0]} -gt 16 ]]; then
+  head_node_ip=${ADDR[1]}
+else
+  head_node_ip=${ADDR[0]}
+fi
+echo "IPV6 address detected. We split the IPV4 address as $head_node_ip"
+fi
+
+port=6379
+ip_head=$head_node_ip:$port
+export ip_head
+echo "IP Head: $ip_head"
+
+echo "Starting HEAD at $head_node"
+srun --nodes=1 --ntasks=1 -w "$head_node" \
+    ray start --head \
+    --node-ip-address="$head_node_ip" \
+    --port=$port \
+    --block &
+
+sleep 10
+
+# number of nodes other than the head node
+worker_num=$((SLURM_JOB_NUM_NODES - 1))
+
+for ((i = 1; i <= worker_num; i++)); do
+    node_i=${nodes_array[$i]}
+    echo "Starting WORKER $i at $node_i"
+    srun --nodes=1 --ntasks=1 -w "$node_i" \
+        ray start \
+        --address "$ip_head" \
+        --block &
+    sleep 5
+done
+
+sleep 10
+
+cd robomind2lerobot && bash convert.sh
+```
+
+**Other Community Supported Cluster Managers**
+
+See the [doc](https://docs.ray.io/en/latest/cluster/vms/user-guides/community/index.html) for more details.
@@ -0,0 +1,8 @@
+export HDF5_USE_FILE_LOCKING=FALSE
+export RAY_DEDUP_LOGS=0
+python robomind_h5.py \
+    --src-path /path/to/robomind/ \
+    --output-path /path/to/local \
+    --benchmark benchmark1_1_release \
+    --embodiments agilex_3rgb franka_1rgb franka_3rgb franka_fr3_dual tienkung_gello_1rgb tienkung_prod1_gello_1rgb tienkung_xsens_1rgb ur_1rgb \
+    --cpus-per-task 2
@@ -0,0 +1,415 @@
+import argparse
+import gc
+import json
+import logging
+import shutil
+from pathlib import Path
+
+import datasets
+import numpy as np
+import pandas as pd
+import ray
+import torch
+from lerobot.common.datasets.compute_stats import aggregate_stats
+from lerobot.common.datasets.lerobot_dataset import LeRobotDataset, LeRobotDatasetMetadata
+from lerobot.common.datasets.utils import (
+    check_timestamps_sync,
+    get_episode_data_index,
+    get_hf_features_from_features,
+    hf_transform_to_torch,
+    validate_episode_buffer,
+    validate_frame,
+    write_episode,
+    write_episode_stats,
+    write_info,
+)
+from lerobot.common.datasets.video_utils import get_safe_default_codec
+from lerobot.common.robot_devices.robots.utils import Robot
+from ray.runtime_env import RuntimeEnv
+from robomind_uitls.configs import ROBOMIND_CONFIG
+from robomind_uitls.lerobot_uitls import compute_episode_stats, generate_features_from_config
+from robomind_uitls.robomind_uitls import load_local_dataset
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+
+
+class RoboMINDDatasetMetadata(LeRobotDatasetMetadata):
+    def save_episode(
+        self,
+        split,
+        episode_index: int,
+        episode_length: int,
+        episode_tasks: list[str],
+        episode_stats: dict[str, dict],
+        action_config: dict[str, str | dict],
+    ) -> None:
+        self.info["total_episodes"] += 1
+        self.info["total_frames"] += episode_length
+
+        chunk = self.get_episode_chunk(episode_index)
+        if chunk >= self.total_chunks:
+            self.info["total_chunks"] += 1
+
+        if split == "train":
+            self.info["splits"]["train"] = f"0:{self.info['total_episodes']}"
+            self.train_count = self.info["total_episodes"]
+        elif "val" in split:
+            self.info["splits"]["validation"] = f"{self.train_count}:{self.info['total_episodes']}"
+        self.info["total_videos"] += len(self.video_keys)
+        if len(self.video_keys) > 0:
+            self.update_video_info()
+
+        write_info(self.info, self.root)
+
+        episode_dict = {
+            "episode_index": episode_index,
+            "tasks": episode_tasks,
+            "length": episode_length,
+            **({"action_config": action_config} if action_config else {}),
+        }
+        self.episodes[episode_index] = episode_dict
+        write_episode(episode_dict, self.root)
+
+        self.episodes_stats[episode_index] = episode_stats
+        self.stats = aggregate_stats([self.stats, episode_stats]) if self.stats else episode_stats
+        write_episode_stats(episode_index, episode_stats, self.root)
+
+
+class RoboMINDDataset(LeRobotDataset):
+    @classmethod
+    def create(
+        cls,
+        repo_id: str,
+        fps: int,
+        root: str | Path | None = None,
+        robot: Robot | None = None,
+        robot_type: str | None = None,
+        features: dict | None = None,
+        use_videos: bool = True,
+        tolerance_s: float = 1e-4,
+        image_writer_processes: int = 0,
+        image_writer_threads: int = 0,
+        video_backend: str | None = None,
+    ) -> "LeRobotDataset":
+        """Create a LeRobot Dataset from scratch in order to record data."""
+        obj = cls.__new__(cls)
+        obj.meta = RoboMINDDatasetMetadata.create(
+            repo_id=repo_id,
+            fps=fps,
+            root=root,
+            robot=robot,
+            robot_type=robot_type,
+            features=features,
+            use_videos=use_videos,
+        )
+        obj.repo_id = obj.meta.repo_id
+        obj.root = obj.meta.root
+        obj.revision = None
+        obj.tolerance_s = tolerance_s
+        obj.image_writer = None
+
+        if image_writer_processes or image_writer_threads:
+            obj.start_image_writer(image_writer_processes, image_writer_threads)
+
+        # TODO(aliberts, rcadene, alexander-soare): Merge this with OnlineBuffer/DataBuffer
+        obj.episode_buffer = obj.create_episode_buffer()
+
+        obj.episodes = None
+        obj.hf_dataset = obj.create_hf_dataset()
+        obj.image_transforms = None
+        obj.delta_timestamps = None
+        obj.delta_indices = None
+        obj.episode_data_index = None
+        obj.video_backend = video_backend if video_backend is not None else get_safe_default_codec()
+        return obj
+
+    def create_hf_dataset(self) -> datasets.Dataset:
+        features = get_hf_features_from_features(self.features)
+        ft_dict = {col: [] for col in features}
+        hf_dataset = datasets.Dataset.from_dict(ft_dict, features=features, split="train")
+
+        # TODO(aliberts): hf_dataset.set_format("torch")
+        hf_dataset.set_transform(hf_transform_to_torch)
+        return hf_dataset
+
+    def add_frame(self, frame: dict) -> None:
+        """
+        This function only adds the frame to the episode_buffer. Apart from images — which are written in a
+        temporary directory — nothing is written to disk. To save those frames, the 'save_episode()' method
+        then needs to be called.
+        """
+        # Convert torch to numpy if needed
+        for name in frame:
+            if isinstance(frame[name], torch.Tensor):
+                frame[name] = frame[name].numpy()
+
+        validate_frame(frame, self.features)
+
+        if self.episode_buffer is None:
+            self.episode_buffer = self.create_episode_buffer()
+
+        # Automatically add frame_index and timestamp to episode buffer
+        frame_index = self.episode_buffer["size"]
+        timestamp = frame.pop("timestamp") if "timestamp" in frame else frame_index / self.fps
+        self.episode_buffer["frame_index"].append(frame_index)
+        self.episode_buffer["timestamp"].append(timestamp)
+
+        # Add frame features to episode_buffer
+        for key, value in frame.items():
+            if key == "task":
+                # Note: we associate the task in natural language to its task index during `save_episode`
+                self.episode_buffer["task"].append(frame["task"])
+                continue
+
+            if key not in self.features:
+                raise ValueError(
+                    f"An element of the frame is not in the features. '{key}' not in '{self.features.keys()}'."
+                )
+
+            if self.features[key]["dtype"] in ["video"]:
+                img_path = self._get_image_file_path(
+                    episode_index=self.episode_buffer["episode_index"], image_key=key, frame_index=frame_index
+                )
+                if frame_index == 0:
+                    img_path.parent.mkdir(parents=True, exist_ok=True)
+                self._save_image(value, img_path)
+                self.episode_buffer[key].append(str(img_path))
+            else:
+                self.episode_buffer[key].append(value)
+
+        self.episode_buffer["size"] += 1
+
+    def save_episode(
+        self, split, action_config: dict, episode_data: dict | None = None, keep_images: bool = False
+    ) -> None:
+        """
+        This will save to disk the current episode in self.episode_buffer.
+
+        Args:
+            episode_data (dict | None, optional): Dict containing the episode data to save. If None, this will
+                save the current episode in self.episode_buffer, which is filled with 'add_frame'. Defaults to
+                None.
+        """
+        if not episode_data:
+            episode_buffer = self.episode_buffer
+
+        validate_episode_buffer(episode_buffer, self.meta.total_episodes, self.features)
+
+        # size and task are special cases that won't be added to hf_dataset
+        episode_length = episode_buffer.pop("size")
+        tasks = episode_buffer.pop("task")
+        episode_tasks = list(set(tasks))
+        episode_index = episode_buffer["episode_index"]
+
+        episode_buffer["index"] = np.arange(self.meta.total_frames, self.meta.total_frames + episode_length)
+        episode_buffer["episode_index"] = np.full((episode_length,), episode_index)
+
+        # Add new tasks to the tasks dictionary
+        for task in episode_tasks:
+            task_index = self.meta.get_task_index(task)
+            if task_index is None:
+                self.meta.add_task(task)
+
+        # Given tasks in natural language, find their corresponding task indices
+        episode_buffer["task_index"] = np.array([self.meta.get_task_index(task) for task in tasks])
+
+        for key, ft in self.features.items():
+            # index, episode_index, task_index are already processed above, and image and video
+            # are processed separately by storing image path and frame info as meta data
+            if key in ["index", "episode_index", "task_index"] or ft["dtype"] in ["video"]:
+                continue
+            episode_buffer[key] = np.stack(episode_buffer[key]).squeeze()
+
+        self._wait_image_writer()
+        self._save_episode_table(episode_buffer, episode_index)
+        ep_stats = compute_episode_stats(episode_buffer, self.features)
+
+        if len(self.meta.video_keys) > 0:
+            video_paths = self.encode_episode_videos(episode_index)
+            for key in self.meta.video_keys:
+                episode_buffer[key] = video_paths[key]
+
+        # `meta.save_episode` be executed after encoding the videos
+        self.meta.save_episode(split, episode_index, episode_length, episode_tasks, ep_stats, action_config)
+
+        ep_data_index = get_episode_data_index(self.meta.episodes, [episode_index])
+        ep_data_index_np = {k: t.numpy() for k, t in ep_data_index.items()}
+        check_timestamps_sync(
+            episode_buffer["timestamp"],
+            episode_buffer["episode_index"],
+            ep_data_index_np,
+            self.fps,
+            self.tolerance_s,
+        )
+
+        if not keep_images:
+            # delete images
+            img_dir = self.root / "images"
+            if img_dir.is_dir():
+                shutil.rmtree(self.root / "images")
+
+        if not episode_data:  # Reset the buffer
+            self.episode_buffer = self.create_episode_buffer()
+
+
+def get_all_tasks(src_path: Path, output_path: Path, embodiment: str):
+    output_path = output_path / src_path.name / embodiment
+    src_path = src_path / f"h5_{embodiment}"
+
+    if src_path.exists():
+        df = pd.read_csv(src_path.parent.parent / "RoboMIND_v1_2_instr.csv", index_col=0).drop_duplicates()
+        instruction_dict = df.set_index("task")["instruction"].to_dict()
+        for task_type in src_path.iterdir():
+            yield (
+                task_type.name,
+                {"train": task_type / "success_episodes" / "train", "val": task_type / "success_episodes" / "val"},
+                (output_path / task_type.name).resolve(),
+                instruction_dict[task_type.name],
+            )
+
+
+def save_as_lerobot_dataset(task: tuple[dict, Path, str], src_path, benchmark, embodiment, save_depth):
+    task_type, splits, local_dir, task_instruction = task
+
+    config = ROBOMIND_CONFIG[embodiment]
+    # HACK: not consistent image shape...
+    if "1_0" in benchmark:
+        match embodiment:
+            case "tienkung_gello_1rgb":
+                if task_type in (
+                    "clean_table_2_241211",
+                    "clean_table_3_241210",
+                    "clean_table_3_241211",
+                    "place_paper_cup_dustbin_241212",
+                    "place_plate_table_241211",
+                    "place_plate_table_241211_12",
+                    "place_plate_table_241212",
+                ):
+                    for value in config["images"].values():
+                        value["shape"] = (720, 1280) + (value["shape"][2],)
+
+            case "tienkung_xsens_1rgb":
+                if task_type == "switch_manipulation":
+                    for value in config["images"].values():
+                        value["shape"] = (720, 1280) + (value["shape"][2],)
+
+    features = generate_features_from_config(config)
+
+    if local_dir.exists():
+        shutil.rmtree(local_dir)
+
+    if not save_depth:
+        features = dict(filter(lambda item: "depth" not in item[0], features.items()))
+
+    dataset: RoboMINDDataset = RoboMINDDataset.create(
+        repo_id=f"{embodiment}/{local_dir.name}",
+        root=local_dir,
+        fps=30,
+        robot_type=embodiment,
+        features=features,
+    )
+
+    logging.info(f"start processing for {benchmark}, {embodiment}, {task_type}, saving to {local_dir}")
+    for split, path in splits.items():
+        action_config_path = src_path / "language_description_annotation_json" / f"h5_{embodiment}.json"
+        if action_config_path.exists():
+            action_config = json.load(open(action_config_path))
+            action_config = {
+                Path(config["id"]).parent.name: config["response"]
+                for config in action_config
+                if local_dir.name in config["id"] and split in config["id"]
+            }
+        else:
+            action_config = {}
+        for episode_path in path.glob("**/trajectory.hdf5"):
+            status, raw_dataset, err = load_local_dataset(episode_path, config, save_depth)
+            if status and len(raw_dataset) >= 50:
+                for frame_data in raw_dataset:
+                    frame_data.update({"task": task_instruction})
+                    dataset.add_frame(frame_data)
+                dataset.save_episode(split, action_config.get(episode_path.parent.parent.name, {}))
+                logging.info(f"process done for {path}, len {len(raw_dataset)}")
+            else:
+                logging.warning(f"Skipped {episode_path}: len of dataset:{len(raw_dataset)} or {str(err)}")
+            gc.collect()
+
+    del dataset
+
+
+def main(
+    src_path: Path,
+    output_path: Path,
+    benchmark: str,
+    embodiments: list[str],
+    cpus_per_task: int,
+    save_depth: bool,
+    debug: bool = False,
+):
+    if debug:
+        tasks = get_all_tasks(src_path / benchmark, output_path, embodiments[0])
+        save_as_lerobot_dataset(next(tasks), src_path, benchmark, embodiments[0], save_depth)
+    else:
+        runtime_env = RuntimeEnv(
+            env_vars={
+                "HDF5_USE_FILE_LOCKING": "FALSE",
+                "HF_DATASETS_DISABLE_PROGRESS_BARS": "TRUE",
+                "LD_PRELOAD": str(Path(__file__).resolve().parent / "libtcmalloc.so.4.5.3"),
+            }
+        )
+        ray.init(runtime_env=runtime_env)
+        resources = ray.available_resources()
+        cpus = int(resources["CPU"])
+
+        logging.info(f"Available CPUs: {cpus}, num_cpus_per_task: {cpus_per_task}")
+        remote_task = ray.remote(save_as_lerobot_dataset).options(num_cpus=cpus_per_task)
+
+        futures = []
+        for embodiment in embodiments:
+            tasks = get_all_tasks(src_path / benchmark, output_path, embodiment)
+            for task in tasks:
+                futures.append((task[1], remote_task.remote(task, src_path, benchmark, embodiment, save_depth)))
+
+        for task_path, future in futures:
+            try:
+                ray.get(future)
+            except Exception as e:
+                logging.error(f"Exception occurred for {task_path['train']}")
+                with open("output.txt", "a") as f:
+                    f.write(f"{task_path['train']}, exception details: {str(e)}\n")
+        ray.shutdown()
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--src-path", type=Path, required=True)
+    parser.add_argument(
+        "--benchmark",
+        type=str,
+        choices=["benchmark1_0_release", "benchmark1_1_release", "benchmark1_2_release"],
+        default="benchmark1_1_release",
+    )
+    parser.add_argument("--output-path", type=Path, required=True)
+    parser.add_argument(
+        "--embodiments",
+        type=str,
+        nargs="+",
+        help=str(
+            [
+                "agilex_3rgb",
+                "franka_1rgb",
+                "franka_3rgb",
+                "franka_fr3_dual",
+                "tienkung_gello_1rgb",
+                "tienkung_prod1_gello_1rgb",
+                "tienkung_xsens_1rgb",
+                "ur_1rgb",
+            ]
+        ),
+        default=["agilex_3rgb"],
+    )
+    parser.add_argument("--cpus-per-task", type=int, default=2)
+    parser.add_argument("--save-depth", action="store_true")
+    parser.add_argument("--debug", action="store_true")
+    args = parser.parse_args()
+
+    main(**vars(args))
@@ -0,0 +1,21 @@
+from .agilex_3rgb import AgileX_3RGB_Config
+from .franka_1rgb import Franka_1RGB_Config
+from .franka_3rgb import Franka_3RGB_Config
+from .franka_fr3_dual_arm import Franka_Fr3_Dual_Arm_Config
+from .tienkung_gello_1rgb import Tien_Kung_Gello_1RGB_Config
+from .tienkung_prod1_gello_1rgb import Tien_Kung_Prod1_Gello_1RGB_Config
+from .tienkung_xsens_1rgb import Tien_Kung_Xsens_1RGB_Config
+from .ur_1rgb import UR_1RGB_Config
+
+ROBOMIND_CONFIG = dict(
+    agilex_3rgb=AgileX_3RGB_Config,
+    franka_1rgb=Franka_1RGB_Config,
+    franka_3rgb=Franka_3RGB_Config,
+    franka_fr3_dual=Franka_Fr3_Dual_Arm_Config,
+    sim_franka_3rgb="",
+    sim_tienkung_1rgb="",
+    tienkung_gello_1rgb=Tien_Kung_Gello_1RGB_Config,
+    tienkung_prod1_gello_1rgb=Tien_Kung_Prod1_Gello_1RGB_Config,
+    tienkung_xsens_1rgb=Tien_Kung_Xsens_1RGB_Config,
+    ur_1rgb=UR_1RGB_Config,
+)
@@ -0,0 +1,118 @@
+AgileX_3RGB_Config = {
+    "images": {
+        "camera_front": {
+            "dtype": "video",
+            "shape": (480, 640, 3),
+            "names": ["height", "width", "rgb"],
+        },
+        "camera_left_wrist": {
+            "dtype": "video",
+            "shape": (480, 640, 3),
+            "names": ["height", "width", "rgb"],
+        },
+        "camera_right_wrist": {
+            "dtype": "video",
+            "shape": (480, 640, 3),
+            "names": ["height", "width", "rgb"],
+        },
+        "camera_front_depth": {
+            "dtype": "image",
+            "shape": (480, 640, 1),
+            "names": ["height", "width", "channel"],
+        },
+        "camera_left_wrist_depth": {
+            "dtype": "image",
+            "shape": (480, 640, 1),
+            "names": ["height", "width", "channel"],
+        },
+        "camera_right_wrist_depth": {
+            "dtype": "image",
+            "shape": (480, 640, 1),
+            "names": ["height", "width", "channel"],
+        },
+    },
+    "states": {
+        "end_effector_left": {
+            "dtype": "float32",
+            "shape": (7,),
+            "names": {"motors": ["x", "y", "z", "rx", "ry", "rz", "rw"]},
+        },
+        "end_effector_right": {
+            "dtype": "float32",
+            "shape": (7,),
+            "names": {"motors": ["x", "y", "z", "rx", "ry", "rz", "rw"]},
+        },
+        "joint_effort_left": {
+            "dtype": "float32",
+            "shape": (7,),
+            "names": {"motors": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "gripper"]},
+        },
+        "joint_effort_right": {
+            "dtype": "float32",
+            "shape": (7,),
+            "names": {"motors": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "gripper"]},
+        },
+        "joint_position_left": {
+            "dtype": "float32",
+            "shape": (7,),
+            "names": {"motors": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "gripper"]},
+        },
+        "joint_position_right": {
+            "dtype": "float32",
+            "shape": (7,),
+            "names": {"motors": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "gripper"]},
+        },
+        "joint_velocity_left": {
+            "dtype": "float32",
+            "shape": (7,),
+            "names": {"motors": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "gripper"]},
+        },
+        "joint_velocity_right": {
+            "dtype": "float32",
+            "shape": (7,),
+            "names": {"motors": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "gripper"]},
+        },
+    },
+    "actions": {
+        "end_effector_left": {
+            "dtype": "float32",
+            "shape": (7,),
+            "names": {"motors": ["x", "y", "z", "rx", "ry", "rz", "rw"]},
+        },
+        "end_effector_right": {
+            "dtype": "float32",
+            "shape": (7,),
+            "names": {"motors": ["x", "y", "z", "rx", "ry", "rz", "rw"]},
+        },
+        "joint_effort_left": {
+            "dtype": "float32",
+            "shape": (7,),
+            "names": {"motors": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "gripper"]},
+        },
+        "joint_effort_right": {
+            "dtype": "float32",
+            "shape": (7,),
+            "names": {"motors": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "gripper"]},
+        },
+        "joint_position_left": {
+            "dtype": "float32",
+            "shape": (7,),
+            "names": {"motors": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "gripper"]},
+        },
+        "joint_position_right": {
+            "dtype": "float32",
+            "shape": (7,),
+            "names": {"motors": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "gripper"]},
+        },
+        "joint_velocity_left": {
+            "dtype": "float32",
+            "shape": (7,),
+            "names": {"motors": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "gripper"]},
+        },
+        "joint_velocity_right": {
+            "dtype": "float32",
+            "shape": (7,),
+            "names": {"motors": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "gripper"]},
+        },
+    },
+}
@@ -0,0 +1,37 @@
+Franka_1RGB_Config = {
+    "images": {
+        "camera_top": {
+            "dtype": "video",
+            "shape": (720, 1280, 3),
+            "names": ["height", "width", "rgb"],
+        },
+        "camera_top_depth": {
+            "dtype": "image",
+            "shape": (480, 640, 1),
+            "names": ["height", "width", "channel"],
+        },
+    },
+    "states": {
+        "end_effector": {
+            "dtype": "float32",
+            "shape": (6,),
+            "names": {"motors": ["x", "y", "z", "r", "p", "y"]},
+        },
+        "joint_position": {
+            "dtype": "float32",
+            "shape": (8,),
+            "names": {
+                "motors": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "joint_6", "gripper"]
+            },
+        },
+    },
+    "actions": {
+        "joint_position": {
+            "dtype": "float32",
+            "shape": (8,),
+            "names": {
+                "motors": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "joint_6", "gripper"]
+            },
+        },
+    },
+}
@@ -0,0 +1,57 @@
+Franka_3RGB_Config = {
+    "images": {
+        "camera_top": {
+            "dtype": "video",
+            "shape": (720, 1280, 3),
+            "names": ["height", "width", "rgb"],
+        },
+        "camera_left": {
+            "dtype": "video",
+            "shape": (480, 640, 3),
+            "names": ["height", "width", "rgb"],
+        },
+        "camera_right": {
+            "dtype": "video",
+            "shape": (480, 640, 3),
+            "names": ["height", "width", "rgb"],
+        },
+        "camera_top_depth": {
+            "dtype": "image",
+            "shape": (720, 1280, 1),
+            "names": ["height", "width", "channel"],
+        },
+        "camera_left_depth": {
+            "dtype": "image",
+            "shape": (480, 640, 1),
+            "names": ["height", "width", "channel"],
+        },
+        "camera_right_depth": {
+            "dtype": "image",
+            "shape": (480, 640, 1),
+            "names": ["height", "width", "channel"],
+        },
+    },
+    "states": {
+        "end_effector": {
+            "dtype": "float32",
+            "shape": (6,),
+            "names": {"motors": ["x", "y", "z", "r", "p", "y"]},
+        },
+        "joint_position": {
+            "dtype": "float32",
+            "shape": (8,),
+            "names": {
+                "motors": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "joint_6", "gripper"]
+            },
+        },
+    },
+    "actions": {
+        "joint_position": {
+            "dtype": "float32",
+            "shape": (8,),
+            "names": {
+                "motors": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "joint_6", "gripper"]
+            },
+        },
+    },
+}
@@ -0,0 +1,101 @@
+Franka_Fr3_Dual_Arm_Config = {
+    "images": {
+        "camera_front": {
+            "dtype": "video",
+            "shape": (720, 1280, 3),
+            "names": ["height", "width", "rgb"],
+        },
+        "camera_top": {
+            "dtype": "video",
+            "shape": (720, 1280, 3),
+            "names": ["height", "width", "rgb"],
+        },
+        "camera_left": {
+            "dtype": "video",
+            "shape": (480, 640, 3),
+            "names": ["height", "width", "rgb"],
+        },
+        "camera_right": {
+            "dtype": "video",
+            "shape": (480, 640, 3),
+            "names": ["height", "width", "rgb"],
+        },
+        "camera_front_depth": {
+            "dtype": "image",
+            "shape": (720, 1280, 1),
+            "names": ["height", "width", "channel"],
+        },
+        "camera_top_depth": {
+            "dtype": "image",
+            "shape": (720, 1280, 1),
+            "names": ["height", "width", "channel"],
+        },
+        "camera_left_depth": {
+            "dtype": "image",
+            "shape": (480, 640, 1),
+            "names": ["height", "width", "channel"],
+        },
+        "camera_right_depth": {
+            "dtype": "image",
+            "shape": (480, 640, 1),
+            "names": ["height", "width", "channel"],
+        },
+    },
+    "states": {
+        "end_effector": {
+            "dtype": "float32",
+            "shape": (12,),
+            "names": {"motors": ["left_xyzrpy", "right_xyzrpy"]},
+        },
+        "joint_position": {
+            "dtype": "float32",
+            "shape": (16,),
+            "names": {
+                "motors": [
+                    "left_joint_0",
+                    "left_joint_1",
+                    "left_joint_2",
+                    "left_joint_3",
+                    "left_joint_4",
+                    "left_joint_5",
+                    "left_joint_6",
+                    "left_gripper",
+                    "right_joint_0",
+                    "right_joint_1",
+                    "right_joint_2",
+                    "right_joint_3",
+                    "right_joint_4",
+                    "right_joint_5",
+                    "right_joint_6",
+                    "right_gripper",
+                ]
+            },
+        },
+    },
+    "actions": {
+        "joint_position": {
+            "dtype": "float32",
+            "shape": (16,),
+            "names": {
+                "motors": [
+                    "left_joint_0",
+                    "left_joint_1",
+                    "left_joint_2",
+                    "left_joint_3",
+                    "left_joint_4",
+                    "left_joint_5",
+                    "left_joint_6",
+                    "left_gripper",
+                    "right_joint_0",
+                    "right_joint_1",
+                    "right_joint_2",
+                    "right_joint_3",
+                    "right_joint_4",
+                    "right_joint_5",
+                    "right_joint_6",
+                    "right_gripper",
+                ]
+            },
+        },
+    },
+}
@@ -0,0 +1,66 @@
+Tien_Kung_Gello_1RGB_Config = {
+    "images": {
+        "camera_top": {
+            "dtype": "video",
+            "shape": (480, 640, 3),
+            "names": ["height", "width", "rgb"],
+        },
+        "camera_top_depth": {
+            "dtype": "image",
+            "shape": (480, 640, 1),
+            "names": ["height", "width", "channel"],
+        },
+    },
+    "states": {
+        "joint_position": {
+            "dtype": "float32",
+            "shape": (16,),
+            "names": {
+                "motors": [
+                    "left_arm_0",
+                    "left_arm_1",
+                    "left_arm_2",
+                    "left_arm_3",
+                    "left_arm_4",
+                    "left_arm_5",
+                    "left_arm_6",
+                    "left hand_closure",
+                    "right_arm_0",
+                    "right_arm_1",
+                    "right_arm_2",
+                    "right_arm_3",
+                    "right_arm_4",
+                    "right_arm_5",
+                    "right_arm_6",
+                    "right hand closure",
+                ]
+            },
+        },
+    },
+    "actions": {
+        "joint_position": {
+            "dtype": "float32",
+            "shape": (16,),
+            "names": {
+                "motors": [
+                    "left_arm_0",
+                    "left_arm_1",
+                    "left_arm_2",
+                    "left_arm_3",
+                    "left_arm_4",
+                    "left_arm_5",
+                    "left_arm_6",
+                    "left hand_closure",
+                    "right_arm_0",
+                    "right_arm_1",
+                    "right_arm_2",
+                    "right_arm_3",
+                    "right_arm_4",
+                    "right_arm_5",
+                    "right_arm_6",
+                    "right hand closure",
+                ]
+            },
+        },
+    },
+}
@@ -0,0 +1,66 @@
+Tien_Kung_Prod1_Gello_1RGB_Config = {
+    "images": {
+        "camera_top": {
+            "dtype": "video",
+            "shape": (720, 1280, 3),
+            "names": ["height", "width", "rgb"],
+        },
+        "camera_top_depth": {
+            "dtype": "image",
+            "shape": (720, 1280, 1),
+            "names": ["height", "width", "channel"],
+        },
+    },
+    "states": {
+        "joint_position": {
+            "dtype": "float32",
+            "shape": (16,),
+            "names": {
+                "motors": [
+                    "left_arm_0",
+                    "left_arm_1",
+                    "left_arm_2",
+                    "left_arm_3",
+                    "left_arm_4",
+                    "left_arm_5",
+                    "left_arm_6",
+                    "left hand_closure",
+                    "right_arm_0",
+                    "right_arm_1",
+                    "right_arm_2",
+                    "right_arm_3",
+                    "right_arm_4",
+                    "right_arm_5",
+                    "right_arm_6",
+                    "right hand closure",
+                ]
+            },
+        },
+    },
+    "actions": {
+        "joint_position": {
+            "dtype": "float32",
+            "shape": (16,),
+            "names": {
+                "motors": [
+                    "left_arm_0",
+                    "left_arm_1",
+                    "left_arm_2",
+                    "left_arm_3",
+                    "left_arm_4",
+                    "left_arm_5",
+                    "left_arm_6",
+                    "left hand_closure",
+                    "right_arm_0",
+                    "right_arm_1",
+                    "right_arm_2",
+                    "right_arm_3",
+                    "right_arm_4",
+                    "right_arm_5",
+                    "right_arm_6",
+                    "right hand closure",
+                ]
+            },
+        },
+    },
+}
@@ -0,0 +1,102 @@
+Tien_Kung_Xsens_1RGB_Config = {
+    "images": {
+        "camera_top": {
+            "dtype": "video",
+            "shape": (480, 640, 3),
+            "names": ["height", "width", "rgb"],
+        },
+        "camera_top_depth": {
+            "dtype": "image",
+            "shape": (480, 640, 1),
+            "names": ["height", "width", "channel"],
+        },
+    },
+    "states": {
+        "end_effector": {
+            "dtype": "float32",
+            "shape": (12,),
+            "names": {
+                "motors": [
+                    "left_little_finger",
+                    "left_ring_finger",
+                    "left_middle_finger",
+                    "left_index_finger",
+                    "left_thumb0_for_bending",
+                    "left_thumb1_for_rotation",
+                    "right_little_finger",
+                    "right_ring_finger",
+                    "right_middle_finger",
+                    "right_index_finger",
+                    "right_thumb0_for_bending",
+                    "right_thumb1_for_rotation",
+                ]
+            },
+        },
+        "joint_position": {
+            "dtype": "float32",
+            "shape": (14,),
+            "names": {
+                "motors": [
+                    "left_arm_0",
+                    "left_arm_1",
+                    "left_arm_2",
+                    "left_arm_3",
+                    "left_arm_4",
+                    "left_arm_5",
+                    "left_arm_6",
+                    "right_arm_0",
+                    "right_arm_1",
+                    "right_arm_2",
+                    "right_arm_3",
+                    "right_arm_4",
+                    "right_arm_5",
+                    "right_arm_6",
+                ]
+            },
+        },
+    },
+    "actions": {
+        "end_effector": {
+            "dtype": "float32",
+            "shape": (12,),
+            "names": {
+                "motors": [
+                    "left_little_finger",
+                    "left_ring_finger",
+                    "left_middle_finger",
+                    "left_index_finger",
+                    "left_thumb0_for_bending",
+                    "left_thumb1_for_rotation",
+                    "right_little_finger",
+                    "right_ring_finger",
+                    "right_middle_finger",
+                    "right_index_finger",
+                    "right_thumb0_for_bending",
+                    "right_thumb1_for_rotation",
+                ]
+            },
+        },
+        "joint_position": {
+            "dtype": "float32",
+            "shape": (14,),
+            "names": {
+                "motors": [
+                    "left_arm_0",
+                    "left_arm_1",
+                    "left_arm_2",
+                    "left_arm_3",
+                    "left_arm_4",
+                    "left_arm_5",
+                    "left_arm_6",
+                    "right_arm_0",
+                    "right_arm_1",
+                    "right_arm_2",
+                    "right_arm_3",
+                    "right_arm_4",
+                    "right_arm_5",
+                    "right_arm_6",
+                ]
+            },
+        },
+    },
+}
@@ -0,0 +1,33 @@
+UR_1RGB_Config = {
+    "images": {
+        "camera_top": {
+            "dtype": "video",
+            "shape": (480, 640, 3),
+            "names": ["height", "width", "rgb"],
+        },
+        "camera_top_depth": {
+            "dtype": "image",
+            "shape": (480, 640, 1),
+            "names": ["height", "width", "channel"],
+        },
+    },
+    "states": {
+        "end_effector": {
+            "dtype": "float32",
+            "shape": (6,),
+            "names": {"motors": ["x", "y", "z", "r", "p", "y"]},
+        },
+        "joint_position": {
+            "dtype": "float32",
+            "shape": (7,),
+            "names": {"motors": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "gripper"]},
+        },
+    },
+    "actions": {
+        "joint_position": {
+            "dtype": "float32",
+            "shape": (7,),
+            "names": {"motors": ["joint_0", "joint_1", "joint_2", "joint_3", "joint_4", "joint_5", "gripper"]},
+        },
+    },
+}
@@ -0,0 +1,74 @@
+import numpy as np
+import torchvision
+from lerobot.common.datasets.compute_stats import auto_downsample_height_width, get_feature_stats, sample_indices
+from lerobot.common.datasets.utils import load_image_as_numpy
+
+torchvision.set_video_backend("pyav")
+
+
+def generate_features_from_config(AgiBotWorld_CONFIG):
+    features = {}
+    for key, value in AgiBotWorld_CONFIG["images"].items():
+        features[f"observation.images.{key}"] = value
+    for key, value in AgiBotWorld_CONFIG["states"].items():
+        features[f"observation.states.{key}"] = value
+    for key, value in AgiBotWorld_CONFIG["actions"].items():
+        features[f"actions.{key}"] = value
+    return features
+
+
+def sample_images(input):
+    if type(input) is list:
+        image_paths = input
+
+        sampled_indices = sample_indices(len(image_paths))
+        images = None
+        for i, idx in enumerate(sampled_indices):
+            path = image_paths[idx]
+
+            img = load_image_as_numpy(path, dtype=np.uint8, channel_first=True)
+            img = auto_downsample_height_width(img)
+
+            if images is None:
+                images = np.empty((len(sampled_indices), *img.shape), dtype=np.uint8)
+
+            images[i] = img
+    elif type(input) is np.ndarray:
+        frames_array = input[:, None, :, :]  # Shape: [T, 1, H, W]
+        sampled_indices = sample_indices(len(frames_array))
+        images = None
+        for i, idx in enumerate(sampled_indices):
+            img = frames_array[idx]
+            img = auto_downsample_height_width(img)
+
+            if images is None:
+                images = np.empty((len(sampled_indices), *img.shape), dtype=np.uint8)
+
+            images[i] = img
+
+    return images
+
+
+def compute_episode_stats(episode_data: dict[str, list[str] | np.ndarray], features: dict) -> dict:
+    ep_stats = {}
+    for key, data in episode_data.items():
+        if features[key]["dtype"] == "string":
+            continue  # HACK: we should receive np.arrays of strings
+        elif features[key]["dtype"] in ["image", "video"]:
+            ep_ft_array = sample_images(data)
+            axes_to_reduce = (0, 2, 3)  # keep channel dim
+            keepdims = True
+        else:
+            ep_ft_array = data  # data is already a np.ndarray
+            axes_to_reduce = 0  # compute stats over the first axis
+            keepdims = data.ndim == 1  # keep as np.array
+
+        ep_stats[key] = get_feature_stats(ep_ft_array, axis=axes_to_reduce, keepdims=keepdims)
+
+        if features[key]["dtype"] in ["image", "video"]:
+            value_norm = 1.0 if "depth" in key else 255.0
+            ep_stats[key] = {
+                k: v if k == "count" else np.squeeze(v / value_norm, axis=0) for k, v in ep_stats[key].items()
+            }
+
+    return ep_stats
@@ -0,0 +1,74 @@
+from pathlib import Path
+
+import cv2
+import h5py
+import numpy as np
+
+
+def decode_images(camera_key, input_images):
+    if "depth" not in camera_key:
+        rgb_images = []
+        camera_rgb_images = input_images
+        for camera_rgb_image in camera_rgb_images:
+            camera_rgb_image = np.array(camera_rgb_image)
+            rgb = cv2.imdecode(camera_rgb_image, cv2.IMREAD_COLOR)
+            if rgb is None:
+                rgb = np.frombuffer(camera_rgb_image, dtype=np.uint8)
+                if rgb.size == 2764800:
+                    rgb = rgb.reshape(720, 1280, 3)
+                elif rgb.size == 921600:
+                    rgb = rgb.reshape(480, 640, 3)
+            rgb_images.append(rgb)
+        rgb_images = np.asarray(rgb_images)
+        return rgb_images
+    else:
+        depth_images = []
+        camera_depth_images = input_images
+        for camera_depth_image in camera_depth_images:
+            if isinstance(camera_depth_image, np.ndarray):
+                depth_array = camera_depth_image
+            else:
+                depth_array = np.frombuffer(camera_depth_image, dtype=np.uint8)
+            depth = cv2.imdecode(depth_array, cv2.IMREAD_UNCHANGED)
+            if depth is None:
+                if depth_array.size == 921600:
+                    depth = depth_array.reshape(720, 1280)
+                elif depth_array.size == 307200:
+                    depth = depth_array.reshape(480, 640)
+            depth_images.append(depth)
+        depth_images = np.asarray(depth_images)[..., None]
+        return depth_images
+
+
+def load_local_dataset(episode_path: Path, config: dict, save_depth: bool):
+    try:
+        images = {}
+        states = {}
+        actions = {}
+        with h5py.File(episode_path, "r") as file:
+            for key in config["images"]:
+                if save_depth and "depth" in key:
+                    image_key = f"observations/depth_images/{key[:-6]}"
+                elif "depth" not in key:
+                    image_key = f"observations/rgb_images/{key}"
+                else:
+                    continue
+                images[f"observation.images.{key}"] = decode_images(image_key, file[image_key])
+            for key in config["states"]:
+                states[f"observation.states.{key}"] = np.array(file[f"puppet/{key}"], dtype=np.float32)
+            for key in config["actions"]:
+                actions[f"actions.{key}"] = np.array(file[f"master/{key}"], dtype=np.float32)
+
+        num_frames = len(next(iter(states.values())))
+        frames = [
+            {
+                **{key: value[i] for key, value in images.items() if save_depth or "depth" not in key},
+                **{key: value[i] for key, value in states.items()},
+                **{key: value[i] for key, value in actions.items()},
+            }
+            for i in range(num_frames)
+        ]
+        return True, frames, ""
+
+    except (FileNotFoundError, OSError, KeyError) as e:
+        return False, [], e