[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci
This commit is contained in:
pre-commit-ci[bot]
2025-07-17 13:56:44 +00:00
committed by Michel Aractingi
parent e05d22cb7b
commit 788dde3a34
5 changed files with 24 additions and 9 deletions
+2 -2
View File
@@ -218,7 +218,7 @@ Under the hood, the `LeRobotDataset` format makes use of several ways to seriali
Here are the important details and internal structure organization of a typical `LeRobotDataset` instantiated with `dataset = LeRobotDataset("lerobot/aloha_static_coffee")`. The exact features will change from dataset to dataset but not the main aspects: Here are the important details and internal structure organization of a typical `LeRobotDataset` instantiated with `dataset = LeRobotDataset("lerobot/aloha_static_coffee")`. The exact features will change from dataset to dataset but not the main aspects:
``` ````
dataset attributes: dataset attributes:
├ hf_dataset: a Hugging Face dataset (backed by Arrow/parquet). Typical features example: ├ hf_dataset: a Hugging Face dataset (backed by Arrow/parquet). Typical features example:
│ ├ observation.images.cam_high (VideoFrame): │ ├ observation.images.cam_high (VideoFrame):
@@ -278,7 +278,7 @@ python -m lerobot.scripts.eval \
--eval.n_episodes=10 \ --eval.n_episodes=10 \
--policy.use_amp=false \ --policy.use_amp=false \
--policy.device=cuda --policy.device=cuda
``` ````
Note: After training your own policy, you can re-evaluate the checkpoints with: Note: After training your own policy, you can re-evaluate the checkpoints with:
+16 -2
View File
@@ -7,6 +7,7 @@ This tutorial explains how to port large-scale robotic datasets to the LeRobot D
Dataset v3.0 fundamentally changes how data is organized and stored: Dataset v3.0 fundamentally changes how data is organized and stored:
**v2.1 Structure (Episode-based)**: **v2.1 Structure (Episode-based)**:
``` ```
dataset/ dataset/
├── data/chunk-000/episode_000000.parquet ├── data/chunk-000/episode_000000.parquet
@@ -16,6 +17,7 @@ dataset/
``` ```
**v3.0 Structure (File-based)**: **v3.0 Structure (File-based)**:
``` ```
dataset/ dataset/
├── data/chunk-000/file-000.parquet # Multiple episodes per file ├── data/chunk-000/file-000.parquet # Multiple episodes per file
@@ -30,16 +32,19 @@ This transition from individual episode files to file-based chunks dramatically
Dataset v3.0 introduces significant improvements for handling large datasets: Dataset v3.0 introduces significant improvements for handling large datasets:
### 🏗️ **Enhanced File Organization** ### 🏗️ **Enhanced File Organization**
- **File-based structure**: Episodes are now grouped into chunked files rather than individual episode files - **File-based structure**: Episodes are now grouped into chunked files rather than individual episode files
- **Configurable file sizes**: for data and video files - **Configurable file sizes**: for data and video files
- **Improved storage efficiency**: Better compression and reduced overhead - **Improved storage efficiency**: Better compression and reduced overhead
### 📊 **Modern Metadata Management** ### 📊 **Modern Metadata Management**
- **Parquet-based metadata**: Replaced JSON Lines with efficient parquet format - **Parquet-based metadata**: Replaced JSON Lines with efficient parquet format
- **Structured episode access**: Direct pandas DataFrame access via `dataset.meta.episodes` - **Structured episode access**: Direct pandas DataFrame access via `dataset.meta.episodes`
- **Per-episode statistics**: Enhanced statistics tracking at episode level - **Per-episode statistics**: Enhanced statistics tracking at episode level
### 🚀 **Performance Enhancements** ### 🚀 **Performance Enhancements**
- **Memory-mapped access**: Improved RAM usage through PyArrow memory mapping - **Memory-mapped access**: Improved RAM usage through PyArrow memory mapping
- **Faster loading**: Significantly reduced dataset initialization time - **Faster loading**: Significantly reduced dataset initialization time
- **Better scalability**: Designed for datasets with millions of episodes - **Better scalability**: Designed for datasets with millions of episodes
@@ -56,6 +61,7 @@ Before porting large datasets, ensure you have:
## Understanding the DROID Dataset ## Understanding the DROID Dataset
[DROID 1.0.1](https://droid-dataset.github.io/droid/the-droid-dataset) is an excellent example of a large-scale robotic dataset: [DROID 1.0.1](https://droid-dataset.github.io/droid/the-droid-dataset) is an excellent example of a large-scale robotic dataset:
- **Size**: 1.7TB (RLDS format), 8.7TB (raw data) - **Size**: 1.7TB (RLDS format), 8.7TB (raw data)
- **Structure**: 2048 pre-defined TensorFlow dataset shards - **Structure**: 2048 pre-defined TensorFlow dataset shards
- **Content**: 76,000+ robot manipulation trajectories from Franka Emika Panda robots - **Content**: 76,000+ robot manipulation trajectories from Franka Emika Panda robots
@@ -64,6 +70,7 @@ Before porting large datasets, ensure you have:
- **Hosting**: Google Cloud Storage with public access via `gsutil` - **Hosting**: Google Cloud Storage with public access via `gsutil`
The dataset contains diverse manipulation demonstrations with: The dataset contains diverse manipulation demonstrations with:
- Multiple camera views (wrist camera, exterior cameras) - Multiple camera views (wrist camera, exterior cameras)
- Natural language task descriptions - Natural language task descriptions
- Robot proprioceptive state and actions - Robot proprioceptive state and actions
@@ -109,6 +116,7 @@ DROID_FEATURES = {
### Step 1: Install Dependencies ### Step 1: Install Dependencies
For DROID specifically: For DROID specifically:
```bash ```bash
pip install tensorflow pip install tensorflow
pip install tensorflow_datasets pip install tensorflow_datasets
@@ -133,6 +141,7 @@ gsutil -m cp -r gs://gresearch/robotics/droid_100 /your/data/
> [!WARNING] > [!WARNING]
> Large datasets require substantial time and storage: > Large datasets require substantial time and storage:
>
> - **Full DROID (1.7TB)**: Several days to download depending on bandwidth > - **Full DROID (1.7TB)**: Several days to download depending on bandwidth
> - **Processing time**: 7+ days for local porting of full dataset > - **Processing time**: 7+ days for local porting of full dataset
> - **Upload time**: 3+ days to push to Hugging Face Hub > - **Upload time**: 3+ days to push to Hugging Face Hub
@@ -150,6 +159,7 @@ python examples/port_datasets/droid_rlds/port_droid.py \
### Development and Testing ### Development and Testing
For development, you can port a single shard: For development, you can port a single shard:
```bash ```bash
python examples/port_datasets/droid_rlds/port_droid.py \ python examples/port_datasets/droid_rlds/port_droid.py \
--raw-dir /your/data/droid/1.0.1 \ --raw-dir /your/data/droid/1.0.1 \
@@ -173,6 +183,7 @@ pip install datatrove # Hugging Face's distributed processing library
### Step 2: Configure Your SLURM Environment ### Step 2: Configure Your SLURM Environment
Find your partition information: Find your partition information:
```bash ```bash
sinfo --format="%R" # List available partitions sinfo --format="%R" # List available partitions
sinfo -N -p your_partition -h -o "%N cpus=%c mem=%m" # Check resources sinfo -N -p your_partition -h -o "%N cpus=%c mem=%m" # Check resources
@@ -206,21 +217,25 @@ python examples/port_datasets/droid_rlds/slurm_port_shards.py \
### Step 4: Monitor Progress ### Step 4: Monitor Progress
Check running jobs: Check running jobs:
```bash ```bash
squeue -u $USER squeue -u $USER
``` ```
Monitor overall progress: Monitor overall progress:
```bash ```bash
jobs_status /your/logs jobs_status /your/logs
``` ```
Inspect individual job logs: Inspect individual job logs:
```bash ```bash
less /your/logs/port_droid/slurm_jobs/JOB_ID_WORKER_ID.out less /your/logs/port_droid/slurm_jobs/JOB_ID_WORKER_ID.out
``` ```
Debug failed jobs: Debug failed jobs:
```bash ```bash
failed_logs /your/logs/port_droid failed_logs /your/logs/port_droid
``` ```
@@ -280,8 +295,6 @@ dataset/
This replaces the old episode-per-file structure with efficient, optimally-sized chunks. This replaces the old episode-per-file structure with efficient, optimally-sized chunks.
## Migrating from Dataset v2.1 ## Migrating from Dataset v2.1
If you have existing datasets in v2.1 format, use the migration tool: If you have existing datasets in v2.1 format, use the migration tool:
@@ -292,6 +305,7 @@ python src/lerobot/datasets/v30/convert_dataset_v21_to_v30.py \
``` ```
This automatically: This automatically:
- Converts file structure to v3.0 format - Converts file structure to v3.0 format
- Migrates metadata from JSON Lines to parquet - Migrates metadata from JSON Lines to parquet
- Aggregates statistics and creates per-episode stats - Aggregates statistics and creates per-episode stats
+1 -1
View File
@@ -18,7 +18,7 @@ import logging
import shutil import shutil
import tempfile import tempfile
from pathlib import Path from pathlib import Path
from typing import Callable from collections.abc import Callable
import datasets import datasets
import numpy as np import numpy as np
+3 -2
View File
@@ -13,7 +13,8 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from typing import Iterator, Union from typing import Union
from collections.abc import Iterator
import torch import torch
@@ -23,7 +24,7 @@ class EpisodeAwareSampler:
self, self,
dataset_from_indices: list[int], dataset_from_indices: list[int],
dataset_to_indices: list[int], dataset_to_indices: list[int],
episode_indices_to_use: Union[list, None] = None, episode_indices_to_use: list | None = None,
drop_n_first_frames: int = 0, drop_n_first_frames: int = 0,
drop_n_last_frames: int = 0, drop_n_last_frames: int = 0,
shuffle: bool = False, shuffle: bool = False,
+2 -2
View File
@@ -345,7 +345,7 @@ def get_audio_info(video_path: Path | str) -> dict:
"json", "json",
str(video_path), str(video_path),
] ]
result = subprocess.run(ffprobe_audio_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True) result = subprocess.run(ffprobe_audio_cmd, capture_output=True, text=True)
if result.returncode != 0: if result.returncode != 0:
raise RuntimeError(f"Error running ffprobe: {result.stderr}") raise RuntimeError(f"Error running ffprobe: {result.stderr}")
@@ -381,7 +381,7 @@ def get_video_info(video_path: Path | str) -> dict:
"json", "json",
str(video_path), str(video_path),
] ]
result = subprocess.run(ffprobe_video_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True) result = subprocess.run(ffprobe_video_cmd, capture_output=True, text=True)
if result.returncode != 0: if result.returncode != 0:
raise RuntimeError(f"Error running ffprobe: {result.stderr}") raise RuntimeError(f"Error running ffprobe: {result.stderr}")