Compare commits

..

1 Commits

Author SHA1 Message Date
CarolinePascal dbb32ead5f fix(video becnhmark)
* fixing typos on PyAV decoders names
* adding torchcodec among video backends
* updating images datasets to v3.0
2025-09-12 17:03:36 +02:00
4 changed files with 12 additions and 167 deletions
+9 -21
View File
@@ -37,14 +37,14 @@ from tqdm import tqdm
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.datasets.video_utils import (
decode_video_frames_torchvision,
decode_video_frames,
encode_video_frames,
)
from lerobot.utils.benchmark import TimeBenchmark
BASE_ENCODING = OrderedDict(
[
("vcodec", "libx264"),
("vcodec", "h264"),
("pix_fmt", "yuv444p"),
("g", 2),
("crf", None),
@@ -147,18 +147,6 @@ def sample_timestamps(timestamps_mode: str, ep_num_images: int, fps: int) -> lis
return [idx / fps for idx in frame_indexes]
def decode_video_frames(
video_path: str,
timestamps: list[float],
tolerance_s: float,
backend: str,
) -> torch.Tensor:
if backend in ["pyav", "video_reader"]:
return decode_video_frames_torchvision(video_path, timestamps, tolerance_s, backend)
else:
raise NotImplementedError(backend)
def benchmark_decoding(
imgs_dir: Path,
video_path: Path,
@@ -406,9 +394,9 @@ if __name__ == "__main__":
nargs="*",
default=[
"lerobot/pusht_image",
"aliberts/aloha_mobile_shrimp_image",
"aliberts/paris_street",
"aliberts/kitchen",
"CarolinePascal/aloha_mobile_shrimp_image",
"CarolinePascal/paris_street",
"CarolinePascal/kitchen",
],
help="Datasets repo-ids to test against. First episodes only are used. Must be images.",
)
@@ -416,7 +404,7 @@ if __name__ == "__main__":
"--vcodec",
type=str,
nargs="*",
default=["libx264", "hevc", "libsvtav1"],
default=["h264", "hevc", "libsvtav1"],
help="Video codecs to be tested",
)
parser.add_argument(
@@ -446,7 +434,7 @@ if __name__ == "__main__":
# nargs="*",
# default=[0, 1],
# help="Use the fastdecode tuning option. 0 disables it. "
# "For libx264 and libx265/hevc, only 1 is possible. "
# "For h264 and h265/hevc, only 1 is possible. "
# "For libsvtav1, 1, 2 or 3 are possible values with a higher number meaning a faster decoding optimization",
# )
parser.add_argument(
@@ -465,8 +453,8 @@ if __name__ == "__main__":
"--backends",
type=str,
nargs="*",
default=["pyav", "video_reader"],
help="Torchvision decoding backend to be tested.",
default=["torchcodec", "pyav", "video_reader"],
help="Video decoding backend to be tested.",
)
parser.add_argument(
"--num-samples",
-4
View File
@@ -50,7 +50,3 @@
- local: backwardcomp
title: Backward compatibility
title: "About"
- sections:
- local: datasets
title: "The LeRobotDataset Format"
-title: "Datasets"
-140
View File
@@ -1,140 +0,0 @@
# The LeRobotDataset Format
`LeRobotDataset` is a standardized dataset format designed to address the specific needs of robot learning research.
In this, it provides a unified and convenient access to robotics data across modalities, including sensorimotor readings, multiple camera feeds and teleoperation status.
`LeRobotDataset` also stores general information regarding the data collected, like the task being performed by the teleoperator, the kind of robot used and measurement details like the frames per second at which the recording of both image and robot state's streams are proceeding.
Therefore, `LeRobotDataset` provides a unified interface for handling multi-modal, time-series data, and it integrates seamlessly with the PyTorch and Hugging Face ecosystems.
`LeRobotDataset` is designed to be easily extensible and customizable by users, and it already supports openly available data coming from a variety of embodiments, ranging from manipulator platforms like the SO-100 and ALOHA-2, to real-world humanoid data, simulation datasets and self-driving car datasets.
This dataset format is built to be both efficient for training and flexible enough to accommodate the diverse data types encountered in robotics, while promoting reproducibility and ease of use for users.
## The Format's Design
A core design choice behind `LeRobotDataset` is separating the underlying data storage from the user-facing API.
This allows for efficient serialization and storage while presenting the data in an intuitive, ready-to-use format.
A dataset is always organized into three main components:
1. **Tabular Data**: Low-dimensional, high-frequency data such as joint states, and actions are stored in efficient [Apache Parquet](https://parquet.apache.org/) files, and typically offloaded to the more mature `datasets` library, providing fast, memory-mapped access.
2. **Visual Data**: To handle large volumes of camera data, frames are concatenated and encoded into MP4 files. Frames from the same episode are always grouped together into the same video, and multiple videos are grouped together by camera. To reduce stress on the file system, groups of videos for the same camera view are also broke into multiple sub-directories, after a given threshold number.
3. **Metadata**: A collection of JSON files which describes the dataset's structure in terms of its metadata, serving as the relational counterpart to both the tabular and visual dimensions of data. Metadata include the different feature schemas, frame rates, normalization statistics, and episode boundaries.
For scalability, and to support datasets with potentially millions of trajectories resulting in hundreads of millions or billions of individual camera frames, we merge data from different episodes into the same high-level structure.
Concretely, this means that any given tabular collection and video will not typically contain information about one episode only, but rather a concatenation of the information available in multiple episodes.
This keeps the pressure on the file system, both locally and on remote storage providers like Hugging Face, manageable, at the expense of leveraging more heavily the metadata part of the data, e.g. used to reconstruct information relative to at which position a given episode starts or ends.
An example structure for a given `LeRobotDataset` would appear as follows:
```bash
lerobot/svla_so101_pickplace
├── data/
│ └── chunk-000/
│ ├── file_000000.parquet
│ └── ...
├── meta/
│ ├── episodes/
│ │ ├── chunk-000/
│ │ │ └── file_000000.parquet
│ │ └── ...
│ ├── info.json
│ ├── stats.json
│ └── tasks.jsonl
└── videos/
└── chunk-000/
├── observation.images.wrist_camera/
│ ├── file_000000.mp4
│ └── ...
└── ...
```
- **`meta/info.json`**: This is the central metadata file. It contains the complete dataset schema, defining all features (e.g., `observation.state`, `action`), their shapes, and data types. It also stores crucial information like the dataset's frames-per-second (`fps`), codebase version, and the path templates used to locate data and video files.
- **`meta/stats.json`**: This file stores aggregated statistics (mean, std, min, max) for each feature across the entire dataset. These are used for data normalization and are accessible via `dataset.meta.stats`.
- **`meta/tasks.jsonl`**: Contains the mapping from natural language task descriptions to integer task indices, which are used for task-conditioned policy training.
- **`meta/episodes/`**: This directory contains metadata about each individual episode, such as its length, corresponding task, and pointers to where its data is stored. For scalability, this information is stored in chunked Parquet files rather than a single large JSON file.
- **`data/`**: Contains the core frame-by-frame tabular data in Parquet files. To improve performance and handle large datasets, data from **multiple episodes are concatenated into larger files**. These files are organized into chunked subdirectories to keep file sizes manageable. Therefore, a single file typically contains data for more than one episode.
- **`videos/`**: Contains the MP4 video files for all visual observation streams. Similar to the `data/` directory, video footage from **multiple episodes is concatenated into single MP4 files**. This strategy significantly reduces the number of files in the dataset, which is more efficient for modern filesystems. The path structure (`/videos/<camera_key>/<chunk>/file_...mp4`) allows the data loader to locate the correct video file and then seek to the precise timestamp for a given frame.
## Code Example: Using `LeRobotDataset` with `torch.utils.data.DataLoader`
This section provides an overview of how to access datasets hosted on Hugging Face using the `LeRobotDataset` class.
Every dataset on the Hugging Face Hub containing the three main pillars presented above (Tabular and Visual Data, as well as relational Metadata) can be assessed with a single line.
Most reinforcement learning (RL) and behavioral cloning (BC) algorithms tend to operate on stack of observation and actions.
For instance, RL algorithms typically use a history of previous observations `[o_{t-H}, ..., o_{t}]` to mitigate partial observability.
BC cloning algorithms are instead typically trained to regress chunks of multiple actions rather than single controls.
To accommodate for the specifics of robot learning training, `LeRobotDataset` provides a native windowing operation, whereby we can use the _seconds_ before and after any given observation using `delta_timestamps`.
Non available frames is opportuninely padded, with a padding mask released to provide support in this.
Notably, this all happens within the `LeRobotDataset` and is entitrely transparent to higher level wrappers such as `torch.utils.data.DataLoader`.
Conveniently, by using `LeRobotDataset` with a Pytorch `DataLoader` one can automatically collate the individual sample dictionaries from the dataset into a single dictionary of batched tensors.
```python
from lerobot.datasets import LeRobotDataset
# Load from the Hugging Face Hub (will be cached locally)
dataset = LeRobotDataset("lerobot/svla_so101_pickplace")
# Get the 100th frame in the dataset by
sample = dataset[100]
print(sample)
# The sample is a dictionary of tensors
# {
# 'observation.state': tensor([...]),
# 'action': tensor([...]),
# 'observation.images.wrist_camera': tensor([C, H, W]),
# 'timestamp': tensor(1.234),
# ...
# }
delta_timestamps = {
"observation.images.wrist_camera": [-0.2, -0.1, 0.0] # 0.2, and 0.1 seconds *before* any observation
}
dataset = LeRobotDataset(
"lerobot/svla_so101_pickplace",
delta_timestamps=delta_timestamps
)
# Accessing an index now returns a stack of frames for the specified key
sample = dataset[100]
# The image tensor will now have a time dimension
# 'observation.images.wrist_camera' has shape [T, C, H, W], where T=3
print(sample['observation.images.wrist_camera'].shape)
batch_size=16
# wrap the dataset in a DataLoader to use process it batches for training purposes
data_loader = torch.utils.data.DataLoader(
dataset,
batch_size=batch_size
)
# 3. Iterate over the DataLoader in a training loop
num_epochs = 1
device = "cuda" if torch.cuda.is_available() else "cpu"
for epoch in range(num_epochs):
for batch in data_loader:
# 'batch' is a dictionary where each value is a batch of tensors.
# For example, batch['action'] will have a shape of [32, action_dim].
# If using delta_timestamps, a batched image tensor might have a
# shape of [32, T, C, H, W].
# Move data to the appropriate device (e.g., GPU)
observations = batch['observation.state'].to(device)
actions = batch['action'].to(device)
images = batch['observation.images.wrist_camera'].to(device)
# Next do amazing_model.forward(batch)
...
```
## Streaming
`LeRobotDataset` now also supports streaming mode.
You can stream of data from a large dataset hosted on the Hugging Face Hub by just replacing the dataset definition with:
```python
from lerobot.datasets.streaming_dataset import StreamingLeRobotDataset
# Streams frames from the Hugging Face Hub
dataset = StreamingLeRobotDataset("lerobot/svla_so101_pickplace")
```
Streaming datasets supports high-performance batch processing (ca. 80-100 it/s, varying on connectivity) and high levels of frames randomization: a key feature for behavioral cloning algorithms otherwise operating on highly non-i.i.d. data.
+3 -2
View File
@@ -440,8 +440,9 @@ class LeRobotDataset(torch.utils.data.Dataset):
download_videos (bool, optional): Flag to download the videos. Note that when set to True but the
video files are already present on local disk, they won't be downloaded again. Defaults to
True.
video_backend (str | None, optional): Video backend to use for decoding videos. Defaults to torchcodec when available int the platform; otherwise, defaults to 'pyav'.
You can also use the 'pyav' decoder used by Torchvision, which used to be the default option, or 'video_reader' which is another decoder of Torchvision.
video_backend (str | None, optional): Video backend to use for decoding videos. Defaults to 'torchcodec'
when available on the platform; otherwise, defaults to torchvision's default backend : 'pyav'.
You can also use 'video_reader' which is another decoder of torchvision.
batch_encoding_size (int, optional): Number of episodes to accumulate before batch encoding videos.
Set to 1 for immediate encoding (default), or higher for batched encoding. Defaults to 1.
"""