From 04125492e47ec264f1147676aa0d17347ddfaa08 Mon Sep 17 00:00:00 2001 From: Steven Palma Date: Tue, 12 May 2026 16:59:11 +0200 Subject: [PATCH] fix(datasets): expand torchcodec platform coverage + rewrite pyav fallback for torchvision >0.26 (#3588) * fix(deps): better versioning control for torchcodec * refactor(video_utils): replace torchvision with pyav * adding Torchcodec version to lerobot-info * chore(benchmarks): delete video benchmark --------- Co-authored-by: Maximellerbach --- benchmarks/video/README.md | 288 -------------- benchmarks/video/run_video_benchmark.py | 488 ------------------------ pyproject.toml | 13 +- src/lerobot/datasets/video_utils.py | 113 +++--- src/lerobot/scripts/lerobot_info.py | 1 + uv.lock | 64 ++-- 6 files changed, 112 insertions(+), 855 deletions(-) delete mode 100644 benchmarks/video/README.md delete mode 100644 benchmarks/video/run_video_benchmark.py diff --git a/benchmarks/video/README.md b/benchmarks/video/README.md deleted file mode 100644 index 1feee69c4..000000000 --- a/benchmarks/video/README.md +++ /dev/null @@ -1,288 +0,0 @@ -# Video benchmark - -## Questions - -What is the optimal trade-off between: - -- maximizing loading time with random access, -- minimizing memory space on disk, -- maximizing success rate of policies, -- compatibility across devices/platforms for decoding videos (e.g. video players, web browsers). - -How to encode videos? - -- Which video codec (`-vcodec`) to use? h264, h265, AV1? -- What pixel format to use (`-pix_fmt`)? `yuv444p` or `yuv420p`? -- How much compression (`-crf`)? No compression with `0`, intermediate compression with `25` or extreme with `50+`? -- Which frequency to chose for key frames (`-g`)? A key frame every `10` frames? - -How to decode videos? - -- Which `decoder`? `torchvision`, `torchaudio`, `ffmpegio`, `decord`, or `nvc`? -- What scenarios to use for the requesting timestamps during benchmark? (`timestamps_mode`) - -## Variables - -**Image content & size** -We don't expect the same optimal settings for a dataset of images from a simulation, or from real-world in an apartment, or in a factory, or outdoor, or with lots of moving objects in the scene, etc. Similarly, loading times might not vary linearly with the image size (resolution). -For these reasons, we run this benchmark on four representative datasets: - -- `lerobot/pusht_image`: (96 x 96 pixels) simulation with simple geometric shapes, fixed camera. -- `lerobot/aloha_mobile_shrimp_image`: (480 x 640 pixels) real-world indoor, moving camera. -- `lerobot/paris_street`: (720 x 1280 pixels) real-world outdoor, moving camera. -- `lerobot/kitchen`: (1080 x 1920 pixels) real-world indoor, fixed camera. - -Note: The datasets used for this benchmark need to be image datasets, not video datasets. - -**Data augmentations** -We might revisit this benchmark and find better settings if we train our policies with various data augmentations to make them more robust (e.g. robust to color changes, compression, etc.). - -### Encoding parameters - -| parameter | values | -| ----------- | ------------------------------------------------------------ | -| **vcodec** | `libx264`, `libx265`, `libsvtav1` | -| **pix_fmt** | `yuv444p`, `yuv420p` | -| **g** | `1`, `2`, `3`, `4`, `5`, `6`, `10`, `15`, `20`, `40`, `None` | -| **crf** | `0`, `5`, `10`, `15`, `20`, `25`, `30`, `40`, `50`, `None` | - -Note that `crf` value might be interpreted differently by various video codecs. In other words, the same value used with one codec doesn't necessarily translate into the same compression level with another codec. In fact, the default value (`None`) isn't the same amongst the different video codecs. Importantly, it is also the case for many other ffmpeg arguments like `g` which specifies the frequency of the key frames. - -For a comprehensive list and documentation of these parameters, see the ffmpeg documentation depending on the video codec used: - -- h264: https://trac.ffmpeg.org/wiki/Encode/H.264 -- h265: https://trac.ffmpeg.org/wiki/Encode/H.265 -- AV1: https://trac.ffmpeg.org/wiki/Encode/AV1 - -### Decoding parameters - -**Decoder** -We tested two video decoding backends from torchvision: - -- `pyav` -- `video_reader` (requires to build torchvision from source) - -**Requested timestamps** -Given the way video decoding works, once a keyframe has been loaded, the decoding of subsequent frames is fast. -This of course is affected by the `-g` parameter during encoding, which specifies the frequency of the keyframes. Given our typical use cases in robotics policies which might request a few timestamps in different random places, we want to replicate these use cases with the following scenarios: - -- `1_frame`: 1 frame, -- `2_frames`: 2 consecutive frames (e.g. `[t, t + 1 / fps]`), -- `6_frames`: 6 consecutive frames (e.g. `[t + i / fps for i in range(6)]`) - -Note that this differs significantly from a typical use case like watching a movie, in which every frame is loaded sequentially from the beginning to the end and it's acceptable to have big values for `-g`. - -Additionally, because some policies might request single timestamps that are a few frames apart, we also have the following scenario: - -- `2_frames_4_space`: 2 frames with 4 consecutive frames of spacing in between (e.g `[t, t + 5 / fps]`), - -However, due to how video decoding is implemented with `pyav`, we don't have access to an accurate seek so in practice this scenario is essentially the same as `6_frames` since all 6 frames between `t` and `t + 5 / fps` will be decoded. - -## Metrics - -**Data compression ratio (lower is better)** -`video_images_size_ratio` is the ratio of the memory space on disk taken by the encoded video over the memory space taken by the original images. For instance, `video_images_size_ratio=25%` means that the video takes 4 times less memory space on disk compared to the original images. - -**Loading time ratio (lower is better)** -`video_images_load_time_ratio` is the ratio of the time it takes to decode frames from the video at a given timestamps over the time it takes to load the exact same original images. Lower is better. For instance, `video_images_load_time_ratio=200%` means that decoding from video is 2 times slower than loading the original images. - -**Average Mean Square Error (lower is better)** -`avg_mse` is the average mean square error between each decoded frame and its corresponding original image over all requested timestamps, and also divided by the number of pixels in the image to be comparable when switching to different image sizes. - -**Average Peak Signal to Noise Ratio (higher is better)** -`avg_psnr` measures the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. Higher PSNR indicates better quality. - -**Average Structural Similarity Index Measure (higher is better)** -`avg_ssim` evaluates the perceived quality of images by comparing luminance, contrast, and structure. SSIM values range from -1 to 1, where 1 indicates perfect similarity. - -One aspect that can't be measured here with those metrics is the compatibility of the encoding across platforms, in particular on web browser, for visualization purposes. -h264, h265 and AV1 are all commonly used codecs and should not pose an issue. However, the chroma subsampling (`pix_fmt`) format might affect compatibility: - -- `yuv420p` is more widely supported across various platforms, including web browsers. -- `yuv444p` offers higher color fidelity but might not be supported as broadly. - - - -## How the benchmark works - -The benchmark evaluates both encoding and decoding of video frames on the first episode of each dataset. - -**Encoding:** for each `vcodec` and `pix_fmt` pair, we use a default value for `g` and `crf` upon which we change a single value (either `g` or `crf`) to one of the specified values (we don't test every combination of those as this would be computationally too heavy). -This gives a unique set of encoding parameters which is used to encode the episode. - -**Decoding:** Then, for each of those unique encodings, we iterate through every combination of the decoding parameters `backend` and `timestamps_mode`. For each of them, we record the metrics of a number of samples (given by `--num-samples`). This is parallelized for efficiency and the number of processes can be controlled with `--num-workers`. Ideally, it's best to have a `--num-samples` that is divisible by `--num-workers`. - -Intermediate results saved for each `vcodec` and `pix_fmt` combination in csv tables. -These are then all concatenated to a single table ready for analysis. - -## Caveats - -We tried to measure the most impactful parameters for both encoding and decoding. However, for computational reasons we can't test out every combination. - -Additional encoding parameters exist that are not included in this benchmark. In particular: - -- `-preset` which allows for selecting encoding presets. This represents a collection of options that will provide a certain encoding speed to compression ratio. By leaving this parameter unspecified, it is considered to be `medium` for libx264 and libx265 and `8` for libsvtav1. -- `-tune` which allows to optimize the encoding for certain aspects (e.g. film quality, fast decoding, etc.). - -See the documentation mentioned above for more detailed info on these settings and for a more comprehensive list of other parameters. - -Similarly on the decoding side, other decoders exist but are not implemented in our current benchmark. To name a few: - -- `torchaudio` -- `ffmpegio` -- `decord` -- `nvc` - -Note as well that since we are mostly interested in the performance at decoding time (also because encoding is done only once before uploading a dataset), we did not measure encoding times nor have any metrics regarding encoding. -However, besides the necessity to build ffmpeg from source, encoding did not pose any issue and it didn't take a significant amount of time during this benchmark. - -## Install - -Building ffmpeg from source is required to include libx265 and libaom/libsvtav1 (av1) video codecs ([compilation guide](https://trac.ffmpeg.org/wiki/CompilationGuide/Ubuntu)). - -**Note:** While you still need to build torchvision with a conda-installed `ffmpeg<4.3` to use the `video_reader` decoder (as described in [#220](https://github.com/huggingface/lerobot/pull/220)), you also need another version which is custom-built with all the video codecs for encoding. For the script to then use that version, you can prepend the command above with `PATH="$HOME/bin:$PATH"`, which is where ffmpeg should be built. - -## Adding a video decoder - -Right now, we're only benchmarking the two video decoder available with torchvision: `pyav` and `video_reader`. -You can easily add a new decoder to benchmark by adding it to this function in the script: - -```diff -def decode_video_frames( - video_path: str, - timestamps: list[float], - tolerance_s: float, - backend: str, -) -> torch.Tensor: - if backend in ["pyav", "video_reader"]: - return decode_video_frames_torchvision( - video_path, timestamps, tolerance_s, backend - ) -+ elif backend == ["your_decoder"]: -+ return your_decoder_function( -+ video_path, timestamps, tolerance_s, backend -+ ) - else: - raise NotImplementedError(backend) -``` - -## Example - -For a quick run, you can try these parameters: - -```bash -python benchmark/video/run_video_benchmark.py \ - --output-dir outputs/video_benchmark \ - --repo-ids \ - lerobot/pusht_image \ - lerobot/aloha_mobile_shrimp_image \ - --vcodec libx264 libx265 \ - --pix-fmt yuv444p yuv420p \ - --g 2 20 None \ - --crf 10 40 None \ - --timestamps-modes 1_frame 2_frames \ - --backends pyav video_reader \ - --num-samples 5 \ - --num-workers 5 \ - --save-frames 0 -``` - -## Results - -### Reproduce - -We ran the benchmark with the following parameters: - -```bash -# h264 and h265 encodings -python benchmark/video/run_video_benchmark.py \ - --output-dir outputs/video_benchmark \ - --repo-ids \ - lerobot/pusht_image \ - lerobot/aloha_mobile_shrimp_image \ - lerobot/paris_street \ - lerobot/kitchen \ - --vcodec libx264 libx265 \ - --pix-fmt yuv444p yuv420p \ - --g 1 2 3 4 5 6 10 15 20 40 None \ - --crf 0 5 10 15 20 25 30 40 50 None \ - --timestamps-modes 1_frame 2_frames 6_frames \ - --backends pyav video_reader \ - --num-samples 50 \ - --num-workers 5 \ - --save-frames 1 - -# av1 encoding (only compatible with yuv420p and pyav decoder) -python benchmark/video/run_video_benchmark.py \ - --output-dir outputs/video_benchmark \ - --repo-ids \ - lerobot/pusht_image \ - lerobot/aloha_mobile_shrimp_image \ - lerobot/paris_street \ - lerobot/kitchen \ - --vcodec libsvtav1 \ - --pix-fmt yuv420p \ - --g 1 2 3 4 5 6 10 15 20 40 None \ - --crf 0 5 10 15 20 25 30 40 50 None \ - --timestamps-modes 1_frame 2_frames 6_frames \ - --backends pyav \ - --num-samples 50 \ - --num-workers 5 \ - --save-frames 1 -``` - -The full results are available [here](https://docs.google.com/spreadsheets/d/1OYJB43Qu8fC26k_OyoMFgGBBKfQRCi4BIuYitQnq3sw/edit?usp=sharing) - -### Parameters selected for LeRobotDataset - -Considering these results, we chose what we think is the best set of encoding parameter: - -- vcodec: `libsvtav1` -- pix-fmt: `yuv420p` -- g: `2` -- crf: `30` - -Since we're using av1 encoding, we're choosing the `pyav` decoder as `video_reader` does not support it (and `pyav` doesn't require a custom build of `torchvision`). - -### Summary - -These tables show the results for `g=2` and `crf=30`, using `timestamps-modes=6_frames` and `backend=pyav` - -| video_images_size_ratio | vcodec | pix_fmt | | | | -| --------------------------------- | ---------- | ------- | --------- | --------- | --------- | -| | libx264 | | libx265 | | libsvtav1 | -| repo_id | yuv420p | yuv444p | yuv420p | yuv444p | yuv420p | -| lerobot/pusht_image | **16.97%** | 17.58% | 18.57% | 18.86% | 22.06% | -| lerobot/aloha_mobile_shrimp_image | 2.14% | 2.11% | 1.38% | **1.37%** | 5.59% | -| lerobot/paris_street | 2.12% | 2.13% | **1.54%** | **1.54%** | 4.43% | -| lerobot/kitchen | 1.40% | 1.39% | **1.00%** | **1.00%** | 2.52% | - -| video_images_load_time_ratio | vcodec | pix_fmt | | | | -| --------------------------------- | ------- | ------- | -------- | ------- | --------- | -| | libx264 | | libx265 | | libsvtav1 | -| repo_id | yuv420p | yuv444p | yuv420p | yuv444p | yuv420p | -| lerobot/pusht_image | 6.45 | 5.19 | **1.90** | 2.12 | 2.47 | -| lerobot/aloha_mobile_shrimp_image | 11.80 | 7.92 | 0.71 | 0.85 | **0.48** | -| lerobot/paris_street | 2.21 | 2.05 | 0.36 | 0.49 | **0.30** | -| lerobot/kitchen | 1.46 | 1.46 | 0.28 | 0.51 | **0.26** | - -| | | vcodec | pix_fmt | | | | -| --------------------------------- | -------- | -------- | ------------ | -------- | --------- | ------------ | -| | | libx264 | | libx265 | | libsvtav1 | -| repo_id | metric | yuv420p | yuv444p | yuv420p | yuv444p | yuv420p | -| lerobot/pusht_image | avg_mse | 2.90E-04 | **2.03E-04** | 3.13E-04 | 2.29E-04 | 2.19E-04 | -| | avg_psnr | 35.44 | 37.07 | 35.49 | **37.30** | 37.20 | -| | avg_ssim | 98.28% | **98.85%** | 98.31% | 98.84% | 98.72% | -| lerobot/aloha_mobile_shrimp_image | avg_mse | 2.76E-04 | 2.59E-04 | 3.17E-04 | 3.06E-04 | **1.30E-04** | -| | avg_psnr | 35.91 | 36.21 | 35.88 | 36.09 | **40.17** | -| | avg_ssim | 95.19% | 95.18% | 95.00% | 95.05% | **97.73%** | -| lerobot/paris_street | avg_mse | 6.89E-04 | 6.70E-04 | 4.03E-03 | 4.02E-03 | **3.09E-04** | -| | avg_psnr | 33.48 | 33.68 | 32.05 | 32.15 | **35.40** | -| | avg_ssim | 93.76% | 93.75% | 89.46% | 89.46% | **95.46%** | -| lerobot/kitchen | avg_mse | 2.50E-04 | 2.24E-04 | 4.28E-04 | 4.18E-04 | **1.53E-04** | -| | avg_psnr | 36.73 | 37.33 | 36.56 | 36.75 | **39.12** | -| | avg_ssim | 95.47% | 95.58% | 95.52% | 95.53% | **96.82%** | diff --git a/benchmarks/video/run_video_benchmark.py b/benchmarks/video/run_video_benchmark.py deleted file mode 100644 index 064a84b48..000000000 --- a/benchmarks/video/run_video_benchmark.py +++ /dev/null @@ -1,488 +0,0 @@ -#!/usr/bin/env python - -# Copyright 2024 The HuggingFace Inc. team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -"""Assess the performance of video decoding in various configurations. - -This script will benchmark different video encoding and decoding parameters. -See the provided README.md or run `python benchmark/video/run_video_benchmark.py --help` for usage info. -""" - -import argparse -import datetime as dt -import itertools -import random -import shutil -from collections import OrderedDict -from concurrent.futures import ThreadPoolExecutor, as_completed -from pathlib import Path -from threading import Lock - -import einops -import numpy as np -import pandas as pd -import PIL -import torch -from skimage.metrics import mean_squared_error, peak_signal_noise_ratio, structural_similarity -from tqdm import tqdm - -from lerobot.datasets.lerobot_dataset import LeRobotDataset -from lerobot.datasets.video_utils import ( - decode_video_frames, - encode_video_frames, -) -from lerobot.utils.constants import OBS_IMAGE -from lerobot.utils.utils import TimerManager - -BASE_ENCODING = OrderedDict( - [ - ("vcodec", "libx264"), - ("pix_fmt", "yuv444p"), - ("g", 2), - ("crf", None), - # TODO(aliberts): Add fastdecode - # ("fastdecode", 0), - ] -) - - -# TODO(rcadene, aliberts): move to `utils.py` folder when we want to refactor -def parse_int_or_none(value) -> int | None: - if value.lower() == "none": - return None - try: - return int(value) - except ValueError as e: - raise argparse.ArgumentTypeError(f"Invalid int or None: {value}") from e - - -def check_datasets_formats(repo_ids: list) -> None: - for repo_id in repo_ids: - dataset = LeRobotDataset(repo_id) - if len(dataset.meta.video_keys) > 0: - raise ValueError( - f"Use only image dataset for running this benchmark. Video dataset provided: {repo_id}" - ) - - -def get_directory_size(directory: Path) -> int: - total_size = 0 - for item in directory.rglob("*"): - if item.is_file(): - total_size += item.stat().st_size - return total_size - - -def load_original_frames(imgs_dir: Path, timestamps: list[float], fps: int) -> torch.Tensor: - frames = [] - for ts in timestamps: - idx = int(ts * fps) - frame = PIL.Image.open(imgs_dir / f"frame-{idx:06d}.png") - frame = torch.from_numpy(np.array(frame)) - frame = frame.type(torch.float32) / 255 - frame = einops.rearrange(frame, "h w c -> c h w") - frames.append(frame) - return torch.stack(frames) - - -def save_decoded_frames( - imgs_dir: Path, save_dir: Path, frames: torch.Tensor, timestamps: list[float], fps: int -) -> None: - if save_dir.exists() and len(list(save_dir.glob("frame-*.png"))) == len(timestamps): - return - - save_dir.mkdir(parents=True, exist_ok=True) - for i, ts in enumerate(timestamps): - idx = int(ts * fps) - frame_hwc = (frames[i].permute((1, 2, 0)) * 255).type(torch.uint8).cpu().numpy() - PIL.Image.fromarray(frame_hwc).save(save_dir / f"frame-{idx:06d}_decoded.png") - shutil.copyfile(imgs_dir / f"frame-{idx:06d}.png", save_dir / f"frame-{idx:06d}_original.png") - - -def save_first_episode(imgs_dir: Path, dataset: LeRobotDataset) -> None: - episode_index = 0 - ep_num_images = dataset.meta.episodes["length"][episode_index] - if imgs_dir.exists() and len(list(imgs_dir.glob("frame-*.png"))) == ep_num_images: - return - - imgs_dir.mkdir(parents=True, exist_ok=True) - hf_dataset = dataset.hf_dataset.with_format(None) - - # We only save images from the first camera - img_keys = [key for key in hf_dataset.features if key.startswith(OBS_IMAGE)] - imgs_dataset = hf_dataset.select_columns(img_keys[0]) - - for i, item in enumerate( - tqdm(imgs_dataset, desc=f"saving {dataset.repo_id} first episode images", leave=False) - ): - img = item[img_keys[0]] - img.save(str(imgs_dir / f"frame-{i:06d}.png"), quality=100) - - if i >= ep_num_images - 1: - break - - -def sample_timestamps(timestamps_mode: str, ep_num_images: int, fps: int) -> list[float]: - # Start at 5 to allow for 2_frames_4_space and 6_frames - idx = random.randint(5, ep_num_images - 1) - match timestamps_mode: - case "1_frame": - frame_indexes = [idx] - case "2_frames": - frame_indexes = [idx - 1, idx] - case "2_frames_4_space": - frame_indexes = [idx - 5, idx] - case "6_frames": - frame_indexes = [idx - i for i in range(6)][::-1] - case _: - raise ValueError(timestamps_mode) - - return [idx / fps for idx in frame_indexes] - - -def benchmark_decoding( - imgs_dir: Path, - video_path: Path, - timestamps_mode: str, - backend: str, - ep_num_images: int, - fps: int, - num_samples: int = 50, - num_workers: int = 4, - save_frames: bool = False, -) -> dict: - def process_sample(sample: int, lock: Lock): - time_benchmark = TimerManager(log=False) - timestamps = sample_timestamps(timestamps_mode, ep_num_images, fps) - num_frames = len(timestamps) - result = { - "psnr_values": [], - "ssim_values": [], - "mse_values": [], - } - - with time_benchmark, lock: - frames = decode_video_frames(video_path, timestamps=timestamps, tolerance_s=5e-1, backend=backend) - result["load_time_video_ms"] = (time_benchmark.last * 1000) / num_frames - - with time_benchmark: - original_frames = load_original_frames(imgs_dir, timestamps, fps) - result["load_time_images_ms"] = (time_benchmark.last * 1000) / num_frames - - frames_np, original_frames_np = frames.numpy(), original_frames.numpy() - for i in range(num_frames): - result["mse_values"].append(mean_squared_error(original_frames_np[i], frames_np[i])) - result["psnr_values"].append( - peak_signal_noise_ratio(original_frames_np[i], frames_np[i], data_range=1.0) - ) - result["ssim_values"].append( - structural_similarity(original_frames_np[i], frames_np[i], data_range=1.0, channel_axis=0) - ) - - if save_frames and sample == 0: - save_dir = video_path.with_suffix("") / f"{timestamps_mode}_{backend}" - save_decoded_frames(imgs_dir, save_dir, frames, timestamps, fps) - - return result - - load_times_video_ms = [] - load_times_images_ms = [] - mse_values = [] - psnr_values = [] - ssim_values = [] - - # A sample is a single set of decoded frames specified by timestamps_mode (e.g. a single frame, 2 frames, etc.). - # For each sample, we record metrics (loading time and quality metrics) which are then averaged over all samples. - # As these samples are independent, we run them in parallel threads to speed up the benchmark. - # Use a single shared lock for all worker threads - shared_lock = Lock() - with ThreadPoolExecutor(max_workers=num_workers) as executor: - futures = [executor.submit(process_sample, i, shared_lock) for i in range(num_samples)] - for future in tqdm(as_completed(futures), total=num_samples, desc="samples", leave=False): - result = future.result() - load_times_video_ms.append(result["load_time_video_ms"]) - load_times_images_ms.append(result["load_time_images_ms"]) - psnr_values.extend(result["psnr_values"]) - ssim_values.extend(result["ssim_values"]) - mse_values.extend(result["mse_values"]) - - avg_load_time_video_ms = float(np.array(load_times_video_ms).mean()) - avg_load_time_images_ms = float(np.array(load_times_images_ms).mean()) - video_images_load_time_ratio = avg_load_time_video_ms / avg_load_time_images_ms - - return { - "avg_load_time_video_ms": avg_load_time_video_ms, - "avg_load_time_images_ms": avg_load_time_images_ms, - "video_images_load_time_ratio": video_images_load_time_ratio, - "avg_mse": float(np.mean(mse_values)), - "avg_psnr": float(np.mean(psnr_values)), - "avg_ssim": float(np.mean(ssim_values)), - } - - -def benchmark_encoding_decoding( - dataset: LeRobotDataset, - video_path: Path, - imgs_dir: Path, - encoding_cfg: dict, - decoding_cfg: dict, - num_samples: int, - num_workers: int, - save_frames: bool, - overwrite: bool = False, - seed: int = 1337, -) -> list[dict]: - fps = dataset.fps - - if overwrite or not video_path.is_file(): - tqdm.write(f"encoding {video_path}") - encode_video_frames( - imgs_dir=imgs_dir, - video_path=video_path, - fps=fps, - vcodec=encoding_cfg["vcodec"], - pix_fmt=encoding_cfg["pix_fmt"], - g=encoding_cfg.get("g"), - crf=encoding_cfg.get("crf"), - # fast_decode=encoding_cfg.get("fastdecode"), - overwrite=True, - ) - - episode_index = 0 - ep_num_images = dataset.meta.episodes["length"][episode_index] - width, height = tuple(dataset[0][dataset.meta.camera_keys[0]].shape[-2:]) - num_pixels = width * height - video_size_bytes = video_path.stat().st_size - images_size_bytes = get_directory_size(imgs_dir) - video_images_size_ratio = video_size_bytes / images_size_bytes - - random.seed(seed) - benchmark_table = [] - for timestamps_mode in tqdm( - decoding_cfg["timestamps_modes"], desc="decodings (timestamps_modes)", leave=False - ): - for backend in tqdm(decoding_cfg["backends"], desc="decodings (backends)", leave=False): - benchmark_row = benchmark_decoding( - imgs_dir, - video_path, - timestamps_mode, - backend, - ep_num_images, - fps, - num_samples, - num_workers, - save_frames, - ) - benchmark_row.update( - **{ - "repo_id": dataset.repo_id, - "resolution": f"{width} x {height}", - "num_pixels": num_pixels, - "video_size_bytes": video_size_bytes, - "images_size_bytes": images_size_bytes, - "video_images_size_ratio": video_images_size_ratio, - "timestamps_mode": timestamps_mode, - "backend": backend, - }, - **encoding_cfg, - ) - benchmark_table.append(benchmark_row) - - return benchmark_table - - -def main( - output_dir: Path, - repo_ids: list[str], - vcodec: list[str], - pix_fmt: list[str], - g: list[int], - crf: list[int], - # fastdecode: list[int], - timestamps_modes: list[str], - backends: list[str], - num_samples: int, - num_workers: int, - save_frames: bool, -): - check_datasets_formats(repo_ids) - encoding_benchmarks = { - "g": g, - "crf": crf, - # "fastdecode": fastdecode, - } - decoding_benchmarks = { - "timestamps_modes": timestamps_modes, - "backends": backends, - } - headers = ["repo_id", "resolution", "num_pixels"] - headers += list(BASE_ENCODING.keys()) - headers += [ - "timestamps_mode", - "backend", - "video_size_bytes", - "images_size_bytes", - "video_images_size_ratio", - "avg_load_time_video_ms", - "avg_load_time_images_ms", - "video_images_load_time_ratio", - "avg_mse", - "avg_psnr", - "avg_ssim", - ] - file_paths = [] - for video_codec in tqdm(vcodec, desc="encodings (vcodec)"): - for pixel_format in tqdm(pix_fmt, desc="encodings (pix_fmt)", leave=False): - benchmark_table = [] - for repo_id in tqdm(repo_ids, desc="encodings (datasets)", leave=False): - dataset = LeRobotDataset(repo_id) - imgs_dir = output_dir / "images" / dataset.repo_id.replace("/", "_") - # We only use the first episode - save_first_episode(imgs_dir, dataset) - for duet in [ - dict(zip(encoding_benchmarks.keys(), unique_combination, strict=False)) - for unique_combination in itertools.product(*encoding_benchmarks.values()) - ]: - encoding_cfg = BASE_ENCODING.copy() - encoding_cfg["vcodec"] = video_codec - encoding_cfg["pix_fmt"] = pixel_format - for key, value in duet.items(): - encoding_cfg[key] = value - args_path = Path("_".join(str(value) for value in encoding_cfg.values())) - video_path = output_dir / "videos" / args_path / f"{repo_id.replace('/', '_')}.mp4" - benchmark_table += benchmark_encoding_decoding( - dataset, - video_path, - imgs_dir, - encoding_cfg, - decoding_benchmarks, - num_samples, - num_workers, - save_frames, - ) - - # Save intermediate results - benchmark_df = pd.DataFrame(benchmark_table, columns=headers) - now = dt.datetime.now() - csv_path = ( - output_dir - / f"{now:%Y-%m-%d}_{now:%H-%M-%S}_{video_codec}_{pixel_format}_{num_samples}-samples.csv" - ) - benchmark_df.to_csv(csv_path, header=True, index=False) - file_paths.append(csv_path) - del benchmark_df - - # Concatenate all results - df_list = [pd.read_csv(csv_path) for csv_path in file_paths] - concatenated_df = pd.concat(df_list, ignore_index=True) - concatenated_path = output_dir / f"{now:%Y-%m-%d}_{now:%H-%M-%S}_all_{num_samples}-samples.csv" - concatenated_df.to_csv(concatenated_path, header=True, index=False) - - -if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument( - "--output-dir", - type=Path, - default=Path("outputs/video_benchmark"), - help="Directory where the video benchmark outputs are written.", - ) - parser.add_argument( - "--repo-ids", - type=str, - nargs="*", - default=[ - "lerobot/pusht_image", - "lerobot/aloha_mobile_shrimp_image", - "lerobot/paris_street", - "lerobot/kitchen", - ], - help="Datasets repo-ids to test against. First episodes only are used. Must be images.", - ) - parser.add_argument( - "--vcodec", - type=str, - nargs="*", - default=["h264", "hevc", "libsvtav1"], - help="Video codecs to be tested", - ) - parser.add_argument( - "--pix-fmt", - type=str, - nargs="*", - default=["yuv444p", "yuv420p"], - help="Pixel formats (chroma subsampling) to be tested", - ) - parser.add_argument( - "--g", - type=parse_int_or_none, - nargs="*", - default=[1, 2, 3, 4, 5, 6, 10, 15, 20, 40, 100, None], - help="Group of pictures sizes to be tested.", - ) - parser.add_argument( - "--crf", - type=parse_int_or_none, - nargs="*", - default=[0, 5, 10, 15, 20, 25, 30, 40, 50, None], - help="Constant rate factors to be tested.", - ) - # parser.add_argument( - # "--fastdecode", - # type=int, - # nargs="*", - # default=[0, 1], - # help="Use the fastdecode tuning option. 0 disables it. " - # "For libx264 and libx265/hevc, only 1 is possible. " - # "For libsvtav1, 1, 2 or 3 are possible values with a higher number meaning a faster decoding optimization", - # ) - parser.add_argument( - "--timestamps-modes", - type=str, - nargs="*", - default=[ - "1_frame", - "2_frames", - "2_frames_4_space", - "6_frames", - ], - help="Timestamps scenarios to be tested.", - ) - parser.add_argument( - "--backends", - type=str, - nargs="*", - default=["torchcodec", "pyav"], - help="Torchvision decoding backend to be tested.", - ) - parser.add_argument( - "--num-samples", - type=int, - default=50, - help="Number of samples for each encoding x decoding config.", - ) - parser.add_argument( - "--num-workers", - type=int, - default=10, - help="Number of processes for parallelized sample processing.", - ) - parser.add_argument( - "--save-frames", - type=int, - default=0, - help="Whether to save decoded frames or not. Enter a non-zero number for true.", - ) - args = parser.parse_args() - main(**vars(args)) diff --git a/pyproject.toml b/pyproject.toml index 870f7b62b..f983134ab 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -99,7 +99,18 @@ dataset = [ "pandas>=2.0.0,<3.0.0", # NOTE: Transitive dependency of datasets "pyarrow>=21.0.0,<30.0.0", # NOTE: Transitive dependency of datasets "lerobot[av-dep]", - "torchcodec>=0.3.0,<0.12.0; sys_platform != 'win32' and (sys_platform != 'linux' or (platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l')) and (sys_platform != 'darwin' or platform_machine != 'x86_64')", # NOTE: Windows support starts at version 0.7 (needs torch==2.8), ffmpeg>=8 support starts at version 0.8.1 (needs torch==2.9), system-wide ffmpeg support starts at version 0.10 (needs torch==2.10), 0.11 needs torch==2.11, 0.12 needs torch==2.12. + + # NOTE: torchcodec wheel availability matrix (PyPI): + # - linux x86_64/amd64 + macOS arm64 : wheels since 0.3.0 (the historic supported set). + # - win32 x86_64 : wheels since 0.7.0 (needs torch>=2.8). + # - linux aarch64/arm64 : wheels since 0.11.0 (needs torch>=2.11). + # - macOS x86_64 (Intel) and linux armv7l: no wheels in any released version -> fall through to the PyAV decoder. + # Each platform gets its own line so the resolver picks the minimum version that has a wheel for it. + + # Other torch/torchcodec pairings (informational): 0.8.1 = ffmpeg>=8 support, 0.10 = system-wide ffmpeg support, 0.12 needs torch==2.12. + "torchcodec>=0.3.0,<0.12.0; (sys_platform == 'linux' and (platform_machine == 'x86_64' or platform_machine == 'AMD64')) or (sys_platform == 'darwin' and platform_machine == 'arm64')", + "torchcodec>=0.7.0,<0.12.0; sys_platform == 'win32'", + "torchcodec>=0.11.0,<0.12.0; sys_platform == 'linux' and (platform_machine == 'aarch64' or platform_machine == 'arm64')", "jsonlines>=4.0.0,<5.0.0", ] training = [ diff --git a/src/lerobot/datasets/video_utils.py b/src/lerobot/datasets/video_utils.py index 512dc6d9b..00ff09ee7 100644 --- a/src/lerobot/datasets/video_utils.py +++ b/src/lerobot/datasets/video_utils.py @@ -33,7 +33,6 @@ import fsspec import numpy as np import pyarrow as pa import torch -import torchvision from datasets.features.features import register_feature from PIL import Image @@ -132,7 +131,9 @@ def decode_video_frames( video_path (Path): Path to the video file. timestamps (list[float]): List of timestamps to extract frames. tolerance_s (float): Allowed deviation in seconds for frame retrieval. - backend (str, optional): Backend to use for decoding. Defaults to "torchcodec" when available in the platform; otherwise, defaults to "pyav". + backend (str, optional): Backend to use for decoding. Defaults to "torchcodec" when available + in the platform; otherwise, defaults to "pyav". The legacy value "video_reader" is + accepted for one release as an alias for "pyav" and will be removed in a future version. return_uint8 (bool): If True, return raw uint8 frames without float32 normalization. This reduces memory for DataLoader IPC; normalization can be done on GPU afterward. @@ -145,85 +146,87 @@ def decode_video_frames( backend = get_safe_default_codec() if backend == "torchcodec": return decode_video_frames_torchcodec(video_path, timestamps, tolerance_s, return_uint8=return_uint8) - elif backend in ["pyav", "video_reader"]: - return decode_video_frames_torchvision( - video_path, timestamps, tolerance_s, backend, return_uint8=return_uint8 - ) + elif backend == "pyav": + return decode_video_frames_pyav(video_path, timestamps, tolerance_s, return_uint8=return_uint8) + elif backend == "video_reader": + logger.warning("backend='video_reader' is deprecated and now aliases to 'pyav'.") + return decode_video_frames_pyav(video_path, timestamps, tolerance_s, return_uint8=return_uint8) else: raise ValueError(f"Unsupported video backend: {backend}") -def decode_video_frames_torchvision( +def decode_video_frames_pyav( video_path: Path | str, timestamps: list[float], tolerance_s: float, - backend: str = "pyav", log_loaded_timestamps: bool = False, return_uint8: bool = False, ) -> torch.Tensor: - """Loads frames associated to the requested timestamps of a video + """Loads frames associated to the requested timestamps of a video using PyAV. - The backend can be either "pyav" (default) or "video_reader". - "video_reader" requires installing torchvision from source, see: - https://github.com/pytorch/vision/blob/main/torchvision/csrc/io/decoder/gpu/README.rst - (note that you need to compile against ffmpeg<4.3) + This is the fallback decoder for platforms where torchcodec has no wheel (currently macOS + x86_64 and linux armv7l — see the torchcodec block in pyproject.toml for the full matrix). + On supported platforms, prefer `decode_video_frames_torchcodec`, which is faster and supports + accurate seek. - While both use cpu, "video_reader" is supposedly faster than "pyav" but requires additional setup. - For more info on video decoding, see `benchmark/video/README.md` + PyAV doesn't support accurate seek: we seek to the nearest preceding keyframe and decode + forward until we have covered the requested timestamp range. The number of key frames in a + video can be adjusted at encoding time to trade off decoding speed against file size. - See torchvision doc for more info on these two backends: - https://pytorch.org/vision/0.18/index.html?highlight=backend#torchvision.set_video_backend + Args: + video_path: Path to the video file. + timestamps: List of timestamps (in seconds) to extract frames for. + tolerance_s: Allowed deviation in seconds between a queried timestamp and the closest + decoded frame. + log_loaded_timestamps: When True, log every decoded frame's timestamp at INFO level. + return_uint8: When True, return raw uint8 frames (C, H, W). Otherwise, return float32 in + [0, 1] range. - Note: Video benefits from inter-frame compression. Instead of storing every frame individually, - the encoder stores a reference frame (or a key frame) and subsequent frames as differences relative to - that key frame. As a consequence, to access a requested frame, we need to load the preceding key frame, - and all subsequent frames until reaching the requested frame. The number of key frames in a video - can be adjusted during encoding to take into account decoding time and video size in bytes. + Returns: + torch.Tensor of shape (len(timestamps), C, H, W). """ - video_path = str(video_path) - - # set backend - keyframes_only = False - torchvision.set_video_backend(backend) - if backend == "pyav": - keyframes_only = True # pyav doesn't support accurate seek - - # set a video stream reader # TODO(rcadene): also load audio stream at the same time - reader = torchvision.io.VideoReader(video_path, "video") + video_path = str(video_path) # set the first and last requested timestamps # Note: previous timestamps are usually loaded, since we need to access the previous key frame first_ts = min(timestamps) last_ts = max(timestamps) - # access closest key frame of the first requested frame - # Note: closest key frame timestamp is usually smaller than `first_ts` (e.g. key frame can be the first frame of the video) - # for details on what `seek` is doing see: https://pyav.basswood-io.com/docs/stable/api/container.html?highlight=inputcontainer#av.container.InputContainer.seek - reader.seek(first_ts, keyframes_only=keyframes_only) + loaded_frames: list[torch.Tensor] = [] + loaded_ts: list[float] = [] - # load all frames until last requested frame - loaded_frames = [] - loaded_ts = [] - for frame in reader: - current_ts = frame["pts"] - if log_loaded_timestamps: - logger.info(f"frame loaded at timestamp={current_ts:.4f}") - loaded_frames.append(frame["data"]) - loaded_ts.append(current_ts) - if current_ts >= last_ts: - break + # Seek + decode. `container.seek(offset)` with no `stream` argument expects the offset in + # av.time_base units (microseconds). `backward=True` lands us on the nearest keyframe at or + # before `first_ts`, so we can then decode forward until we cover `last_ts`. See: + # https://pyav.basswood-io.com/docs/stable/api/container.html#av.container.InputContainer.seek + with av.open(video_path) as container: + stream = container.streams.video[0] + container.seek(int(first_ts * av.time_base), backward=True) - if backend == "pyav": - reader.container.close() + for frame in container.decode(stream): + if frame.pts is None: + continue + current_ts = float(frame.pts * stream.time_base) + if log_loaded_timestamps: + logger.info(f"frame loaded at timestamp={current_ts:.4f}") + # Convert to CHW uint8 to match torchcodec's output layout. + arr = frame.to_ndarray(format="rgb24") # H, W, 3 + loaded_frames.append(torch.from_numpy(arr).permute(2, 0, 1).contiguous()) + loaded_ts.append(current_ts) + if current_ts >= last_ts: + break - reader = None + if not loaded_frames: + raise FrameTimestampError( + f"No frames could be decoded from {video_path} in the timestamp range [{first_ts}, {last_ts}]." + ) query_ts = torch.tensor(timestamps) - loaded_ts = torch.tensor(loaded_ts) + loaded_ts_t = torch.tensor(loaded_ts) # compute distances between each query timestamp and timestamps of all loaded frames - dist = torch.cdist(query_ts[:, None], loaded_ts[:, None], p=1) + dist = torch.cdist(query_ts[:, None], loaded_ts_t[:, None], p=1) min_, argmin_ = dist.min(1) is_within_tol = min_ < tolerance_s @@ -234,14 +237,14 @@ def decode_video_frames_torchvision( " This might be due to synchronization issues with timestamps during data collection." " To be safe, we advise to ignore this item during training." f"\nqueried timestamps: {query_ts}" - f"\nloaded timestamps: {loaded_ts}" + f"\nloaded timestamps: {loaded_ts_t}" f"\nvideo: {video_path}" - f"\nbackend: {backend}" + f"\nbackend: pyav" ) # get closest frames to the query timestamps closest_frames = torch.stack([loaded_frames[idx] for idx in argmin_]) - closest_ts = loaded_ts[argmin_] + closest_ts = loaded_ts_t[argmin_] if log_loaded_timestamps: logger.info(f"{closest_ts=}") diff --git a/src/lerobot/scripts/lerobot_info.py b/src/lerobot/scripts/lerobot_info.py index 2092db48b..a057836e5 100644 --- a/src/lerobot/scripts/lerobot_info.py +++ b/src/lerobot/scripts/lerobot_info.py @@ -92,6 +92,7 @@ def get_sys_info() -> dict[str, str]: info.update( { "PyTorch version": torch_version, + "Torchcodec version": get_package_version("torchcodec"), "Is PyTorch built with CUDA support?": str(torch_cuda_available), "Cuda version": cuda_version, "GPU model": gpu_model, diff --git a/uv.lock b/uv.lock index 28b906d89..2ffaf9fe4 100644 --- a/uv.lock +++ b/uv.lock @@ -2,13 +2,17 @@ version = 1 revision = 3 requires-python = ">=3.12" resolution-markers = [ - "python_full_version >= '3.15' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and sys_platform == 'linux'", + "(python_full_version >= '3.15' and platform_machine == 'AMD64' and sys_platform == 'linux') or (python_full_version >= '3.15' and platform_machine == 'x86_64' and sys_platform == 'linux')", + "python_full_version >= '3.15' and platform_machine != 'AMD64' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and platform_machine != 'x86_64' and sys_platform == 'linux'", "python_full_version >= '3.15' and platform_machine == 's390x' and sys_platform == 'linux'", - "python_full_version == '3.14.*' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and sys_platform == 'linux'", - "python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and sys_platform == 'linux'", + "(python_full_version == '3.14.*' and platform_machine == 'AMD64' and sys_platform == 'linux') or (python_full_version == '3.14.*' and platform_machine == 'x86_64' and sys_platform == 'linux')", + "python_full_version == '3.14.*' and platform_machine != 'AMD64' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and platform_machine != 'x86_64' and sys_platform == 'linux'", + "(python_full_version == '3.13.*' and platform_machine == 'AMD64' and sys_platform == 'linux') or (python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform == 'linux')", + "python_full_version == '3.13.*' and platform_machine != 'AMD64' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and platform_machine != 'x86_64' and sys_platform == 'linux'", "python_full_version == '3.14.*' and platform_machine == 's390x' and sys_platform == 'linux'", "python_full_version == '3.13.*' and platform_machine == 's390x' and sys_platform == 'linux'", - "python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and sys_platform == 'linux'", + "(python_full_version < '3.13' and platform_machine == 'AMD64' and sys_platform == 'linux') or (python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform == 'linux')", + "python_full_version < '3.13' and platform_machine != 'AMD64' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and platform_machine != 'x86_64' and sys_platform == 'linux'", "python_full_version < '3.13' and platform_machine == 's390x' and sys_platform == 'linux'", "(python_full_version >= '3.15' and platform_machine == 'aarch64' and sys_platform == 'linux') or (python_full_version >= '3.15' and platform_machine == 'arm64' and sys_platform == 'linux') or (python_full_version >= '3.15' and platform_machine == 'armv7l' and sys_platform == 'linux')", "(python_full_version == '3.14.*' and platform_machine == 'aarch64' and sys_platform == 'linux') or (python_full_version == '3.14.*' and platform_machine == 'arm64' and sys_platform == 'linux') or (python_full_version == '3.14.*' and platform_machine == 'armv7l' and sys_platform == 'linux')", @@ -1134,7 +1138,7 @@ name = "decord" version = "0.6.0" source = { registry = "https://pypi.org/simple" } dependencies = [ - { name = "numpy", marker = "(platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x') or (platform_machine != 's390x' and sys_platform != 'linux')" }, + { name = "numpy", marker = "(platform_machine != 's390x' and sys_platform != 'linux') or (platform_machine == 'AMD64' and sys_platform == 'linux') or (platform_machine == 'x86_64' and sys_platform == 'linux')" }, ] wheels = [ { url = "https://files.pythonhosted.org/packages/11/79/936af42edf90a7bd4e41a6cac89c913d4b47fa48a26b042d5129a9242ee3/decord-0.6.0-py3-none-manylinux2010_x86_64.whl", hash = "sha256:51997f20be8958e23b7c4061ba45d0efcd86bffd5fe81c695d0befee0d442976", size = 13602299, upload-time = "2021-06-14T21:30:55.486Z" }, @@ -2729,7 +2733,7 @@ all = [ { name = "scikit-image" }, { name = "scipy" }, { name = "teleop" }, - { name = "torchcodec", marker = "(platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and sys_platform == 'linux') or (platform_machine != 'x86_64' and sys_platform == 'darwin') or (sys_platform != 'darwin' and sys_platform != 'linux' and sys_platform != 'win32')" }, + { name = "torchcodec", marker = "(platform_machine == 'arm64' and sys_platform == 'darwin') or (platform_machine == 'AMD64' and sys_platform == 'linux') or (platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'x86_64' and sys_platform == 'linux') or sys_platform == 'win32'" }, { name = "torchdiffeq" }, { name = "transformers" }, { name = "wandb" }, @@ -2742,7 +2746,7 @@ aloha = [ { name = "pandas" }, { name = "pyarrow" }, { name = "scipy" }, - { name = "torchcodec", marker = "(platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and sys_platform == 'linux') or (platform_machine != 'x86_64' and sys_platform == 'darwin') or (sys_platform != 'darwin' and sys_platform != 'linux' and sys_platform != 'win32')" }, + { name = "torchcodec", marker = "(platform_machine == 'arm64' and sys_platform == 'darwin') or (platform_machine == 'AMD64' and sys_platform == 'linux') or (platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'x86_64' and sys_platform == 'linux') or sys_platform == 'win32'" }, ] async = [ { name = "contourpy" }, @@ -2766,7 +2770,7 @@ core-scripts = [ { name = "pynput" }, { name = "pyserial" }, { name = "rerun-sdk" }, - { name = "torchcodec", marker = "(platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and sys_platform == 'linux') or (platform_machine != 'x86_64' and sys_platform == 'darwin') or (sys_platform != 'darwin' and sys_platform != 'linux' and sys_platform != 'win32')" }, + { name = "torchcodec", marker = "(platform_machine == 'arm64' and sys_platform == 'darwin') or (platform_machine == 'AMD64' and sys_platform == 'linux') or (platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'x86_64' and sys_platform == 'linux') or sys_platform == 'win32'" }, ] damiao = [ { name = "python-can" }, @@ -2777,7 +2781,7 @@ dataset = [ { name = "jsonlines" }, { name = "pandas" }, { name = "pyarrow" }, - { name = "torchcodec", marker = "(platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and sys_platform == 'linux') or (platform_machine != 'x86_64' and sys_platform == 'darwin') or (sys_platform != 'darwin' and sys_platform != 'linux' and sys_platform != 'win32')" }, + { name = "torchcodec", marker = "(platform_machine == 'arm64' and sys_platform == 'darwin') or (platform_machine == 'AMD64' and sys_platform == 'linux') or (platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'x86_64' and sys_platform == 'linux') or sys_platform == 'win32'" }, ] dataset-viz = [ { name = "av" }, @@ -2786,7 +2790,7 @@ dataset-viz = [ { name = "pandas" }, { name = "pyarrow" }, { name = "rerun-sdk" }, - { name = "torchcodec", marker = "(platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and sys_platform == 'linux') or (platform_machine != 'x86_64' and sys_platform == 'darwin') or (sys_platform != 'darwin' and sys_platform != 'linux' and sys_platform != 'win32')" }, + { name = "torchcodec", marker = "(platform_machine == 'arm64' and sys_platform == 'darwin') or (platform_machine == 'AMD64' and sys_platform == 'linux') or (platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'x86_64' and sys_platform == 'linux') or sys_platform == 'win32'" }, ] deepdiff-dep = [ { name = "deepdiff" }, @@ -2888,7 +2892,7 @@ libero = [ { name = "pandas" }, { name = "pyarrow" }, { name = "scipy" }, - { name = "torchcodec", marker = "(platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and sys_platform == 'linux') or (platform_machine != 'x86_64' and sys_platform == 'darwin') or (sys_platform != 'darwin' and sys_platform != 'linux' and sys_platform != 'win32')" }, + { name = "torchcodec", marker = "(platform_machine == 'arm64' and sys_platform == 'darwin') or (platform_machine == 'AMD64' and sys_platform == 'linux') or (platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'x86_64' and sys_platform == 'linux') or sys_platform == 'win32'" }, { name = "transformers" }, ] matplotlib-dep = [ @@ -2903,7 +2907,7 @@ metaworld = [ { name = "pandas" }, { name = "pyarrow" }, { name = "scipy" }, - { name = "torchcodec", marker = "(platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and sys_platform == 'linux') or (platform_machine != 'x86_64' and sys_platform == 'darwin') or (sys_platform != 'darwin' and sys_platform != 'linux' and sys_platform != 'win32')" }, + { name = "torchcodec", marker = "(platform_machine == 'arm64' and sys_platform == 'darwin') or (platform_machine == 'AMD64' and sys_platform == 'linux') or (platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'x86_64' and sys_platform == 'linux') or sys_platform == 'win32'" }, ] multi-task-dit = [ { name = "diffusers" }, @@ -2944,7 +2948,7 @@ pusht = [ { name = "pandas" }, { name = "pyarrow" }, { name = "pymunk" }, - { name = "torchcodec", marker = "(platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and sys_platform == 'linux') or (platform_machine != 'x86_64' and sys_platform == 'darwin') or (sys_platform != 'darwin' and sys_platform != 'linux' and sys_platform != 'win32')" }, + { name = "torchcodec", marker = "(platform_machine == 'arm64' and sys_platform == 'darwin') or (platform_machine == 'AMD64' and sys_platform == 'linux') or (platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'x86_64' and sys_platform == 'linux') or sys_platform == 'win32'" }, ] pygame-dep = [ { name = "pygame" }, @@ -2996,7 +3000,7 @@ training = [ { name = "jsonlines" }, { name = "pandas" }, { name = "pyarrow" }, - { name = "torchcodec", marker = "(platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and sys_platform == 'linux') or (platform_machine != 'x86_64' and sys_platform == 'darwin') or (sys_platform != 'darwin' and sys_platform != 'linux' and sys_platform != 'win32')" }, + { name = "torchcodec", marker = "(platform_machine == 'arm64' and sys_platform == 'darwin') or (platform_machine == 'AMD64' and sys_platform == 'linux') or (platform_machine == 'aarch64' and sys_platform == 'linux') or (platform_machine == 'arm64' and sys_platform == 'linux') or (platform_machine == 'x86_64' and sys_platform == 'linux') or sys_platform == 'win32'" }, { name = "wandb" }, ] transformers-dep = [ @@ -3209,7 +3213,9 @@ requires-dist = [ { name = "timm", marker = "extra == 'groot'", specifier = ">=1.0.0,<1.1.0" }, { name = "torch", marker = "sys_platform != 'linux'", specifier = ">=2.7,<2.12.0" }, { name = "torch", marker = "sys_platform == 'linux'", specifier = ">=2.7,<2.12.0", index = "https://download.pytorch.org/whl/cu128" }, - { name = "torchcodec", marker = "(platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and sys_platform == 'linux' and extra == 'dataset') or (platform_machine != 'x86_64' and sys_platform == 'darwin' and extra == 'dataset') or (sys_platform != 'darwin' and sys_platform != 'linux' and sys_platform != 'win32' and extra == 'dataset')", specifier = ">=0.3.0,<0.12.0" }, + { name = "torchcodec", marker = "(platform_machine == 'arm64' and sys_platform == 'darwin' and extra == 'dataset') or (platform_machine == 'AMD64' and sys_platform == 'linux' and extra == 'dataset') or (platform_machine == 'x86_64' and sys_platform == 'linux' and extra == 'dataset')", specifier = ">=0.3.0,<0.12.0" }, + { name = "torchcodec", marker = "(platform_machine == 'aarch64' and sys_platform == 'linux' and extra == 'dataset') or (platform_machine == 'arm64' and sys_platform == 'linux' and extra == 'dataset')", specifier = ">=0.11.0,<0.12.0" }, + { name = "torchcodec", marker = "sys_platform == 'win32' and extra == 'dataset'", specifier = ">=0.7.0,<0.12.0" }, { name = "torchdiffeq", marker = "extra == 'wallx'", specifier = ">=0.2.4,<0.3.0" }, { name = "torchvision", marker = "sys_platform != 'linux'", specifier = ">=0.22.0,<0.27.0" }, { name = "torchvision", marker = "sys_platform == 'linux'", specifier = ">=0.22.0,<0.27.0", index = "https://download.pytorch.org/whl/cu128" }, @@ -4260,6 +4266,7 @@ dependencies = [ { name = "protobuf" }, ] wheels = [ + { url = "https://files.pythonhosted.org/packages/81/b1/d111b1df656761f980d9e298a60039a9cb66036b1d039e777537743d0ac3/onnxruntime-1.26.0-cp312-cp312-macosx_14_0_arm64.whl", hash = "sha256:05b028781b322ad74b57ce5b50aa5280bb1fe96ceec334628ade681e0b24c1ac", size = 18016624, upload-time = "2026-05-12T00:41:01.735Z" }, { url = "https://files.pythonhosted.org/packages/f6/a0/3f9d896a0385a36bd04345d6d0b802821a5782adde562e7e135f6bb71c73/onnxruntime-1.26.0-cp312-cp312-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:91f2bb870a4b9224eba0a6728c1fa7a9e552b8e59e1083c51fbbc3d013f2b5c0", size = 16052692, upload-time = "2026-05-08T19:07:13.829Z" }, { url = "https://files.pythonhosted.org/packages/7c/43/2a4e04f8dbeffad19bbcced4bcd4289bf478921518437404d6b92bdf213b/onnxruntime-1.26.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:9b6dd70599005bd1bf29779f04a91978b92b5e719c11a20068a8f8e535f725b6", size = 18185439, upload-time = "2026-05-08T19:07:36.299Z" }, { url = "https://files.pythonhosted.org/packages/44/fc/026d0a7162b9c2153dac292baea9e027c42304dc1d9dc6f8ff5b4cfbaedd/onnxruntime-1.26.0-cp312-cp312-win_amd64.whl", hash = "sha256:a26374dc7fbcaae593601086b242120e13f2310558df0991da6dd8b8fac00414", size = 13026427, upload-time = "2026-05-08T19:08:03.503Z" }, @@ -6275,13 +6282,17 @@ name = "torch" version = "2.11.0+cu128" source = { registry = "https://download.pytorch.org/whl/cu128" } resolution-markers = [ - "python_full_version >= '3.15' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and sys_platform == 'linux'", + "(python_full_version >= '3.15' and platform_machine == 'AMD64' and sys_platform == 'linux') or (python_full_version >= '3.15' and platform_machine == 'x86_64' and sys_platform == 'linux')", + "python_full_version >= '3.15' and platform_machine != 'AMD64' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and platform_machine != 'x86_64' and sys_platform == 'linux'", "python_full_version >= '3.15' and platform_machine == 's390x' and sys_platform == 'linux'", - "python_full_version == '3.14.*' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and sys_platform == 'linux'", - "python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and sys_platform == 'linux'", + "(python_full_version == '3.14.*' and platform_machine == 'AMD64' and sys_platform == 'linux') or (python_full_version == '3.14.*' and platform_machine == 'x86_64' and sys_platform == 'linux')", + "python_full_version == '3.14.*' and platform_machine != 'AMD64' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and platform_machine != 'x86_64' and sys_platform == 'linux'", + "(python_full_version == '3.13.*' and platform_machine == 'AMD64' and sys_platform == 'linux') or (python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform == 'linux')", + "python_full_version == '3.13.*' and platform_machine != 'AMD64' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and platform_machine != 'x86_64' and sys_platform == 'linux'", "python_full_version == '3.14.*' and platform_machine == 's390x' and sys_platform == 'linux'", "python_full_version == '3.13.*' and platform_machine == 's390x' and sys_platform == 'linux'", - "python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and sys_platform == 'linux'", + "(python_full_version < '3.13' and platform_machine == 'AMD64' and sys_platform == 'linux') or (python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform == 'linux')", + "python_full_version < '3.13' and platform_machine != 'AMD64' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and platform_machine != 'x86_64' and sys_platform == 'linux'", "python_full_version < '3.13' and platform_machine == 's390x' and sys_platform == 'linux'", "(python_full_version >= '3.15' and platform_machine == 'aarch64' and sys_platform == 'linux') or (python_full_version >= '3.15' and platform_machine == 'arm64' and sys_platform == 'linux') or (python_full_version >= '3.15' and platform_machine == 'armv7l' and sys_platform == 'linux')", "(python_full_version == '3.14.*' and platform_machine == 'aarch64' and sys_platform == 'linux') or (python_full_version == '3.14.*' and platform_machine == 'arm64' and sys_platform == 'linux') or (python_full_version == '3.14.*' and platform_machine == 'armv7l' and sys_platform == 'linux')", @@ -6325,12 +6336,15 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/64/85/38f4843ff2a6bf7dfb71a153acd99024dadb96749965a67524c2f1cc1894/torchcodec-0.11.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:57056e91d1d883d0fb77ca7759e304be9c0bdb4ea0e37bde5c2e361347063b8c", size = 4368988, upload-time = "2026-04-14T18:24:51.46Z" }, { url = "https://files.pythonhosted.org/packages/4b/85/3b41034b0f1289423745f918ace2a1e1e86b9c578c2e2461b6afcbb5354a/torchcodec-0.11.1-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:f1aee486a84247fcaa67870ac5005aa8d382a9839e91e476fa71b5b3d9fda9b7", size = 2397532, upload-time = "2026-04-14T18:24:53.368Z" }, { url = "https://files.pythonhosted.org/packages/ca/a9/a2b6ee3e84c55bdd0c45fd991dde71c95a99115ec9e26938b212b4545dcf/torchcodec-0.11.1-cp312-cp312-manylinux_2_28_x86_64.whl", hash = "sha256:6c26e90e7aa982302644d0af8cb706318682bb390f48a80ecbfeab03499acd04", size = 2329883, upload-time = "2026-04-14T18:24:55.467Z" }, + { url = "https://files.pythonhosted.org/packages/82/48/683114a4ed6b59f76b6919532a5db0f4068787be26bab92cc18a1dfa6794/torchcodec-0.11.1-cp312-cp312-win_amd64.whl", hash = "sha256:3fd2d10e0e0a5f455c1c87dc1380b3bd43b77dd5eeeaf479470643b1c04a2dd2", size = 1921066, upload-time = "2026-04-14T18:24:57.102Z" }, { url = "https://files.pythonhosted.org/packages/2c/61/a8985a7561ef651e409deeac151a0ed5cef763db9577db5cc49c2f5eaab2/torchcodec-0.11.1-cp313-cp313-macosx_12_0_arm64.whl", hash = "sha256:915fbe20068ec77486fbbeaf0c627c89c7376445f27d215b7489c0a03c64fd4c", size = 4289805, upload-time = "2026-04-14T18:24:59.124Z" }, { url = "https://files.pythonhosted.org/packages/7a/31/c4ec0304dd169a9b2b7fa0dd1d5d659d3cccc975b98ac88c498fe6dd7196/torchcodec-0.11.1-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:3755de03c96afd37410cba68198225d11cd6431a32f2161a0019791a4a853305", size = 2399057, upload-time = "2026-04-14T18:25:00.782Z" }, { url = "https://files.pythonhosted.org/packages/5d/b2/85ad7a81f387e40983c21bc94da0c333974afb41f38c3a85d25875274187/torchcodec-0.11.1-cp313-cp313-manylinux_2_28_x86_64.whl", hash = "sha256:5eee69971cec1147a03b8a6b678b5dfbeff0b2c71ed7929e488391f9fbcd630c", size = 2332721, upload-time = "2026-04-14T18:25:02.518Z" }, + { url = "https://files.pythonhosted.org/packages/ad/ca/5c66f21d2a12039450e9dd4d9d7c480019dfbe9e8a87696a3c3a827c1e37/torchcodec-0.11.1-cp313-cp313-win_amd64.whl", hash = "sha256:67b34e5733636588ebe0f15082bbb90a8ce1472ccb8bb1a656ec28958a208919", size = 1920990, upload-time = "2026-04-14T18:25:04.269Z" }, { url = "https://files.pythonhosted.org/packages/c4/b7/8d6ee76fca0cfefec01402f33c11766455da2b8460cb9191cdc34f8defc0/torchcodec-0.11.1-cp314-cp314-macosx_12_0_arm64.whl", hash = "sha256:a00ef79e847644f91c9995de021062adc851916b16244d26c0a7a04569710508", size = 4408290, upload-time = "2026-04-14T18:25:05.967Z" }, { url = "https://files.pythonhosted.org/packages/1e/1e/e37bd46ffac9eec1a9afc32c5097cd83b0de1e865021f7f953c5142919f4/torchcodec-0.11.1-cp314-cp314-manylinux_2_28_aarch64.whl", hash = "sha256:170a3efea64f0cd2c21cee0a233a9e13c67a704b5c5e7ef9aeda31e747ac6885", size = 2402232, upload-time = "2026-04-14T18:25:08.026Z" }, { url = "https://files.pythonhosted.org/packages/8f/d0/a9173dbfa011cc2224f7489e50844b9f62110050bbdbd9d29485e7f1e0e2/torchcodec-0.11.1-cp314-cp314-manylinux_2_28_x86_64.whl", hash = "sha256:db66ddce36a6fa35f30fbe1d78b57289fcb53f8f43c1c85923edbe339540c665", size = 2334158, upload-time = "2026-04-14T18:25:09.77Z" }, + { url = "https://files.pythonhosted.org/packages/18/96/6ee0e26547976dc55a69042ce895747a34221eab348931e975141d80d25e/torchcodec-0.11.1-cp314-cp314-win_amd64.whl", hash = "sha256:3fd9ef8302b261d3db5585e42be4a3138c5c240a822031642cdf1f82ea3db5b7", size = 1925002, upload-time = "2026-04-14T18:25:11.718Z" }, ] [[package]] @@ -6400,13 +6414,17 @@ name = "torchvision" version = "0.26.0+cu128" source = { registry = "https://download.pytorch.org/whl/cu128" } resolution-markers = [ - "python_full_version >= '3.15' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and sys_platform == 'linux'", + "(python_full_version >= '3.15' and platform_machine == 'AMD64' and sys_platform == 'linux') or (python_full_version >= '3.15' and platform_machine == 'x86_64' and sys_platform == 'linux')", + "python_full_version >= '3.15' and platform_machine != 'AMD64' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and platform_machine != 'x86_64' and sys_platform == 'linux'", "python_full_version >= '3.15' and platform_machine == 's390x' and sys_platform == 'linux'", - "python_full_version == '3.14.*' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and sys_platform == 'linux'", - "python_full_version == '3.13.*' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and sys_platform == 'linux'", + "(python_full_version == '3.14.*' and platform_machine == 'AMD64' and sys_platform == 'linux') or (python_full_version == '3.14.*' and platform_machine == 'x86_64' and sys_platform == 'linux')", + "python_full_version == '3.14.*' and platform_machine != 'AMD64' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and platform_machine != 'x86_64' and sys_platform == 'linux'", + "(python_full_version == '3.13.*' and platform_machine == 'AMD64' and sys_platform == 'linux') or (python_full_version == '3.13.*' and platform_machine == 'x86_64' and sys_platform == 'linux')", + "python_full_version == '3.13.*' and platform_machine != 'AMD64' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and platform_machine != 'x86_64' and sys_platform == 'linux'", "python_full_version == '3.14.*' and platform_machine == 's390x' and sys_platform == 'linux'", "python_full_version == '3.13.*' and platform_machine == 's390x' and sys_platform == 'linux'", - "python_full_version < '3.13' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and sys_platform == 'linux'", + "(python_full_version < '3.13' and platform_machine == 'AMD64' and sys_platform == 'linux') or (python_full_version < '3.13' and platform_machine == 'x86_64' and sys_platform == 'linux')", + "python_full_version < '3.13' and platform_machine != 'AMD64' and platform_machine != 'aarch64' and platform_machine != 'arm64' and platform_machine != 'armv7l' and platform_machine != 's390x' and platform_machine != 'x86_64' and sys_platform == 'linux'", "python_full_version < '3.13' and platform_machine == 's390x' and sys_platform == 'linux'", "(python_full_version >= '3.15' and platform_machine == 'aarch64' and sys_platform == 'linux') or (python_full_version >= '3.15' and platform_machine == 'arm64' and sys_platform == 'linux') or (python_full_version >= '3.15' and platform_machine == 'armv7l' and sys_platform == 'linux')", "(python_full_version == '3.14.*' and platform_machine == 'aarch64' and sys_platform == 'linux') or (python_full_version == '3.14.*' and platform_machine == 'arm64' and sys_platform == 'linux') or (python_full_version == '3.14.*' and platform_machine == 'armv7l' and sys_platform == 'linux')",