Compare commits

...

12 Commits

Author SHA1 Message Date
Francesco Capuano eacb638299 fix: tests 2025-11-08 13:32:08 +00:00
Francesco Capuano 927c6ac3c5 add: parallel, distributed aggregation of multiple datasets with a tree-based thread pool 2025-11-08 13:32:08 +00:00
Francesco Capuano 13a429e5c7 fix: DEFAULT FEATURES must be present when creating metadata. If not, then we raise. this is a first step towards standardazing the dataset format, or otherwise (as it is now) everything would be allowed 2025-11-08 13:32:08 +00:00
Francesco Capuano c87fd37736 add: num workers to dataset tools while you're at it 2025-11-08 13:32:08 +00:00
Francesco Capuano bb5676ee5a add: number of workers for merging datasets 2025-11-08 13:32:07 +00:00
Steven Palma a4aa316470 fix(dataset): fix data access bottleneck for faster training (#2408) 2025-11-07 21:54:44 +01:00
Michel Aractingi f6b16f6d97 fix(dataset_tools) Critical bug in modify features (#2342)
* fix bug in `_copy_data_with_feature_changes`

* Update src/lerobot/datasets/dataset_tools.py

Co-authored-by: Caroline Pascal <caroline8.pascal@gmail.com>
Signed-off-by: Michel Aractingi <michel.aractingi@huggingface.co>

* add missing import

---------

Signed-off-by: Michel Aractingi <michel.aractingi@huggingface.co>
Co-authored-by: Caroline Pascal <caroline8.pascal@gmail.com>
2025-11-04 15:56:41 +01:00
Jade Choghari df0c335a5a feat(sim): EnvHub - allow loading envs from the hub (#2121)
* add env from the hub support

* add safe loading

* changes

* add tests, docs

* more

* style/cleaning

* order

---------

Co-authored-by: Michel Aractingi <michel.aractingi@huggingface.co>
2025-11-04 14:52:46 +01:00
Jade Choghari 87ed3a2b6e dep(upgrade): add libero as a pypi package (#2365)
* add changes

* Update pyproject.toml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Jade Choghari <chogharijade@gmail.com>

* add openpi-transformers

Signed-off-by: Jade Choghari <chogharijade@gmail.com>

* new changes

Signed-off-by: Jade Choghari <chogharijade@gmail.com>

* Update hf-libero version in pyproject.toml

Signed-off-by: Jade Choghari <chogharijade@gmail.com>

---------

Signed-off-by: Jade Choghari <chogharijade@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-04 10:43:52 +01:00
Jade Choghari d57d1aa197 fix(make_policy): rename mapping edge cases in training (#2332)
* fix bug

* update fixes

* add hf license

* more fixes

* add transformers

* iterate on review

* more fixes

* more fixes

* add a False test

* reduce img size

* reduce img size

* skip the test

* add

* add style
2025-10-31 13:08:42 +01:00
Caroline Pascal 3f8c5d9809 fix(video_key typo): fixing video_key typo in update_video_info (#2323) 2025-10-28 09:41:33 +01:00
Steven Palma d1548e1d13 docs(install): imrpove groot and libero installation instructions (#2314) 2025-10-26 15:37:41 +08:00
17 changed files with 1224 additions and 90 deletions
+2
View File
@@ -41,6 +41,8 @@
title: NVIDIA GR00T N1.5
title: "Policies"
- sections:
- local: envhub
title: Environments from the Hub
- local: il_sim
title: Imitation Learning in Sim
- local: libero
+424
View File
@@ -0,0 +1,424 @@
# Loading Environments from the Hub
The **EnvHub** feature allows you to load simulation environments directly from the Hugging Face Hub with a single line of code. This unlocks a powerful new model for collaboration: instead of environments being locked away inside monolithic libraries, anyone can publish custom environments and share them with the community.
## Overview
With EnvHub, you can:
- Load environments from the Hub instantly
- Share your custom simulation tasks with the community
- Version control your environments using Git
- Distribute complex physics simulations without packaging hassles
## Quick Start
Loading an environment from the Hub is as simple as:
```python
from lerobot.envs.factory import make_env
# Load a hub environment (requires explicit consent to run remote code)
env = make_env("lerobot/cartpole-env", trust_remote_code=True)
```
<Tip warning={true}>
**Security Notice**: Loading environments from the Hub executes Python code
from third-party repositories. Only use `trust_remote_code=True` with
repositories you trust. We strongly recommend pinning to a specific commit
hash for reproducibility and security.
</Tip>
## What is EnvHub?
EnvHub is a framework that allows researchers and developers to:
1. **Publish environments** to the Hugging Face Hub as Git repositories
2. **Load environments** dynamically without installing them as packages
3. **Version and track** environment changes using Git semantics
4. **Discover** new simulation tasks shared by the community
This design means you can go from discovering an interesting environment on the Hub to running experiments in seconds, without worrying about dependency conflicts or complex installation procedures.
## Repository Structure
To make your environment loadable from the Hub, your repository must contain at minimum:
### Required Files
**`env.py`** (or custom Python file)
- Must expose a `make_env(n_envs: int, use_async_envs: bool)` function
- This function should return one of:
- A `gym.vector.VectorEnv` (most common)
- A single `gym.Env` (will be automatically wrapped)
- A dict mapping `{suite_name: {task_id: VectorEnv}}` (for multi-task benchmarks)
### Optional Files
**`requirements.txt`**
- List any additional dependencies your environment needs
- Users will need to install these manually before loading your environment
**`README.md`**
- Document your environment: what task it implements, observation/action spaces, rewards, etc.
- Include usage examples and any special setup instructions
**`.gitignore`**
- Exclude unnecessary files from your repository
### Example Repository Structure
```
my-environment-repo/
├── env.py # Main environment definition (required)
├── requirements.txt # Dependencies (optional)
├── README.md # Documentation (recommended)
├── assets/ # Images, videos, etc. (optional)
│ └── demo.gif
└── configs/ # Config files if needed (optional)
└── task_config.yaml
```
## Creating Your Environment Repository
### Step 1: Define Your Environment
Create an `env.py` file with a `make_env` function:
```python
# env.py
import gymnasium as gym
def make_env(n_envs: int = 1, use_async_envs: bool = False):
"""
Create vectorized environments for your custom task.
Args:
n_envs: Number of parallel environments
use_async_envs: Whether to use AsyncVectorEnv or SyncVectorEnv
Returns:
gym.vector.VectorEnv or dict mapping suite names to vectorized envs
"""
def _make_single_env():
# Create your custom environment
return gym.make("CartPole-v1")
# Choose vector environment type
env_cls = gym.vector.AsyncVectorEnv if use_async_envs else gym.vector.SyncVectorEnv
# Create vectorized environment
vec_env = env_cls([_make_single_env for _ in range(n_envs)])
return vec_env
```
### Step 2: Test Locally
Before uploading, test your environment locally:
```python
from lerobot.envs.utils import _load_module_from_path, _call_make_env, _normalize_hub_result
# Load your module
module = _load_module_from_path("./env.py")
# Test the make_env function
result = _call_make_env(module, n_envs=2, use_async_envs=False)
normalized = _normalize_hub_result(result)
# Verify it works
suite_name = next(iter(normalized))
env = normalized[suite_name][0]
obs, info = env.reset()
print(f"Observation shape: {obs.shape if hasattr(obs, 'shape') else type(obs)}")
env.close()
```
### Step 3: Upload to the Hub
Upload your repository to Hugging Face:
```bash
# Install huggingface_hub if needed
pip install huggingface_hub
# Login to Hugging Face
huggingface-cli login
# Create a new repository
huggingface-cli repo create my-custom-env --type space --org my-org
# Initialize git and push
git init
git add .
git commit -m "Initial environment implementation"
git remote add origin https://huggingface.co/my-org/my-custom-env
git push -u origin main
```
Alternatively, use the `huggingface_hub` Python API:
```python
from huggingface_hub import HfApi
api = HfApi()
# Create repository
api.create_repo("my-custom-env", repo_type="space")
# Upload files
api.upload_folder(
folder_path="./my-env-folder",
repo_id="username/my-custom-env",
repo_type="space",
)
```
## Loading Environments from the Hub
### Basic Usage
```python
from lerobot.envs.factory import make_env
# Load from the hub
envs_dict = make_env(
"username/my-custom-env",
n_envs=4,
trust_remote_code=True
)
# Access the environment
suite_name = next(iter(envs_dict))
env = envs_dict[suite_name][0]
# Use it like any gym environment
obs, info = env.reset()
action = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(action)
```
### Advanced: Pinning to Specific Versions
For reproducibility and security, pin to a specific Git revision:
```python
# Pin to a specific branch
env = make_env("username/my-env@main", trust_remote_code=True)
# Pin to a specific commit (recommended for papers/experiments)
env = make_env("username/my-env@abc123def456", trust_remote_code=True)
# Pin to a tag
env = make_env("username/my-env@v1.0.0", trust_remote_code=True)
```
### Custom File Paths
If your environment definition is not in `env.py`:
```python
# Load from a custom file
env = make_env("username/my-env:custom_env.py", trust_remote_code=True)
# Combine with version pinning
env = make_env("username/my-env@v1.0:envs/task_a.py", trust_remote_code=True)
```
### Async Environments
For better performance with multiple environments:
```python
envs_dict = make_env(
"username/my-env",
n_envs=8,
use_async_envs=True, # Use AsyncVectorEnv for parallel execution
trust_remote_code=True
)
```
## URL Format Reference
The hub URL format supports several patterns:
| Pattern | Description | Example |
| -------------------- | ------------------------------ | -------------------------------------- |
| `user/repo` | Load `env.py` from main branch | `make_env("lerobot/pusht-env")` |
| `user/repo@revision` | Load from specific revision | `make_env("lerobot/pusht-env@main")` |
| `user/repo:path` | Load custom file | `make_env("lerobot/envs:pusht.py")` |
| `user/repo@rev:path` | Revision + custom file | `make_env("lerobot/envs@v1:pusht.py")` |
## Multi-Task Environments
For benchmarks with multiple tasks (like LIBERO), return a nested dictionary:
```python
def make_env(n_envs: int = 1, use_async_envs: bool = False):
env_cls = gym.vector.AsyncVectorEnv if use_async_envs else gym.vector.SyncVectorEnv
# Return dict: {suite_name: {task_id: VectorEnv}}
return {
"suite_1": {
0: env_cls([lambda: gym.make("Task1-v0") for _ in range(n_envs)]),
1: env_cls([lambda: gym.make("Task2-v0") for _ in range(n_envs)]),
},
"suite_2": {
0: env_cls([lambda: gym.make("Task3-v0") for _ in range(n_envs)]),
}
}
```
## Security Considerations
<Tip warning={true}>
**Important**: The `trust_remote_code=True` flag is required to execute
environment code from the Hub. This is by design for security.
</Tip>
When loading environments from the Hub:
1. **Review the code first**: Visit the repository and inspect `env.py` before loading
2. **Pin to commits**: Use specific commit hashes for reproducibility
3. **Check dependencies**: Review `requirements.txt` for suspicious packages
4. **Use trusted sources**: Prefer official organizations or well-known researchers
5. **Sandbox if needed**: Run untrusted code in isolated environments (containers, VMs)
Example of safe usage:
```python
# ❌ BAD: Loading without inspection
env = make_env("random-user/untrusted-env", trust_remote_code=True)
# ✅ GOOD: Review code, then pin to specific commit
# 1. Visit https://huggingface.co/trusted-org/verified-env
# 2. Review the env.py file
# 3. Copy the commit hash
env = make_env("trusted-org/verified-env@a1b2c3d4", trust_remote_code=True)
```
## Example: CartPole from the Hub
Here's a complete example using the reference CartPole environment:
```python
from lerobot.envs.factory import make_env
import numpy as np
# Load the environment
envs_dict = make_env("lerobot/cartpole-env", n_envs=4, trust_remote_code=True)
# Get the vectorized environment
suite_name = next(iter(envs_dict))
env = envs_dict[suite_name][0]
# Run a simple episode
obs, info = env.reset()
done = np.zeros(env.num_envs, dtype=bool)
total_reward = np.zeros(env.num_envs)
while not done.all():
# Random policy
action = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(action)
total_reward += reward
done = terminated | truncated
print(f"Average reward: {total_reward.mean():.2f}")
env.close()
```
## Benefits of EnvHub
### For Environment Authors
- **Easy distribution**: No PyPI packaging required
- **Version control**: Use Git for environment versioning
- **Rapid iteration**: Push updates instantly
- **Documentation**: Hub README renders beautifully
- **Community**: Reach LeRobot users directly
### For Researchers
- **Quick experiments**: Load any environment in one line
- **Reproducibility**: Pin to specific commits
- **Discovery**: Browse environments on the Hub
- **No conflicts**: No need to install conflicting packages
### For the Community
- **Growing ecosystem**: More diverse simulation tasks
- **Standardization**: Common `make_env` API
- **Collaboration**: Fork and improve existing environments
- **Accessibility**: Lower barrier to sharing research
## Troubleshooting
### "Refusing to execute remote code"
You must explicitly pass `trust_remote_code=True`:
```python
env = make_env("user/repo", trust_remote_code=True)
```
### "Module X not found"
The hub environment has dependencies you need to install:
```bash
# Check the repo's requirements.txt and install dependencies
pip install gymnasium numpy
```
### "make_env not found in module"
Your `env.py` must expose a `make_env` function:
```python
def make_env(n_envs: int, use_async_envs: bool):
# Your implementation
pass
```
### Environment returns wrong type
The `make_env` function must return:
- A `gym.vector.VectorEnv`, or
- A single `gym.Env`, or
- A dict `{suite_name: {task_id: VectorEnv}}`
## Best Practices
1. **Document your environment**: Include observation/action space descriptions, reward structure, and termination conditions in your README
2. **Add requirements.txt**: List all dependencies with versions
3. **Test thoroughly**: Verify your environment works locally before pushing
4. **Use semantic versioning**: Tag releases with version numbers
5. **Add examples**: Include usage examples in your README
6. **Keep it simple**: Minimize dependencies when possible
7. **License your work**: Add a LICENSE file to clarify usage terms
## Future Directions
The EnvHub ecosystem enables exciting possibilities:
- **GPU-accelerated physics**: Share Isaac Gym or Brax environments
- **Photorealistic rendering**: Distribute environments with advanced graphics
- **Multi-agent scenarios**: Complex interaction tasks
- **Real-world simulators**: Digital twins of physical setups
- **Procedural generation**: Infinite task variations
- **Domain randomization**: Pre-configured DR pipelines
As more researchers and developers contribute, the diversity and quality of available environments will grow, benefiting the entire robotics learning community.
## See Also
- [Hugging Face Hub Documentation](https://huggingface.co/docs/hub/en/index)
- [Gymnasium Documentation](https://gymnasium.farama.org/index.html)
- [Example Hub Environment](https://huggingface.co/lerobot/cartpole-env)
+4 -1
View File
@@ -40,7 +40,7 @@ python -c "import flash_attn; print(f'Flash Attention {flash_attn.__version__} i
3. Install LeRobot by running:
```bash
pip install lerobot[groot] # consider also installing libero,dev and test tags
pip install lerobot[groot]
```
## Usage
@@ -83,6 +83,9 @@ accelerate launch \
### Libero Benchmark Results
> [!NOTE]
> Follow our instructions for Libero usage: [Libero](./libero)
GR00T has demonstrated strong performance on the Libero benchmark suite. To compare and test its LeRobot implementation, we finetuned the GR00T N1.5 model for 30k steps on the Libero dataset and compared the results to the GR00T reference results.
| Benchmark | LeRobot Implementation | GR00T Reference |
+5
View File
@@ -28,6 +28,11 @@ LIBERO is now part of our **multi-eval supported simulation**, meaning you can b
To Install LIBERO, after following LeRobot official instructions, just do:
`pip install -e ".[libero]"`
> [!NOTE]
> For lerobot 0.4.0, if you want to install libero tag, you will have to do: `pip install "lerobot[libero]@git+https://github.com/huggingface/lerobot.git"`.
>
> This will be solved in the next patch release
### Single-suite evaluation
Evaluate a policy on one LIBERO suite:
+5
View File
@@ -28,6 +28,11 @@ As described by Physical Intelligence, while AI has achieved remarkable success
pip install -e ".[pi]"
```
> [!NOTE]
> For lerobot 0.4.0, if you want to install pi tag, you will have to do: `pip install "lerobot[pi]@git+https://github.com/huggingface/lerobot.git"`.
>
> This will be solved in the next patch release
## Training Data and Capabilities
π₀ is trained on the largest robot interaction dataset to date, combining three key data sources:
+5
View File
@@ -36,6 +36,11 @@ This diverse training mixture creates a "curriculum" that enables generalization
pip install -e ".[pi]"
```
> [!NOTE]
> For lerobot 0.4.0, if you want to install pi tag, you will have to do: `pip install "lerobot[pi]@git+https://github.com/huggingface/lerobot.git"`.
>
> This will be solved in the next patch release
## Usage
To use π₀.₅ in your LeRobot configuration, specify the policy type as:
+1 -1
View File
@@ -142,7 +142,7 @@ video_benchmark = ["scikit-image>=0.23.2,<0.26.0", "pandas>=2.2.2,<2.4.0"]
# Simulation
aloha = ["gym-aloha>=0.1.2,<0.2.0"]
pusht = ["gym-pusht>=0.1.5,<0.2.0", "pymunk>=6.6.0,<7.0.0"] # TODO: Fix pymunk version in gym-pusht instead
libero = ["lerobot[transformers-dep]", "libero @ git+https://github.com/huggingface/lerobot-libero.git@main#egg=libero"]
libero = ["lerobot[transformers-dep]", "hf-libero>=0.1.3,<0.2.0"]
metaworld = ["metaworld==3.0.0"]
# All
+217 -41
View File
@@ -15,8 +15,10 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import contextlib
import logging
import shutil
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
import pandas as pd
@@ -107,6 +109,7 @@ def update_meta_data(
dst_meta,
meta_idx,
data_idx,
data_file_map,
videos_idx,
):
"""Updates metadata DataFrame with new chunk, file, and timestamp indices.
@@ -127,8 +130,25 @@ def update_meta_data(
df["meta/episodes/chunk_index"] = df["meta/episodes/chunk_index"] + meta_idx["chunk"]
df["meta/episodes/file_index"] = df["meta/episodes/file_index"] + meta_idx["file"]
df["data/chunk_index"] = df["data/chunk_index"] + data_idx["chunk"]
df["data/file_index"] = df["data/file_index"] + data_idx["file"]
# Remap data chunk/file indices per-source-file using the actual destination
# file chosen during data aggregation. A flat offset is incorrect when
# multiple source files are concatenated into a single destination file.
if data_file_map:
new_data_chunk = []
new_data_file = []
for idx in df.index:
src_chunk = int(df.at[idx, "data/chunk_index"]) # original source file location
src_file = int(df.at[idx, "data/file_index"]) # original source file location
dst_chunk, dst_file = data_file_map.get(
(src_chunk, src_file), (src_chunk + data_idx["chunk"], src_file + data_idx["file"])
)
new_data_chunk.append(dst_chunk)
new_data_file.append(dst_file)
df["data/chunk_index"] = new_data_chunk
df["data/file_index"] = new_data_file
else:
df["data/chunk_index"] = df["data/chunk_index"] + data_idx["chunk"]
df["data/file_index"] = df["data/file_index"] + data_idx["file"]
for key, video_idx in videos_idx.items():
# Store original video file indices before updating
orig_chunk_col = f"videos/{key}/chunk_index"
@@ -166,7 +186,7 @@ def update_meta_data(
return df
def aggregate_datasets(
def _aggregate_datasets(
repo_ids: list[str],
aggr_repo_id: str,
roots: list[Path] | None = None,
@@ -175,39 +195,24 @@ def aggregate_datasets(
video_files_size_in_mb: float | None = None,
chunk_size: int | None = None,
):
"""Aggregates multiple LeRobot datasets into a single unified dataset.
"""Serial aggregation kernel: combines datasets into a destination dataset.
This is the main function that orchestrates the aggregation process by:
1. Loading and validating all source dataset metadata
2. Creating a new destination dataset with unified tasks
3. Aggregating videos, data, and metadata from all source datasets
4. Finalizing the aggregated dataset with proper statistics
Args:
repo_ids: List of repository IDs for the datasets to aggregate.
aggr_repo_id: Repository ID for the aggregated output dataset.
roots: Optional list of root paths for the source datasets.
aggr_root: Optional root path for the aggregated dataset.
data_files_size_in_mb: Maximum size for data files in MB (defaults to DEFAULT_DATA_FILE_SIZE_IN_MB)
video_files_size_in_mb: Maximum size for video files in MB (defaults to DEFAULT_VIDEO_FILE_SIZE_IN_MB)
chunk_size: Maximum number of files per chunk (defaults to DEFAULT_CHUNK_SIZE)
This function performs a single-process aggregation. It assumes it is the
sole writer for its destination `aggr_root`.
"""
logging.info("Start aggregate_datasets")
if data_files_size_in_mb is None:
data_files_size_in_mb = DEFAULT_DATA_FILE_SIZE_IN_MB
if video_files_size_in_mb is None:
video_files_size_in_mb = DEFAULT_VIDEO_FILE_SIZE_IN_MB
if chunk_size is None:
chunk_size = DEFAULT_CHUNK_SIZE
all_metadata = (
[LeRobotDatasetMetadata(repo_id) for repo_id in repo_ids]
if roots is None
else [
LeRobotDatasetMetadata(repo_id, root=root) for repo_id, root in zip(repo_ids, roots, strict=False)
# Build metadata objects, supporting a per-dataset "root" that may be None.
# When root is provided we load from the local filesystem, otherwise from Hub cache.
if roots is None:
all_metadata = [LeRobotDatasetMetadata(repo_id) for repo_id in repo_ids]
else:
all_metadata = [
(
LeRobotDatasetMetadata(repo_id, root=root)
if root is not None
else LeRobotDatasetMetadata(repo_id)
)
for repo_id, root in zip(repo_ids, roots, strict=False)
]
)
fps, robot_type, features = validate_all_metadata(all_metadata)
video_keys = [key for key in features if features[key]["dtype"] == "video"]
@@ -237,9 +242,11 @@ def aggregate_datasets(
for src_meta in tqdm.tqdm(all_metadata, desc="Copy data and videos"):
videos_idx = aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chunk_size)
data_idx = aggregate_data(src_meta, dst_meta, data_idx, data_files_size_in_mb, chunk_size)
data_idx, data_file_map = aggregate_data(
src_meta, dst_meta, data_idx, data_files_size_in_mb, chunk_size
)
meta_idx = aggregate_metadata(src_meta, dst_meta, meta_idx, data_idx, videos_idx)
meta_idx = aggregate_metadata(src_meta, dst_meta, meta_idx, data_idx, data_file_map, videos_idx)
dst_meta.info["total_episodes"] += src_meta.total_episodes
dst_meta.info["total_frames"] += src_meta.total_frames
@@ -248,6 +255,168 @@ def aggregate_datasets(
logging.info("Aggregation complete.")
def aggregate_datasets(
repo_ids: list[str],
aggr_repo_id: str,
roots: list[Path] | None = None,
aggr_root: Path | None = None,
data_files_size_in_mb: float | None = None,
video_files_size_in_mb: float | None = None,
chunk_size: int | None = None,
num_workers: int | None = None,
tmp_root: Path | None = None,
):
"""Aggregates multiple LeRobot datasets into a single unified dataset.
This is the main function that orchestrates the aggregation process by:
1. Loading and validating all source dataset metadata
2. Creating a new destination dataset with unified tasks
3. Aggregating videos, data, and metadata from all source datasets
4. Finalizing the aggregated dataset with proper statistics
Args:
repo_ids: List of repository IDs for the datasets to aggregate.
aggr_repo_id: Repository ID for the aggregated output dataset.
roots: Optional list of root paths for the source datasets.
aggr_root: Optional root path for the aggregated dataset.
data_files_size_in_mb: Maximum size for data files in MB (defaults to DEFAULT_DATA_FILE_SIZE_IN_MB)
video_files_size_in_mb: Maximum size for video files in MB (defaults to DEFAULT_VIDEO_FILE_SIZE_IN_MB)
chunk_size: Maximum number of files per chunk (defaults to DEFAULT_CHUNK_SIZE)
num_workers: When > 1, performs a tree-based parallel reduction using a thread pool
tmp_root: Optional base directory to store intermediate reduction outputs
"""
logging.info("Start aggregate_datasets")
if data_files_size_in_mb is None:
data_files_size_in_mb = DEFAULT_DATA_FILE_SIZE_IN_MB
if video_files_size_in_mb is None:
video_files_size_in_mb = DEFAULT_VIDEO_FILE_SIZE_IN_MB
if chunk_size is None:
chunk_size = DEFAULT_CHUNK_SIZE
if num_workers is None or num_workers <= 1:
# Run aggregation sequentially
_aggregate_datasets(
repo_ids=repo_ids,
aggr_repo_id=aggr_repo_id,
aggr_root=aggr_root,
roots=roots,
data_files_size_in_mb=data_files_size_in_mb,
video_files_size_in_mb=video_files_size_in_mb,
chunk_size=chunk_size,
)
# Uses a parallel fan-out/fan-in strategy when num_workers is provided
elif num_workers > 1:
# Validate across all metadata early to fail fast
all_metadata_for_validation = (
[LeRobotDatasetMetadata(repo_id) for repo_id in repo_ids]
if roots is None
else [
LeRobotDatasetMetadata(repo_id, root=root)
for repo_id, root in zip(repo_ids, roots, strict=False)
]
)
validate_all_metadata(all_metadata_for_validation)
# Clamp workers to a sensible upper bound (pairs per round)
num_workers = min(num_workers, max(1, len(repo_ids) // 2))
# Choose a base temporary root for intermediate merge results
if tmp_root is not None:
base_tmp_root = tmp_root
elif aggr_root is not None:
base_tmp_root = aggr_root.parent / f".{aggr_repo_id}__tmp"
else:
base_tmp_root = Path.cwd() / f".{aggr_repo_id}__tmp"
base_tmp_root.mkdir(parents=True, exist_ok=True)
current_repo_ids: list[str] = list(repo_ids)
# Always maintain a roots list aligned with repo_ids. Use None for Hub-backed inputs.
current_roots: list[Path | None] = list(roots) if roots is not None else [None] * len(repo_ids)
try:
level = 0
while len(current_repo_ids) > 1:
next_repo_ids: list[str] = []
next_roots: list[Path | None] = []
futures = []
with ThreadPoolExecutor(max_workers=num_workers) as executor:
group_index = 0
i = 0
while i < len(current_repo_ids):
group_repo_ids = current_repo_ids[i : i + 2]
group_roots = current_roots[i : i + 2]
if len(group_repo_ids) == 1:
# Carry over singleton to next level
next_repo_ids.append(group_repo_ids[0])
next_roots.append(group_roots[0])
i += 1
continue
out_repo_id = f"{aggr_repo_id}__reduce_l{level}_g{group_index}"
out_root = base_tmp_root / f"reduce_l{level}_g{group_index}"
futures.append(
executor.submit(
_aggregate_datasets,
group_repo_ids,
out_repo_id,
group_roots,
out_root,
data_files_size_in_mb,
video_files_size_in_mb,
chunk_size,
)
)
next_repo_ids.append(out_repo_id)
next_roots.append(out_root)
i += 2
group_index += 1
for f in as_completed(futures):
# Bubble up any exception raised inside tasks
f.result()
# Cleanup previous level temporary outputs that won't be used again
base_resolved = base_tmp_root.resolve()
keep_set = {nr.resolve() for nr in next_roots if nr is not None}
for prev_root in current_roots:
if prev_root is None:
continue
# Suppress per-iteration to keep cleaning other roots even if one fails
with contextlib.suppress(Exception):
pr = prev_root.resolve()
if pr not in keep_set and base_resolved in pr.parents:
shutil.rmtree(prev_root, ignore_errors=True)
current_repo_ids = next_repo_ids
current_roots = next_roots # aligned list of Path|None after first level
level += 1
# Final copy/aggregation into the desired output
_aggregate_datasets(
repo_ids=current_repo_ids,
aggr_repo_id=aggr_repo_id,
roots=current_roots,
aggr_root=aggr_root,
data_files_size_in_mb=data_files_size_in_mb,
video_files_size_in_mb=video_files_size_in_mb,
chunk_size=chunk_size,
)
finally:
# Remove all temporary reduction artifacts
with contextlib.suppress(Exception):
shutil.rmtree(base_tmp_root, ignore_errors=True)
logging.info("Aggregation complete.")
return
def aggregate_videos(src_meta, dst_meta, videos_idx, video_files_size_in_mb, chunk_size):
"""Aggregates video chunks from a source dataset into the destination dataset.
@@ -366,6 +535,9 @@ def aggregate_data(src_meta, dst_meta, data_idx, data_files_size_in_mb, chunk_si
unique_chunk_file_ids = sorted(unique_chunk_file_ids)
# Map source (chunk,file) -> destination (chunk,file) actually used during write
src_to_dst_file: dict[tuple[int, int], tuple[int, int]] = {}
for src_chunk_idx, src_file_idx in unique_chunk_file_ids:
src_path = src_meta.root / DEFAULT_DATA_PATH.format(
chunk_index=src_chunk_idx, file_index=src_file_idx
@@ -373,7 +545,7 @@ def aggregate_data(src_meta, dst_meta, data_idx, data_files_size_in_mb, chunk_si
df = pd.read_parquet(src_path)
df = update_data_df(df, src_meta, dst_meta)
data_idx = append_or_create_parquet_file(
data_idx, used_chunk, used_file = append_or_create_parquet_file(
df,
src_path,
data_idx,
@@ -383,11 +555,12 @@ def aggregate_data(src_meta, dst_meta, data_idx, data_files_size_in_mb, chunk_si
contains_images=len(dst_meta.image_keys) > 0,
aggr_root=dst_meta.root,
)
src_to_dst_file[(src_chunk_idx, src_file_idx)] = (used_chunk, used_file)
return data_idx
return data_idx, src_to_dst_file
def aggregate_metadata(src_meta, dst_meta, meta_idx, data_idx, videos_idx):
def aggregate_metadata(src_meta, dst_meta, meta_idx, data_idx, data_file_map, videos_idx):
"""Aggregates metadata from a source dataset into the destination dataset.
Reads source metadata files, updates all indices and timestamps,
@@ -421,10 +594,11 @@ def aggregate_metadata(src_meta, dst_meta, meta_idx, data_idx, videos_idx):
dst_meta,
meta_idx,
data_idx,
data_file_map,
videos_idx,
)
meta_idx = append_or_create_parquet_file(
meta_idx, _m_used_chunk, _m_used_file = append_or_create_parquet_file(
df,
src_path,
meta_idx,
@@ -478,7 +652,7 @@ def append_or_create_parquet_file(
to_parquet_with_hf_images(df, dst_path)
else:
df.to_parquet(dst_path)
return idx
return idx, idx["chunk"], idx["file"]
src_size = get_parquet_file_size_in_mb(src_path)
dst_size = get_parquet_file_size_in_mb(dst_path)
@@ -489,17 +663,19 @@ def append_or_create_parquet_file(
new_path.parent.mkdir(parents=True, exist_ok=True)
final_df = df
target_path = new_path
used_chunk, used_file = idx["chunk"], idx["file"]
else:
existing_df = pd.read_parquet(dst_path)
final_df = pd.concat([existing_df, df], ignore_index=True)
target_path = dst_path
used_chunk, used_file = idx["chunk"], idx["file"]
if contains_images:
to_parquet_with_hf_images(final_df, target_path)
else:
final_df.to_parquet(target_path)
return idx
return idx, used_chunk, used_file
def finalize_aggregation(aggr_meta, all_metadata):
+17 -19
View File
@@ -39,6 +39,7 @@ from lerobot.datasets.aggregate import aggregate_datasets
from lerobot.datasets.compute_stats import aggregate_stats
from lerobot.datasets.lerobot_dataset import LeRobotDataset, LeRobotDatasetMetadata
from lerobot.datasets.utils import (
DATA_DIR,
DEFAULT_CHUNK_SIZE,
DEFAULT_DATA_FILE_SIZE_IN_MB,
DEFAULT_DATA_PATH,
@@ -233,6 +234,7 @@ def merge_datasets(
datasets: list[LeRobotDataset],
output_repo_id: str,
output_dir: str | Path | None = None,
num_workers: int | None = None,
) -> LeRobotDataset:
"""Merge multiple LeRobotDatasets into a single dataset.
@@ -256,6 +258,7 @@ def merge_datasets(
aggr_repo_id=output_repo_id,
roots=roots,
aggr_root=output_dir,
num_workers=num_workers,
)
merged_dataset = LeRobotDataset(
@@ -328,7 +331,7 @@ def modify_features(
if repo_id is None:
repo_id = f"{dataset.repo_id}_modified"
output_dir = Path(output_dir) if output_dir is not None else HF_LEROBOT_HOME / repo_id
output_dir = Path(output_dir, exists_ok=True) if output_dir is not None else HF_LEROBOT_HOME / repo_id
new_features = dataset.meta.features.copy()
@@ -962,28 +965,23 @@ def _copy_data_with_feature_changes(
remove_features: list[str] | None = None,
) -> None:
"""Copy data while adding or removing features."""
if dataset.meta.episodes is None:
dataset.meta.episodes = load_episodes(dataset.meta.root)
data_dir = dataset.root / DATA_DIR
parquet_files = sorted(data_dir.glob("*/*.parquet"))
# Map file paths to episode indices to extract chunk/file indices
file_to_episodes: dict[Path, set[int]] = {}
for ep_idx in range(dataset.meta.total_episodes):
file_path = dataset.meta.get_data_file_path(ep_idx)
if file_path not in file_to_episodes:
file_to_episodes[file_path] = set()
file_to_episodes[file_path].add(ep_idx)
if not parquet_files:
raise ValueError(f"No parquet files found in {data_dir}")
frame_idx = 0
for src_path in tqdm(sorted(file_to_episodes.keys()), desc="Processing data files"):
df = pd.read_parquet(dataset.root / src_path).reset_index(drop=True)
for src_path in tqdm(parquet_files, desc="Processing data files"):
df = pd.read_parquet(src_path).reset_index(drop=True)
# Get chunk_idx and file_idx from the source file's first episode
episodes_in_file = file_to_episodes[src_path]
first_ep_idx = min(episodes_in_file)
src_ep = dataset.meta.episodes[first_ep_idx]
chunk_idx = src_ep["data/chunk_index"]
file_idx = src_ep["data/file_index"]
relative_path = src_path.relative_to(dataset.root)
chunk_dir = relative_path.parts[1]
file_name = relative_path.parts[2]
chunk_idx = int(chunk_dir.split("-")[1])
file_idx = int(file_name.split("-")[1].split(".")[0])
if remove_features:
df = df.drop(columns=remove_features, errors="ignore")
@@ -1009,7 +1007,7 @@ def _copy_data_with_feature_changes(
df[feature_name] = feature_slice
frame_idx = end_idx
# Write using the preserved chunk_idx and file_idx from source
# Write using the same chunk/file structure as source
dst_path = new_meta.root / DEFAULT_DATA_PATH.format(chunk_index=chunk_idx, file_index=file_idx)
dst_path.parent.mkdir(parents=True, exist_ok=True)
+21 -8
View File
@@ -430,9 +430,7 @@ class LeRobotDatasetMetadata:
video_keys = [video_key] if video_key is not None else self.video_keys
for key in video_keys:
if not self.features[key].get("info", None):
video_path = self.root / self.video_path.format(
video_key=video_key, chunk_index=0, file_index=0
)
video_path = self.root / self.video_path.format(video_key=key, chunk_index=0, file_index=0)
self.info["features"][key]["info"] = get_video_info(video_path)
def update_chunk_settings(
@@ -942,11 +940,26 @@ class LeRobotDataset(torch.utils.data.Dataset):
return query_timestamps
def _query_hf_dataset(self, query_indices: dict[str, list[int]]) -> dict:
return {
key: torch.stack(self.hf_dataset[q_idx][key])
for key, q_idx in query_indices.items()
if key not in self.meta.video_keys
}
"""
Query dataset for indices across keys, skipping video keys.
Tries column-first [key][indices] for speed, falls back to row-first.
Args:
query_indices: Dict mapping keys to index lists to retrieve
Returns:
Dict with stacked tensors of queried data (video keys excluded)
"""
result: dict = {}
for key, q_idx in query_indices.items():
if key in self.meta.video_keys:
continue
try:
result[key] = torch.stack(self.hf_dataset[key][q_idx])
except (KeyError, TypeError, IndexError):
result[key] = torch.stack(self.hf_dataset[q_idx][key])
return result
def _query_videos(self, query_timestamps: dict[str, list[float]], ep_idx: int) -> dict[str, torch.Tensor]:
"""Note: When using data workers (e.g. DataLoader with num_workers>0), do not call this function
+28 -3
View File
@@ -19,6 +19,7 @@ import gymnasium as gym
from gymnasium.envs.registration import registry as gym_registry
from lerobot.envs.configs import AlohaEnv, EnvConfig, LiberoEnv, PushtEnv
from lerobot.envs.utils import _call_make_env, _download_hub_file, _import_hub_module, _normalize_hub_result
def make_env_config(env_type: str, **kwargs) -> EnvConfig:
@@ -33,15 +34,24 @@ def make_env_config(env_type: str, **kwargs) -> EnvConfig:
def make_env(
cfg: EnvConfig, n_envs: int = 1, use_async_envs: bool = False
cfg: EnvConfig | str,
n_envs: int = 1,
use_async_envs: bool = False,
hub_cache_dir: str | None = None,
trust_remote_code: bool = False,
) -> dict[str, dict[int, gym.vector.VectorEnv]]:
"""Makes a gym vector environment according to the config.
"""Makes a gym vector environment according to the config or Hub reference.
Args:
cfg (EnvConfig): the config of the environment to instantiate.
cfg (EnvConfig | str): Either an `EnvConfig` object describing the environment to build locally,
or a Hugging Face Hub repository identifier (e.g. `"username/repo"`). In the latter case,
the repo must include a Python file (usually `env.py`).
n_envs (int, optional): The number of parallelized env to return. Defaults to 1.
use_async_envs (bool, optional): Whether to return an AsyncVectorEnv or a SyncVectorEnv. Defaults to
False.
hub_cache_dir (str | None): Optional cache path for downloaded hub files.
trust_remote_code (bool): **Explicit consent** to execute remote code from the Hub.
Default False must be set to True to import/exec hub `env.py`.
Raises:
ValueError: if n_envs < 1
@@ -54,6 +64,21 @@ def make_env(
- For single-task environments: a single suite entry (cfg.type) with task_id=0.
"""
# if user passed a hub id string (e.g., "username/repo", "username/repo@main:env.py")
# simplified: only support hub-provided `make_env`
if isinstance(cfg, str):
# _download_hub_file will raise the same RuntimeError if trust_remote_code is False
repo_id, file_path, local_file, revision = _download_hub_file(cfg, trust_remote_code, hub_cache_dir)
# import and surface clear import errors
module = _import_hub_module(local_file, repo_id)
# call the hub-provided make_env
raw_result = _call_make_env(module, n_envs=n_envs, use_async_envs=use_async_envs)
# normalize the return into {suite: {task_id: vec_env}}
return _normalize_hub_result(raw_result)
if n_envs < 1:
raise ValueError("`n_envs` must be at least 1")
+132
View File
@@ -13,6 +13,8 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import importlib.util
import os
import warnings
from collections.abc import Mapping, Sequence
from functools import singledispatch
@@ -22,6 +24,7 @@ import einops
import gymnasium as gym
import numpy as np
import torch
from huggingface_hub import hf_hub_download, snapshot_download
from torch import Tensor
from lerobot.configs.types import FeatureType, PolicyFeature
@@ -195,3 +198,132 @@ def _(envs: Sequence) -> None:
@close_envs.register
def _(env: gym.Env) -> None:
_close_single_env(env)
# helper to safely load a python file as a module
def _load_module_from_path(path: str, module_name: str | None = None):
module_name = module_name or f"hub_env_{os.path.basename(path).replace('.', '_')}"
spec = importlib.util.spec_from_file_location(module_name, path)
if spec is None:
raise ImportError(f"Could not load module spec for {module_name} from {path}")
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module) # type: ignore
return module
# helper to parse hub string (supports "user/repo", "user/repo@rev", optional path)
# examples:
# "user/repo" -> will look for env.py at repo root
# "user/repo@main:envs/my_env.py" -> explicit revision and path
def _parse_hub_url(hub_uri: str):
# very small parser: [repo_id][@revision][:path]
# repo_id is required (user/repo or org/repo)
revision = None
file_path = "env.py"
if "@" in hub_uri:
repo_and_rev, *rest = hub_uri.split(":", 1)
repo_id, rev = repo_and_rev.split("@", 1)
revision = rev
if rest:
file_path = rest[0]
else:
repo_id, *rest = hub_uri.split(":", 1)
if rest:
file_path = rest[0]
return repo_id, revision, file_path
def _download_hub_file(
cfg_str: str,
trust_remote_code: bool,
hub_cache_dir: str | None,
) -> tuple[str, str, str, str]:
"""
Parse `cfg_str` (hub URL), enforce `trust_remote_code`, and return
(repo_id, file_path, local_file, revision).
"""
if not trust_remote_code:
raise RuntimeError(
f"Refusing to execute remote code from the Hub for '{cfg_str}'. "
"Executing hub env modules runs arbitrary Python code from third-party repositories. "
"If you trust this repo and understand the risks, call `make_env(..., trust_remote_code=True)` "
"and prefer pinning to a specific revision: 'user/repo@<commit-hash>:env.py'."
)
repo_id, revision, file_path = _parse_hub_url(cfg_str)
try:
local_file = hf_hub_download(
repo_id=repo_id, filename=file_path, revision=revision, cache_dir=hub_cache_dir
)
except Exception as e:
# fallback to snapshot download
snapshot_dir = snapshot_download(repo_id=repo_id, revision=revision, cache_dir=hub_cache_dir)
local_file = os.path.join(snapshot_dir, file_path)
if not os.path.exists(local_file):
raise FileNotFoundError(
f"Could not find {file_path} in repository {repo_id}@{revision or 'main'}"
) from e
return repo_id, file_path, local_file, revision
def _import_hub_module(local_file: str, repo_id: str) -> Any:
"""
Import the downloaded file as a module and surface helpful import error messages.
"""
module_name = f"hub_env_{repo_id.replace('/', '_')}"
try:
module = _load_module_from_path(local_file, module_name=module_name)
except ModuleNotFoundError as e:
missing = getattr(e, "name", None) or str(e)
raise ModuleNotFoundError(
f"Hub env '{repo_id}:{os.path.basename(local_file)}' failed to import because the dependency "
f"'{missing}' is not installed locally.\n\n"
) from e
except ImportError as e:
raise ImportError(
f"Failed to load hub env module '{repo_id}:{os.path.basename(local_file)}'. Import error: {e}\n\n"
) from e
return module
def _call_make_env(module: Any, n_envs: int, use_async_envs: bool) -> Any:
"""
Ensure module exposes make_env and call it.
"""
if not hasattr(module, "make_env"):
raise AttributeError(
f"The hub module {getattr(module, '__name__', 'hub_module')} must expose `make_env(n_envs=int, use_async_envs=bool)`."
)
entry_fn = module.make_env
return entry_fn(n_envs=n_envs, use_async_envs=use_async_envs)
def _normalize_hub_result(result: Any) -> dict[str, dict[int, gym.vector.VectorEnv]]:
"""
Normalize possible return types from hub `make_env` into the mapping:
{ suite_name: { task_id: vector_env } }
Accepts:
- dict (assumed already correct)
- gym.vector.VectorEnv
- gym.Env (will be wrapped into SyncVectorEnv)
"""
if isinstance(result, dict):
return result
# VectorEnv: use its spec.id if available
if isinstance(result, gym.vector.VectorEnv):
suite_name = getattr(result, "spec", None) and getattr(result.spec, "id", None) or "hub_env"
return {suite_name: {0: result}}
# Single Env: wrap into SyncVectorEnv
if isinstance(result, gym.Env):
vec = gym.vector.SyncVectorEnv([lambda: result])
suite_name = getattr(result, "spec", None) and getattr(result.spec, "id", None) or "hub_env"
return {suite_name: {0: vec}}
raise ValueError(
"Hub `make_env` must return either a mapping {suite: {task_id: vec_env}}, "
"a gym.vector.VectorEnv, or a single gym.Env."
)
+4 -16
View File
@@ -38,6 +38,7 @@ from lerobot.policies.sac.configuration_sac import SACConfig
from lerobot.policies.sac.reward_model.configuration_classifier import RewardClassifierConfig
from lerobot.policies.smolvla.configuration_smolvla import SmolVLAConfig
from lerobot.policies.tdmpc.configuration_tdmpc import TDMPCConfig
from lerobot.policies.utils import validate_visual_features_consistency
from lerobot.policies.vqbet.configuration_vqbet import VQBeTConfig
from lerobot.processor import PolicyAction, PolicyProcessorPipeline
from lerobot.processor.converters import (
@@ -420,20 +421,7 @@ def make_policy(
# policy = torch.compile(policy, mode="reduce-overhead")
if not rename_map:
expected_features = set(cfg.input_features.keys()) | set(cfg.output_features.keys())
provided_features = set(features.keys())
if expected_features and provided_features != expected_features:
missing = expected_features - provided_features
extra = provided_features - expected_features
# TODO (jadechoghari): provide a dynamic rename map suggestion to the user.
raise ValueError(
f"Feature mismatch between dataset/environment and policy config.\n"
f"- Missing features: {sorted(missing) if missing else 'None'}\n"
f"- Extra features: {sorted(extra) if extra else 'None'}\n\n"
f"Please ensure your dataset and policy use consistent feature names.\n"
f"If your dataset uses different observation keys (e.g., cameras named differently), "
f"use the `--rename_map` argument, for example:\n"
f' --rename_map=\'{{"observation.images.left": "observation.images.camera1", '
f'"observation.images.top": "observation.images.camera2"}}\''
)
validate_visual_features_consistency(cfg, features)
# TODO: (jadechoghari) - add a check_state(cfg, features) and check_action(cfg, features)
return policy
+41
View File
@@ -22,6 +22,8 @@ import numpy as np
import torch
from torch import nn
from lerobot.configs.policies import PreTrainedConfig
from lerobot.configs.types import FeatureType, PolicyFeature
from lerobot.datasets.utils import build_dataset_frame
from lerobot.processor import PolicyAction, RobotAction, RobotObservation
from lerobot.utils.constants import ACTION, OBS_STR
@@ -198,3 +200,42 @@ def make_robot_action(action_tensor: PolicyAction, ds_features: dict[str, dict])
f"{name}": float(action_tensor[i]) for i, name in enumerate(action_names)
}
return act_processed_policy
def raise_feature_mismatch_error(
provided_features: set[str],
expected_features: set[str],
) -> None:
"""
Raises a standardized ValueError for feature mismatches between dataset/environment and policy config.
"""
missing = expected_features - provided_features
extra = provided_features - expected_features
# TODO (jadechoghari): provide a dynamic rename map suggestion to the user.
raise ValueError(
f"Feature mismatch between dataset/environment and policy config.\n"
f"- Missing features: {sorted(missing) if missing else 'None'}\n"
f"- Extra features: {sorted(extra) if extra else 'None'}\n\n"
f"Please ensure your dataset and policy use consistent feature names.\n"
f"If your dataset uses different observation keys (e.g., cameras named differently), "
f"use the `--rename_map` argument, for example:\n"
f' --rename_map=\'{{"observation.images.left": "observation.images.camera1", '
f'"observation.images.top": "observation.images.camera2"}}\''
)
def validate_visual_features_consistency(
cfg: PreTrainedConfig,
features: dict[str, PolicyFeature],
) -> None:
"""
Validates visual feature consistency between a policy config and provided dataset/environment features.
Args:
cfg (PreTrainedConfig): The model or policy configuration containing input_features and type.
features (Dict[str, PolicyFeature]): A mapping of feature names to PolicyFeature objects.
"""
expected_visuals = {k for k, v in cfg.input_features.items() if v.type == FeatureType.VISUAL}
provided_visuals = {k for k, v in features.items() if v.type == FeatureType.VISUAL}
if not provided_visuals.issubset(expected_visuals):
raise_feature_mismatch_error(provided_visuals, expected_visuals)
@@ -103,6 +103,7 @@ class SplitConfig:
class MergeConfig:
type: str = "merge"
repo_ids: list[str] | None = None
num_workers: int | None = None
@dataclass
@@ -215,6 +216,7 @@ def handle_merge(cfg: EditDatasetConfig) -> None:
datasets,
output_repo_id=cfg.repo_id,
output_dir=output_dir,
num_workers=cfg.operation.num_workers,
)
logging.info(f"Merged dataset saved to {output_dir}")
+159 -1
View File
@@ -17,6 +17,7 @@ import importlib
from dataclasses import dataclass, field
import gymnasium as gym
import numpy as np
import pytest
import torch
from gymnasium.envs.registration import register, registry as gym_registry
@@ -26,7 +27,11 @@ import lerobot
from lerobot.configs.types import PolicyFeature
from lerobot.envs.configs import EnvConfig
from lerobot.envs.factory import make_env, make_env_config
from lerobot.envs.utils import preprocess_observation
from lerobot.envs.utils import (
_normalize_hub_result,
_parse_hub_url,
preprocess_observation,
)
from tests.utils import require_env
OBS_TYPES = ["state", "pixels", "pixels_agent_pos"]
@@ -108,3 +113,156 @@ def test_factory_custom_gym_id():
finally:
if gym_id in gym_registry:
del gym_registry[gym_id]
# Hub environment loading tests
def test_make_env_hub_url_parsing():
"""Test URL parsing for hub environment references."""
# simple repo_id
repo_id, revision, file_path = _parse_hub_url("user/repo")
assert repo_id == "user/repo"
assert revision is None
assert file_path == "env.py"
# repo with revision
repo_id, revision, file_path = _parse_hub_url("user/repo@main")
assert repo_id == "user/repo"
assert revision == "main"
assert file_path == "env.py"
# repo with custom file path
repo_id, revision, file_path = _parse_hub_url("user/repo:custom_env.py")
assert repo_id == "user/repo"
assert revision is None
assert file_path == "custom_env.py"
# repo with revision and custom file path
repo_id, revision, file_path = _parse_hub_url("user/repo@v1.0:envs/my_env.py")
assert repo_id == "user/repo"
assert revision == "v1.0"
assert file_path == "envs/my_env.py"
# repo with commit hash
repo_id, revision, file_path = _parse_hub_url("org/repo@abc123def456")
assert repo_id == "org/repo"
assert revision == "abc123def456"
assert file_path == "env.py"
def test_normalize_hub_result():
"""Test normalization of different return types from hub make_env."""
# test with VectorEnv (most common case)
mock_vec_env = gym.vector.SyncVectorEnv([lambda: gym.make("CartPole-v1")])
result = _normalize_hub_result(mock_vec_env)
assert isinstance(result, dict)
assert len(result) == 1
suite_name = next(iter(result))
assert 0 in result[suite_name]
assert isinstance(result[suite_name][0], gym.vector.VectorEnv)
mock_vec_env.close()
# test with single Env
mock_env = gym.make("CartPole-v1")
result = _normalize_hub_result(mock_env)
assert isinstance(result, dict)
suite_name = next(iter(result))
assert 0 in result[suite_name]
assert isinstance(result[suite_name][0], gym.vector.VectorEnv)
result[suite_name][0].close()
# test with dict (already normalized)
mock_vec_env = gym.vector.SyncVectorEnv([lambda: gym.make("CartPole-v1")])
input_dict = {"my_suite": {0: mock_vec_env}}
result = _normalize_hub_result(input_dict)
assert result == input_dict
assert "my_suite" in result
assert 0 in result["my_suite"]
mock_vec_env.close()
# test with invalid type
with pytest.raises(ValueError, match="Hub `make_env` must return"):
_normalize_hub_result("invalid_type")
def test_make_env_from_hub_requires_trust_remote_code():
"""Test that loading from hub requires explicit trust_remote_code=True."""
hub_id = "lerobot/cartpole-env"
# Should raise RuntimeError when trust_remote_code=False (default)
with pytest.raises(RuntimeError, match="Refusing to execute remote code"):
make_env(hub_id, trust_remote_code=False)
# Should also raise when not specified (defaults to False)
with pytest.raises(RuntimeError, match="Refusing to execute remote code"):
make_env(hub_id)
@pytest.mark.parametrize(
"hub_id",
[
"lerobot/cartpole-env",
"lerobot/cartpole-env@main",
"lerobot/cartpole-env:env.py",
],
)
def test_make_env_from_hub_with_trust(hub_id):
"""Test loading environment from Hugging Face Hub with trust_remote_code=True."""
# load environment from hub
envs_dict = make_env(hub_id, n_envs=2, trust_remote_code=True)
# verify structure
assert isinstance(envs_dict, dict)
assert len(envs_dict) >= 1
# get the first suite and task
suite_name = next(iter(envs_dict))
task_id = next(iter(envs_dict[suite_name]))
env = envs_dict[suite_name][task_id]
# verify it's a vector environment
assert isinstance(env, gym.vector.VectorEnv)
assert env.num_envs == 2
# test basic environment interaction
obs, info = env.reset()
assert obs is not None
assert isinstance(obs, (dict, np.ndarray))
# take a random action
action = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(action)
assert obs is not None
assert isinstance(reward, np.ndarray)
assert len(reward) == 2
# clean up
env.close()
def test_make_env_from_hub_async():
"""Test loading hub environment with async vector environments."""
hub_id = "lerobot/cartpole-env"
# load with async envs
envs_dict = make_env(hub_id, n_envs=2, use_async_envs=True, trust_remote_code=True)
suite_name = next(iter(envs_dict))
task_id = next(iter(envs_dict[suite_name]))
env = envs_dict[suite_name][task_id]
# verify it's an async vector environment
assert isinstance(env, gym.vector.AsyncVectorEnv)
assert env.num_envs == 2
# test basic interaction
obs, info = env.reset()
assert obs is not None
action = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(action)
assert len(reward) == 2
# clean up
env.close()
+157
View File
@@ -0,0 +1,157 @@
#!/usr/bin/env python
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Visual Feature Consistency Tests
This module tests the `validate_visual_features_consistency` function,
which ensures that visual features (camera observations) in a dataset/env
match the expectations defined in a policy configuration.
The purpose of this check is to prevent mismatches between what a policy expects
(e.g., `observation.images.camera1`, `camera2`, `camera3`) and what a dataset or
environment actually provides (e.g., `observation.images.top`, `side`, or fewer cameras).
"""
from pathlib import Path
import numpy as np
import pytest
from lerobot.configs.default import DatasetConfig
from lerobot.configs.policies import PreTrainedConfig
from lerobot.configs.train import TrainPipelineConfig
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.policies.factory import make_policy_config
from lerobot.scripts.lerobot_train import train
from lerobot.utils.utils import auto_select_torch_device
pytest.importorskip("transformers")
DUMMY_REPO_ID = "dummy/repo"
@pytest.fixture
def temp_dir(tmp_path):
return tmp_path
DUMMY_STATE_DIM = 6
DUMMY_ACTION_DIM = 6
IMAGE_SIZE = 8
DEVICE = auto_select_torch_device()
def make_dummy_dataset(camera_keys, tmp_path):
"""Creates a minimal dummy dataset for testing rename_mapping logic."""
features = {
"action": {"dtype": "float32", "shape": (DUMMY_ACTION_DIM,), "names": None},
"observation.state": {"dtype": "float32", "shape": (DUMMY_STATE_DIM,), "names": None},
}
for cam in camera_keys:
features[f"observation.images.{cam}"] = {
"dtype": "image",
"shape": (IMAGE_SIZE, IMAGE_SIZE, 3),
"names": ["height", "width", "channel"],
}
dataset = LeRobotDataset.create(
repo_id=DUMMY_REPO_ID,
fps=30,
features=features,
root=tmp_path / "_dataset",
)
root = tmp_path / "_dataset"
for ep_idx in range(2):
for _ in range(3):
frame = {
"action": np.random.randn(DUMMY_ACTION_DIM).astype(np.float32),
"observation.state": np.random.randn(DUMMY_STATE_DIM).astype(np.float32),
}
for cam in camera_keys:
frame[f"observation.images.{cam}"] = np.random.randint(
0, 255, size=(IMAGE_SIZE, IMAGE_SIZE, 3), dtype=np.uint8
)
frame["task"] = f"task_{ep_idx}"
dataset.add_frame(frame)
dataset.save_episode()
dataset.finalize()
return dataset, root
def custom_validate(train_config: TrainPipelineConfig, policy_path: str, empty_cameras: int):
train_config.policy = PreTrainedConfig.from_pretrained(policy_path)
train_config.policy.pretrained_path = Path(policy_path)
# override empty_cameras and push_to_hub for testing
train_config.policy.empty_cameras = empty_cameras
train_config.policy.push_to_hub = False
if train_config.use_policy_training_preset:
train_config.optimizer = train_config.policy.get_optimizer_preset()
train_config.scheduler = train_config.policy.get_scheduler_preset()
return train_config
@pytest.mark.skip(reason="Skipping this test as it results OOM")
@pytest.mark.parametrize(
"camera_keys, empty_cameras, rename_map, expect_success",
[
# case 1: dataset has fewer cameras than policy (3 instead of 4), but we specify empty_cameras=1 for smolvla, pi0, pi05
(["camera1", "camera2", "camera3"], 1, {}, True),
# case 2: dataset has 2 cameras with different names, rename_mapping provided
(
["top", "side"],
0,
{
"observation.images.top": "observation.images.camera1",
"observation.images.side": "observation.images.camera2",
},
True,
),
# case 3: dataset has 2 cameras, policy expects 3, names do not match, no empty_cameras
(["top", "side"], 0, {}, False),
# TODO: case 4: dataset has 2 cameras, policy expects 3, no rename_map, no empty_cameras, should raise for smolvla
# (["camera1", "camera2"], 0, {}, False),
],
)
def test_train_with_camera_mismatch(camera_keys, empty_cameras, rename_map, expect_success, tmp_path):
"""Tests that training works or fails depending on camera/feature alignment."""
_dataset, root = make_dummy_dataset(camera_keys, tmp_path)
pretrained_path = "lerobot/smolvla_base"
dataset_config = DatasetConfig(repo_id=DUMMY_REPO_ID, root=root)
policy_config = make_policy_config(
"smolvla",
optimizer_lr=0.01,
push_to_hub=False,
pretrained_path=pretrained_path,
device=DEVICE,
)
policy_config.empty_cameras = empty_cameras
train_config = TrainPipelineConfig(
dataset=dataset_config,
policy=policy_config,
rename_map=rename_map,
output_dir=tmp_path / "_output",
steps=1,
)
train_config = custom_validate(train_config, policy_path=pretrained_path, empty_cameras=empty_cameras)
# HACK: disable the internal CLI validation step for tests, we did it with custom_validate
train_config.validate = lambda: None
if expect_success:
train(train_config)
else:
with pytest.raises(ValueError):
train(train_config)