mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-15 08:39:49 +00:00
Merge branch 'main' into feature/add-multitask-dit
This commit is contained in:
@@ -37,6 +37,8 @@
|
||||
title: SmolVLA
|
||||
- local: pi0
|
||||
title: π₀ (Pi0)
|
||||
- local: pi0fast
|
||||
title: π₀-FAST (Pi0Fast)
|
||||
- local: pi05
|
||||
title: π₀.₅ (Pi05)
|
||||
- local: groot
|
||||
|
||||
@@ -0,0 +1,182 @@
|
||||
# π₀-FAST (Pi0-FAST)
|
||||
|
||||
π₀-FAST is a **Vision-Language-Action model for general robot control** that uses autoregressive next-token prediction to model continuous robot actions.
|
||||
|
||||
## Model Overview
|
||||
|
||||
π₀-FAST combines the power of Vision-Language Models with a novel action tokenization approach called **FAST (Frequency-space Action Sequence Tokenization)**. This enables training autoregressive VLAs on highly dexterous tasks that are impossible with standard binning-based discretization, while training **up to 5x faster** than diffusion-based approaches like π₀.
|
||||
|
||||
### Why FAST?
|
||||
|
||||
Standard approaches for robot action tokenization use simple per-dimension, per-timestep binning schemes. While passable for simple behaviors, this rapidly breaks down for complex and dexterous skills that require precision and high-frequency control.
|
||||
|
||||
FAST solves this by compressing action sequences using signal processing techniques, resulting in a dense sequence of action tokens that can be predicted autoregressively—just like language tokens.
|
||||
|
||||
### How FAST Tokenization Works
|
||||
|
||||
The FAST tokenizer compresses action sequences through the following steps:
|
||||
|
||||
1. **Normalize**: Take a continuous action chunk of shape `(H, D)` where `H` is the horizon and `D` is the action dimension. Normalize using one of the supported normalization methods (Quantiles recommended to handle outliers).
|
||||
|
||||
2. **Discrete Cosine Transform (DCT)**: Apply DCT (via scipy) to each action dimension separately. DCT is a compression algorithm commonly used in image and audio codecs (JPEG, MP3).
|
||||
|
||||
3. **Quantization**: Round and remove insignificant coefficients for each action dimension, producing a sparse frequency matrix.
|
||||
|
||||
4. **Flatten**: Flatten the matrix into a 1D vector, with low-frequency components first.
|
||||
|
||||
5. **Byte Pair Encoding (BPE)**: Train a BPE tokenizer to compress the DCT coefficients into dense action tokens, typically achieving **10x compression** over prior tokenization approaches.
|
||||
|
||||
This approach can transform **any existing VLM** into a VLA by training it to predict these FAST tokens.
|
||||
|
||||
## Installation Requirements
|
||||
|
||||
1. Install LeRobot by following our [Installation Guide](./installation).
|
||||
2. Install π₀-FAST dependencies by running:
|
||||
|
||||
```bash
|
||||
pip install -e ".[pi]"
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> For lerobot 0.4.0, if you want to install the pi tag, you will have to do: `pip install "lerobot[pi]@git+https://github.com/huggingface/lerobot.git"`.
|
||||
>
|
||||
> This will be solved in the next patch release
|
||||
|
||||
## Training a Custom FAST Tokenizer
|
||||
|
||||
You have two options for the FAST tokenizer:
|
||||
|
||||
1. **Use the pre-trained tokenizer**: The `physical-intelligence/fast` tokenizer was trained on 1M+ real robot action sequences and works as a general-purpose tokenizer.
|
||||
|
||||
2. **Train your own tokenizer**: For maximum performance on your specific dataset, you can finetune the tokenizer on your own data.
|
||||
|
||||
### Training Your Own Tokenizer
|
||||
|
||||
```bash
|
||||
python src/lerobot/policies/pi0_fast/train_fast_tokenizer.py \
|
||||
--repo_id "user/my-lerobot-dataset" \
|
||||
--action_horizon 10 \
|
||||
--encoded_dims "0:6" \
|
||||
--vocab_size 1024 \
|
||||
--scale 10.0 \
|
||||
--normalization_mode QUANTILES \
|
||||
--output_dir "./my_fast_tokenizer" \
|
||||
--push_to_hub \
|
||||
--hub_repo_id "username/my-action-tokenizer"
|
||||
```
|
||||
|
||||
### Key Tokenizer Parameters
|
||||
|
||||
| Parameter | Description | Default |
|
||||
| ---------------------- | --------------------------------------------------------------------------------- | ------------ |
|
||||
| `--repo_id` | LeRobot dataset repository ID | Required |
|
||||
| `--action_horizon` | Number of future actions in each chunk | `10` |
|
||||
| `--encoded_dims` | Comma-separated dimension ranges to encode (e.g., `"0:6,7:23"`) | `"0:6,7:23"` |
|
||||
| `--vocab_size` | BPE vocabulary size | `1024` |
|
||||
| `--scale` | DCT scaling factor for quantization | `10.0` |
|
||||
| `--normalization_mode` | Normalization mode (`MEAN_STD`, `MIN_MAX`, `QUANTILES`, `QUANTILE10`, `IDENTITY`) | `QUANTILES` |
|
||||
| `--sample_fraction` | Fraction of chunks to sample per episode | `0.1` |
|
||||
|
||||
## Usage
|
||||
|
||||
To use π₀-FAST in LeRobot, specify the policy type as:
|
||||
|
||||
```python
|
||||
policy.type=pi0_fast
|
||||
```
|
||||
|
||||
## Training
|
||||
|
||||
For training π₀-FAST, you can use the LeRobot training script:
|
||||
|
||||
```bash
|
||||
python src/lerobot/scripts/lerobot_train.py \
|
||||
--dataset.repo_id=your_dataset \
|
||||
--policy.type=pi0_fast \
|
||||
--output_dir=./outputs/pi0fast_training \
|
||||
--job_name=pi0fast_training \
|
||||
--policy.pretrained_path=lerobot/pi0_fast_base \
|
||||
--policy.dtype=bfloat16 \
|
||||
--policy.gradient_checkpointing=true \
|
||||
--policy.chunk_size=10 \
|
||||
--policy.n_action_steps=10 \
|
||||
--policy.max_action_tokens=256 \
|
||||
--steps=100000 \
|
||||
--batch_size=4 \
|
||||
--policy.device=cuda
|
||||
```
|
||||
|
||||
### Key Training Parameters
|
||||
|
||||
| Parameter | Description | Default |
|
||||
| -------------------------------------- | -------------------------------------------------- | ---------------------------- |
|
||||
| `--policy.gradient_checkpointing=true` | Reduces memory usage significantly during training | `false` |
|
||||
| `--policy.dtype=bfloat16` | Use mixed precision training for efficiency | `float32` |
|
||||
| `--policy.chunk_size` | Number of action steps to predict (action horizon) | `50` |
|
||||
| `--policy.n_action_steps` | Number of action steps to execute | `50` |
|
||||
| `--policy.max_action_tokens` | Maximum number of FAST tokens per action chunk | `256` |
|
||||
| `--policy.action_tokenizer_name` | FAST tokenizer to use | `physical-intelligence/fast` |
|
||||
| `--policy.compile_model=true` | Enable torch.compile for faster training | `false` |
|
||||
|
||||
## Inference
|
||||
|
||||
### KV-Caching for Fast Inference
|
||||
|
||||
π₀-FAST supports **KV-caching**, a widely used optimization in LLM inference. This caches the key-value pairs from the attention mechanism, avoiding redundant computation during autoregressive decoding.
|
||||
|
||||
```python
|
||||
# KV-caching is enabled by default
|
||||
policy.use_kv_cache=true
|
||||
```
|
||||
|
||||
### Inference Example
|
||||
|
||||
```python
|
||||
from lerobot.policies.pi0_fast import PI0FastPolicy, PI0FastConfig
|
||||
|
||||
# Load the policy
|
||||
policy = PI0FastPolicy.from_pretrained("your-model-path")
|
||||
|
||||
# During inference
|
||||
actions = policy.predict_action_chunk(batch)
|
||||
```
|
||||
|
||||
## Model Architecture
|
||||
|
||||
π₀-FAST uses a PaliGemma-based architecture:
|
||||
|
||||
- **Vision Encoder**: SigLIP vision tower for image understanding
|
||||
- **Language Model**: Gemma 2B for processing language instructions and predicting action tokens
|
||||
|
||||
The model takes images, text instructions, and robot state as input, and outputs discrete FAST tokens that are decoded back to continuous actions.
|
||||
|
||||
## Configuration Options
|
||||
|
||||
| Parameter | Description | Default |
|
||||
| -------------------- | ----------------------------------------------- | ---------- |
|
||||
| `paligemma_variant` | VLM backbone variant (`gemma_300m`, `gemma_2b`) | `gemma_2b` |
|
||||
| `max_state_dim` | Maximum state vector dimension (padded) | `32` |
|
||||
| `max_action_dim` | Maximum action vector dimension (padded) | `32` |
|
||||
| `temperature` | Sampling temperature (0.0 for greedy) | `0.0` |
|
||||
| `max_decoding_steps` | Maximum decoding steps | `256` |
|
||||
| `use_kv_cache` | Enable KV caching for faster inference | `true` |
|
||||
|
||||
## Comparison with π₀
|
||||
|
||||
| Feature | π₀ | π₀-FAST |
|
||||
| --------------------- | ------------------------- | ---------------------------- |
|
||||
| Action Representation | Flow Matching (Diffusion) | Autoregressive Tokens (FAST) |
|
||||
| Training Speed | 1x | **5x faster** |
|
||||
| Dexterity | High | High |
|
||||
| Inference Method | Iterative Denoising | Autoregressive Decoding |
|
||||
| KV-Caching | N/A | Supported |
|
||||
|
||||
## License
|
||||
|
||||
This model follows the **Apache 2.0 License**, consistent with the original [OpenPI repository](https://github.com/Physical-Intelligence/openpi).
|
||||
|
||||
## References
|
||||
|
||||
- [FAST: Efficient Robot Action Tokenization](https://www.physicalintelligence.company/research/fast) - Physical Intelligence Blog
|
||||
- [OpenPI Repository](https://github.com/Physical-Intelligence/openpi) - Original implementation
|
||||
- [FAST Tokenizer on Hugging Face](https://huggingface.co/physical-intelligence/fast) - Pre-trained tokenizer
|
||||
+1
-1
@@ -127,7 +127,7 @@ wallx = [
|
||||
"torchdiffeq==0.2.5",
|
||||
"qwen_vl_utils==0.0.11"
|
||||
]
|
||||
pi = ["transformers @ git+https://github.com/huggingface/transformers.git@fix/lerobot_openpi"]
|
||||
pi = ["transformers @ git+https://github.com/huggingface/transformers.git@fix/lerobot_openpi", "scipy>=1.10.1,<1.15"]
|
||||
smolvla = ["lerobot[transformers-dep]", "num2words>=0.5.14,<0.6.0", "accelerate>=1.7.0,<2.0.0", "safetensors>=0.4.3,<1.0.0"]
|
||||
multi_task_dit = ["lerobot[transformers-dep]"]
|
||||
groot = [
|
||||
|
||||
@@ -17,6 +17,7 @@ from .diffusion.configuration_diffusion import DiffusionConfig as DiffusionConfi
|
||||
from .groot.configuration_groot import GrootConfig as GrootConfig
|
||||
from .multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig as MultiTaskDiTConfig
|
||||
from .pi0.configuration_pi0 import PI0Config as PI0Config
|
||||
from .pi0_fast.configuration_pi0_fast import PI0FastConfig as PI0FastConfig
|
||||
from .pi05.configuration_pi05 import PI05Config as PI05Config
|
||||
from .smolvla.configuration_smolvla import SmolVLAConfig as SmolVLAConfig
|
||||
from .smolvla.processor_smolvla import SmolVLANewLineProcessor
|
||||
@@ -31,6 +32,7 @@ __all__ = [
|
||||
"MultiTaskDiTConfig",
|
||||
"PI0Config",
|
||||
"PI05Config",
|
||||
"PI0FastConfig",
|
||||
"SmolVLAConfig",
|
||||
"SARMConfig",
|
||||
"TDMPCConfig",
|
||||
|
||||
@@ -95,6 +95,10 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:
|
||||
from lerobot.policies.pi0.modeling_pi0 import PI0Policy
|
||||
|
||||
return PI0Policy
|
||||
elif name == "pi0_fast":
|
||||
from lerobot.policies.pi0_fast.modeling_pi0_fast import PI0FastPolicy
|
||||
|
||||
return PI0FastPolicy
|
||||
elif name == "pi05":
|
||||
from lerobot.policies.pi05.modeling_pi05 import PI05Policy
|
||||
|
||||
|
||||
@@ -0,0 +1,21 @@
|
||||
#!/usr/bin/env python
|
||||
|
||||
# Copyright 2025 Physical Intelligence and The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from .configuration_pi0_fast import PI0FastConfig
|
||||
from .modeling_pi0_fast import PI0FastPolicy
|
||||
from .processor_pi0_fast import make_pi0_fast_pre_post_processors
|
||||
|
||||
__all__ = ["PI0FastConfig", "PI0FastPolicy", "make_pi0_fast_pre_post_processors"]
|
||||
@@ -0,0 +1,161 @@
|
||||
#!/usr/bin/env python
|
||||
|
||||
# Copyright 2025 Physical Intelligence and The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
|
||||
from lerobot.configs.policies import PreTrainedConfig
|
||||
from lerobot.configs.types import FeatureType, NormalizationMode, PolicyFeature
|
||||
from lerobot.optim.optimizers import AdamWConfig
|
||||
from lerobot.optim.schedulers import CosineDecayWithWarmupSchedulerConfig
|
||||
from lerobot.policies.rtc.configuration_rtc import RTCConfig
|
||||
|
||||
DEFAULT_IMAGE_SIZE = 224
|
||||
|
||||
|
||||
@PreTrainedConfig.register_subclass("pi0_fast")
|
||||
@dataclass
|
||||
class PI0FastConfig(PreTrainedConfig):
|
||||
paligemma_variant: str = "gemma_2b"
|
||||
action_expert_variant: str = "gemma_300m"
|
||||
dtype: str = "float32" # Options: "bfloat16", "float32"
|
||||
|
||||
chunk_size: int = 50 # Number of action steps to predict, in openpi called "action_horizon"
|
||||
n_action_steps: int = 50 # Number of action steps to execute
|
||||
|
||||
# Shorter state and action vectors will be padded to these dimensions
|
||||
max_state_dim: int = 32
|
||||
max_action_dim: int = 32
|
||||
max_action_tokens: int = 256
|
||||
|
||||
# Real-Time Chunking (RTC) configuration
|
||||
rtc_config: RTCConfig | None = None
|
||||
|
||||
image_resolution: tuple[int, int] = (
|
||||
DEFAULT_IMAGE_SIZE,
|
||||
DEFAULT_IMAGE_SIZE,
|
||||
) # see openpi `preprocessing_pytorch.py`
|
||||
|
||||
# Add empty images. Used to add empty cameras when no image features are present.
|
||||
empty_cameras: int = 0
|
||||
|
||||
tokenizer_max_length: int = 200 # see openpi `__post_init__`
|
||||
text_tokenizer_name: str = "google/paligemma-3b-pt-224"
|
||||
action_tokenizer_name: str = "physical-intelligence/fast"
|
||||
temperature: float = 0.0
|
||||
max_decoding_steps: int = 256
|
||||
fast_skip_tokens: int = 128
|
||||
|
||||
# Whether to validate that decoded action tokens start with "Action: " prefix
|
||||
validate_action_token_prefix: bool = True
|
||||
|
||||
# Whether to use KV cache for faster autoregressive decoding
|
||||
use_kv_cache: bool = True
|
||||
|
||||
normalization_mapping: dict[str, NormalizationMode] = field(
|
||||
default_factory=lambda: {
|
||||
"VISUAL": NormalizationMode.IDENTITY,
|
||||
"STATE": NormalizationMode.MEAN_STD, # Pi0Fast uses quantiles for state
|
||||
"ACTION": NormalizationMode.MEAN_STD, # Pi0Fast uses quantiles for action
|
||||
}
|
||||
)
|
||||
|
||||
# Training settings
|
||||
gradient_checkpointing: bool = False # Enable gradient checkpointing for memory optimization
|
||||
compile_model: bool = False # Whether to use torch.compile for model optimization
|
||||
compile_mode: str = "max-autotune" # Torch compile mode
|
||||
device: str | None = None # Device to use for the model (None = auto-detect)
|
||||
|
||||
# Optimizer settings: see openpi `AdamW`
|
||||
optimizer_lr: float = 2.5e-5 # see openpi `CosineDecaySchedule: peak_lr`
|
||||
optimizer_betas: tuple[float, float] = (0.9, 0.95)
|
||||
optimizer_eps: float = 1e-8
|
||||
optimizer_weight_decay: float = 0.01
|
||||
optimizer_grad_clip_norm: float = 1.0
|
||||
|
||||
# Scheduler settings: see openpi `CosineDecaySchedule`
|
||||
# Note: These will auto-scale if --steps < scheduler_decay_steps
|
||||
# For example, --steps=3000 will scale warmup to 100 and decay to 3000
|
||||
scheduler_warmup_steps: int = 1_000
|
||||
scheduler_decay_steps: int = 30_000
|
||||
scheduler_decay_lr: float = 2.5e-6
|
||||
|
||||
def __post_init__(self):
|
||||
super().__post_init__()
|
||||
|
||||
# Validate configuration
|
||||
if self.n_action_steps > self.chunk_size:
|
||||
raise ValueError(
|
||||
f"n_action_steps ({self.n_action_steps}) cannot be greater than chunk_size ({self.chunk_size})"
|
||||
)
|
||||
|
||||
if self.paligemma_variant not in ["gemma_300m", "gemma_2b"]:
|
||||
raise ValueError(f"Invalid paligemma_variant: {self.paligemma_variant}")
|
||||
|
||||
if self.dtype not in ["bfloat16", "float32"]:
|
||||
raise ValueError(f"Invalid dtype: {self.dtype}")
|
||||
|
||||
def validate_features(self) -> None:
|
||||
"""Validate and set up input/output features."""
|
||||
for i in range(self.empty_cameras):
|
||||
key = f"observation.images.empty_camera_{i}"
|
||||
empty_camera = PolicyFeature(
|
||||
type=FeatureType.VISUAL,
|
||||
shape=(3, *self.image_resolution), # Use configured image resolution
|
||||
)
|
||||
self.input_features[key] = empty_camera
|
||||
|
||||
if "observation.state" not in self.input_features:
|
||||
state_feature = PolicyFeature(
|
||||
type=FeatureType.STATE,
|
||||
shape=(self.max_state_dim,), # Padded to max_state_dim
|
||||
)
|
||||
self.input_features["observation.state"] = state_feature
|
||||
|
||||
if "action" not in self.output_features:
|
||||
action_feature = PolicyFeature(
|
||||
type=FeatureType.ACTION,
|
||||
shape=(self.max_action_dim,), # Padded to max_action_dim
|
||||
)
|
||||
self.output_features["action"] = action_feature
|
||||
|
||||
def get_optimizer_preset(self) -> AdamWConfig:
|
||||
return AdamWConfig(
|
||||
lr=self.optimizer_lr,
|
||||
betas=self.optimizer_betas,
|
||||
eps=self.optimizer_eps,
|
||||
weight_decay=self.optimizer_weight_decay,
|
||||
grad_clip_norm=self.optimizer_grad_clip_norm,
|
||||
)
|
||||
|
||||
def get_scheduler_preset(self):
|
||||
return CosineDecayWithWarmupSchedulerConfig(
|
||||
peak_lr=self.optimizer_lr,
|
||||
decay_lr=self.scheduler_decay_lr,
|
||||
num_warmup_steps=self.scheduler_warmup_steps,
|
||||
num_decay_steps=self.scheduler_decay_steps,
|
||||
)
|
||||
|
||||
@property
|
||||
def observation_delta_indices(self) -> None:
|
||||
return None
|
||||
|
||||
@property
|
||||
def action_delta_indices(self) -> list:
|
||||
return list(range(self.chunk_size))
|
||||
|
||||
@property
|
||||
def reward_delta_indices(self) -> None:
|
||||
return None
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,177 @@
|
||||
#!/usr/bin/env python
|
||||
|
||||
# Copyright 2025 Physical Intelligence and The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from copy import deepcopy
|
||||
from dataclasses import dataclass
|
||||
from typing import Any
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
from lerobot.configs.types import PipelineFeatureType, PolicyFeature
|
||||
from lerobot.policies.pi0_fast.configuration_pi0_fast import PI0FastConfig
|
||||
from lerobot.policies.pi0_fast.modeling_pi0_fast import pad_vector
|
||||
from lerobot.processor import (
|
||||
ActionTokenizerProcessorStep,
|
||||
AddBatchDimensionProcessorStep,
|
||||
DeviceProcessorStep,
|
||||
NormalizerProcessorStep,
|
||||
PolicyAction,
|
||||
PolicyProcessorPipeline,
|
||||
ProcessorStep,
|
||||
ProcessorStepRegistry,
|
||||
RenameObservationsProcessorStep,
|
||||
TokenizerProcessorStep,
|
||||
UnnormalizerProcessorStep,
|
||||
)
|
||||
from lerobot.processor.converters import policy_action_to_transition, transition_to_policy_action
|
||||
from lerobot.processor.core import EnvTransition, TransitionKey
|
||||
from lerobot.utils.constants import (
|
||||
OBS_STATE,
|
||||
POLICY_POSTPROCESSOR_DEFAULT_NAME,
|
||||
POLICY_PREPROCESSOR_DEFAULT_NAME,
|
||||
)
|
||||
|
||||
|
||||
@ProcessorStepRegistry.register(name="pi0_fast_prepare_state_tokenizer_processor_step")
|
||||
@dataclass
|
||||
class Pi0FastPrepareStateAndLanguageTokenizerProcessorStep(ProcessorStep):
|
||||
"""
|
||||
Processor step to prepare the state and tokenize the language input.
|
||||
"""
|
||||
|
||||
max_state_dim: int = 32
|
||||
task_key: str = "task"
|
||||
|
||||
def __call__(self, transition: EnvTransition) -> EnvTransition:
|
||||
transition = transition.copy()
|
||||
|
||||
state = transition.get(TransitionKey.OBSERVATION, {}).get(OBS_STATE)
|
||||
if state is None:
|
||||
raise ValueError("State is required for PI0Fast")
|
||||
tasks = transition.get(TransitionKey.COMPLEMENTARY_DATA, {}).get(self.task_key)
|
||||
if tasks is None:
|
||||
raise ValueError("No task found in complementary data")
|
||||
|
||||
# TODO: check if this necessary
|
||||
state = deepcopy(state)
|
||||
|
||||
# Prepare state (pad to max_state_dim)
|
||||
state = pad_vector(state, self.max_state_dim)
|
||||
|
||||
# State should already be normalized to [-1, 1] by the NormalizerProcessorStep that runs before this step
|
||||
# Discretize into 256 bins (see openpi `PaligemmaTokenizer.tokenize()`)
|
||||
state_np = state.cpu().numpy()
|
||||
discretized_states = np.digitize(state_np, bins=np.linspace(-1, 1, 256 + 1)[:-1]) - 1
|
||||
|
||||
full_prompts = []
|
||||
for i, task in enumerate(tasks):
|
||||
cleaned_text = task.strip().replace("_", " ").replace("\n", " ")
|
||||
state_str = " ".join(map(str, discretized_states[i]))
|
||||
full_prompt = f"Task: {cleaned_text}, State: {state_str};\n"
|
||||
full_prompts.append(full_prompt)
|
||||
|
||||
transition[TransitionKey.COMPLEMENTARY_DATA][self.task_key] = full_prompts
|
||||
# Normalize state to [-1, 1] range if needed (assuming it's already normalized by normalizer processor step!!)
|
||||
# Discretize into 256 bins (see openpi `PaligemmaTokenizer.tokenize()`)
|
||||
return transition
|
||||
|
||||
def transform_features(
|
||||
self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
|
||||
) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
|
||||
"""
|
||||
This step does not alter the feature definitions.
|
||||
"""
|
||||
return features
|
||||
|
||||
|
||||
def make_pi0_fast_pre_post_processors(
|
||||
config: PI0FastConfig,
|
||||
dataset_stats: dict[str, dict[str, torch.Tensor]] | None = None,
|
||||
) -> tuple[
|
||||
PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
|
||||
PolicyProcessorPipeline[PolicyAction, PolicyAction],
|
||||
]:
|
||||
"""
|
||||
Constructs pre-processor and post-processor pipelines for the PI0Fast policy.
|
||||
|
||||
The pre-processing pipeline prepares input data for the model by:
|
||||
1. Renaming features to match pretrained configurations.
|
||||
2. Normalizing input and output features based on dataset statistics.
|
||||
3. Adding a batch dimension.
|
||||
4. Appending a newline character to the task description for tokenizer compatibility.
|
||||
5. Tokenizing the text prompt using the PaliGemma tokenizer.
|
||||
6. Moving all data to the specified device.
|
||||
|
||||
The post-processing pipeline handles the model's output by:
|
||||
1. Moving data to the CPU.
|
||||
2. Unnormalizing the output features to their original scale.
|
||||
|
||||
Args:
|
||||
config: The configuration object for the PI0Fast policy.
|
||||
dataset_stats: A dictionary of statistics for normalization.
|
||||
preprocessor_kwargs: Additional arguments for the pre-processor pipeline.
|
||||
postprocessor_kwargs: Additional arguments for the post-processor pipeline.
|
||||
|
||||
Returns:
|
||||
A tuple containing the configured pre-processor and post-processor pipelines.
|
||||
"""
|
||||
# Add remaining processors
|
||||
input_steps: list[ProcessorStep] = [
|
||||
RenameObservationsProcessorStep(rename_map={}), # To mimic the same processor as pretrained one
|
||||
AddBatchDimensionProcessorStep(),
|
||||
# NOTE: NormalizerProcessorStep MUST come before Pi0FastPrepareStateAndLanguageTokenizerProcessorStep
|
||||
# because the tokenizer step expects normalized state in [-1, 1] range for discretization
|
||||
NormalizerProcessorStep(
|
||||
features={**config.input_features, **config.output_features},
|
||||
norm_map=config.normalization_mapping,
|
||||
stats=dataset_stats,
|
||||
),
|
||||
Pi0FastPrepareStateAndLanguageTokenizerProcessorStep(max_state_dim=config.max_state_dim),
|
||||
TokenizerProcessorStep(
|
||||
tokenizer_name=config.text_tokenizer_name,
|
||||
max_length=config.tokenizer_max_length,
|
||||
padding_side="right",
|
||||
padding="max_length",
|
||||
),
|
||||
ActionTokenizerProcessorStep(
|
||||
action_tokenizer_name=config.action_tokenizer_name,
|
||||
max_action_tokens=config.max_action_tokens,
|
||||
fast_skip_tokens=config.fast_skip_tokens,
|
||||
paligemma_tokenizer_name=config.text_tokenizer_name,
|
||||
),
|
||||
DeviceProcessorStep(device=config.device),
|
||||
]
|
||||
|
||||
output_steps: list[ProcessorStep] = [
|
||||
UnnormalizerProcessorStep(
|
||||
features=config.output_features, norm_map=config.normalization_mapping, stats=dataset_stats
|
||||
),
|
||||
DeviceProcessorStep(device="cpu"),
|
||||
]
|
||||
|
||||
return (
|
||||
PolicyProcessorPipeline[dict[str, Any], dict[str, Any]](
|
||||
steps=input_steps,
|
||||
name=POLICY_PREPROCESSOR_DEFAULT_NAME,
|
||||
),
|
||||
PolicyProcessorPipeline[PolicyAction, PolicyAction](
|
||||
steps=output_steps,
|
||||
name=POLICY_POSTPROCESSOR_DEFAULT_NAME,
|
||||
to_transition=policy_action_to_transition,
|
||||
to_output=transition_to_policy_action,
|
||||
),
|
||||
)
|
||||
@@ -0,0 +1,539 @@
|
||||
"""Train FAST tokenizer for action encoding.
|
||||
|
||||
This script:
|
||||
1. Loads action chunks from LeRobotDataset (with sampling)
|
||||
2. Applies delta transforms and per-timestamp normalization
|
||||
3. Trains FAST tokenizer on specified action dimensions
|
||||
4. Saves tokenizer to assets directory
|
||||
5. Reports compression statistics
|
||||
"""
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
import tyro
|
||||
from huggingface_hub import HfApi
|
||||
from transformers import AutoProcessor
|
||||
|
||||
from lerobot.configs.types import NormalizationMode
|
||||
from lerobot.datasets.lerobot_dataset import LeRobotDataset
|
||||
|
||||
|
||||
def apply_delta_transform(state: np.ndarray, actions: np.ndarray, delta_dims: list[int] | None) -> np.ndarray:
|
||||
"""Apply delta transform to specified dimensions.
|
||||
|
||||
Args:
|
||||
state: Current state [D]
|
||||
actions: Future actions [D]
|
||||
delta_dims: List of dimension indices to apply delta transform to
|
||||
|
||||
Returns:
|
||||
Transformed actions [D]
|
||||
"""
|
||||
if delta_dims is None or len(delta_dims) == 0:
|
||||
return actions
|
||||
|
||||
delta_actions = actions.copy()
|
||||
for dim in delta_dims:
|
||||
delta_actions[dim] = actions[dim] - state[dim]
|
||||
|
||||
return delta_actions
|
||||
|
||||
|
||||
def apply_normalization(
|
||||
data: np.ndarray,
|
||||
stats: dict[str, np.ndarray],
|
||||
mode: NormalizationMode,
|
||||
eps: float = 1e-8,
|
||||
) -> np.ndarray:
|
||||
"""Apply normalization to data based on the specified mode.
|
||||
|
||||
Args:
|
||||
data: Data to normalize [N, H, D] or [D]
|
||||
stats: Dictionary of statistics (mean, std, min, max, q01, q99, q10, q90)
|
||||
mode: Normalization mode to apply
|
||||
eps: Small epsilon for numerical stability
|
||||
|
||||
Returns:
|
||||
Normalized data with the same shape as input
|
||||
"""
|
||||
if mode == NormalizationMode.IDENTITY:
|
||||
return data
|
||||
|
||||
if mode == NormalizationMode.MEAN_STD:
|
||||
mean = stats.get("mean")
|
||||
std = stats.get("std")
|
||||
if mean is None or std is None:
|
||||
raise ValueError("MEAN_STD mode requires 'mean' and 'std' in stats")
|
||||
return (data - mean) / np.maximum(std, eps)
|
||||
|
||||
if mode == NormalizationMode.MIN_MAX:
|
||||
min_val = stats.get("min")
|
||||
max_val = stats.get("max")
|
||||
if min_val is None or max_val is None:
|
||||
raise ValueError("MIN_MAX mode requires 'min' and 'max' in stats")
|
||||
denom = np.maximum(max_val - min_val, eps)
|
||||
return 2.0 * (data - min_val) / denom - 1.0
|
||||
|
||||
if mode == NormalizationMode.QUANTILES:
|
||||
q01 = stats.get("q01")
|
||||
q99 = stats.get("q99")
|
||||
if q01 is None or q99 is None:
|
||||
raise ValueError("QUANTILES mode requires 'q01' and 'q99' in stats")
|
||||
denom = np.maximum(q99 - q01, eps)
|
||||
# Clip to quantile range then normalize to [-1, 1]
|
||||
clipped = np.clip(data, q01, q99)
|
||||
return 2.0 * (clipped - q01) / denom - 1.0
|
||||
|
||||
if mode == NormalizationMode.QUANTILE10:
|
||||
q10 = stats.get("q10")
|
||||
q90 = stats.get("q90")
|
||||
if q10 is None or q90 is None:
|
||||
raise ValueError("QUANTILE10 mode requires 'q10' and 'q90' in stats")
|
||||
denom = np.maximum(q90 - q10, eps)
|
||||
# Clip to quantile range then normalize to [-1, 1]
|
||||
clipped = np.clip(data, q10, q90)
|
||||
return 2.0 * (clipped - q10) / denom - 1.0
|
||||
|
||||
raise ValueError(f"Unsupported normalization mode: {mode}")
|
||||
|
||||
|
||||
def process_episode(args):
|
||||
"""Process single episode and return action chunks."""
|
||||
dataset, ep_idx, action_horizon, delta_dims, sample_fraction, state_key, use_delta_transform = args
|
||||
|
||||
try:
|
||||
# get episode info
|
||||
ep_info = dataset.meta.episodes[ep_idx]
|
||||
from_idx = ep_info["dataset_from_index"]
|
||||
to_idx = ep_info["dataset_to_index"]
|
||||
ep_length = to_idx - from_idx
|
||||
|
||||
if ep_length < action_horizon:
|
||||
return None
|
||||
|
||||
# load all frames in episode
|
||||
# if dataset has episode filtering, we need to use the mapping
|
||||
states = []
|
||||
actions = []
|
||||
|
||||
for abs_idx in range(from_idx, to_idx):
|
||||
# map absolute index to relative index if needed
|
||||
if dataset._absolute_to_relative_idx is not None:
|
||||
if abs_idx not in dataset._absolute_to_relative_idx:
|
||||
# this episode's frames aren't in the filtered dataset
|
||||
return None
|
||||
rel_idx = dataset._absolute_to_relative_idx[abs_idx]
|
||||
else:
|
||||
rel_idx = abs_idx
|
||||
|
||||
frame = dataset.hf_dataset[rel_idx]
|
||||
|
||||
# get state (could be from observation.state or other state key)
|
||||
if state_key in frame:
|
||||
state = (
|
||||
frame[state_key].numpy()
|
||||
if torch.is_tensor(frame[state_key])
|
||||
else np.array(frame[state_key])
|
||||
)
|
||||
else:
|
||||
# if no state key, use zeros (no delta transform)
|
||||
state = np.zeros_like(
|
||||
frame["action"].numpy() if torch.is_tensor(frame["action"]) else np.array(frame["action"])
|
||||
)
|
||||
|
||||
action = (
|
||||
frame["action"].numpy() if torch.is_tensor(frame["action"]) else np.array(frame["action"])
|
||||
)
|
||||
|
||||
states.append(state)
|
||||
actions.append(action)
|
||||
|
||||
states = np.array(states)
|
||||
actions = np.array(actions)
|
||||
|
||||
# create action chunks (sliding window)
|
||||
# all actions in a chunk are relative to the FIRST state in that chunk
|
||||
action_chunks = []
|
||||
|
||||
for i in range(len(states) - action_horizon + 1):
|
||||
current_state = states[i] # First state in chunk
|
||||
future_absolute_actions = actions[i : i + action_horizon]
|
||||
|
||||
if use_delta_transform:
|
||||
# relative actions
|
||||
delta_chunk = np.zeros_like(future_absolute_actions)
|
||||
for t in range(action_horizon):
|
||||
delta_chunk[t] = apply_delta_transform(
|
||||
current_state,
|
||||
future_absolute_actions[t],
|
||||
delta_dims,
|
||||
)
|
||||
action_chunks.append(delta_chunk)
|
||||
else:
|
||||
# absolute actions (no delta)
|
||||
action_chunks.append(future_absolute_actions)
|
||||
|
||||
if len(action_chunks) == 0:
|
||||
return None
|
||||
|
||||
action_chunks = np.array(action_chunks)
|
||||
|
||||
# sample chunks
|
||||
if sample_fraction < 1.0:
|
||||
n_chunks = len(action_chunks)
|
||||
n_samples = max(1, int(n_chunks * sample_fraction))
|
||||
episode_seed = hash(ep_idx) % (2**31)
|
||||
rng = np.random.RandomState(episode_seed)
|
||||
indices = rng.choice(n_chunks, size=n_samples, replace=False)
|
||||
action_chunks = action_chunks[indices]
|
||||
|
||||
return action_chunks
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error processing episode {ep_idx}: {e}")
|
||||
import traceback
|
||||
|
||||
traceback.print_exc()
|
||||
return None
|
||||
|
||||
|
||||
def train_fast_tokenizer(
|
||||
action_chunks: np.ndarray,
|
||||
vocab_size: int = 1024,
|
||||
scale: float = 10.0,
|
||||
) -> AutoProcessor:
|
||||
"""
|
||||
Train FAST tokenizer (BPE on DCT coefficients) on action chunks.
|
||||
|
||||
Uses the .fit() method to train a new tokenizer on the provided data.
|
||||
|
||||
Args:
|
||||
action_chunks: Array of action chunks [N, H, D] where N=num_chunks, H=horizon, D=action_dim
|
||||
vocab_size: BPE vocabulary size
|
||||
scale: DCT scaling factor for quantization
|
||||
|
||||
Returns:
|
||||
Trained FAST tokenizer
|
||||
"""
|
||||
print(f"Training FAST tokenizer on {len(action_chunks)} action chunks...")
|
||||
print(f"Action chunk shape: {action_chunks.shape}")
|
||||
print(f"Vocab size: {vocab_size}")
|
||||
print(f"DCT scale: {scale}")
|
||||
|
||||
# download the tokenizer source code (not pretrained weights)
|
||||
# we'll train a new tokenizer on our own data
|
||||
base_tokenizer = AutoProcessor.from_pretrained("physical-intelligence/fast", trust_remote_code=True)
|
||||
|
||||
# convert action_chunks array to list of arrays (expected by .fit())
|
||||
action_data_list = [action_chunks[i] for i in range(len(action_chunks))]
|
||||
|
||||
# train the new tokenizer on our action data using .fit()
|
||||
# this trains the BPE tokenizer on DCT coefficients
|
||||
print("Training new tokenizer (this may take a few minutes)...")
|
||||
tokenizer = base_tokenizer.fit(
|
||||
action_data_list,
|
||||
scale=scale,
|
||||
vocab_size=vocab_size,
|
||||
time_horizon=action_chunks.shape[1], # action_horizon
|
||||
action_dim=action_chunks.shape[2], # encoded dimensions
|
||||
)
|
||||
print("✓ Tokenizer training complete!")
|
||||
|
||||
# validate it works
|
||||
sample_chunk = action_chunks[0]
|
||||
encoded = tokenizer(sample_chunk[None])[0]
|
||||
if isinstance(encoded, list):
|
||||
encoded = np.array(encoded)
|
||||
print(f"Sample encoding: {len(encoded)} tokens for chunk shape {sample_chunk.shape}")
|
||||
|
||||
return tokenizer
|
||||
|
||||
|
||||
def compute_compression_stats(tokenizer, action_chunks: np.ndarray):
|
||||
"""Compute compression statistics."""
|
||||
print("\nComputing compression statistics...")
|
||||
|
||||
# sample for stats (use max 1000 chunks for speed)
|
||||
sample_size = min(1000, len(action_chunks))
|
||||
sample_indices = np.random.RandomState(42).choice(len(action_chunks), size=sample_size, replace=False)
|
||||
sample_chunks = action_chunks[sample_indices]
|
||||
|
||||
token_lengths = []
|
||||
for chunk in sample_chunks:
|
||||
encoded = tokenizer(chunk[None])[0]
|
||||
if isinstance(encoded, list):
|
||||
token_lengths.append(len(encoded))
|
||||
else:
|
||||
token_lengths.append(encoded.shape[0] if hasattr(encoded, "shape") else len(encoded))
|
||||
|
||||
token_lengths = np.array(token_lengths)
|
||||
|
||||
# compression ratio: (H * D) / avg_tokens
|
||||
input_size = action_chunks.shape[1] * action_chunks.shape[2]
|
||||
avg_tokens = np.mean(token_lengths)
|
||||
compression_ratio = input_size / avg_tokens
|
||||
|
||||
stats = {
|
||||
"compression_ratio": float(compression_ratio),
|
||||
"mean_token_length": float(np.mean(token_lengths)),
|
||||
"p99_token_length": float(np.percentile(token_lengths, 99)),
|
||||
"min_token_length": float(np.min(token_lengths)),
|
||||
"max_token_length": float(np.max(token_lengths)),
|
||||
}
|
||||
|
||||
print("Compression Statistics:")
|
||||
print(f" Average compression ratio: {stats['compression_ratio']:.2f}x")
|
||||
print(f" Mean token length: {stats['mean_token_length']:.1f}")
|
||||
print(f" P99 token length: {stats['p99_token_length']:.0f}")
|
||||
print(f" Min token length: {stats['min_token_length']:.0f}")
|
||||
print(f" Max token length: {stats['max_token_length']:.0f}")
|
||||
|
||||
return stats
|
||||
|
||||
|
||||
def main(
|
||||
repo_id: str,
|
||||
root: str | None = None,
|
||||
action_horizon: int = 10,
|
||||
max_episodes: int | None = None,
|
||||
sample_fraction: float = 0.1,
|
||||
encoded_dims: str = "0:6,7:23",
|
||||
delta_dims: str | None = None,
|
||||
use_delta_transform: bool = False,
|
||||
state_key: str = "observation.state",
|
||||
normalization_mode: str = "QUANTILES",
|
||||
vocab_size: int = 1024,
|
||||
scale: float = 10.0,
|
||||
output_dir: str | None = None,
|
||||
push_to_hub: bool = False,
|
||||
hub_repo_id: str | None = None,
|
||||
hub_private: bool = False,
|
||||
):
|
||||
"""
|
||||
Train FAST tokenizer for action encoding.
|
||||
|
||||
Args:
|
||||
repo_id: LeRobot dataset repository ID
|
||||
root: Root directory for dataset (default: ~/.cache/huggingface/lerobot)
|
||||
action_horizon: Number of future actions in each chunk
|
||||
max_episodes: Max episodes to use (None = all episodes in dataset)
|
||||
sample_fraction: Fraction of chunks to sample per episode
|
||||
encoded_dims: Comma-separated dimension ranges to encode (e.g., "0:6,7:23")
|
||||
delta_dims: Comma-separated dimension indices for delta transform (e.g., "0,1,2,3,4,5")
|
||||
use_delta_transform: Whether to apply delta transform (relative actions vs absolute actions)
|
||||
state_key: Dataset key for state observations (default: "observation.state")
|
||||
normalization_mode: Normalization mode (MEAN_STD, MIN_MAX, QUANTILES, QUANTILE10, IDENTITY)
|
||||
vocab_size: FAST vocabulary size (BPE vocab size)
|
||||
scale: DCT scaling factor (default: 10.0)
|
||||
output_dir: Directory to save tokenizer (default: ./fast_tokenizer_{repo_id})
|
||||
push_to_hub: Whether to push the tokenizer to Hugging Face Hub
|
||||
hub_repo_id: Hub repository ID (e.g., "username/tokenizer-name"). If None, uses output_dir name
|
||||
hub_private: Whether to create a private repository on the Hub
|
||||
"""
|
||||
# load dataset
|
||||
print(f"Loading dataset: {repo_id}")
|
||||
dataset = LeRobotDataset(repo_id=repo_id, root=root)
|
||||
print(f"Dataset loaded: {dataset.num_episodes} episodes, {dataset.num_frames} frames")
|
||||
|
||||
# parse normalization mode
|
||||
try:
|
||||
norm_mode = NormalizationMode(normalization_mode)
|
||||
except ValueError as err:
|
||||
raise ValueError(
|
||||
f"Invalid normalization_mode: {normalization_mode}. "
|
||||
f"Must be one of: {', '.join([m.value for m in NormalizationMode])}"
|
||||
) from err
|
||||
print(f"Normalization mode: {norm_mode.value}")
|
||||
|
||||
# parse encoded dimensions
|
||||
encoded_dim_ranges = []
|
||||
for range_str in encoded_dims.split(","):
|
||||
start, end = map(int, range_str.strip().split(":"))
|
||||
encoded_dim_ranges.append((start, end))
|
||||
|
||||
total_encoded_dims = sum(end - start for start, end in encoded_dim_ranges)
|
||||
print(f"Encoding {total_encoded_dims} dimensions: {encoded_dims}")
|
||||
|
||||
# parse delta dimensions
|
||||
delta_dim_list = None
|
||||
if delta_dims is not None and delta_dims.strip():
|
||||
delta_dim_list = [int(d.strip()) for d in delta_dims.split(",")]
|
||||
print(f"Delta dimensions: {delta_dim_list}")
|
||||
else:
|
||||
print("No delta dimensions specified")
|
||||
|
||||
print(f"Use delta transform: {use_delta_transform}")
|
||||
if use_delta_transform and (delta_dim_list is None or len(delta_dim_list) == 0):
|
||||
print("Warning: use_delta_transform=True but no delta_dims specified. No delta will be applied.")
|
||||
|
||||
print(f"Action horizon: {action_horizon}")
|
||||
print(f"State key: {state_key}")
|
||||
|
||||
# determine episodes to process
|
||||
num_episodes = dataset.num_episodes
|
||||
if max_episodes is not None:
|
||||
num_episodes = min(max_episodes, num_episodes)
|
||||
|
||||
print(f"Processing {num_episodes} episodes...")
|
||||
|
||||
# process episodes sequentially (to avoid pickling issues with dataset)
|
||||
all_chunks = []
|
||||
for ep_idx in range(num_episodes):
|
||||
if ep_idx % 10 == 0:
|
||||
print(f" Processing episode {ep_idx}/{num_episodes}...")
|
||||
|
||||
chunks = process_episode(
|
||||
(dataset, ep_idx, action_horizon, delta_dim_list, sample_fraction, state_key, use_delta_transform)
|
||||
)
|
||||
if chunks is not None:
|
||||
all_chunks.append(chunks)
|
||||
|
||||
# concatenate all chunks
|
||||
all_chunks = np.concatenate(all_chunks, axis=0)
|
||||
print(f"Collected {len(all_chunks)} action chunks")
|
||||
|
||||
# extract only encoded dimensions FIRST (before normalization)
|
||||
encoded_chunks = []
|
||||
for start, end in encoded_dim_ranges:
|
||||
encoded_chunks.append(all_chunks[:, :, start:end])
|
||||
encoded_chunks = np.concatenate(encoded_chunks, axis=-1) # [N, H, D_encoded]
|
||||
print(f"Extracted {encoded_chunks.shape[-1]} encoded dimensions")
|
||||
|
||||
# apply normalization to encoded dimensions
|
||||
print("\nBefore normalization - overall stats:")
|
||||
print(f" Min: {np.min(encoded_chunks):.4f}, Max: {np.max(encoded_chunks):.4f}")
|
||||
print(f" Mean: {np.mean(encoded_chunks):.4f}, Std: {np.std(encoded_chunks):.4f}")
|
||||
|
||||
# get normalization stats from dataset
|
||||
norm_stats = dataset.meta.stats
|
||||
if norm_stats is not None and "action" in norm_stats:
|
||||
action_stats = norm_stats["action"]
|
||||
|
||||
# build encoded dimension indices
|
||||
encoded_dim_indices = []
|
||||
for start, end in encoded_dim_ranges:
|
||||
encoded_dim_indices.extend(range(start, end))
|
||||
encoded_dim_indices = np.array(encoded_dim_indices)
|
||||
|
||||
# extract stats for encoded dimensions only
|
||||
encoded_stats = {}
|
||||
for stat_name, stat_values in action_stats.items():
|
||||
if isinstance(stat_values, (list, np.ndarray)):
|
||||
stat_array = np.array(stat_values)
|
||||
if len(stat_array) > max(encoded_dim_indices):
|
||||
encoded_stats[stat_name] = stat_array[encoded_dim_indices]
|
||||
|
||||
if encoded_stats:
|
||||
print(f"\nNormalization stats for encoded dimensions (mode: {norm_mode.value}):")
|
||||
for stat_name, stat_values in encoded_stats.items():
|
||||
print(
|
||||
f" {stat_name}: shape={stat_values.shape}, "
|
||||
f"range=[{np.min(stat_values):.4f}, {np.max(stat_values):.4f}]"
|
||||
)
|
||||
|
||||
# apply normalization based on mode
|
||||
try:
|
||||
encoded_chunks = apply_normalization(encoded_chunks, encoded_stats, norm_mode, eps=1e-8)
|
||||
print(f"\nApplied {norm_mode.value} normalization")
|
||||
except ValueError as e:
|
||||
print(f"Warning: {e}. Using raw actions without normalization.")
|
||||
|
||||
print("\nAfter normalization - overall stats:")
|
||||
print(f" Min: {np.min(encoded_chunks):.4f}, Max: {np.max(encoded_chunks):.4f}")
|
||||
print(f" Mean: {np.mean(encoded_chunks):.4f}, Std: {np.std(encoded_chunks):.4f}")
|
||||
|
||||
print("\nPer-dimension stats (after normalization):")
|
||||
for d in range(encoded_chunks.shape[-1]):
|
||||
dim_data = encoded_chunks[:, :, d]
|
||||
print(
|
||||
f" Dim {d}: min={np.min(dim_data):7.4f}, max={np.max(dim_data):7.4f}, "
|
||||
f"mean={np.mean(dim_data):7.4f}, std={np.std(dim_data):7.4f}"
|
||||
)
|
||||
else:
|
||||
print("Warning: Could not extract stats for encoded dimensions, using raw actions")
|
||||
else:
|
||||
print("Warning: No normalization stats found in dataset, using raw actions")
|
||||
|
||||
print(f"Encoded chunks shape: {encoded_chunks.shape}")
|
||||
|
||||
# train FAST tokenizer
|
||||
tokenizer = train_fast_tokenizer(
|
||||
encoded_chunks,
|
||||
vocab_size=vocab_size,
|
||||
scale=scale,
|
||||
)
|
||||
|
||||
# compute compression statistics
|
||||
compression_stats = compute_compression_stats(tokenizer, encoded_chunks)
|
||||
|
||||
# save tokenizer
|
||||
if output_dir is None:
|
||||
output_dir = f"fast_tokenizer_{repo_id.replace('/', '_')}"
|
||||
output_path = Path(output_dir)
|
||||
output_path.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
tokenizer.save_pretrained(output_path)
|
||||
|
||||
# save metadata
|
||||
metadata = {
|
||||
"repo_id": repo_id,
|
||||
"vocab_size": vocab_size,
|
||||
"scale": scale,
|
||||
"encoded_dims": encoded_dims,
|
||||
"encoded_dim_ranges": encoded_dim_ranges,
|
||||
"total_encoded_dims": total_encoded_dims,
|
||||
"delta_dims": delta_dims,
|
||||
"delta_dim_list": delta_dim_list,
|
||||
"use_delta_transform": use_delta_transform,
|
||||
"state_key": state_key,
|
||||
"normalization_mode": norm_mode.value,
|
||||
"action_horizon": action_horizon,
|
||||
"num_training_chunks": len(encoded_chunks),
|
||||
"compression_stats": compression_stats,
|
||||
}
|
||||
|
||||
with open(output_path / "metadata.json", "w") as f:
|
||||
json.dump(metadata, f, indent=2)
|
||||
|
||||
print(f"\nSaved FAST tokenizer to {output_path}")
|
||||
print(f"Metadata: {json.dumps(metadata, indent=2)}")
|
||||
|
||||
# push to Hugging Face Hub if requested
|
||||
if push_to_hub:
|
||||
# determine the hub repository ID
|
||||
if hub_repo_id is None:
|
||||
hub_repo_id = output_path.name
|
||||
print(f"\nNo hub_repo_id provided, using: {hub_repo_id}")
|
||||
|
||||
print(f"\nPushing tokenizer to Hugging Face Hub: {hub_repo_id}")
|
||||
print(f" Private: {hub_private}")
|
||||
|
||||
try:
|
||||
# use the tokenizer's push_to_hub method
|
||||
tokenizer.push_to_hub(
|
||||
repo_id=hub_repo_id,
|
||||
private=hub_private,
|
||||
commit_message=f"Upload FAST tokenizer trained on {repo_id}",
|
||||
)
|
||||
|
||||
# also upload the metadata.json file separately
|
||||
api = HfApi()
|
||||
api.upload_file(
|
||||
path_or_fileobj=str(output_path / "metadata.json"),
|
||||
path_in_repo="metadata.json",
|
||||
repo_id=hub_repo_id,
|
||||
repo_type="model",
|
||||
commit_message="Upload tokenizer metadata",
|
||||
)
|
||||
|
||||
print(f"Successfully pushed tokenizer to: https://huggingface.co/{hub_repo_id}")
|
||||
except Exception as e:
|
||||
print(f"Error pushing to hub: {e}")
|
||||
print(" Make sure you're logged in with `huggingface-cli login`")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
tyro.cli(main)
|
||||
@@ -75,7 +75,7 @@ from .policy_robot_bridge import (
|
||||
RobotActionToPolicyActionProcessorStep,
|
||||
)
|
||||
from .rename_processor import RenameObservationsProcessorStep
|
||||
from .tokenizer_processor import TokenizerProcessorStep
|
||||
from .tokenizer_processor import ActionTokenizerProcessorStep, TokenizerProcessorStep
|
||||
|
||||
__all__ = [
|
||||
"ActionProcessorStep",
|
||||
@@ -122,6 +122,7 @@ __all__ = [
|
||||
"AddBatchDimensionProcessorStep",
|
||||
"RobotProcessorPipeline",
|
||||
"TokenizerProcessorStep",
|
||||
"ActionTokenizerProcessorStep",
|
||||
"Torch2NumpyActionProcessorStep",
|
||||
"RobotActionToPolicyActionProcessorStep",
|
||||
"PolicyActionToRobotActionProcessorStep",
|
||||
|
||||
@@ -23,22 +23,29 @@ token IDs and attention masks, which are then added to the observation dictionar
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from dataclasses import dataclass, field
|
||||
from typing import TYPE_CHECKING, Any
|
||||
|
||||
import torch
|
||||
|
||||
from lerobot.configs.types import FeatureType, PipelineFeatureType, PolicyFeature
|
||||
from lerobot.utils.constants import OBS_LANGUAGE_ATTENTION_MASK, OBS_LANGUAGE_TOKENS
|
||||
from lerobot.utils.constants import (
|
||||
ACTION_TOKEN_MASK,
|
||||
ACTION_TOKENS,
|
||||
OBS_LANGUAGE_ATTENTION_MASK,
|
||||
OBS_LANGUAGE_TOKENS,
|
||||
)
|
||||
from lerobot.utils.import_utils import _transformers_available
|
||||
|
||||
from .core import EnvTransition, TransitionKey
|
||||
from .pipeline import ObservationProcessorStep, ProcessorStepRegistry
|
||||
from .pipeline import ActionProcessorStep, ObservationProcessorStep, ProcessorStepRegistry
|
||||
|
||||
# Conditional import for type checking and lazy loading
|
||||
if TYPE_CHECKING or _transformers_available:
|
||||
from transformers import AutoTokenizer
|
||||
from transformers import AutoProcessor, AutoTokenizer
|
||||
else:
|
||||
AutoProcessor = None
|
||||
AutoTokenizer = None
|
||||
|
||||
|
||||
@@ -268,3 +275,256 @@ class TokenizerProcessorStep(ObservationProcessorStep):
|
||||
)
|
||||
|
||||
return features
|
||||
|
||||
|
||||
@dataclass
|
||||
@ProcessorStepRegistry.register(name="action_tokenizer_processor")
|
||||
class ActionTokenizerProcessorStep(ActionProcessorStep):
|
||||
"""
|
||||
Processor step to tokenize action data using a fast action tokenizer.
|
||||
|
||||
This step takes action tensors from an `EnvTransition`, tokenizes them using
|
||||
a Hugging Face `transformers` AutoProcessor (such as the Physical Intelligence "fast" tokenizer),
|
||||
and returns the tokenized action.
|
||||
|
||||
Requires the `transformers` library to be installed.
|
||||
|
||||
Attributes:
|
||||
tokenizer_name: The name of a pretrained processor from the Hugging Face Hub (e.g., "physical-intelligence/fast").
|
||||
tokenizer: A pre-initialized processor/tokenizer object. If provided, `tokenizer_name` is ignored.
|
||||
trust_remote_code: Whether to trust remote code when loading the tokenizer (required for some tokenizers).
|
||||
action_tokenizer: The internal tokenizer/processor instance, loaded during initialization.
|
||||
paligemma_tokenizer_name: The name of a pretrained PaliGemma tokenizer from the Hugging Face Hub (e.g., "google/paligemma-3b-pt-224").
|
||||
"""
|
||||
|
||||
action_tokenizer_name: str | None = None
|
||||
action_tokenizer_input_object: Any | None = None
|
||||
trust_remote_code: bool = True
|
||||
max_action_tokens: int = 256
|
||||
fast_skip_tokens: int = 128
|
||||
paligemma_tokenizer_name: str = "google/paligemma-3b-pt-224"
|
||||
# Internal tokenizer instance (not part of the config)
|
||||
action_tokenizer: Any = field(default=None, init=False, repr=False)
|
||||
_paligemma_tokenizer: Any = field(default=None, init=False, repr=False)
|
||||
|
||||
def __post_init__(self):
|
||||
"""
|
||||
Initializes the action tokenizer after the dataclass is created.
|
||||
|
||||
It checks for the availability of the `transformers` library and loads the tokenizer
|
||||
either from a provided object or by name from the Hugging Face Hub.
|
||||
|
||||
Raises:
|
||||
ImportError: If the `transformers` library is not installed.
|
||||
ValueError: If neither `tokenizer` nor `tokenizer_name` is provided.
|
||||
"""
|
||||
if not _transformers_available:
|
||||
raise ImportError(
|
||||
"The 'transformers' library is not installed. "
|
||||
"Please install it with `pip install 'lerobot[transformers-dep]'` to use ActionTokenizerProcessorStep."
|
||||
)
|
||||
|
||||
if self.action_tokenizer_input_object is not None:
|
||||
self.action_tokenizer = self.action_tokenizer_input_object
|
||||
|
||||
elif self.action_tokenizer_name is not None:
|
||||
if AutoProcessor is None:
|
||||
raise ImportError("AutoProcessor is not available")
|
||||
self.action_tokenizer = AutoProcessor.from_pretrained(
|
||||
self.action_tokenizer_name, trust_remote_code=self.trust_remote_code
|
||||
)
|
||||
else:
|
||||
raise ValueError(
|
||||
"Either 'action_tokenizer' or 'action_tokenizer_name' must be provided. "
|
||||
"Pass a tokenizer object directly or a tokenizer name to auto-load."
|
||||
)
|
||||
|
||||
self._paligemma_tokenizer = AutoTokenizer.from_pretrained(
|
||||
self.paligemma_tokenizer_name,
|
||||
trust_remote_code=self.trust_remote_code,
|
||||
add_eos_token=True,
|
||||
add_bos_token=False,
|
||||
)
|
||||
|
||||
def __call__(self, transition: EnvTransition) -> EnvTransition:
|
||||
"""
|
||||
Applies action tokenization to the transition.
|
||||
|
||||
This overrides the base class to handle both tokens and mask.
|
||||
|
||||
Args:
|
||||
transition: The input transition with action data.
|
||||
|
||||
Returns:
|
||||
The processed transition with tokenized actions and mask in complementary data.
|
||||
"""
|
||||
self._current_transition = transition.copy()
|
||||
new_transition = self._current_transition
|
||||
|
||||
action = new_transition.get(TransitionKey.ACTION)
|
||||
if action is None:
|
||||
# During inference, no action is available, skip tokenization
|
||||
return new_transition
|
||||
|
||||
# Tokenize and get both tokens and mask
|
||||
tokens, mask = self._tokenize_action(action)
|
||||
|
||||
# Store mask in complementary data
|
||||
complementary_data = new_transition.get(TransitionKey.COMPLEMENTARY_DATA, {})
|
||||
if complementary_data is None:
|
||||
complementary_data = {}
|
||||
complementary_data[ACTION_TOKEN_MASK] = mask
|
||||
complementary_data[ACTION_TOKENS] = tokens
|
||||
new_transition[TransitionKey.COMPLEMENTARY_DATA] = complementary_data
|
||||
return new_transition
|
||||
|
||||
def _act_tokens_to_paligemma_tokens(self, tokens: torch.Tensor) -> torch.Tensor:
|
||||
"""
|
||||
Converts action tokens to PaliGemma tokens.
|
||||
"""
|
||||
return self._paligemma_tokenizer.vocab_size - 1 - self.fast_skip_tokens - tokens
|
||||
|
||||
def _tokenize_action(self, action: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
|
||||
"""
|
||||
Tokenizes the action tensor and creates a mask.
|
||||
|
||||
Args:
|
||||
action: The input action tensor to tokenize. Shape: (B, H, action_dim) or (H, action_dim,)
|
||||
|
||||
Returns:
|
||||
A tuple of (tokens, mask) where:
|
||||
- tokens: Tensor of token IDs with shape (B, max_action_tokens)
|
||||
- mask: Boolean mask with shape (B, max_action_tokens), True for real tokens, False for padding
|
||||
"""
|
||||
if action is None:
|
||||
raise ValueError("Action cannot be None")
|
||||
|
||||
# Get the device and dtype of the input action
|
||||
device = action.device if isinstance(action, torch.Tensor) else None
|
||||
|
||||
# Handle single sample (add batch dimension)
|
||||
single_sample = action.dim() == 1
|
||||
if single_sample:
|
||||
action = action.unsqueeze(0)
|
||||
|
||||
batch_size = action.shape[0]
|
||||
|
||||
# Tokenize the action batch
|
||||
# The fast tokenizer expects action data and returns token IDs
|
||||
tokens_list = []
|
||||
masks_list = []
|
||||
|
||||
for i in range(batch_size):
|
||||
# Tokenize single action (move to CPU first as tokenizer uses scipy which requires numpy)
|
||||
action_cpu = action[i : i + 1].cpu()
|
||||
tokens = self.action_tokenizer(action_cpu)
|
||||
|
||||
# Convert to numpy array if it's a list
|
||||
if isinstance(tokens, list) or not isinstance(tokens, torch.Tensor):
|
||||
tokens = torch.tensor(tokens, dtype=torch.long, device=action.device)
|
||||
else:
|
||||
# Move tokens back to the same device as input action
|
||||
tokens = tokens.to(device=action.device)
|
||||
|
||||
# Flatten to 1D if needed
|
||||
if tokens.dim() > 1:
|
||||
tokens = tokens.flatten()
|
||||
|
||||
bos_id = self._paligemma_tokenizer.bos_token_id
|
||||
# add bos
|
||||
tokens = torch.cat(
|
||||
[
|
||||
torch.tensor([bos_id], device=action.device),
|
||||
torch.tensor(
|
||||
self._paligemma_tokenizer.encode("Action: ", add_special_tokens=False),
|
||||
device=action.device,
|
||||
),
|
||||
self._act_tokens_to_paligemma_tokens(tokens),
|
||||
torch.tensor(self._paligemma_tokenizer.encode("|"), device=action.device),
|
||||
]
|
||||
)
|
||||
|
||||
# Truncate or pad to max_action_tokens
|
||||
if len(tokens) > self.max_action_tokens:
|
||||
logging.warning(
|
||||
f"Token length ({len(tokens)}) exceeds max length ({self.max_action_tokens}), truncating. "
|
||||
"Consider increasing the `max_action_tokens` in your model config if this happens frequently."
|
||||
)
|
||||
tokens = tokens[: self.max_action_tokens]
|
||||
mask = torch.ones(self.max_action_tokens, dtype=torch.bool, device=action.device)
|
||||
else:
|
||||
mask = torch.cat(
|
||||
[
|
||||
torch.ones(len(tokens), dtype=torch.bool, device=action.device),
|
||||
torch.zeros(
|
||||
self.max_action_tokens - len(tokens), dtype=torch.bool, device=action.device
|
||||
),
|
||||
]
|
||||
)
|
||||
# Pad tokens with zeros
|
||||
tokens = torch.nn.functional.pad(tokens, (0, self.max_action_tokens - len(tokens)), value=0)
|
||||
|
||||
tokens_list.append(tokens)
|
||||
masks_list.append(mask)
|
||||
|
||||
# Stack into batched tensors
|
||||
tokens_batch = torch.stack(tokens_list, dim=0) # (B, max_action_tokens)
|
||||
masks_batch = torch.stack(masks_list, dim=0) # (B, max_action_tokens)
|
||||
|
||||
# Remove batch dimension if input was single sample
|
||||
if single_sample:
|
||||
tokens_batch = tokens_batch.squeeze(0)
|
||||
masks_batch = masks_batch.squeeze(0)
|
||||
|
||||
# Move to the same device as the input
|
||||
if device is not None:
|
||||
tokens_batch = tokens_batch.to(device)
|
||||
masks_batch = masks_batch.to(device)
|
||||
|
||||
return tokens_batch, masks_batch
|
||||
|
||||
def action(self, action: torch.Tensor) -> torch.Tensor:
|
||||
"""
|
||||
This method is not used since we override __call__.
|
||||
Required by ActionProcessorStep ABC.
|
||||
"""
|
||||
tokens, _ = self._tokenize_action(action)
|
||||
return tokens
|
||||
|
||||
def get_config(self) -> dict[str, Any]:
|
||||
"""
|
||||
Returns the serializable configuration of the processor.
|
||||
|
||||
Note: The tokenizer object itself is not serialized. If the processor was initialized
|
||||
with a tokenizer name, that name will be included in the config.
|
||||
|
||||
Returns:
|
||||
A dictionary with the processor's configuration parameters.
|
||||
"""
|
||||
config = {
|
||||
"trust_remote_code": self.trust_remote_code,
|
||||
"max_action_tokens": self.max_action_tokens,
|
||||
}
|
||||
|
||||
# Only save tokenizer_name if it was used to create the tokenizer
|
||||
if self.action_tokenizer_name is not None and self.action_tokenizer_input_object is None:
|
||||
config["action_tokenizer_name"] = self.action_tokenizer_name
|
||||
|
||||
return config
|
||||
|
||||
def transform_features(
|
||||
self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
|
||||
) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
|
||||
"""
|
||||
Updates feature definitions to reflect tokenized actions.
|
||||
|
||||
This updates the policy features dictionary to indicate that the action
|
||||
has been tokenized into a sequence of token IDs with shape (max_action_tokens,).
|
||||
|
||||
Args:
|
||||
features: The dictionary of existing policy features.
|
||||
|
||||
Returns:
|
||||
The updated dictionary of policy features.
|
||||
"""
|
||||
return features
|
||||
|
||||
@@ -28,6 +28,8 @@ OBS_LANGUAGE_TOKENS = OBS_LANGUAGE + ".tokens"
|
||||
OBS_LANGUAGE_ATTENTION_MASK = OBS_LANGUAGE + ".attention_mask"
|
||||
|
||||
ACTION = "action"
|
||||
ACTION_TOKENS = ACTION + ".tokens"
|
||||
ACTION_TOKEN_MASK = ACTION + ".token_mask"
|
||||
REWARD = "next.reward"
|
||||
TRUNCATED = "next.truncated"
|
||||
DONE = "next.done"
|
||||
|
||||
@@ -63,6 +63,7 @@ def is_package_available(pkg_name: str, return_version: bool = False) -> tuple[b
|
||||
|
||||
_transformers_available = is_package_available("transformers")
|
||||
_peft_available = is_package_available("peft")
|
||||
_scipy_available = is_package_available("scipy")
|
||||
|
||||
|
||||
def make_device_from_device_class(config: ChoiceRegistry) -> Any:
|
||||
|
||||
@@ -0,0 +1,504 @@
|
||||
#!/usr/bin/env python
|
||||
|
||||
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""Test script to verify PI0Fast policy integration with LeRobot vs the original implementation"""
|
||||
# ruff: noqa: E402
|
||||
|
||||
import os
|
||||
import random
|
||||
from copy import deepcopy
|
||||
from typing import Any
|
||||
|
||||
import numpy as np
|
||||
import pytest
|
||||
import torch
|
||||
|
||||
pytest.importorskip("transformers")
|
||||
pytest.importorskip("scipy")
|
||||
pytestmark = pytest.mark.skipif(
|
||||
os.environ.get("CI") == "true" or os.environ.get("GITHUB_ACTIONS") == "true",
|
||||
reason="This test requires accepting the model license",
|
||||
)
|
||||
|
||||
from lerobot.policies.pi0_fast.configuration_pi0_fast import PI0FastConfig
|
||||
from lerobot.policies.pi0_fast.modeling_pi0_fast import PI0FastPolicy
|
||||
from lerobot.policies.pi0_fast.processor_pi0_fast import make_pi0_fast_pre_post_processors
|
||||
from lerobot.processor import PolicyAction, PolicyProcessorPipeline # noqa: E402
|
||||
from lerobot.utils.constants import (
|
||||
ACTION_TOKEN_MASK,
|
||||
ACTION_TOKENS,
|
||||
OBS_IMAGES,
|
||||
OBS_LANGUAGE_ATTENTION_MASK,
|
||||
OBS_LANGUAGE_TOKENS,
|
||||
OBS_STATE,
|
||||
) # noqa: E402
|
||||
from tests.utils import require_cuda # noqa: E402
|
||||
|
||||
# Constants
|
||||
DUMMY_ACTION_DIM = 7
|
||||
DUMMY_STATE_DIM = 20
|
||||
IMAGE_HEIGHT = 224
|
||||
IMAGE_WIDTH = 224
|
||||
NUM_VIEWS = 2 # Number of camera views
|
||||
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
|
||||
MODEL_PATH_LEROBOT = "lerobot/pi0fast-base"
|
||||
|
||||
# Expected action token shape: (batch_size, max_decoding_steps)
|
||||
EXPECTED_ACTION_TOKENS_SHAPE = (1, 2)
|
||||
|
||||
# Expected first 5 action tokens (for reproducibility check)
|
||||
EXPECTED_ACTION_TOKENS_FIRST_5 = torch.tensor([255657, 255362])
|
||||
|
||||
# Expected actions after detokenization
|
||||
EXPECTED_ACTIONS_SHAPE = (1, 2, 32) # (batch_size, n_action_steps, action_dim)
|
||||
EXPECTED_ACTIONS_MEAN = 0.04419417306780815
|
||||
EXPECTED_ACTIONS_STD = 0.26231569051742554
|
||||
EXPECTED_ACTIONS_FIRST_5 = torch.tensor([0.0000, 1.4849, 0.0000, 0.0000, 0.0000])
|
||||
|
||||
|
||||
def set_seed_all(seed: int):
|
||||
"""Set random seed for all RNG sources to ensure reproducibility."""
|
||||
random.seed(seed)
|
||||
np.random.seed(seed)
|
||||
torch.manual_seed(seed)
|
||||
|
||||
if torch.cuda.is_available():
|
||||
torch.cuda.manual_seed(seed)
|
||||
torch.cuda.manual_seed_all(seed)
|
||||
|
||||
# Set deterministic behavior
|
||||
torch.backends.cudnn.deterministic = True
|
||||
torch.backends.cudnn.benchmark = False
|
||||
torch.use_deterministic_algorithms(True, warn_only=True)
|
||||
|
||||
|
||||
def instantiate_lerobot_pi0_fast(
|
||||
from_pretrained: bool = False,
|
||||
model_path: str = MODEL_PATH_LEROBOT,
|
||||
) -> tuple[
|
||||
Any, # Policy
|
||||
PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
|
||||
PolicyProcessorPipeline[PolicyAction, PolicyAction],
|
||||
]:
|
||||
"""Instantiate LeRobot PI0Fast policy with preprocessor and postprocessor."""
|
||||
if from_pretrained:
|
||||
policy = PI0FastPolicy.from_pretrained(
|
||||
pretrained_name_or_path=model_path,
|
||||
strict=True,
|
||||
)
|
||||
policy.config.validate_action_token_prefix = False
|
||||
policy.config.max_action_tokens = 2
|
||||
policy.config.max_decoding_steps = 2
|
||||
policy.config.chunk_size = 2
|
||||
policy.config.n_action_steps = 2
|
||||
else:
|
||||
config = PI0FastConfig(
|
||||
n_action_steps=2,
|
||||
max_action_dim=DUMMY_ACTION_DIM,
|
||||
max_state_dim=DUMMY_STATE_DIM,
|
||||
device=DEVICE,
|
||||
validate_action_token_prefix=False,
|
||||
max_action_tokens=2,
|
||||
max_decoding_steps=2,
|
||||
chunk_size=2,
|
||||
)
|
||||
policy = PI0FastPolicy(config)
|
||||
|
||||
policy.to(DEVICE)
|
||||
policy.config.device = DEVICE
|
||||
preprocessor, postprocessor = make_pi0_fast_pre_post_processors(
|
||||
config=policy.config,
|
||||
dataset_stats=None, # Pass None for dataset_stats to disable normalization
|
||||
)
|
||||
|
||||
return policy, preprocessor, postprocessor
|
||||
|
||||
|
||||
def create_dummy_data(device=DEVICE):
|
||||
"""Create dummy data for testing both implementations."""
|
||||
batch_size = 1
|
||||
prompt = "Pick up the red block and place it in the bin"
|
||||
|
||||
# Create random RGB images in [0, 255] uint8 range (as PIL images would be)
|
||||
# Then convert to [0, 1] float32 range for LeRobot
|
||||
def fake_rgb(h, w):
|
||||
arr = np.random.randint(0, 255, (h, w, 3), dtype=np.uint8)
|
||||
t = torch.from_numpy(arr).permute(2, 0, 1) # CHW
|
||||
return t
|
||||
|
||||
batch = {
|
||||
f"{OBS_IMAGES}.base_0_rgb": torch.stack(
|
||||
[fake_rgb(IMAGE_HEIGHT, IMAGE_WIDTH) for _ in range(batch_size)]
|
||||
).to(device),
|
||||
f"{OBS_IMAGES}.left_wrist_0_rgb": torch.stack(
|
||||
[fake_rgb(IMAGE_HEIGHT, IMAGE_WIDTH) for _ in range(batch_size)]
|
||||
).to(device),
|
||||
f"{OBS_IMAGES}.right_wrist_0_rgb": torch.stack(
|
||||
[fake_rgb(IMAGE_HEIGHT, IMAGE_WIDTH) for _ in range(batch_size)]
|
||||
).to(device),
|
||||
OBS_STATE: torch.randn(batch_size, DUMMY_STATE_DIM, dtype=torch.float32, device=device),
|
||||
"task": [prompt for _ in range(batch_size)],
|
||||
}
|
||||
|
||||
return batch
|
||||
|
||||
|
||||
# Pytest fixtures
|
||||
@pytest.fixture(scope="module")
|
||||
def pi0_fast_components():
|
||||
"""Fixture to instantiate and provide all PI0Fast components for tests."""
|
||||
print(f"\nTesting with DEVICE='{DEVICE}'")
|
||||
print("\n[Setup] Instantiating LeRobot PI0Fast policy...")
|
||||
policy_obj, preprocessor_obj, postprocessor_obj = instantiate_lerobot_pi0_fast(from_pretrained=True)
|
||||
print("Model loaded successfully")
|
||||
yield policy_obj, preprocessor_obj, postprocessor_obj
|
||||
|
||||
|
||||
@pytest.fixture(scope="module")
|
||||
def policy(pi0_fast_components):
|
||||
"""Fixture to provide the PI0Fast policy for tests."""
|
||||
return pi0_fast_components[0]
|
||||
|
||||
|
||||
@pytest.fixture(scope="module")
|
||||
def preprocessor(pi0_fast_components):
|
||||
"""Fixture to provide the PI0Fast preprocessor for tests."""
|
||||
return pi0_fast_components[1]
|
||||
|
||||
|
||||
@require_cuda
|
||||
def test_pi0_fast_preprocessor_alignment(policy, preprocessor):
|
||||
"""Test that LeRobot PI0Fast preprocessor produces expected outputs."""
|
||||
print("\n" + "=" * 80)
|
||||
print("Test: PI0Fast Preprocessor Outputs")
|
||||
print("=" * 80)
|
||||
|
||||
set_seed_all(42)
|
||||
|
||||
print("\nCreating dummy data...")
|
||||
batch = create_dummy_data()
|
||||
|
||||
print("\n[LeRobot] Preprocessing...")
|
||||
lerobot_observation = preprocessor(deepcopy(batch))
|
||||
|
||||
print("\nVerifying preprocessor outputs:")
|
||||
print("-" * 80)
|
||||
|
||||
# Expected keys from PI0Fast preprocessing
|
||||
expected_keys = [
|
||||
"observation.images.base_0_rgb",
|
||||
"observation.images.left_wrist_0_rgb",
|
||||
"observation.images.right_wrist_0_rgb",
|
||||
"observation.state",
|
||||
"observation.language_tokens",
|
||||
"observation.language_attention_mask",
|
||||
]
|
||||
|
||||
for key in expected_keys:
|
||||
if key in lerobot_observation:
|
||||
shape = tuple(lerobot_observation[key].shape)
|
||||
print(f"\nKey: {key}")
|
||||
print(f"Shape: {shape}")
|
||||
print(f"Dtype: {lerobot_observation[key].dtype}")
|
||||
else:
|
||||
print(f"\nKey '{key}' not found in inputs!")
|
||||
|
||||
# Check language tokens shape
|
||||
if "observation.language_tokens" in lerobot_observation:
|
||||
lang_tokens = lerobot_observation["observation.language_tokens"]
|
||||
print(f"\nLanguage tokens shape: {lang_tokens.shape}")
|
||||
# Should have batch dimension and max_length from tokenizer
|
||||
assert lang_tokens.dim() == 2, f"Expected 2D tensor, got {lang_tokens.dim()}D"
|
||||
|
||||
print("\nPreprocessor outputs verified!")
|
||||
|
||||
|
||||
@require_cuda
|
||||
def test_pi0_fast_action_generation(policy, preprocessor):
|
||||
"""Test PI0Fast LeRobot implementation generates expected actions."""
|
||||
print("\n" + "=" * 80)
|
||||
print("Test: PI0Fast Action Generation Against Expected Values")
|
||||
print("=" * 80)
|
||||
|
||||
set_seed_all(42)
|
||||
|
||||
print("\nCreating dummy data...")
|
||||
batch = create_dummy_data()
|
||||
|
||||
print("\n[LeRobot] Running inference...")
|
||||
lerobot_observation = preprocessor(deepcopy(batch))
|
||||
|
||||
# Reset seed for inference
|
||||
torch.manual_seed(42)
|
||||
with torch.no_grad():
|
||||
lerobot_actions = policy.predict_action_chunk(lerobot_observation)
|
||||
lerobot_actions = lerobot_actions.float().cpu()
|
||||
|
||||
print(f"LeRobot actions shape: {lerobot_actions.shape}")
|
||||
print(f"LeRobot actions mean: {lerobot_actions.mean().item():.6f}")
|
||||
print(f"LeRobot actions std: {lerobot_actions.std().item():.6f}")
|
||||
print(f"LeRobot actions first 5: {lerobot_actions[0, 0, :5]}")
|
||||
|
||||
print("\nExpected values (from original PI0Fast):")
|
||||
print(f"Expected actions shape: {EXPECTED_ACTIONS_SHAPE}")
|
||||
print(f"Expected actions mean: {EXPECTED_ACTIONS_MEAN:.6f}")
|
||||
print(f"Expected actions std: {EXPECTED_ACTIONS_STD:.6f}")
|
||||
print(f"Expected actions first 5: {EXPECTED_ACTIONS_FIRST_5}")
|
||||
|
||||
print("\nAction Comparison:")
|
||||
print("-" * 80)
|
||||
|
||||
# Compare shapes
|
||||
actual_shape = tuple(lerobot_actions.shape)
|
||||
print(f"Actual shape: {actual_shape}")
|
||||
|
||||
assert actual_shape == EXPECTED_ACTIONS_SHAPE, (
|
||||
f"Shape mismatch: {actual_shape} vs {EXPECTED_ACTIONS_SHAPE}"
|
||||
)
|
||||
print(f"Shape matches: {actual_shape}")
|
||||
|
||||
# Compare statistics
|
||||
actual_mean = lerobot_actions.mean().item()
|
||||
actual_std = lerobot_actions.std().item()
|
||||
|
||||
print(f"\nMean: {actual_mean:.6f} (expected: {EXPECTED_ACTIONS_MEAN:.6f})")
|
||||
print(f"Std: {actual_std:.6f} (expected: {EXPECTED_ACTIONS_STD:.6f})")
|
||||
|
||||
# Compare first 5 actions
|
||||
actual_first_5 = lerobot_actions[0, 0, :5]
|
||||
print("\nFirst 5 actions comparison:")
|
||||
print(f" Actual: {actual_first_5}")
|
||||
print(f" Expected: {EXPECTED_ACTIONS_FIRST_5}")
|
||||
|
||||
first_5_diff = torch.abs(actual_first_5 - EXPECTED_ACTIONS_FIRST_5)
|
||||
print(f" Max diff: {first_5_diff.max().item():.6e}")
|
||||
print(f" Mean diff: {first_5_diff.mean().item():.6e}")
|
||||
|
||||
# Check with different tolerances
|
||||
tolerances = [1e-5, 1e-4, 1e-3, 1e-2]
|
||||
for tol in tolerances:
|
||||
is_close = torch.allclose(actual_first_5, EXPECTED_ACTIONS_FIRST_5, atol=tol)
|
||||
status = "Success" if is_close else "Failure"
|
||||
print(f"{status}: First 5 actions close (atol={tol}): {is_close}")
|
||||
|
||||
# Assert with reasonable tolerance
|
||||
tolerance = 1e-3
|
||||
assert torch.allclose(actual_first_5, EXPECTED_ACTIONS_FIRST_5, atol=tolerance), (
|
||||
f"First 5 actions differ by more than tolerance ({tolerance})"
|
||||
)
|
||||
print(f"\nSuccess: Actions match expected values within tolerance ({tolerance})!")
|
||||
|
||||
print("\nAction generation test completed (values printed for reference)!")
|
||||
|
||||
|
||||
@require_cuda
|
||||
def test_pi0_fast_inference_reproducibility(policy, preprocessor):
|
||||
"""Test that PI0Fast inference is reproducible with the same seed."""
|
||||
print("\n" + "=" * 80)
|
||||
print("Test: PI0Fast Inference Reproducibility")
|
||||
print("=" * 80)
|
||||
|
||||
print("\nCreating dummy data...")
|
||||
batch = create_dummy_data()
|
||||
|
||||
# First inference
|
||||
print("\n[Run 1] Running inference...")
|
||||
set_seed_all(42)
|
||||
lerobot_observation = preprocessor(deepcopy(batch))
|
||||
with torch.no_grad():
|
||||
actions_1 = policy.predict_action_chunk(lerobot_observation)
|
||||
actions_1 = actions_1.float().cpu()
|
||||
|
||||
# Second inference with same seed
|
||||
print("\n[Run 2] Running inference with same seed...")
|
||||
set_seed_all(42)
|
||||
lerobot_observation = preprocessor(deepcopy(batch))
|
||||
with torch.no_grad():
|
||||
actions_2 = policy.predict_action_chunk(lerobot_observation)
|
||||
actions_2 = actions_2.float().cpu()
|
||||
|
||||
print("\nComparing two runs:")
|
||||
print("-" * 80)
|
||||
if torch.allclose(actions_1, actions_2, atol=1e-8):
|
||||
print("Inference is perfectly reproducible!")
|
||||
else:
|
||||
diff = torch.abs(actions_1 - actions_2)
|
||||
print("Small differences detected:")
|
||||
print(f" Max diff: {diff.max().item():.6e}")
|
||||
print(f" Mean diff: {diff.mean().item():.6e}")
|
||||
|
||||
assert torch.allclose(actions_1, actions_2, atol=1e-6), "Inference should be reproducible!"
|
||||
|
||||
print("\nInference is reproducible!")
|
||||
|
||||
|
||||
@require_cuda
|
||||
def test_pi0_fast_forward_pass_logits(policy, preprocessor):
|
||||
"""Test PI0Fast forward pass and compare logits against expected values."""
|
||||
print("\n" + "=" * 80)
|
||||
print("Test: PI0Fast Forward Pass Logits")
|
||||
print("=" * 80)
|
||||
|
||||
set_seed_all(42)
|
||||
|
||||
print("\nCreating dummy data with action tokens...")
|
||||
batch = create_dummy_data()
|
||||
|
||||
# Preprocess the batch
|
||||
lerobot_observation = preprocessor(deepcopy(batch))
|
||||
|
||||
# For forward pass, we need action tokens
|
||||
# Create dummy action tokens for testing
|
||||
batch_size = 1
|
||||
max_action_tokens = policy.config.max_action_tokens
|
||||
|
||||
# Create dummy action tokens (in practice, these come from the FAST tokenizer)
|
||||
dummy_action_tokens = torch.randint(
|
||||
0, 1000, (batch_size, max_action_tokens), dtype=torch.long, device=DEVICE
|
||||
)
|
||||
dummy_action_masks = torch.ones(batch_size, max_action_tokens, dtype=torch.bool, device=DEVICE)
|
||||
|
||||
# Add action tokens to the observation
|
||||
lerobot_observation[ACTION_TOKENS] = dummy_action_tokens
|
||||
lerobot_observation[ACTION_TOKEN_MASK] = dummy_action_masks
|
||||
|
||||
print("\n[LeRobot] Running forward pass...")
|
||||
policy.train()
|
||||
with torch.no_grad():
|
||||
loss, loss_dict = policy.forward(lerobot_observation)
|
||||
|
||||
print(f"Loss: {loss.item():.6f}")
|
||||
print(f"FAST Loss: {loss_dict['ce_loss']:.6f}")
|
||||
|
||||
print("\nForward pass completed successfully!")
|
||||
print(f"Loss value: {loss.item():.6f}")
|
||||
|
||||
# The loss should be a positive value
|
||||
assert loss.item() > 0, "Loss should be positive"
|
||||
assert not torch.isnan(loss), "Loss should not be NaN"
|
||||
assert not torch.isinf(loss), "Loss should not be infinite"
|
||||
|
||||
print("\nForward pass test passed!")
|
||||
|
||||
|
||||
@require_cuda
|
||||
def test_pi0_fast_action_token_sampling(policy, preprocessor):
|
||||
"""Test PI0Fast action token sampling (autoregressive decoding)."""
|
||||
print("\n" + "=" * 80)
|
||||
print("Test: PI0Fast Action Token Sampling")
|
||||
print("=" * 80)
|
||||
|
||||
set_seed_all(42)
|
||||
|
||||
print("\nCreating dummy data...")
|
||||
batch = create_dummy_data()
|
||||
|
||||
print("\n[LeRobot] Preprocessing...")
|
||||
lerobot_observation = preprocessor(deepcopy(batch))
|
||||
|
||||
# Prepare inputs for model
|
||||
images, img_masks = policy._preprocess_images(lerobot_observation)
|
||||
tokens = lerobot_observation[OBS_LANGUAGE_TOKENS]
|
||||
masks = lerobot_observation[OBS_LANGUAGE_ATTENTION_MASK]
|
||||
|
||||
print("\n[LeRobot] Sampling action tokens...")
|
||||
torch.manual_seed(42)
|
||||
with torch.no_grad():
|
||||
action_tokens = policy.model.sample_actions_fast(
|
||||
images,
|
||||
img_masks,
|
||||
tokens,
|
||||
masks,
|
||||
max_decoding_steps=2,
|
||||
temperature=0.0, # Greedy decoding for reproducibility
|
||||
)
|
||||
|
||||
print(f"Action tokens shape: {action_tokens.shape}")
|
||||
print(f"Action tokens first 10: {action_tokens[0, :10].tolist()}")
|
||||
|
||||
print("\nExpected values (from original PI0Fast):")
|
||||
print(f"Expected shape: {EXPECTED_ACTION_TOKENS_SHAPE}")
|
||||
print(f"Expected first 5: {EXPECTED_ACTION_TOKENS_FIRST_5.tolist()}")
|
||||
|
||||
# Verify shape
|
||||
actual_shape = tuple(action_tokens.shape)
|
||||
print(f"\nActual shape: {actual_shape}")
|
||||
|
||||
assert actual_shape == EXPECTED_ACTION_TOKENS_SHAPE, (
|
||||
f"Shape mismatch: {actual_shape} vs {EXPECTED_ACTION_TOKENS_SHAPE}"
|
||||
)
|
||||
|
||||
# Compare first 5 tokens
|
||||
actual_first_5 = action_tokens[0, :5].cpu()
|
||||
assert torch.equal(actual_first_5, EXPECTED_ACTION_TOKENS_FIRST_5), (
|
||||
f"First 5 tokens mismatch: {actual_first_5} vs {EXPECTED_ACTION_TOKENS_FIRST_5}"
|
||||
)
|
||||
|
||||
print("\nAction token sampling test completed!")
|
||||
|
||||
|
||||
@require_cuda
|
||||
def test_pi0_fast_detokenization(policy, preprocessor):
|
||||
"""Test PI0Fast action detokenization (FAST decoding)."""
|
||||
print("\n" + "=" * 80)
|
||||
print("Test: PI0Fast Action Detokenization")
|
||||
print("=" * 80)
|
||||
|
||||
set_seed_all(42)
|
||||
|
||||
print("\nCreating dummy data...")
|
||||
batch = create_dummy_data()
|
||||
|
||||
print("\n[LeRobot] Preprocessing...")
|
||||
lerobot_observation = preprocessor(deepcopy(batch))
|
||||
|
||||
# Prepare inputs for model
|
||||
images, img_masks = policy._preprocess_images(lerobot_observation)
|
||||
tokens = lerobot_observation[OBS_LANGUAGE_TOKENS]
|
||||
masks = lerobot_observation[OBS_LANGUAGE_ATTENTION_MASK]
|
||||
|
||||
print("\n[LeRobot] Sampling action tokens...")
|
||||
torch.manual_seed(42)
|
||||
with torch.no_grad():
|
||||
action_tokens = policy.model.sample_actions_fast(
|
||||
images,
|
||||
img_masks,
|
||||
tokens,
|
||||
masks,
|
||||
max_decoding_steps=2,
|
||||
temperature=0.0,
|
||||
)
|
||||
|
||||
print(f"Action tokens shape: {action_tokens.shape}")
|
||||
|
||||
# Detokenize
|
||||
print("\n[LeRobot] Detokenizing action tokens...")
|
||||
action_horizon = policy.config.n_action_steps
|
||||
action_dim = policy.config.output_features["action"].shape[0]
|
||||
|
||||
try:
|
||||
continuous_actions = policy.detokenize_actions(
|
||||
action_tokens, action_horizon=action_horizon, action_dim=action_dim
|
||||
)
|
||||
print(f"Continuous actions shape: {continuous_actions.shape}")
|
||||
print(f"Continuous actions mean: {continuous_actions.mean().item():.6f}")
|
||||
print(f"Continuous actions std: {continuous_actions.std().item():.6f}")
|
||||
print(f"Continuous actions first 5: {continuous_actions[0, 0, :5]}")
|
||||
print("\nDetokenization successful!")
|
||||
except Exception as e:
|
||||
print(f"\nDetokenization failed with error: {e}")
|
||||
print("This may be expected if the action tokens are not valid FAST tokens.")
|
||||
print("The test will pass as long as the sampling works correctly.")
|
||||
Reference in New Issue
Block a user