Merge branch 'main' into feature/add-multitask-dit

2026-05-15 08:39:49 +00:00 · 2026-01-10 17:04:08 -08:00
parent 0e7bfa5624 1d86c9b7f2
commit c24cbaacf9
15 changed files with 3214 additions and 5 deletions
@@ -37,6 +37,8 @@
    title: SmolVLA
  - local: pi0
    title: π₀ (Pi0)
+  - local: pi0fast
+    title: π₀-FAST (Pi0Fast)
  - local: pi05
    title: π₀.₅ (Pi05)
  - local: groot
@@ -0,0 +1,182 @@
+# π₀-FAST (Pi0-FAST)
+
+π₀-FAST is a **Vision-Language-Action model for general robot control** that uses autoregressive next-token prediction to model continuous robot actions.
+
+## Model Overview
+
+π₀-FAST combines the power of Vision-Language Models with a novel action tokenization approach called **FAST (Frequency-space Action Sequence Tokenization)**. This enables training autoregressive VLAs on highly dexterous tasks that are impossible with standard binning-based discretization, while training **up to 5x faster** than diffusion-based approaches like π₀.
+
+### Why FAST?
+
+Standard approaches for robot action tokenization use simple per-dimension, per-timestep binning schemes. While passable for simple behaviors, this rapidly breaks down for complex and dexterous skills that require precision and high-frequency control.
+
+FAST solves this by compressing action sequences using signal processing techniques, resulting in a dense sequence of action tokens that can be predicted autoregressively—just like language tokens.
+
+### How FAST Tokenization Works
+
+The FAST tokenizer compresses action sequences through the following steps:
+
+1. **Normalize**: Take a continuous action chunk of shape `(H, D)` where `H` is the horizon and `D` is the action dimension. Normalize using one of the supported normalization methods (Quantiles recommended to handle outliers).
+
+2. **Discrete Cosine Transform (DCT)**: Apply DCT (via scipy) to each action dimension separately. DCT is a compression algorithm commonly used in image and audio codecs (JPEG, MP3).
+
+3. **Quantization**: Round and remove insignificant coefficients for each action dimension, producing a sparse frequency matrix.
+
+4. **Flatten**: Flatten the matrix into a 1D vector, with low-frequency components first.
+
+5. **Byte Pair Encoding (BPE)**: Train a BPE tokenizer to compress the DCT coefficients into dense action tokens, typically achieving **10x compression** over prior tokenization approaches.
+
+This approach can transform **any existing VLM** into a VLA by training it to predict these FAST tokens.
+
+## Installation Requirements
+
+1. Install LeRobot by following our [Installation Guide](./installation).
+2. Install π₀-FAST dependencies by running:
+
+   ```bash
+   pip install -e ".[pi]"
+   ```
+
+   > [!NOTE]
+   > For lerobot 0.4.0, if you want to install the pi tag, you will have to do: `pip install "lerobot[pi]@git+https://github.com/huggingface/lerobot.git"`.
+   >
+   > This will be solved in the next patch release
+
+## Training a Custom FAST Tokenizer
+
+You have two options for the FAST tokenizer:
+
+1. **Use the pre-trained tokenizer**: The `physical-intelligence/fast` tokenizer was trained on 1M+ real robot action sequences and works as a general-purpose tokenizer.
+
+2. **Train your own tokenizer**: For maximum performance on your specific dataset, you can finetune the tokenizer on your own data.
+
+### Training Your Own Tokenizer
+
+```bash
+python src/lerobot/policies/pi0_fast/train_fast_tokenizer.py \
+    --repo_id "user/my-lerobot-dataset" \
+    --action_horizon 10 \
+    --encoded_dims "0:6" \
+    --vocab_size 1024 \
+    --scale 10.0 \
+    --normalization_mode QUANTILES \
+    --output_dir "./my_fast_tokenizer" \
+    --push_to_hub \
+    --hub_repo_id "username/my-action-tokenizer"
+```
+
+### Key Tokenizer Parameters
+
+| Parameter              | Description                                                                       | Default      |
+| ---------------------- | --------------------------------------------------------------------------------- | ------------ |
+| `--repo_id`            | LeRobot dataset repository ID                                                     | Required     |
+| `--action_horizon`     | Number of future actions in each chunk                                            | `10`         |
+| `--encoded_dims`       | Comma-separated dimension ranges to encode (e.g., `"0:6,7:23"`)                   | `"0:6,7:23"` |
+| `--vocab_size`         | BPE vocabulary size                                                               | `1024`       |
+| `--scale`              | DCT scaling factor for quantization                                               | `10.0`       |
+| `--normalization_mode` | Normalization mode (`MEAN_STD`, `MIN_MAX`, `QUANTILES`, `QUANTILE10`, `IDENTITY`) | `QUANTILES`  |
+| `--sample_fraction`    | Fraction of chunks to sample per episode                                          | `0.1`        |
+
+## Usage
+
+To use π₀-FAST in LeRobot, specify the policy type as:
+
+```python
+policy.type=pi0_fast
+```
+
+## Training
+
+For training π₀-FAST, you can use the LeRobot training script:
+
+```bash
+python src/lerobot/scripts/lerobot_train.py \
+    --dataset.repo_id=your_dataset \
+    --policy.type=pi0_fast \
+    --output_dir=./outputs/pi0fast_training \
+    --job_name=pi0fast_training \
+    --policy.pretrained_path=lerobot/pi0_fast_base \
+    --policy.dtype=bfloat16 \
+    --policy.gradient_checkpointing=true \
+    --policy.chunk_size=10 \
+    --policy.n_action_steps=10 \
+    --policy.max_action_tokens=256 \
+    --steps=100000 \
+    --batch_size=4 \
+    --policy.device=cuda
+```
+
+### Key Training Parameters
+
+| Parameter                              | Description                                        | Default                      |
+| -------------------------------------- | -------------------------------------------------- | ---------------------------- |
+| `--policy.gradient_checkpointing=true` | Reduces memory usage significantly during training | `false`                      |
+| `--policy.dtype=bfloat16`              | Use mixed precision training for efficiency        | `float32`                    |
+| `--policy.chunk_size`                  | Number of action steps to predict (action horizon) | `50`                         |
+| `--policy.n_action_steps`              | Number of action steps to execute                  | `50`                         |
+| `--policy.max_action_tokens`           | Maximum number of FAST tokens per action chunk     | `256`                        |
+| `--policy.action_tokenizer_name`       | FAST tokenizer to use                              | `physical-intelligence/fast` |
+| `--policy.compile_model=true`          | Enable torch.compile for faster training           | `false`                      |
+
+## Inference
+
+### KV-Caching for Fast Inference
+
+π₀-FAST supports **KV-caching**, a widely used optimization in LLM inference. This caches the key-value pairs from the attention mechanism, avoiding redundant computation during autoregressive decoding.
+
+```python
+# KV-caching is enabled by default
+policy.use_kv_cache=true
+```
+
+### Inference Example
+
+```python
+from lerobot.policies.pi0_fast import PI0FastPolicy, PI0FastConfig
+
+# Load the policy
+policy = PI0FastPolicy.from_pretrained("your-model-path")
+
+# During inference
+actions = policy.predict_action_chunk(batch)
+```
+
+## Model Architecture
+
+π₀-FAST uses a PaliGemma-based architecture:
+
+- **Vision Encoder**: SigLIP vision tower for image understanding
+- **Language Model**: Gemma 2B for processing language instructions and predicting action tokens
+
+The model takes images, text instructions, and robot state as input, and outputs discrete FAST tokens that are decoded back to continuous actions.
+
+## Configuration Options
+
+| Parameter            | Description                                     | Default    |
+| -------------------- | ----------------------------------------------- | ---------- |
+| `paligemma_variant`  | VLM backbone variant (`gemma_300m`, `gemma_2b`) | `gemma_2b` |
+| `max_state_dim`      | Maximum state vector dimension (padded)         | `32`       |
+| `max_action_dim`     | Maximum action vector dimension (padded)        | `32`       |
+| `temperature`        | Sampling temperature (0.0 for greedy)           | `0.0`      |
+| `max_decoding_steps` | Maximum decoding steps                          | `256`      |
+| `use_kv_cache`       | Enable KV caching for faster inference          | `true`     |
+
+## Comparison with π₀
+
+| Feature               | π₀                        | π₀-FAST                      |
+| --------------------- | ------------------------- | ---------------------------- |
+| Action Representation | Flow Matching (Diffusion) | Autoregressive Tokens (FAST) |
+| Training Speed        | 1x                        | **5x faster**                |
+| Dexterity             | High                      | High                         |
+| Inference Method      | Iterative Denoising       | Autoregressive Decoding      |
+| KV-Caching            | N/A                       | Supported                    |
+
+## License
+
+This model follows the **Apache 2.0 License**, consistent with the original [OpenPI repository](https://github.com/Physical-Intelligence/openpi).
+
+## References
+
+- [FAST: Efficient Robot Action Tokenization](https://www.physicalintelligence.company/research/fast) - Physical Intelligence Blog
+- [OpenPI Repository](https://github.com/Physical-Intelligence/openpi) - Original implementation
+- [FAST Tokenizer on Hugging Face](https://huggingface.co/physical-intelligence/fast) - Pre-trained tokenizer
@@ -127,7 +127,7 @@ wallx = [
    "torchdiffeq==0.2.5",
    "qwen_vl_utils==0.0.11"
 ]
-pi = ["transformers @ git+https://github.com/huggingface/transformers.git@fix/lerobot_openpi"]
+pi = ["transformers @ git+https://github.com/huggingface/transformers.git@fix/lerobot_openpi", "scipy>=1.10.1,<1.15"]
 smolvla = ["lerobot[transformers-dep]", "num2words>=0.5.14,<0.6.0", "accelerate>=1.7.0,<2.0.0", "safetensors>=0.4.3,<1.0.0"]
 multi_task_dit = ["lerobot[transformers-dep]"]
 groot = [
@@ -17,6 +17,7 @@ from .diffusion.configuration_diffusion import DiffusionConfig as DiffusionConfi
 from .groot.configuration_groot import GrootConfig as GrootConfig
 from .multi_task_dit.configuration_multi_task_dit import MultiTaskDiTConfig as MultiTaskDiTConfig
 from .pi0.configuration_pi0 import PI0Config as PI0Config
+from .pi0_fast.configuration_pi0_fast import PI0FastConfig as PI0FastConfig
 from .pi05.configuration_pi05 import PI05Config as PI05Config
 from .smolvla.configuration_smolvla import SmolVLAConfig as SmolVLAConfig
 from .smolvla.processor_smolvla import SmolVLANewLineProcessor
@@ -31,6 +32,7 @@ __all__ = [
    "MultiTaskDiTConfig",
    "PI0Config",
    "PI05Config",
+    "PI0FastConfig",
    "SmolVLAConfig",
    "SARMConfig",
    "TDMPCConfig",
@@ -95,6 +95,10 @@ def get_policy_class(name: str) -> type[PreTrainedPolicy]:
        from lerobot.policies.pi0.modeling_pi0 import PI0Policy

        return PI0Policy
+    elif name == "pi0_fast":
+        from lerobot.policies.pi0_fast.modeling_pi0_fast import PI0FastPolicy
+
+        return PI0FastPolicy
    elif name == "pi05":
        from lerobot.policies.pi05.modeling_pi05 import PI05Policy

@@ -0,0 +1,21 @@
+#!/usr/bin/env python
+
+# Copyright 2025 Physical Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .configuration_pi0_fast import PI0FastConfig
+from .modeling_pi0_fast import PI0FastPolicy
+from .processor_pi0_fast import make_pi0_fast_pre_post_processors
+
+__all__ = ["PI0FastConfig", "PI0FastPolicy", "make_pi0_fast_pre_post_processors"]
@@ -0,0 +1,161 @@
+#!/usr/bin/env python
+
+# Copyright 2025 Physical Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass, field
+
+from lerobot.configs.policies import PreTrainedConfig
+from lerobot.configs.types import FeatureType, NormalizationMode, PolicyFeature
+from lerobot.optim.optimizers import AdamWConfig
+from lerobot.optim.schedulers import CosineDecayWithWarmupSchedulerConfig
+from lerobot.policies.rtc.configuration_rtc import RTCConfig
+
+DEFAULT_IMAGE_SIZE = 224
+
+
+@PreTrainedConfig.register_subclass("pi0_fast")
+@dataclass
+class PI0FastConfig(PreTrainedConfig):
+    paligemma_variant: str = "gemma_2b"
+    action_expert_variant: str = "gemma_300m"
+    dtype: str = "float32"  # Options: "bfloat16", "float32"
+
+    chunk_size: int = 50  # Number of action steps to predict, in openpi called "action_horizon"
+    n_action_steps: int = 50  # Number of action steps to execute
+
+    # Shorter state and action vectors will be padded to these dimensions
+    max_state_dim: int = 32
+    max_action_dim: int = 32
+    max_action_tokens: int = 256
+
+    # Real-Time Chunking (RTC) configuration
+    rtc_config: RTCConfig | None = None
+
+    image_resolution: tuple[int, int] = (
+        DEFAULT_IMAGE_SIZE,
+        DEFAULT_IMAGE_SIZE,
+    )  # see openpi `preprocessing_pytorch.py`
+
+    # Add empty images. Used to add empty cameras when no image features are present.
+    empty_cameras: int = 0
+
+    tokenizer_max_length: int = 200  # see openpi `__post_init__`
+    text_tokenizer_name: str = "google/paligemma-3b-pt-224"
+    action_tokenizer_name: str = "physical-intelligence/fast"
+    temperature: float = 0.0
+    max_decoding_steps: int = 256
+    fast_skip_tokens: int = 128
+
+    # Whether to validate that decoded action tokens start with "Action: " prefix
+    validate_action_token_prefix: bool = True
+
+    # Whether to use KV cache for faster autoregressive decoding
+    use_kv_cache: bool = True
+
+    normalization_mapping: dict[str, NormalizationMode] = field(
+        default_factory=lambda: {
+            "VISUAL": NormalizationMode.IDENTITY,
+            "STATE": NormalizationMode.MEAN_STD,  # Pi0Fast uses quantiles for state
+            "ACTION": NormalizationMode.MEAN_STD,  # Pi0Fast uses quantiles for action
+        }
+    )
+
+    # Training settings
+    gradient_checkpointing: bool = False  # Enable gradient checkpointing for memory optimization
+    compile_model: bool = False  # Whether to use torch.compile for model optimization
+    compile_mode: str = "max-autotune"  # Torch compile mode
+    device: str | None = None  # Device to use for the model (None = auto-detect)
+
+    # Optimizer settings: see openpi `AdamW`
+    optimizer_lr: float = 2.5e-5  # see openpi `CosineDecaySchedule: peak_lr`
+    optimizer_betas: tuple[float, float] = (0.9, 0.95)
+    optimizer_eps: float = 1e-8
+    optimizer_weight_decay: float = 0.01
+    optimizer_grad_clip_norm: float = 1.0
+
+    # Scheduler settings: see openpi `CosineDecaySchedule`
+    # Note: These will auto-scale if --steps < scheduler_decay_steps
+    # For example, --steps=3000 will scale warmup to 100 and decay to 3000
+    scheduler_warmup_steps: int = 1_000
+    scheduler_decay_steps: int = 30_000
+    scheduler_decay_lr: float = 2.5e-6
+
+    def __post_init__(self):
+        super().__post_init__()
+
+        # Validate configuration
+        if self.n_action_steps > self.chunk_size:
+            raise ValueError(
+                f"n_action_steps ({self.n_action_steps}) cannot be greater than chunk_size ({self.chunk_size})"
+            )
+
+        if self.paligemma_variant not in ["gemma_300m", "gemma_2b"]:
+            raise ValueError(f"Invalid paligemma_variant: {self.paligemma_variant}")
+
+        if self.dtype not in ["bfloat16", "float32"]:
+            raise ValueError(f"Invalid dtype: {self.dtype}")
+
+    def validate_features(self) -> None:
+        """Validate and set up input/output features."""
+        for i in range(self.empty_cameras):
+            key = f"observation.images.empty_camera_{i}"
+            empty_camera = PolicyFeature(
+                type=FeatureType.VISUAL,
+                shape=(3, *self.image_resolution),  # Use configured image resolution
+            )
+            self.input_features[key] = empty_camera
+
+        if "observation.state" not in self.input_features:
+            state_feature = PolicyFeature(
+                type=FeatureType.STATE,
+                shape=(self.max_state_dim,),  # Padded to max_state_dim
+            )
+            self.input_features["observation.state"] = state_feature
+
+        if "action" not in self.output_features:
+            action_feature = PolicyFeature(
+                type=FeatureType.ACTION,
+                shape=(self.max_action_dim,),  # Padded to max_action_dim
+            )
+            self.output_features["action"] = action_feature
+
+    def get_optimizer_preset(self) -> AdamWConfig:
+        return AdamWConfig(
+            lr=self.optimizer_lr,
+            betas=self.optimizer_betas,
+            eps=self.optimizer_eps,
+            weight_decay=self.optimizer_weight_decay,
+            grad_clip_norm=self.optimizer_grad_clip_norm,
+        )
+
+    def get_scheduler_preset(self):
+        return CosineDecayWithWarmupSchedulerConfig(
+            peak_lr=self.optimizer_lr,
+            decay_lr=self.scheduler_decay_lr,
+            num_warmup_steps=self.scheduler_warmup_steps,
+            num_decay_steps=self.scheduler_decay_steps,
+        )
+
+    @property
+    def observation_delta_indices(self) -> None:
+        return None
+
+    @property
+    def action_delta_indices(self) -> list:
+        return list(range(self.chunk_size))
+
+    @property
+    def reward_delta_indices(self) -> None:
+        return None
@@ -0,0 +1,177 @@
+#!/usr/bin/env python
+
+# Copyright 2025 Physical Intelligence and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from copy import deepcopy
+from dataclasses import dataclass
+from typing import Any
+
+import numpy as np
+import torch
+
+from lerobot.configs.types import PipelineFeatureType, PolicyFeature
+from lerobot.policies.pi0_fast.configuration_pi0_fast import PI0FastConfig
+from lerobot.policies.pi0_fast.modeling_pi0_fast import pad_vector
+from lerobot.processor import (
+    ActionTokenizerProcessorStep,
+    AddBatchDimensionProcessorStep,
+    DeviceProcessorStep,
+    NormalizerProcessorStep,
+    PolicyAction,
+    PolicyProcessorPipeline,
+    ProcessorStep,
+    ProcessorStepRegistry,
+    RenameObservationsProcessorStep,
+    TokenizerProcessorStep,
+    UnnormalizerProcessorStep,
+)
+from lerobot.processor.converters import policy_action_to_transition, transition_to_policy_action
+from lerobot.processor.core import EnvTransition, TransitionKey
+from lerobot.utils.constants import (
+    OBS_STATE,
+    POLICY_POSTPROCESSOR_DEFAULT_NAME,
+    POLICY_PREPROCESSOR_DEFAULT_NAME,
+)
+
+
+@ProcessorStepRegistry.register(name="pi0_fast_prepare_state_tokenizer_processor_step")
+@dataclass
+class Pi0FastPrepareStateAndLanguageTokenizerProcessorStep(ProcessorStep):
+    """
+    Processor step to prepare the state and tokenize the language input.
+    """
+
+    max_state_dim: int = 32
+    task_key: str = "task"
+
+    def __call__(self, transition: EnvTransition) -> EnvTransition:
+        transition = transition.copy()
+
+        state = transition.get(TransitionKey.OBSERVATION, {}).get(OBS_STATE)
+        if state is None:
+            raise ValueError("State is required for PI0Fast")
+        tasks = transition.get(TransitionKey.COMPLEMENTARY_DATA, {}).get(self.task_key)
+        if tasks is None:
+            raise ValueError("No task found in complementary data")
+
+        # TODO: check if this necessary
+        state = deepcopy(state)
+
+        # Prepare state (pad to max_state_dim)
+        state = pad_vector(state, self.max_state_dim)
+
+        # State should already be normalized to [-1, 1] by the NormalizerProcessorStep that runs before this step
+        # Discretize into 256 bins (see openpi `PaligemmaTokenizer.tokenize()`)
+        state_np = state.cpu().numpy()
+        discretized_states = np.digitize(state_np, bins=np.linspace(-1, 1, 256 + 1)[:-1]) - 1
+
+        full_prompts = []
+        for i, task in enumerate(tasks):
+            cleaned_text = task.strip().replace("_", " ").replace("\n", " ")
+            state_str = " ".join(map(str, discretized_states[i]))
+            full_prompt = f"Task: {cleaned_text}, State: {state_str};\n"
+            full_prompts.append(full_prompt)
+
+        transition[TransitionKey.COMPLEMENTARY_DATA][self.task_key] = full_prompts
+        # Normalize state to [-1, 1] range if needed (assuming it's already normalized by normalizer processor step!!)
+        # Discretize into 256 bins (see openpi `PaligemmaTokenizer.tokenize()`)
+        return transition
+
+    def transform_features(
+        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
+    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
+        """
+        This step does not alter the feature definitions.
+        """
+        return features
+
+
+def make_pi0_fast_pre_post_processors(
+    config: PI0FastConfig,
+    dataset_stats: dict[str, dict[str, torch.Tensor]] | None = None,
+) -> tuple[
+    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
+    PolicyProcessorPipeline[PolicyAction, PolicyAction],
+]:
+    """
+    Constructs pre-processor and post-processor pipelines for the PI0Fast policy.
+
+    The pre-processing pipeline prepares input data for the model by:
+    1. Renaming features to match pretrained configurations.
+    2. Normalizing input and output features based on dataset statistics.
+    3. Adding a batch dimension.
+    4. Appending a newline character to the task description for tokenizer compatibility.
+    5. Tokenizing the text prompt using the PaliGemma tokenizer.
+    6. Moving all data to the specified device.
+
+    The post-processing pipeline handles the model's output by:
+    1. Moving data to the CPU.
+    2. Unnormalizing the output features to their original scale.
+
+    Args:
+        config: The configuration object for the PI0Fast policy.
+        dataset_stats: A dictionary of statistics for normalization.
+        preprocessor_kwargs: Additional arguments for the pre-processor pipeline.
+        postprocessor_kwargs: Additional arguments for the post-processor pipeline.
+
+    Returns:
+        A tuple containing the configured pre-processor and post-processor pipelines.
+    """
+    # Add remaining processors
+    input_steps: list[ProcessorStep] = [
+        RenameObservationsProcessorStep(rename_map={}),  # To mimic the same processor as pretrained one
+        AddBatchDimensionProcessorStep(),
+        # NOTE: NormalizerProcessorStep MUST come before Pi0FastPrepareStateAndLanguageTokenizerProcessorStep
+        # because the tokenizer step expects normalized state in [-1, 1] range for discretization
+        NormalizerProcessorStep(
+            features={**config.input_features, **config.output_features},
+            norm_map=config.normalization_mapping,
+            stats=dataset_stats,
+        ),
+        Pi0FastPrepareStateAndLanguageTokenizerProcessorStep(max_state_dim=config.max_state_dim),
+        TokenizerProcessorStep(
+            tokenizer_name=config.text_tokenizer_name,
+            max_length=config.tokenizer_max_length,
+            padding_side="right",
+            padding="max_length",
+        ),
+        ActionTokenizerProcessorStep(
+            action_tokenizer_name=config.action_tokenizer_name,
+            max_action_tokens=config.max_action_tokens,
+            fast_skip_tokens=config.fast_skip_tokens,
+            paligemma_tokenizer_name=config.text_tokenizer_name,
+        ),
+        DeviceProcessorStep(device=config.device),
+    ]
+
+    output_steps: list[ProcessorStep] = [
+        UnnormalizerProcessorStep(
+            features=config.output_features, norm_map=config.normalization_mapping, stats=dataset_stats
+        ),
+        DeviceProcessorStep(device="cpu"),
+    ]
+
+    return (
+        PolicyProcessorPipeline[dict[str, Any], dict[str, Any]](
+            steps=input_steps,
+            name=POLICY_PREPROCESSOR_DEFAULT_NAME,
+        ),
+        PolicyProcessorPipeline[PolicyAction, PolicyAction](
+            steps=output_steps,
+            name=POLICY_POSTPROCESSOR_DEFAULT_NAME,
+            to_transition=policy_action_to_transition,
+            to_output=transition_to_policy_action,
+        ),
+    )
@@ -0,0 +1,539 @@
+"""Train FAST tokenizer for action encoding.
+
+This script:
+1. Loads action chunks from LeRobotDataset (with sampling)
+2. Applies delta transforms and per-timestamp normalization
+3. Trains FAST tokenizer on specified action dimensions
+4. Saves tokenizer to assets directory
+5. Reports compression statistics
+"""
+
+import json
+from pathlib import Path
+
+import numpy as np
+import torch
+import tyro
+from huggingface_hub import HfApi
+from transformers import AutoProcessor
+
+from lerobot.configs.types import NormalizationMode
+from lerobot.datasets.lerobot_dataset import LeRobotDataset
+
+
+def apply_delta_transform(state: np.ndarray, actions: np.ndarray, delta_dims: list[int] | None) -> np.ndarray:
+    """Apply delta transform to specified dimensions.
+
+    Args:
+        state: Current state [D]
+        actions: Future actions [D]
+        delta_dims: List of dimension indices to apply delta transform to
+
+    Returns:
+        Transformed actions [D]
+    """
+    if delta_dims is None or len(delta_dims) == 0:
+        return actions
+
+    delta_actions = actions.copy()
+    for dim in delta_dims:
+        delta_actions[dim] = actions[dim] - state[dim]
+
+    return delta_actions
+
+
+def apply_normalization(
+    data: np.ndarray,
+    stats: dict[str, np.ndarray],
+    mode: NormalizationMode,
+    eps: float = 1e-8,
+) -> np.ndarray:
+    """Apply normalization to data based on the specified mode.
+
+    Args:
+        data: Data to normalize [N, H, D] or [D]
+        stats: Dictionary of statistics (mean, std, min, max, q01, q99, q10, q90)
+        mode: Normalization mode to apply
+        eps: Small epsilon for numerical stability
+
+    Returns:
+        Normalized data with the same shape as input
+    """
+    if mode == NormalizationMode.IDENTITY:
+        return data
+
+    if mode == NormalizationMode.MEAN_STD:
+        mean = stats.get("mean")
+        std = stats.get("std")
+        if mean is None or std is None:
+            raise ValueError("MEAN_STD mode requires 'mean' and 'std' in stats")
+        return (data - mean) / np.maximum(std, eps)
+
+    if mode == NormalizationMode.MIN_MAX:
+        min_val = stats.get("min")
+        max_val = stats.get("max")
+        if min_val is None or max_val is None:
+            raise ValueError("MIN_MAX mode requires 'min' and 'max' in stats")
+        denom = np.maximum(max_val - min_val, eps)
+        return 2.0 * (data - min_val) / denom - 1.0
+
+    if mode == NormalizationMode.QUANTILES:
+        q01 = stats.get("q01")
+        q99 = stats.get("q99")
+        if q01 is None or q99 is None:
+            raise ValueError("QUANTILES mode requires 'q01' and 'q99' in stats")
+        denom = np.maximum(q99 - q01, eps)
+        # Clip to quantile range then normalize to [-1, 1]
+        clipped = np.clip(data, q01, q99)
+        return 2.0 * (clipped - q01) / denom - 1.0
+
+    if mode == NormalizationMode.QUANTILE10:
+        q10 = stats.get("q10")
+        q90 = stats.get("q90")
+        if q10 is None or q90 is None:
+            raise ValueError("QUANTILE10 mode requires 'q10' and 'q90' in stats")
+        denom = np.maximum(q90 - q10, eps)
+        # Clip to quantile range then normalize to [-1, 1]
+        clipped = np.clip(data, q10, q90)
+        return 2.0 * (clipped - q10) / denom - 1.0
+
+    raise ValueError(f"Unsupported normalization mode: {mode}")
+
+
+def process_episode(args):
+    """Process single episode and return action chunks."""
+    dataset, ep_idx, action_horizon, delta_dims, sample_fraction, state_key, use_delta_transform = args
+
+    try:
+        # get episode info
+        ep_info = dataset.meta.episodes[ep_idx]
+        from_idx = ep_info["dataset_from_index"]
+        to_idx = ep_info["dataset_to_index"]
+        ep_length = to_idx - from_idx
+
+        if ep_length < action_horizon:
+            return None
+
+        # load all frames in episode
+        # if dataset has episode filtering, we need to use the mapping
+        states = []
+        actions = []
+
+        for abs_idx in range(from_idx, to_idx):
+            # map absolute index to relative index if needed
+            if dataset._absolute_to_relative_idx is not None:
+                if abs_idx not in dataset._absolute_to_relative_idx:
+                    # this episode's frames aren't in the filtered dataset
+                    return None
+                rel_idx = dataset._absolute_to_relative_idx[abs_idx]
+            else:
+                rel_idx = abs_idx
+
+            frame = dataset.hf_dataset[rel_idx]
+
+            # get state (could be from observation.state or other state key)
+            if state_key in frame:
+                state = (
+                    frame[state_key].numpy()
+                    if torch.is_tensor(frame[state_key])
+                    else np.array(frame[state_key])
+                )
+            else:
+                # if no state key, use zeros (no delta transform)
+                state = np.zeros_like(
+                    frame["action"].numpy() if torch.is_tensor(frame["action"]) else np.array(frame["action"])
+                )
+
+            action = (
+                frame["action"].numpy() if torch.is_tensor(frame["action"]) else np.array(frame["action"])
+            )
+
+            states.append(state)
+            actions.append(action)
+
+        states = np.array(states)
+        actions = np.array(actions)
+
+        # create action chunks (sliding window)
+        # all actions in a chunk are relative to the FIRST state in that chunk
+        action_chunks = []
+
+        for i in range(len(states) - action_horizon + 1):
+            current_state = states[i]  # First state in chunk
+            future_absolute_actions = actions[i : i + action_horizon]
+
+            if use_delta_transform:
+                # relative actions
+                delta_chunk = np.zeros_like(future_absolute_actions)
+                for t in range(action_horizon):
+                    delta_chunk[t] = apply_delta_transform(
+                        current_state,
+                        future_absolute_actions[t],
+                        delta_dims,
+                    )
+                action_chunks.append(delta_chunk)
+            else:
+                # absolute actions (no delta)
+                action_chunks.append(future_absolute_actions)
+
+        if len(action_chunks) == 0:
+            return None
+
+        action_chunks = np.array(action_chunks)
+
+        # sample chunks
+        if sample_fraction < 1.0:
+            n_chunks = len(action_chunks)
+            n_samples = max(1, int(n_chunks * sample_fraction))
+            episode_seed = hash(ep_idx) % (2**31)
+            rng = np.random.RandomState(episode_seed)
+            indices = rng.choice(n_chunks, size=n_samples, replace=False)
+            action_chunks = action_chunks[indices]
+
+        return action_chunks
+
+    except Exception as e:
+        print(f"Error processing episode {ep_idx}: {e}")
+        import traceback
+
+        traceback.print_exc()
+        return None
+
+
+def train_fast_tokenizer(
+    action_chunks: np.ndarray,
+    vocab_size: int = 1024,
+    scale: float = 10.0,
+) -> AutoProcessor:
+    """
+    Train FAST tokenizer (BPE on DCT coefficients) on action chunks.
+
+    Uses the .fit() method to train a new tokenizer on the provided data.
+
+    Args:
+        action_chunks: Array of action chunks [N, H, D] where N=num_chunks, H=horizon, D=action_dim
+        vocab_size: BPE vocabulary size
+        scale: DCT scaling factor for quantization
+
+    Returns:
+        Trained FAST tokenizer
+    """
+    print(f"Training FAST tokenizer on {len(action_chunks)} action chunks...")
+    print(f"Action chunk shape: {action_chunks.shape}")
+    print(f"Vocab size: {vocab_size}")
+    print(f"DCT scale: {scale}")
+
+    # download the tokenizer source code (not pretrained weights)
+    # we'll train a new tokenizer on our own data
+    base_tokenizer = AutoProcessor.from_pretrained("physical-intelligence/fast", trust_remote_code=True)
+
+    # convert action_chunks array to list of arrays (expected by .fit())
+    action_data_list = [action_chunks[i] for i in range(len(action_chunks))]
+
+    # train the new tokenizer on our action data using .fit()
+    # this trains the BPE tokenizer on DCT coefficients
+    print("Training new tokenizer (this may take a few minutes)...")
+    tokenizer = base_tokenizer.fit(
+        action_data_list,
+        scale=scale,
+        vocab_size=vocab_size,
+        time_horizon=action_chunks.shape[1],  # action_horizon
+        action_dim=action_chunks.shape[2],  # encoded dimensions
+    )
+    print("✓ Tokenizer training complete!")
+
+    # validate it works
+    sample_chunk = action_chunks[0]
+    encoded = tokenizer(sample_chunk[None])[0]
+    if isinstance(encoded, list):
+        encoded = np.array(encoded)
+    print(f"Sample encoding: {len(encoded)} tokens for chunk shape {sample_chunk.shape}")
+
+    return tokenizer
+
+
+def compute_compression_stats(tokenizer, action_chunks: np.ndarray):
+    """Compute compression statistics."""
+    print("\nComputing compression statistics...")
+
+    # sample for stats (use max 1000 chunks for speed)
+    sample_size = min(1000, len(action_chunks))
+    sample_indices = np.random.RandomState(42).choice(len(action_chunks), size=sample_size, replace=False)
+    sample_chunks = action_chunks[sample_indices]
+
+    token_lengths = []
+    for chunk in sample_chunks:
+        encoded = tokenizer(chunk[None])[0]
+        if isinstance(encoded, list):
+            token_lengths.append(len(encoded))
+        else:
+            token_lengths.append(encoded.shape[0] if hasattr(encoded, "shape") else len(encoded))
+
+    token_lengths = np.array(token_lengths)
+
+    # compression ratio: (H * D) / avg_tokens
+    input_size = action_chunks.shape[1] * action_chunks.shape[2]
+    avg_tokens = np.mean(token_lengths)
+    compression_ratio = input_size / avg_tokens
+
+    stats = {
+        "compression_ratio": float(compression_ratio),
+        "mean_token_length": float(np.mean(token_lengths)),
+        "p99_token_length": float(np.percentile(token_lengths, 99)),
+        "min_token_length": float(np.min(token_lengths)),
+        "max_token_length": float(np.max(token_lengths)),
+    }
+
+    print("Compression Statistics:")
+    print(f"  Average compression ratio: {stats['compression_ratio']:.2f}x")
+    print(f"  Mean token length: {stats['mean_token_length']:.1f}")
+    print(f"  P99 token length: {stats['p99_token_length']:.0f}")
+    print(f"  Min token length: {stats['min_token_length']:.0f}")
+    print(f"  Max token length: {stats['max_token_length']:.0f}")
+
+    return stats
+
+
+def main(
+    repo_id: str,
+    root: str | None = None,
+    action_horizon: int = 10,
+    max_episodes: int | None = None,
+    sample_fraction: float = 0.1,
+    encoded_dims: str = "0:6,7:23",
+    delta_dims: str | None = None,
+    use_delta_transform: bool = False,
+    state_key: str = "observation.state",
+    normalization_mode: str = "QUANTILES",
+    vocab_size: int = 1024,
+    scale: float = 10.0,
+    output_dir: str | None = None,
+    push_to_hub: bool = False,
+    hub_repo_id: str | None = None,
+    hub_private: bool = False,
+):
+    """
+    Train FAST tokenizer for action encoding.
+
+    Args:
+        repo_id: LeRobot dataset repository ID
+        root: Root directory for dataset (default: ~/.cache/huggingface/lerobot)
+        action_horizon: Number of future actions in each chunk
+        max_episodes: Max episodes to use (None = all episodes in dataset)
+        sample_fraction: Fraction of chunks to sample per episode
+        encoded_dims: Comma-separated dimension ranges to encode (e.g., "0:6,7:23")
+        delta_dims: Comma-separated dimension indices for delta transform (e.g., "0,1,2,3,4,5")
+        use_delta_transform: Whether to apply delta transform (relative actions vs absolute actions)
+        state_key: Dataset key for state observations (default: "observation.state")
+        normalization_mode: Normalization mode (MEAN_STD, MIN_MAX, QUANTILES, QUANTILE10, IDENTITY)
+        vocab_size: FAST vocabulary size (BPE vocab size)
+        scale: DCT scaling factor (default: 10.0)
+        output_dir: Directory to save tokenizer (default: ./fast_tokenizer_{repo_id})
+        push_to_hub: Whether to push the tokenizer to Hugging Face Hub
+        hub_repo_id: Hub repository ID (e.g., "username/tokenizer-name"). If None, uses output_dir name
+        hub_private: Whether to create a private repository on the Hub
+    """
+    # load dataset
+    print(f"Loading dataset: {repo_id}")
+    dataset = LeRobotDataset(repo_id=repo_id, root=root)
+    print(f"Dataset loaded: {dataset.num_episodes} episodes, {dataset.num_frames} frames")
+
+    # parse normalization mode
+    try:
+        norm_mode = NormalizationMode(normalization_mode)
+    except ValueError as err:
+        raise ValueError(
+            f"Invalid normalization_mode: {normalization_mode}. "
+            f"Must be one of: {', '.join([m.value for m in NormalizationMode])}"
+        ) from err
+    print(f"Normalization mode: {norm_mode.value}")
+
+    # parse encoded dimensions
+    encoded_dim_ranges = []
+    for range_str in encoded_dims.split(","):
+        start, end = map(int, range_str.strip().split(":"))
+        encoded_dim_ranges.append((start, end))
+
+    total_encoded_dims = sum(end - start for start, end in encoded_dim_ranges)
+    print(f"Encoding {total_encoded_dims} dimensions: {encoded_dims}")
+
+    # parse delta dimensions
+    delta_dim_list = None
+    if delta_dims is not None and delta_dims.strip():
+        delta_dim_list = [int(d.strip()) for d in delta_dims.split(",")]
+        print(f"Delta dimensions: {delta_dim_list}")
+    else:
+        print("No delta dimensions specified")
+
+    print(f"Use delta transform: {use_delta_transform}")
+    if use_delta_transform and (delta_dim_list is None or len(delta_dim_list) == 0):
+        print("Warning: use_delta_transform=True but no delta_dims specified. No delta will be applied.")
+
+    print(f"Action horizon: {action_horizon}")
+    print(f"State key: {state_key}")
+
+    # determine episodes to process
+    num_episodes = dataset.num_episodes
+    if max_episodes is not None:
+        num_episodes = min(max_episodes, num_episodes)
+
+    print(f"Processing {num_episodes} episodes...")
+
+    # process episodes sequentially (to avoid pickling issues with dataset)
+    all_chunks = []
+    for ep_idx in range(num_episodes):
+        if ep_idx % 10 == 0:
+            print(f"  Processing episode {ep_idx}/{num_episodes}...")
+
+        chunks = process_episode(
+            (dataset, ep_idx, action_horizon, delta_dim_list, sample_fraction, state_key, use_delta_transform)
+        )
+        if chunks is not None:
+            all_chunks.append(chunks)
+
+    # concatenate all chunks
+    all_chunks = np.concatenate(all_chunks, axis=0)
+    print(f"Collected {len(all_chunks)} action chunks")
+
+    # extract only encoded dimensions FIRST (before normalization)
+    encoded_chunks = []
+    for start, end in encoded_dim_ranges:
+        encoded_chunks.append(all_chunks[:, :, start:end])
+    encoded_chunks = np.concatenate(encoded_chunks, axis=-1)  # [N, H, D_encoded]
+    print(f"Extracted {encoded_chunks.shape[-1]} encoded dimensions")
+
+    # apply normalization to encoded dimensions
+    print("\nBefore normalization - overall stats:")
+    print(f"  Min: {np.min(encoded_chunks):.4f}, Max: {np.max(encoded_chunks):.4f}")
+    print(f"  Mean: {np.mean(encoded_chunks):.4f}, Std: {np.std(encoded_chunks):.4f}")
+
+    # get normalization stats from dataset
+    norm_stats = dataset.meta.stats
+    if norm_stats is not None and "action" in norm_stats:
+        action_stats = norm_stats["action"]
+
+        # build encoded dimension indices
+        encoded_dim_indices = []
+        for start, end in encoded_dim_ranges:
+            encoded_dim_indices.extend(range(start, end))
+        encoded_dim_indices = np.array(encoded_dim_indices)
+
+        # extract stats for encoded dimensions only
+        encoded_stats = {}
+        for stat_name, stat_values in action_stats.items():
+            if isinstance(stat_values, (list, np.ndarray)):
+                stat_array = np.array(stat_values)
+                if len(stat_array) > max(encoded_dim_indices):
+                    encoded_stats[stat_name] = stat_array[encoded_dim_indices]
+
+        if encoded_stats:
+            print(f"\nNormalization stats for encoded dimensions (mode: {norm_mode.value}):")
+            for stat_name, stat_values in encoded_stats.items():
+                print(
+                    f"  {stat_name}: shape={stat_values.shape}, "
+                    f"range=[{np.min(stat_values):.4f}, {np.max(stat_values):.4f}]"
+                )
+
+            # apply normalization based on mode
+            try:
+                encoded_chunks = apply_normalization(encoded_chunks, encoded_stats, norm_mode, eps=1e-8)
+                print(f"\nApplied {norm_mode.value} normalization")
+            except ValueError as e:
+                print(f"Warning: {e}. Using raw actions without normalization.")
+
+            print("\nAfter normalization - overall stats:")
+            print(f"  Min: {np.min(encoded_chunks):.4f}, Max: {np.max(encoded_chunks):.4f}")
+            print(f"  Mean: {np.mean(encoded_chunks):.4f}, Std: {np.std(encoded_chunks):.4f}")
+
+            print("\nPer-dimension stats (after normalization):")
+            for d in range(encoded_chunks.shape[-1]):
+                dim_data = encoded_chunks[:, :, d]
+                print(
+                    f"  Dim {d}: min={np.min(dim_data):7.4f}, max={np.max(dim_data):7.4f}, "
+                    f"mean={np.mean(dim_data):7.4f}, std={np.std(dim_data):7.4f}"
+                )
+        else:
+            print("Warning: Could not extract stats for encoded dimensions, using raw actions")
+    else:
+        print("Warning: No normalization stats found in dataset, using raw actions")
+
+    print(f"Encoded chunks shape: {encoded_chunks.shape}")
+
+    # train FAST tokenizer
+    tokenizer = train_fast_tokenizer(
+        encoded_chunks,
+        vocab_size=vocab_size,
+        scale=scale,
+    )
+
+    # compute compression statistics
+    compression_stats = compute_compression_stats(tokenizer, encoded_chunks)
+
+    # save tokenizer
+    if output_dir is None:
+        output_dir = f"fast_tokenizer_{repo_id.replace('/', '_')}"
+    output_path = Path(output_dir)
+    output_path.mkdir(parents=True, exist_ok=True)
+
+    tokenizer.save_pretrained(output_path)
+
+    # save metadata
+    metadata = {
+        "repo_id": repo_id,
+        "vocab_size": vocab_size,
+        "scale": scale,
+        "encoded_dims": encoded_dims,
+        "encoded_dim_ranges": encoded_dim_ranges,
+        "total_encoded_dims": total_encoded_dims,
+        "delta_dims": delta_dims,
+        "delta_dim_list": delta_dim_list,
+        "use_delta_transform": use_delta_transform,
+        "state_key": state_key,
+        "normalization_mode": norm_mode.value,
+        "action_horizon": action_horizon,
+        "num_training_chunks": len(encoded_chunks),
+        "compression_stats": compression_stats,
+    }
+
+    with open(output_path / "metadata.json", "w") as f:
+        json.dump(metadata, f, indent=2)
+
+    print(f"\nSaved FAST tokenizer to {output_path}")
+    print(f"Metadata: {json.dumps(metadata, indent=2)}")
+
+    # push to Hugging Face Hub if requested
+    if push_to_hub:
+        # determine the hub repository ID
+        if hub_repo_id is None:
+            hub_repo_id = output_path.name
+            print(f"\nNo hub_repo_id provided, using: {hub_repo_id}")
+
+        print(f"\nPushing tokenizer to Hugging Face Hub: {hub_repo_id}")
+        print(f"   Private: {hub_private}")
+
+        try:
+            # use the tokenizer's push_to_hub method
+            tokenizer.push_to_hub(
+                repo_id=hub_repo_id,
+                private=hub_private,
+                commit_message=f"Upload FAST tokenizer trained on {repo_id}",
+            )
+
+            # also upload the metadata.json file separately
+            api = HfApi()
+            api.upload_file(
+                path_or_fileobj=str(output_path / "metadata.json"),
+                path_in_repo="metadata.json",
+                repo_id=hub_repo_id,
+                repo_type="model",
+                commit_message="Upload tokenizer metadata",
+            )
+
+            print(f"Successfully pushed tokenizer to: https://huggingface.co/{hub_repo_id}")
+        except Exception as e:
+            print(f"Error pushing to hub: {e}")
+            print("   Make sure you're logged in with `huggingface-cli login`")
+
+
+if __name__ == "__main__":
+    tyro.cli(main)
@@ -75,7 +75,7 @@ from .policy_robot_bridge import (
    RobotActionToPolicyActionProcessorStep,
 )
 from .rename_processor import RenameObservationsProcessorStep
-from .tokenizer_processor import TokenizerProcessorStep
+from .tokenizer_processor import ActionTokenizerProcessorStep, TokenizerProcessorStep

 __all__ = [
    "ActionProcessorStep",
@@ -122,6 +122,7 @@ __all__ = [
    "AddBatchDimensionProcessorStep",
    "RobotProcessorPipeline",
    "TokenizerProcessorStep",
+    "ActionTokenizerProcessorStep",
    "Torch2NumpyActionProcessorStep",
    "RobotActionToPolicyActionProcessorStep",
    "PolicyActionToRobotActionProcessorStep",
@@ -23,22 +23,29 @@ token IDs and attention masks, which are then added to the observation dictionar

 from __future__ import annotations

+import logging
 from dataclasses import dataclass, field
 from typing import TYPE_CHECKING, Any

 import torch

 from lerobot.configs.types import FeatureType, PipelineFeatureType, PolicyFeature
-from lerobot.utils.constants import OBS_LANGUAGE_ATTENTION_MASK, OBS_LANGUAGE_TOKENS
+from lerobot.utils.constants import (
+    ACTION_TOKEN_MASK,
+    ACTION_TOKENS,
+    OBS_LANGUAGE_ATTENTION_MASK,
+    OBS_LANGUAGE_TOKENS,
+)
 from lerobot.utils.import_utils import _transformers_available

 from .core import EnvTransition, TransitionKey
-from .pipeline import ObservationProcessorStep, ProcessorStepRegistry
+from .pipeline import ActionProcessorStep, ObservationProcessorStep, ProcessorStepRegistry

 # Conditional import for type checking and lazy loading
 if TYPE_CHECKING or _transformers_available:
-    from transformers import AutoTokenizer
+    from transformers import AutoProcessor, AutoTokenizer
 else:
+    AutoProcessor = None
    AutoTokenizer = None


@@ -268,3 +275,256 @@ class TokenizerProcessorStep(ObservationProcessorStep):
            )

        return features
+
+
+@dataclass
+@ProcessorStepRegistry.register(name="action_tokenizer_processor")
+class ActionTokenizerProcessorStep(ActionProcessorStep):
+    """
+    Processor step to tokenize action data using a fast action tokenizer.
+
+    This step takes action tensors from an `EnvTransition`, tokenizes them using
+    a Hugging Face `transformers` AutoProcessor (such as the Physical Intelligence "fast" tokenizer),
+    and returns the tokenized action.
+
+    Requires the `transformers` library to be installed.
+
+    Attributes:
+        tokenizer_name: The name of a pretrained processor from the Hugging Face Hub (e.g., "physical-intelligence/fast").
+        tokenizer: A pre-initialized processor/tokenizer object. If provided, `tokenizer_name` is ignored.
+        trust_remote_code: Whether to trust remote code when loading the tokenizer (required for some tokenizers).
+        action_tokenizer: The internal tokenizer/processor instance, loaded during initialization.
+        paligemma_tokenizer_name: The name of a pretrained PaliGemma tokenizer from the Hugging Face Hub (e.g., "google/paligemma-3b-pt-224").
+    """
+
+    action_tokenizer_name: str | None = None
+    action_tokenizer_input_object: Any | None = None
+    trust_remote_code: bool = True
+    max_action_tokens: int = 256
+    fast_skip_tokens: int = 128
+    paligemma_tokenizer_name: str = "google/paligemma-3b-pt-224"
+    # Internal tokenizer instance (not part of the config)
+    action_tokenizer: Any = field(default=None, init=False, repr=False)
+    _paligemma_tokenizer: Any = field(default=None, init=False, repr=False)
+
+    def __post_init__(self):
+        """
+        Initializes the action tokenizer after the dataclass is created.
+
+        It checks for the availability of the `transformers` library and loads the tokenizer
+        either from a provided object or by name from the Hugging Face Hub.
+
+        Raises:
+            ImportError: If the `transformers` library is not installed.
+            ValueError: If neither `tokenizer` nor `tokenizer_name` is provided.
+        """
+        if not _transformers_available:
+            raise ImportError(
+                "The 'transformers' library is not installed. "
+                "Please install it with `pip install 'lerobot[transformers-dep]'` to use ActionTokenizerProcessorStep."
+            )
+
+        if self.action_tokenizer_input_object is not None:
+            self.action_tokenizer = self.action_tokenizer_input_object
+
+        elif self.action_tokenizer_name is not None:
+            if AutoProcessor is None:
+                raise ImportError("AutoProcessor is not available")
+            self.action_tokenizer = AutoProcessor.from_pretrained(
+                self.action_tokenizer_name, trust_remote_code=self.trust_remote_code
+            )
+        else:
+            raise ValueError(
+                "Either 'action_tokenizer' or 'action_tokenizer_name' must be provided. "
+                "Pass a tokenizer object directly or a tokenizer name to auto-load."
+            )
+
+        self._paligemma_tokenizer = AutoTokenizer.from_pretrained(
+            self.paligemma_tokenizer_name,
+            trust_remote_code=self.trust_remote_code,
+            add_eos_token=True,
+            add_bos_token=False,
+        )
+
+    def __call__(self, transition: EnvTransition) -> EnvTransition:
+        """
+        Applies action tokenization to the transition.
+
+        This overrides the base class to handle both tokens and mask.
+
+        Args:
+            transition: The input transition with action data.
+
+        Returns:
+            The processed transition with tokenized actions and mask in complementary data.
+        """
+        self._current_transition = transition.copy()
+        new_transition = self._current_transition
+
+        action = new_transition.get(TransitionKey.ACTION)
+        if action is None:
+            # During inference, no action is available, skip tokenization
+            return new_transition
+
+        # Tokenize and get both tokens and mask
+        tokens, mask = self._tokenize_action(action)
+
+        # Store mask in complementary data
+        complementary_data = new_transition.get(TransitionKey.COMPLEMENTARY_DATA, {})
+        if complementary_data is None:
+            complementary_data = {}
+        complementary_data[ACTION_TOKEN_MASK] = mask
+        complementary_data[ACTION_TOKENS] = tokens
+        new_transition[TransitionKey.COMPLEMENTARY_DATA] = complementary_data
+        return new_transition
+
+    def _act_tokens_to_paligemma_tokens(self, tokens: torch.Tensor) -> torch.Tensor:
+        """
+        Converts action tokens to PaliGemma tokens.
+        """
+        return self._paligemma_tokenizer.vocab_size - 1 - self.fast_skip_tokens - tokens
+
+    def _tokenize_action(self, action: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
+        """
+        Tokenizes the action tensor and creates a mask.
+
+        Args:
+            action: The input action tensor to tokenize. Shape: (B, H, action_dim) or (H, action_dim,)
+
+        Returns:
+            A tuple of (tokens, mask) where:
+            - tokens: Tensor of token IDs with shape (B, max_action_tokens)
+            - mask: Boolean mask with shape (B, max_action_tokens), True for real tokens, False for padding
+        """
+        if action is None:
+            raise ValueError("Action cannot be None")
+
+        # Get the device and dtype of the input action
+        device = action.device if isinstance(action, torch.Tensor) else None
+
+        # Handle single sample (add batch dimension)
+        single_sample = action.dim() == 1
+        if single_sample:
+            action = action.unsqueeze(0)
+
+        batch_size = action.shape[0]
+
+        # Tokenize the action batch
+        # The fast tokenizer expects action data and returns token IDs
+        tokens_list = []
+        masks_list = []
+
+        for i in range(batch_size):
+            # Tokenize single action (move to CPU first as tokenizer uses scipy which requires numpy)
+            action_cpu = action[i : i + 1].cpu()
+            tokens = self.action_tokenizer(action_cpu)
+
+            # Convert to numpy array if it's a list
+            if isinstance(tokens, list) or not isinstance(tokens, torch.Tensor):
+                tokens = torch.tensor(tokens, dtype=torch.long, device=action.device)
+            else:
+                # Move tokens back to the same device as input action
+                tokens = tokens.to(device=action.device)
+
+            # Flatten to 1D if needed
+            if tokens.dim() > 1:
+                tokens = tokens.flatten()
+
+            bos_id = self._paligemma_tokenizer.bos_token_id
+            # add bos
+            tokens = torch.cat(
+                [
+                    torch.tensor([bos_id], device=action.device),
+                    torch.tensor(
+                        self._paligemma_tokenizer.encode("Action: ", add_special_tokens=False),
+                        device=action.device,
+                    ),
+                    self._act_tokens_to_paligemma_tokens(tokens),
+                    torch.tensor(self._paligemma_tokenizer.encode("|"), device=action.device),
+                ]
+            )
+
+            # Truncate or pad to max_action_tokens
+            if len(tokens) > self.max_action_tokens:
+                logging.warning(
+                    f"Token length ({len(tokens)}) exceeds max length ({self.max_action_tokens}), truncating. "
+                    "Consider increasing the `max_action_tokens` in your model config if this happens frequently."
+                )
+                tokens = tokens[: self.max_action_tokens]
+                mask = torch.ones(self.max_action_tokens, dtype=torch.bool, device=action.device)
+            else:
+                mask = torch.cat(
+                    [
+                        torch.ones(len(tokens), dtype=torch.bool, device=action.device),
+                        torch.zeros(
+                            self.max_action_tokens - len(tokens), dtype=torch.bool, device=action.device
+                        ),
+                    ]
+                )
+                # Pad tokens with zeros
+                tokens = torch.nn.functional.pad(tokens, (0, self.max_action_tokens - len(tokens)), value=0)
+
+            tokens_list.append(tokens)
+            masks_list.append(mask)
+
+        # Stack into batched tensors
+        tokens_batch = torch.stack(tokens_list, dim=0)  # (B, max_action_tokens)
+        masks_batch = torch.stack(masks_list, dim=0)  # (B, max_action_tokens)
+
+        # Remove batch dimension if input was single sample
+        if single_sample:
+            tokens_batch = tokens_batch.squeeze(0)
+            masks_batch = masks_batch.squeeze(0)
+
+        # Move to the same device as the input
+        if device is not None:
+            tokens_batch = tokens_batch.to(device)
+            masks_batch = masks_batch.to(device)
+
+        return tokens_batch, masks_batch
+
+    def action(self, action: torch.Tensor) -> torch.Tensor:
+        """
+        This method is not used since we override __call__.
+        Required by ActionProcessorStep ABC.
+        """
+        tokens, _ = self._tokenize_action(action)
+        return tokens
+
+    def get_config(self) -> dict[str, Any]:
+        """
+        Returns the serializable configuration of the processor.
+
+        Note: The tokenizer object itself is not serialized. If the processor was initialized
+        with a tokenizer name, that name will be included in the config.
+
+        Returns:
+            A dictionary with the processor's configuration parameters.
+        """
+        config = {
+            "trust_remote_code": self.trust_remote_code,
+            "max_action_tokens": self.max_action_tokens,
+        }
+
+        # Only save tokenizer_name if it was used to create the tokenizer
+        if self.action_tokenizer_name is not None and self.action_tokenizer_input_object is None:
+            config["action_tokenizer_name"] = self.action_tokenizer_name
+
+        return config
+
+    def transform_features(
+        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
+    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
+        """
+        Updates feature definitions to reflect tokenized actions.
+
+        This updates the policy features dictionary to indicate that the action
+        has been tokenized into a sequence of token IDs with shape (max_action_tokens,).
+
+        Args:
+            features: The dictionary of existing policy features.
+
+        Returns:
+            The updated dictionary of policy features.
+        """
+        return features
@@ -28,6 +28,8 @@ OBS_LANGUAGE_TOKENS = OBS_LANGUAGE + ".tokens"
 OBS_LANGUAGE_ATTENTION_MASK = OBS_LANGUAGE + ".attention_mask"

 ACTION = "action"
+ACTION_TOKENS = ACTION + ".tokens"
+ACTION_TOKEN_MASK = ACTION + ".token_mask"
 REWARD = "next.reward"
 TRUNCATED = "next.truncated"
 DONE = "next.done"
@@ -63,6 +63,7 @@ def is_package_available(pkg_name: str, return_version: bool = False) -> tuple[b

 _transformers_available = is_package_available("transformers")
 _peft_available = is_package_available("peft")
+_scipy_available = is_package_available("scipy")


 def make_device_from_device_class(config: ChoiceRegistry) -> Any:
@@ -0,0 +1,504 @@
+#!/usr/bin/env python
+
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Test script to verify PI0Fast policy integration with LeRobot vs the original implementation"""
+# ruff: noqa: E402
+
+import os
+import random
+from copy import deepcopy
+from typing import Any
+
+import numpy as np
+import pytest
+import torch
+
+pytest.importorskip("transformers")
+pytest.importorskip("scipy")
+pytestmark = pytest.mark.skipif(
+    os.environ.get("CI") == "true" or os.environ.get("GITHUB_ACTIONS") == "true",
+    reason="This test requires accepting the model license",
+)
+
+from lerobot.policies.pi0_fast.configuration_pi0_fast import PI0FastConfig
+from lerobot.policies.pi0_fast.modeling_pi0_fast import PI0FastPolicy
+from lerobot.policies.pi0_fast.processor_pi0_fast import make_pi0_fast_pre_post_processors
+from lerobot.processor import PolicyAction, PolicyProcessorPipeline  # noqa: E402
+from lerobot.utils.constants import (
+    ACTION_TOKEN_MASK,
+    ACTION_TOKENS,
+    OBS_IMAGES,
+    OBS_LANGUAGE_ATTENTION_MASK,
+    OBS_LANGUAGE_TOKENS,
+    OBS_STATE,
+)  # noqa: E402
+from tests.utils import require_cuda  # noqa: E402
+
+# Constants
+DUMMY_ACTION_DIM = 7
+DUMMY_STATE_DIM = 20
+IMAGE_HEIGHT = 224
+IMAGE_WIDTH = 224
+NUM_VIEWS = 2  # Number of camera views
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+MODEL_PATH_LEROBOT = "lerobot/pi0fast-base"
+
+# Expected action token shape: (batch_size, max_decoding_steps)
+EXPECTED_ACTION_TOKENS_SHAPE = (1, 2)
+
+# Expected first 5 action tokens (for reproducibility check)
+EXPECTED_ACTION_TOKENS_FIRST_5 = torch.tensor([255657, 255362])
+
+# Expected actions after detokenization
+EXPECTED_ACTIONS_SHAPE = (1, 2, 32)  # (batch_size, n_action_steps, action_dim)
+EXPECTED_ACTIONS_MEAN = 0.04419417306780815
+EXPECTED_ACTIONS_STD = 0.26231569051742554
+EXPECTED_ACTIONS_FIRST_5 = torch.tensor([0.0000, 1.4849, 0.0000, 0.0000, 0.0000])
+
+
+def set_seed_all(seed: int):
+    """Set random seed for all RNG sources to ensure reproducibility."""
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed(seed)
+        torch.cuda.manual_seed_all(seed)
+
+    # Set deterministic behavior
+    torch.backends.cudnn.deterministic = True
+    torch.backends.cudnn.benchmark = False
+    torch.use_deterministic_algorithms(True, warn_only=True)
+
+
+def instantiate_lerobot_pi0_fast(
+    from_pretrained: bool = False,
+    model_path: str = MODEL_PATH_LEROBOT,
+) -> tuple[
+    Any,  # Policy
+    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
+    PolicyProcessorPipeline[PolicyAction, PolicyAction],
+]:
+    """Instantiate LeRobot PI0Fast policy with preprocessor and postprocessor."""
+    if from_pretrained:
+        policy = PI0FastPolicy.from_pretrained(
+            pretrained_name_or_path=model_path,
+            strict=True,
+        )
+        policy.config.validate_action_token_prefix = False
+        policy.config.max_action_tokens = 2
+        policy.config.max_decoding_steps = 2
+        policy.config.chunk_size = 2
+        policy.config.n_action_steps = 2
+    else:
+        config = PI0FastConfig(
+            n_action_steps=2,
+            max_action_dim=DUMMY_ACTION_DIM,
+            max_state_dim=DUMMY_STATE_DIM,
+            device=DEVICE,
+            validate_action_token_prefix=False,
+            max_action_tokens=2,
+            max_decoding_steps=2,
+            chunk_size=2,
+        )
+        policy = PI0FastPolicy(config)
+
+    policy.to(DEVICE)
+    policy.config.device = DEVICE
+    preprocessor, postprocessor = make_pi0_fast_pre_post_processors(
+        config=policy.config,
+        dataset_stats=None,  # Pass None for dataset_stats to disable normalization
+    )
+
+    return policy, preprocessor, postprocessor
+
+
+def create_dummy_data(device=DEVICE):
+    """Create dummy data for testing both implementations."""
+    batch_size = 1
+    prompt = "Pick up the red block and place it in the bin"
+
+    # Create random RGB images in [0, 255] uint8 range (as PIL images would be)
+    # Then convert to [0, 1] float32 range for LeRobot
+    def fake_rgb(h, w):
+        arr = np.random.randint(0, 255, (h, w, 3), dtype=np.uint8)
+        t = torch.from_numpy(arr).permute(2, 0, 1)  # CHW
+        return t
+
+    batch = {
+        f"{OBS_IMAGES}.base_0_rgb": torch.stack(
+            [fake_rgb(IMAGE_HEIGHT, IMAGE_WIDTH) for _ in range(batch_size)]
+        ).to(device),
+        f"{OBS_IMAGES}.left_wrist_0_rgb": torch.stack(
+            [fake_rgb(IMAGE_HEIGHT, IMAGE_WIDTH) for _ in range(batch_size)]
+        ).to(device),
+        f"{OBS_IMAGES}.right_wrist_0_rgb": torch.stack(
+            [fake_rgb(IMAGE_HEIGHT, IMAGE_WIDTH) for _ in range(batch_size)]
+        ).to(device),
+        OBS_STATE: torch.randn(batch_size, DUMMY_STATE_DIM, dtype=torch.float32, device=device),
+        "task": [prompt for _ in range(batch_size)],
+    }
+
+    return batch
+
+
+# Pytest fixtures
+@pytest.fixture(scope="module")
+def pi0_fast_components():
+    """Fixture to instantiate and provide all PI0Fast components for tests."""
+    print(f"\nTesting with DEVICE='{DEVICE}'")
+    print("\n[Setup] Instantiating LeRobot PI0Fast policy...")
+    policy_obj, preprocessor_obj, postprocessor_obj = instantiate_lerobot_pi0_fast(from_pretrained=True)
+    print("Model loaded successfully")
+    yield policy_obj, preprocessor_obj, postprocessor_obj
+
+
+@pytest.fixture(scope="module")
+def policy(pi0_fast_components):
+    """Fixture to provide the PI0Fast policy for tests."""
+    return pi0_fast_components[0]
+
+
+@pytest.fixture(scope="module")
+def preprocessor(pi0_fast_components):
+    """Fixture to provide the PI0Fast preprocessor for tests."""
+    return pi0_fast_components[1]
+
+
+@require_cuda
+def test_pi0_fast_preprocessor_alignment(policy, preprocessor):
+    """Test that LeRobot PI0Fast preprocessor produces expected outputs."""
+    print("\n" + "=" * 80)
+    print("Test: PI0Fast Preprocessor Outputs")
+    print("=" * 80)
+
+    set_seed_all(42)
+
+    print("\nCreating dummy data...")
+    batch = create_dummy_data()
+
+    print("\n[LeRobot] Preprocessing...")
+    lerobot_observation = preprocessor(deepcopy(batch))
+
+    print("\nVerifying preprocessor outputs:")
+    print("-" * 80)
+
+    # Expected keys from PI0Fast preprocessing
+    expected_keys = [
+        "observation.images.base_0_rgb",
+        "observation.images.left_wrist_0_rgb",
+        "observation.images.right_wrist_0_rgb",
+        "observation.state",
+        "observation.language_tokens",
+        "observation.language_attention_mask",
+    ]
+
+    for key in expected_keys:
+        if key in lerobot_observation:
+            shape = tuple(lerobot_observation[key].shape)
+            print(f"\nKey: {key}")
+            print(f"Shape: {shape}")
+            print(f"Dtype: {lerobot_observation[key].dtype}")
+        else:
+            print(f"\nKey '{key}' not found in inputs!")
+
+    # Check language tokens shape
+    if "observation.language_tokens" in lerobot_observation:
+        lang_tokens = lerobot_observation["observation.language_tokens"]
+        print(f"\nLanguage tokens shape: {lang_tokens.shape}")
+        # Should have batch dimension and max_length from tokenizer
+        assert lang_tokens.dim() == 2, f"Expected 2D tensor, got {lang_tokens.dim()}D"
+
+    print("\nPreprocessor outputs verified!")
+
+
+@require_cuda
+def test_pi0_fast_action_generation(policy, preprocessor):
+    """Test PI0Fast LeRobot implementation generates expected actions."""
+    print("\n" + "=" * 80)
+    print("Test: PI0Fast Action Generation Against Expected Values")
+    print("=" * 80)
+
+    set_seed_all(42)
+
+    print("\nCreating dummy data...")
+    batch = create_dummy_data()
+
+    print("\n[LeRobot] Running inference...")
+    lerobot_observation = preprocessor(deepcopy(batch))
+
+    # Reset seed for inference
+    torch.manual_seed(42)
+    with torch.no_grad():
+        lerobot_actions = policy.predict_action_chunk(lerobot_observation)
+        lerobot_actions = lerobot_actions.float().cpu()
+
+    print(f"LeRobot actions shape: {lerobot_actions.shape}")
+    print(f"LeRobot actions mean: {lerobot_actions.mean().item():.6f}")
+    print(f"LeRobot actions std: {lerobot_actions.std().item():.6f}")
+    print(f"LeRobot actions first 5: {lerobot_actions[0, 0, :5]}")
+
+    print("\nExpected values (from original PI0Fast):")
+    print(f"Expected actions shape: {EXPECTED_ACTIONS_SHAPE}")
+    print(f"Expected actions mean: {EXPECTED_ACTIONS_MEAN:.6f}")
+    print(f"Expected actions std: {EXPECTED_ACTIONS_STD:.6f}")
+    print(f"Expected actions first 5: {EXPECTED_ACTIONS_FIRST_5}")
+
+    print("\nAction Comparison:")
+    print("-" * 80)
+
+    # Compare shapes
+    actual_shape = tuple(lerobot_actions.shape)
+    print(f"Actual shape: {actual_shape}")
+
+    assert actual_shape == EXPECTED_ACTIONS_SHAPE, (
+        f"Shape mismatch: {actual_shape} vs {EXPECTED_ACTIONS_SHAPE}"
+    )
+    print(f"Shape matches: {actual_shape}")
+
+    # Compare statistics
+    actual_mean = lerobot_actions.mean().item()
+    actual_std = lerobot_actions.std().item()
+
+    print(f"\nMean: {actual_mean:.6f} (expected: {EXPECTED_ACTIONS_MEAN:.6f})")
+    print(f"Std: {actual_std:.6f} (expected: {EXPECTED_ACTIONS_STD:.6f})")
+
+    # Compare first 5 actions
+    actual_first_5 = lerobot_actions[0, 0, :5]
+    print("\nFirst 5 actions comparison:")
+    print(f"  Actual:   {actual_first_5}")
+    print(f"  Expected: {EXPECTED_ACTIONS_FIRST_5}")
+
+    first_5_diff = torch.abs(actual_first_5 - EXPECTED_ACTIONS_FIRST_5)
+    print(f"  Max diff: {first_5_diff.max().item():.6e}")
+    print(f"  Mean diff: {first_5_diff.mean().item():.6e}")
+
+    # Check with different tolerances
+    tolerances = [1e-5, 1e-4, 1e-3, 1e-2]
+    for tol in tolerances:
+        is_close = torch.allclose(actual_first_5, EXPECTED_ACTIONS_FIRST_5, atol=tol)
+        status = "Success" if is_close else "Failure"
+        print(f"{status}: First 5 actions close (atol={tol}): {is_close}")
+
+    # Assert with reasonable tolerance
+    tolerance = 1e-3
+    assert torch.allclose(actual_first_5, EXPECTED_ACTIONS_FIRST_5, atol=tolerance), (
+        f"First 5 actions differ by more than tolerance ({tolerance})"
+    )
+    print(f"\nSuccess: Actions match expected values within tolerance ({tolerance})!")
+
+    print("\nAction generation test completed (values printed for reference)!")
+
+
+@require_cuda
+def test_pi0_fast_inference_reproducibility(policy, preprocessor):
+    """Test that PI0Fast inference is reproducible with the same seed."""
+    print("\n" + "=" * 80)
+    print("Test: PI0Fast Inference Reproducibility")
+    print("=" * 80)
+
+    print("\nCreating dummy data...")
+    batch = create_dummy_data()
+
+    # First inference
+    print("\n[Run 1] Running inference...")
+    set_seed_all(42)
+    lerobot_observation = preprocessor(deepcopy(batch))
+    with torch.no_grad():
+        actions_1 = policy.predict_action_chunk(lerobot_observation)
+        actions_1 = actions_1.float().cpu()
+
+    # Second inference with same seed
+    print("\n[Run 2] Running inference with same seed...")
+    set_seed_all(42)
+    lerobot_observation = preprocessor(deepcopy(batch))
+    with torch.no_grad():
+        actions_2 = policy.predict_action_chunk(lerobot_observation)
+        actions_2 = actions_2.float().cpu()
+
+    print("\nComparing two runs:")
+    print("-" * 80)
+    if torch.allclose(actions_1, actions_2, atol=1e-8):
+        print("Inference is perfectly reproducible!")
+    else:
+        diff = torch.abs(actions_1 - actions_2)
+        print("Small differences detected:")
+        print(f"  Max diff: {diff.max().item():.6e}")
+        print(f"  Mean diff: {diff.mean().item():.6e}")
+
+    assert torch.allclose(actions_1, actions_2, atol=1e-6), "Inference should be reproducible!"
+
+    print("\nInference is reproducible!")
+
+
+@require_cuda
+def test_pi0_fast_forward_pass_logits(policy, preprocessor):
+    """Test PI0Fast forward pass and compare logits against expected values."""
+    print("\n" + "=" * 80)
+    print("Test: PI0Fast Forward Pass Logits")
+    print("=" * 80)
+
+    set_seed_all(42)
+
+    print("\nCreating dummy data with action tokens...")
+    batch = create_dummy_data()
+
+    # Preprocess the batch
+    lerobot_observation = preprocessor(deepcopy(batch))
+
+    # For forward pass, we need action tokens
+    # Create dummy action tokens for testing
+    batch_size = 1
+    max_action_tokens = policy.config.max_action_tokens
+
+    # Create dummy action tokens (in practice, these come from the FAST tokenizer)
+    dummy_action_tokens = torch.randint(
+        0, 1000, (batch_size, max_action_tokens), dtype=torch.long, device=DEVICE
+    )
+    dummy_action_masks = torch.ones(batch_size, max_action_tokens, dtype=torch.bool, device=DEVICE)
+
+    # Add action tokens to the observation
+    lerobot_observation[ACTION_TOKENS] = dummy_action_tokens
+    lerobot_observation[ACTION_TOKEN_MASK] = dummy_action_masks
+
+    print("\n[LeRobot] Running forward pass...")
+    policy.train()
+    with torch.no_grad():
+        loss, loss_dict = policy.forward(lerobot_observation)
+
+    print(f"Loss: {loss.item():.6f}")
+    print(f"FAST Loss: {loss_dict['ce_loss']:.6f}")
+
+    print("\nForward pass completed successfully!")
+    print(f"Loss value: {loss.item():.6f}")
+
+    # The loss should be a positive value
+    assert loss.item() > 0, "Loss should be positive"
+    assert not torch.isnan(loss), "Loss should not be NaN"
+    assert not torch.isinf(loss), "Loss should not be infinite"
+
+    print("\nForward pass test passed!")
+
+
+@require_cuda
+def test_pi0_fast_action_token_sampling(policy, preprocessor):
+    """Test PI0Fast action token sampling (autoregressive decoding)."""
+    print("\n" + "=" * 80)
+    print("Test: PI0Fast Action Token Sampling")
+    print("=" * 80)
+
+    set_seed_all(42)
+
+    print("\nCreating dummy data...")
+    batch = create_dummy_data()
+
+    print("\n[LeRobot] Preprocessing...")
+    lerobot_observation = preprocessor(deepcopy(batch))
+
+    # Prepare inputs for model
+    images, img_masks = policy._preprocess_images(lerobot_observation)
+    tokens = lerobot_observation[OBS_LANGUAGE_TOKENS]
+    masks = lerobot_observation[OBS_LANGUAGE_ATTENTION_MASK]
+
+    print("\n[LeRobot] Sampling action tokens...")
+    torch.manual_seed(42)
+    with torch.no_grad():
+        action_tokens = policy.model.sample_actions_fast(
+            images,
+            img_masks,
+            tokens,
+            masks,
+            max_decoding_steps=2,
+            temperature=0.0,  # Greedy decoding for reproducibility
+        )
+
+    print(f"Action tokens shape: {action_tokens.shape}")
+    print(f"Action tokens first 10: {action_tokens[0, :10].tolist()}")
+
+    print("\nExpected values (from original PI0Fast):")
+    print(f"Expected shape: {EXPECTED_ACTION_TOKENS_SHAPE}")
+    print(f"Expected first 5: {EXPECTED_ACTION_TOKENS_FIRST_5.tolist()}")
+
+    # Verify shape
+    actual_shape = tuple(action_tokens.shape)
+    print(f"\nActual shape: {actual_shape}")
+
+    assert actual_shape == EXPECTED_ACTION_TOKENS_SHAPE, (
+        f"Shape mismatch: {actual_shape} vs {EXPECTED_ACTION_TOKENS_SHAPE}"
+    )
+
+    # Compare first 5 tokens
+    actual_first_5 = action_tokens[0, :5].cpu()
+    assert torch.equal(actual_first_5, EXPECTED_ACTION_TOKENS_FIRST_5), (
+        f"First 5 tokens mismatch: {actual_first_5} vs {EXPECTED_ACTION_TOKENS_FIRST_5}"
+    )
+
+    print("\nAction token sampling test completed!")
+
+
+@require_cuda
+def test_pi0_fast_detokenization(policy, preprocessor):
+    """Test PI0Fast action detokenization (FAST decoding)."""
+    print("\n" + "=" * 80)
+    print("Test: PI0Fast Action Detokenization")
+    print("=" * 80)
+
+    set_seed_all(42)
+
+    print("\nCreating dummy data...")
+    batch = create_dummy_data()
+
+    print("\n[LeRobot] Preprocessing...")
+    lerobot_observation = preprocessor(deepcopy(batch))
+
+    # Prepare inputs for model
+    images, img_masks = policy._preprocess_images(lerobot_observation)
+    tokens = lerobot_observation[OBS_LANGUAGE_TOKENS]
+    masks = lerobot_observation[OBS_LANGUAGE_ATTENTION_MASK]
+
+    print("\n[LeRobot] Sampling action tokens...")
+    torch.manual_seed(42)
+    with torch.no_grad():
+        action_tokens = policy.model.sample_actions_fast(
+            images,
+            img_masks,
+            tokens,
+            masks,
+            max_decoding_steps=2,
+            temperature=0.0,
+        )
+
+    print(f"Action tokens shape: {action_tokens.shape}")
+
+    # Detokenize
+    print("\n[LeRobot] Detokenizing action tokens...")
+    action_horizon = policy.config.n_action_steps
+    action_dim = policy.config.output_features["action"].shape[0]
+
+    try:
+        continuous_actions = policy.detokenize_actions(
+            action_tokens, action_horizon=action_horizon, action_dim=action_dim
+        )
+        print(f"Continuous actions shape: {continuous_actions.shape}")
+        print(f"Continuous actions mean: {continuous_actions.mean().item():.6f}")
+        print(f"Continuous actions std: {continuous_actions.std().item():.6f}")
+        print(f"Continuous actions first 5: {continuous_actions[0, 0, :5]}")
+        print("\nDetokenization successful!")
+    except Exception as e:
+        print(f"\nDetokenization failed with error: {e}")
+        print("This may be expected if the action tokens are not valid FAST tokens.")
+        print("The test will pass as long as the sampling works correctly.")