mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-11 14:49:43 +00:00
f0d2b37beb
* chore(dependencies): upgrade transformers + hggingface-hub + peft + scipy * chore(dependencies): bump pi0 family to transformers v5 * chore(dependencies): bump wall x to transformers v5 * chore(dependencies): bump gr00t to transformers v5 * chore(style): fix pre-commit * fix(policy): xvla forced_bos_token missing * test(rl): skip ci tests for resnet10 * Fix: full pi models support for transformer v5 (#2967) * fix(pi): remove loss truncation * fix(pi): remove state padding before tokenization * fix(pi): fix image padding value * fix from_pretrain * add transformer v5 changes * remove reference * more fixes * make it work * add support for rest of pi family * add pifast work * more changes * more changes * more cleanup * fix torch params * dtype fix * torch compile * embed mismatch fix * revert groot * more nit fixes * remove unused classes * more fixes * revert * nit * torch dtype warning fix * but back dynamic renaming * add tie embedding --------- Co-authored-by: Yufei Sun <skieyfly@gmail.com> * chore: fix XVLA in transformers v5 (#3006) * test(policies): enable wall x CI testing * style(test): pre-commit check * style(test): pre-commit * fix wall x for transformer v5 (#3008) * tv5 fix * various wall x fixes * Delete tests/policies/pi0_pi05/print_pi05_output_logits.py Signed-off-by: Jade Choghari <chogharijade@gmail.com> * sync modeling_florence2.py with chore/bump_transformers_v5 * more * more fixes * more * remove comment * more --------- Signed-off-by: Jade Choghari <chogharijade@gmail.com> * chore(dependencies): adjust dependencies versioning after transformers v5 (#3034) * chore(dependecies): adjust dependecies versioning after transformers v5 * fix(policies): remove deprecated input_embeds * fix(policies): dict _tied_weights_keys * chore(depedencies): common qwen-vl-utils * chore(dependencies): bump transformers to 5.2 * Fix policy testing for tv5 (#3032) * fix ci logger * other fix * fix mypy * change logits to torch2.10 * skip wallx| * remove logging --------- Co-authored-by: Steven Palma <imstevenpmwork@ieee.org> * feat(ci): log into HF to unblock some CI tests (#3007) * feat(ci): log into HF to unblock some CI tests * chore(ci): change hf call + secret name * fix(ci): temp fix for pi0 rtc test * test(policies): require_cuda for unblocked tests * test(policies): require_cuda wall_x * fic(tests): require_cuda outter most for pi0 * fix(test): return instead of yield --------- Signed-off-by: Steven Palma <imstevenpmwork@ieee.org> * style(test): fix pre-commit * chore(deps): upgrade transformers (#3050) * chore(test): use lerobot model * fix(policies): change default action tokenizer for wall x * sample on cpu * Revert "Merge branch 'chore/bump_transformers_v5' of https://github.com/huggingface/lerobot into chore/bump_transformers_v5" This reverts commitd9b76755f7, reversing changes made to89359cb0b6. * Reapply "Merge branch 'chore/bump_transformers_v5' of https://github.com/huggingface/lerobot into chore/bump_transformers_v5" This reverts commitc9914db78b. --------- Signed-off-by: Jade Choghari <chogharijade@gmail.com> Signed-off-by: Steven Palma <imstevenpmwork@ieee.org> Co-authored-by: Jade Choghari <chogharijade@gmail.com> Co-authored-by: Yufei Sun <skieyfly@gmail.com> Co-authored-by: Pepijn <pepijn@huggingface.co>
247 lines
11 KiB
Plaintext
247 lines
11 KiB
Plaintext
# π₀-FAST (Pi0-FAST)
|
|
|
|
π₀-FAST is a **Vision-Language-Action model for general robot control** that uses autoregressive next-token prediction to model continuous robot actions.
|
|
|
|
## Model Overview
|
|
|
|
π₀-FAST combines the power of Vision-Language Models with a novel action tokenization approach called **FAST (Frequency-space Action Sequence Tokenization)**. This enables training autoregressive VLAs on highly dexterous tasks that are impossible with standard binning-based discretization, while training **up to 5x faster** than diffusion-based approaches like π₀.
|
|
|
|
<img
|
|
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/lerobot-pifast.png"
|
|
alt="An overview of Pi0-FAST"
|
|
width="85%"
|
|
/>
|
|
|
|
### Why FAST?
|
|
|
|
Standard approaches for robot action tokenization use simple per-dimension, per-timestep binning schemes. While passable for simple behaviors, this rapidly breaks down for complex and dexterous skills that require precision and high-frequency control.
|
|
|
|
FAST solves this by compressing action sequences using signal processing techniques, resulting in a dense sequence of action tokens that can be predicted autoregressively—just like language tokens.
|
|
|
|
### How FAST Tokenization Works
|
|
|
|
The FAST tokenizer compresses action sequences through the following steps:
|
|
|
|
1. **Normalize**: Take a continuous action chunk of shape `(H, D)` where `H` is the horizon and `D` is the action dimension. Normalize using one of the supported normalization methods (Quantiles recommended to handle outliers).
|
|
|
|
2. **Discrete Cosine Transform (DCT)**: Apply DCT (via scipy) to each action dimension separately. DCT is a compression algorithm commonly used in image and audio codecs (JPEG, MP3).
|
|
|
|
3. **Quantization**: Round and remove insignificant coefficients for each action dimension, producing a sparse frequency matrix.
|
|
|
|
4. **Flatten**: Flatten the matrix into a 1D vector, with low-frequency components first.
|
|
|
|
5. **Byte Pair Encoding (BPE)**: Train a BPE tokenizer to compress the DCT coefficients into dense action tokens, typically achieving **10x compression** over prior tokenization approaches.
|
|
|
|
This approach can transform **any existing VLM** into a VLA by training it to predict these FAST tokens.
|
|
|
|
## Installation Requirements
|
|
|
|
1. Install LeRobot by following our [Installation Guide](./installation).
|
|
2. Install π₀-FAST dependencies by running:
|
|
|
|
```bash
|
|
pip install -e ".[pi]"
|
|
```
|
|
|
|
> [!NOTE]
|
|
> For lerobot 0.4.0, if you want to install the pi tag, you will have to do: `pip install "lerobot[pi]@git+https://github.com/huggingface/lerobot.git"`.
|
|
>
|
|
> This will be solved in the next patch release
|
|
|
|
## Training a Custom FAST Tokenizer
|
|
|
|
You have two options for the FAST tokenizer:
|
|
|
|
1. **Use the pre-trained tokenizer**: The `lerobot/fast-action-tokenizer` tokenizer was trained on 1M+ real robot action sequences and works as a general-purpose tokenizer.
|
|
|
|
2. **Train your own tokenizer**: For maximum performance on your specific dataset, you can finetune the tokenizer on your own data.
|
|
|
|
### Training Your Own Tokenizer
|
|
|
|
```bash
|
|
lerobot-train-tokenizer \
|
|
--repo_id "user/my-lerobot-dataset" \
|
|
--action_horizon 10 \
|
|
--encoded_dims "0:6" \
|
|
--vocab_size 1024 \
|
|
--scale 10.0 \
|
|
--normalization_mode QUANTILES \
|
|
--output_dir "./my_fast_tokenizer" \
|
|
--push_to_hub \
|
|
--hub_repo_id "username/my-action-tokenizer"
|
|
```
|
|
|
|
### Key Tokenizer Parameters
|
|
|
|
| Parameter | Description | Default |
|
|
| ---------------------- | --------------------------------------------------------------------------------- | ------------ |
|
|
| `--repo_id` | LeRobot dataset repository ID | Required |
|
|
| `--action_horizon` | Number of future actions in each chunk | `10` |
|
|
| `--encoded_dims` | Comma-separated dimension ranges to encode (e.g., `"0:6,7:23"`) | `"0:6,7:23"` |
|
|
| `--vocab_size` | BPE vocabulary size | `1024` |
|
|
| `--scale` | DCT scaling factor for quantization | `10.0` |
|
|
| `--normalization_mode` | Normalization mode (`MEAN_STD`, `MIN_MAX`, `QUANTILES`, `QUANTILE10`, `IDENTITY`) | `QUANTILES` |
|
|
| `--sample_fraction` | Fraction of chunks to sample per episode | `0.1` |
|
|
|
|
## Usage
|
|
|
|
To use π₀-FAST in LeRobot, specify the policy type as:
|
|
|
|
```python
|
|
policy.type=pi0_fast
|
|
```
|
|
|
|
## Training
|
|
|
|
For training π₀-FAST, you can use the LeRobot training script:
|
|
|
|
```bash
|
|
lerobot-train \
|
|
--dataset.repo_id=your_dataset \
|
|
--policy.type=pi0_fast \
|
|
--output_dir=./outputs/pi0fast_training \
|
|
--job_name=pi0fast_training \
|
|
--policy.pretrained_path=lerobot/pi0_fast_base \
|
|
--policy.dtype=bfloat16 \
|
|
--policy.gradient_checkpointing=true \
|
|
--policy.chunk_size=10 \
|
|
--policy.n_action_steps=10 \
|
|
--policy.max_action_tokens=256 \
|
|
--steps=100000 \
|
|
--batch_size=4 \
|
|
--policy.device=cuda
|
|
```
|
|
|
|
### Key Training Parameters
|
|
|
|
| Parameter | Description | Default |
|
|
| -------------------------------------- | -------------------------------------------------- | ------------------------------- |
|
|
| `--policy.gradient_checkpointing=true` | Reduces memory usage significantly during training | `false` |
|
|
| `--policy.dtype=bfloat16` | Use mixed precision training for efficiency | `float32` |
|
|
| `--policy.chunk_size` | Number of action steps to predict (action horizon) | `50` |
|
|
| `--policy.n_action_steps` | Number of action steps to execute | `50` |
|
|
| `--policy.max_action_tokens` | Maximum number of FAST tokens per action chunk | `256` |
|
|
| `--policy.action_tokenizer_name` | FAST tokenizer to use | `lerobot/fast-action-tokenizer` |
|
|
| `--policy.compile_model=true` | Enable torch.compile for faster training | `false` |
|
|
|
|
## Inference
|
|
|
|
### KV-Caching for Fast Inference
|
|
|
|
π₀-FAST supports **KV-caching**, a widely used optimization in LLM inference. This caches the key-value pairs from the attention mechanism, avoiding redundant computation during autoregressive decoding.
|
|
|
|
```python
|
|
# KV-caching is enabled by default
|
|
policy.use_kv_cache=true
|
|
```
|
|
|
|
### Inference Example
|
|
|
|
```python
|
|
from lerobot.policies.pi0_fast import PI0FastPolicy, PI0FastConfig
|
|
|
|
# Load the policy
|
|
policy = PI0FastPolicy.from_pretrained("your-model-path")
|
|
|
|
# During inference
|
|
actions = policy.predict_action_chunk(batch)
|
|
```
|
|
|
|
## Model Architecture
|
|
|
|
π₀-FAST uses a PaliGemma-based architecture:
|
|
|
|
- **Vision Encoder**: SigLIP vision tower for image understanding
|
|
- **Language Model**: Gemma 2B for processing language instructions and predicting action tokens
|
|
|
|
The model takes images, text instructions, and robot state as input, and outputs discrete FAST tokens that are decoded back to continuous actions.
|
|
|
|
## Configuration Options
|
|
|
|
| Parameter | Description | Default |
|
|
| -------------------- | ----------------------------------------------- | ---------- |
|
|
| `paligemma_variant` | VLM backbone variant (`gemma_300m`, `gemma_2b`) | `gemma_2b` |
|
|
| `max_state_dim` | Maximum state vector dimension (padded) | `32` |
|
|
| `max_action_dim` | Maximum action vector dimension (padded) | `32` |
|
|
| `temperature` | Sampling temperature (0.0 for greedy) | `0.0` |
|
|
| `max_decoding_steps` | Maximum decoding steps | `256` |
|
|
| `use_kv_cache` | Enable KV caching for faster inference | `true` |
|
|
|
|
## Comparison with π₀
|
|
|
|
| Feature | π₀ | π₀-FAST |
|
|
| --------------------- | ------------------------- | ---------------------------- |
|
|
| Action Representation | Flow Matching (Diffusion) | Autoregressive Tokens (FAST) |
|
|
| Training Speed | 1x | **5x faster** |
|
|
| Dexterity | High | High |
|
|
| Inference Method | Iterative Denoising | Autoregressive Decoding |
|
|
| KV-Caching | N/A | Supported |
|
|
|
|
## Reproducing π₀Fast results
|
|
|
|
We reproduce the results of π₀Fast on the LIBERO benchmark using the LeRobot implementation. We take the LeRobot PiFast base model [lerobot/pi0fast-base](https://huggingface.co/lerobot/pi0fast-base) and finetune for an additional 40kk steps in bfloat16, with batch size of 256 on 8 H100 GPUs using the [HuggingFace LIBERO dataset](https://huggingface.co/datasets/HuggingFaceVLA/libero).
|
|
|
|
The finetuned model can be found here:
|
|
|
|
- **π₀Fast LIBERO**: [lerobot/pi0fast-libero](https://huggingface.co/lerobot/pi0fast-libero)
|
|
|
|
With the following training command:
|
|
|
|
```bash
|
|
lerobot-train \
|
|
--dataset.repo_id=lerobot/libero \
|
|
--output_dir=outputs/libero_pi0fast \
|
|
--job_name=libero_pi0fast \
|
|
--policy.path=lerobot/pi0fast_base \
|
|
--policy.dtype=bfloat16 \
|
|
--steps=100000 \
|
|
--save_freq=20000 \
|
|
--batch_size=4 \
|
|
--policy.device=cuda \
|
|
--policy.scheduler_warmup_steps=4000 \
|
|
--policy.scheduler_decay_steps=100000 \
|
|
--policy.scheduler_decay_lr=1e-5 \
|
|
--policy.gradient_checkpointing=true \
|
|
--policy.chunk_size=10 \
|
|
--policy.n_action_steps=10 \
|
|
--policy.max_action_tokens=256 \
|
|
--policy.empty_cameras=1 \
|
|
```
|
|
|
|
We then evaluate the finetuned model using the LeRobot LIBERO implementation, by running the following command:
|
|
|
|
```bash
|
|
tasks="libero_object,libero_spatial,libero_goal,libero_10"
|
|
lerobot-eval \
|
|
--policy.path=lerobot/pi0fast-libero \
|
|
--policy.max_action_tokens=256 \
|
|
--env.type=libero \
|
|
--policy.gradient_checkpointing=false \
|
|
--env.task=${tasks} \
|
|
--eval.batch_size=1 \
|
|
--eval.n_episodes=1 \
|
|
--rename_map='{"observation.images.image":"observation.images.base_0_rgb","observation.images.image2":"observation.images.left_wrist_0_rgb"}'
|
|
```
|
|
|
|
**Note:** We set `n_action_steps=10`, similar to the original OpenPI implementation.
|
|
|
|
### Results
|
|
|
|
We obtain the following results on the LIBERO benchmark:
|
|
|
|
| Model | LIBERO Spatial | LIBERO Object | LIBERO Goal | LIBERO 10 | Average |
|
|
| ----------- | -------------- | ------------- | ----------- | --------- | -------- |
|
|
| **π₀-fast** | 70.0 | 100.0 | 100.0 | 60.0 | **82.5** |
|
|
|
|
The full evaluation output folder, including videos, is available [here](https://drive.google.com/drive/folders/1HXpwPTRm4hx6g1sF2P7OOqGG0TwPU7LQ?usp=sharing)
|
|
|
|
## License
|
|
|
|
This model follows the **Apache 2.0 License**, consistent with the original [OpenPI repository](https://github.com/Physical-Intelligence/openpi).
|
|
|
|
## References
|
|
|
|
- [FAST: Efficient Robot Action Tokenization](https://www.physicalintelligence.company/research/fast) - Physical Intelligence Blog
|
|
- [OpenPI Repository](https://github.com/Physical-Intelligence/openpi) - Original implementation
|
|
- [FAST Tokenizer on Hugging Face](https://huggingface.co/physical-intelligence/fast) - Pre-trained tokenizer
|