diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 1a4558f93..46e549939 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -22,7 +22,11 @@ title: "Tutorials" - sections: - local: smolvla - title: Finetune SmolVLA + title: SmolVLA + - local: pi0 + title: π₀ (Pi0) + - local: pi05 + title: π₀.₅ (Pi05) title: "Policies" - sections: - local: hope_jr diff --git a/docs/source/pi0.mdx b/docs/source/pi0.mdx new file mode 100644 index 000000000..315c642e6 --- /dev/null +++ b/docs/source/pi0.mdx @@ -0,0 +1,109 @@ +# π₀ (Pi0) + +π₀ is a **Vision-Language-Action model for general robot control**, from Physical Intelligence. The LeRobot implementation is adapted from their open source [OpenPI](https://github.com/Physical-Intelligence/openpi) repository. + +## Model Overview + +π₀ represents a breakthrough in robotics as the first general-purpose robot foundation model developed by [Physical Intelligence](https://www.physicalintelligence.company/blog/pi0). Unlike traditional robots that are narrow specialists programmed for repetitive motions, π₀ is designed to be a generalist policy that can understand visual inputs, interpret natural language instructions, and control a variety of different robots across diverse tasks. + +### The Vision for Physical Intelligence + +As described by Physical Intelligence, while AI has achieved remarkable success in digital domains, from chess-playing to drug discovery, human intelligence still dramatically outpaces AI in the physical world. To paraphrase Moravec's paradox, winning a game of chess represents an "easy" problem for AI, but folding a shirt or cleaning up a table requires solving some of the most difficult engineering problems ever conceived. π₀ represents a first step toward developing artificial physical intelligence that enables users to simply ask robots to perform any task they want, just like they can with large language models. + +### Architecture and Approach + +π₀ combines several key innovations: + +- **Flow Matching**: Uses a novel method to augment pre-trained VLMs with continuous action outputs via flow matching (a variant of diffusion models) +- **Cross-Embodiment Training**: Trained on data from 8 distinct robot platforms including UR5e, Bimanual UR5e, Franka, Bimanual Trossen, Bimanual ARX, Mobile Trossen, and Mobile Fibocom +- **Internet-Scale Pre-training**: Inherits semantic knowledge from a pre-trained 3B parameter Vision-Language Model +- **High-Frequency Control**: Outputs motor commands at up to 50 Hz for real-time dexterous manipulation + +## Installation Requirements + +⚠️ **Warning**: This policy requires patching the Hugging Face `transformers` library. + +### Prerequisites + +1. Ensure you have the exact version installed: + + ```bash + pip show transformers + ``` + + It must be version **4.53.2**. + +2. Apply the custom patches: + ```bash + cp -r ./src/lerobot/policies/pi0_openpi/transformers_replace/* \ + $(python -c "import transformers, os; print(os.path.dirname(transformers.__file__))") + ``` + +### What the patches do: + +- Support the **AdaRMS optimizer** +- Correctly control the precision of activations +- Allow the KV cache to be used without updates + +**Important Notes:** + +- This permanently modifies your `transformers` installation +- The changes survive reinstalls unless you explicitly remove the patched files or recreate the environment + +### Restoring Clean State + +To undo the patches and restore a clean state: + +```bash +pip uninstall transformers +pip install transformers==4.53.2 +``` + +## Training Data and Capabilities + +π₀ is trained on the largest robot interaction dataset to date, combining three key data sources: + +1. **Internet-Scale Pre-training**: Vision-language data from the web for semantic understanding +2. **Open X-Embodiment Dataset**: Open-source robot manipulation datasets +3. **Physical Intelligence Dataset**: Large and diverse dataset of dexterous tasks across 8 distinct robots + +## Usage + +To use π₀ in LeRobot, specify the policy type as: + +```python +policy.type=pi0_openpi +``` + +## Training + +For training π₀, you can use the standard LeRobot training script with the appropriate configuration: + +```bash +python src/lerobot/scripts/train.py \ + --dataset.repo_id=your_dataset \ + --policy.type=pi0_openpi \ + --output_dir=./outputs/pi0_training \ + --job_name=pi0_training \ + --policy.pretrained_path=pepijn223/pi0_base_fp32 \ + --policy.repo_id=your_repo_id \ + --policy.compile_model=true \ + --policy.gradient_checkpointing=true \ + --policy.dtype=bfloat16 \ + --steps=3000 \ + --policy.scheduler_decay_steps=3000 \ + --policy.device=cuda \ + --batch_size=32 +``` + +### Key Training Parameters + +- **`--policy.compile_model=true`**: Enables model compilation for faster training +- **`--policy.gradient_checkpointing=true`**: Reduces memory usage significantly during training +- **`--policy.dtype=bfloat16`**: Use mixed precision training for efficiency +- **`--policy.pretrained_path=pepijn223/pi0_base_fp32`**: The base π₀.₅ model to finetune, options are: `pepijn223/pi0_base_fp32`, `pepijn223/pi0_libero_fp32`, `pepijn223/pi0_droid_fp32` +- **`--batch_size=32`**: Batch size for training, adapt this based on your GPU memory + +## License + +This model follows the **Apache 2.0 License**, consistent with the original [OpenPI repository](https://github.com/Physical-Intelligence/openpi). diff --git a/docs/source/pi05.mdx b/docs/source/pi05.mdx new file mode 100644 index 000000000..49d06da26 --- /dev/null +++ b/docs/source/pi05.mdx @@ -0,0 +1,135 @@ +# π₀.₅ (Pi05) Policy + +π₀.₅ is a **Vision-Language-Action model with open-world generalization**, from Physical Intelligence. The LeRobot implementation is adapted from their open source [OpenPI](https://github.com/Physical-Intelligence/openpi) repository. + +## Model Overview + +π₀.₅ represents a significant evolution from π₀, developed by [Physical Intelligence](https://www.physicalintelligence.company/blog/pi05) to address the biggest challenge in robotics: **open-world generalization**. While robots can perform impressive feats in controlled environments, π₀.₅ is designed to generalize to entirely new environments and situations that were never seen during training. + +### The Generalization Challenge + +As Physical Intelligence explains, the fundamental challenge isn't performing feats of agility or dexterity, but generalization, the ability to correctly perform tasks in new settings with new objects. Consider a robot cleaning different homes: each home has different objects in different places. Generalization must occur at multiple levels: + +- **Physical Level**: Understanding how to pick up a spoon (by the handle) or plate (by the edge), even with unseen objects in cluttered environments +- **Semantic Level**: Understanding task semantics, where to put clothes and shoes (laundry hamper, not on the bed), what tools are appropriate for cleaning spills +- **Environmental Level**: Adapting to "messy" real-world environments like homes, grocery stores, offices, and hospitals + +### Co-Training on Heterogeneous Data + +The breakthrough innovation in π₀.₅ is **co-training on heterogeneous data sources**. The model learns from: + +1. **Multimodal Web Data**: Image captioning, visual question answering, object detection +2. **Verbal Instructions**: Humans coaching robots through complex tasks step-by-step +3. **Subtask Commands**: High-level semantic behavior labels (e.g., "pick up the pillow" for an unmade bed) +4. **Cross-Embodiment Robot Data**: Data from various robot platforms with different capabilities +5. **Multi-Environment Data**: Static robots deployed across many different homes +6. **Mobile Manipulation Data**: ~400 hours of mobile robot demonstrations + +This diverse training mixture creates a "curriculum" that enables generalization across physical, visual, and semantic levels simultaneously. + +## Installation Requirements + +⚠️ **Warning**: This policy requires patching the Hugging Face `transformers` library. + +### Prerequisites + +1. Ensure you have the exact version installed: + + ```bash + pip show transformers + ``` + + It must be version **4.53.2**. + +2. Apply the custom patches: + ```bash + cp -r ./src/lerobot/policies/pi05_openpi/transformers_replace/* \ + $(python -c "import transformers, os; print(os.path.dirname(transformers.__file__))") + ``` + +### What the patches do: + +- Support the **AdaRMS optimizer** +- Correctly control the precision of activations +- Allow the KV cache to be used without updates + +**Important Notes:** + +- This permanently modifies your `transformers` installation +- The changes survive reinstalls unless you explicitly remove the patched files or recreate the environment + +### Restoring Clean State + +To undo the patches and restore a clean state: + +```bash +pip uninstall transformers +pip install transformers==4.53.2 +``` + +## Usage + +To use π₀.₅ in your LeRobot configuration, specify the policy type as: + +```python +policy.type=pi05_openpi +``` + +## Training + +### Training Command Example + +Here's a complete training command for finetuning the base π₀.₅ model on your own dataset: + +```bash +python src/lerobot/scripts/train.py \ + --dataset.repo_id=your_dataset \ + --policy.type=pi0_openpi \ + --output_dir=./outputs/pi0_training \ + --job_name=pi0_training \ + --policy.repo_id=pepijn223/pi05_base_fp32 \ + --policy.pretrained_path=your_repo_id \ + --policy.compile_model=true \ + --policy.gradient_checkpointing=true \ + --wandb.enable=true \ + --policy.dtype=bfloat16 \ + --steps=3000 \ + --policy.scheduler_decay_steps=3000 \ + --policy.device=cuda \ + --batch_size=32 +``` + +### Key Training Parameters + +- **`--policy.compile_model=true`**: Enables model compilation for faster training +- **`--policy.gradient_checkpointing=true`**: Reduces memory usage significantly during training +- **`--policy.dtype=bfloat16`**: Use mixed precision training for efficiency +- **`--policy.pretrained_path=pepijn223/pi05_base_fp32`**: The base π₀.₅ model to finetune, options are: `pepijn223/pi05_base_fp32`, `pepijn223/pi05_libero_fp32`, `pepijn223/pi05_droid_fp32` +- **`--batch_size=32`**: Batch size for training, adapt this based on your GPU memory + +## Performance Results + +### Libero Benchmark Results + +π₀.₅ has demonstrated strong performance on the Libero benchmark suite: + +#### Our Results (LeRobot Implementation) + +- **Libero Spatial**: 98.0% success rate +- **Libero Object**: 99.0% success rate +- **Libero Goal**: 97.0% success rate +- **Libero 10**: 93.0% success rate + +#### OpenPI Reference Results (30k finetuned) + +- **Libero Spatial**: 98.8% success rate +- **Libero Object**: 98.2% success rate +- **Libero Goal**: 98.0% success rate +- **Libero 10**: 92.4% success rate +- **Average**: 96.85% success rate + +These results demonstrate π₀.₅'s strong generalization capabilities across diverse robotic manipulation tasks. + +## License + +This model follows the **Apache 2.0 License**, consistent with the original [OpenPI repository](https://github.com/Physical-Intelligence/openpi). diff --git a/docs/source/smolvla.mdx b/docs/source/smolvla.mdx index 89c475a90..d25bbcd09 100644 --- a/docs/source/smolvla.mdx +++ b/docs/source/smolvla.mdx @@ -1,4 +1,4 @@ -# Finetune SmolVLA +# SmolVLA SmolVLA is Hugging Face’s lightweight foundation model for robotics. Designed for easy fine-tuning on LeRobot datasets, it helps accelerate your development! diff --git a/src/lerobot/policies/pi0_openpi/README.md b/src/lerobot/policies/pi0_openpi/README.md index 7ea3a0e16..06bbdbdaa 100644 --- a/src/lerobot/policies/pi0_openpi/README.md +++ b/src/lerobot/policies/pi0_openpi/README.md @@ -1,7 +1,7 @@ # π₀ (pi0) This repository contains the Hugging Face port of **π₀**, adapted from [OpenPI](https://github.com/Physical-Intelligence/openpi) by the Physical Intelligence. -It is designed as a **Vision-Language-Action flow model for general robot control**. +It is designed as a **Vision-Language-Action model for general robot control**. ---