mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-15 00:29:52 +00:00
docs: improve assets (#2777)
* add assets * add libero results pifast: * update * update * update size * update naems: : * update training tokenizer
This commit is contained in:
@@ -12,6 +12,12 @@ Developers and researchers can post-train GR00T N1.5 with their own real or synt
|
||||
|
||||
GR00T N1.5 (specifically the GR00T-N1.5-3B model) is built using pre-trained vision and language encoders. It utilizes a flow matching action transformer to model a chunk of actions, conditioned on vision, language, and proprioception.
|
||||
|
||||
<img
|
||||
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/lerobot-groot-paper1%20(1).png"
|
||||
alt="An overview of GR00T"
|
||||
width="80%"
|
||||
/>
|
||||
|
||||
Its strong performance comes from being trained on an expansive and diverse humanoid dataset, which includes:
|
||||
|
||||
- Real captured data from robots.
|
||||
|
||||
@@ -6,6 +6,12 @@
|
||||
|
||||
π₀ represents a breakthrough in robotics as the first general-purpose robot foundation model developed by [Physical Intelligence](https://www.physicalintelligence.company/blog/pi0). Unlike traditional robot programs that are narrow specialists programmed for repetitive motions, π₀ is designed to be a generalist policy that can understand visual inputs, interpret natural language instructions, and control a variety of different robots across diverse tasks.
|
||||
|
||||
<img
|
||||
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/lerobot-pi0%20(1).png"
|
||||
alt="An overview of Pi0"
|
||||
width="85%"
|
||||
/>
|
||||
|
||||
### The Vision for Physical Intelligence
|
||||
|
||||
As described by Physical Intelligence, while AI has achieved remarkable success in digital domains, from chess-playing to drug discovery, human intelligence still dramatically outpaces AI in the physical world. To paraphrase Moravec's paradox, winning a game of chess represents an "easy" problem for AI, but folding a shirt or cleaning up a table requires solving some of the most difficult engineering problems ever conceived. π₀ represents a first step toward developing artificial physical intelligence that enables users to simply ask robots to perform any task they want, just like they can with large language models.
|
||||
|
||||
+66
-2
@@ -6,6 +6,12 @@
|
||||
|
||||
π₀-FAST combines the power of Vision-Language Models with a novel action tokenization approach called **FAST (Frequency-space Action Sequence Tokenization)**. This enables training autoregressive VLAs on highly dexterous tasks that are impossible with standard binning-based discretization, while training **up to 5x faster** than diffusion-based approaches like π₀.
|
||||
|
||||
<img
|
||||
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/lerobot-pifast.png"
|
||||
alt="An overview of Pi0-FAST"
|
||||
width="85%"
|
||||
/>
|
||||
|
||||
### Why FAST?
|
||||
|
||||
Standard approaches for robot action tokenization use simple per-dimension, per-timestep binning schemes. While passable for simple behaviors, this rapidly breaks down for complex and dexterous skills that require precision and high-frequency control.
|
||||
@@ -53,7 +59,7 @@ You have two options for the FAST tokenizer:
|
||||
### Training Your Own Tokenizer
|
||||
|
||||
```bash
|
||||
python src/lerobot/policies/pi0_fast/train_fast_tokenizer.py \
|
||||
lerobot-train-tokenizer \
|
||||
--repo_id "user/my-lerobot-dataset" \
|
||||
--action_horizon 10 \
|
||||
--encoded_dims "0:6" \
|
||||
@@ -90,7 +96,7 @@ policy.type=pi0_fast
|
||||
For training π₀-FAST, you can use the LeRobot training script:
|
||||
|
||||
```bash
|
||||
python src/lerobot/scripts/lerobot_train.py \
|
||||
lerobot-train \
|
||||
--dataset.repo_id=your_dataset \
|
||||
--policy.type=pi0_fast \
|
||||
--output_dir=./outputs/pi0fast_training \
|
||||
@@ -171,6 +177,64 @@ The model takes images, text instructions, and robot state as input, and outputs
|
||||
| Inference Method | Iterative Denoising | Autoregressive Decoding |
|
||||
| KV-Caching | N/A | Supported |
|
||||
|
||||
## Reproducing π₀Fast results
|
||||
|
||||
We reproduce the results of π₀Fast on the LIBERO benchmark using the LeRobot implementation. We take the LeRobot PiFast base model [lerobot/pi0fast-base](https://huggingface.co/lerobot/pi0fast-base) and finetune for an additional 40kk steps in bfloat16, with batch size of 256 on 8 H100 GPUs using the [HuggingFace LIBERO dataset](https://huggingface.co/datasets/HuggingFaceVLA/libero).
|
||||
|
||||
The finetuned model can be found here:
|
||||
|
||||
- **π₀Fast LIBERO**: [lerobot/pi0fast-libero](https://huggingface.co/lerobot/pi0fast-libero)
|
||||
|
||||
With the following training command:
|
||||
|
||||
```bash
|
||||
lerobot-train \
|
||||
--dataset.repo_id=lerobot/libero \
|
||||
--output_dir=outputs/libero_pi0fast \
|
||||
--job_name=libero_pi0fast \
|
||||
--policy.path=lerobot/pi0fast_base \
|
||||
--policy.dtype=bfloat16 \
|
||||
--steps=100000 \
|
||||
--save_freq=20000 \
|
||||
--batch_size=4 \
|
||||
--policy.device=cuda \
|
||||
--policy.scheduler_warmup_steps=4000 \
|
||||
--policy.scheduler_decay_steps=100000 \
|
||||
--policy.scheduler_decay_lr=1e-5 \
|
||||
--policy.gradient_checkpointing=true \
|
||||
--policy.chunk_size=10 \
|
||||
--policy.n_action_steps=10 \
|
||||
--policy.max_action_tokens=256 \
|
||||
--policy.empty_cameras=1 \
|
||||
```
|
||||
|
||||
We then evaluate the finetuned model using the LeRobot LIBERO implementation, by running the following command:
|
||||
|
||||
```bash
|
||||
tasks="libero_object,libero_spatial,libero_goal,libero_10"
|
||||
lerobot-eval \
|
||||
--policy.path=lerobot/pi0fast-libero \
|
||||
--policy.max_action_tokens=256 \
|
||||
--env.type=libero \
|
||||
--policy.gradient_checkpointing=false \
|
||||
--env.task=${tasks} \
|
||||
--eval.batch_size=1 \
|
||||
--eval.n_episodes=1 \
|
||||
--rename_map='{"observation.images.image":"observation.images.base_0_rgb","observation.images.image2":"observation.images.left_wrist_0_rgb"}'
|
||||
```
|
||||
|
||||
**Note:** We set `n_action_steps=10`, similar to the original OpenPI implementation.
|
||||
|
||||
### Results
|
||||
|
||||
We obtain the following results on the LIBERO benchmark:
|
||||
|
||||
| Model | LIBERO Spatial | LIBERO Object | LIBERO Goal | LIBERO 10 | Average |
|
||||
| ----------- | -------------- | ------------- | ----------- | --------- | -------- |
|
||||
| **π₀-fast** | 70.0 | 100.0 | 100.0 | 60.0 | **82.5** |
|
||||
|
||||
The full evaluation output folder, including videos, is available [here](https://drive.google.com/drive/folders/1HXpwPTRm4hx6g1sF2P7OOqGG0TwPU7LQ?usp=sharing)
|
||||
|
||||
## License
|
||||
|
||||
This model follows the **Apache 2.0 License**, consistent with the original [OpenPI repository](https://github.com/Physical-Intelligence/openpi).
|
||||
|
||||
@@ -4,6 +4,12 @@ SARM (Stage-Aware Reward Modeling) is a video-based reward modeling framework fo
|
||||
|
||||
**Paper**: [SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation](https://arxiv.org/abs/2509.25358)
|
||||
|
||||
<img
|
||||
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/lerobot-sarm.png"
|
||||
alt="An overview of SARM"
|
||||
width="80%"
|
||||
/>
|
||||
|
||||
## Why Reward Models?
|
||||
|
||||
Standard behavior cloning treats all demonstration frames equally, but real-world robot datasets are messy. They contain hesitations, corrections, and variable-quality trajectories. Reward models solve this by learning a generalizable notion of **task progress** from demonstrations: given video frames and a task description, they predict how close the robot is to completing the task (0→1). This learned "progress signal" can be used in multiple ways, two promising applications are: (1) **weighted imitation learning** (RA-BC), where high-progress frames receive more weight during policy training, and (2) **reinforcement learning**, where the reward model provides dense rewards for online or offline policy improvement.
|
||||
|
||||
@@ -8,6 +8,12 @@ X Square Robot’s WALL-OSS is now integrated into Hugging Face’s LeRobot ecos
|
||||
|
||||
The WALL-OSS team is building the embodied foundation model to capture and compress the world's most valuable data: the continuous, high-fidelity stream of physical interaction. By creating a direct feedback loop between the model's decisions and the body's lived experience, the emergence of a truly generalizable intelligence is enabled—one that understands not just how the world works, but how to act effectively within it.
|
||||
|
||||
<img
|
||||
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/walloss-lerobot-paper.png"
|
||||
alt="An overview of WALL-OSS"
|
||||
width="85%"
|
||||
/>
|
||||
|
||||
Technically, WALL-OSS introduces a tightly coupled multimodal architecture (tightly-coupled MoE structure) that integrates both discrete and continuous action modeling strategies. Through a two-stage training pipeline (Inspiration → Integration), the model gradually unifies semantic reasoning and high-frequency action generation. Its core innovations include:
|
||||
|
||||
- **Embodied perception–enhanced multimodal pretraining**: Large-scale training on unified vision–language–action data to strengthen spatial, causal, and manipulation understanding.
|
||||
|
||||
Reference in New Issue
Block a user