docs: improve assets (#2777)

* add assets

* add libero results pifast:

* update

* update

* update size

* update naems:
:

* update training tokenizer
This commit is contained in:
Jade Choghari
2026-01-12 13:33:28 +01:00
committed by GitHub
parent 91ff9c4975
commit 473f1bd0e0
8 changed files with 129 additions and 7 deletions
+5 -5
View File
@@ -100,11 +100,11 @@ lerobot-train \
--dataset.repo_id=lerobot/aloha_mobile_cabinet
```
| Category | Models |
| -------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Imitation Learning** | [ACT](./docs/source/policy_act_README.md), [Diffusion](./docs/source/policy_diffusion_README.md), [VQ-BeT](./docs/source/policy_vqbet_README.md) |
| **Reinforcement Learning** | [HIL-SERL](./docs/source/hilserl.mdx), [TDMPC](./docs/source/policy_tdmpc_README.md) & QC-FQL (coming soon) |
| **VLAs Models** | [Pi0.5](./docs/source/pi05.mdx), [GR00T N1.5](./docs/source/policy_groot_README.md), [SmolVLA](./docs/source/policy_smolvla_README.md), [XVLA](./docs/source/xvla.mdx) |
| Category | Models |
| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Imitation Learning** | [ACT](./docs/source/policy_act_README.md), [Diffusion](./docs/source/policy_diffusion_README.md), [VQ-BeT](./docs/source/policy_vqbet_README.md) |
| **Reinforcement Learning** | [HIL-SERL](./docs/source/hilserl.mdx), [TDMPC](./docs/source/policy_tdmpc_README.md) & QC-FQL (coming soon) |
| **VLAs Models** | [Pi0Fast](./docs/source/pi0fast.mdx), [Pi0.5](./docs/source/pi05.mdx), [GR00T N1.5](./docs/source/policy_groot_README.md), [SmolVLA](./docs/source/policy_smolvla_README.md), [XVLA](./docs/source/xvla.mdx) |
Similarly to the hardware, you can easily implement your own policy & leverage LeRobot's data collection, training, and visualization tools, and share your model to the HF Hub
+6
View File
@@ -12,6 +12,12 @@ Developers and researchers can post-train GR00T N1.5 with their own real or synt
GR00T N1.5 (specifically the GR00T-N1.5-3B model) is built using pre-trained vision and language encoders. It utilizes a flow matching action transformer to model a chunk of actions, conditioned on vision, language, and proprioception.
<img
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/lerobot-groot-paper1%20(1).png"
alt="An overview of GR00T"
width="80%"
/>
Its strong performance comes from being trained on an expansive and diverse humanoid dataset, which includes:
- Real captured data from robots.
+6
View File
@@ -6,6 +6,12 @@
π₀ represents a breakthrough in robotics as the first general-purpose robot foundation model developed by [Physical Intelligence](https://www.physicalintelligence.company/blog/pi0). Unlike traditional robot programs that are narrow specialists programmed for repetitive motions, π₀ is designed to be a generalist policy that can understand visual inputs, interpret natural language instructions, and control a variety of different robots across diverse tasks.
<img
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/lerobot-pi0%20(1).png"
alt="An overview of Pi0"
width="85%"
/>
### The Vision for Physical Intelligence
As described by Physical Intelligence, while AI has achieved remarkable success in digital domains, from chess-playing to drug discovery, human intelligence still dramatically outpaces AI in the physical world. To paraphrase Moravec's paradox, winning a game of chess represents an "easy" problem for AI, but folding a shirt or cleaning up a table requires solving some of the most difficult engineering problems ever conceived. π₀ represents a first step toward developing artificial physical intelligence that enables users to simply ask robots to perform any task they want, just like they can with large language models.
+66 -2
View File
@@ -6,6 +6,12 @@
π₀-FAST combines the power of Vision-Language Models with a novel action tokenization approach called **FAST (Frequency-space Action Sequence Tokenization)**. This enables training autoregressive VLAs on highly dexterous tasks that are impossible with standard binning-based discretization, while training **up to 5x faster** than diffusion-based approaches like π₀.
<img
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/lerobot-pifast.png"
alt="An overview of Pi0-FAST"
width="85%"
/>
### Why FAST?
Standard approaches for robot action tokenization use simple per-dimension, per-timestep binning schemes. While passable for simple behaviors, this rapidly breaks down for complex and dexterous skills that require precision and high-frequency control.
@@ -53,7 +59,7 @@ You have two options for the FAST tokenizer:
### Training Your Own Tokenizer
```bash
python src/lerobot/policies/pi0_fast/train_fast_tokenizer.py \
lerobot-train-tokenizer \
--repo_id "user/my-lerobot-dataset" \
--action_horizon 10 \
--encoded_dims "0:6" \
@@ -90,7 +96,7 @@ policy.type=pi0_fast
For training π₀-FAST, you can use the LeRobot training script:
```bash
python src/lerobot/scripts/lerobot_train.py \
lerobot-train \
--dataset.repo_id=your_dataset \
--policy.type=pi0_fast \
--output_dir=./outputs/pi0fast_training \
@@ -171,6 +177,64 @@ The model takes images, text instructions, and robot state as input, and outputs
| Inference Method | Iterative Denoising | Autoregressive Decoding |
| KV-Caching | N/A | Supported |
## Reproducing π₀Fast results
We reproduce the results of π₀Fast on the LIBERO benchmark using the LeRobot implementation. We take the LeRobot PiFast base model [lerobot/pi0fast-base](https://huggingface.co/lerobot/pi0fast-base) and finetune for an additional 40kk steps in bfloat16, with batch size of 256 on 8 H100 GPUs using the [HuggingFace LIBERO dataset](https://huggingface.co/datasets/HuggingFaceVLA/libero).
The finetuned model can be found here:
- **π₀Fast LIBERO**: [lerobot/pi0fast-libero](https://huggingface.co/lerobot/pi0fast-libero)
With the following training command:
```bash
lerobot-train \
--dataset.repo_id=lerobot/libero \
--output_dir=outputs/libero_pi0fast \
--job_name=libero_pi0fast \
--policy.path=lerobot/pi0fast_base \
--policy.dtype=bfloat16 \
--steps=100000 \
--save_freq=20000 \
--batch_size=4 \
--policy.device=cuda \
--policy.scheduler_warmup_steps=4000 \
--policy.scheduler_decay_steps=100000 \
--policy.scheduler_decay_lr=1e-5 \
--policy.gradient_checkpointing=true \
--policy.chunk_size=10 \
--policy.n_action_steps=10 \
--policy.max_action_tokens=256 \
--policy.empty_cameras=1 \
```
We then evaluate the finetuned model using the LeRobot LIBERO implementation, by running the following command:
```bash
tasks="libero_object,libero_spatial,libero_goal,libero_10"
lerobot-eval \
--policy.path=lerobot/pi0fast-libero \
--policy.max_action_tokens=256 \
--env.type=libero \
--policy.gradient_checkpointing=false \
--env.task=${tasks} \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--rename_map='{"observation.images.image":"observation.images.base_0_rgb","observation.images.image2":"observation.images.left_wrist_0_rgb"}'
```
**Note:** We set `n_action_steps=10`, similar to the original OpenPI implementation.
### Results
We obtain the following results on the LIBERO benchmark:
| Model | LIBERO Spatial | LIBERO Object | LIBERO Goal | LIBERO 10 | Average |
| ----------- | -------------- | ------------- | ----------- | --------- | -------- |
| **π₀-fast** | 70.0 | 100.0 | 100.0 | 60.0 | **82.5** |
The full evaluation output folder, including videos, is available [here](https://drive.google.com/drive/folders/1HXpwPTRm4hx6g1sF2P7OOqGG0TwPU7LQ?usp=sharing)
## License
This model follows the **Apache 2.0 License**, consistent with the original [OpenPI repository](https://github.com/Physical-Intelligence/openpi).
+6
View File
@@ -4,6 +4,12 @@ SARM (Stage-Aware Reward Modeling) is a video-based reward modeling framework fo
**Paper**: [SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation](https://arxiv.org/abs/2509.25358)
<img
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/lerobot-sarm.png"
alt="An overview of SARM"
width="80%"
/>
## Why Reward Models?
Standard behavior cloning treats all demonstration frames equally, but real-world robot datasets are messy. They contain hesitations, corrections, and variable-quality trajectories. Reward models solve this by learning a generalizable notion of **task progress** from demonstrations: given video frames and a task description, they predict how close the robot is to completing the task (0→1). This learned "progress signal" can be used in multiple ways, two promising applications are: (1) **weighted imitation learning** (RA-BC), where high-progress frames receive more weight during policy training, and (2) **reinforcement learning**, where the reward model provides dense rewards for online or offline policy improvement.
+6
View File
@@ -8,6 +8,12 @@ X Square Robots WALL-OSS is now integrated into Hugging Faces LeRobot ecos
The WALL-OSS team is building the embodied foundation model to capture and compress the world's most valuable data: the continuous, high-fidelity stream of physical interaction. By creating a direct feedback loop between the model's decisions and the body's lived experience, the emergence of a truly generalizable intelligence is enabled—one that understands not just how the world works, but how to act effectively within it.
<img
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobot/walloss-lerobot-paper.png"
alt="An overview of WALL-OSS"
width="85%"
/>
Technically, WALL-OSS introduces a tightly coupled multimodal architecture (tightly-coupled MoE structure) that integrates both discrete and continuous action modeling strategies. Through a two-stage training pipeline (Inspiration → Integration), the model gradually unifies semantic reasoning and high-frequency action generation. Its core innovations include:
- **Embodied perceptionenhanced multimodal pretraining**: Large-scale training on unified visionlanguageaction data to strengthen spatial, causal, and manipulation understanding.
+1
View File
@@ -197,6 +197,7 @@ lerobot-setup-motors="lerobot.scripts.lerobot_setup_motors:main"
lerobot-teleoperate="lerobot.scripts.lerobot_teleoperate:main"
lerobot-eval="lerobot.scripts.lerobot_eval:main"
lerobot-train="lerobot.scripts.lerobot_train:main"
lerobot-train-tokenizer="lerobot.scripts.lerobot_train_tokenizer:main"
lerobot-dataset-viz="lerobot.scripts.lerobot_dataset_viz:main"
lerobot-info="lerobot.scripts.lerobot_info:main"
lerobot-find-joint-limits="lerobot.scripts.lerobot_find_joint_limits:main"
@@ -1,3 +1,16 @@
# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Train FAST tokenizer for action encoding.
This script:
@@ -6,6 +19,26 @@ This script:
3. Trains FAST tokenizer on specified action dimensions
4. Saves tokenizer to assets directory
5. Reports compression statistics
Example:
```shell
lerobot-train-tokenizer \
--repo_id=user/dataset_name \
--action_horizon=10 \
--max_episodes=100 \
--sample_fraction=0.1 \
--encoded_dims="0:6" \
--delta_dims="0,1,2,3,4,5" \
--use_delta_transform=true \
--state_key="observation.state" \
--normalization_mode="QUANTILES" \
--vocab_size=1024 \
--scale=10.0 \
--output_dir="./fast_tokenizer_dataset_name" \
--push_to_hub=true \
--hub_repo_id="user/fast_tokenizer_dataset_name" \
--hub_private=false
"""
import json