From fbcf118dcb106cf1a246940ba891b72b533cee1b Mon Sep 17 00:00:00 2001 From: Jade Choghari Date: Wed, 26 Nov 2025 15:32:30 +0100 Subject: [PATCH] add xvla docs --- docs/source/_toctree.yml | 2 + docs/source/xvla.mdx | 461 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 463 insertions(+) create mode 100644 docs/source/xvla.mdx diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 2f9715ce1..a05b5a705 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -37,6 +37,8 @@ title: π₀.₅ (Pi05) - local: groot title: NVIDIA GR00T N1.5 + - local: xvla + title: X-VLA title: "Policies" - sections: - local: async diff --git a/docs/source/xvla.mdx b/docs/source/xvla.mdx new file mode 100644 index 000000000..94f29f33a --- /dev/null +++ b/docs/source/xvla.mdx @@ -0,0 +1,461 @@ +# X-VLA: The First Soft-Prompted Robot Foundation Model for Any Robot, Any Task + +## Overview + +For years, robotics has aspired to build agents that can follow natural human instructions and operate dexterously across many environments and robot bodies. Recent breakthroughs in LLMs and VLMs suggest a path forward: extend these foundation-model architectures to embodied control by grounding them in actions. This has led to the rise of Vision-Language-Action (VLA) models, with the hope that a single generalist model could combine broad semantic understanding with robust manipulation skills. + +But training such models is difficult. Robot data is fragmented across platforms, sensors, embodiments, and collection protocols. Heterogeneity appears everywhere: different arm configurations, different action spaces, different camera setups, different visual domains, and different task distributions. These inconsistencies create major distribution shifts that make pretraining unstable and adaptation unreliable. + +Inspired by meta-learning and prompt learning, we ask: **"What if a VLA model could learn the structure of each robot and dataset the same way LLMs learn tasks, through prompts?"** + +**X-VLA** is a soft-prompted, flow-matching VLA framework that treats each hardware setup as a "task" and encodes it using a small set of learnable embeddings. These **Soft Prompts** capture embodiment and domain-specific variations, guiding the Transformer from the earliest stages of multimodal fusion. With this mechanism, X-VLA can reconcile diverse robot morphologies, data types, and sensor setups within a single unified architecture. + + + +Built from pure Transformer encoders, X-VLA scales naturally with model size and dataset diversity. Across 6 simulation benchmarks and 3 real robots, Soft Prompts consistently outperform existing methods in handling hardware and domain differences. X-VLA-0.9B, trained on 290K episodes spanning seven robotic platforms, learns an embodiment-agnostic generalist policy in Phase I, and adapts efficiently to new robots in Phase II simply by learning a new set of prompts, while keeping the backbone frozen. + + + +With only 1% of parameters tuned (9M), X-VLA-0.9B achieves near-π₀ performance on LIBERO and Simpler-WidowX, despite using **300× fewer trainable parameters**. It also demonstrates strong real-world dexterity with minimal demonstrations, including folding cloths in under two minutes. + + + +X-VLA shows that generalist robot intelligence does not require increasingly complex architectures, only the right way to absorb heterogeneity. Soft Prompts offer a simple, scalable mechanism for unifying diverse robotic data, paving the way toward adaptable, cross-embodiment robot foundation models. + +--- + +## Installation + +After installing LeRobot, install the X-VLA dependencies: + +```bash +pip install -e .[xvla] +``` + +After the new release, you'll be able to do: + +```bash +pip install lerobot[xvla] +``` + +--- + +## Quick Start + +### Basic Usage + +To use X-VLA in your LeRobot configuration, specify the policy type as: + +```bash +policy.type=xvla +``` + +### Evaluating Pre-trained Checkpoints + +Example evaluation with LIBERO: + +```bash +lerobot-eval \ + --policy.path="lerobot/xvla-libero" \ + --env.type=libero \ + --env.task=libero_spatial,libero_goal,libero_10 \ + --env.control_mode=absolute \ + --eval.batch_size=1 \ + --eval.n_episodes=1 \ + --env.episode_length=800 \ + --seed=142 +``` + +--- + +## Available Checkpoints + +### 🎯 Base Model + +**[lerobot/xvla-base](https://huggingface.co/lerobot/xvla-base)** + +A 0.9B parameter instantiation of X-VLA, trained with a carefully designed data processing and learning recipe. The training pipeline consists of two phases: + +- **Phase I: Pretraining** - Pretrained on 290K episodes from Droid, Robomind, and Agibot, spanning seven platforms across five types of robotic arms (single-arm to bi-manual setups). By leveraging soft prompts to absorb embodiment-specific variations, the model learns an embodiment-agnostic generalist policy. + +- **Phase II: Domain Adaptation** - Adapted to deployable policies for target domains. A new set of soft prompts is introduced and optimized to encode the hardware configuration of the novel domain, while the pretrained backbone remains frozen. + +### 🎮 Simulation Checkpoints + +**[lerobot/xvla-libero](https://huggingface.co/lerobot/xvla-libero)** + +Achieves 93% success rate on LIBERO benchmarks. Fine-tuned from the base model for simulation tasks. + +**[lerobot/xvla-widowx](https://huggingface.co/lerobot/xvla-widowx)** + +Fine-tuned on BridgeData for pick-and-place experiments on compact WidowX platforms. Demonstrates robust manipulation capabilities. + +### 🤖 Real-World Checkpoints + +**[lerobot/xvla-folding](https://huggingface.co/lerobot/xvla-folding)** + +A fine-tuned dexterous manipulation model trained on the high-quality Soft-FOLD cloth folding dataset. Achieves 100% success rate over 2 hours of continuous cloth folding. + +**[lerobot/xvla-agibot-world](https://huggingface.co/lerobot/xvla-agibot-world)** + +Optimized for AgileX robot dexterous manipulation tasks. + +**[lerobot/xvla-google-robot](https://huggingface.co/lerobot/xvla-google-robot)** + +Adapted for Google Robot platforms. + +--- + +## Training X-VLA + +### Recommended Training Configuration + +When fine-tuning X-VLA for a new embodiment or task, we recommend the following freezing strategy: + +```bash +lerobot-train \ + --dataset.repo_id=YOUR_DATASET \ + --output_dir=./outputs/xvla_training \ + --job_name=xvla_training \ + --policy.path="lerobot/xvla-base" \ + --policy.repo_id="HF_USER/xvla-your-robot" \ + --steps=3000 \ + --policy.device=cuda \ + --policy.freeze_vision_encoder=True \ + --policy.freeze_language_encoder=True \ + --policy.train_policy_transformer=True \ + --policy.train_soft_prompts=True \ + --policy.action_mode=YOUR_ACTION_MODE +``` + +### Training Parameters Explained + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `freeze_vision_encoder` | `True` | Freeze the VLM vision encoder weights | +| `freeze_language_encoder` | `True` | Freeze the VLM language encoder weights | +| `train_policy_transformer` | `True` | Allow policy transformer layers to train | +| `train_soft_prompts` | `True` | Allow soft prompts to train | + +**💡 Best Practice**: For Phase II adaptation to new embodiments, freeze the VLM encoders and only train the policy transformer and soft prompts. This provides excellent sample efficiency with minimal compute. + +### Example: Training on Bimanual Robot + +```bash +lerobot-train \ + --dataset.repo_id=pepijn223/bimanual-so100-handover-cube \ + --output_dir=./outputs/xvla_bimanual \ + --job_name=xvla_so101_training \ + --policy.path="lerobot/xvla-base" \ + --policy.repo_id="YOUR_USERNAME/xvla-biso101" \ + --steps=3000 \ + --policy.device=cuda \ + --policy.action_mode=so101_bimanual \ + --policy.freeze_vision_encoder=True \ + --policy.freeze_language_encoder=True \ + --policy.train_policy_transformer=True \ + --policy.train_soft_prompts=True +``` + +--- + +## Core Concepts + +### 1. Action Modes + +X-VLA uses an **Action Registry** system to handle different action spaces and embodiments. The `action_mode` parameter defines how actions are processed, what loss functions are used, and how predictions are post-processed. + +#### Available Action Modes + +| Action Mode | Action Dim | Description | Use Case | +|-------------|------------|-------------|----------| +| `ee6d` | 20 | End-effector with xyz, 6D rotation, gripper | Dual-arm setups with spatial control | +| `joint` | 14 | Joint-space with gripper | Direct joint control robots | +| `agibot_ee6d` | 20 | AGI-bot variant with MSE loss | AGI-bot platforms | +| `franka_joint7` | 7 | Franka Panda 7-joint control | Franka robots without gripper | +| `so101_bimanual` | 20 (model), 12 (real) | SO101 bimanual robot | Bimanual manipulation tasks | + +#### Why Action Modes Matter + +When you have a pretrained checkpoint like `lerobot/xvla-base` trained with `action_dim=20`, and you want to train on a dataset with a different action dimension (e.g., 14 for bimanual arms), you can't simply trim the action dimension. The action mode orchestrates: + +1. **Loss Computation**: Different loss functions for different action components (MSE for joints, BCE for grippers, etc.) +2. **Preprocessing**: Zeroing out gripper channels, padding dimensions +3. **Postprocessing**: Applying sigmoid to gripper logits, trimming padding + +#### Example: BimanualSO101 Action Space + +The `so101_bimanual` action mode handles the mismatch between model output (20D) and real robot control (12D): + +```python +# Model outputs 20 dimensions for compatibility +dim_action = 20 + +# Real robot only needs 12 dimensions +# [left_arm (6), right_arm (6)] = [joints (5) + gripper (1)] × 2 +REAL_DIM = 12 + +# Preprocessing: Pad 12D actions to 20D for training +# Postprocessing: Trim 20D predictions to 12D for deployment +``` + +See the [action_hub.py](/home/jade_choghari/robot/lerobot/src/lerobot/policies/xvla/action_hub.py) implementation for details. + +### 2. Domain IDs + +Domain IDs are learnable identifiers for different robot configurations and camera setups. They allow X-VLA to distinguish between: + +- Different robots (Robot 1 vs Robot 2) +- Different camera configurations (cam1 vs cam2) +- Different combinations (Robot1-cam1-cam2 vs Robot1-cam1 vs Robot2-cam1) + +#### Setting Domain IDs + +**During Training**: By default, domain_id is set to 0 for general training. + +**During Evaluation**: Specify the domain_id that matches your checkpoint's training configuration. + +```python +# Example: LIBERO checkpoint uses domain_id=3 +domain_id = 3 +``` + +The domain_id is automatically added to observations by the `XVLAAddDomainIdProcessorStep` in the preprocessing pipeline. + +### 3. Processor Steps + +X-VLA requires specific preprocessing and postprocessing steps for proper operation. + +#### Required Preprocessing Steps + +1. **XVLAImageToFloatProcessorStep**: Converts images from [0, 255] to [0, 1] range +2. **XVLAImageNetNormalizeProcessorStep**: Applies ImageNet normalization (required for VLM backbone) +3. **XVLAAddDomainIdProcessorStep**: Adds domain_id to observations + +#### Example Custom Processor + +For LIBERO environments, a custom processor handles the specific observation format: + +```python +from lerobot.policies.xvla.processor_xvla import LiberoProcessorStep + +processor = LiberoProcessorStep() +# Handles robot_state dictionary, converts rotation matrices to 6D representation +# Applies 180° image rotation for camera convention +``` + +### 4. Configuration Parameters + +Key configuration parameters for X-VLA: + +```python +# Observation and action +n_obs_steps: int = 1 # Number of observation timesteps +chunk_size: int = 32 # Action sequence length +n_action_steps: int = 32 # Number of action steps to execute + +# Model architecture +hidden_size: int = 1024 # Transformer hidden dimension +depth: int = 24 # Number of transformer layers +num_heads: int = 16 # Number of attention heads +num_domains: int = 30 # Maximum number of domain IDs +len_soft_prompts: int = 32 # Length of soft prompt embeddings + +# Action space +action_mode: str = "ee6d" # Action space type +use_proprio: bool = True # Use proprioceptive state +max_state_dim: int = 32 # Maximum state dimension + +# Vision +num_image_views: int | None # Number of camera views +resize_imgs_with_padding: tuple[int, int] | None # Target image size with padding + +# Training +num_denoising_steps: int = 10 # Flow matching denoising steps +``` + +--- + +## Creating Custom Action Modes + +If your robot has a unique action space, you can create a custom action mode: + +### Step 1: Define Your Action Space + +```python +from lerobot.policies.xvla.action_hub import BaseActionSpace, register_action +import torch.nn as nn + +@register_action("my_custom_robot") +class MyCustomActionSpace(BaseActionSpace): + """Custom action space for my robot.""" + + dim_action = 15 # Your robot's action dimension + gripper_idx = (7, 14) # Gripper channel indices + + def __init__(self): + super().__init__() + self.mse = nn.MSELoss() + self.bce = nn.BCEWithLogitsLoss() + + def compute_loss(self, pred, target): + """Define your loss computation.""" + # Example: MSE for joints, BCE for grippers + joints_loss = self.mse(pred[:, :, :7], target[:, :, :7]) + gripper_loss = self.bce(pred[:, :, self.gripper_idx], + target[:, :, self.gripper_idx]) + + return { + "joints_loss": joints_loss, + "gripper_loss": gripper_loss, + } + + def preprocess(self, proprio, action, mode="train"): + """Preprocess actions before training.""" + # Example: Zero out grippers in proprioception + proprio_m = proprio.clone() + action_m = action.clone() if action is not None else None + proprio_m[..., self.gripper_idx] = 0.0 + if action_m is not None: + action_m[..., self.gripper_idx] = 0.0 + return proprio_m, action_m + + def postprocess(self, action): + """Post-process predictions for deployment.""" + # Example: Apply sigmoid to gripper logits + action[..., self.gripper_idx] = torch.sigmoid(action[..., self.gripper_idx]) + return action +``` + +### Step 2: Use Your Custom Action Mode + +```bash +lerobot-train \ + --policy.action_mode=my_custom_robot \ + --dataset.repo_id=YOUR_DATASET \ + --policy.path="lerobot/xvla-base" \ + ... +``` + +--- + +## Advanced Topics + +### Multi-Camera Support + +X-VLA supports multiple camera views through the `num_image_views` parameter: + +```python +# Configure for 3 camera views +policy.num_image_views=3 + +# Add empty cameras if you have fewer physical cameras +policy.empty_cameras=1 # Adds 1 zero-padded camera view +``` + +### Custom Preprocessing Pipeline + +Create a custom preprocessing pipeline for your environment: + +```python +from lerobot.processor import PolicyProcessorPipeline +from lerobot.policies.xvla.processor_xvla import ( + XVLAImageToFloatProcessorStep, + XVLAImageNetNormalizeProcessorStep, + XVLAAddDomainIdProcessorStep, +) + +# Build custom pipeline +preprocessor = PolicyProcessorPipeline( + steps=[ + YourCustomProcessorStep(), # Your custom processing + XVLAImageToFloatProcessorStep(), # Required: convert to float + XVLAImageNetNormalizeProcessorStep(), # Required: ImageNet norm + XVLAAddDomainIdProcessorStep(domain_id=5), # Your domain ID + ] +) +``` + +### Handling Different Action Dimensions + +When your dataset has fewer action dimensions than the pretrained model: + +**Option 1**: Use padding (automatic in most action modes) +```python +# Model expects 20D, dataset has 12D +# Action mode handles padding internally +action_mode = "so101_bimanual" # Pads 12 → 20 +``` + +**Option 2**: Create a custom action mode that maps dimensions explicitly +```python +@register_action("my_mapped_action") +class MappedActionSpace(BaseActionSpace): + dim_action = 20 + REAL_DIM = 12 + + def _pad_to_model_dim(self, x): + # Custom padding logic + ... +``` + +--- + +## Troubleshooting + +### Common Issues + +**Issue**: "Action dimension mismatch" +- **Solution**: Check that your `action_mode` matches your robot's action space. Create a custom action mode if needed. + +**Issue**: "Image values outside [0, 1] range" +- **Solution**: Ensure images are preprocessed with `XVLAImageToFloatProcessorStep` before normalization. + +**Issue**: "Domain ID not found" +- **Solution**: Make sure `XVLAAddDomainIdProcessorStep` is in your preprocessing pipeline with the correct domain_id. + +**Issue**: "Low success rate on new embodiment" +- **Solution**: + 1. Verify your action_mode is correct + 2. Check that soft prompts are being trained (`train_soft_prompts=True`) + 3. Ensure proper preprocessing (ImageNet normalization, domain_id) + 4. Consider increasing training steps + +**Issue**: "Out of memory during training" +- **Solution**: + 1. Reduce `chunk_size` (e.g., from 32 to 16) + 2. Enable gradient checkpointing + 3. Reduce batch size + 4. Freeze more components + +--- + +## Citation + +If you use X-VLA in your research, please cite: + +```bibtex +@article{zheng2025x, + title = {X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model}, + author = {Zheng, Jinliang and Li, Jianxiong and Wang, Zhihao and Liu, Dongxiu and Kang, Xirui + and Feng, Yuchun and Zheng, Yinan and Zou, Jiayin and Chen, Yilun and Zeng, Jia and others}, + journal = {arXiv preprint arXiv:2510.10274}, + year = {2025} +} +``` + +--- + +## Additional Resources + +- [X-VLA Paper](https://arxiv.org) (coming soon) +- [LeRobot Documentation](https://github.com/huggingface/lerobot) +- [Action Registry Implementation](/home/jade_choghari/robot/lerobot/src/lerobot/policies/xvla/action_hub.py) +- [Processor Implementation](/home/jade_choghari/robot/lerobot/src/lerobot/policies/xvla/processor_xvla.py) +- [Model Configuration](/home/jade_choghari/robot/lerobot/src/lerobot/policies/xvla/configuration_xvla.py) + +--- + +## Contributing + +We welcome contributions! If you've implemented a new action mode or processor for your robot, please consider submitting a PR to help the community.