merge rac

2026-07-09 19:11:44 +00:00 · 2025-12-30 10:37:48 +01:00
parent 202a493c14 27eeff7535
commit 9833b84bf8
71 changed files with 7895 additions and 782 deletions
@@ -41,7 +41,13 @@
    title: NVIDIA GR00T N1.5
  - local: xvla
    title: X-VLA
+  - local: walloss
+    title: WALL-OSS
  title: "Policies"
+- sections:
+  - local: sarm
+    title: SARM
+  title: "Reward Models"
 - sections:
  - local: async
    title: Use Async Inference
@@ -0,0 +1,35 @@
+# WALL-OSS
+
+This repository contains the Hugging Face port of **WALL-OSS**, a Vision-Language-Action model for cross-embodiment robotic control based on Qwen2.5-VL with flow matching/FAST action prediction.
+
+---
+
+## Model Overview
+
+| Feature            | Description                                           |
+| ------------------ | ----------------------------------------------------- | --- |
+| Base Model         | Qwen2.5-VL (Vision-Language Model)                    |
+| Action Prediction  | Flow Matching (diffusion) or FAST (discrete tokens)   |
+| Architecture       | Mixture of Experts (MoE) with action-specific routing |     |
+| Multi-Modal Inputs | Vision (images/videos), Language, Proprioception      |
+
+---
+
+## Citation
+
+If you use this work, please cite:
+
+```bibtex
+@article{zhai2025igniting,
+    title   = {Igniting VLMs Toward the Embodied Space},
+    author  = {Zhai, Andy and Liu, Brae and Fang, Bruno and Cai, Chalse and Ma, Ellie and Yin, Ethan and Wang, Hao and Zhou, Hugo and Wang, James and Shi, Lights and Liang, Lucy and Wang, Make and Wang, Qian and Gan, Roy and Yu, Ryan and Li, Shalfun and Liu, Starrick and Chen, Sylas and Chen, Vincent and Xu, Zach},
+    journal = {arXiv preprint arXiv:2509.11766},
+    year    = {2025}
+}
+```
+
+---
+
+## License
+
+This port follows the **Apache 2.0 License**.
@@ -0,0 +1,273 @@
+# RaC: Recovery and Correction Training
+
+RaC (Recovery and Correction) is a human-in-the-loop data collection and training paradigm that improves robot policy performance on long-horizon tasks by explicitly teaching recovery and correction behaviors.
+
+**Key References:**
+- [RaC: Robot Learning for Long-Horizon Tasks by Scaling Recovery and Correction](https://arxiv.org/abs/2509.07953) (Hu et al., 2025)
+- [HG-DAgger: Interactive Imitation Learning with Human Experts](https://arxiv.org/abs/1810.02890) (Kelly et al., 2019)
+- [π∗0.6: a VLA That Learns From Experience](https://pi.website/blog/pistar06) (Physical Intelligence, 2025)
+- [SARM: Stage-Aware Reward Modeling](https://arxiv.org/abs/2509.25358) (Chen et al., 2025)
+
+---
+
+## Why RaC? The Problem with Standard Data Collection
+
+### Standard Behavioral Cloning Data Collection Limitations
+
+Standard behavior cloning trains policies on successful demonstrations. This approach can be sensitive to distribution shift and compounding errors. Because during deployment small errors can cascade and push the robot into states never seen during training.
+This is where RaC and methods like Dagger and HG-DAgger come in.
+
+### Prior Human-in-the-Loop Methods
+
+**DAgger** (Dataset Aggregation) addresses distribution shift by:
+- Running the novice policy to collect states
+- Querying expert for correct actions at those states
+- Aggregating new labels into training set
+
+**HG-DAgger** (Human-Gated DAgger) improves on DAgger by:
+- Giving human full control authority during interventions
+- Human takes over when unsafe, provides correction, returns control
+- Better action labels because human has uninterrupted control
+
+### RaC
+
+RaC explicitly collects **recovery + correction** data:
+
+```
+BC/DAgger:   policy → mistake → human corrects → continue
+RaC:         policy → mistake → human RECOVERS (teleop back) → CORRECTS → END
+```
+
+The critical insight is **Rule 1 (Recover then Correct)**:
+- Every intervention starts with human teleoperating back to an in-distribution state
+- Then human provides correction to complete the current subtask
+- Both segments are recorded as training data
+- This teaches the policy: "when things go wrong, go back and retry"
+
+**Rule 2 (Terminate after Intervention)**:
+- Episode ends after correction completes
+- Avoids mixed policy/human data on later subtasks
+- Keeps data distribution clean
+
+---
+
+## Comparison Table
+
+| Method | Data Type | Recovery Behavior | Correction Behavior |
+|--------|-----------|-------------------|---------------------|
+| BC | Success only | ✗ | ✗ |
+| DAgger | Success + corrections | ✗ | ✓ |
+| HG-DAgger | Success + corrections | Sometimes | ✓ |
+| RaC | Success + recovery + correction | ✓ Explicit | ✓ |
+
+---
+
+## The RaC Pipeline
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│                         RaC Training Pipeline                           │
+├─────────────────────────────────────────────────────────────────────────┤
+│                                                                         │
+│  1. PRE-TRAINING (Standard BC)                                          │
+│     └─> Train initial policy on clean demonstrations                    │
+│                                                                         │
+│  2. RAC DATA COLLECTION (Human-in-the-loop)                             │
+│     ├─> Policy runs autonomously                                        │
+│     ├─> Human monitors and intervenes when failure imminent             │
+│     │   ├─> RECOVERY: Human teleoperates robot back to good state       │
+│     │   └─> CORRECTION: Human completes the current subtask             │
+│     └─> Episode terminates after correction (Rule 2)                    │
+│                                                                         │
+│  3. REWARD LABELING (Optional: SARM)                                    │
+│     └─> Compute progress rewards for advantage-weighted training        │
+│                                                                         │
+│  4. FINE-TUNING                                                         │
+│     └─> Train on combined demos + RaC data (optionally with RA-BC)      │
+│                                                                         │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Step-by-Step Guide
+
+### Step 1: Pre-train a Base Policy
+
+First, train a policy on your demonstration dataset:
+
+```bash
+python src/lerobot/scripts/lerobot_train.py \
+    --dataset.repo_id=your-username/demo-dataset \
+    --policy.type=pi0 \
+    --output_dir=outputs/pretrain \
+    --batch_size=32 \
+    --steps=50000
+```
+
+### Step 2: Collect RaC Data
+
+Run the RaC data collection script with your pre-trained policy:
+
+```bash
+python examples/rac/rac_data_collection.py \
+    --robot.type=so100_follower \
+    --robot.port=/dev/tty.usbmodem58760431541 \
+    --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}}" \
+    --teleop.type=so100_leader \
+    --teleop.port=/dev/tty.usbmodem58760431551 \
+    --policy.path=outputs/pretrain/checkpoints/last/pretrained_model \
+    --dataset.repo_id=your-username/rac-dataset \
+    --dataset.single_task="Pick up the cube and place it in the bowl" \
+    --dataset.num_episodes=50
+```
+
+**Keyboard Controls:**
+
+| Key | Action |
+|-----|--------|
+| **SPACE** | Start intervention (take control) |
+| **→** | End episode (save) |
+| **ESC** | Stop recording session |
+
+**The RaC Protocol:**
+
+1. Watch the policy run autonomously
+2. When you see imminent failure, press **SPACE** to intervene
+3. **RECOVERY**: Teleoperate the robot back to a good in-distribution state
+4. **CORRECTION**: Use teleoperator to complete the subtask
+5. Press **→** to save and end episode
+
+The recovery segment (teleoperating back to good state) is recorded as training data - this teaches the policy how to recover from errors.
+
+### Step 3: (Optional) Compute SARM Rewards
+
+For advantage-weighted training (RA-BC / Pi0.6-style), compute SARM progress values:
+
+```bash
+python src/lerobot/policies/sarm/compute_rabc_weights.py \
+    --dataset-repo-id your-username/rac-dataset \
+    --reward-model-path your-username/sarm-model \
+    --head-mode sparse \
+    --push-to-hub
+```
+
+### Step 4: Fine-tune Policy
+
+Fine-tune on the RaC data:
+
+```bash
+# Without RA-BC (standard fine-tuning)
+python src/lerobot/scripts/lerobot_train.py \
+    --dataset.repo_id=your-username/rac-dataset \
+    --policy.type=pi0 \
+    --policy.pretrained_path=outputs/pretrain/checkpoints/last/pretrained_model \
+    --output_dir=outputs/rac_finetune \
+    --steps=20000
+
+# With RA-BC (advantage-weighted, Pi0.6-style)
+python src/lerobot/scripts/lerobot_train.py \
+    --dataset.repo_id=your-username/rac-dataset \
+    --policy.type=pi0 \
+    --policy.pretrained_path=outputs/pretrain/checkpoints/last/pretrained_model \
+    --output_dir=outputs/rac_finetune_rabc \
+    --use_rabc=true \
+    --rabc_kappa=0.01 \
+    --steps=20000
+```
+
+---
+
+## Connection to Pi0.6 / RECAP
+
+Pi0.6's RECAP method shares similar principles:
+- Collect autonomous rollouts + expert interventions
+- Use value function to compute **advantages**: A(s,a) = V(s') - V(s)
+- **Advantage conditioning**: Weight training based on expected improvement
+
+In LeRobot, we can use **SARM** as the value function:
+- SARM progress φ(s) ∈ [0,1] measures task completion
+- Progress delta = φ(s') - φ(s) approximates advantage
+- RA-BC uses these to weight training samples (higher weight for good corrections)
+
+---
+
+## Tips for Effective RaC Collection
+
+### When to Intervene
+
+Intervene when you see:
+- Robot about to make an irreversible mistake
+- Robot hesitating or showing uncertain behavior
+- Robot deviating from expected trajectory
+
+### Recovery: Teleoperating Back to Good State
+
+During recovery, teleoperate the robot back to a state where:
+- The robot is in a familiar, in-distribution configuration
+- The current subtask can still be completed
+- The recovery trajectory itself is informative training data
+
+### Quality of Corrections
+
+During correction:
+- Provide **confident, clean** trajectories
+- Complete the current subtask fully
+- Don't overcorrect or add unnecessary movements
+
+---
+
+## Iterative Improvement
+
+RaC can be applied iteratively:
+
+```
+┌─────────────────────────────────────────────────────────────────────────┐
+│  Policy v0 (demos)                                                      │
+│       ↓                                                                 │
+│  RaC Collection (target current failure modes) → Policy v1              │
+│       ↓                                                                 │
+│  RaC Collection (target new failure modes) → Policy v2                  │
+│       ↓                                                                 │
+│  ... (repeat until satisfactory performance)                            │
+└─────────────────────────────────────────────────────────────────────────┘
+```
+
+Each iteration:
+1. Deploy current policy
+2. Collect RaC interventions on failure cases
+3. Fine-tune on accumulated data
+
+---
+
+## References
+
+```bibtex
+@article{hu2025rac,
+  title={RaC: Robot Learning for Long-Horizon Tasks by Scaling Recovery and Correction},
+  author={Hu, Zheyuan and Wu, Robyn and Enock, Naveen and Li, Jasmine and Kadakia, Riya and Erickson, Zackory and Kumar, Aviral},
+  journal={arXiv preprint arXiv:2509.07953},
+  year={2025}
+}
+
+@article{kelly2019hgdagger,
+  title={HG-DAgger: Interactive Imitation Learning with Human Experts},
+  author={Kelly, Michael and Sidrane, Chelsea and Driggs-Campbell, Katherine and Kochenderfer, Mykel J},
+  journal={arXiv preprint arXiv:1810.02890},
+  year={2019}
+}
+
+@article{pi2025recap,
+  title={π∗0.6: a VLA That Learns From Experience},
+  author={Physical Intelligence},
+  year={2025}
+}
+
+@article{chen2025sarm,
+  title={SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation},
+  author={Chen, Qianzhong and Yu, Justin and Schwager, Mac and Abbeel, Pieter and Shentu, Yide and Wu, Philipp},
+  journal={arXiv preprint arXiv:2509.25358},
+  year={2025}
+}
+```
+
@@ -0,0 +1,586 @@
+# SARM: Stage-Aware Reward Modeling
+
+SARM (Stage-Aware Reward Modeling) is a video-based reward modeling framework for long-horizon robot manipulation tasks. This guide covers how to train SARM reward models and optionally use them with Reward-Aligned Behavior Cloning (RA-BC).
+
+**Paper**: [SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation](https://arxiv.org/abs/2509.25358)
+
+## Why Reward Models?
+
+Standard behavior cloning treats all demonstration frames equally, but real-world robot datasets are messy. They contain hesitations, corrections, and variable-quality trajectories. Reward models solve this by learning a generalizable notion of **task progress** from demonstrations: given video frames and a task description, they predict how close the robot is to completing the task (0→1). This learned "progress signal" can be used in multiple ways, two promising applications are: (1) **weighted imitation learning** (RA-BC), where high-progress frames receive more weight during policy training, and (2) **reinforcement learning**, where the reward model provides dense rewards for online or offline policy improvement.
+
+## Overview
+
+SARM has following features:
+
+1. **Stage-aware architecture**: Jointly predicts the high-level task stage and fine-grained progress within each stage
+2. **Subtask annotations**: Uses natural language subtask annotations to derive consistent progress labels
+3. **Temporal proportions**: Computes dataset-level priors (α̅\_k) for each subtask to normalize progress across variable-length demonstrations
+
+SARM trains on a compact **stage+tau** target for each frame:
+
+- **stage**: integer stage index `k ∈ {0, ..., K-1}`
+- **τ (tau)**: within-stage progress `τ ∈ [0, 1]`
+- **target encoding**: `y = k + τ` (this is what the dataset processor produces)
+
+At inference time (and in downstream RA-BC), SARM converts the raw `k + τ` value into a **normalized progress** in `[0, 1]` using dataset-level **temporal proportions** `α̅_k` (stored in `meta/temporal_proportions_*.json`).
+
+This matches **Formula (2)** from the paper:
+
+```
+progress_t = P_{k-1} + α̅_k × τ_t
+```
+
+Where:
+
+- `τ_t = (t - s_k) / (e_k - s_k)` is within-subtask normalized time
+- `P_{k-1}` is cumulative prior (sum of previous subtask proportions)
+- `α̅_k` is the temporal proportion for subtask k
+
+This ensures identical task states map to consistent progress values, even across demonstrations of different lengths.
+
+## Inputs and Targets (What the new code expects)
+
+SARM is trained through its processor (`src/lerobot/policies/sarm/processor_sarm.py`), which:
+
+- **Encodes** images and task text with CLIP (ViT-B/32) into `video_features` and `text_features`
+- **Pads/truncates** robot state into `state_features` (up to `max_state_dim`)
+- **Builds targets** as `sparse_targets` (and `dense_targets` in `dense_only`/`dual`) using the stage+tau encoding `y = k + τ`
+- **Masks rewind frames** using a per-sample `lengths` tensor (rewind is a training-time augmentation)
+
+At minimum, each training sample needs:
+
+- `task` (string): task description
+- `policy.image_key` images and `policy.state_key` states from the dataset
+
+---
+
+## Annotation Modes
+
+You can choose from **3 annotation modes** that determine how progress labels are computed:
+
+| Mode           | Annotations Required | Heads                        | Use Case                                                     |
+| -------------- | -------------------- | ---------------------------- | ------------------------------------------------------------ |
+| `single_stage` | None                 | Sparse only                  | Simple tasks, quick experiments, no VLM needed               |
+| `dense_only`   | Dense (VLM)          | Dual (sparse auto-generated) | Detailed subtask tracking without defining high-level stages |
+| `dual`         | Sparse + Dense (VLM) | Dual                         | Full SARM paper setup with both granularities                |
+
+### Mode Details
+
+<hfoptions id="mode_explanation">
+<hfoption id="single_stage">
+
+**No annotations required.** The entire episode is treated as a single stage called `"task"`, and progress is linear from 0 to 1 over the episode duration.
+
+- **Sparse head**: 1 stage ("task"), linear progress
+- **Dense head**: Not used
+- **Best for**: Simple tasks, quick experiments, or when VLM annotation is not available
+
+## Set Up Your Environment
+
+1. Install LeRobot by following our [Installation Guide](./installation).
+2. Install SARM dependencies by running:
+
+```bash
+pip install -e ".[sarm]"
+```
+
+Workflow:
+
+```
+1. Train SARM → 2. Visualize predictions → 3. (Optional) Train policy with RA-BC
+```
+
+</hfoption>
+<hfoption id="dense_only">
+
+**Only dense (fine-grained) annotations from a VLM.** The sparse head automatically uses a single `"task"` stage covering the full episode, while the dense head learns detailed subtask progression.
+
+- **Sparse head**: 1 stage ("task"), linear progress (auto-generated)
+- **Dense head**: Multiple fine-grained stages from VLM annotations
+- **Best for**: When you want detailed subtask tracking but don't need to define high-level stages
+
+Workflow:
+
+```
+1. Annotate (dense) → 2. Verify → 3. Train SARM → 4. Visualize → 5. (Optional) Train policy with RA-BC
+```
+
+</hfoption>
+<hfoption id="dual">
+
+**Both sparse and dense annotations from VLM.** Full dual-head mode as described in the SARM paper, with both high-level (sparse) and fine-grained (dense) stage predictions.
+
+- **Sparse head**: High-level stages from VLM annotations
+- **Dense head**: Fine-grained stages from VLM annotations
+- **Best for**: Complex multi-stage tasks where both granularities are useful
+
+Workflow:
+
+```
+1. Annotate (sparse+dense) → 2. Verify → 3. Train SARM → 4. Visualize → 5. (Optional) Train policy with RA-BC
+```
+
+</hfoption>
+</hfoptions>
+
+---
+
+## Step 1: Subtask Annotation
+
+<hfoptions id="annotation_mode">
+<hfoption id="single_stage">
+
+**No annotation required!** Skip this step entirely. The model will use the episode's task description and compute linear progress automatically.
+
+</hfoption>
+<hfoption id="dense_only">
+
+Generate **dense (fine-grained) annotations only** using a VLM. The sparse stage will be auto-generated.
+
+```bash
+python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \
+  --repo-id your-username/your-dataset \
+  --dense-only \
+  --dense-subtasks "Bring robot arms up from starting position,Grab near side and do 1st fold,Grab side and do 2nd fold,Grab side and do 3rd fold to finish folding" \
+  --video-key observation.images.base \
+  --num-workers 4 \
+  --push-to-hub
+```
+
+**What gets saved:**
+
+- `meta/temporal_proportions_sparse.json` - Auto-generated sparse proportions (`{"task": 1.0}`)
+- `meta/temporal_proportions_dense.json` - Dense temporal proportions
+- Per-episode columns in `episodes/*.parquet`:
+  - `dense_subtask_names`, `dense_subtask_start_frames`, `dense_subtask_end_frames`
+  - (also time-based columns: `dense_subtask_start_times`, `dense_subtask_end_times`)
+
+</hfoption>
+<hfoption id="dual">
+
+Generate **both sparse (high-level) and dense (fine-grained) annotations** using a VLM.
+
+```bash
+python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \
+  --repo-id your-username/your-dataset \
+  --sparse-subtasks "Bring arms up from starting position,Fold the towel (3 folds in total)" \
+  --dense-subtasks "Bring robot arms up from starting position,Grab near side and do 1st fold,Grab side and do 2nd fold,Grab side and do 3rd fold to finish folding" \
+  --video-key observation.images.base \
+  --num-workers 4 \
+  --push-to-hub
+```
+
+**What gets saved:**
+
+- `meta/temporal_proportions_sparse.json` - Sparse temporal proportions
+- `meta/temporal_proportions_dense.json` - Dense temporal proportions
+- Per-episode columns in `episodes/*.parquet`:
+  - `sparse_subtask_names`, `sparse_subtask_start_frames`, `sparse_subtask_end_frames`
+  - `dense_subtask_names`, `dense_subtask_start_frames`, `dense_subtask_end_frames`
+  - (also time-based columns: `*_subtask_start_times`, `*_subtask_end_times`)
+
+</hfoption>
+</hfoptions>
+
+### Annotation Arguments
+
+| Argument               | Description                                                                     |
+| ---------------------- | ------------------------------------------------------------------------------- |
+| `--repo-id`            | HuggingFace dataset repository ID                                               |
+| `--sparse-subtasks`    | Comma-separated list of high-level subtask names                                |
+| `--dense-subtasks`     | Comma-separated list of fine-grained subtask names                              |
+| `--dense-only`         | Generate only dense annotations (auto-creates sparse "task" stage)              |
+| `--video-key`          | Camera/video key to use (e.g., `observation.images.top`)                        |
+| `--num-workers`        | Number of parallel GPU workers (default: 1)                                     |
+| `--episodes`           | Specific episode indices to annotate (default: all)                             |
+| `--skip-existing`      | Skip episodes that already have annotations                                     |
+| `--model`              | VLM model (default: `Qwen/Qwen3-VL-30B-A3B-Instruct`)                           |
+| `--num-visualizations` | Number of episodes to visualize after annotation (default: 5, set to 0 to skip) |
+
+> **Note**: After annotation completes, 5 episodes are automatically visualized by default. Use `--num-visualizations 0` to skip this step.
+
+---
+
+## Step 2: Verify Annotations
+
+<hfoptions id="verify_mode">
+<hfoption id="single_stage">
+
+**No verification needed!** Skip this step.
+
+</hfoption>
+<hfoption id="dense_only">
+
+Visualize annotations using the `--visualize-only` flag:
+
+```bash
+python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \
+  --repo-id your-username/your-dataset \
+  --visualize-only \
+  --visualize-type dense \
+  --num-visualizations 5 \
+  --video-key observation.images.base \
+  --output-dir ./subtask_viz
+```
+
+</hfoption>
+<hfoption id="dual">
+
+Visualize annotations using the `--visualize-only` flag:
+
+```bash
+python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \
+  --repo-id your-username/your-dataset \
+  --visualize-only \
+  --visualize-type both \
+  --num-visualizations 5 \
+  --video-key observation.images.base \
+  --output-dir ./subtask_viz
+```
+
+</hfoption>
+</hfoptions>
+
+This generates visualizations showing video frames with subtask boundaries overlaid and timeline of subtasks.
+
+### Visualization Arguments
+
+| Argument               | Description                                                    |
+| ---------------------- | -------------------------------------------------------------- |
+| `--visualize-only`     | Only visualize existing annotations (no generation)            |
+| `--num-visualizations` | Number of episodes to visualize (default: 5)                   |
+| `--visualize-type`     | Type of annotations to visualize: `sparse`, `dense`, or `both` |
+
+**Tip**: If annotations are inaccurate, adjust your subtask descriptions to be more specific and re-run.
+
+---
+
+## Step 3: Train SARM
+
+<hfoptions id="train_mode">
+<hfoption id="single_stage">
+
+Train with **no annotations** - uses linear progress from 0 to 1:
+
+```bash
+python src/lerobot/scripts/lerobot_train.py \
+  --dataset.repo_id=your-username/your-dataset \
+  --policy.type=sarm \
+  --policy.annotation_mode=single_stage \
+  --policy.image_key=observation.images.base \
+  --output_dir=outputs/train/sarm_single \
+  --batch_size=32 \
+  --steps=5000 \
+  --wandb.enable=true \
+  --wandb.project=sarm \
+  --policy.repo_id=your-username/your-model-name
+```
+
+</hfoption>
+<hfoption id="dense_only">
+
+Train with **dense annotations only** (sparse auto-generated):
+
+```bash
+python src/lerobot/scripts/lerobot_train.py \
+  --dataset.repo_id=your-username/your-dataset \
+  --policy.type=sarm \
+  --policy.annotation_mode=dense_only \
+  --policy.image_key=observation.images.base \
+  --output_dir=outputs/train/sarm_dense \
+  --batch_size=32 \
+  --steps=5000 \
+  --wandb.enable=true \
+  --wandb.project=sarm \
+  --policy.repo_id=your-username/your-model-name
+```
+
+</hfoption>
+<hfoption id="dual">
+
+Train with **both sparse and dense annotations**:
+
+```bash
+python src/lerobot/scripts/lerobot_train.py \
+  --dataset.repo_id=your-username/your-dataset \
+  --policy.type=sarm \
+  --policy.annotation_mode=dual \
+  --policy.image_key=observation.images.base \
+  --output_dir=outputs/train/sarm_dual \
+  --batch_size=32 \
+  --steps=5000 \
+  --wandb.enable=true \
+  --wandb.project=sarm \
+  --policy.repo_id=your-username/your-model-name
+```
+
+</hfoption>
+</hfoptions>
+
+### Multi-GPU Training
+
+Add `accelerate launch --multi_gpu --num_processes=4` to use multiple GPUs for training.
+
+### Training Arguments
+
+| Argument                   | Description                                                       | Default                  |
+| -------------------------- | ----------------------------------------------------------------- | ------------------------ |
+| `--policy.annotation_mode` | `single_stage`, `dense_only`, or `dual`                           | `single_stage`           |
+| `--policy.image_key`       | Camera key for images                                             | `observation.images.top` |
+| `--policy.state_key`       | Key for joint states                                              | `observation.state`      |
+| `--policy.n_obs_steps`     | Observation history steps (total obs frames = `n_obs_steps + 1`)  | `8`                      |
+| `--policy.frame_gap`       | Gap (in frames) between sampled observations (at 30 fps: 30 ≈ 1s) | `30`                     |
+
+---
+
+## Step 4: Visualize Predictions
+
+Use `compute_rabc_weights.py` with `--visualize-only` to visualize model predictions (and, if available, annotation-derived targets) without writing a parquet file.
+
+<hfoptions id="viz_mode">
+<hfoption id="single_stage">
+
+```bash
+python src/lerobot/policies/sarm/compute_rabc_weights.py \
+  --dataset-repo-id your-username/your-dataset \
+  --reward-model-path your-username/sarm-model \
+  --visualize-only \
+  --num-visualizations 5 \
+  --head-mode sparse \
+  --output-dir ./sarm_viz
+```
+
+</hfoption>
+<hfoption id="dense_only">
+
+```bash
+python src/lerobot/policies/sarm/compute_rabc_weights.py \
+  --dataset-repo-id your-username/your-dataset \
+  --reward-model-path your-username/sarm-model \
+  --visualize-only \
+  --num-visualizations 5 \
+  --head-mode dense \
+  --output-dir ./sarm_viz
+```
+
+</hfoption>
+<hfoption id="dual">
+
+```bash
+python src/lerobot/policies/sarm/compute_rabc_weights.py \
+  --dataset-repo-id your-username/your-dataset \
+  --reward-model-path your-username/sarm-model \
+  --visualize-only \
+  --num-visualizations 5 \
+  --head-mode both \
+  --output-dir ./sarm_viz
+```
+
+</hfoption>
+</hfoptions>
+
+The visualization shows:
+
+- **Progress plot**: Predicted progress (and optional annotation-derived “GT” when available and `--stride 1`)
+- **Stage probabilities**: Stacked area plot of predicted stage probabilities
+- **Sample frames**: Key frames from the episode with progress/stage labels
+
+### Visualization Arguments
+
+| Argument               | Description                                               |
+| ---------------------- | --------------------------------------------------------- |
+| `--visualize-only`     | Only visualize predictions (no RABC computation)          |
+| `--num-visualizations` | Number of episodes to visualize (default: 5)              |
+| `--head-mode`          | SARM head to use: `sparse`, `dense`, or `both`            |
+| `--stride`             | Compute every N frames, interpolate the rest (default: 1) |
+
+---
+
+## Step 5 (Optional): Train Policy with RA-BC
+
+Reward-Aligned Behavior Cloning (RA-BC) uses the trained SARM model to weight training samples based on predicted progress improvement. This requires two steps:
+
+1. **Precompute progress values** for all frames using the trained SARM model
+2. **Train policy** with RA-BC weighting using the precomputed values
+
+### How RA-BC Works
+
+For each training sample, RA-BC computes the progress delta:
+
+```
+r_i = φ(o_{t+Δ}) - φ(o_t)
+```
+
+Where `φ` is the SARM progress prediction and `Δ` is the policy's `chunk_size`. Samples with positive progress (good demonstrations) get higher weights, while samples with negative or zero progress get down-weighted.
+
+The weighting follows **Equations 8-9** from the paper:
+
+- **Soft weight**: `w̃_i = clip((r_i − (μ − 2σ)) / (4σ + ε), 0, 1)`
+- **Final weight**: `w_i = 𝟙{r_i > κ} + 𝟙{0 ≤ r_i ≤ κ} × w̃_i`
+
+### Step 5a: Compute SARM Progress Values
+
+First, run the SARM model on all frames in your dataset to compute progress values:
+
+```bash
+python src/lerobot/policies/sarm/compute_rabc_weights.py \
+  --dataset-repo-id your-username/your-dataset \
+  --reward-model-path your-username/sarm-model \
+  --head-mode sparse \
+  --num-visualizations 5 \
+  --push-to-hub
+```
+
+This script:
+
+- Processes all frames and computes progress values
+- Saves progress values to a parquet file next to the dataset on disk (defaults to `<dataset_root>/sarm_progress.parquet`)
+- Generates visualizations of the first N episodes (default: 5)
+
+**Arguments:**
+
+| Argument               | Description                                                    | Default    |
+| ---------------------- | -------------------------------------------------------------- | ---------- |
+| `--reward-model-path`  | Path to trained SARM model                                     | (required) |
+| `--head-mode`          | SARM head to use: `sparse`, `dense`, or `both`                 | `sparse`   |
+| `--device`             | Device for inference                                           | `cuda`     |
+| `--visualize-only`     | Only visualize predictions (no RA-BC computation)              | `false`    |
+| `--num-visualizations` | Number of episodes to visualize (default: 5, set to 0 to skip) | `5`        |
+
+**Output format** (`sarm_progress.parquet`):
+
+| Column            | Description                                    |
+| ----------------- | ---------------------------------------------- |
+| `index`           | Global frame index in dataset                  |
+| `episode_index`   | Episode number                                 |
+| `frame_index`     | Local frame index within episode               |
+| `progress_sparse` | Sparse head progress value [0, 1]              |
+| `progress_dense`  | Dense head progress value [0, 1] (if computed) |
+
+### Step 5b: Train Policy with RA-BC
+
+Once you have the progress file, train your policy with RA-BC weighting. The progress file is auto-detected from the dataset path (`sarm_progress.parquet`). Currently PI0, PI0.5 and SmolVLA are supported with RA-BC:
+
+```bash
+python src/lerobot/scripts/lerobot_train.py \
+  --dataset.repo_id=your-username/your-dataset \
+  --policy.type=pi0 \
+  --use_rabc=true \
+  --rabc_head_mode=sparse \
+  --rabc_kappa=0.01 \
+  --output_dir=outputs/train/policy_rabc \
+  --batch_size=32 \
+  --steps=40000
+```
+
+The training script automatically:
+
+- Loads the precomputed progress values from the parquet file
+- Uses the policy's `chunk_size` to compute progress deltas (Δ)
+- Computes sample weights based on progress improvement
+- Applies weighted loss during training
+
+**RA-BC Arguments:**
+
+| Argument               | Description                                                | Default                            |
+| ---------------------- | ---------------------------------------------------------- | ---------------------------------- |
+| `--use_rabc`           | Enable RA-BC sample weighting                              | `false`                            |
+| `--rabc_progress_path` | Path to progress parquet file (auto-detected from dataset) | `sarm_progress.parquet` in dataset |
+| `--rabc_head_mode`     | Which SARM head's progress to use: `sparse` or `dense`     | `sparse`                           |
+| `--rabc_kappa`         | Threshold κ for high-quality samples                       | `0.01`                             |
+
+### Tuning RA-BC Kappa
+
+The `kappa` parameter is the threshold that determines which samples get full weight (w=1). Understanding how to tune it is critical for RA-BC to work effectively.
+
+**How the weighting works:**
+
+| Condition           | Weight                  |
+| ------------------- | ----------------------- |
+| `delta > kappa`     | 1.0 (hard threshold)    |
+| `0 ≤ delta ≤ kappa` | Soft weight from Eq. 8  |
+| `delta < 0`         | 0.0 (negative progress) |
+
+**Diagnosing kappa issues:**
+
+Monitor these WandB metrics during training:
+
+| Metric             | Healthy Range | Problem Indicator         |
+| ------------------ | ------------- | ------------------------- |
+| `rabc_mean_weight` | 0.3 - 0.8     | ≈ 1.0 means kappa too low |
+| `rabc_delta_mean`  | > 0           | Should be positive        |
+| `rabc_delta_std`   | > 0           | Variance in data quality  |
+
+**If `rabc_mean_weight ≈ 1.0`:** Your kappa is too low. Most samples have `delta > kappa` and bypass the soft-weighting entirely. RA-BC becomes equivalent to vanilla BC.
+
+**Setting kappa based on your data:**
+
+The default `kappa=0.01` was tuned for the paper's T-shirt folding task (~90s episodes at 30fps). For your dataset, check the logged `rabc_delta_mean` and `rabc_delta_std`:
+
+```
+# If delta_mean ≈ 0.03 and delta_std ≈ 0.02:
+# Most deltas fall in range [0.01, 0.05]
+
+# Option 1: Set kappa = delta_mean (medium selectivity)
+--rabc_kappa=0.03
+
+# Option 2: Set kappa = delta_mean + delta_std (high selectivity)
+--rabc_kappa=0.05
+
+# Option 3: Set kappa = delta_mean + 2*delta_std (very selective)
+--rabc_kappa=0.07
+```
+
+**When RA-BC may not help:**
+
+If your dataset is already high quality (consistent progress across all demonstrations), RA-BC won't provide much benefit since there's nothing to filter.
+
+### Multi-GPU Training with RA-BC
+
+```bash
+accelerate launch \
+  --multi_gpu \
+  --num_processes=4 \
+  src/lerobot/scripts/lerobot_train.py \
+  --dataset.repo_id=your-username/your-dataset \
+  --policy.type=pi0 \
+  --use_rabc=true \
+  --rabc_kappa=0.01 \
+  --output_dir=outputs/train/policy_rabc \
+  --batch_size=32 \
+  --steps=40000
+```
+
+---
+
+## Tips & Best Practices
+
+### Choosing a Mode
+
+- **Start with `single_stage`** for quick experiments - no annotation overhead
+- Use **`dense_only`** when you want detailed progress tracking but tasks don't have clear high-level stages
+- Use **`dual`** for complex tasks where both coarse and fine-grained progress is meaningful
+
+### Annotation Quality
+
+1. **Be specific with subtask names**: Instead of "fold", use "grab near side and fold toward center"
+2. **Verify with visualization**: Always check a few episodes before training
+3. **Consistent naming**: Use the same subtask names across all episodes
+
+### RA-BC
+
+1. **Train SARM first**: RA-BC quality depends entirely on SARM quality
+2. **Monitor `rabc_mean_weight`**: If it's ≈ 1.0, increase kappa (see [Tuning RA-BC Kappa](#tuning-ra-bc-kappa))
+
+---
+
+## Citation
+
+```bibtex
+@article{chen2025sarm,
+  title={SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation},
+  author={Chen, Qianzhong and Yu, Justin and Schwager, Mac and Abbeel, Pieter and Shentu, Yide and Wu, Philipp},
+  journal={arXiv preprint arXiv:2509.25358},
+  year={2025}
+}
+```
@@ -0,0 +1,74 @@
+# WALL-OSS
+
+WALL-OSS is an open-source foundation model for embodied intelligence, proposed by the [XSquare Robot](https://x2robot.com/en/research/68bc2cde8497d7f238dde690) team in 2025. The LeRobot implementation is adapted from their open-source [WallX](https://github.com/X-Square-Robot/wall-x) repository.
+
+X Square Robot’s WALL-OSS is now integrated into Hugging Face’s LeRobot ecosystem. This is an exciting collaborative project between the LeRobot and X Square Robot teams. You can now post-train, evaluate, and deploy WALL-OSS directly through LeRobot. With this, we’re aiming to make it easier for the open-source robotics community to customize and deploy WALL-OSS foundation models. Read and explore WALL-OSS [paper](https://arxiv.org/pdf/2509.11766) and [code](https://github.com/X-Square-Robot/wall-x).
+
+## Model Overview
+
+The WALL-OSS team is building the embodied foundation model to capture and compress the world's most valuable data: the continuous, high-fidelity stream of physical interaction. By creating a direct feedback loop between the model's decisions and the body's lived experience, the emergence of a truly generalizable intelligence is enabled—one that understands not just how the world works, but how to act effectively within it.
+
+Technically, WALL-OSS introduces a tightly coupled multimodal architecture (tightly-coupled MoE structure) that integrates both discrete and continuous action modeling strategies. Through a two-stage training pipeline (Inspiration → Integration), the model gradually unifies semantic reasoning and high-frequency action generation. Its core innovations include:
+
+- **Embodied perception–enhanced multimodal pretraining**: Large-scale training on unified vision–language–action data to strengthen spatial, causal, and manipulation understanding.
+- **Unified Cross-Level Chain-of-Thought (Uni-CoT)**: A single differentiable framework that unifies high-level instruction reasoning, sub-task decomposition, and fine-grained action synthesis, forming a continuous chain from “understanding” to “execution.”
+- **Mixture-of-Experts (MoE) action heads**: Dynamically activating experts depending on the task phase and modeling actions in discrete or continuous space to maintain stable VLM priors.
+- **Two-stage training paradigm**:
+  - **Inspiration stage**: Injecting discrete action priors to strengthen spatial understanding and semantic-action alignment.
+  - **Integration stage**: Using flow matching to achieve high-frequency continuous control.
+
+## Installation Requirements
+
+1. Install LeRobot by following our [Installation Guide](./installation).
+2. Install WallX dependencies by running:
+
+   ```bash
+   pip install -e ".[wallx]"
+   ```
+
+## Usage
+
+To use WallX in LeRobot, specify the policy type as:
+
+```python
+policy.type=wall_x
+```
+
+## Training
+
+For training WallX, you can use the standard LeRobot training script with the appropriate configuration:
+
+```bash
+python src/lerobot/scripts/lerobot_train.py \
+    --dataset.repo_id=your_dataset \
+    --policy.type=wall_x \
+    --output_dir=./outputs/wallx_training \
+    --job_name=wallx_training \
+    --policy.repo_id=your_repo_id \
+    --policy.pretrained_name_or_path=x-square-robot/wall-oss-flow \
+    --policy.prediction_mode=diffusion \
+    --policy.attn_implementation=eager \
+    --steps=3000 \
+    --policy.device=cuda \
+    --batch_size=32
+```
+
+### Training Arguments
+
+| Argument                       | Description                                                                                                                                                   |
+| ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `--dataset.repo_id`            | The Hugging Face Hub repository ID for your training dataset (e.g., `lerobot/aloha_sim_insertion_human`)                                                      |
+| `--policy.type`                | Specifies using the WallX policy architecture                                                                                                                 |
+| `--output_dir`                 | Local directory where training checkpoints and logs will be saved                                                                                             |
+| `--job_name`                   | A name identifier for this training run (used in logging/tracking)                                                                                            |
+| `--policy.repo_id`             | Your Hugging Face Hub repo ID where the trained model will be pushed                                                                                          |
+| `--policy.pretrained_path`     | Path to pretrained WallX weights to initialize from (the official WALL-OSS checkpoint)                                                                        |
+| `--policy.prediction_mode`     | The action prediction strategy: `diffusion` or `fast` - `diffusion` uses iterative denoising for action generation, `fast` uses next token prediction instead |
+| `--policy.attn_implementation` | Attention implementation backend - `eager` uses standard PyTorch attention (alternatives include `flash_attention_2` or `sdpa`)                               |
+| `--steps`                      | Total number of training steps to run                                                                                                                         |
+| `--policy.device`              | Device to train on (`cuda` for GPU, `cpu` for CPU)                                                                                                            |
+| `--batch_size`                 | Number of samples per training batch                                                                                                                          |
+
+## License
+
+This model follows the **Apache 2.0 License**, consistent with the original [WallX repository](https://github.com/X-Square-Robot/wall-x).