# SARM: Stage-Aware Reward Modeling SARM (Stage-Aware Reward Modeling) is a video-based reward modeling framework for long-horizon robot manipulation tasks. This guide covers how to train SARM reward models and optionally use them with Reward-Aligned Behavior Cloning (RA-BC). **Paper**: [SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation](https://arxiv.org/abs/2509.25358) ## Why Reward Models? Standard behavior cloning treats all demonstration frames equally, but real-world robot datasets are messy. They contain hesitations, corrections, and variable-quality trajectories. Reward models solve this by learning a generalizable notion of **task progress** from demonstrations: given video frames and a task description, they predict how close the robot is to completing the task (0→1). This learned "progress signal" can be used in multiple ways, two promising applications are: (1) **weighted imitation learning** (RA-BC), where high-progress frames receive more weight during policy training, and (2) **reinforcement learning**, where the reward model provides dense rewards for online or offline policy improvement. ## Overview SARM has following features: 1. **Stage-aware architecture**: Jointly predicts the high-level task stage and fine-grained progress within each stage 2. **Subtask annotations**: Uses natural language subtask annotations to derive consistent progress labels 3. **Temporal proportions**: Computes dataset-level priors (α̅\_k) for each subtask to normalize progress across variable-length demonstrations SARM trains on a compact **stage+tau** target for each frame: - **stage**: integer stage index `k ∈ {0, ..., K-1}` - **τ (tau)**: within-stage progress `τ ∈ [0, 1]` - **target encoding**: `y = k + τ` (this is what the dataset processor produces) At inference time (and in downstream RA-BC), SARM converts the raw `k + τ` value into a **normalized progress** in `[0, 1]` using dataset-level **temporal proportions** `α̅_k` (stored in `meta/temporal_proportions_*.json`). This matches **Formula (2)** from the paper: ``` progress_t = P_{k-1} + α̅_k × τ_t ``` Where: - `τ_t = (t - s_k) / (e_k - s_k)` is within-subtask normalized time - `P_{k-1}` is cumulative prior (sum of previous subtask proportions) - `α̅_k` is the temporal proportion for subtask k This ensures identical task states map to consistent progress values, even across demonstrations of different lengths. ## Inputs and Targets (What the new code expects) SARM is trained through its processor (`src/lerobot/policies/sarm/processor_sarm.py`), which: - **Encodes** images and task text with CLIP (ViT-B/32) into `video_features` and `text_features` - **Pads/truncates** robot state into `state_features` (up to `max_state_dim`) - **Builds targets** as `sparse_targets` (and `dense_targets` in `dense_only`/`dual`) using the stage+tau encoding `y = k + τ` - **Masks rewind frames** using a per-sample `lengths` tensor (rewind is a training-time augmentation) At minimum, each training sample needs: - `task` (string): task description - `policy.image_key` images and `policy.state_key` states from the dataset --- ## Annotation Modes You can choose from **3 annotation modes** that determine how progress labels are computed: | Mode | Annotations Required | Heads | Use Case | | -------------- | -------------------- | ---------------------------- | ------------------------------------------------------------ | | `single_stage` | None | Sparse only | Simple tasks, quick experiments, no VLM needed | | `dense_only` | Dense (VLM) | Dual (sparse auto-generated) | Detailed subtask tracking without defining high-level stages | | `dual` | Sparse + Dense (VLM) | Dual | Full SARM paper setup with both granularities | ### Mode Details **No annotations required.** The entire episode is treated as a single stage called `"task"`, and progress is linear from 0 to 1 over the episode duration. - **Sparse head**: 1 stage ("task"), linear progress - **Dense head**: Not used - **Best for**: Simple tasks, quick experiments, or when VLM annotation is not available ## Set Up Your Environment 1. Install LeRobot by following our [Installation Guide](./installation). 2. Install SARM dependencies by running: ```bash pip install -e ".[sarm]" ``` Workflow: ``` 1. Train SARM → 2. Visualize predictions → 3. (Optional) Train policy with RA-BC ``` **Only dense (fine-grained) annotations from a VLM.** The sparse head automatically uses a single `"task"` stage covering the full episode, while the dense head learns detailed subtask progression. - **Sparse head**: 1 stage ("task"), linear progress (auto-generated) - **Dense head**: Multiple fine-grained stages from VLM annotations - **Best for**: When you want detailed subtask tracking but don't need to define high-level stages Workflow: ``` 1. Annotate (dense) → 2. Verify → 3. Train SARM → 4. Visualize → 5. (Optional) Train policy with RA-BC ``` **Both sparse and dense annotations from VLM.** Full dual-head mode as described in the SARM paper, with both high-level (sparse) and fine-grained (dense) stage predictions. - **Sparse head**: High-level stages from VLM annotations - **Dense head**: Fine-grained stages from VLM annotations - **Best for**: Complex multi-stage tasks where both granularities are useful Workflow: ``` 1. Annotate (sparse+dense) → 2. Verify → 3. Train SARM → 4. Visualize → 5. (Optional) Train policy with RA-BC ``` --- ## Step 1: Subtask Annotation **No annotation required!** Skip this step entirely. The model will use the episode's task description and compute linear progress automatically. Generate **dense (fine-grained) annotations only** using a VLM. The sparse stage will be auto-generated. ```bash python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \ --repo-id your-username/your-dataset \ --dense-only \ --dense-subtasks "Bring robot arms up from starting position,Grab near side and do 1st fold,Grab side and do 2nd fold,Grab side and do 3rd fold to finish folding" \ --video-key observation.images.base \ --num-workers 4 \ --push-to-hub ``` **What gets saved:** - `meta/temporal_proportions_sparse.json` - Auto-generated sparse proportions (`{"task": 1.0}`) - `meta/temporal_proportions_dense.json` - Dense temporal proportions - Per-episode columns in `episodes/*.parquet`: - `dense_subtask_names`, `dense_subtask_start_frames`, `dense_subtask_end_frames` - (also time-based columns: `dense_subtask_start_times`, `dense_subtask_end_times`) Generate **both sparse (high-level) and dense (fine-grained) annotations** using a VLM. ```bash python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \ --repo-id your-username/your-dataset \ --sparse-subtasks "Bring arms up from starting position,Fold the towel (3 folds in total)" \ --dense-subtasks "Bring robot arms up from starting position,Grab near side and do 1st fold,Grab side and do 2nd fold,Grab side and do 3rd fold to finish folding" \ --video-key observation.images.base \ --num-workers 4 \ --push-to-hub ``` **What gets saved:** - `meta/temporal_proportions_sparse.json` - Sparse temporal proportions - `meta/temporal_proportions_dense.json` - Dense temporal proportions - Per-episode columns in `episodes/*.parquet`: - `sparse_subtask_names`, `sparse_subtask_start_frames`, `sparse_subtask_end_frames` - `dense_subtask_names`, `dense_subtask_start_frames`, `dense_subtask_end_frames` - (also time-based columns: `*_subtask_start_times`, `*_subtask_end_times`) ### Annotation Arguments | Argument | Description | | ---------------------- | ------------------------------------------------------------------------------- | | `--repo-id` | HuggingFace dataset repository ID | | `--sparse-subtasks` | Comma-separated list of high-level subtask names | | `--dense-subtasks` | Comma-separated list of fine-grained subtask names | | `--dense-only` | Generate only dense annotations (auto-creates sparse "task" stage) | | `--video-key` | Camera/video key to use (e.g., `observation.images.top`) | | `--num-workers` | Number of parallel GPU workers (default: 1) | | `--episodes` | Specific episode indices to annotate (default: all) | | `--skip-existing` | Skip episodes that already have annotations | | `--model` | VLM model (default: `Qwen/Qwen3-VL-30B-A3B-Instruct`) | | `--num-visualizations` | Number of episodes to visualize after annotation (default: 5, set to 0 to skip) | > **Note**: After annotation completes, 5 episodes are automatically visualized by default. Use `--num-visualizations 0` to skip this step. --- ## Step 2: Verify Annotations **No verification needed!** Skip this step. Visualize annotations using the `--visualize-only` flag: ```bash python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \ --repo-id your-username/your-dataset \ --visualize-only \ --visualize-type dense \ --num-visualizations 5 \ --video-key observation.images.base \ --output-dir ./subtask_viz ``` Visualize annotations using the `--visualize-only` flag: ```bash python src/lerobot/data_processing/sarm_annotations/subtask_annotation.py \ --repo-id your-username/your-dataset \ --visualize-only \ --visualize-type both \ --num-visualizations 5 \ --video-key observation.images.base \ --output-dir ./subtask_viz ``` This generates visualizations showing video frames with subtask boundaries overlaid and timeline of subtasks. ### Visualization Arguments | Argument | Description | | ---------------------- | -------------------------------------------------------------- | | `--visualize-only` | Only visualize existing annotations (no generation) | | `--num-visualizations` | Number of episodes to visualize (default: 5) | | `--visualize-type` | Type of annotations to visualize: `sparse`, `dense`, or `both` | **Tip**: If annotations are inaccurate, adjust your subtask descriptions to be more specific and re-run. --- ## Step 3: Train SARM Train with **no annotations** - uses linear progress from 0 to 1: ```bash python src/lerobot/scripts/lerobot_train.py \ --dataset.repo_id=your-username/your-dataset \ --policy.type=sarm \ --policy.annotation_mode=single_stage \ --policy.image_key=observation.images.base \ --output_dir=outputs/train/sarm_single \ --batch_size=32 \ --steps=5000 \ --wandb.enable=true \ --wandb.project=sarm \ --policy.repo_id=your-username/your-model-name ``` Train with **dense annotations only** (sparse auto-generated): ```bash python src/lerobot/scripts/lerobot_train.py \ --dataset.repo_id=your-username/your-dataset \ --policy.type=sarm \ --policy.annotation_mode=dense_only \ --policy.image_key=observation.images.base \ --output_dir=outputs/train/sarm_dense \ --batch_size=32 \ --steps=5000 \ --wandb.enable=true \ --wandb.project=sarm \ --policy.repo_id=your-username/your-model-name ``` Train with **both sparse and dense annotations**: ```bash python src/lerobot/scripts/lerobot_train.py \ --dataset.repo_id=your-username/your-dataset \ --policy.type=sarm \ --policy.annotation_mode=dual \ --policy.image_key=observation.images.base \ --output_dir=outputs/train/sarm_dual \ --batch_size=32 \ --steps=5000 \ --wandb.enable=true \ --wandb.project=sarm \ --policy.repo_id=your-username/your-model-name ``` ### Multi-GPU Training Add `accelerate launch --multi_gpu --num_processes=4` to use multiple GPUs for training. ### Training Arguments | Argument | Description | Default | | -------------------------- | ----------------------------------------------------------------- | ------------------------ | | `--policy.annotation_mode` | `single_stage`, `dense_only`, or `dual` | `single_stage` | | `--policy.image_key` | Camera key for images | `observation.images.top` | | `--policy.state_key` | Key for joint states | `observation.state` | | `--policy.n_obs_steps` | Observation history steps (total obs frames = `n_obs_steps + 1`) | `8` | | `--policy.frame_gap` | Gap (in frames) between sampled observations (at 30 fps: 30 ≈ 1s) | `30` | --- ## Step 4: Visualize Predictions Use `compute_rabc_weights.py` with `--visualize-only` to visualize model predictions (and, if available, annotation-derived targets) without writing a parquet file. ```bash python src/lerobot/policies/sarm/compute_rabc_weights.py \ --dataset-repo-id your-username/your-dataset \ --reward-model-path your-username/sarm-model \ --visualize-only \ --num-visualizations 5 \ --head-mode sparse \ --output-dir ./sarm_viz ``` ```bash python src/lerobot/policies/sarm/compute_rabc_weights.py \ --dataset-repo-id your-username/your-dataset \ --reward-model-path your-username/sarm-model \ --visualize-only \ --num-visualizations 5 \ --head-mode dense \ --output-dir ./sarm_viz ``` ```bash python src/lerobot/policies/sarm/compute_rabc_weights.py \ --dataset-repo-id your-username/your-dataset \ --reward-model-path your-username/sarm-model \ --visualize-only \ --num-visualizations 5 \ --head-mode both \ --output-dir ./sarm_viz ``` The visualization shows: - **Progress plot**: Predicted progress (and optional annotation-derived “GT” when available and `--stride 1`) - **Stage probabilities**: Stacked area plot of predicted stage probabilities - **Sample frames**: Key frames from the episode with progress/stage labels ### Visualization Arguments | Argument | Description | | ---------------------- | --------------------------------------------------------- | | `--visualize-only` | Only visualize predictions (no RABC computation) | | `--num-visualizations` | Number of episodes to visualize (default: 5) | | `--head-mode` | SARM head to use: `sparse`, `dense`, or `both` | | `--stride` | Compute every N frames, interpolate the rest (default: 1) | --- ## Step 5 (Optional): Train Policy with RA-BC Reward-Aligned Behavior Cloning (RA-BC) uses the trained SARM model to weight training samples based on predicted progress improvement. This requires two steps: 1. **Precompute progress values** for all frames using the trained SARM model 2. **Train policy** with RA-BC weighting using the precomputed values ### How RA-BC Works For each training sample, RA-BC computes the progress delta: ``` r_i = φ(o_{t+Δ}) - φ(o_t) ``` Where `φ` is the SARM progress prediction and `Δ` is the policy's `chunk_size`. Samples with positive progress (good demonstrations) get higher weights, while samples with negative or zero progress get down-weighted. The weighting follows **Equations 8-9** from the paper: - **Soft weight**: `w̃_i = clip((r_i − (μ − 2σ)) / (4σ + ε), 0, 1)` - **Final weight**: `w_i = 𝟙{r_i > κ} + 𝟙{0 ≤ r_i ≤ κ} × w̃_i` ### Step 5a: Compute SARM Progress Values First, run the SARM model on all frames in your dataset to compute progress values: ```bash python src/lerobot/policies/sarm/compute_rabc_weights.py \ --dataset-repo-id your-username/your-dataset \ --reward-model-path your-username/sarm-model \ --head-mode sparse \ --num-visualizations 5 \ --push-to-hub ``` This script: - Processes all frames and computes progress values - Saves progress values to a parquet file next to the dataset on disk (defaults to `/sarm_progress.parquet`) - Generates visualizations of the first N episodes (default: 5) **Arguments:** | Argument | Description | Default | | ---------------------- | -------------------------------------------------------------- | ---------- | | `--reward-model-path` | Path to trained SARM model | (required) | | `--head-mode` | SARM head to use: `sparse`, `dense`, or `both` | `sparse` | | `--device` | Device for inference | `cuda` | | `--visualize-only` | Only visualize predictions (no RA-BC computation) | `false` | | `--num-visualizations` | Number of episodes to visualize (default: 5, set to 0 to skip) | `5` | **Output format** (`sarm_progress.parquet`): | Column | Description | | ----------------- | ---------------------------------------------- | | `index` | Global frame index in dataset | | `episode_index` | Episode number | | `frame_index` | Local frame index within episode | | `progress_sparse` | Sparse head progress value [0, 1] | | `progress_dense` | Dense head progress value [0, 1] (if computed) | ### Step 5b: Train Policy with RA-BC Once you have the progress file, train your policy with RA-BC weighting. The progress file is auto-detected from the dataset path (`sarm_progress.parquet`). Currently PI0, PI0.5 and SmolVLA are supported with RA-BC: ```bash python src/lerobot/scripts/lerobot_train.py \ --dataset.repo_id=your-username/your-dataset \ --policy.type=pi0 \ --use_rabc=true \ --rabc_head_mode=sparse \ --rabc_kappa=0.01 \ --output_dir=outputs/train/policy_rabc \ --batch_size=32 \ --steps=40000 ``` The training script automatically: - Loads the precomputed progress values from the parquet file - Uses the policy's `chunk_size` to compute progress deltas (Δ) - Computes sample weights based on progress improvement - Applies weighted loss during training **RA-BC Arguments:** | Argument | Description | Default | | ---------------------- | ---------------------------------------------------------- | ---------------------------------- | | `--use_rabc` | Enable RA-BC sample weighting | `false` | | `--rabc_progress_path` | Path to progress parquet file (auto-detected from dataset) | `sarm_progress.parquet` in dataset | | `--rabc_head_mode` | Which SARM head's progress to use: `sparse` or `dense` | `sparse` | | `--rabc_kappa` | Threshold κ for high-quality samples | `0.01` | ### Tuning RA-BC Kappa The `kappa` parameter is the threshold that determines which samples get full weight (w=1). Understanding how to tune it is critical for RA-BC to work effectively. **How the weighting works:** | Condition | Weight | | ------------------- | ----------------------- | | `delta > kappa` | 1.0 (hard threshold) | | `0 ≤ delta ≤ kappa` | Soft weight from Eq. 8 | | `delta < 0` | 0.0 (negative progress) | **Diagnosing kappa issues:** Monitor these WandB metrics during training: | Metric | Healthy Range | Problem Indicator | | ------------------ | ------------- | ------------------------- | | `rabc_mean_weight` | 0.3 - 0.8 | ≈ 1.0 means kappa too low | | `rabc_delta_mean` | > 0 | Should be positive | | `rabc_delta_std` | > 0 | Variance in data quality | **If `rabc_mean_weight ≈ 1.0`:** Your kappa is too low. Most samples have `delta > kappa` and bypass the soft-weighting entirely. RA-BC becomes equivalent to vanilla BC. **Setting kappa based on your data:** The default `kappa=0.01` was tuned for the paper's T-shirt folding task (~90s episodes at 30fps). For your dataset, check the logged `rabc_delta_mean` and `rabc_delta_std`: ``` # If delta_mean ≈ 0.03 and delta_std ≈ 0.02: # Most deltas fall in range [0.01, 0.05] # Option 1: Set kappa = delta_mean (medium selectivity) --rabc_kappa=0.03 # Option 2: Set kappa = delta_mean + delta_std (high selectivity) --rabc_kappa=0.05 # Option 3: Set kappa = delta_mean + 2*delta_std (very selective) --rabc_kappa=0.07 ``` **When RA-BC may not help:** If your dataset is already high quality (consistent progress across all demonstrations), RA-BC won't provide much benefit since there's nothing to filter. ### Multi-GPU Training with RA-BC ```bash accelerate launch \ --multi_gpu \ --num_processes=4 \ src/lerobot/scripts/lerobot_train.py \ --dataset.repo_id=your-username/your-dataset \ --policy.type=pi0 \ --use_rabc=true \ --rabc_kappa=0.01 \ --output_dir=outputs/train/policy_rabc \ --batch_size=32 \ --steps=40000 ``` --- ## Tips & Best Practices ### Choosing a Mode - **Start with `single_stage`** for quick experiments - no annotation overhead - Use **`dense_only`** when you want detailed progress tracking but tasks don't have clear high-level stages - Use **`dual`** for complex tasks where both coarse and fine-grained progress is meaningful ### Annotation Quality 1. **Be specific with subtask names**: Instead of "fold", use "grab near side and fold toward center" 2. **Verify with visualization**: Always check a few episodes before training 3. **Consistent naming**: Use the same subtask names across all episodes ### RA-BC 1. **Train SARM first**: RA-BC quality depends entirely on SARM quality 2. **Monitor `rabc_mean_weight`**: If it's ≈ 1.0, increase kappa (see [Tuning RA-BC Kappa](#tuning-ra-bc-kappa)) --- ## Citation ```bibtex @article{chen2025sarm, title={SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation}, author={Chen, Qianzhong and Yu, Justin and Schwager, Mac and Abbeel, Pieter and Shentu, Yide and Wu, Philipp}, journal={arXiv preprint arXiv:2509.25358}, year={2025} } ```