feat(rewards): add TOPReward reward model

2026-07-17 23:11:45 +00:00 · 2026-05-19 18:00:18 +02:00
parent d38eb89f71
commit 70ad322676
14 changed files with 2230 additions and 3 deletions
@@ -73,6 +73,8 @@
 - sections:
  - local: sarm
    title: SARM
+  - local: topreward
+    title: TOPReward
  title: "Reward Models"
 - sections:
  - local: inference
@@ -0,0 +1,191 @@
+# TOPReward
+
+TOPReward is a **zero-shot reward model** that extracts token log-probabilities from an off-the-shelf vision-language model (VLM) as a robotic reward signal. Given a video trajectory and a task instruction, it returns the VLM's log-likelihood that the instruction is true — no fine-tuning required.
+
+**Paper**: [TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics](https://arxiv.org/abs/2602.19313)
+**Project**: [topreward.github.io](https://topreward.github.io/webpage/)
+**Original code**: [github.com/TOPReward/TOPReward](https://github.com/TOPReward/TOPReward)
+**Default backbone**: [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
+
+## Overview
+
+TOPReward asks a generic VLM how likely a task instruction is, **conditioned on the video** of a robot trying to complete that task. Concretely, given:
+
+- A trajectory video (a sequence of frames).
+- A task instruction (e.g. _"open the drawer"_).
+
+it builds a chat prompt of the form
+
+```text
+<video>
+"The above video shows a robot manipulation trajectory that completes the
+ following task: <instruction> Decide whether the above statement is True
+ or not. The answer is: True"
+```
+
+forwards it through the VLM, label-masks everything except the very last token, and reads back the log-probability of that token — by default the literal `"True"` that closes the suffix template. The resulting `log P("True" | video + prompt + instruction)` is the reward, and answers the question "given this video, how strongly does the VLM agree that the instruction is satisfied?".
+
+Because the method only depends on a frozen VLM, TOPReward is **zero-shot**: there are no fine-tuned weights to host. The "model" in LeRobot is a small wrapper around `transformers`' `Qwen3VLForConditionalGeneration` plus the prompt assembly + label-masking logic.
+
+## Installation Requirements
+
+1. Install LeRobot following the [Installation Guide](./installation).
+2. Install the TOPReward optional extra:
+
+```bash
+pip install -e ".[topreward]"
+```
+
+or, with `uv` from a source checkout:
+
+```bash
+uv sync --extra topreward
+```
+
+This pulls in `transformers` and `qwen-vl-utils`. The first time you run TOPReward, Hugging Face will also download the VLM weights from the Hub (~16 GB for Qwen3-VL-8B-Instruct). A GPU is strongly recommended.
+
+## Model Inputs and Outputs
+
+TOPReward expects:
+
+- A trajectory video or sequence of frames.
+- A natural-language task description.
+
+In LeRobot datasets the preprocessor reads:
+
+| Config field              | Default                     | Meaning                                                                 |
+| ------------------------- | --------------------------- | ----------------------------------------------------------------------- |
+| `reward_model.image_key`  | `observation.images.top`    | Camera observation used by TOPReward                                    |
+| `reward_model.task_key`   | `task`                      | Key in complementary data that stores the task string                   |
+| `reward_model.max_frames` | `16`                        | Cap on frames per sample (compute_reward only; predict_curves bypasses) |
+| `reward_model.fps`        | `2.0`                       | Metadata passed to the Qwen video processor                             |
+| `reward_model.vlm_name`   | `Qwen/Qwen3-VL-8B-Instruct` | Hugging Face Hub id of the underlying VLM                               |
+
+The model returns:
+
+- `compute_reward(batch)`: one log-probability per sample. Higher = better task–video alignment. When `success_threshold` is finite, returns the binary thresholded value instead.
+- `predict_curves(batch, num_prefixes=None)`: per-frame progress curve in `[0, 1]` (min-max normalised log-probs over prefix lengths). `num_prefixes=None` is fully dense; `num_prefixes=15` matches the upstream sparse-dense default with linear interpolation between anchors.
+
+## Usage
+
+### Load the reward model directly
+
+```python
+from lerobot.rewards.topreward import TOPRewardConfig, TOPRewardModel
+
+cfg = TOPRewardConfig(
+    vlm_name="Qwen/Qwen3-VL-8B-Instruct",
+    device="cuda",
+)
+reward_model = TOPRewardModel(cfg)
+```
+
+There is no `from_pretrained` weight download for TOPReward itself — the VLM is fetched from the Hub on construction.
+
+### Score a clip + task
+
+```python
+import numpy as np
+from lerobot.rewards.topreward.processor_topreward import TOPREWARD_FEATURE_PREFIX
+
+# frames: np.ndarray, shape (T, H, W, C), dtype uint8
+# task: str
+batch = {
+    f"{TOPREWARD_FEATURE_PREFIX}frames": [frames],
+    f"{TOPREWARD_FEATURE_PREFIX}task": [task],
+}
+reward = reward_model.compute_reward(batch)  # tensor of shape (1,)
+```
+
+For a dense per-frame curve over the same clip:
+
+```python
+out = reward_model.predict_curves(batch, num_prefixes=15)
+progress = out["progress"][0].numpy()  # shape (T,), values in [0, 1]
+```
+
+### Use the reward factory
+
+```python
+from lerobot.rewards import make_reward_model, make_reward_model_config, make_reward_pre_post_processors
+
+cfg = make_reward_model_config(
+    "topreward",
+    vlm_name="Qwen/Qwen3-VL-8B-Instruct",
+    device="cuda",
+    image_key="observation.images.top",
+)
+reward_model = make_reward_model(cfg)
+preprocessor, postprocessor = make_reward_pre_post_processors(cfg)
+```
+
+The preprocessor writes normalised frames + task strings under the `observation.topreward.*` namespace; the model reads them in `compute_reward`.
+
+### Offline dataset labeling
+
+Mirror the SARM / Robometer RA-BC flow — write a `topreward_progress.parquet` once, then reuse it for training (RA-BC) and visualisation (overlay videos):
+
+```bash
+# Fully dense per-frame labeling
+uv run python -m lerobot.rewards.topreward.compute_rabc_weights \
+    --dataset-repo-id lerobot/libero_10_image \
+    --device cuda
+
+# Sparse-dense (15 anchors per episode, matches upstream)
+uv run python -m lerobot.rewards.topreward.compute_rabc_weights \
+    --dataset-repo-id lerobot/libero_10_image \
+    --num-prefixes 15 \
+    --device cuda
+```
+
+Then render the SARM-style progress overlay for any episode:
+
+```bash
+uv run examples/dataset/create_progress_videos.py \
+    --repo-id lerobot/libero_10_image \
+    --episode 0 \
+    --progress-file topreward_progress.parquet \
+    --gif
+```
+
+## Publishing a named TOPReward configuration
+
+Because TOPReward stores no weights of its own, "publishing a TOPReward model" amounts to writing the LeRobot `config.json` (≈ 1 KB) that pins the VLM id, prompt and reduction:
+
+```python
+from lerobot.rewards.topreward import TOPRewardConfig, TOPRewardModel
+
+cfg = TOPRewardConfig(
+    vlm_name="Qwen/Qwen3-VL-8B-Instruct",
+    reduction="mean",
+    fps=2.0,
+)
+TOPRewardModel(cfg).save_pretrained("./topreward-qwen3vl-8b")
+# Push the directory to the Hub via `huggingface-cli` or `HfApi.upload_folder`.
+```
+
+Reloading restores the same configuration (no weight download for TOPReward itself; the VLM is re-fetched via `vlm_name`):
+
+```python
+reloaded = TOPRewardModel.from_pretrained("./topreward-qwen3vl-8b")
+```
+
+## References
+
+- [TOPReward project page](https://topreward.github.io/webpage/)
+- [TOPReward paper](https://arxiv.org/abs/2602.19313)
+- [Original TOPReward code](https://github.com/TOPReward/TOPReward)
+- [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
+
+## Citation
+
+```bibtex
+@article{chen2026topreward,
+  title={TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics},
+  author={Chen, Shirui and Harrison, Cole and Lee, Ying-Chun and Yang, Angela Jin and
+          Ren, Zhongzheng and Ratliff, Lillian J and Duan, Jiafei and Fox, Dieter and
+          Krishna, Ranjay},
+  journal={arXiv preprint arXiv:2602.19313},
+  year={2026}
+}
+```