lerobot/docs/source/topreward.mdx

# TOPReward

TOPReward is a **zero-shot reward model** that extracts token log-probabilities from an off-the-shelf vision-language model (VLM) as a robotic reward signal. Given a video trajectory and a task instruction, it returns the VLM's log-likelihood that the instruction is true — no fine-tuning required.

**Paper**: [TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics](https://arxiv.org/abs/2602.19313)
**Project**: [topreward.github.io](https://topreward.github.io/webpage/)
**Original code**: [github.com/TOPReward/TOPReward](https://github.com/TOPReward/TOPReward)
**Default backbone**: [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)

## Overview

TOPReward asks a generic VLM how likely a task instruction is, **conditioned on the video** of a robot trying to complete that task. Concretely, given:

- A trajectory video (a sequence of frames).
- A task instruction (e.g. _"open the drawer"_).

it builds a chat prompt of the form

```text
<video>
"The above video shows a robot manipulation trajectory that completes the
 following task: <instruction> Decide whether the above statement is True
 or not. The answer is: True"
```

forwards it through the VLM, label-masks everything except the very last token, and reads back the log-probability of that token — by default the literal `"True"` that closes the suffix template. The resulting `log P("True" | video + prompt + instruction)` is the reward, and answers the question "given this video, how strongly does the VLM agree that the instruction is satisfied?".

Because the method only depends on a frozen VLM, TOPReward is **zero-shot**: there are no fine-tuned weights to host. The "model" in LeRobot is a small wrapper around `transformers`' `Qwen3VLForConditionalGeneration` plus the prompt assembly + label-masking logic.

## Installation Requirements

1. Install LeRobot following the [Installation Guide](./installation).
2. Install the TOPReward optional extra:

```bash
pip install -e ".[topreward]"
```

or, with `uv` from a source checkout:

```bash
uv sync --extra topreward
```

This pulls in `transformers` and `qwen-vl-utils`. The first time you run TOPReward, Hugging Face will also download the VLM weights from the Hub (~16 GB for Qwen3-VL-8B-Instruct). A GPU is strongly recommended.

## Model Inputs and Outputs

TOPReward expects:

- A trajectory video or sequence of frames.
- A natural-language task description.

In LeRobot datasets the preprocessor reads:

| Config field              | Default                     | Meaning                                                                 |
| ------------------------- | --------------------------- | ----------------------------------------------------------------------- |
| `reward_model.image_key`  | `observation.images.top`    | Camera observation used by TOPReward                                    |
| `reward_model.task_key`   | `task`                      | Key in complementary data that stores the task string                   |
| `reward_model.max_frames` | `16`                        | Cap on frames per sample (compute_reward only; predict_curves bypasses) |
| `reward_model.fps`        | `2.0`                       | Metadata passed to the Qwen video processor                             |
| `reward_model.vlm_name`   | `Qwen/Qwen3-VL-8B-Instruct` | Hugging Face Hub id of the underlying VLM                               |

The model returns:

- `compute_reward(batch)`: one log-probability per sample. Higher = better task–video alignment. When `success_threshold` is finite, returns the binary thresholded value instead.
- `predict_curves(batch, num_prefixes=None)`: per-frame progress curve in `[0, 1]` (min-max normalised log-probs over prefix lengths). `num_prefixes=None` is fully dense; `num_prefixes=15` matches the upstream sparse-dense default with linear interpolation between anchors.

## Usage

### Load the reward model directly

```python
from lerobot.rewards.topreward import TOPRewardConfig, TOPRewardModel

cfg = TOPRewardConfig(
    vlm_name="Qwen/Qwen3-VL-8B-Instruct",
    device="cuda",
)
reward_model = TOPRewardModel(cfg)
```

There is no `from_pretrained` weight download for TOPReward itself — the VLM is fetched from the Hub on construction.

### Score a clip + task

```python
import numpy as np
from lerobot.rewards.topreward.processor_topreward import TOPREWARD_FEATURE_PREFIX

# frames: np.ndarray, shape (T, H, W, C), dtype uint8
# task: str
batch = {
    f"{TOPREWARD_FEATURE_PREFIX}frames": [frames],
    f"{TOPREWARD_FEATURE_PREFIX}task": [task],
}
reward = reward_model.compute_reward(batch)  # tensor of shape (1,)
```

For a dense per-frame curve over the same clip:

```python
out = reward_model.predict_curves(batch, num_prefixes=15)
progress = out["progress"][0].numpy()  # shape (T,), values in [0, 1]
```

### Use the reward factory

```python
from lerobot.rewards import make_reward_model, make_reward_model_config, make_reward_pre_post_processors

cfg = make_reward_model_config(
    "topreward",
    vlm_name="Qwen/Qwen3-VL-8B-Instruct",
    device="cuda",
    image_key="observation.images.top",
)
reward_model = make_reward_model(cfg)
preprocessor, postprocessor = make_reward_pre_post_processors(cfg)
```

The preprocessor writes normalised frames + task strings under the `observation.topreward.*` namespace; the model reads them in `compute_reward`.

### Offline dataset labeling

Mirror the SARM / Robometer RA-BC flow — write a `topreward_progress.parquet` once, then reuse it for training (RA-BC) and visualisation (overlay videos):

```bash
# Fully dense per-frame labeling
uv run python -m lerobot.rewards.topreward.compute_rabc_weights \
    --dataset-repo-id lerobot/libero_10_image \
    --device cuda

# Sparse-dense (15 anchors per episode, matches upstream)
uv run python -m lerobot.rewards.topreward.compute_rabc_weights \
    --dataset-repo-id lerobot/libero_10_image \
    --num-prefixes 15 \
    --device cuda
```

Then render the SARM-style progress overlay for any episode:

```bash
uv run examples/dataset/create_progress_videos.py \
    --repo-id lerobot/libero_10_image \
    --episode 0 \
    --progress-file topreward_progress.parquet \
    --gif
```

## Publishing a named TOPReward configuration

Because TOPReward stores no weights of its own, "publishing a TOPReward model" amounts to writing the LeRobot `config.json` (≈ 1 KB) that pins the VLM id, prompt and reduction:

```python
from lerobot.rewards.topreward import TOPRewardConfig, TOPRewardModel

cfg = TOPRewardConfig(
    vlm_name="Qwen/Qwen3-VL-8B-Instruct",
    reduction="mean",
    fps=2.0,
)
TOPRewardModel(cfg).save_pretrained("./topreward-qwen3vl-8b")
# Push the directory to the Hub via `huggingface-cli` or `HfApi.upload_folder`.
```

Reloading restores the same configuration (no weight download for TOPReward itself; the VLM is re-fetched via `vlm_name`):

```python
reloaded = TOPRewardModel.from_pretrained("./topreward-qwen3vl-8b")
```

## References

- [TOPReward project page](https://topreward.github.io/webpage/)
- [TOPReward paper](https://arxiv.org/abs/2602.19313)
- [Original TOPReward code](https://github.com/TOPReward/TOPReward)
- [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)

## Citation

```bibtex
@article{chen2026topreward,
  title={TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics},
  author={Chen, Shirui and Harrison, Cole and Lee, Ying-Chun and Yang, Angela Jin and
          Ren, Zhongzheng and Ratliff, Lillian J and Duan, Jiafei and Fox, Dieter and
          Krishna, Ranjay},
  journal={arXiv preprint arXiv:2602.19313},
  year={2026}
}
```