mirror of
https://github.com/huggingface/lerobot.git
synced 2026-05-19 18:49:52 +00:00
192 lines
7.5 KiB
Plaintext
192 lines
7.5 KiB
Plaintext
# TOPReward
|
||
|
||
TOPReward is a **zero-shot reward model** that extracts token log-probabilities from an off-the-shelf vision-language model (VLM) as a robotic reward signal. Given a video trajectory and a task instruction, it returns the VLM's log-likelihood that the instruction is true — no fine-tuning required.
|
||
|
||
**Paper**: [TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics](https://arxiv.org/abs/2602.19313)
|
||
**Project**: [topreward.github.io](https://topreward.github.io/webpage/)
|
||
**Original code**: [github.com/TOPReward/TOPReward](https://github.com/TOPReward/TOPReward)
|
||
**Default backbone**: [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
|
||
|
||
## Overview
|
||
|
||
TOPReward asks a generic VLM how likely a task instruction is, **conditioned on the video** of a robot trying to complete that task. Concretely, given:
|
||
|
||
- A trajectory video (a sequence of frames).
|
||
- A task instruction (e.g. _"open the drawer"_).
|
||
|
||
it builds a chat prompt of the form
|
||
|
||
```text
|
||
<video>
|
||
"The above video shows a robot manipulation trajectory that completes the
|
||
following task: <instruction> Decide whether the above statement is True
|
||
or not. The answer is: True"
|
||
```
|
||
|
||
forwards it through the VLM, label-masks everything except the very last token, and reads back the log-probability of that token — by default the literal `"True"` that closes the suffix template. The resulting `log P("True" | video + prompt + instruction)` is the reward, and answers the question "given this video, how strongly does the VLM agree that the instruction is satisfied?".
|
||
|
||
Because the method only depends on a frozen VLM, TOPReward is **zero-shot**: there are no fine-tuned weights to host. The "model" in LeRobot is a small wrapper around `transformers`' `Qwen3VLForConditionalGeneration` plus the prompt assembly + label-masking logic.
|
||
|
||
## Installation Requirements
|
||
|
||
1. Install LeRobot following the [Installation Guide](./installation).
|
||
2. Install the TOPReward optional extra:
|
||
|
||
```bash
|
||
pip install -e ".[topreward]"
|
||
```
|
||
|
||
or, with `uv` from a source checkout:
|
||
|
||
```bash
|
||
uv sync --extra topreward
|
||
```
|
||
|
||
This pulls in `transformers` and `qwen-vl-utils`. The first time you run TOPReward, Hugging Face will also download the VLM weights from the Hub (~16 GB for Qwen3-VL-8B-Instruct). A GPU is strongly recommended.
|
||
|
||
## Model Inputs and Outputs
|
||
|
||
TOPReward expects:
|
||
|
||
- A trajectory video or sequence of frames.
|
||
- A natural-language task description.
|
||
|
||
In LeRobot datasets the preprocessor reads:
|
||
|
||
| Config field | Default | Meaning |
|
||
| ------------------------- | --------------------------- | ----------------------------------------------------------------------- |
|
||
| `reward_model.image_key` | `observation.images.top` | Camera observation used by TOPReward |
|
||
| `reward_model.task_key` | `task` | Key in complementary data that stores the task string |
|
||
| `reward_model.max_frames` | `16` | Cap on frames per sample (compute_reward only; predict_curves bypasses) |
|
||
| `reward_model.fps` | `2.0` | Metadata passed to the Qwen video processor |
|
||
| `reward_model.vlm_name` | `Qwen/Qwen3-VL-8B-Instruct` | Hugging Face Hub id of the underlying VLM |
|
||
|
||
The model returns:
|
||
|
||
- `compute_reward(batch)`: one log-probability per sample. Higher = better task–video alignment. When `success_threshold` is finite, returns the binary thresholded value instead.
|
||
- `predict_curves(batch, num_prefixes=None)`: per-frame progress curve in `[0, 1]` (min-max normalised log-probs over prefix lengths). `num_prefixes=None` is fully dense; `num_prefixes=15` matches the upstream sparse-dense default with linear interpolation between anchors.
|
||
|
||
## Usage
|
||
|
||
### Load the reward model directly
|
||
|
||
```python
|
||
from lerobot.rewards.topreward import TOPRewardConfig, TOPRewardModel
|
||
|
||
cfg = TOPRewardConfig(
|
||
vlm_name="Qwen/Qwen3-VL-8B-Instruct",
|
||
device="cuda",
|
||
)
|
||
reward_model = TOPRewardModel(cfg)
|
||
```
|
||
|
||
There is no `from_pretrained` weight download for TOPReward itself — the VLM is fetched from the Hub on construction.
|
||
|
||
### Score a clip + task
|
||
|
||
```python
|
||
import numpy as np
|
||
from lerobot.rewards.topreward.processor_topreward import TOPREWARD_FEATURE_PREFIX
|
||
|
||
# frames: np.ndarray, shape (T, H, W, C), dtype uint8
|
||
# task: str
|
||
batch = {
|
||
f"{TOPREWARD_FEATURE_PREFIX}frames": [frames],
|
||
f"{TOPREWARD_FEATURE_PREFIX}task": [task],
|
||
}
|
||
reward = reward_model.compute_reward(batch) # tensor of shape (1,)
|
||
```
|
||
|
||
For a dense per-frame curve over the same clip:
|
||
|
||
```python
|
||
out = reward_model.predict_curves(batch, num_prefixes=15)
|
||
progress = out["progress"][0].numpy() # shape (T,), values in [0, 1]
|
||
```
|
||
|
||
### Use the reward factory
|
||
|
||
```python
|
||
from lerobot.rewards import make_reward_model, make_reward_model_config, make_reward_pre_post_processors
|
||
|
||
cfg = make_reward_model_config(
|
||
"topreward",
|
||
vlm_name="Qwen/Qwen3-VL-8B-Instruct",
|
||
device="cuda",
|
||
image_key="observation.images.top",
|
||
)
|
||
reward_model = make_reward_model(cfg)
|
||
preprocessor, postprocessor = make_reward_pre_post_processors(cfg)
|
||
```
|
||
|
||
The preprocessor writes normalised frames + task strings under the `observation.topreward.*` namespace; the model reads them in `compute_reward`.
|
||
|
||
### Offline dataset labeling
|
||
|
||
Mirror the SARM / Robometer RA-BC flow — write a `topreward_progress.parquet` once, then reuse it for training (RA-BC) and visualisation (overlay videos):
|
||
|
||
```bash
|
||
# Fully dense per-frame labeling
|
||
uv run python -m lerobot.rewards.topreward.compute_rabc_weights \
|
||
--dataset-repo-id lerobot/libero_10_image \
|
||
--device cuda
|
||
|
||
# Sparse-dense (15 anchors per episode, matches upstream)
|
||
uv run python -m lerobot.rewards.topreward.compute_rabc_weights \
|
||
--dataset-repo-id lerobot/libero_10_image \
|
||
--num-prefixes 15 \
|
||
--device cuda
|
||
```
|
||
|
||
Then render the SARM-style progress overlay for any episode:
|
||
|
||
```bash
|
||
uv run examples/dataset/create_progress_videos.py \
|
||
--repo-id lerobot/libero_10_image \
|
||
--episode 0 \
|
||
--progress-file topreward_progress.parquet \
|
||
--gif
|
||
```
|
||
|
||
## Publishing a named TOPReward configuration
|
||
|
||
Because TOPReward stores no weights of its own, "publishing a TOPReward model" amounts to writing the LeRobot `config.json` (≈ 1 KB) that pins the VLM id, prompt and reduction:
|
||
|
||
```python
|
||
from lerobot.rewards.topreward import TOPRewardConfig, TOPRewardModel
|
||
|
||
cfg = TOPRewardConfig(
|
||
vlm_name="Qwen/Qwen3-VL-8B-Instruct",
|
||
reduction="mean",
|
||
fps=2.0,
|
||
)
|
||
TOPRewardModel(cfg).save_pretrained("./topreward-qwen3vl-8b")
|
||
# Push the directory to the Hub via `huggingface-cli` or `HfApi.upload_folder`.
|
||
```
|
||
|
||
Reloading restores the same configuration (no weight download for TOPReward itself; the VLM is re-fetched via `vlm_name`):
|
||
|
||
```python
|
||
reloaded = TOPRewardModel.from_pretrained("./topreward-qwen3vl-8b")
|
||
```
|
||
|
||
## References
|
||
|
||
- [TOPReward project page](https://topreward.github.io/webpage/)
|
||
- [TOPReward paper](https://arxiv.org/abs/2602.19313)
|
||
- [Original TOPReward code](https://github.com/TOPReward/TOPReward)
|
||
- [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
|
||
|
||
## Citation
|
||
|
||
```bibtex
|
||
@article{chen2026topreward,
|
||
title={TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics},
|
||
author={Chen, Shirui and Harrison, Cole and Lee, Ying-Chun and Yang, Angela Jin and
|
||
Ren, Zhongzheng and Ratliff, Lillian J and Duan, Jiafei and Fox, Dieter and
|
||
Krishna, Ranjay},
|
||
journal={arXiv preprint arXiv:2602.19313},
|
||
year={2026}
|
||
}
|
||
```
|