refactor(rewards): clean up TOPReward processor/model

2026-07-23 17:56:07 +00:00 · 2026-05-20 17:39:21 +02:00
parent 70ad322676
commit f6ecb7b955
7 changed files with 568 additions and 928 deletions
@@ -23,9 +23,20 @@ it builds a chat prompt of the form
 or not. The answer is: True"
 ```

-forwards it through the VLM, label-masks everything except the very last token, and reads back the log-probability of that token — by default the literal `"True"` that closes the suffix template. The resulting `log P("True" | video + prompt + instruction)` is the reward, and answers the question "given this video, how strongly does the VLM agree that the instruction is satisfied?".
+forwards it through the VLM, label-masks everything except the very last token, and reads back the log-probability of that token — by default the literal `"True"` that closes the suffix template. The resulting `log P("True" | video + prompt + instruction)` is the reward.

-Because the method only depends on a frozen VLM, TOPReward is **zero-shot**: there are no fine-tuned weights to host. The "model" in LeRobot is a small wrapper around `transformers`' `Qwen3VLForConditionalGeneration` plus the prompt assembly + label-masking logic.
+Because the method only depends on a frozen VLM, TOPReward is **zero-shot**: there are no fine-tuned weights to host. The "model" in LeRobot is a small wrapper around `transformers`' `Qwen3VLForConditionalGeneration` plus the label-masking logic. The processor owns the tokeniser and builds the full chat prompt (EO-1/Robometer pattern).
+
+## What the LeRobot integration covers
+
+- Standard `reward_model.type=topreward` configuration through LeRobot.
+- VLM loading via the `transformers` `Qwen3VLForConditionalGeneration` API.
+- Prompt assembly + tokenisation in the processor (matching upstream `QwenClient.compute_instruction_reward`).
+- `compute_reward()` returns one scalar log-prob per sample.
+- LeRobot reward-model save/load — `save_pretrained` writes only `config.json` (the VLM is identified by `vlm_name`).
+- An offline labeling script that writes a `topreward_progress.parquet` (SARM-compatible schema) for RA-BC and overlay.
+
+The current LeRobot port supports the **Qwen3-VL client only**. Other upstream clients (Gemini, OpenAI, Gemma, Molmo) can be added as follow-up extras.

 ## Installation Requirements

@@ -53,18 +64,17 @@ TOPReward expects:

 In LeRobot datasets the preprocessor reads:

-| Config field              | Default                     | Meaning                                                                 |
-| ------------------------- | --------------------------- | ----------------------------------------------------------------------- |
-| `reward_model.image_key`  | `observation.images.top`    | Camera observation used by TOPReward                                    |
-| `reward_model.task_key`   | `task`                      | Key in complementary data that stores the task string                   |
-| `reward_model.max_frames` | `16`                        | Cap on frames per sample (compute_reward only; predict_curves bypasses) |
-| `reward_model.fps`        | `2.0`                       | Metadata passed to the Qwen video processor                             |
-| `reward_model.vlm_name`   | `Qwen/Qwen3-VL-8B-Instruct` | Hugging Face Hub id of the underlying VLM                               |
+| Config field              | Default                     | Meaning                                       |
+| ------------------------- | --------------------------- | --------------------------------------------- |
+| `reward_model.image_key`  | `observation.images.top`    | Camera observation used by TOPReward          |
+| `reward_model.task_key`   | `task`                      | Key in complementary data for the task string |
+| `reward_model.max_frames` | `16`                        | Cap on frames per sample                      |
+| `reward_model.fps`        | `2.0`                       | Metadata passed to the Qwen video processor   |
+| `reward_model.vlm_name`   | `Qwen/Qwen3-VL-8B-Instruct` | Hugging Face Hub id of the underlying VLM     |

 The model returns:

- `compute_reward(batch)`: one log-probability per sample. Higher = better task–video alignment. When `success_threshold` is finite, returns the binary thresholded value instead.
- `predict_curves(batch, num_prefixes=None)`: per-frame progress curve in `[0, 1]` (min-max normalised log-probs over prefix lengths). `num_prefixes=None` is fully dense; `num_prefixes=15` matches the upstream sparse-dense default with linear interpolation between anchors.
+- `compute_reward(batch)`: one log-probability per sample. Higher = better task-video alignment. When `success_threshold` is finite, returns the binary thresholded value instead.

 ## Usage

@@ -80,30 +90,6 @@ cfg = TOPRewardConfig(
 reward_model = TOPRewardModel(cfg)
 ```

-There is no `from_pretrained` weight download for TOPReward itself — the VLM is fetched from the Hub on construction.
-
-### Score a clip + task
-
-```python
-import numpy as np
-from lerobot.rewards.topreward.processor_topreward import TOPREWARD_FEATURE_PREFIX
-
-# frames: np.ndarray, shape (T, H, W, C), dtype uint8
-# task: str
-batch = {
-    f"{TOPREWARD_FEATURE_PREFIX}frames": [frames],
-    f"{TOPREWARD_FEATURE_PREFIX}task": [task],
-}
-reward = reward_model.compute_reward(batch)  # tensor of shape (1,)
-```
-
-For a dense per-frame curve over the same clip:
-
-```python
-out = reward_model.predict_curves(batch, num_prefixes=15)
-progress = out["progress"][0].numpy()  # shape (T,), values in [0, 1]
-```
-
 ### Use the reward factory

 ```python
@@ -119,26 +105,21 @@ reward_model = make_reward_model(cfg)
 preprocessor, postprocessor = make_reward_pre_post_processors(cfg)
 ```

-The preprocessor writes normalised frames + task strings under the `observation.topreward.*` namespace; the model reads them in `compute_reward`.
+The preprocessor tokenises the full prompt (video + prefix + instruction suffix), writes Qwen-VL tensors + `prompt_length` under `observation.topreward.*`. The model reads those tensors, label-masks based on `prompt_length`, and extracts the log-prob reward.

 ### Offline dataset labeling

-Mirror the SARM / Robometer RA-BC flow — write a `topreward_progress.parquet` once, then reuse it for training (RA-BC) and visualisation (overlay videos):
+Write a `topreward_progress.parquet` for RA-BC training and overlay videos:

 ```bash
-# Fully dense per-frame labeling
-uv run python -m lerobot.rewards.topreward.compute_rabc_weights \
-    --dataset-repo-id lerobot/libero_10_image \
-    --device cuda
-
 # Sparse-dense (15 anchors per episode, matches upstream)
 uv run python -m lerobot.rewards.topreward.compute_rabc_weights \
    --dataset-repo-id lerobot/libero_10_image \
-    --num-prefixes 15 \
+    --num-samples 15 \
    --device cuda
 ```

-Then render the SARM-style progress overlay for any episode:
+Then render the progress overlay for any episode:

 ```bash
 uv run examples/dataset/create_progress_videos.py \
@@ -148,27 +129,28 @@ uv run examples/dataset/create_progress_videos.py \
    --gif
 ```

-## Publishing a named TOPReward configuration
+## Configuration Notes

-Because TOPReward stores no weights of its own, "publishing a TOPReward model" amounts to writing the LeRobot `config.json` (≈ 1 KB) that pins the VLM id, prompt and reduction:
+### Prompt knobs

-```python
-from lerobot.rewards.topreward import TOPRewardConfig, TOPRewardModel
+The default prompt mirrors the upstream paper:

-cfg = TOPRewardConfig(
-    vlm_name="Qwen/Qwen3-VL-8B-Instruct",
-    reduction="mean",
-    fps=2.0,
-)
-TOPRewardModel(cfg).save_pretrained("./topreward-qwen3vl-8b")
-# Push the directory to the Hub via `huggingface-cli` or `HfApi.upload_folder`.
+```text
+prompt_prefix = "The above video shows a robot manipulation trajectory that completes the following task: "
+prompt_suffix_template = "{instruction} Decide whether the above statement is True or not. The answer is: True"
 ```

-Reloading restores the same configuration (no weight download for TOPReward itself; the VLM is re-fetched via `vlm_name`):
+Both are exposed on `TOPRewardConfig` for ablation. The suffix template **must** contain `{instruction}`.

-```python
-reloaded = TOPRewardModel.from_pretrained("./topreward-qwen3vl-8b")
-```
+### Chat template
+
+`add_chat_template=True` wraps the full prompt (including instruction) with the tokenizer's chat template before tokenisation. Default is `False`, matching the upstream paper's main experiments.
+
+## Limitations
+
+- The current LeRobot port is **inference-only and zero-shot**; `forward()` is not overridden and `is_trainable` returns `False`.
+- Only the **Qwen3-VL family** is supported; other upstream clients are out of scope.
+- TOPReward inherits the underlying VLM's biases.

 ## References

@@ -189,3 +171,7 @@ reloaded = TOPRewardModel.from_pretrained("./topreward-qwen3vl-8b")
  year={2026}
 }
 ```
+
+## License
+
+The original TOPReward codebase is MIT-licensed. The LeRobot port follows the LeRobot Apache 2.0 license; the wrapped Qwen3-VL weights are subject to the original Qwen license.