optmize topreward input processing (#3660 )

fix(rewards/topreward): fix pyproject extra typo and simplify processor (#3653 )
Add lerobot[topreward] extra to all in pyproject.toml, drop the redundant labels arg in scoring, and collapse the dead-branch shape check in the encoder processor.
2026-05-26 14:09:47 +00:00 · 2026-05-25 16:07:45 +02:00 · 2026-05-23 00:27:09 +02:00 · 2026-05-21 11:05:02 +02:00 · 2026-05-20 17:39:21 +02:00 · 2026-05-19 18:00:18 +02:00
15 changed files with 1891 additions and 3 deletions
@@ -73,6 +73,8 @@
 - sections:
  - local: sarm
    title: SARM
+  - local: topreward
+    title: TOPReward
  title: "Reward Models"
 - sections:
  - local: inference
@@ -0,0 +1,177 @@
+# TOPReward
+
+TOPReward is a **zero-shot reward model** that extracts token log-probabilities from an off-the-shelf vision-language model (VLM) as a robotic reward signal. Given a video trajectory and a task instruction, it returns the VLM's log-likelihood that the instruction is true — no fine-tuning required.
+
+**Paper**: [TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics](https://arxiv.org/abs/2602.19313)
+**Project**: [topreward.github.io](https://topreward.github.io/webpage/)
+**Original code**: [github.com/TOPReward/TOPReward](https://github.com/TOPReward/TOPReward)
+**Default backbone**: [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
+
+## Overview
+
+TOPReward asks a generic VLM how likely a task instruction is, **conditioned on the video** of a robot trying to complete that task. Concretely, given:
+
+- A trajectory video (a sequence of frames).
+- A task instruction (e.g. _"open the drawer"_).
+
+it builds a chat prompt of the form
+
+```text
+<video>
+"The above video shows a robot manipulation trajectory that completes the
+ following task: <instruction> Decide whether the above statement is True
+ or not. The answer is: True"
+```
+
+forwards it through the VLM, label-masks everything except the very last token, and reads back the log-probability of that token — by default the literal `"True"` that closes the suffix template. The resulting `log P("True" | video + prompt + instruction)` is the reward.
+
+Because the method only depends on a frozen VLM, TOPReward is **zero-shot**: there are no fine-tuned weights to host. The "model" in LeRobot is a small wrapper around `transformers`' `Qwen3VLForConditionalGeneration` plus the label-masking logic. The processor owns the tokeniser and builds the full chat prompt (EO-1/Robometer pattern).
+
+## What the LeRobot integration covers
+
+- Standard `reward_model.type=topreward` configuration through LeRobot.
+- VLM loading via the `transformers` `Qwen3VLForConditionalGeneration` API.
+- Prompt assembly + tokenisation in the processor (matching upstream `QwenClient.compute_instruction_reward`).
+- `compute_reward()` returns one scalar log-prob per sample.
+- LeRobot reward-model save/load — `save_pretrained` writes only `config.json` (the VLM is identified by `vlm_name`).
+- An offline labeling script that writes a `topreward_progress.parquet` (SARM-compatible schema) for RA-BC and overlay.
+
+The current LeRobot port supports the **Qwen3-VL client only**. Other upstream clients (Gemini, OpenAI, Gemma, Molmo) can be added as follow-up extras.
+
+## Installation Requirements
+
+1. Install LeRobot following the [Installation Guide](./installation).
+2. Install the TOPReward optional extra:
+
+```bash
+pip install -e ".[topreward]"
+```
+
+or, with `uv` from a source checkout:
+
+```bash
+uv sync --extra topreward
+```
+
+This pulls in `transformers`. The first time you run TOPReward, Hugging Face will also download the VLM weights from the Hub (~16 GB for Qwen3-VL-8B-Instruct). A GPU is strongly recommended.
+
+## Model Inputs and Outputs
+
+TOPReward expects:
+
+- A trajectory video or sequence of frames.
+- A natural-language task description.
+
+In LeRobot datasets the preprocessor reads:
+
+| Config field              | Default                     | Meaning                                       |
+| ------------------------- | --------------------------- | --------------------------------------------- |
+| `reward_model.image_key`  | `observation.images.top`    | Camera observation used by TOPReward          |
+| `reward_model.task_key`   | `task`                      | Key in complementary data for the task string |
+| `reward_model.max_frames` | `16`                        | Cap on frames per sample                      |
+| `reward_model.fps`        | `2.0`                       | Metadata passed to the Qwen video processor   |
+| `reward_model.vlm_name`   | `Qwen/Qwen3-VL-8B-Instruct` | Hugging Face Hub id of the underlying VLM     |
+
+The model returns:
+
+- `compute_reward(batch)`: one log-probability per sample. Higher = better task-video alignment. When `success_threshold` is finite, returns the binary thresholded value instead.
+
+## Usage
+
+### Load the reward model directly
+
+```python
+from lerobot.rewards.topreward import TOPRewardConfig, TOPRewardModel
+
+cfg = TOPRewardConfig(
+    vlm_name="Qwen/Qwen3-VL-8B-Instruct",
+    device="cuda",
+)
+reward_model = TOPRewardModel(cfg)
+```
+
+### Use the reward factory
+
+```python
+from lerobot.rewards import make_reward_model, make_reward_model_config, make_reward_pre_post_processors
+
+cfg = make_reward_model_config(
+    "topreward",
+    vlm_name="Qwen/Qwen3-VL-8B-Instruct",
+    device="cuda",
+    image_key="observation.images.top",
+)
+reward_model = make_reward_model(cfg)
+preprocessor, postprocessor = make_reward_pre_post_processors(cfg)
+```
+
+The preprocessor tokenises the full prompt (video + prefix + instruction suffix), writes Qwen-VL tensors + `prompt_length` under `observation.topreward.*`. The model reads those tensors, label-masks based on `prompt_length`, and extracts the log-prob reward.
+
+### Offline dataset labeling
+
+Write a `topreward_progress.parquet` for RA-BC training and overlay videos:
+
+```bash
+# Sparse-dense (15 anchors per episode, matches upstream)
+uv run python -m lerobot.rewards.topreward.compute_rabc_weights \
+    --dataset-repo-id lerobot/libero_10_image \
+    --num-samples 15 \
+    --device cuda
+```
+
+Then render the progress overlay for any episode:
+
+```bash
+uv run examples/dataset/create_progress_videos.py \
+    --repo-id lerobot/libero_10_image \
+    --episode 0 \
+    --progress-file topreward_progress.parquet \
+    --gif
+```
+
+## Configuration Notes
+
+### Prompt knobs
+
+The default prompt mirrors the upstream paper:
+
+```text
+prompt_prefix = "The above video shows a robot manipulation trajectory that completes the following task: "
+prompt_suffix_template = "{instruction} Decide whether the above statement is True or not. The answer is: True"
+```
+
+Both are exposed on `TOPRewardConfig` for ablation. The suffix template **must** contain `{instruction}`.
+
+### Chat template
+
+`add_chat_template=True` wraps the full prompt (including instruction) with the tokenizer's chat template before tokenisation. Default is `False`, matching the upstream paper's main experiments.
+
+## Limitations
+
+- The current LeRobot port is **inference-only and zero-shot**; `forward()` is not overridden and `is_trainable` returns `False`.
+- Only the **Qwen3-VL family** is supported; other upstream clients are out of scope.
+- TOPReward inherits the underlying VLM's biases.
+
+## References
+
+- [TOPReward project page](https://topreward.github.io/webpage/)
+- [TOPReward paper](https://arxiv.org/abs/2602.19313)
+- [Original TOPReward code](https://github.com/TOPReward/TOPReward)
+- [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
+
+## Citation
+
+```bibtex
+@article{chen2026topreward,
+  title={TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics},
+  author={Chen, Shirui and Harrison, Cole and Lee, Ying-Chun and Yang, Angela Jin and
+          Ren, Zhongzheng and Ratliff, Lillian J and Duan, Jiafei and Fox, Dieter and
+          Krishna, Ranjay},
+  journal={arXiv preprint arXiv:2602.19313},
+  year={2026}
+}
+```
+
+## License
+
+The original TOPReward codebase is MIT-licensed. The LeRobot port follows the LeRobot Apache 2.0 license; the wrapped Qwen3-VL weights are subject to the original Qwen license.
@@ -209,6 +209,7 @@ groot = [
    "flash-attn>=2.5.9,<3.0.0 ; sys_platform != 'darwin'"
 ]
 sarm = ["lerobot[transformers-dep]", "pydantic>=2.0.0,<3.0.0", "faker>=33.0.0,<35.0.0", "lerobot[matplotlib-dep]", "lerobot[qwen-vl-utils-dep]"]
+topreward = ["lerobot[transformers-dep]"]
 xvla = ["lerobot[transformers-dep]"]
 eo1 = ["lerobot[transformers-dep]", "lerobot[qwen-vl-utils-dep]"]
 hilserl = ["lerobot[transformers-dep]", "lerobot[dataset]", "gym-hil>=0.1.13,<0.2.0", "lerobot[grpcio-dep]", "lerobot[placo-dep]"]
@@ -286,6 +287,7 @@ all = [
    "lerobot[libero]; sys_platform == 'linux'",
    "lerobot[metaworld]",
    "lerobot[sarm]",
+    "lerobot[topreward]",
    "lerobot[peft]",
    # "lerobot[unitree_g1]", TODO: Unitree requires specific installation instructions for unitree_sdk2
 ]
@@ -21,11 +21,13 @@ from .factory import (
 )
 from .pretrained import PreTrainedRewardModel as PreTrainedRewardModel
 from .sarm.configuration_sarm import SARMConfig as SARMConfig
+from .topreward.configuration_topreward import TOPRewardConfig as TOPRewardConfig

 __all__ = [
    # Configuration classes
    "RewardClassifierConfig",
    "SARMConfig",
+    "TOPRewardConfig",
    # Base class
    "PreTrainedRewardModel",
    # Factory functions
@@ -26,6 +26,7 @@ from lerobot.processor import PolicyAction, PolicyProcessorPipeline
 from .classifier.configuration_classifier import RewardClassifierConfig
 from .pretrained import PreTrainedRewardModel
 from .sarm.configuration_sarm import SARMConfig
+from .topreward.configuration_topreward import TOPRewardConfig


 def get_reward_model_class(name: str) -> type[PreTrainedRewardModel]:
@@ -37,7 +38,7 @@ def get_reward_model_class(name: str) -> type[PreTrainedRewardModel]:

    Args:
        name: The name of the reward model. Supported names are "reward_classifier",
-              "sarm".
+              "sarm", "topreward".

    Returns:
        The reward model class corresponding to the given name.
@@ -53,6 +54,10 @@ def get_reward_model_class(name: str) -> type[PreTrainedRewardModel]:
        from lerobot.rewards.sarm.modeling_sarm import SARMRewardModel

        return SARMRewardModel
+    elif name == "topreward":
+        from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+
+        return TOPRewardModel
    else:
        try:
            return _get_reward_model_cls_from_name(name=name)
@@ -69,7 +74,7 @@ def make_reward_model_config(reward_type: str, **kwargs) -> RewardModelConfig:

    Args:
        reward_type: The type of the reward model. Supported types include
-                     "reward_classifier", "sarm".
+                     "reward_classifier", "sarm", "topreward".
        **kwargs: Keyword arguments to be passed to the configuration class constructor.

    Returns:
@@ -82,6 +87,8 @@ def make_reward_model_config(reward_type: str, **kwargs) -> RewardModelConfig:
        return RewardClassifierConfig(**kwargs)
    elif reward_type == "sarm":
        return SARMConfig(**kwargs)
+    elif reward_type == "topreward":
+        return TOPRewardConfig(**kwargs)
    else:
        try:
            config_cls = RewardModelConfig.get_choice_class(reward_type)
@@ -162,6 +169,14 @@ def make_reward_pre_post_processors(
            dataset_meta=kwargs.get("dataset_meta"),
        )

+    elif isinstance(reward_cfg, TOPRewardConfig):
+        from lerobot.rewards.topreward.processor_topreward import make_topreward_pre_post_processors
+
+        return make_topreward_pre_post_processors(
+            config=reward_cfg,
+            dataset_stats=kwargs.get("dataset_stats"),
+        )
+
    else:
        try:
            processors = _make_processors_from_reward_model_config(
@@ -0,0 +1,19 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .configuration_topreward import TOPRewardConfig
+from .modeling_topreward import TOPRewardModel
+from .processor_topreward import make_topreward_pre_post_processors
+
+__all__ = ["TOPRewardConfig", "TOPRewardModel", "make_topreward_pre_post_processors"]
@@ -0,0 +1,353 @@
+#!/usr/bin/env python
+
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Compute per-frame TOPReward progress curves for a LeRobot dataset.
+
+For each episode, scores trajectory prefixes of increasing length using
+the TOPReward reward model, min-max normalises the raw log-prob rewards per episode,
+and writes a parquet file with one row per frame.
+
+The parquet uses the same schema as SARM's :mod:`lerobot.rewards.sarm.compute_rabc_weights`.
+
+Usage:
+    # Sparse-dense mode (15 anchors per episode, matches upstream)
+    python -m lerobot.rewards.topreward.compute_rabc_weights \\
+        --dataset-repo-id lerobot/libero_10_image \\
+        --num-samples 15
+
+    # Use a different VLM backbone
+    python -m lerobot.rewards.topreward.compute_rabc_weights \\
+        --dataset-repo-id lerobot/libero_10_image \\
+        --vlm-name Qwen/Qwen3-VL-4B-Instruct
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+from pathlib import Path
+from typing import Any
+
+import numpy as np
+import pyarrow as pa
+import pyarrow.parquet as pq
+import torch
+from tqdm import tqdm
+
+from lerobot.datasets import LeRobotDataset
+from lerobot.rewards.topreward.configuration_topreward import TOPRewardConfig
+from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+from lerobot.rewards.topreward.processor_topreward import TOPRewardEncoderProcessorStep
+from lerobot.types import TransitionKey
+
+DEFAULT_OUTPUT_FILENAME = "topreward_progress.parquet"
+
+
+def get_reward_model_path_from_parquet(parquet_path: Path) -> str | None:
+    """Read ``reward_model_path`` from parquet metadata if available."""
+    if not parquet_path.exists():
+        return None
+    try:
+        metadata = pq.read_metadata(parquet_path).schema.to_arrow_schema().metadata
+        if metadata and b"reward_model_path" in metadata:
+            return metadata[b"reward_model_path"].decode()
+    except Exception:  # nosec B110
+        return None
+    return None
+
+
+def _resolve_task(sample: dict[str, Any], default: str) -> str:
+    """Best-effort task extraction from a dataset sample."""
+    task = sample.get("task")
+    if isinstance(task, str) and task:
+        return task
+    return default
+
+
+def normalize_rewards(rewards: list[float] | np.ndarray) -> np.ndarray:
+    """Min-max normalise raw log-prob rewards into ``[0, 1]``."""
+    rewards_arr = np.asarray(rewards, dtype=np.float64)
+    if rewards_arr.size == 0:
+        return rewards_arr.astype(np.float32)
+    if rewards_arr.size == 1:
+        return np.array([1.0], dtype=np.float32)
+    r_min, r_max = rewards_arr.min(), rewards_arr.max()
+    if r_max == r_min:
+        return np.ones_like(rewards_arr, dtype=np.float32)
+    return ((rewards_arr - r_min) / (r_max - r_min)).astype(np.float32)
+
+
+def compute_instruction_rewards_for_prefixes(
+    model: TOPRewardModel,
+    encoder: TOPRewardEncoderProcessorStep,
+    dataset: LeRobotDataset,
+    ep_start: int,
+    num_frames: int,
+    task: str,
+    image_key: str,
+    num_samples: int | None,
+    device: str,
+) -> np.ndarray:
+    """Score an episode via prefix sweep and return a per-frame normalised curve."""
+    if num_samples is None or num_samples >= num_frames:
+        prefix_lengths = np.arange(1, num_frames + 1, dtype=np.int64)
+    else:
+        prefix_lengths = np.unique(np.linspace(1, num_frames, num_samples).round().astype(np.int64))
+
+    episode_frames = torch.stack([dataset[ep_start + i][image_key] for i in range(num_frames)])
+    rewards: list[float] = []
+    for length in prefix_lengths:
+        frames = episode_frames[: int(length)].unsqueeze(0)  # (1, T, C, H, W)
+
+        transition = {
+            TransitionKey.OBSERVATION: {image_key: frames},
+            TransitionKey.COMPLEMENTARY_DATA: {"task": task},
+        }
+        encoded = encoder(transition)
+        obs = encoded[TransitionKey.OBSERVATION]
+        batch = {
+            key: value.to(device) if isinstance(value, torch.Tensor) else value for key, value in obs.items()
+        }
+
+        with torch.no_grad():
+            reward = model.compute_reward(batch)
+        rewards.append(float(reward.item()))
+
+    normalized_rewards = normalize_rewards(rewards)
+
+    if prefix_lengths.shape[0] == num_frames:
+        return normalized_rewards
+
+    return np.interp(
+        np.arange(1, num_frames + 1, dtype=np.float64),
+        prefix_lengths.astype(np.float64),
+        normalized_rewards.astype(np.float64),
+    ).astype(np.float32)
+
+
+def compute_topreward_progress(
+    dataset_repo_id: str,
+    reward_model_path: str | None = None,
+    vlm_name: str | None = None,
+    output_path: str | None = None,
+    device: str = "cuda",
+    num_samples: int | None = None,
+    fps: float | None = None,
+    episodes: list[int] | None = None,
+) -> Path:
+    """Run TOPReward over a dataset and write per-frame progress."""
+    if reward_model_path is not None:
+        logging.info(f"Loading TOPReward config from: {reward_model_path}")
+        model = TOPRewardModel.from_pretrained(reward_model_path)
+        config = model.config
+        config.device = device
+        if vlm_name is not None and vlm_name != config.vlm_name:
+            logging.info(f"Overriding vlm_name from config: {config.vlm_name} -> {vlm_name}")
+            config.vlm_name = vlm_name
+            model = TOPRewardModel(config)
+    else:
+        config_kwargs: dict[str, Any] = {"device": device}
+        if vlm_name is not None:
+            config_kwargs["vlm_name"] = vlm_name
+        if fps is not None:
+            config_kwargs["fps"] = fps
+        config = TOPRewardConfig(**config_kwargs)
+        logging.info(f"Constructing TOPReward with VLM: {config.vlm_name}")
+        model = TOPRewardModel(config)
+
+    model.to(device).eval()
+
+    encoder = TOPRewardEncoderProcessorStep(
+        vlm_name=config.vlm_name,
+        image_key=config.image_key,
+        task_key=config.task_key,
+        default_task=config.default_task,
+        max_frames=None,  # no tail-crop: we control prefix length explicitly
+        fps=config.fps,
+        prompt_prefix=config.prompt_prefix,
+        prompt_suffix_template=config.prompt_suffix_template,
+        add_chat_template=config.add_chat_template,
+        max_length=config.max_input_length,
+    )
+
+    image_key = config.image_key
+
+    logging.info(f"Loading dataset: {dataset_repo_id}")
+    dataset = LeRobotDataset(dataset_repo_id, download_videos=True)
+    logging.info(f"Dataset: {dataset.num_episodes} episodes, {dataset.num_frames} frames")
+
+    episode_indices = list(range(dataset.num_episodes)) if episodes is None else episodes
+    logging.info(f"Processing {len(episode_indices)} episode(s)")
+
+    all_index: list[int] = []
+    all_episode: list[int] = []
+    all_frame: list[int] = []
+    all_progress: list[float] = []
+
+    for episode_idx in tqdm(episode_indices, desc="Episodes"):
+        ep = dataset.meta.episodes[episode_idx]
+        ep_start = int(ep["dataset_from_index"])
+        ep_end = int(ep["dataset_to_index"])
+        num_frames = ep_end - ep_start
+        if num_frames <= 0:
+            continue
+
+        first_sample = dataset[ep_start]
+        task = _resolve_task(first_sample, default=config.default_task or "perform the task")
+
+        per_frame = compute_instruction_rewards_for_prefixes(
+            model=model,
+            encoder=encoder,
+            dataset=dataset,
+            ep_start=ep_start,
+            num_frames=num_frames,
+            task=task,
+            image_key=image_key,
+            num_samples=num_samples,
+            device=device,
+        )
+
+        for local in range(num_frames):
+            all_index.append(ep_start + local)
+            all_episode.append(episode_idx)
+            all_frame.append(local)
+            all_progress.append(float(per_frame[local]))
+
+        if device.startswith("cuda"):
+            torch.cuda.empty_cache()
+
+    table = pa.table(
+        {
+            "index": np.asarray(all_index, dtype=np.int64),
+            "episode_index": np.asarray(all_episode, dtype=np.int64),
+            "frame_index": np.asarray(all_frame, dtype=np.int64),
+            "progress_sparse": np.asarray(all_progress, dtype=np.float32),
+        }
+    )
+
+    schema_metadata: dict[bytes, bytes] = {b"vlm_name": config.vlm_name.encode()}
+    if reward_model_path is not None:
+        schema_metadata[b"reward_model_path"] = reward_model_path.encode()
+    table = table.replace_schema_metadata(schema_metadata)
+
+    out = Path(dataset.root) / DEFAULT_OUTPUT_FILENAME if output_path is None else Path(output_path)
+    out.parent.mkdir(parents=True, exist_ok=True)
+    pq.write_table(table, out)
+    logging.info(f"Saved {len(table)} frame values to {out}")
+
+    progress_arr = np.asarray(all_progress, dtype=np.float32)
+    if progress_arr.size:
+        logging.info(
+            f"Progress: mean={float(progress_arr.mean()):.4f}, "
+            f"std={float(progress_arr.std()):.4f}, "
+            f"min={float(progress_arr.min()):.4f}, "
+            f"max={float(progress_arr.max()):.4f}"
+        )
+    return out
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Compute per-frame TOPReward progress curves for RA-BC weighting.",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+    # Sparse-dense mode (matches upstream TOPReward num_samples=15)
+    python -m lerobot.rewards.topreward.compute_rabc_weights \\
+        --dataset-repo-id lerobot/libero_10_image \\
+        --num-samples 15
+
+    # Use a smaller VLM
+    python -m lerobot.rewards.topreward.compute_rabc_weights \\
+        --dataset-repo-id lerobot/libero_10_image \\
+        --vlm-name Qwen/Qwen3-VL-4B-Instruct
+        """,
+    )
+    parser.add_argument(
+        "--dataset-repo-id", type=str, required=True, help="HuggingFace dataset repo id or local path."
+    )
+    parser.add_argument(
+        "--reward-model-path", type=str, default=None, help="Optional TOPReward LeRobot config."
+    )
+    parser.add_argument("--vlm-name", type=str, default=None, help="Override the VLM backbone (HF Hub id).")
+    parser.add_argument("--output-path", type=str, default=None, help="Output parquet path.")
+    parser.add_argument("--device", type=str, default="cuda", help="Device to use (default: cuda).")
+    parser.add_argument(
+        "--num-samples",
+        type=int,
+        default=None,
+        help="Anchor prefix samples per episode. None = dense. 15 matches upstream.",
+    )
+    parser.add_argument(
+        "--episodes",
+        type=int,
+        nargs="+",
+        default=None,
+        help="Process only these episode indices (e.g. --episodes 0 or --episodes 0 5 10).",
+    )
+    parser.add_argument("--fps", type=float, default=None, help="Override TOPRewardConfig.fps.")
+    parser.add_argument(
+        "--push-to-hub", action="store_true", help="Upload to the dataset repo on HuggingFace Hub."
+    )
+
+    args = parser.parse_args()
+
+    logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
+
+    output_path = compute_topreward_progress(
+        dataset_repo_id=args.dataset_repo_id,
+        reward_model_path=args.reward_model_path,
+        vlm_name=args.vlm_name,
+        output_path=args.output_path,
+        device=args.device,
+        num_samples=args.num_samples,
+        fps=args.fps,
+        episodes=args.episodes,
+    )
+
+    print(f"\nTOPReward progress saved to: {output_path}")
+
+    if args.push_to_hub:
+        from huggingface_hub import HfApi
+
+        api = HfApi()
+        hub_path = DEFAULT_OUTPUT_FILENAME
+
+        print(f"\nUploading to Hub: {args.dataset_repo_id}/{hub_path}")
+        api.upload_file(
+            path_or_fileobj=str(output_path),
+            path_in_repo=hub_path,
+            repo_id=args.dataset_repo_id,
+            repo_type="dataset",
+        )
+        print(
+            "Successfully uploaded to: "
+            f"https://huggingface.co/datasets/{args.dataset_repo_id}/blob/main/{hub_path}"
+        )
+
+        print("\nTo use in training, add to your config:")
+        print("  use_rabc: true")
+        print(f"  rabc_progress_path: hf://datasets/{args.dataset_repo_id}/{hub_path}")
+        print("  rabc_head_mode: sparse")
+    else:
+        print("\nTo use in training, add to your config:")
+        print("  use_rabc: true")
+        print(f"  rabc_progress_path: {output_path}")
+        print("  rabc_head_mode: sparse")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,146 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+
+from lerobot.configs import FeatureType, NormalizationMode, PolicyFeature
+from lerobot.configs.rewards import RewardModelConfig
+from lerobot.utils.constants import OBS_IMAGES
+
+# Default prompt scaffolding from the upstream TOPReward paper / reference
+# implementation (``QwenClient.compute_instruction_reward``). The prompt
+# scores the terminal ``True`` token in ``f"{instruction} ... True"``
+# given the video.
+DEFAULT_PROMPT_PREFIX = (
+    "The above video shows a robot manipulation trajectory that completes the following task: "
+)
+DEFAULT_PROMPT_SUFFIX_TEMPLATE = (
+    "{instruction} Decide whether the above statement is True or not. The answer is: True"
+)
+
+
+@RewardModelConfig.register_subclass("topreward")
+@dataclass
+class TOPRewardConfig(RewardModelConfig):
+    """Configuration for the TOPReward zero-shot reward model.
+
+    TOPReward is **zero-shot**: it has no learnable parameters of its own.
+    The "model" is a generic vision-language model (default
+    ``Qwen/Qwen3-VL-8B-Instruct``) used with a fixed prompt to extract
+    token log-probabilities as a reward signal. There is therefore no
+    fine-tuned checkpoint to host: ``pretrained_path`` is unused at
+    runtime — the model identity is :attr:`vlm_name` (an HF Hub id).
+
+    Args:
+        vlm_name: Hugging Face Hub id of the underlying VLM. Must be a
+            Qwen3-VL family model (the only client implemented in this
+            LeRobot port).
+        torch_dtype: Torch dtype name passed to the VLM loader
+            (``"auto"``, ``"bfloat16"``, ``"float16"``, ...).
+        attn_implementation: ``transformers`` attention implementation
+            (e.g. ``"flash_attention_2"``, ``"sdpa"``). Defaults to
+            ``None`` so the upstream picks the best available.
+        image_key: Observation key that holds the trajectory frames.
+        task_key: Complementary-data key that holds the task instruction.
+        default_task: Fallback instruction when ``task_key`` is absent.
+        max_frames: Cap on the number of frames fed to the VLM per
+            sample. ``None`` = use all frames.
+        fps: Frames-per-second metadata for the Qwen video processor.
+        prompt_prefix: Text shown to the VLM right after the video and
+            before the suffix template.
+        prompt_suffix_template: Suffix appended after ``prompt_prefix``.
+            Must contain ``{instruction}``; the VLM scores the
+            log-likelihood of the tokens that follow the prefix.
+        add_chat_template: If ``True``, wrap the full prompt with the
+            tokenizer's chat template before tokenisation (matches
+            upstream ``add_chat_template=True``).
+        success_threshold: Optional log-prob threshold. If finite,
+            :meth:`TOPRewardModel.compute_reward` returns
+            ``(reward > success_threshold).float()`` instead of the raw
+            log-prob.
+        max_input_length: Hard limit on the total tokenized input length;
+            samples that exceed it raise a ``ValueError``.
+    """
+
+    # Path to a local LeRobot dir or HF repo that holds a ``config.json``
+    # snapshot of this TOPRewardConfig. The VLM weights themselves are
+    # always identified by ``vlm_name``.
+    pretrained_path: str | None = None
+
+    vlm_name: str = "Qwen/Qwen3-VL-8B-Instruct"
+    torch_dtype: str = "auto"
+    attn_implementation: str | None = None
+
+    image_key: str = OBS_IMAGES + ".top"
+    task_key: str = "task"
+    default_task: str | None = None
+    max_frames: int | None = 16
+    fps: float = 2.0
+
+    prompt_prefix: str = DEFAULT_PROMPT_PREFIX
+    prompt_suffix_template: str = DEFAULT_PROMPT_SUFFIX_TEMPLATE
+    add_chat_template: bool = False
+
+    success_threshold: float = float("-inf")
+    max_input_length: int = 32768
+
+    license: str | None = "mit"  # matches upstream TOPReward
+    tags: list[str] | None = field(
+        default_factory=lambda: ["reward-model", "vision-language", "qwen3-vl", "zero-shot"]
+    )
+
+    input_features: dict[str, PolicyFeature] = field(default_factory=dict)
+    output_features: dict[str, PolicyFeature] = field(default_factory=dict)
+    normalization_mapping: dict[str, NormalizationMode] = field(
+        default_factory=lambda: {
+            "VISUAL": NormalizationMode.IDENTITY,
+            "REWARD": NormalizationMode.IDENTITY,
+        }
+    )
+
+    def __post_init__(self) -> None:
+        super().__post_init__()
+        if self.max_frames is not None and self.max_frames < 1:
+            raise ValueError(f"max_frames must be >= 1, got {self.max_frames}")
+        if self.fps <= 0:
+            raise ValueError(f"fps must be > 0, got {self.fps}")
+        if "{instruction}" not in self.prompt_suffix_template:
+            raise ValueError(
+                "prompt_suffix_template must contain `{instruction}` so the model "
+                "scores the log-likelihood of the task suffix."
+            )
+        if self.max_input_length <= 0:
+            raise ValueError(f"max_input_length must be > 0, got {self.max_input_length}")
+
+        if self.image_key not in self.input_features:
+            self.input_features[self.image_key] = PolicyFeature(shape=(3, 224, 224), type=FeatureType.VISUAL)
+        self.output_features.setdefault("reward", PolicyFeature(shape=(1,), type=FeatureType.REWARD))
+
+    @property
+    def observation_delta_indices(self) -> list[int] | None:
+        return None
+
+    @property
+    def action_delta_indices(self) -> None:
+        return None
+
+    @property
+    def reward_delta_indices(self) -> None:
+        return None
+
+    def validate_features(self) -> None:
+        if self.image_key not in self.input_features:
+            raise ValueError(f"TOPReward requires image input feature {self.image_key!r}")
@@ -0,0 +1,238 @@
+# Copyright 2026 Shirui Chen, Cole Harrison, Ying-Chun Lee, Angela Jin Yang,
+# Zhongzheng Ren, Lillian J. Ratliff, Jiafei Duan, Dieter Fox, Ranjay Krishna
+# and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics.
+
+Paper:         https://arxiv.org/abs/2602.19313
+Project:       https://topreward.github.io/webpage/
+Original code: https://github.com/TOPReward/TOPReward
+Backbone:      https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct  (default)
+
+TOPReward is a **zero-shot** reward model: it has no fine-tuned weights of
+its own. Given a video trajectory and a task instruction, it asks an
+off-the-shelf VLM how likely the instruction is, conditioned on the video,
+and returns that log-likelihood as the reward signal.
+
+Inference recipe:
+
+1. The processor builds a chat-style prompt, tokenises it, and emits
+   ``input_ids``, ``attention_mask``, vision tensors, and ``labels``.
+   The processor label-masks everything except the terminal answer token with
+   ``-100``.
+2. Forward the full token sequence through the VLM.
+3. Read the terminal answer token log-probability from the logits as the
+   scalar reward.
+
+With the default ``prompt_suffix_template``, the only unmasked token is the
+literal ``"True"`` at the end — the reward is
+``log P("True" | video + prompt + instruction)``.
+
+This LeRobot port is **inference-only and not trainable** — :meth:`forward`
+is intentionally inherited from :class:`PreTrainedRewardModel` and raises
+``NotImplementedError``, making :attr:`PreTrainedRewardModel.is_trainable`
+return ``False``.
+
+Because the VLM weights live on the Hugging Face Hub under their canonical
+id (``Qwen/Qwen3-VL-8B-Instruct`` etc.) and TOPReward never modifies them,
+:meth:`_save_pretrained` and :meth:`from_pretrained` are overridden so a
+TOPReward LeRobot "checkpoint" is a single ``config.json`` (the VLM is
+re-fetched from the Hub at load time).
+"""
+
+from __future__ import annotations
+
+import builtins
+import logging
+import os
+from pathlib import Path
+from tempfile import TemporaryDirectory
+from typing import TYPE_CHECKING, Any, TypeVar
+
+import numpy as np
+import torch
+from huggingface_hub import HfApi, hf_hub_download
+from huggingface_hub.constants import CONFIG_NAME
+from huggingface_hub.errors import HfHubHTTPError
+from torch import Tensor
+from torch.nn.functional import cross_entropy
+
+from lerobot.configs.rewards import RewardModelConfig
+from lerobot.rewards.pretrained import PreTrainedRewardModel
+from lerobot.rewards.topreward.configuration_topreward import TOPRewardConfig
+from lerobot.rewards.topreward.processor_topreward import TOPREWARD_FEATURE_PREFIX, TOPREWARD_INPUT_KEYS
+from lerobot.utils.import_utils import _transformers_available, require_package
+
+if TYPE_CHECKING:
+    from lerobot.configs.train import TrainPipelineConfig
+
+if TYPE_CHECKING or _transformers_available:
+    from transformers import Qwen3VLForConditionalGeneration
+else:
+    Qwen3VLForConditionalGeneration = None  # type: ignore[assignment]
+
+logger = logging.getLogger(__name__)
+
+T = TypeVar("T", bound="TOPRewardModel")
+
+
+def _torch_dtype(name: str) -> torch.dtype | str:
+    """Resolve a torch dtype name; ``"auto"`` is passed through verbatim."""
+    if name == "auto":
+        return "auto"
+    dtype = getattr(torch, name, None)
+    if isinstance(dtype, torch.dtype):
+        return dtype
+    raise ValueError(f"Unknown torch dtype: {name!r}")
+
+
+class TOPRewardModel(PreTrainedRewardModel):
+    """TOPReward zero-shot reward model."""
+
+    name = "topreward"
+    config_class = TOPRewardConfig
+
+    def __init__(self, config: TOPRewardConfig) -> None:
+        require_package("transformers", extra="topreward")
+        super().__init__(config)
+        self.config = config
+
+        torch_dtype = _torch_dtype(config.torch_dtype)
+        model_kwargs: dict[str, Any] = {"dtype": torch_dtype, "trust_remote_code": True}
+        if config.attn_implementation is not None:
+            model_kwargs["attn_implementation"] = config.attn_implementation
+
+        self.model = Qwen3VLForConditionalGeneration.from_pretrained(config.vlm_name, **model_kwargs)
+
+    def compute_reward(self, batch: dict[str, Any]) -> Tensor:
+        """Return one log-prob reward per sample in the batch."""
+        inputs: dict[str, Any] = {}
+        for key in TOPREWARD_INPUT_KEYS:
+            batch_key = f"{TOPREWARD_FEATURE_PREFIX}{key}"
+            if batch_key not in batch:
+                raise KeyError(
+                    f"TOPReward batch missing `{batch_key}`. Make sure the "
+                    "TOPRewardEncoderProcessorStep ran before `compute_reward`."
+                )
+            inputs[key] = batch[batch_key]
+
+        device = next(self.model.parameters()).device
+        inputs = {key: value.to(device) if hasattr(value, "to") else value for key, value in inputs.items()}
+        labels = inputs.pop("labels")
+        inputs["logits_to_keep"] = 2
+
+        self.eval()
+        with torch.no_grad():
+            outputs = self.model(**inputs)
+        logits = outputs.logits
+        rewards = -cross_entropy(logits[:, -2, :].float(), labels[:, -1], reduction="none")
+        if np.isfinite(self.config.success_threshold):
+            rewards = (rewards > self.config.success_threshold).float()
+        return rewards.to(self.config.device or "cpu")
+
+    def _save_pretrained(self, save_directory: Path) -> None:
+        """Save ``config.json`` only."""
+        self.config._save_pretrained(save_directory)
+
+    @classmethod
+    def from_pretrained(
+        cls: builtins.type[T],
+        pretrained_name_or_path: str | Path,
+        *,
+        config: RewardModelConfig | None = None,
+        force_download: bool = False,
+        resume_download: bool | None = None,
+        proxies: dict | None = None,
+        token: str | bool | None = None,
+        cache_dir: str | Path | None = None,
+        local_files_only: bool = False,
+        revision: str | None = None,
+        strict: bool = False,  # noqa: ARG003 — accepted for API parity; unused (no safetensors to load)
+        **kwargs: Any,
+    ) -> T:
+        """Load a TOPReward configuration and instantiate the wrapped VLM."""
+        if config is None:
+            config = RewardModelConfig.from_pretrained(
+                pretrained_name_or_path=pretrained_name_or_path,
+                force_download=force_download,
+                resume_download=resume_download,
+                proxies=proxies,
+                token=token,
+                cache_dir=cache_dir,
+                local_files_only=local_files_only,
+                revision=revision,
+                **kwargs,
+            )
+        if not isinstance(config, TOPRewardConfig):
+            raise TypeError(
+                f"Expected a TOPRewardConfig, got {type(config).__name__}. Make sure "
+                f"`pretrained_name_or_path={pretrained_name_or_path!r}` points at a "
+                "TOPReward checkpoint."
+            )
+
+        model_id = str(pretrained_name_or_path)
+        if not os.path.isdir(model_id):
+            try:
+                hf_hub_download(
+                    repo_id=model_id,
+                    filename=CONFIG_NAME,
+                    revision=revision,
+                    cache_dir=cache_dir,
+                    force_download=force_download,
+                    proxies=proxies,
+                    resume_download=resume_download,
+                    token=token,
+                    local_files_only=local_files_only,
+                )
+            except HfHubHTTPError as e:
+                raise FileNotFoundError(
+                    f"{CONFIG_NAME} not found on the HuggingFace Hub in {model_id}"
+                ) from e
+
+        instance = cls(config, **kwargs)
+        instance.to(config.device)
+        instance.eval()
+        return instance
+
+    def push_model_to_hub(self, cfg: TrainPipelineConfig):
+        """Push the TOPReward ``config.json`` + model card to the Hub."""
+        api = HfApi()
+        repo_id = api.create_repo(
+            repo_id=self.config.repo_id, private=self.config.private, exist_ok=True
+        ).repo_id
+
+        with TemporaryDirectory(ignore_cleanup_errors=True) as tmp:
+            saved_path = Path(tmp) / repo_id
+            saved_path.mkdir(parents=True, exist_ok=True)
+
+            self.config._save_pretrained(saved_path)
+
+            card = self.generate_model_card(
+                cfg.dataset.repo_id, self.config.type, self.config.license, self.config.tags
+            )
+            card.save(str(saved_path / "README.md"))
+
+            cfg.save_pretrained(saved_path)
+
+            commit_info = api.upload_folder(
+                repo_id=repo_id,
+                repo_type="model",
+                folder_path=saved_path,
+                commit_message="Upload TOPReward config and readme",
+                allow_patterns=["*.json", "*.yaml", "*.md"],
+                ignore_patterns=["*.tmp", "*.log", "*.safetensors"],
+            )
+
+            logger.info(f"Model pushed to {commit_info.repo_url.url}")
@@ -0,0 +1,305 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""TOPReward pre/post processing pipeline."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from typing import TYPE_CHECKING, Any
+
+import torch
+from torch import Tensor
+
+from lerobot.configs import PipelineFeatureType, PolicyFeature
+from lerobot.processor import (
+    AddBatchDimensionProcessorStep,
+    DeviceProcessorStep,
+    PolicyAction,
+    PolicyProcessorPipeline,
+    ProcessorStep,
+    ProcessorStepRegistry,
+    policy_action_to_transition,
+)
+from lerobot.rewards.topreward.configuration_topreward import (
+    DEFAULT_PROMPT_PREFIX,
+    DEFAULT_PROMPT_SUFFIX_TEMPLATE,
+    TOPRewardConfig,
+)
+from lerobot.types import EnvTransition, TransitionKey
+from lerobot.utils.constants import (
+    OBS_IMAGES,
+    OBS_PREFIX,
+    POLICY_POSTPROCESSOR_DEFAULT_NAME,
+    POLICY_PREPROCESSOR_DEFAULT_NAME,
+)
+from lerobot.utils.import_utils import _transformers_available, require_package
+
+if TYPE_CHECKING or _transformers_available:
+    from transformers import AutoProcessor
+else:
+    AutoProcessor = None
+
+TOPREWARD_FEATURE_PREFIX = f"{OBS_PREFIX}topreward."
+
+_TRUE_ANSWER = "True"
+
+TOPREWARD_VLM_INPUT_KEYS = (
+    "input_ids",
+    "attention_mask",
+    "pixel_values_videos",
+    "video_grid_thw",
+    "mm_token_type_ids",
+)
+TOPREWARD_INPUT_KEYS = TOPREWARD_VLM_INPUT_KEYS + ("labels",)
+
+
+def _prepare_video_batch(video: Tensor, *, max_frames: int | None) -> Tensor:
+    """Return videos as ``(B, T, C, H, W)`` uint8 tensors for Qwen3-VL."""
+    if video.ndim == 4:
+        video = video.unsqueeze(1)
+    elif video.ndim != 5:
+        raise ValueError(
+            f"Expected TOPReward frames with shape (B,C,H,W) or (B,T,C,H,W); got {tuple(video.shape)}"
+        )
+
+    if max_frames is not None:
+        video = video[:, -max_frames:]
+    if video.shape[-1] in (1, 3):
+        video = video.permute(0, 1, 4, 2, 3)
+    elif video.shape[2] not in (1, 3):
+        raise ValueError(f"Expected channel dim of size 1 or 3, got shape {tuple(video.shape)}")
+
+    if video.is_floating_point():
+        video = video * 255.0
+
+    return video.clamp(0, 255).to(torch.uint8).contiguous()
+
+
+def _expand_tasks(task: Any, *, batch_size: int, default: str | None) -> list[str]:
+    if task is None:
+        task = default
+    if task is None:
+        raise KeyError("TOPReward expected a task description in complementary data")
+    if isinstance(task, str):
+        return [task] * batch_size
+    if isinstance(task, tuple):
+        task = list(task)
+    if not (isinstance(task, list) and all(isinstance(item, str) for item in task)):
+        raise TypeError(f"TOPReward task must be a string or list of strings, got {type(task)}")
+    if len(task) == 1 and batch_size > 1:
+        return task * batch_size
+    if len(task) != batch_size:
+        raise ValueError(f"Expected {batch_size} tasks, got {len(task)}")
+    return task
+
+
+@dataclass
+@ProcessorStepRegistry.register(name="topreward_encoder")
+class TOPRewardEncoderProcessorStep(ProcessorStep):
+    """Encode raw frames + task into Qwen-VL tensors for the TOPReward model.
+
+    Loads a :class:`~transformers.AutoProcessor` matching ``vlm_name`` and
+    builds the full chat prompt including the instruction suffix. The
+    resulting ``input_ids``, ``attention_mask``, vision tensors, and
+    ``labels`` are written under the ``observation.topreward.*`` namespace
+    so the model can score without re-tokenising.
+
+    At call time the step reads:
+
+    - ``observation[image_key]``: ``(B, T, C, H, W)`` or ``(B, C, H, W)`` frames.
+    - ``complementary_data[task_key]``: a string or list of strings.
+
+    and writes ``observation[f"{TOPREWARD_FEATURE_PREFIX}<name>"]`` for the
+    Qwen-VL tensors plus ``labels``.
+    """
+
+    vlm_name: str = "Qwen/Qwen3-VL-8B-Instruct"
+    image_key: str = OBS_IMAGES + ".top"
+    task_key: str = "task"
+    default_task: str | None = None
+    max_frames: int | None = 16
+    fps: float = 2.0
+    prompt_prefix: str = DEFAULT_PROMPT_PREFIX
+    prompt_suffix_template: str = DEFAULT_PROMPT_SUFFIX_TEMPLATE
+    add_chat_template: bool = False
+    max_length: int = 32768
+
+    _processor: Any = field(default=None, init=False, repr=False)
+
+    def __post_init__(self) -> None:
+        require_package("transformers", extra="topreward")
+        self._processor = AutoProcessor.from_pretrained(self.vlm_name, trust_remote_code=True)
+
+    def __call__(self, transition: EnvTransition) -> EnvTransition:
+        observation = transition.get(TransitionKey.OBSERVATION)
+        complementary = transition.get(TransitionKey.COMPLEMENTARY_DATA) or {}
+        if self.image_key not in observation:
+            raise KeyError(f"TOPReward expected image key {self.image_key!r} in observation")
+
+        frames = observation[self.image_key]
+        videos = frames.detach().cpu() if isinstance(frames, Tensor) else torch.as_tensor(frames)
+        videos = _prepare_video_batch(videos, max_frames=self.max_frames)
+
+        batch_size = videos.shape[0]
+        tasks = _expand_tasks(
+            complementary.get(self.task_key, self.default_task),
+            batch_size=batch_size,
+            default=self.default_task,
+        )
+
+        encoded = self._encode_batch(videos, tasks, batch_size)
+
+        new_observation = dict(observation)
+        for key, value in encoded.items():
+            new_observation[f"{TOPREWARD_FEATURE_PREFIX}{key}"] = value
+
+        new_transition = transition.copy()
+        new_transition[TransitionKey.OBSERVATION] = new_observation
+        return new_transition
+
+    def _encode_batch(self, videos: Tensor, tasks: list[str], batch_size) -> dict[str, Any]:
+        """Tokenise a batch of (frames, task) pairs into Qwen-VL tensors.
+
+        The loop only builds per-sample chat strings. Tokenisation, padding,
+        video preprocessing, and label construction are batched.
+        """
+
+        texts: list[str] = []
+        video_metadata = [
+            {
+                "total_num_frames": int(videos.shape[1]),
+                "fps": float(self.fps),
+                "frames_indices": list(range(int(videos.shape[1]))),
+            }
+            for _ in range(batch_size)
+        ]
+        eos_token = self._processor.tokenizer.eos_token
+
+        for i in range(batch_size):
+            instruction_suffix = self.prompt_suffix_template.format(instruction=tasks[i])
+            if self.add_chat_template:
+                suffix_for_template = instruction_suffix.removesuffix(_TRUE_ANSWER).rstrip()
+                templated_messages = [
+                    {
+                        "role": "user",
+                        "content": [
+                            {"type": "video", "video": videos[i], "fps": self.fps},
+                            {"type": "text", "text": f"{self.prompt_prefix}{suffix_for_template}"},
+                        ],
+                    }
+                ]
+                prompt_chat = self._processor.apply_chat_template(
+                    templated_messages, tokenize=False, add_generation_prompt=True
+                )
+                full_text = f"{prompt_chat}{_TRUE_ANSWER}"
+            else:
+                user_messages = [
+                    {
+                        "role": "user",
+                        "content": [
+                            {"type": "video", "video": videos[i], "fps": self.fps},
+                            {"type": "text", "text": self.prompt_prefix},
+                        ],
+                    }
+                ]
+                prompt_chat = self._processor.apply_chat_template(
+                    user_messages, tokenize=False, add_generation_prompt=False
+                )
+                if eos_token is not None:
+                    prompt_chat = prompt_chat.split(eos_token)[0]
+                full_text = f"{prompt_chat}{instruction_suffix}"
+
+            texts.append(full_text)
+
+        result = self._processor(
+            text=texts,
+            videos=videos,
+            video_metadata=video_metadata,
+            do_sample_frames=False,
+            padding=True,
+            padding_side="left",
+            return_tensors="pt",
+        )
+        input_ids = result["input_ids"]
+
+        if input_ids.shape[-1] > self.max_length:
+            raise ValueError(
+                f"TOPReward input length {input_ids.shape[-1]} exceeds max_length "
+                f"{self.max_length}; lower `max_frames` or raise `max_length`."
+            )
+
+        labels = torch.full_like(input_ids, -100)
+        labels[:, -1] = input_ids[:, -1]
+        result["labels"] = labels
+        return result
+
+    def transform_features(
+        self, features: dict[PipelineFeatureType, dict[str, PolicyFeature]]
+    ) -> dict[PipelineFeatureType, dict[str, PolicyFeature]]:
+        return features
+
+    def get_config(self) -> dict[str, Any]:
+        return {
+            "vlm_name": self.vlm_name,
+            "image_key": self.image_key,
+            "task_key": self.task_key,
+            "default_task": self.default_task,
+            "max_frames": self.max_frames,
+            "fps": self.fps,
+            "prompt_prefix": self.prompt_prefix,
+            "prompt_suffix_template": self.prompt_suffix_template,
+            "add_chat_template": self.add_chat_template,
+            "max_length": self.max_length,
+        }
+
+
+def make_topreward_pre_post_processors(
+    config: TOPRewardConfig,
+    dataset_stats: dict[str, dict[str, Any]] | None = None,
+) -> tuple[
+    PolicyProcessorPipeline[dict[str, Any], dict[str, Any]],
+    PolicyProcessorPipeline[PolicyAction, PolicyAction],
+]:
+    """Pipeline that pre-encodes frames + task into Qwen-VL tensors.
+
+    The preprocessor adds a batch dimension if needed, runs TOPReward's
+    encoder (which tokenises the full prompt and emits ``labels``), and
+    moves everything to the configured device. The postprocessor is
+    the identity since TOPReward outputs a single reward tensor.
+    """
+    preprocessor = PolicyProcessorPipeline[dict[str, Any], dict[str, Any]](
+        steps=[
+            AddBatchDimensionProcessorStep(),
+            TOPRewardEncoderProcessorStep(
+                vlm_name=config.vlm_name,
+                image_key=config.image_key,
+                task_key=config.task_key,
+                default_task=config.default_task,
+                max_frames=config.max_frames,
+                fps=config.fps,
+                prompt_prefix=config.prompt_prefix,
+                prompt_suffix_template=config.prompt_suffix_template,
+                add_chat_template=config.add_chat_template,
+                max_length=config.max_input_length,
+            ),
+            DeviceProcessorStep(device=config.device or "cpu"),
+        ],
+        name=POLICY_PREPROCESSOR_DEFAULT_NAME,
+    )
+    postprocessor = PolicyProcessorPipeline(
+        name=POLICY_POSTPROCESSOR_DEFAULT_NAME,
+        to_transition=policy_action_to_transition,
+    )
+    return preprocessor, postprocessor
@@ -13,6 +13,8 @@
 A reward classifier is a lightweight neural network that scores observations or trajectories for task success, providing a learned reward signal or offline evaluation when explicit rewards are unavailable.
 {% elif model_name == "sarm" %}
 A Success-Aware Reward Model (SARM) predicts a dense reward signal from observations, typically used downstream for reinforcement learning or human-in-the-loop fine-tuning when task success is not directly observable.
+{% elif model_name == "topreward" %}
+TOPReward is a **zero-shot** reward model that extracts token log-probabilities from an off-the-shelf vision-language model (default Qwen3-VL) as a reward signal. Given a video trajectory and a task instruction, it returns the VLM's log-likelihood of the instruction being true, with no fine-tuning required.
 {% else %}
 _Reward model type not recognized — please update this template._
 {% endif %}
@@ -0,0 +1,296 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for the TOPReward reward model."""
+
+from __future__ import annotations
+
+from types import SimpleNamespace
+
+import pytest
+import torch
+
+from lerobot.configs.rewards import RewardModelConfig
+from lerobot.rewards.factory import get_reward_model_class, make_reward_model_config
+from lerobot.rewards.topreward import TOPRewardConfig
+from lerobot.rewards.topreward.processor_topreward import TOPREWARD_FEATURE_PREFIX, TOPREWARD_INPUT_KEYS
+from tests.utils import skip_if_package_missing
+
+
+class _FakeQwenModel(torch.nn.Module):
+    """Stand-in for ``Qwen3VLForConditionalGeneration``.
+
+    Returns a ``SimpleNamespace`` with ``logits`` of a controlled shape so
+    the log-prob extraction path in ``compute_reward`` can be exercised
+    without downloading real VLM weights.
+    """
+
+    def __init__(self) -> None:
+        super().__init__()
+        self._param = torch.nn.Parameter(torch.zeros(1))
+        self._reward_value: float = -1.5
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):  # noqa: ARG003
+        return cls()
+
+    def forward(  # noqa: ARG002
+        self, input_ids, attention_mask=None, labels=None, logits_to_keep=0, **kwargs
+    ):
+        batch_size, seq_len = input_ids.shape
+        vocab_size = 1000
+        logits = torch.zeros(batch_size, seq_len, vocab_size)
+        # Place a controlled log-prob at the target token position so the
+        # model returns a predictable reward value.
+        # The label-masked suffix is the last token.
+        # After the causal-LM shift (logits[:, :-1], labels[:, 1:]) the scored
+        # position is logits[:, -2, :] predicting labels[:, -1].
+        # We set logits so that log_softmax at the target token ≈ _reward_value.
+        for i in range(batch_size):
+            target_idx = int(input_ids[i, -1].item())
+            logits[i, -2, target_idx] = self._reward_value * -10  # high logit -> high log-prob
+        if logits_to_keep:
+            logits = logits[:, -logits_to_keep:, :]
+        return SimpleNamespace(logits=logits)
+
+
+def _patch_build(monkeypatch) -> None:
+    """Stub out HF AutoX so TOPReward construction is cheap and offline."""
+    from lerobot.rewards.topreward import modeling_topreward
+
+    monkeypatch.setattr(modeling_topreward, "Qwen3VLForConditionalGeneration", _FakeQwenModel)
+
+
+def _make_batch(
+    input_ids: torch.Tensor,
+    attention_mask: torch.Tensor | None = None,
+    labels: torch.Tensor | None = None,
+    *,
+    omit: str | None = None,
+) -> dict[str, torch.Tensor]:
+    """Build a ``compute_reward``-ready batch using TOPReward's namespaced keys."""
+    batch_size, seq_len = input_ids.shape
+    if attention_mask is None:
+        attention_mask = torch.ones(batch_size, seq_len, dtype=torch.long)
+    batch: dict[str, torch.Tensor] = {}
+    if labels is not None:
+        batch[f"{TOPREWARD_FEATURE_PREFIX}labels"] = labels
+    batch.update(
+        {
+            f"{TOPREWARD_FEATURE_PREFIX}input_ids": input_ids,
+            f"{TOPREWARD_FEATURE_PREFIX}attention_mask": attention_mask,
+            f"{TOPREWARD_FEATURE_PREFIX}pixel_values_videos": torch.zeros(
+                batch_size, 1536, dtype=torch.float32
+            ),
+            f"{TOPREWARD_FEATURE_PREFIX}video_grid_thw": torch.ones(batch_size, 3, dtype=torch.long),
+            f"{TOPREWARD_FEATURE_PREFIX}mm_token_type_ids": torch.zeros_like(input_ids),
+        }
+    )
+    if omit is not None:
+        batch.pop(f"{TOPREWARD_FEATURE_PREFIX}{omit}", None)
+    return batch
+
+
+def _terminal_labels(input_ids: torch.Tensor) -> torch.Tensor:
+    labels = torch.full_like(input_ids, -100)
+    labels[:, -1] = input_ids[:, -1]
+    return labels
+
+
+# ---------------------------------------------------------------------------
+# Registry + factory
+# ---------------------------------------------------------------------------
+
+
+def test_topreward_config_registered():
+    assert "topreward" in RewardModelConfig.get_known_choices()
+    assert RewardModelConfig.get_choice_class("topreward") is TOPRewardConfig
+    assert isinstance(make_reward_model_config("topreward", device="cpu"), TOPRewardConfig)
+
+
+def test_topreward_factory_returns_in_tree_class():
+    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+
+    assert get_reward_model_class("topreward") is TOPRewardModel
+
+
+# ---------------------------------------------------------------------------
+# Config validation
+# ---------------------------------------------------------------------------
+
+
+def test_topreward_config_rejects_zero_max_frames():
+    with pytest.raises(ValueError, match="max_frames must be >= 1"):
+        TOPRewardConfig(device="cpu", max_frames=0)
+
+
+def test_topreward_config_rejects_non_positive_fps():
+    with pytest.raises(ValueError, match="fps must be > 0"):
+        TOPRewardConfig(device="cpu", fps=0.0)
+
+
+def test_topreward_config_rejects_suffix_without_instruction_placeholder():
+    with pytest.raises(ValueError, match=r"\{instruction\}"):
+        TOPRewardConfig(device="cpu", prompt_suffix_template="no placeholder here")
+
+
+# ---------------------------------------------------------------------------
+# compute_reward
+# ---------------------------------------------------------------------------
+
+
+@skip_if_package_missing("transformers")
+def test_topreward_compute_reward_returns_one_scalar_per_sample(monkeypatch):
+    """``compute_reward`` must return a ``(B,)`` float32 tensor with one
+    log-prob reward per sample, consuming pre-encoded Qwen-VL tensors."""
+    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+
+    _patch_build(monkeypatch)
+    cfg = TOPRewardConfig(device="cpu")
+    model = TOPRewardModel(cfg)
+
+    input_ids = torch.randint(0, 100, (2, 10))
+    attention_mask = torch.ones(2, 10, dtype=torch.long)
+    labels = _terminal_labels(input_ids)
+
+    batch = _make_batch(input_ids, attention_mask, labels)
+    rewards = model.compute_reward(batch)
+
+    assert rewards.shape == (2,)
+    assert rewards.dtype == torch.float32
+
+
+@skip_if_package_missing("transformers")
+def test_topreward_compute_reward_applies_success_threshold(monkeypatch):
+    """When ``success_threshold`` is finite, the model returns binary success."""
+    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+
+    _patch_build(monkeypatch)
+    cfg = TOPRewardConfig(device="cpu", success_threshold=0.0)
+    model = TOPRewardModel(cfg)
+
+    input_ids = torch.randint(0, 100, (2, 10))
+    attention_mask = torch.ones(2, 10, dtype=torch.long)
+    labels = _terminal_labels(input_ids)
+
+    batch = _make_batch(input_ids, attention_mask, labels)
+    rewards = model.compute_reward(batch)
+
+    assert rewards.shape == (2,)
+    assert set(rewards.tolist()).issubset({0.0, 1.0})
+
+
+@skip_if_package_missing("transformers")
+def test_topreward_compute_reward_errors_when_inputs_missing(monkeypatch):
+    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+
+    _patch_build(monkeypatch)
+    cfg = TOPRewardConfig(device="cpu")
+    model = TOPRewardModel(cfg)
+
+    with pytest.raises(KeyError, match=r"observation\.topreward\.input_ids"):
+        model.compute_reward(_make_batch(torch.randint(0, 100, (1, 10)), omit="input_ids"))
+
+
+@skip_if_package_missing("transformers")
+def test_topreward_compute_reward_errors_when_labels_missing(monkeypatch):
+    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+
+    _patch_build(monkeypatch)
+    cfg = TOPRewardConfig(device="cpu")
+    model = TOPRewardModel(cfg)
+
+    input_ids = torch.randint(0, 100, (1, 10))
+    with pytest.raises(KeyError, match=r"observation\.topreward\.labels"):
+        model.compute_reward(_make_batch(input_ids, labels=None))
+
+
+@skip_if_package_missing("transformers")
+def test_topreward_compute_reward_requires_all_encoder_keys(monkeypatch):
+    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+
+    _patch_build(monkeypatch)
+    cfg = TOPRewardConfig(device="cpu")
+    model = TOPRewardModel(cfg)
+
+    input_ids = torch.randint(0, 100, (1, 10))
+    labels = _terminal_labels(input_ids)
+    required_encoder_keys = set(TOPREWARD_INPUT_KEYS) - {"input_ids", "labels"}
+
+    for key in required_encoder_keys:
+        with pytest.raises(KeyError, match=rf"observation\.topreward\.{key}"):
+            model.compute_reward(_make_batch(input_ids, labels=labels, omit=key))
+
+
+# ---------------------------------------------------------------------------
+# Save / load — config-only checkpoint
+# ---------------------------------------------------------------------------
+
+
+@skip_if_package_missing("transformers")
+def test_topreward_save_pretrained_writes_only_config_json(monkeypatch, tmp_path):
+    from huggingface_hub.constants import CONFIG_NAME, SAFETENSORS_SINGLE_FILE
+
+    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+
+    _patch_build(monkeypatch)
+    cfg = TOPRewardConfig(
+        device="cpu",
+        vlm_name="Qwen/Qwen3-VL-8B-Instruct",
+        fps=4.0,
+        image_key="observation.images.front",
+    )
+    model = TOPRewardModel(cfg)
+    model.save_pretrained(str(tmp_path))
+
+    assert (tmp_path / CONFIG_NAME).exists()
+    assert not (tmp_path / SAFETENSORS_SINGLE_FILE).exists()
+
+
+@skip_if_package_missing("transformers")
+def test_topreward_from_pretrained_local_dir_roundtrips_config(monkeypatch, tmp_path):
+    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+
+    _patch_build(monkeypatch)
+    cfg = TOPRewardConfig(
+        device="cpu",
+        vlm_name="Qwen/Qwen3-VL-8B-Instruct",
+        fps=4.0,
+        image_key="observation.images.front",
+        add_chat_template=True,
+        success_threshold=-1.5,
+    )
+    TOPRewardModel(cfg).save_pretrained(str(tmp_path))
+
+    reloaded = TOPRewardModel.from_pretrained(str(tmp_path))
+
+    assert isinstance(reloaded.config, TOPRewardConfig)
+    assert reloaded.config.vlm_name == "Qwen/Qwen3-VL-8B-Instruct"
+    assert reloaded.config.fps == 4.0
+    assert reloaded.config.image_key == "observation.images.front"
+    assert reloaded.config.add_chat_template is True
+    assert reloaded.config.success_threshold == -1.5
+
+
+@skip_if_package_missing("transformers")
+def test_topreward_is_not_trainable(monkeypatch):
+    from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel
+
+    _patch_build(monkeypatch)
+    cfg = TOPRewardConfig(device="cpu")
+    model = TOPRewardModel(cfg)
+
+    assert model.is_trainable is False
+    with pytest.raises(NotImplementedError, match="not trainable"):
+        model.forward({"x": torch.zeros(1)})
@@ -0,0 +1,80 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""End-to-end TOPReward smoke test with the real Qwen3-VL model."""
+
+import os
+
+import pytest
+import torch
+
+pytest.importorskip("transformers")
+
+from lerobot.rewards.topreward.configuration_topreward import TOPRewardConfig  # noqa: E402
+from lerobot.rewards.topreward.modeling_topreward import TOPRewardModel  # noqa: E402
+from lerobot.rewards.topreward.processor_topreward import (  # noqa: E402
+    TOPREWARD_FEATURE_PREFIX,
+    TOPREWARD_INPUT_KEYS,
+    make_topreward_pre_post_processors,
+)
+from tests.utils import require_cuda  # noqa: E402
+
+pytestmark = pytest.mark.skipif(
+    os.environ.get("CI") == "true" or os.environ.get("GITHUB_ACTIONS") == "true",
+    reason="This test requires downloading and loading Qwen3-VL and is not meant for CI",
+)
+
+
+def _make_dummy_topreward_batch(image_key: str, task_key: str) -> dict[str, object]:
+    num_frames = 4
+    image_size = 64
+    frames = torch.zeros(1, num_frames, 3, image_size, image_size, dtype=torch.uint8)
+    for frame_idx in range(num_frames):
+        frames[0, frame_idx, 0].fill_(min(frame_idx * 48, 255))
+        frames[0, frame_idx, 1].fill_(96)
+        frames[0, frame_idx, 2].fill_(192)
+
+    return {
+        image_key: frames,
+        task_key: ["pick up the red cube"],
+    }
+
+
+@require_cuda
+def test_topreward_full_qwen3vl_preprocessor_to_compute_reward():
+    cfg = TOPRewardConfig(
+        vlm_name="Qwen/Qwen3-VL-8B-Instruct",
+        device="cuda",
+        max_frames=4,
+        fps=2.0,
+        max_input_length=4096,
+    )
+
+    preprocessor, _ = make_topreward_pre_post_processors(cfg)
+    encoded_batch = preprocessor(_make_dummy_topreward_batch(cfg.image_key, cfg.task_key))
+    for key in TOPREWARD_INPUT_KEYS:
+        assert f"{TOPREWARD_FEATURE_PREFIX}{key}" in encoded_batch
+
+    model = TOPRewardModel(cfg)
+    try:
+        model.to(cfg.device)
+        model.eval()
+        rewards = model.compute_reward(encoded_batch)
+    finally:
+        del model
+        torch.cuda.empty_cache()
+
+    assert rewards.shape == (1,)
+    assert rewards.dtype == torch.float32
+    assert torch.isfinite(rewards).all()
@@ -0,0 +1,246 @@
+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Tests for TOPReward's pre-processing helpers and encoder step."""
+
+from __future__ import annotations
+
+import pytest
+import torch
+
+from lerobot.configs import FeatureType, PipelineFeatureType, PolicyFeature
+from lerobot.rewards.topreward.processor_topreward import (
+    TOPREWARD_FEATURE_PREFIX,
+    TOPREWARD_INPUT_KEYS,
+    _expand_tasks,
+    _prepare_video_batch,
+)
+from lerobot.types import TransitionKey
+from tests.utils import skip_if_package_missing
+
+# ---------------------------------------------------------------------------
+# _prepare_video_batch — raw image/video batch -> (B, T, C, H, W) uint8
+# ---------------------------------------------------------------------------
+
+
+def test_prepare_video_batch_batched_chw_float_is_converted_to_uint8():
+    video = torch.rand(2, 4, 3, 8, 8)
+    tensor = _prepare_video_batch(video, max_frames=None)
+
+    assert tensor.shape == (2, 4, 3, 8, 8)
+    assert tensor.dtype == torch.uint8
+    assert tensor.min() >= 0 and tensor.max() <= 255
+
+
+def test_prepare_video_batch_batched_thwc_uint8_is_permuted_to_channel_first():
+    video = torch.randint(0, 256, (2, 3, 8, 8, 3), dtype=torch.uint8)
+    tensor = _prepare_video_batch(video, max_frames=None)
+
+    assert tensor.shape == (2, 3, 3, 8, 8)
+    assert tensor.dtype == torch.uint8
+
+
+def test_prepare_video_batch_max_frames_tail_crops_recent_frames():
+    video = torch.zeros(1, 10, 3, 4, 4)
+    for t in range(10):
+        video[:, t] = t / 9.0
+
+    tensor = _prepare_video_batch(video, max_frames=3)
+
+    assert tensor.shape == (1, 3, 3, 4, 4)
+    assert int(tensor[0, 0, 0, 0, 0]) == int(7 / 9 * 255)
+    assert int(tensor[0, -1, 0, 0, 0]) == 255
+
+
+def test_prepare_video_batch_rejects_3d_input():
+    with pytest.raises(ValueError, match="Expected TOPReward frames"):
+        _prepare_video_batch(torch.zeros(4, 8, 8), max_frames=None)
+
+
+def test_prepare_video_batch_floats_above_one_are_rescaled_and_clipped():
+    video = torch.full((1, 1, 3, 2, 2), 5.0)
+    tensor = _prepare_video_batch(video, max_frames=None)
+
+    assert tensor.shape == (1, 1, 3, 2, 2)
+    assert int(tensor.max()) == 255
+
+
+def test_prepare_video_batch_clips_very_large_floats_to_uint8_max():
+    video = torch.full((1, 1, 3, 2, 2), 300.0)
+    tensor = _prepare_video_batch(video, max_frames=None)
+
+    assert int(tensor.max()) == 255
+
+
+# ---------------------------------------------------------------------------
+# _expand_tasks — string / list / tuple broadcasting to batch size
+# ---------------------------------------------------------------------------
+
+
+def test_expand_tasks_string_is_broadcast_to_batch_size():
+    assert _expand_tasks("pick up", batch_size=3, default=None) == ["pick up", "pick up", "pick up"]
+
+
+def test_expand_tasks_list_of_matching_size_passes_through():
+    assert _expand_tasks(["a", "b", "c"], batch_size=3, default=None) == ["a", "b", "c"]
+
+
+def test_expand_tasks_tuple_is_normalised_to_list():
+    assert _expand_tasks(("a", "b"), batch_size=2, default=None) == ["a", "b"]
+
+
+def test_expand_tasks_single_element_list_is_broadcast():
+    assert _expand_tasks(["only one"], batch_size=3, default=None) == ["only one"] * 3
+
+
+def test_expand_tasks_size_mismatch_raises():
+    with pytest.raises(ValueError, match="Expected 3 tasks"):
+        _expand_tasks(["a", "b"], batch_size=3, default=None)
+
+
+def test_expand_tasks_missing_uses_default():
+    assert _expand_tasks(None, batch_size=2, default="fallback") == ["fallback", "fallback"]
+
+
+def test_expand_tasks_missing_without_default_raises():
+    with pytest.raises(KeyError, match="task description"):
+        _expand_tasks(None, batch_size=1, default=None)
+
+
+def test_expand_tasks_wrong_type_raises():
+    with pytest.raises(TypeError, match="must be a string or list"):
+        _expand_tasks(42, batch_size=1, default=None)
+
+
+# ---------------------------------------------------------------------------
+# Encoder step — stubbed AutoProcessor
+# ---------------------------------------------------------------------------
+
+
+def _skip_if_topreward_extras_missing(func):
+    func = skip_if_package_missing("transformers")(func)
+    return func
+
+
+class _FakeTokenizer:
+    eos_token = "<|endoftext|>"
+    pad_token = "<|endoftext|>"
+
+    def __call__(self, *args, **kwargs):
+        return {"input_ids": torch.zeros(1, 10, dtype=torch.long)}
+
+
+class _FakeAutoProcessor:
+    def __init__(self) -> None:
+        self.tokenizer = _FakeTokenizer()
+
+    @classmethod
+    def from_pretrained(cls, *args, **kwargs):  # noqa: ARG003
+        return cls()
+
+    def apply_chat_template(self, messages, **kwargs):  # noqa: ARG002
+        return "fake_prompt_text"
+
+    def __call__(self, text=None, images=None, videos=None, **kwargs):  # noqa: ARG002
+        seq_len = 10
+        batch_size = len(text) if isinstance(text, list) else 1
+        return {
+            "input_ids": torch.randint(0, 100, (batch_size, seq_len)),
+            "attention_mask": torch.ones(batch_size, seq_len, dtype=torch.long),
+            "pixel_values_videos": torch.zeros(batch_size, 1536, dtype=torch.float32),
+            "video_grid_thw": torch.ones(batch_size, 3, dtype=torch.long),
+            "mm_token_type_ids": torch.zeros(batch_size, seq_len, dtype=torch.long),
+        }
+
+
+def _build_step(monkeypatch, **overrides):
+    from lerobot.rewards.topreward import processor_topreward
+
+    monkeypatch.setattr(processor_topreward, "AutoProcessor", _FakeAutoProcessor)
+    return processor_topreward.TOPRewardEncoderProcessorStep(**overrides)
+
+
+def _make_transition(observation: dict, complementary: dict | None = None) -> dict:
+    transition: dict = {TransitionKey.OBSERVATION: observation}
+    if complementary is not None:
+        transition[TransitionKey.COMPLEMENTARY_DATA] = complementary
+    return transition
+
+
+@_skip_if_topreward_extras_missing
+def test_encoder_step_emits_input_ids_and_labels(monkeypatch):
+    """The processor must emit Qwen-VL tensors including ``input_ids`` and
+    ``labels`` under the ``observation.topreward.*`` namespace."""
+    step = _build_step(monkeypatch)
+
+    frames_batch = torch.zeros(2, 4, 3, 8, 8)
+    out = step(
+        _make_transition(
+            observation={"observation.images.top": frames_batch},
+            complementary={"task": ["pick", "place"]},
+        )
+    )
+
+    obs_out = out[TransitionKey.OBSERVATION]
+    for key in TOPREWARD_INPUT_KEYS:
+        assert f"{TOPREWARD_FEATURE_PREFIX}{key}" in obs_out
+
+    input_ids = obs_out[f"{TOPREWARD_FEATURE_PREFIX}input_ids"]
+    labels = obs_out[f"{TOPREWARD_FEATURE_PREFIX}labels"]
+    assert labels.dtype == torch.long
+    assert labels.shape == (2, 10)
+    assert labels[:, :-1].eq(-100).all()
+    assert labels[:, -1].equal(input_ids[:, -1])
+
+
+@_skip_if_topreward_extras_missing
+def test_encoder_step_get_config_roundtrips_user_fields(monkeypatch):
+    step = _build_step(
+        monkeypatch,
+        vlm_name="Qwen/Qwen3-VL-8B-Instruct",
+        image_key="observation.images.cam_top",
+        task_key="task",
+        default_task="do the thing",
+        max_frames=8,
+        fps=4.0,
+        add_chat_template=True,
+        max_length=2048,
+    )
+
+    cfg = step.get_config()
+    assert cfg["vlm_name"] == "Qwen/Qwen3-VL-8B-Instruct"
+    assert cfg["image_key"] == "observation.images.cam_top"
+    assert cfg["default_task"] == "do the thing"
+    assert cfg["max_frames"] == 8
+    assert cfg["fps"] == 4.0
+    assert cfg["add_chat_template"] is True
+    assert cfg["max_length"] == 2048
+
+
+@_skip_if_topreward_extras_missing
+def test_encoder_step_transform_features_is_identity(monkeypatch):
+    step = _build_step(monkeypatch)
+    features = {
+        PipelineFeatureType.OBSERVATION: {
+            "observation.images.top": PolicyFeature(shape=(3, 224, 224), type=FeatureType.VISUAL),
+        }
+    }
+    assert step.transform_features(features) == features
+
+
+@_skip_if_topreward_extras_missing
+def test_encoder_step_rejects_missing_image_key(monkeypatch):
+    step = _build_step(monkeypatch, image_key="observation.images.top")
+    with pytest.raises(KeyError, match="image key"):
+        step(_make_transition(observation={}, complementary={"task": "pick"}))
@@ -3009,6 +3009,9 @@ test = [
    { name = "pytest-cov" },
    { name = "pytest-timeout" },
 ]
+topreward = [
+    { name = "transformers" },
+]
 training = [
    { name = "accelerate" },
    { name = "av" },
@@ -3167,6 +3170,7 @@ requires-dist = [
    { name = "lerobot", extras = ["scipy-dep"], marker = "extra == 'wallx'" },
    { name = "lerobot", extras = ["smolvla"], marker = "extra == 'all'" },
    { name = "lerobot", extras = ["test"], marker = "extra == 'all'" },
+    { name = "lerobot", extras = ["topreward"], marker = "extra == 'all'" },
    { name = "lerobot", extras = ["training"], marker = "extra == 'all'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'eo1'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'groot'" },
@@ -3177,6 +3181,7 @@ requires-dist = [
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'pi'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'sarm'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'smolvla'" },
+    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'topreward'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'wallx'" },
    { name = "lerobot", extras = ["transformers-dep"], marker = "extra == 'xvla'" },
    { name = "lerobot", extras = ["video-benchmark"], marker = "extra == 'all'" },
@@ -3244,7 +3249,7 @@ requires-dist = [
    { name = "transformers", marker = "extra == 'transformers-dep'", specifier = ">=5.4.0,<5.6.0" },
    { name = "wandb", marker = "extra == 'training'", specifier = ">=0.24.0,<0.25.0" },
 ]
-provides-extras = ["dataset", "training", "hardware", "viz", "core-scripts", "evaluation", "dataset-viz", "av-dep", "pygame-dep", "placo-dep", "transformers-dep", "grpcio-dep", "can-dep", "peft-dep", "scipy-dep", "diffusers-dep", "qwen-vl-utils-dep", "matplotlib-dep", "pyserial-dep", "deepdiff-dep", "pynput-dep", "pyzmq-dep", "motorbridge-dep", "motorbridge-smart-servo-dep", "feetech", "dynamixel", "damiao", "robstride", "openarms", "gamepad", "hopejr", "lekiwi", "unitree-g1", "reachy2", "rebot", "kinematics", "intelrealsense", "phone", "diffusion", "wallx", "pi", "smolvla", "multi-task-dit", "groot", "sarm", "xvla", "eo1", "hilserl", "async", "peft", "dev", "notebook", "test", "video-benchmark", "aloha", "pusht", "libero", "metaworld", "all"]
+provides-extras = ["dataset", "training", "hardware", "viz", "core-scripts", "evaluation", "dataset-viz", "av-dep", "pygame-dep", "placo-dep", "transformers-dep", "grpcio-dep", "can-dep", "peft-dep", "scipy-dep", "diffusers-dep", "qwen-vl-utils-dep", "matplotlib-dep", "pyserial-dep", "deepdiff-dep", "pynput-dep", "pyzmq-dep", "motorbridge-dep", "motorbridge-smart-servo-dep", "feetech", "dynamixel", "damiao", "robstride", "openarms", "gamepad", "hopejr", "lekiwi", "unitree-g1", "reachy2", "rebot", "kinematics", "intelrealsense", "phone", "diffusion", "wallx", "pi", "smolvla", "multi-task-dit", "groot", "sarm", "topreward", "xvla", "eo1", "hilserl", "async", "peft", "dev", "notebook", "test", "video-benchmark", "aloha", "pusht", "libero", "metaworld", "all"]

 [[package]]
 name = "librt"
Author	SHA1	Message	Date
Haoming Song	3b5b94dbd6	optmize topreward input processing (#3660 )	2026-05-25 16:07:45 +02:00
Cole	616663cd9f	fix(rewards/topreward): fix pyproject extra typo and simplify processor (#3653 ) Add lerobot[topreward] extra to all in pyproject.toml, drop the redundant labels arg in scoring, and collapse the dead-branch shape check in the encoder processor.	2026-05-23 00:27:09 +02:00
Khalil Meftah	5cfca59ec7	fix(rewards/topreward): add missing input keys mm_token_type_ids	2026-05-21 11:05:02 +02:00
Khalil Meftah	f6ecb7b955	refactor(rewards): clean up TOPReward processor/model	2026-05-20 17:39:21 +02:00
Khalil Meftah	70ad322676	feat(rewards): add TOPReward reward model	2026-05-19 18:00:18 +02:00