TOPReward:
Token Probabilities as Hidden Zero-Shot Rewards for Robotics

Shirui Chen1,2, Cole Harrison3, Ying-Chun Lee1, Angela Jin Yang1, Jason Ren1,2,4,
Lillian J. Ratliff1, Jiafei Duan1,2*, Dieter Fox1,2*, Ranjay Krishna1,2*

University of Washington1   |   Allen Institute for AI2   |   Amazon3   |   University of North Carolina4

*Co-advised

TOPReward Overview

TOPReward enables effective zero-shot estimation of task progress across diverse and challenging real-world manipulation tasks. Moreover, we introduce our in-house robotics reward benchmark, ManiRewardBench.

Highlights

Abstract

Vision-Language-Action (VLA) models have advanced in pretraining, but their RL performance is limited by low sample efficiency and sparse real-world rewards. Generalizable process reward models are needed, yet existing temporal value functions often fail to transfer across domains.

We introduce TOPReward, a zero-shot progress estimator for robotic manipulation that reads task progress directly from video Vision-Language Models (VLMs) token probabilities. Evaluated on Open X-Embodiment (39 datasets, 780 episodes) and our new ManiRewardBench (113 tasks, 497 episodes across Franka, YAM, and SO-100/101), TOPReward achieves 0.945 mean VOC on ManiRewardBench using Qwen3-VL and outperforms GVL on open-source VLMs. It also enables success detection and reward-aligned behavior cloning.

130+
Zero-Shot Real-World Tasks
0.945
Mean VOC (Qwen3-VL)
3+
Robot Platforms Evaluated
0
Task-Specific Training Required

Method Overview

TOPReward Example
Qualitative example of "Fold the Towel": Instruction-conditioned progress estimation on a real trajectory. The curve shows TOPReward's predicted completion value over time, with annotated values at selected frames corresponding to semantic sub-tasks.
1

Prompted Video–Language Inference

We ask the VLM model to judge whether the observed trajectory completes the instruction and score the log-likelihood of an affirmative answer (e.g. the token True).

2

Token-Probability Reward Extraction

Compute the log-likelihood of the answer token "True", avoiding reliance on the model’s instruction-following or numeric generation abilities.

3

Progress Estimation from Trajectory Prefixes

The extracted token probabilities are aligned across time to produce a dense temporal reward function.

Quantitative Results

Large-Scale Progress Estimation: Logit-Based Rewards Are More Reliable

Results on the Open X-Embodiment dataset. We report Mean VOC over 39 datasets with 20 episodes per dataset. Higher is better.
Method Molmo-2-8B Qwen3-VL-8B Gemini-2.5-Pro
GVL -0.016 0.194 0.541
TOPReward 0.417 0.857 0.433
Results on the ManiRewardBench. We report Mean VOC over 113 tasks, 497 episodes. Higher is better.
Molmo-2 Qwen3-VL-8B Gemini*
Dataset GVL TOPReward GVL TOPReward GVL TOPReward
LeRobot -0.001 0.595 0.332 0.954 0.620 0.578
Franka 0.000 0.662 0.242 0.942 0.695 0.448
Bimanual YAM 0.007 0.565 0.164 0.947 0.566 0.546
Single-arm YAM -0.017 0.642 0.544 0.945 0.752 0.488
Progress Trace Comparison
Progress traces on ManiRewardBench. Example progress traces predicted by TOPReward (orange) compared to ground-truth completion (dashed) from ManiRewardBench. We also overlay Gemini-GVL (blue) on the same episodes when available.

TOPReward generates smooth, steadily increasing progress signals that closely match ground-truth completion across tasks. In contrast, Gemini-GVL produces noisier, non-monotonic predictions. TOPReward also captures multi-step structure, with plateaus at subtask completion and sharper gains during active manipulation.

Success Detection: Likelihood Beats Rank Correlation

VOC (Value-Order Correlation) measures rank consistency, not task completion, so failed trajectories that plateau early can still score high. In contrast, TOPReward estimates instruction-satisfaction likelihood, better separating success from failure.

We evaluate success detection on the ManiRewardBench failure split (23 tasks) as a binary classification problem, reporting ROC-AUC. TOPReward uses the average log-likelihood of the last three sampled frames, while GVL uses VOC scores.

Success detection results.
Method Qwen3-VL-8B Gemini-2.5-Pro
GVL 0.519 0.823
TOPReward (Ours) 0.654 0.826

Real-World Deployment

We deploy TOPReward on a real single-arm SO-100 robot to compute advantage weights for imitation learning. Using only 50 noisy demonstrations per task, we apply advantage-weighted regression (AWR) combined with TOPReward (TOP-AWR). Across all 6 tasks, advantage-weighted regression (AWR) consistently improves the number of successes over standard behavior cloning (BC).

Real-World Experiments. We report number of successes (out of 10 trials) for advantage-weighted behavior cloning on single-arm SO-100 tasks.
Task Pretrained BC TOP-AWR (Ours)
Place toy car in box 1 2 3
Stack red cube on green cube 1.3 1 2.3
Put pen into cup 1.7 5.7 6.3
Place doll in box 0 7 10
Pick up cube 4 7 10
Put cube in cup 4 6 9

LeRobot Rollouts

Pretrained - Place toy car in box
Behavior Cloning (BC) - Place toy car in box
TOP-AWR (Ours) - Place toy car in box
Pretrained - Put cube in cup
Behavior Cloning (BC) - Put cube in cup
TOP-AWR (Ours) - Put cube in cup
Pretrained - Place doll in box
Behavior Cloning (BC) - Place doll in box
TOP-AWR (Ours) - Place doll in box
Pretrained - Put pen into cup
Behavior Cloning (BC) - Put pen into cup
TOP-AWR (Ours) - Put pen into cup
Pretrained - Pick up cube
Behavior Cloning (BC) - Pick up cube
TOP-AWR (Ours) - Pick up cube

Conclusion

We introduced TOPReward, a zero-shot progress estimator for robotic manipulation that interprets pretrained video VLM token likelihoods as temporal value functions. By querying the model's belief about task completion rather than relying on numerical output, TOPReward avoids the known limitations of VLMs in instruction following and numeric reasoning.

Experiments show that TOPReward consistently outperforms prior approaches across diverse benchmarks and robot platforms, while enabling success detection and enhancing behavior cloning in real-world manipulation tasks.

Limitations and Future Work

BibTeX

@misc{chen2026topreward,
  title        = {TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics},
  author       = {Shirui Chen and Cole Harrison and Ying-Chun Lee and Angela Jin Yang and Jason Ren and Lillian J. Ratliff and Jiafei Duan and Dieter Fox and Ranjay Krishna},
  year         = {2026},
  howpublished = {\url{https://topreward.github.io/webpage/}},
  note         = {Project page}
}