TOPReward:
Token Probabilities as Hidden Zero-Shot Rewards for Robotics

Shirui Chen^1,2, Cole Harrison³, Ying-Chun Lee¹, Angela Jin Yang¹, Jason Ren^1,2,4,
Lillian J. Ratliff¹, Jiafei Duan^1,2^*, Dieter Fox^1,2^*, Ranjay Krishna^1,2^*

University of Washington¹ | Allen Institute for AI² | Amazon³ | University of North Carolina⁴

^*Co-advised

arXiv Code

TOPReward enables effective zero-shot estimation of task progress across diverse and challenging real-world manipulation tasks. Moreover, we introduce our in-house robotics reward benchmark, ManiRewardBench.

Highlights

Zero-shot progress estimation: Use VLM token log-likelihoods as a dense temporal reward (no calibration).
Strong benchmark performance: Achieves mean VOC scores of 0.857 on Open X-Embodiment and 0.947 on ManiRewardBench, outperforming GVL.
Useful downstream: Enables success detection and improves BC via advantage weighting.
Open-source model compatibility: Works with Qwen3-VL-8B and other video VLMs, with improvements expected as video understanding advances.

Abstract

Vision-Language-Action (VLA) models have advanced in pretraining, but their RL performance is limited by low sample efficiency and sparse real-world rewards. Generalizable process reward models are needed, yet existing temporal value functions often fail to transfer across domains.

We introduce TOPReward, a zero-shot progress estimator for robotic manipulation that reads task progress directly from video Vision-Language Models (VLMs) token probabilities. Evaluated on Open X-Embodiment (39 datasets, 780 episodes) and our new ManiRewardBench (113 tasks, 497 episodes across Franka, YAM, and SO-100/101), TOPReward achieves 0.945 mean VOC on ManiRewardBench using Qwen3-VL and outperforms GVL on open-source VLMs. It also enables success detection and reward-aligned behavior cloning.

130+

Zero-Shot Real-World Tasks

0.945

Mean VOC (Qwen3-VL)

Robot Platforms Evaluated

Task-Specific Training Required

Method Overview

TOPReward Example — **Qualitative example of "Fold the Towel":** Instruction-conditioned progress estimation on a real trajectory. The curve shows TOPReward's predicted completion value over time, with annotated values at selected frames corresponding to semantic sub-tasks.

Prompted Video–Language Inference

We ask the VLM model to judge whether the observed trajectory completes the instruction and score the log-likelihood of an affirmative answer (e.g. the token True).

Token-Probability Reward Extraction

Compute the log-likelihood of the answer token "True", avoiding reliance on the model’s instruction-following or numeric generation abilities.

Progress Estimation from Trajectory Prefixes

The extracted token probabilities are aligned across time to produce a dense temporal reward function.

Quantitative Results

Large-Scale Progress Estimation: Logit-Based Rewards Are More Reliable

**Results on the Open X-Embodiment dataset.** We report Mean VOC over 39 datasets with 20 episodes per dataset. Higher is better.
Method	Molmo-2-8B	Qwen3-VL-8B	Gemini-2.5-Pro
GVL	-0.016	0.194	0.541
TOPReward	0.417	0.857	0.433

**Results on the ManiRewardBench.** We report Mean VOC over 113 tasks, 497 episodes. Higher is better.
	Molmo-2		Qwen3-VL-8B		Gemini^*
Dataset	GVL	TOPReward	GVL	TOPReward	GVL	TOPReward
LeRobot	-0.001	0.595	0.332	0.954	0.620	0.578
Franka	0.000	0.662	0.242	0.942	0.695	0.448
Bimanual YAM	0.007	0.565	0.164	0.947	0.566	0.546
Single-arm YAM	-0.017	0.642	0.544	0.945	0.752	0.488

Progress Trace Comparison — **Progress traces on ManiRewardBench.** Example progress traces predicted by TOPReward (orange) compared to ground-truth completion (dashed) from ManiRewardBench. We also overlay Gemini-GVL (blue) on the same episodes when available.

TOPReward generates smooth, steadily increasing progress signals that closely match ground-truth completion across tasks. In contrast, Gemini-GVL produces noisier, non-monotonic predictions. TOPReward also captures multi-step structure, with plateaus at subtask completion and sharper gains during active manipulation.

Success Detection: Likelihood Beats Rank Correlation

VOC (Value-Order Correlation) measures rank consistency, not task completion, so failed trajectories that plateau early can still score high. In contrast, TOPReward estimates instruction-satisfaction likelihood, better separating success from failure.

We evaluate success detection on the ManiRewardBench failure split (23 tasks) as a binary classification problem, reporting ROC-AUC. TOPReward uses the average log-likelihood of the last three sampled frames, while GVL uses VOC scores.

**Success detection results.**
Method	Qwen3-VL-8B	Gemini-2.5-Pro
GVL	0.519	0.823
TOPReward (Ours)	0.654	0.826

Real-World Deployment

We deploy TOPReward on a real single-arm SO-100 robot to compute advantage weights for imitation learning. Using only 50 noisy demonstrations per task, we apply advantage-weighted regression (AWR) combined with TOPReward (TOP-AWR). Across all 6 tasks, advantage-weighted regression (AWR) consistently improves the number of successes over standard behavior cloning (BC).

**Real-World Experiments.** We report number of successes (out of 10 trials) for advantage-weighted behavior cloning on single-arm SO-100 tasks.
Task	Pretrained	BC	TOP-AWR (Ours)
Place toy car in box	1	2	3
Stack red cube on green cube	1.3	1	2.3
Put pen into cup	1.7	5.7	6.3
Place doll in box	0	7	10
Pick up cube	4	7	10
Put cube in cup	4	6	9

LeRobot Rollouts

Pretrained - Place toy car in box ✗

Behavior Cloning (BC) - Place toy car in box ✗

TOP-AWR (Ours) - Place toy car in box ✓

Pretrained - Put cube in cup ✗

Behavior Cloning (BC) - Put cube in cup ✗

TOP-AWR (Ours) - Put cube in cup ✓

Pretrained - Place doll in box ✗

Behavior Cloning (BC) - Place doll in box ✗

TOP-AWR (Ours) - Place doll in box ✓

Pretrained - Put pen into cup ✗

Behavior Cloning (BC) - Put pen into cup ✗

TOP-AWR (Ours) - Put pen into cup ✓

Pretrained - Pick up cube ✗

Behavior Cloning (BC) - Pick up cube ✗

TOP-AWR (Ours) - Pick up cube ✓

Conclusion

We introduced TOPReward, a zero-shot progress estimator for robotic manipulation that interprets pretrained video VLM token likelihoods as temporal value functions. By querying the model's belief about task completion rather than relying on numerical output, TOPReward avoids the known limitations of VLMs in instruction following and numeric reasoning.

Experiments show that TOPReward consistently outperforms prior approaches across diverse benchmarks and robot platforms, while enabling success detection and enhancing behavior cloning in real-world manipulation tasks.

Limitations and Future Work

Visual perception limits: Tasks requiring fine-grained spatial reasoning may yield noisy progress signals if the VLM cannot distinguish intermediate states.
Normalization constraints: Per-episode min-max normalization limits absolute progress comparison across trajectories without calibration.
Model-dependent performance: Performance relies on the underlying VLM's video understanding; stronger models should directly improve TOPReward.

BibTeX

@misc{chen2026topreward,
  title        = {TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics},
  author       = {Shirui Chen and Cole Harrison and Ying-Chun Lee and Angela Jin Yang and Jason Ren and Lillian J. Ratliff and Jiafei Duan and Dieter Fox and Ranjay Krishna},
  year         = {2026},
  howpublished = {\url{https://topreward.github.io/webpage/}},
  note         = {Project page}
}