Vision-Language-Action (VLA) models have advanced in pretraining, but their RL performance is limited by low sample efficiency and sparse real-world rewards. Generalizable process reward models are needed, yet existing temporal value functions often fail to transfer across domains.
We introduce TOPReward, a zero-shot progress estimator for robotic manipulation that reads task progress directly from video Vision-Language Models (VLMs) token probabilities. Evaluated on Open X-Embodiment (39 datasets, 780 episodes) and our new ManiRewardBench (113 tasks, 497 episodes across Franka, YAM, and SO-100/101), TOPReward achieves 0.945 mean VOC on ManiRewardBench using Qwen3-VL and outperforms GVL on open-source VLMs. It also enables success detection and reward-aligned behavior cloning.
We ask the VLM model to judge whether the observed trajectory completes the instruction and score the log-likelihood of an affirmative answer (e.g. the token True).
Compute the log-likelihood of the answer token "True", avoiding reliance on the model’s instruction-following or numeric generation abilities.
The extracted token probabilities are aligned across time to produce a dense temporal reward function.
| Method | Molmo-2-8B | Qwen3-VL-8B | Gemini-2.5-Pro |
|---|---|---|---|
| GVL | -0.016 | 0.194 | 0.541 |
| TOPReward | 0.417 | 0.857 | 0.433 |
| Molmo-2 | Qwen3-VL-8B | Gemini* | ||||
|---|---|---|---|---|---|---|
| Dataset | GVL | TOPReward | GVL | TOPReward | GVL | TOPReward |
| LeRobot | -0.001 | 0.595 | 0.332 | 0.954 | 0.620 | 0.578 |
| Franka | 0.000 | 0.662 | 0.242 | 0.942 | 0.695 | 0.448 |
| Bimanual YAM | 0.007 | 0.565 | 0.164 | 0.947 | 0.566 | 0.546 |
| Single-arm YAM | -0.017 | 0.642 | 0.544 | 0.945 | 0.752 | 0.488 |
TOPReward generates smooth, steadily increasing progress signals that closely match ground-truth completion across tasks. In contrast, Gemini-GVL produces noisier, non-monotonic predictions. TOPReward also captures multi-step structure, with plateaus at subtask completion and sharper gains during active manipulation.
VOC (Value-Order Correlation) measures rank consistency, not task completion, so failed trajectories that plateau early can still score high. In contrast, TOPReward estimates instruction-satisfaction likelihood, better separating success from failure.
We evaluate success detection on the ManiRewardBench failure split (23 tasks) as a binary classification problem, reporting ROC-AUC. TOPReward uses the average log-likelihood of the last three sampled frames, while GVL uses VOC scores.
| Method | Qwen3-VL-8B | Gemini-2.5-Pro |
|---|---|---|
| GVL | 0.519 | 0.823 |
| TOPReward (Ours) | 0.654 | 0.826 |
We deploy TOPReward on a real single-arm SO-100 robot to compute advantage weights for imitation learning. Using only 50 noisy demonstrations per task, we apply advantage-weighted regression (AWR) combined with TOPReward (TOP-AWR). Across all 6 tasks, advantage-weighted regression (AWR) consistently improves the number of successes over standard behavior cloning (BC).
| Task | Pretrained | BC | TOP-AWR (Ours) |
|---|---|---|---|
| Place toy car in box | 1 | 2 | 3 |
| Stack red cube on green cube | 1.3 | 1 | 2.3 |
| Put pen into cup | 1.7 | 5.7 | 6.3 |
| Place doll in box | 0 | 7 | 10 |
| Pick up cube | 4 | 7 | 10 |
| Put cube in cup | 4 | 6 | 9 |
We introduced TOPReward, a zero-shot progress estimator for robotic manipulation that interprets pretrained video VLM token likelihoods as temporal value functions. By querying the model's belief about task completion rather than relying on numerical output, TOPReward avoids the known limitations of VLMs in instruction following and numeric reasoning.
Experiments show that TOPReward consistently outperforms prior approaches across diverse benchmarks and robot platforms, while enabling success detection and enhancing behavior cloning in real-world manipulation tasks.
@misc{chen2026topreward,
title = {TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics},
author = {Shirui Chen and Cole Harrison and Ying-Chun Lee and Angela Jin Yang and Jason Ren and Lillian J. Ratliff and Jiafei Duan and Dieter Fox and Ranjay Krishna},
year = {2026},
howpublished = {\url{https://topreward.github.io/webpage/}},
note = {Project page}
}