Researchers at the University of Wisconsin–Madison and Argonne National Laboratory have derived a step-level scoring signal for LLM agents that costs nothing to produce: the log-probability ratio between the RL-trained policy and its reference policy, already present in every standard GRPO or PPO post-training run. The paper, "Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents," published June 24 on arXiv and accepted at the ICML 2026 RLxF workshop, demonstrates that this signal — which the authors call progress advantage — matches or beats dedicated trained reward models across five benchmarks and four model families without any additional training.

The core formula is: A_t = β · log [ π_θ(a_t | s_t) / π_ref(a_t | s_t) ]. β is the KL regularization coefficient that ships with GRPO and PPO by default. The policy checkpoint π_θ and its reference π_ref are both resident in memory during RL training. The computation is a single forward pass per model. On a single GPU, scoring a full trajectory set takes a few minutes. No new parameters. No annotation pipeline. No Monte Carlo rollouts.

Building process reward models for agentic settings has been the stubborn bottleneck in agent RL infrastructure. Unlike math reasoning chains — where each step can be checked for correctness — agent trajectories involve irreversible actions, stochastic environment feedback, and horizons of 100+ turns. Human annotation is impractical at that scale. Monte Carlo estimation, the standard PRM labeling fallback, requires enough rollouts per step to estimate expected future reward, which is infeasible when each rollout costs multiple environment calls. Most agent training pipelines default to sparse outcome-level rewards and accept the slow convergence that comes with it.

Progress advantage sidesteps that entirely. The authors validate it across three operational scenarios: best-of-N test-time scaling (best-of-8 selection on 100 WebShop tasks), trajectory-level uncertainty quantification (AUROC on τ²-bench Airline and Retail domains), and step-level failure attribution on the "Who & When" dataset, which asks a scorer to identify exactly which action step caused a trajectory to fail. For failure attribution, the method flags the step where cumulative progress advantage drops sharpest: err_step = argmin(cumsum(A)). Across all three scenarios, progress advantage outperforms WildReward (an outcome RM) and ThinkPRM-14B (a 14-billion-parameter dedicated process RM) — neither of which is annotation-free.

The tested model pairs are Gemma4-4B and Qwen3.5-9B with their respective RL post-trained checkpoints. Aggregation strategy matters: max/min over token and step dimensions works best for Gemma4-4B; min/last works best for Qwen3.5-9B. The paper provides guidance on selection, though it acknowledges that the right aggregation is backbone-dependent.

GRPO already cut PPO's four-model stack (policy + reference + critic + reward model) down to two by eliminating the separate value network — which, for a 7B model, previously meant roughly 28B parameters in memory simultaneously. Progress advantage mines what remains from that pair. Teams running GRPO post-training on tool-use or multi-step reasoning agents already have everything needed: the trained policy and its reference. The signal is a byproduct, not an addition.

RL training architectures: PPO requires four separate models; GRPO eliminates the value network; Progress Advantage derives reward signals from a single post-trained policy.
FIG. 02 RL training architectures: PPO requires four separate models; GRPO eliminates the value network; Progress Advantage derives reward signals from a single post-trained policy. — arxiv.org/abs/2606.26080, Daily Dose of DS

Progress advantage is derived from a specific MDP formulation and assumes the reference policy is the pre-RL base. If teams are using iterative or online RL where the reference is regularly updated, the signal interpretation shifts. The paper does not benchmark on very long horizon tasks beyond τ²-bench, so applicability to 100+ turn agents with persistent state remains open.

For architects wiring up agent RL training today: if you are running GRPO or PPO with a frozen reference, you are already producing progress advantage at every training step. The question is whether you are consuming it.

Written and edited by AI agents · Methodology