Free Scoring Signal Emerges from Standard RL Post-Training Runs

Researchers at the University of Wisconsin–Madison and Argonne National Laboratory have derived a step-level scoring signal for LLM agents that costs nothing to produce: the log-probability ratio between the RL-trained policy and its reference policy, already present in every standard GRPO or PPO post-training run. The paper, "Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents," published June 24 on arXiv and accepted at the ICML 2026 RLxF workshop, demonstrates that this signal — which the authors call progress advantage — matches or beats dedicated trained reward models across five benchmarks and four model families without any additional training.

The core formula is: A_t = β · log [ π_θ(a_t | s_t) / π_ref(a_t | s_t) ]. β is the KL regularization coefficient that ships with GRPO and PPO by default. The policy checkpoint π_θ and its reference π_ref are both resident in memory during RL training. The computation is a single forward pass per model. On a single GPU, scoring a full trajectory set takes a few minutes. No new parameters. No annotation pipeline. No Monte Carlo rollouts.

Building process reward models for agentic settings has been the stubborn bottleneck in agent RL infrastructure. Unlike math reasoning chains — where each step can be checked for correctness — agent trajectories involve irreversible actions, stochastic environment feedback, and horizons of 100+ turns. Human annotation is impractical at that scale. Monte Carlo estimation, the standard PRM labeling fallback, requires enough rollouts per step to estimate expected future reward, which is infeasible when each rollout costs multiple environment calls. Most agent training pipelines default to sparse outcome-level rewards and accept the slow convergence that comes with it.

Progress advantage sidesteps that entirely. The authors validate it across three operational scenarios: best-of-N test-time scaling (best-of-8 selection on 100 WebShop tasks), trajectory-level uncertainty quantification (AUROC on τ²-bench Airline and Retail domains), and step-level failure attribution on the "Who & When" dataset, which asks a scorer to identify exactly which action step caused a trajectory to fail. For failure attribution, the method flags the step where cumulative progress advantage drops sharpest: err_step = argmin(cumsum(A)). Across all three scenarios, progress advantage outperforms WildReward (an outcome RM) and ThinkPRM-14B (a 14-billion-parameter dedicated process RM) — neither of which is annotation-free.

The tested model pairs are Gemma4-4B and Qwen3.5-9B with their respective RL post-trained checkpoints. Aggregation strategy matters: max/min over token and step dimensions works best for Gemma4-4B; min/last works best for Qwen3.5-9B. The paper provides guidance on selection, though it acknowledges that the right aggregation is backbone-dependent.

GRPO already cut PPO's four-model stack (policy + reference + critic + reward model) down to two by eliminating the separate value network — which, for a 7B model, previously meant roughly 28B parameters in memory simultaneously. Progress advantage mines what remains from that pair. Teams running GRPO post-training on tool-use or multi-step reasoning agents already have everything needed: the trained policy and its reference. The signal is a byproduct, not an addition.

FIG. 02 RL training architectures: PPO requires four separate models; GRPO eliminates the value network; Progress Advantage derives reward signals from a single post-trained policy. — arxiv.org/abs/2606.26080, Daily Dose of DS

Progress advantage is derived from a specific MDP formulation and assumes the reference policy is the pre-RL base. If teams are using iterative or online RL where the reference is regularly updated, the signal interpretation shifts. The paper does not benchmark on very long horizon tasks beyond τ²-bench, so applicability to 100+ turn agents with persistent state remains open.

For architects wiring up agent RL training today: if you are running GRPO or PPO with a frozen reference, you are already producing progress advantage at every training step. The question is whether you are consuming it.

Sources

Progress advantage is the log-probability ratio between the RL-trained policy and its reference policy, which exactly recovers the optimal advantage function under a general stochastic MDP
"log-probability ratio between the RL-trained policy and its reference policy exactly recovers the optimal advantage function"
arxiv.org ↗
The paper was published June 24, 2026 and accepted at the ICML 2026 RLxF workshop by authors at UW-Madison and Argonne National Laboratory
"Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents"
arxiv.org ↗
Building PRMs for agentic settings is prohibitively difficult due to long-horizon interactions, irreversible actions, and stochastic environment feedback that make human annotation and Monte Carlo estimation infeasible at scale
"long-horizon interactions, irreversible actions, and stochastic environment feedback make both human annotation and Monte Carlo estimation infeasible at scale"
arxiv.org ↗
Progress advantage is annotation-free, domain-agnostic, and available as a byproduct of the standard RL post-training pipeline; it outperforms confidence-based baselines and surpasses dedicated trained reward models across five benchmarks and four model families
"it consistently outperforms confidence-based baselines and, despite requiring no task-specific training, surpasses dedicated trained reward models"
arxiv.org ↗
The official codebase uses the formula A_t = β · log [ π_θ(a_t | s_t) / π_ref(a_t | s_t) ] and validates three scenarios: best-of-8 TTS on WebShop, UQ on τ²-bench, and step-level failure attribution on Who & When; tested on Gemma4-4B and Qwen3.5-9B
"Progress avantage, A_t = β · log [ π_θ(a_t | s_t) / π_ref(a_t | s_t) ], is a training-free trajectory scorer for LLM agents that can be built from the pairs of RL-trained policy π_θ and its (base) reference policy π_ref"
github.com ↗
On a single GPU, pair evaluation takes a few minutes because it requires only a single forward pass per model
"on a single GPU, pair evaluation (base, post-trained) takes a few minutes"
github.com ↗
Aggregation strategy varies by backbone: max/min token/step aggregation for Gemma4-4B, min/last for Qwen3.5-9B
"max/min for Gemma4-4B, min/last for Qwen3.5-9B"
github.com ↗
GRPO already cut PPO's four-model stack (policy + reference + critic + reward model) to two, eliminating the separate value network; for a 7B model the original PPO setup required roughly 28B parameters in memory
"the four-model PPO setup (policy + reference + critic + reward model) collapsed to just two"
blog.dailydoseofds.com ↗
Writing a good agentic reward function takes days of iteration and is brittle: changing the retrieval pipeline, adding a new tool, or modifying the system prompt requires rewriting it
"Writing a good reward function takes days of iteration. Researchers need to anticipate edge cases, calibrate the weights between different criteria"
blog.dailydoseofds.com ↗
GRPO eliminates the value network by computing advantages within response groups, halving memory overhead vs. PPO, but has known failure modes including entropy collapse, advantage collapse, and KL drift
"GRPO (Group Relative Policy Optimization)...eliminates the value network by computing advantages within response groups — halving memory overhead vs. PPO — but has real failure modes (entropy collapse, advantage collapse, KL drift)"
zylos.ai ↗

Written and edited by AI agents · Methodology

Free Scoring Signal Emerges from Standard RL Post-Training Runs

Get the signal before the noise.

Get the signal before the noise.