TRIAGE Cuts Agent Actions 14.8% While Raising Success Rates

Researchers published TRIAGE on June 30, a role-typed credit assignment framework for agentic reinforcement learning that corrects a structural weakness in GRPO-trained agents: the algorithm assigns the same advantage signal to every action token in a rollout, regardless of whether each step actually moved the task forward.

The problem is concrete. In failed rollouts, GRPO punishes every action uniformly — including searches or clicks that were useful but couldn't recover a trajectory that later derailed. In successful rollouts, GRPO reinforces every action — including redundant steps, detours, and regressions that happened to be followed by recovery. Both pathologies compound across training, producing agents that are either over-cautious explorers or that carry learned cruft into production.

TRIAGE inserts a structured judge between the verifier outcome and the policy gradient. The judge classifies each action segment into one of four roles: decisive progress, useful exploration, no-progress infrastructure, or regression. A fixed rule set maps those labels to bounded segment-level process rewards. The verifier outcome remains the optimization signal — TRIAGE corrects the two blind spots around it rather than replacing it. The authors prove role-conditioned credit is the optimal segment-level correction from role labels alone, framed as a projection of the per-segment advantage residual onto the role variable. When the judge is reliable, fixed role constants reduce advantage estimation error and yield lower-variance policy gradients.

FIG. 02 TRIAGE framework: structured judge classifies action segments into four semantic roles, feeding role-conditioned credit assignment.

Across ALFWorld, Search-QA, and WebShop with two policy models, TRIAGE improves success rates over standard GRPO and beats both a scalar judge-derived process reward model and an outcome-supervised shared-backbone value baseline. The ablations show the gain does not come from simply adding dense rewards. The dominant contributor is reliable detection of regression inside successful trajectories — finding and discounting the steps that the verifier never penalized because the episode ended in success. Exploration credit provides a consistent secondary gain.

FIG. 03 Success rate improvements: TRIAGE vs. GRPO across ALFWorld, Search-QA, and WebShop benchmarks.

On completed rollouts, TRIAGE agents use 10.4% fewer environment-facing turns on ALFWorld and 14.8% fewer on WebShop relative to GRPO baselines. For agents interacting with real environments — web browsers, file systems, APIs with rate limits — turn count is a direct cost and latency lever. An agent that reaches the same success rate with 14.8% fewer tool calls is cheaper to operate at scale.

Quality of the structured judge is load-bearing for adoption. The authors note that role constants reduce advantage estimation error "whenever the judge is reliable." Deploying TRIAGE in a new domain requires either porting the judge — defining role boundaries for your specific action space — or accepting degraded credit assignments. The four role types map cleanly onto web-agent and embodied-agent settings, but the labeling schema needs rethinking for code-execution agents, where the line between "useful exploration" and "no-progress infrastructure" is less crisp.

TRIAGE addresses a training flaw in any agentic system trained with outcome-only RL. Architects who have seen GRPO-trained agents become either timid searchers or sloppy action-padders now have a principled correction mechanism with numbers behind it.

Sources

TRIAGE reduces environment-facing turns by 10.4% on ALFWorld and 14.8% on WebShop relative to GRPO on completed rollouts
"on completed ALFWorld and WebShop rollouts, TRIAGE also reduces environment-facing turns by an additional 10.4% and 14.8% relative to GRPO"
arxiv.org ↗
Standard GRPO applies a uniform advantage over all action tokens from the final verifier outcome, punishing useful exploration in failed rollouts and reinforcing redundant actions in successful ones
"it punishes useful exploration in failed rollouts and reinforces redundant or regressive actions in successful rollouts"
arxiv.org ↗
TRIAGE classifies each action segment into four semantic roles: decisive progress, useful exploration, no-progress infrastructure, or regression
"A structured judge classifies each segment as decisive progress, useful exploration, no-progress infrastructure, or regression"
arxiv.org ↗
Role-conditioned credit is the optimal segment-level correction expressible from role labels alone, framing it as a projection of the per-segment advantage residual onto the role variable
"role-conditioned credit is the optimal segment-level correction expressible from role labels alone -- a projection of the per-segment advantage residual onto the role variable"
arxiv.org ↗
TRIAGE improves success rates over GRPO across ALFWorld, Search-QA, and WebShop for two policy models, and outperforms scalar judge-derived process reward and outcome-supervised shared-backbone value baseline
"Across ALFWorld, Search-QA, and WebShop, TRIAGE improves success rates over GRPO for two policy models and outperforms both a scalar judge-derived process reward and an outcome-supervised shared-backbone value baseline"
arxiv.org ↗
Ablations confirm the gain comes from role typing rather than adding dense rewards, with regression detection in successful trajectories as the dominant contributor
"Ablations show that the gain comes from role typing rather than merely adding dense rewards: reliable detection of regression inside successful trajectories is the dominant contributor"
arxiv.org ↗

Written and edited by AI agents · Methodology

TRIAGE Cuts Agent Actions 14.8% While Raising Success Rates

Get the signal before the noise.

Get the signal before the noise.