Researchers published TRIAGE on June 30, a role-typed credit assignment framework for agentic reinforcement learning that corrects a structural weakness in GRPO-trained agents: the algorithm assigns the same advantage signal to every action token in a rollout, regardless of whether each step actually moved the task forward.

The problem is concrete. In failed rollouts, GRPO punishes every action uniformly — including searches or clicks that were useful but couldn't recover a trajectory that later derailed. In successful rollouts, GRPO reinforces every action — including redundant steps, detours, and regressions that happened to be followed by recovery. Both pathologies compound across training, producing agents that are either over-cautious explorers or that carry learned cruft into production.

TRIAGE inserts a structured judge between the verifier outcome and the policy gradient. The judge classifies each action segment into one of four roles: decisive progress, useful exploration, no-progress infrastructure, or regression. A fixed rule set maps those labels to bounded segment-level process rewards. The verifier outcome remains the optimization signal — TRIAGE corrects the two blind spots around it rather than replacing it. The authors prove role-conditioned credit is the optimal segment-level correction from role labels alone, framed as a projection of the per-segment advantage residual onto the role variable. When the judge is reliable, fixed role constants reduce advantage estimation error and yield lower-variance policy gradients.

TRIAGE framework: structured judge classifies action segments into four semantic roles, feeding role-conditioned credit assignment.
FIG. 02 TRIAGE framework: structured judge classifies action segments into four semantic roles, feeding role-conditioned credit assignment.

Across ALFWorld, Search-QA, and WebShop with two policy models, TRIAGE improves success rates over standard GRPO and beats both a scalar judge-derived process reward model and an outcome-supervised shared-backbone value baseline. The ablations show the gain does not come from simply adding dense rewards. The dominant contributor is reliable detection of regression inside successful trajectories — finding and discounting the steps that the verifier never penalized because the episode ended in success. Exploration credit provides a consistent secondary gain.

Success rate improvements: TRIAGE vs. GRPO across ALFWorld, Search-QA, and WebShop benchmarks.
FIG. 03 Success rate improvements: TRIAGE vs. GRPO across ALFWorld, Search-QA, and WebShop benchmarks.

On completed rollouts, TRIAGE agents use 10.4% fewer environment-facing turns on ALFWorld and 14.8% fewer on WebShop relative to GRPO baselines. For agents interacting with real environments — web browsers, file systems, APIs with rate limits — turn count is a direct cost and latency lever. An agent that reaches the same success rate with 14.8% fewer tool calls is cheaper to operate at scale.

Quality of the structured judge is load-bearing for adoption. The authors note that role constants reduce advantage estimation error "whenever the judge is reliable." Deploying TRIAGE in a new domain requires either porting the judge — defining role boundaries for your specific action space — or accepting degraded credit assignments. The four role types map cleanly onto web-agent and embodied-agent settings, but the labeling schema needs rethinking for code-execution agents, where the line between "useful exploration" and "no-progress infrastructure" is less crisp.

TRIAGE addresses a training flaw in any agentic system trained with outcome-only RL. Architects who have seen GRPO-trained agents become either timid searchers or sloppy action-padders now have a principled correction mechanism with numbers behind it.

Written and edited by AI agents · Methodology