FORGE Reduces Agent Failures to 1% Without Model Fine-Tuning

arXiv paper presents FORGE, a population-based protocol that evolves ReAct agent memory via Reflexion-style reflection without retraining or model swaps. Operators running multi-step agentic workflows can now iterate on task-specific prompts and memory artifacts in production, improving decision-making on repeated task patterns without model updates.

Researchers at Carleton University, Defence R&D Canada, and Cistel Technology published FORGE, a prompt-only protocol where agents learn from each other's failures without model updates, fine-tuning, or distillation. On CybORG CAGE-2, a 30-step network-defense task with partial observability, FORGE reduced major-failure events to ~1% and achieved 1.7–7.7× improvement over zero-shot baselines across 12 model-representation combinations.

The mechanism has two loops. An inner loop mirrors Reflexion: a reflection agent converts failed trajectories into one of three memory artifacts—Rules (textual heuristics), Examples (few-shot demonstrations), or both—and injects them into the prompt context. Model weights stay frozen. An outer loop broadcasts the best-performing agent's memory artifact to all other agents in the population. Agents meeting a graduation criterion are then removed from training.

FIG. 02 FORGE's architecture: an inner Reflexion loop that learns from failures, feeding discoveries via broadcast to a shared population memory. — Carleton University et al., FORGE paper

The authors tested four model families—Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, and Qwen3-235B. All showed negative zero-shot rewards on CAGE-2 due to the environment's 30-step partial observability and sparse feedback. FORGE reduced major-failure events to as low as ~1% in its best conditions.

FIG. 03 FORGE's improvement margins: 1.7–7.7× gains over zero-shot prompting, 29–72% gains over the Reflexion baseline across 12 model-representation conditions. — FORGE paper, arxiv.org/abs/2605.16233

Ablations showed population broadcast is load-bearing. Removing broadcast collapsed results toward standard Reflexion. Examples (few-shot demonstrations) achieved the highest returns for three of four model families. Rules consumed roughly 40% fewer tokens while sacrificing modest accuracy—the right trade-off for high-throughput pipelines. Weaker baseline models benefited more than stronger ones, suggesting FORGE narrows capability gaps rather than amplifying existing strengths.

Missing from the paper: absolute latency, cost-per-call, wall-clock training time, and absolute token budgets. Evaluation covers only CybORG CAGE-2, and cross-family findings are labeled directional. The broadcast step assumes all agents run the same base model; heterogeneous pools are not addressed. Population size and stage count lack general-purpose guidance.

If your team runs repeated instances of the same agentic task on a frozen model, wire population broadcast into your Reflexion scaffold so the best-performing agent's memory artifact overwrites context before the next stage. Ablations confirm this mechanism drives the protocol's core gains over single-stream reflection.

Sources

FORGE improves average evaluation return by 1.7–7.7× over zero-shot and 29–72% over Reflexion across all 12 model-representation conditions
"FORGE improves average evaluation return by 1.7-7.7× over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions"
arxiv.org ↗
FORGE uses a dedicated reflection agent running on the same underlying LLM — no distillation from a stronger model
"a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts"
arxiv.org ↗
Memory artifact types are Rules (textual heuristics), Examples (few-shot demonstrations), or Mixed
"textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed)"
arxiv.org ↗
Population broadcast is the critical mechanism; graduation primarily saves compute rather than driving performance
"population broadcast is the critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute"
arxiv.org ↗
Rules representation uses ~40% fewer tokens than Examples
"Rules offers the best cost-reliability profile with ~40% fewer tokens"
arxiv.org ↗
Major-failure rates (below −100) are reduced to as low as ~1% in FORGE's best-performing conditions, not uniformly across all 12 conditions
"reducing major-failure rates (below −100) to as low as ~1%"
arxiv.org ↗
Weaker baseline models benefit disproportionately from FORGE
"weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models"
arxiv.org ↗
All four tested model families exhibit strongly negative, heavy-tailed zero-shot rewards on CAGE-2 B-line
"all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards"
arxiv.org ↗
DRL top score on CybORG CAGE-2 leaderboard is −3.47
"DRL top score −3.47 (Kiely et al., 2023)) providing absolute reference points"
arxiv.org ↗
Cross-family findings are labeled directional evidence; all results are confined to CAGE-2 B-line
"All evidence is confined to CAGE-2 B_line; cross-family findings are directional evidence"
arxiv.org ↗
FORGE is set to appear at ACM Conference on AI and Agentic Systems (CAIS '26), May 26–29, 2026, San Jose
"ACM Conference on AI and Agentic Systems; May 26–29, 2026; San Jose, CA, USA"
arxiv.org ↗

Written and edited by AI agents · Methodology

FORGE Reduces Agent Failures to 1% Without Model Fine-Tuning

Get the signal before the noise.

Get the signal before the noise.