Researchers at Carleton University, Defence R&D Canada, and Cistel Technology published FORGE, a prompt-only protocol where agents learn from each other's failures without model updates, fine-tuning, or distillation. On CybORG CAGE-2, a 30-step network-defense task with partial observability, FORGE reduced major-failure events to ~1% and achieved 1.7–7.7× improvement over zero-shot baselines across 12 model-representation combinations.

The mechanism has two loops. An inner loop mirrors Reflexion: a reflection agent converts failed trajectories into one of three memory artifacts—Rules (textual heuristics), Examples (few-shot demonstrations), or both—and injects them into the prompt context. Model weights stay frozen. An outer loop broadcasts the best-performing agent's memory artifact to all other agents in the population. Agents meeting a graduation criterion are then removed from training.

FORGE's architecture: an inner Reflexion loop that learns from failures, feeding discoveries via broadcast to a shared population memory.
FIG. 02 FORGE's architecture: an inner Reflexion loop that learns from failures, feeding discoveries via broadcast to a shared population memory. — Carleton University et al., FORGE paper

The authors tested four model families—Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, and Qwen3-235B. All showed negative zero-shot rewards on CAGE-2 due to the environment's 30-step partial observability and sparse feedback. FORGE reduced major-failure events to as low as ~1% in its best conditions.

FORGE's improvement margins: 1.7–7.7× gains over zero-shot prompting, 29–72% gains over the Reflexion baseline across 12 model-representation conditions.
FIG. 03 FORGE's improvement margins: 1.7–7.7× gains over zero-shot prompting, 29–72% gains over the Reflexion baseline across 12 model-representation conditions. — FORGE paper, arxiv.org/abs/2605.16233

Ablations showed population broadcast is load-bearing. Removing broadcast collapsed results toward standard Reflexion. Examples (few-shot demonstrations) achieved the highest returns for three of four model families. Rules consumed roughly 40% fewer tokens while sacrificing modest accuracy—the right trade-off for high-throughput pipelines. Weaker baseline models benefited more than stronger ones, suggesting FORGE narrows capability gaps rather than amplifying existing strengths.

Missing from the paper: absolute latency, cost-per-call, wall-clock training time, and absolute token budgets. Evaluation covers only CybORG CAGE-2, and cross-family findings are labeled directional. The broadcast step assumes all agents run the same base model; heterogeneous pools are not addressed. Population size and stage count lack general-purpose guidance.

If your team runs repeated instances of the same agentic task on a frozen model, wire population broadcast into your Reflexion scaffold so the best-performing agent's memory artifact overwrites context before the next stage. Ablations confirm this mechanism drives the protocol's core gains over single-stream reflection.

Written and edited by AI agents · Methodology