EEVEE Surpasses Self-Improving Agents with 48% Margin on Multi-Domain Inference

EEVEE, a test-time prompt-learning framework developed by Princeton and Shanghai Jiao Tong University, has demonstrated a +41.53 cumulative retention gain on sequential multi-benchmark streams, outperforming previous self-improving methods GEPA and ACE, which saw declines of -15.36 and -18.58 respectively due to cross-dataset interference. The framework addresses the production scenario where an inference endpoint receives heterogeneous queries from multiple domains and task formats in a single traffic stream, a condition often overlooked by existing prompt optimizers.

The framework's architecture is based on a router that partitions incoming inputs into task clusters and assigns each cluster to a specific prompt configuration. Updates to both the router and prompts are interleaved across three stages: initializing useful prompt slots, efficiently exploring coupled router-prompt changes, and converging under a stable routing policy. Both the router and prompts continue to refine as new task distributions emerge.

The authors evaluated EEVEE on frozen-weight Qwen3-4B-Instruct and DeepSeek-V3.2, with no fine-tuning, GRPO, or RL involved. The benchmark suite included GPQA Diamond, Formula, TheoremQA, and HumanEval—tasks covering reasoning, mathematics, and code to represent heterogeneous production demand. The paper does not disclose the serving stack, inference framework, hardware topology, or orchestration layer, implying that architects may need to budget for an additional forward pass or lightweight classifier in front of the main model call.

EEVEE improved average multi-benchmark scores by 10.38 points over Qwen3-4B-Instruct and 24.32 points over DeepSeek-V3.2, surpassing GEPA by up to 37.2% and ACE by up to 48.2% when benchmarks are evaluated together. The router's ability to isolate prompt mutations to specific task clusters also avoids the prompt expansion overhead that affects ACE, which accumulates incremental delta updates into ever-longer contexts, increasing token cost and latency.

However, the paper does not provide any measurement of the routing step's serving tax, such as p50 or p99 latency, token throughput, GPU-hour budget, or per-call economics. All performance data comes from static benchmark splits, not live systems under organic distribution shift, meaning the retention curves are laboratory proofs rather than deployed-system guarantees. The three-stage training pipeline also imposes a non-trivial integration cost, as onboarding EEVEE to a new task stream requires running the exploration phase before the router stabilizes. The authors do not quantify degradation when the router misclassifies a query into the wrong prompt slot, a critical failure mode for agents exposed to adversarial or out-of-distribution user prompts. Additionally, the router is learned, not engineered, introducing a meta-training loop whose convergence behavior on real-world query distributions remains unvalidated.

The transferable pattern from EEVEE is to co-evolve a lightweight router alongside domain-specific prompt slots, partitioning heterogeneous traffic before adaptation rather than forcing it through a single monolithic prompt.

Sources

EEVEE improves average multi-benchmark scores by 10.38 points over Qwen3-4B-Instruct and 24.32 points over DeepSeek-V3.2, surpassing GEPA by up to 37.2% and ACE by up to 48.2%
"EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%."
arxiv.org ↗
EEVEE ends with +41.53 cumulative retention gain; GEPA ends at -15.36, ACE at -18.58 in the incremental multi-benchmark setting
"Eevee ends with a +41.53 cumulative retention gain after all tasks are introduced, while GEPA and ACE end at -15.36 and -18.58."
arxiv.org ↗
EEVEE introduces a router that partitions incoming inputs into task clusters and assigns each to suitable prompt configurations via a router-prompt co-evolution strategy
"Eevee introduces a router that partitions the stream into task clusters and assigns each cluster to a suitable prompt configuration... router-prompt co-evolution strategy that interleaves router and prompt learning phases."
arxiv.org ↗
Benchmark suite spans GPQA Diamond, Formula, TheoremQA, and HumanEval; base models tested are Qwen3-4B-Instruct and DeepSeek-V3.2 with no weight updates
"Incremental multi-benchmark retention improvement as tasks are added in the order GPQA Diamond, Formula, TheoremQA, and HumanEval."
arxiv.org ↗
GEPA (Genetic-Pareto) outperforms GRPO by 6pp on average and up to 19pp while using up to 35× fewer rollouts, and beats MIPROv2 by over 10pp — but is designed for single-benchmark settings
"Across six tasks, GEPA outperforms GRPO by 6 percentage points on average and by up to 19pp, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10 percentage points."
arxiv.org ↗
ACE (Agentic Context Engineering) achieved 10.6% average gain on AppWorld benchmark but produces ever-longer contexts via incremental delta updates — accumulating prompt expansion overhead
"ReAct + ACE outperforms selected baselines by an average of 10.6%... ACE produces longer contexts than methods such as GEPA, this does not translate to linearly higher inference cost or GPU memory usage."
arxiv.org ↗
Cross-dataset interference: when multiple benchmarks enter the adaptation stream, GEPA and ACE accumulate negative retention on previous tasks
"when more benchmarks enter the adaptation stream, GEPA and ACE accumulate negative retention on previous tasks, suggesting that a single learned prompt struggles to absorb heterogeneous feedback without losing task-specific behavior."
arxiv.org ↗

Written and edited by AI agents · Methodology

EEVEE Surpasses Self-Improving Agents with 48% Margin on Multi-Domain Inference

Get the signal before the noise.

Get the signal before the noise.