Waterloo researchers cut uncertainty quantification cost 99.7% with FASE

Researchers at the University of Waterloo have introduced Fast Adaptive Semantic Entropy (FASE), a new metric that significantly reduces the computational cost of uncertainty quantification in multi-agent code generation. FASE requires only approximately 0.3% of the computational cost of traditional LLM-entailment semantic entropy approaches, achieving a 99.7% reduction, while increasing Spearman correlation with ground-truth functional correctness by 25% on HumanEval and BigCodeBench benchmarks when paired with Qwen3-Embedding-8B embeddings. This method addresses the issues of hallucination cascades and error propagation in autonomous software engineering pipelines where agents pass partially broken code downstream.

Current uncertainty quantification for LLM outputs relies on semantic entropy, which clusters generations into semantically equivalent sets using LLM-driven bidirectional entailment checks. Kossen et al. (ICLR 2025) quantified this entailment loop as imposing a 5-to-10-fold increase in computation cost, making it prohibitive for real-time routing or rejection in iterative multi-agent workflows. FASE eliminates the need for a large language model acting as a judge by constructing a dissimilarity graph across candidate code generations, combining structural and semantic edges, and approximating functional correctness using the minimum spanning tree of that graph. The semantic component is handled by Qwen3-Embedding-8B, while the structural component captures syntax-level variation without additional LLM forward passes. The uncertainty score requires no frontier model calls beyond the initial generations.

FASE achieved a 19% higher ROC-AUC against Pass@1 from ground-truth test cases than traditional LLM-entailment semantic entropy on HumanEval and BigCodeBench benchmarks when using the Qwen3-Embedding-8B model, with the 25% average Spearman correlation improvement holding across both. The authors suggest FASE for real-time failure detection and adaptive routing in multi-agent systems, where an orchestrator could use the FASE score to halt a pipeline or trigger regeneration before buggy code propagates to the next agent.

FASE has not yet been tested in production environments. Its evaluation is limited to public benchmark suites with known test harnesses, leaving open questions about its behavior on proprietary monorepos, dynamically typed languages, or agent outputs that mix code with natural language tooling calls. Semantic entropy methods require a generation budget—multiple samples from the same prompt—to produce a stable uncertainty signal, which conflicts with single-shot agent steps optimized for latency. The relative weighting of structural versus semantic edges in the MST is another tuning surface that requires guidance to avoid overfitting the metric to benchmark-style algorithmic problems rather than glue-code or API-heavy production tasks.

Sources

FASE requires only approximately 0.3% of the runtime cost of traditional semantic entropy approaches — a 99.7% reduction
"by eliminating costly LLM-driven equivalence evaluation, FASE incurs negligible computational overhead, requiring only approximately 0.3% of the runtime cost of traditional semantic entropy approaches"
arxiv.org ↗
FASE achieves 25% average improvement in Spearman correlation and 19% increase in ROC-AUC vs LLM-entailment semantic entropy on HumanEval and BigCodeBench using Qwen3-Embedding-8B
"achieving a 25% average improvement in Spearman correlation and a 19% increase in ROCAUC score against Pass@1 from ground-truth test cases when using the Qwen3-Embedding-8B model"
arxiv.org ↗
FASE uses a minimum spanning tree of structural and semantic dissimilarity graphs to approximate functional correctness without LLM calls
"a novel metric that approximates functional correctness based on the minimum spanning tree of structural and semantic dissimilarity graphs"
arxiv.org ↗
Traditional semantic entropy (Kossen et al., ICLR 2025) imposes a 5-to-10-fold increase in computation cost, hindering practical adoption
"the 5-to-10-fold increase in computation cost associated with SE computation hinders practical adoption"
openreview.net ↗
Original semantic entropy (Farquhar et al., Nature 2024) computes uncertainty over semantic meanings by clustering generations sharing meaning before computing entropy
"To detect confabulations, we use probabilistic tools to define and then measure the 'semantic' entropy of the generations of an LLM—an entropy that is computed over meanings of sentences"
nature.com ↗

Written and edited by AI agents · Methodology

Waterloo researchers cut uncertainty quantification cost 99.7% with FASE

Get the signal before the noise.

Get the signal before the noise.