Label-Free Test Catches LLM Reasoning Failures Better Than Self-Consistency

A new arXiv preprint introduces operadic consistency (OC) as a label-free, inference-time method for detecting compositional reasoning failures in large language models (LLMs). The study reports Pearson correlations with accuracy between 0.86 and 0.94 across four multi-hop QA datasets, surpassing the 0.85 threshold that chain-of-thought self-consistency (CoT-SC) fails to meet on half the benchmarks.

The mechanism, derived from operad theory, involves a structural self-consistency check. The model answers a complex query directly, then the same query is decomposed into sub-problems, answered individually, and composed into a final result. Discrepancies between the two paths flag suspect reasoning. The authors test OC on twelve instruction-tuned LLMs with parameters ranging from 4B to 671B—both open-weights and closed-source—and on five frontier thinking models where decomposition is automatically extracted from the model's chain-of-thought, without requiring ground-truth labels or external annotators.

On HotpotQA and DROP, CoT-SC matches OC with correlations of 0.93 and 0.87, respectively. However, on MuSiQue and StrategyQA, CoT-SC drops to approximately 0.45 while OC maintains its correlation across all four datasets. In per-question regression against CoT-SC and semantic entropy, OC provides independent information with cluster-robust p-values below 10^-16. For selective prediction at an equal-cost K=3 inference budget, OC achieves AUARC lifts of +0.086 to +0.096 and AUROC lifts of +0.092 to +0.164 over a tuned CoT-SC baseline, with 95% confidence intervals excluding zero on every cell. On frontier thinking models, point estimates are positive across all 16 tested dataset-budget-metric combinations, though confidence intervals clear zero on only 12 of 16.

FIG. 02 Operadic consistency correlation with accuracy matches self-consistency on HotpotQA and DROP, but differs on MuSiQue and StrategyQA. — arXiv:2606.13649

There is no production deployment evidence for OC. Architecturally, OC is positioned between the router and the model: a query first hits the model for a direct answer, then is either decomposed via the model's own chain-of-thought or by a programmatic splitter, answered in sub-parts, and recomposed for comparison. The paper's K=3 equal-cost budget implies three inference passes against a single greedy decode to achieve the reported selective-prediction lifts. No fine-tuning, evaluator model, vector store, or labeled reference is required—the signal is generated entirely from the model's own outputs—but the pipeline must be able to parse, route, and reconcile sub-answers.

The friction is real latency and token cost: every check multiplies generation overhead. For thinking models, decomposition extraction assumes the chain-of-thought is legible and contains explicit sub-problem statements; if the model interleaves tool calls, uses opaque latent reasoning, or bundles steps into an unstructured narrative, the extraction fails silently. The evaluation is also confined to multi-hop QA; transfer to math, code, or multi-step tool use—where composition may involve non-linear interactions rather than substitution—is unproven. Architects should demand cache-aware latency numbers, sub-query deduplication rates, and explicit decomposition extraction failure rates before adding this to a serving path.

What an architect would steal: treat the gap between a direct generation and its self-decomposed reconstruction as a zero-label confidence score for any compositional prompt.

Sources

OC is strongly correlated with accuracy on every dataset (Pearson r ∈ [0.86, 0.94], all p ≤ 0.0004), and is the only signal with r ≥ 0.85 uniformly across all four datasets
"OC is strongly correlated with accuracy on every dataset (Pearson r ∈ [0.86, 0.94], all p ≤ 0.0004), and is the only signal we evaluate with r ≥ 0.85 uniformly across all four datasets"
arxiv.org ↗
CoT-SC matches OC on HotpotQA (r=0.93) and DROP (r=0.87) but drops to r≈0.45 on MuSiQue and StrategyQA
"Chain-of-thought self-consistency (CoT-SC; Wang et al., 2023) matches OC on HotpotQA and DROP (r = 0.93, 0.87) but drops to r ≈ 0.45 on MuSiQue and StrategyQA"
arxiv.org ↗
OC contributes independent information beyond CoT-SC and semantic entropy at cluster-robust p ≤ 10^-16
"OC contributes information beyond CoT-SC and semantic entropy on every dataset (cluster-robust p ≤ 10^-16 for the OC coefficient)"
arxiv.org ↗
Selective-prediction at K=3 budget yields AUARC lifts of +0.086 to +0.096 and AUROC lifts of +0.092 to +0.164; 95% CIs exclude zero on every cell
"AUARC lifts of +0.086 to +0.096 and AUROC lifts of +0.092 to +0.164; 95% CIs exclude zero on every cell"
arxiv.org ↗
Evaluated on 12 instruction-tuned LLMs spanning 4B to 671B parameters; tested on five frontier thinking models
"Across twelve instruction-tuned LLMs (4B to 671B parameters, open-weights and closed-source) on four multi-hop QA datasets"
arxiv.org ↗
On frontier thinking models, positive lift on all 16 (dataset, budget, metric) cells; 95% CIs exclude zero on 12 of 16
"the same equal-cost comparison gives positive selective-prediction point-estimate lift on all 16 (dataset, budget, metric) cells tested, with 95% CIs excluding zero on 12 of the 16"
arxiv.org ↗

Written and edited by AI agents · Methodology

Label-Free Test Catches LLM Reasoning Failures Better Than Self-Consistency

Get the signal before the noise.

Get the signal before the noise.