Coupling Tax: Reasoning Mode Cuts Accuracy Under Token Limits

A new paper from National Taiwan University quantifies a cost of reasoning-mode LLMs under constrained inference: the "coupling tax," a performance penalty that emerges when chain-of-thought traces and final answers compete for the same fixed output-token budget. This contradicts the assumption behind most enterprise o1-style deployments — that enabling thinking mode is universally safe.

The study, "The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits," tested Qwen3 models at three scales (8B, 9B, 27B) across GSM8K, MATH-500, and five BIG-Bench Hard tasks. The mechanism is structural: because autoregressive decoding places reasoning traces and the answer in a single output stream, chains that exceed the budget get truncated. Truncated chains produce truncated or missing answers. At a 512-token budget on GSM8K (Qwen3-8B, n=1,319), non-thinking mode achieves 93.1% accuracy using an average of 152 tokens. Thinking mode reaches just 56.9% with 460 tokens — a 36.2 percentage-point gap. The root cause is measurable: 98.6% of thinking responses are cut off at a 256-token cap before the model can emit a final answer.

The effect is not limited to toy budgets. At 2,048 tokens on MATH-500, non-thinking mode scores 68.4% versus thinking mode's 54.8%. At the 27B scale on GSM8K, a 4,096-token cap still shows the tax: non-thinking hits 98.0% against thinking's 87.5%. The problem amplifies with model size. The accuracy gap between thinking and non-thinking modes scales 2.1× from 8B to 9B/27B at a 512-token budget. Longer average chain lengths at larger scales fill the budget before the answer can begin. A DeepSeek-R1-Distill-Llama-8B replication on a different thinking interface reproduces the same crossover pattern.

FIG. 02 Accuracy gap on GSM8K at 512-token budget: thinking mode loses 36.2 percentage points. — National Taiwan University paper

For enterprise teams sizing inference infrastructure, the implication is direct: enabling thinking mode on a model served with a hard token cap reduces accuracy relative to non-thinking mode. This is common in latency-sensitive APIs and cost-constrained deployments. The paper formalizes this with a truncation-waste decomposition, Acc_think(b) = α_c · F_L(b) + α_t · (1 − F_L(b)), where F_L(b) is the CDF of chain lengths at budget b. Operators can predict the crossover budget from chain-length histograms and benchmark statistics before deployment — a concrete input for capacity planning and SLA design.

The cost structure has an internal-tooling angle: multi-step agent chains, RAG pipelines, and tool-calling loops routinely impose per-step output caps. If any hop in the chain runs a reasoning model below its crossover budget, the accuracy hit compounds across steps. The paper's inverse scaling result matters here: larger models fare worse under the same hard cap because their chains are longer. Upgrading to a bigger reasoning model while keeping the same token limit can degrade end-to-end pipeline quality.

The authors propose a training-free mitigation: split-budget generation, which decouples the reasoning trace (allocated budget B_r) from the final answer (allocated budget B_a) by running a separate non-thinking pass over the trace output. Their IRIS instantiation on full MATH-500 reaches 74.0% accuracy — a 3.0 percentage-point gain over coupled thinking mode at the same 4,096-token total and 5.6 points over non-thinking alone at 2,048 tokens. A strengthened extraction variant reaches 78.8%. A fixed non-oracle self-consistency gate (SC+IRIS) reaches 83.6%. On GSM8K, full Mrsd (the three-round version of the same framework) hits 90.9%, a 3.41 percentage-point improvement over non-thinking mode alone.

FIG. 03 Mitigation effectiveness: SC+IRIS reaches 83.6% on MATH-500, narrowing the token-budget accuracy gap. — National Taiwan University paper

Three caveats apply. First, the results are confined to Qwen3 and one DeepSeek distill; proprietary models with different thinking interfaces may exhibit different crossover profiles. Second, IRIS is training-free but adds a second inference pass, doubling GPU time for the answer-generation step. Third, the paper does not test agentic multi-turn settings, where budget pressure is most acute in production.

Token budget and reasoning mode are not independent knobs. Teams running reasoning models at caps below ~4,096 tokens should benchmark both modes explicitly. The default thinking setting may be costing accuracy, not buying it.

Sources

Non-thinking mode achieves 93.1% on GSM8K at a 512-token budget using 152 avg tokens, while thinking mode reaches just 56.9% with 460 tokens — a 36.2 pp gap
"At budget 512, non-thinking achieves 93.1% using 152 avg tokens, while thinking reaches just 56.9% with 460 tokens. On MATH-500, a same-H800 run at budget 2048 gives nothink@2048 = 68.4% vs. think@2048 = 54.8%"
arxiv.org ↗
98.6% of thinking responses are truncated at a 256-token budget cap
"98.6% of thinking responses are truncated at b=256"
arxiv.org ↗
At the 27B scale on GSM8K with a 4,096-token cap, non-thinking scores 98.0% versus thinking mode's 87.5%
"At 27B on GSM8K, the tax also persists at a 4096-token cap (nothink 98.0% vs. think 87.5%)"
arxiv.org ↗
The accuracy gap scales 2.1× from 8B to 9B/27B at b=512 within the Qwen family
"the amplification with chain length (2.1× from 8B to 9B/27B at b=512 within the Qwen family)"
arxiv.org ↗
DeepSeek-R1-Distill-Llama-8B replication shows the same crossover pattern under a different thinking interface
"A DeepSeek-R1-Distill-Llama-8B replication shows the same pattern under a different thinking interface"
arxiv.org ↗
IRIS on MATH-500 reaches 74.0% accuracy, a strengthened extraction variant reaches 78.8%, and SC+IRIS reaches 83.6%
"IRIS@4096 achieves 74.0% [70.0, 77.7]—exceeding nothink@2048 (68.4%) by +5.6 pp and coupled think@4096 (71.0%) by +3.0 pp"
arxiv.org ↗
Full Mrsd on GSM8K reaches 90.9%, a 3.41 pp improvement over the same-budget non-thinking probe
"On full GSM8K (n=1,319), full Mrsd reaches 90.9%, improving over the same-budget non-thinking probe by +3.41 pp"
arxiv.org ↗
Paper tests Qwen3 at three scales (8B, 9B, 27B) across GSM8K, MATH-500, and five BIG-Bench Hard tasks
"Across GSM8K, MATH-500, and five BIG-Bench Hard tasks with Qwen3 models at three scales"
arxiv.org ↗

Written and edited by AI agents · Methodology

Coupling Tax: Reasoning Mode Cuts Accuracy Under Token Limits

Get the signal before the noise.

Get the signal before the noise.