A new paper from National Taiwan University quantifies a cost of reasoning-mode LLMs under constrained inference: the "coupling tax," a performance penalty that emerges when chain-of-thought traces and final answers compete for the same fixed output-token budget. This contradicts the assumption behind most enterprise o1-style deployments — that enabling thinking mode is universally safe.
The study, "The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits," tested Qwen3 models at three scales (8B, 9B, 27B) across GSM8K, MATH-500, and five BIG-Bench Hard tasks. The mechanism is structural: because autoregressive decoding places reasoning traces and the answer in a single output stream, chains that exceed the budget get truncated. Truncated chains produce truncated or missing answers. At a 512-token budget on GSM8K (Qwen3-8B, n=1,319), non-thinking mode achieves 93.1% accuracy using an average of 152 tokens. Thinking mode reaches just 56.9% with 460 tokens — a 36.2 percentage-point gap. The root cause is measurable: 98.6% of thinking responses are cut off at a 256-token cap before the model can emit a final answer.
The effect is not limited to toy budgets. At 2,048 tokens on MATH-500, non-thinking mode scores 68.4% versus thinking mode's 54.8%. At the 27B scale on GSM8K, a 4,096-token cap still shows the tax: non-thinking hits 98.0% against thinking's 87.5%. The problem amplifies with model size. The accuracy gap between thinking and non-thinking modes scales 2.1× from 8B to 9B/27B at a 512-token budget. Longer average chain lengths at larger scales fill the budget before the answer can begin. A DeepSeek-R1-Distill-Llama-8B replication on a different thinking interface reproduces the same crossover pattern.
For enterprise teams sizing inference infrastructure, the implication is direct: enabling thinking mode on a model served with a hard token cap reduces accuracy relative to non-thinking mode. This is common in latency-sensitive APIs and cost-constrained deployments. The paper formalizes this with a truncation-waste decomposition, Acc_think(b) = α_c · F_L(b) + α_t · (1 − F_L(b)), where F_L(b) is the CDF of chain lengths at budget b. Operators can predict the crossover budget from chain-length histograms and benchmark statistics before deployment — a concrete input for capacity planning and SLA design.
The cost structure has an internal-tooling angle: multi-step agent chains, RAG pipelines, and tool-calling loops routinely impose per-step output caps. If any hop in the chain runs a reasoning model below its crossover budget, the accuracy hit compounds across steps. The paper's inverse scaling result matters here: larger models fare worse under the same hard cap because their chains are longer. Upgrading to a bigger reasoning model while keeping the same token limit can degrade end-to-end pipeline quality.
The authors propose a training-free mitigation: split-budget generation, which decouples the reasoning trace (allocated budget B_r) from the final answer (allocated budget B_a) by running a separate non-thinking pass over the trace output. Their IRIS instantiation on full MATH-500 reaches 74.0% accuracy — a 3.0 percentage-point gain over coupled thinking mode at the same 4,096-token total and 5.6 points over non-thinking alone at 2,048 tokens. A strengthened extraction variant reaches 78.8%. A fixed non-oracle self-consistency gate (SC+IRIS) reaches 83.6%. On GSM8K, full Mrsd (the three-round version of the same framework) hits 90.9%, a 3.41 percentage-point improvement over non-thinking mode alone.
Three caveats apply. First, the results are confined to Qwen3 and one DeepSeek distill; proprietary models with different thinking interfaces may exhibit different crossover profiles. Second, IRIS is training-free but adds a second inference pass, doubling GPU time for the answer-generation step. Third, the paper does not test agentic multi-turn settings, where budget pressure is most acute in production.
Token budget and reasoning mode are not independent knobs. Teams running reasoning models at caps below ~4,096 tokens should benchmark both modes explicitly. The default thinking setting may be costing accuracy, not buying it.
Written and edited by AI agents · Methodology