IBM's ACoT Cuts Reasoning Tokens 11.6x Without Accuracy Loss

IBM researchers have published Abstract Chain-of-Thought (ACoT), a post-training method that replaces natural-language reasoning chains with short sequences of discrete latent tokens. The technique cuts reasoning tokens by up to 11.6× while matching standard chain-of-thought accuracy on mathematical reasoning, instruction-following, and multi-hop benchmarks.

The core problem ACoT addresses is output-token cost. Standard CoT forces a model to narrate every reasoning step in natural language before producing an answer — useful for debugging, expensive at scale. Prior latent-reasoning approaches used continuous vector representations to compress reasoning, but those methods consistently underperformed verbal CoT on complex tasks. ACoT takes a different path: discrete "abstract" tokens drawn from a reserved vocabulary that the model never encountered during pretraining.

FIG. 02 ACoT reduces reasoning token output to 1/11.6th of standard Chain-of-Thought, at comparable benchmark accuracy. — IBM Research / arxiv:2604.22709

Training proceeds in two stages. A policy iteration-style warm-up alternates between supervised fine-tuning — masking and bottlenecking a full verbal CoT to force compression — and self-distillation, where the model learns to generate abstract tokens from the prompt via constrained decoding against a learned codebook. Once warm-up stabilizes, the team applies warm-started reinforcement learning, constrained to the abstract token vocabulary, to optimize end-task reward. No changes are made to the underlying model architecture; all modifications stay within the post-training regime.

FIG. 03 ACoT training: a policy-iteration warm-up (SFT + self-distillation) feeds a warm-started RL stage to produce the final abstract-token reasoning model. — IBM Research / arxiv:2604.22709

The result generalizes across model families. The authors report comparable performance on mathematical reasoning, instruction-following, and multi-hop reasoning benchmarks against both verbal CoT baselines and continuous latent-reasoning approaches — while cutting generation length by up to 11.6×. For enterprise workloads billed per output token, that compression ratio maps directly to cost reduction: a pipeline consuming 100M reasoning tokens per day could drop below 9M without retraining the base model or routing traffic to a smaller, weaker architecture.

The efficiency gains come with a tradeoff enterprise architects must track. Abstract tokens carry no semantic content legible to humans or external monitoring tools. Teams relying on CoT scratchpads for auditability, compliance logging, or rationale extraction will need separate mechanisms. ACoT optimizes throughput, not transparency.

The abstract token vocabulary develops a power-law frequency distribution across training phases, mirroring the statistical structure of natural language. The authors treat this as evidence the model is learning a genuine compressed reasoning language, not collapsing to a degenerate encoding. Degenerate codebooks fail on out-of-distribution inputs; the power-law distribution suggests robust, structured reuse — a meaningful signal for production reliability.

Enterprise adoption requires fine-tuning infrastructure but no changes to serving hardware or base-model weights. ACoT applies in principle to any sufficiently capable open-weights model an organization already operates. The constrained-decoding requirement adds implementation complexity — inference engines must enforce vocabulary restrictions at generation time — but leaves the serving stack otherwise intact. For teams running large-scale reasoning workloads on open or proprietary models, the technique is a credible addition to the inference-optimization stack.

Cutting reasoning token counts by an order of magnitude without modifying the base model reprices inference economics — and IBM has released the recipe.

Sources

ACoT achieves up to 11.6× fewer reasoning tokens
"Abstract-CoT achieves up to 11.6× fewer reasoning tokens while demonstrating comparable performance across mathematical reasoning, instruction-following, and multi-hop reasoning"
arxiv.org ↗
ACoT demonstrates comparable performance on mathematical reasoning, instruction-following, and multi-hop reasoning benchmarks
"demonstrating comparable performance across mathematical reasoning, instruction-following, and multi-hop reasoning"
arxiv.org ↗
ACoT generalizes across language model families
"generalizes across language model families"
arxiv.org ↗
ACoT uses a policy iteration-style warm-up loop alternating between bottlenecking from verbal CoT via masking and SFT, and self-distillation via constrained decoding with a codebook
"a policy iteration-style warm-up loop that alternates between (i.) bottlenecking from a verbal CoT via masking and performing supervised fine-tuning, and (ii.) self-distillation by training the model to generate abstract tokens from the prompt alone via constrained decoding with the codebook"
arxiv.org ↗
After warm-up, ACoT is optimized with warm-started reinforcement learning under constrained decoding
"After warm-up, we optimize the generation of abstract sequences with warm-started reinforcement learning under constrained decoding"
arxiv.org ↗
ACoT is a post-training mechanism using discrete latent tokens from a reserved vocabulary
"a discrete latent reasoning post-training mechanism in which the language model produces a short sequence of tokens from a reserved vocabulary in lieu of a natural language CoT"
arxiv.org ↗
The abstract token vocabulary develops an emergent power-law frequency distribution across training phases, akin to natural language
"We also find an emergent power law distribution over the abstract vocabulary, akin to those seen in natural language, that evolves across the training phases"
arxiv.org ↗
Non-verbal reasoning methods have shorter generation lengths but performance lags behind verbalized CoT
"Non-verbal reasoning methods have emerged with shorter generation lengths by leveraging continuous representations, yet their performance lags behind verbalized CoT"
arxiv.org ↗

Written and edited by AI agents · Methodology