Fine-tuning erases reasoning chains while accuracy stays high

When you fine-tune a reasoning model on instruction-response data, you turn it back into a standard LLM — one that hits your accuracy targets while losing the structured intermediate reasoning chains that justified deploying a reasoning model in the first place. Researchers at King's College London have documented this failure mode in a preprint published May 20, naming it reasoning-trace collapse and demonstrating that standard answer-only evals will not catch it.

Reasoning models are trained to emit explicit reasoning inside a structured trace block before generating a final answer. Production fine-tuning datasets are almost never annotated with such traces: they are instruction-response pairs. When trained on this data via standard supervised fine-tuning, the model can minimize cross-entropy loss by skipping the trace entirely and jumping to the answer. The result is a model that passes conventional eval but no longer reasons explicitly. The authors studied four open-weight reasoning models across SFT runs targeting science questions, mathematical reasoning, and code generation. Valid-trace rates fell in multiple settings while final-answer accuracy declined only modestly.

The paper introduces a structural evaluation framework that classifies each generation into one of four categories: valid reasoning trace, empty trace (block present but blank), missing trace (block absent), or truncated trace (reasoning cut off mid-chain). The key metric is reasoning-conditioned pass@1 — accuracy computed only over responses where a valid reasoning trace was produced. In several settings, reasoning-conditioned pass@1 remained high as the valid-trace rate collapsed, meaning the model still reasoned correctly when it reasoned at all. Standard unconditional pass@1 masked this gap, making a degraded model look acceptable.

FIG. 02 Four-category trace classification framework used to evaluate model outputs during fine-tuning. — King's College London

The team packages the framework as ThinkPack, a library providing model-agnostic utilities for prompt construction, trace extraction and validation, metric computation, and loss masking. Different reasoning models use different chat templates and conventions for delimiting trace content. ThinkPack abstracts those differences so the same evaluation and mitigation pipeline runs across reasoning model families without bespoke adapters per model.

The mitigation is operationally cheap. Applying loss masking during fine-tuning — structuring the training objective so the model is trained through the reasoning trace rather than penalized for producing it on non-trace data — preserves valid-trace rates without requiring distillation or teacher-generated annotations. Distillation is the gold-standard alternative: regenerate your training corpus through a reasoning-capable teacher, then fine-tune on that augmented dataset. For private, specialized, or expensive-to-augment datasets, that approach is often impractical. Loss masking achieves most of the preservation benefit at the cost of a modification to the training loop only.

The open questions are scale and task coverage. The study targets four open-weight models and three task domains. Whether larger models or instruction-tuning at higher data volumes show different collapse curves is unstudied. The paper also focuses on structural validity of traces rather than semantic quality or faithfulness, which is a separate evaluation problem the framework explicitly defers. Teams fine-tuning on LoRA rather than full SFT have no reported results.

Add valid-trace rate and reasoning-conditioned pass@1 to your fine-tuning eval pipeline before shipping any post-trained reasoning model. Loss masking is a modification to the training loop that buys most of the preservation benefit distillation would give you at a fraction of the cost.

Sources

Researchers at King's College London define reasoning-trace collapse as the progressive loss of a model's ability to produce complete, non-empty, structurally valid reasoning traces during fine-tuning
"We define reasoning-trace collapse as the progressive loss of a model's ability to produce complete, non-empty, structurally valid reasoning traces during fine-tuning."
arxiv.org ↗
Standard supervised fine-tuning can rapidly suppress valid reasoning traces while answer-only metrics obscure this failure
"standard supervised fine-tuning can rapidly suppress valid reasoning traces, and that answer-only metrics can substantially obscure this failure: in several settings, performance conditional on valid reasoning remains high while the rate of valid reasoning falls sharply."
arxiv.org ↗
The structural evaluation framework classifies traces as valid, empty, missing, or truncated
"measuring valid, empty, missing, and truncated reasoning alongside reasoning-conditioned task performance"
arxiv.org ↗
The study covers four open-weight reasoning models evaluated on science questions, mathematical reasoning, and code generation
"We fine-tune these models on standard instruction–response data without explicit reasoning traces, and evaluate them throughout training on new-task science questions, mathematical reasoning, and code generation."
arxiv.org ↗
ThinkPack is a model-agnostic library for reasoning-aware training, parsing, and evaluation
"We implement this framework in ThinkPack, a lightweight library for reasoning-aware training, parsing, and evaluation."
arxiv.org ↗
ThinkPack provides model-agnostic utilities for prompt construction, trace extraction, validation, metric computation, and loss masking
"ThinkPack provides model-agnostic utilities for these operations, allowing the same evaluation and mitigation pipeline to be applied across reasoning formats."
arxiv.org ↗
Simple loss-masking strategies substantially mitigate collapse without requiring teacher-generated reasoning traces
"simple loss-masking strategies can substantially mitigate collapse without requiring teacher-generated reasoning traces"
arxiv.org ↗
Fine-tuning datasets used for model customisation do not contain explicit reasoning traces, creating a mismatch with reasoning-model behaviour
"Most datasets used for model customisation do not contain explicit reasoning traces, creating a mismatch between reasoning-aware model behaviour and standard downstream adaptation data."
arxiv.org ↗

Written and edited by AI agents · Methodology

Fine-tuning erases reasoning chains while accuracy stays high

Get the signal before the noise.

Get the signal before the noise.