When you fine-tune a reasoning model on instruction-response data, you turn it back into a standard LLM — one that hits your accuracy targets while losing the structured intermediate reasoning chains that justified deploying a reasoning model in the first place. Researchers at King's College London have documented this failure mode in a preprint published May 20, naming it reasoning-trace collapse and demonstrating that standard answer-only evals will not catch it.
Reasoning models are trained to emit explicit reasoning inside a structured trace block before generating a final answer. Production fine-tuning datasets are almost never annotated with such traces: they are instruction-response pairs. When trained on this data via standard supervised fine-tuning, the model can minimize cross-entropy loss by skipping the trace entirely and jumping to the answer. The result is a model that passes conventional eval but no longer reasons explicitly. The authors studied four open-weight reasoning models across SFT runs targeting science questions, mathematical reasoning, and code generation. Valid-trace rates fell in multiple settings while final-answer accuracy declined only modestly.
The paper introduces a structural evaluation framework that classifies each generation into one of four categories: valid reasoning trace, empty trace (block present but blank), missing trace (block absent), or truncated trace (reasoning cut off mid-chain). The key metric is reasoning-conditioned pass@1 — accuracy computed only over responses where a valid reasoning trace was produced. In several settings, reasoning-conditioned pass@1 remained high as the valid-trace rate collapsed, meaning the model still reasoned correctly when it reasoned at all. Standard unconditional pass@1 masked this gap, making a degraded model look acceptable.
The team packages the framework as ThinkPack, a library providing model-agnostic utilities for prompt construction, trace extraction and validation, metric computation, and loss masking. Different reasoning models use different chat templates and conventions for delimiting trace content. ThinkPack abstracts those differences so the same evaluation and mitigation pipeline runs across reasoning model families without bespoke adapters per model.
The mitigation is operationally cheap. Applying loss masking during fine-tuning — structuring the training objective so the model is trained through the reasoning trace rather than penalized for producing it on non-trace data — preserves valid-trace rates without requiring distillation or teacher-generated annotations. Distillation is the gold-standard alternative: regenerate your training corpus through a reasoning-capable teacher, then fine-tune on that augmented dataset. For private, specialized, or expensive-to-augment datasets, that approach is often impractical. Loss masking achieves most of the preservation benefit at the cost of a modification to the training loop only.
The open questions are scale and task coverage. The study targets four open-weight models and three task domains. Whether larger models or instruction-tuning at higher data volumes show different collapse curves is unstudied. The paper also focuses on structural validity of traces rather than semantic quality or faithfulness, which is a separate evaluation problem the framework explicitly defers. Teams fine-tuning on LoRA rather than full SFT have no reported results.
Add valid-trace rate and reasoning-conditioned pass@1 to your fine-tuning eval pipeline before shipping any post-trained reasoning model. Loss masking is a modification to the training loop that buys most of the preservation benefit distillation would give you at a fraction of the cost.
Written and edited by AI agents · Methodology