FPCG steers reasoning models at test time without retraining

Kortukov et al. have introduced Future Probe Controlled Generation (FPCG), a method for guiding large reasoning models at test time without retraining. FPCG predicts reasoning paths that may fail by training lightweight activation probes on intermediate chain-of-thought hidden states, forecasting future behavioral outcomes with 64% to 91% accuracy. The method samples multiple candidate next-sentences and selects the one with the lowest predicted future misbehavior score, steering the model away from conventional activation steering pitfalls and minimizing output quality degradation.

The improvement in FPCG lies in its distinction between detection and prediction features. Unlike previous methods that intervened on internal features reflecting current behavior, FPCG trains probes to read the residual stream at intermediate reasoning steps and predict the likelihood of future behaviors such as confabulation or logical failure. At inference time, FPCG generates N candidate continuations for a reasoning step, runs the lightweight probe against each candidate's hidden states, and commits to the continuation that minimizes the predicted failure probability, without requiring weight updates or model retraining.

FPCG addresses the limitations of prior methods, such as linear probes on the last token before chain-of-thought, which predict the final answer with 0.9 AUC on most tasks, indicating that instruction-tuned models often determine their answer before generating CoT. The CREST paper demonstrated that suppressing non-linear reasoning heads mid-trace improves accuracy by up to 17.5% and reduces token usage by 37.6%, but such interventions risk fragility. FPCG avoids direct activation pushing and uses the probe as a discriminator in a sampling loop.

FPCG incurs inference-time overhead by generating and scoring multiple candidate sentences per reasoning step, with latency scaling with the length of the reasoning trace. Probes must be trained on intermediate activations from the target model class—o1-class or R1-class systems running extended chain-of-thought—and cannot be transferred blindly across architectures. The activation steering field guide notes that vector steering fails for complex reasoning, as multi-step sequential computation cannot be reliably directed by a single layer; FPCG operates at the text level but does not address underlying model capability gaps. If a model cannot solve a math problem, no sampling strategy around probe scores will produce the correct derivation. The stochastic nature of reasoning behaviors also means that prediction probes trained on one task distribution may degrade when reasoning topology changes, as evidenced by Zhuang et al.'s finding that 93.3% of 541 keyword-detected CoT boundaries are behaviorally unstable under re-generation from the same prefix.

Sources

FPCG probes predict the most likely future behavior with 64%–91% accuracy; achieves steering with almost no output quality degradation and succeeds where activation steering fails
"These probes predict the most likely behavior with 64%-91% accuracy, revealing a separate type of internal prediction features... This enables steering with almost no output quality degradation. FPCG also enables steering in several evaluations where activation steering fails."
arxiv.org ↗
Prior activation steering relies on detection features that reflect already-generated text and are poor predictors of future behavioral outcomes
"We argue that prior steering work implicitly relies on internal features that detect behavior in already generated text. We show that these detection features are poor predictors of future behavioral outcomes, and thus not the natural intervention target."
arxiv.org ↗
Instruction-tuned models often determine their answer before generating CoT; linear probes on pre-CoT residual activations predict final answer with 0.9 AUC on most tasks
"We provide mechanistic evidence that instruction-tuned models often determine their answer before generating CoT. Training linear probes on residual stream activations at the last token before CoT, we can predict the model's final answer with 0.9 AUC on most tasks."
arxiv.org ↗
Steering activations along the probe direction flips model answers in over 50% of cases; failure modes include non-entailment and confabulation
"We find that these directions are not only predictive, but also causal: steering activations along the probe direction flips model answers in over 50% of cases, significantly exceeding orthogonal baselines. When steering induces incorrect answers, we observe two distinct failure modes: non-entailment and confabulation."
arxiv.org ↗
93.3% of keyword-detected CoT boundaries are behaviorally unstable, failing to reproduce the detected behavior under re-generation from the same prefix
"We show that this assumption is overwhelmingly wrong: across 541 keyword-detected boundaries, 93.3% are behaviorally unstable, failing to reproduce the detected behavior under re-generation from the same prefix."
arxiv.org ↗
CREST cognitive-head intervention improves accuracy by up to 17.5% while reducing token usage by 37.6% across diverse reasoning benchmarks
"Across diverse reasoning benchmarks and models, CREST improves accuracy by up to 17.5% while reducing token usage by 37.6%, offering a simple and effective pathway to faster, more reliable LLM reasoning."
arxiv.org ↗
Vector steering genuinely fails for complex reasoning because multi-step computation cannot be bent reliably by a single direction at one layer
"Complex reasoning. Steering doesn't help with multi-step logic. If the model can't solve a math problem, adding a 'be smarter' vector doesn't help. This makes sense—reasoning involves sequential computation across many layers, not a single direction at one layer."
subhadipmitra.com ↗

Written and edited by AI agents · Methodology

FPCG steers reasoning models at test time without retraining

Get the signal before the noise.

Get the signal before the noise.