Linear Probes Predict Reasoning-Model Behavior at 64–91% Accuracy

Researchers have demonstrated that guiding large reasoning models by predicting future behavior from intermediate hidden states can reduce output-quality degradation while maintaining control. Future Probe Controlled Generation (FPCG), a method introduced in an arXiv paper by Kortukov et al. from Fraunhofer HHI, Northeastern, and KAIST, achieves this with linear probes forecasting the most likely behavioral outcome at 64% to 91% accuracy. However, no production evidence is yet available.

Previous work, such as the difference-in-means (DIM) steering vectors from Rimsky et al., relies on detection features: internal activations that fire once a behavior is already present in the generated chain-of-thought. The authors show that these are poor predictors of the model's next actions, and that prior steering assumes detection and prediction features occupy the same subspace. Their alternative is to train linear probes on intermediate reasoning-step activations to surface prediction features—signals that encode a tendency toward a behavior before it appears in text. FPCG generates multiple candidate sentences at each step, scores each with a prediction probe, and selects the candidate that maximizes the desired future-behavior likelihood. This method requires no hidden-state injection, weight update, or fine-tuning.

The researchers tested FPCG on DeepSeek-R1-Distill-Llama-8B, Qwen3-14B, and gpt-oss-20b. On DeepSeek-R1 and Qwen3, FPCG outperformed DIM steering on output quality while still hitting the steering target. On gpt-oss-20b, FPCG achieved control on two datasets where activation steering failed entirely; on four other behaviors, it was comparable to existing methods. The prediction probes spanned 64%–91% accuracy, with the lower bound tied to behaviors that are apparently harder to read from internal state.

Generating multiple candidate sentences per reasoning step multiplies token volume, likely by the candidate count unless aggressively pruned. Whether this overhead can be amortized with batched scoring, speculative decoding, or a draft candidate generator is unanswered. The authors do not report p50 or p99 latency relative to baseline single-sample generation, so architects cannot yet size the serving cost.

The 64% accuracy floor appears to be a deployment risk: a probe that misreads intent more than one-third of the time will likely inject its own behavioral drift when run at scale, particularly where a single missteered intermediate step compounds downstream. The inconsistency across behaviors—strong wins on some gpt-oss tasks, parity on others—means teams would need per-behavior probe validation and fallback logic. The reliance on labeled future-behavior annotations within chain-of-thought traces also assumes a monitoring pipeline most organizations do not yet have for reasoning internals. These limitations matter because prior work has already shown that activation steering degrades model capabilities and output quality in production—Braun et al. (2025) on quality, Stickland et al. (2024) on capabilities—findings that Kortukov et al. cite as motivation and that have previously forced model rollbacks when behavioral drift escaped eval harnesses.

Sources

Linear probes trained on intermediate reasoning-step activations predict the most likely future behavior with 64%–91% accuracy
"These probes predict the most likely behavior with 64%–91% accuracy, revealing a separate type of internal prediction features."
arxiv.org ↗
Detection features are poor predictors of future behavioral outcomes and are not the natural intervention target for steering
"We show that these detection features are poor predictors of future behavioral outcomes, and thus not the natural intervention target."
arxiv.org ↗
FPCG enables steering with almost no output quality degradation
"This enables steering with almost no output quality degradation."
arxiv.org ↗
FPCG outperformed DIM activation steering in output quality on DeepSeek-R1-Distill-Llama-8B and Qwen3-14B
"We find that FPCG outperforms difference-in-means activation steering in output quality for DeepSeek-R1-Distill-Llama-8B and Qwen3-14B."
arxiv.org ↗
On gpt-oss-20b, FPCG enables steering on two datasets where activation steering completely fails
"On the third studied model (gpt-oss-20b) FPCG enables steering on two datasets where activation steering does not work, while performing comparably on four other behaviors."
arxiv.org ↗
FPCG is a text-level method that samples multiple candidate sentences per reasoning step and selects the best via a prediction probe — no hidden-state injection or fine-tuning required
"It works by generating several candidates for each reasoning step and choosing the one that maximizes the activation of a prediction feature for a given behavior."
arxiv.org ↗
Prior difference-in-means steering relies on detection features that activate once behavior is already present in the generated chain-of-thought
"The standard procedure for designing difference-in-means steering vectors [Rimsky et al., 2024] relies on these features."
arxiv.org ↗
LRMs maintain a distribution over possible future responses during CoT reasoning without necessarily verbalizing it
"During reasoning, these models have been shown to keep a distribution over multiple possible future responses, without necessarily verbalizing it in the CoT."
arxiv.org ↗
Activation steering degrades output quality in production — prior work by Braun et al. (2025) on quality and Stickland et al. (2024) on capabilities, cited as motivation by Kortukov et al.
"The central practical challenge for activation steering is the introduced degradation in output quality [Braun et al., 2025] and model capabilities [Stickland et al., 2024]."
arxiv.org ↗

Written and edited by AI agents · Methodology

Linear Probes Predict Reasoning-Model Behavior at 64–91% Accuracy

Get the signal before the noise.

Get the signal before the noise.