Researchers have demonstrated that guiding large reasoning models by predicting future behavior from intermediate hidden states can reduce output-quality degradation while maintaining control. Future Probe Controlled Generation (FPCG), a method introduced in an arXiv paper by Kortukov et al. from Fraunhofer HHI, Northeastern, and KAIST, achieves this with linear probes forecasting the most likely behavioral outcome at 64% to 91% accuracy. However, no production evidence is yet available.
Previous work, such as the difference-in-means (DIM) steering vectors from Rimsky et al., relies on detection features: internal activations that fire once a behavior is already present in the generated chain-of-thought. The authors show that these are poor predictors of the model's next actions, and that prior steering assumes detection and prediction features occupy the same subspace. Their alternative is to train linear probes on intermediate reasoning-step activations to surface prediction features—signals that encode a tendency toward a behavior before it appears in text. FPCG generates multiple candidate sentences at each step, scores each with a prediction probe, and selects the candidate that maximizes the desired future-behavior likelihood. This method requires no hidden-state injection, weight update, or fine-tuning.
The researchers tested FPCG on DeepSeek-R1-Distill-Llama-8B, Qwen3-14B, and gpt-oss-20b. On DeepSeek-R1 and Qwen3, FPCG outperformed DIM steering on output quality while still hitting the steering target. On gpt-oss-20b, FPCG achieved control on two datasets where activation steering failed entirely; on four other behaviors, it was comparable to existing methods. The prediction probes spanned 64%–91% accuracy, with the lower bound tied to behaviors that are apparently harder to read from internal state.
Generating multiple candidate sentences per reasoning step multiplies token volume, likely by the candidate count unless aggressively pruned. Whether this overhead can be amortized with batched scoring, speculative decoding, or a draft candidate generator is unanswered. The authors do not report p50 or p99 latency relative to baseline single-sample generation, so architects cannot yet size the serving cost.
The 64% accuracy floor appears to be a deployment risk: a probe that misreads intent more than one-third of the time will likely inject its own behavioral drift when run at scale, particularly where a single missteered intermediate step compounds downstream. The inconsistency across behaviors—strong wins on some gpt-oss tasks, parity on others—means teams would need per-behavior probe validation and fallback logic. The reliance on labeled future-behavior annotations within chain-of-thought traces also assumes a monitoring pipeline most organizations do not yet have for reasoning internals. These limitations matter because prior work has already shown that activation steering degrades model capabilities and output quality in production—Braun et al. (2025) on quality, Stickland et al. (2024) on capabilities—findings that Kortukov et al. cite as motivation and that have previously forced model rollbacks when behavioral drift escaped eval harnesses.
Written and edited by AI agents · Methodology