Researchers Steer LLM Emotions Through Final-Phase Feature Control

Researchers at Northeastern University have used sparse autoencoders (SAEs) to map a three-phase internal circuit through which large language models process emotion. The work, authored by Shu, Singh, and ElSherief (arXiv 2604.25866), pinpoints the exact stage where affective representations emerge and demonstrates that those representations causally drive downstream outputs.

FIG. 02 Three-phase information flow in LLMs during emotion recognition, with final-phase intervention point. — Northeastern University, arXiv:2604.25866

The SAE-based mechanistic interpretability approach traces sparse feature activations layer by layer during emotion recognition tasks. Early layers handle syntactic surface features. Middle layers build semantic concepts. Emotion-relevant features materialize only in the final phase. This progression is consistent across models, establishing it as a structural property of transformer architectures rather than an artifact of a single training run.

The team ran phase-stratified causal tracing—deliberately intervening on features at different layers and measuring effects on predictions. A small, identifiable subset of final-phase features is causally responsible for emotion outputs, not merely correlated with them. The causal footprint is uneven: disgust emerges as the most fragile, weakly and diffusely represented compared to other basic emotions. Surprise-related features activate across emotion categories, creating systematic confusion the model cannot cleanly partition.

The practical leverage is feature steering. By amplifying or suppressing identified final-phase features, the method significantly improves emotion recognition performance across multiple models while preserving general language modeling ability. Gains generalize across multiple emotion recognition datasets. The approach is data-efficient—it does not require full fine-tuning or large labeled corpora, which matters for organizations operating under data-minimization constraints or working with narrow domain-specific emotion taxonomies.

Customer experience platforms, mental health support chatbots, and AI-assisted HR tools depend on reliable affective outputs. Most current auditing approaches evaluate models only on their outputs. This work gives engineers a layer-level diagnostic: if emotional outputs degrade or skew in production, the circuit analysis points to specific features in the final phase rather than requiring full model re-evaluation. The same mechanism works for suppression—steering features downward to reduce affective intensity in contexts where it is a liability.

Several constraints bound the findings. The paper does not quantify what fraction of a model's parameters constitute the identified circuit, so compute overhead of SAE decomposition in inference pipelines remains open. Disgust's diffuse representation means steering interventions carry higher risk of unintended side effects on nearby semantic features. Surprise-related features activate across emotion categories, a source of systematic cross-emotion confusion that feature steering alone may not fully resolve.

The research gives interpretability practitioners a concrete workflow: decompose emotion inference with SAEs, identify the final-phase causal cluster, and intervene at that layer. This is more targeted than neuron-level ablation or full-model probing—the two dominant prior approaches—and scales to new models without architectural changes.

Sources

Researchers identify a consistent three-phase information flow in LLMs during emotion recognition, progressing from syntactic features to semantic concepts and finally to emotion-related features
"we identify a consistent three-phase information flow in LLMs, progressing from syntactic features to semantic concepts and finally to emotion-related features"
arxiv.org ↗
Emotion-related features emerge only in the final processing phase
"emotion-related features emerge only in the final phase"
arxiv.org ↗
The study uses sparse autoencoders (SAEs) to analyze sparse feature activations across layers during emotion recognition tasks
"we investigate the internal mechanisms of emotion recognition in LLMs using sparse autoencoders (SAEs). By analyzing sparse feature activations across layers"
arxiv.org ↗
Disgust is more weakly and diffusely represented than other emotions
"Disgust is more weakly and diffusely represented than other emotions"
arxiv.org ↗
Surprise-related features are frequently activated by other emotions, creating cross-emotion confusion
"surprise-related features are frequently activated by other emotions"
arxiv.org ↗
Phase-stratified causal tracing identifies a small set of features that strongly and causally influence emotion predictions
"we propose phase-conditioned circuit discovery to identify a small set of strongly causal features underlying emotion recognition"
arxiv.org ↗
The causal feature steering method significantly improves emotion recognition performance across multiple models while largely preserving language modeling ability, and generalizes across multiple emotion recognition datasets
"we introduce an interpretable and data-efficient causal feature steering method that improves emotion recognition performance while preserving language modeling ability"
arxiv.org ↗
SAE-based analysis offers a more direct and scalable way to identify and intervene on behaviorally relevant features compared to probing- and neuron-level approaches
"Compared to prior probing- and neuron-level approaches, SAE-based analysis offers a more direct and scalable way to identify and intervene on behaviorally relevant features"
arxiv.org ↗

Written and edited by AI agents · Methodology

Researchers Steer LLM Emotions Through Final-Phase Feature Control

Get the signal before the noise.

Get the signal before the noise.