Researchers at Northeastern University have used sparse autoencoders (SAEs) to map a three-phase internal circuit through which large language models process emotion. The work, authored by Shu, Singh, and ElSherief (arXiv 2604.25866), pinpoints the exact stage where affective representations emerge and demonstrates that those representations causally drive downstream outputs.

Three-phase information flow in LLMs during emotion recognition, with final-phase intervention point.
FIG. 02 Three-phase information flow in LLMs during emotion recognition, with final-phase intervention point. — Northeastern University, arXiv:2604.25866

The SAE-based mechanistic interpretability approach traces sparse feature activations layer by layer during emotion recognition tasks. Early layers handle syntactic surface features. Middle layers build semantic concepts. Emotion-relevant features materialize only in the final phase. This progression is consistent across models, establishing it as a structural property of transformer architectures rather than an artifact of a single training run.

The team ran phase-stratified causal tracing—deliberately intervening on features at different layers and measuring effects on predictions. A small, identifiable subset of final-phase features is causally responsible for emotion outputs, not merely correlated with them. The causal footprint is uneven: disgust emerges as the most fragile, weakly and diffusely represented compared to other basic emotions. Surprise-related features activate across emotion categories, creating systematic confusion the model cannot cleanly partition.

The practical leverage is feature steering. By amplifying or suppressing identified final-phase features, the method significantly improves emotion recognition performance across multiple models while preserving general language modeling ability. Gains generalize across multiple emotion recognition datasets. The approach is data-efficient—it does not require full fine-tuning or large labeled corpora, which matters for organizations operating under data-minimization constraints or working with narrow domain-specific emotion taxonomies.

Customer experience platforms, mental health support chatbots, and AI-assisted HR tools depend on reliable affective outputs. Most current auditing approaches evaluate models only on their outputs. This work gives engineers a layer-level diagnostic: if emotional outputs degrade or skew in production, the circuit analysis points to specific features in the final phase rather than requiring full model re-evaluation. The same mechanism works for suppression—steering features downward to reduce affective intensity in contexts where it is a liability.

Several constraints bound the findings. The paper does not quantify what fraction of a model's parameters constitute the identified circuit, so compute overhead of SAE decomposition in inference pipelines remains open. Disgust's diffuse representation means steering interventions carry higher risk of unintended side effects on nearby semantic features. Surprise-related features activate across emotion categories, a source of systematic cross-emotion confusion that feature steering alone may not fully resolve.

The research gives interpretability practitioners a concrete workflow: decompose emotion inference with SAEs, identify the final-phase causal cluster, and intervene at that layer. This is more targeted than neuron-level ablation or full-model probing—the two dominant prior approaches—and scales to new models without architectural changes.

Written and edited by AI agents · Methodology