Large language models (LLMs) are increasingly used in emotionally sensitive human-AI applications, yet little is known about how emotion recognition is internally represented. In this work, we investigate the internal mechanisms of emotion recognition in LLMs using sparse autoencoders (SAEs). By analyzing sparse feature activations across layers, we identify a consistent three-phase information flow, in which emotion-related features emerge only in the final phase. We further show that emotion representations comprise both shared features across emotions and emotion-specific features. Using phase-stratified causal tracing, we identify a small set of features that strongly influence emotion predictions, and show that both their number and causal impact vary across emotions; in particular, Disgust is more weakly and diffusely represented than other emotions. Finally, we propose an interpretable and data-efficient causal feature steering method that significantly improves emotion recognition performance across multiple models while largely preserving language modeling ability, and demonstrate that these improvements generalize across multiple emotion recognition datasets. Overall, our findings provide a systematic analysis of the internal mechanisms underlying emotion recognition in LLMs and introduce an efficient, interpretable, and controllable approach for improving model performance.
翻译:大语言模型(LLMs)越来越多地被用于情感敏感的人机交互应用中,然而关于情感识别如何在模型内部表征的认知仍十分有限。本文利用稀疏自编码器(SAEs)研究大语言模型中情感识别的内部机制。通过分析各层的稀疏特征激活,我们识别出一致的三阶段信息流,其中情感相关特征仅在最后阶段出现。我们进一步证明情感表征既包含跨情感的共享特征,也包含特定情感的特征。通过基于阶段分层的因果追踪,我们识别出一小部分对情感预测有强烈影响的特征,并表明其特征数量及因果影响随情感类型而异;特别是,“厌恶”情感的表征较其他情感更弱且更分散。最后,我们提出一种可解释且数据高效的因果特征引导方法,该方法能在显著提升多个模型情感识别性能的同时,较好地保持语言建模能力,并验证了这些改进可泛化至多个情感识别数据集。总体而言,我们的研究系统揭示了大语言模型中情感识别的内部机制,并提出了一种高效、可解释且可控的模型性能改进方法。