Extracting and Steering Emotion Representations in Small Language Models: A Methodological Comparison

Small language models (SLMs) in the 100M-10B parameter range increasingly power production systems, yet whether they possess the internal emotion representations recently discovered in frontier models remains unknown. We present the first comparative analysis of emotion vector extraction methods for SLMs, evaluating 9 models across 5 architectural families (GPT-2, Gemma, Qwen, Llama, Mistral) using 20 emotions and two extraction methods (generation-based and comprehension-based). Generation-based extraction produces statistically superior emotion separation (Mann-Whitney p = 0.007; Cohen's d = -107.5), with the advantage modulated by instruction tuning and architecture. Emotion representations localize at middle transformer layers (~50% depth), following a U-shaped curve that is architecture-invariant from 124M to 3B parameters. We validate these findings against representational anisotropy baselines across 4 models and confirm causal behavioral effects through steering experiments, independently verified by an external emotion classifier (92% success rate, 37/40 scenarios). Steering reveals three regimes -- surgical (coherent text transformation), repetitive collapse, and explosive (text degradation) -- quantified by perplexity ratios and separated by model architecture rather than scale. We document cross-lingual emotion entanglement in Qwen, where steering activates semantically aligned Chinese tokens that RLHF does not suppress, raising safety concerns for multilingual deployment. This work provides methodological guidelines for emotion research on open-weight models and contributes to the Model Medicine series by bridging external behavioral profiling with internal representational analysis.

翻译：小型语言模型（SLMs，参数规模在1亿至100亿之间）正日益驱动生产系统，然而它们是否具备最近在尖端模型中发现的内隐情感表征仍属未知。本文首次对小型语言模型的情感向量提取方法进行比较分析，评估了涵盖5种架构家族（GPT-2、Gemma、Qwen、Llama、Mistral）的9个模型，涉及20种情感及两种提取方法（基于生成与基于理解）。基于生成的提取方法在情感分离上具有统计显著性优势（Mann-Whitney检验p=0.007；Cohen's d=-107.5），其优势程度受指令微调与架构的调节。情感表征定位于中间Transformer层（约50%深度），呈现与架构无关的U型曲线（参数规模从1.24亿至30亿）。我们通过4个模型的反向各向异性基线验证了这些发现，并通过操控实验确认了因果行为效应——该效应经外部情感分类器独立验证（成功率为92%，37/40场景）。操控揭示了三种模式：精准式（连贯文本转换）、重复性坍缩与爆发式（文本退化），其量化指标为困惑度比率，且区分标准为模型架构而非规模。我们记录了Qwen中的跨语言情感纠缠现象：操控会激活语义对齐的中文标记，而RLHF无法抑制这些标记，这对多语言部署构成安全隐患。本研究为开源权重模型的情感研究提供方法论指导，并通过桥接外部行为特征分析与内部表征分析，为“模型医学”系列研究做出贡献。