Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM

Prior behavioural work suggests that some LLMs alter choices when options are framed as causing pain or pleasure, and that such deviations can scale with stated intensity. To bridge behavioural evidence (what the model does) with mechanistic interpretability (what computations support it), we investigate how valence-related information is represented and where it is causally used inside a transformer. Using Gemma-2-9B-it and a minimalist decision task modelled on prior work, we (i) map representational availability with layer-wise linear probing across streams, (ii) test causal contribution with activation interventions (steering; patching/ablation), and (iii) quantify dose-response effects over an epsilon grid, reading out both the 2-3 logit margin and digit-pair-normalised choice probabilities. We find that (a) valence sign (pain vs. pleasure) is perfectly linearly separable across stream families from very early layers (L0-L1), while a lexical baseline retains substantial signal; (b) graded intensity is strongly decodable, with peaks in mid-to-late layers and especially in attention/MLP outputs, and decision alignment is highest slightly before the final token; (c) additive steering along a data-derived valence direction causally modulates the 2-3 margin at late sites, with the largest effects observed in late-layer attention outputs (attn_out L14); and (d) head-level patching/ablation suggests that these effects are distributed across multiple heads rather than concentrated in a single unit. Together, these results link behavioural sensitivity to identifiable internal representations and intervention-sensitive sites, providing concrete mechanistic targets for more stringent counterfactual tests and broader replication. This work supports a more evidence-driven (a) debate on AI sentience and welfare, and (b) governance when setting policy, auditing standards, and safety safeguards.

翻译：先前的行为研究表明，某些大型语言模型（LLMs）在选项被框定为引起痛苦或快乐时会改变选择，且此类偏差可随陈述强度而缩放。为连接行为证据（模型做什么）与机制可解释性（支持它的计算过程），本研究探究了效价相关信息如何在Transformer内部被表征及其因果作用位置。使用Gemma-2-9B-it模型及基于先前研究构建的极简决策任务，我们（1）通过跨流层的分层线性探针映射表征可用性，（2）通过激活干预（导向；修补/消融）检验因果贡献，（3）在ε网格上量化剂量-响应效应，同时读取2-3对数边际和数字对归一化选择概率。研究发现：（a）效价符号（痛苦vs快乐）从极早期层（L0-L1）起即可在不同流族中实现完美线性分离，而词汇基线仍保留显著信号；（b）分级强度具有强可解码性，峰值出现在中后层（尤其注意力/MLP输出端），且决策对齐度在最终词元前略达最高；（c）沿数据推导的效价方向进行加性导向可在后期位点因果调节2-3边际，最大效应见于后期层注意力输出（attn_out L14）；（d）头部层级的修补/消融表明这些效应分散于多个注意力头而非集中于单一单元。综上，这些结果将行为敏感性与可识别的内部表征及干预敏感位点相联结，为更严格的反事实检验与更广泛的复现提供了具体机制靶点。本研究为（a）关于AI感知与福祉的辩论，及（b）政策制定、审计标准与安全防护的治理实践，提供了更具证据驱动的支持。