We find that correct-to-incorrect sycophancy signals are most linearly separable within multi-head attention activations. Motivated by the linear representation hypothesis, we train linear probes across the residual stream, multilayer perceptron (MLP), and attention layers to analyze where these signals emerge. Although separability appears in the residual stream and MLPs, steering using these probes is most effective in a sparse subset of middle-layer attention heads. Using TruthfulQA as the base dataset, we find that probes trained on it transfer effectively to other factual QA benchmarks. Furthermore, comparing our discovered direction to previously identified "truthful" directions reveals limited overlap, suggesting that factual accuracy, and deference resistance, arise from related but distinct mechanisms. Attention-pattern analysis further indicates that the influential heads attend disproportionately to expressions of user doubt, contributing to sycophantic shifts. Overall, these findings suggest that sycophancy can be mitigated through simple, targeted linear interventions that exploit the internal geometry of attention activations.
翻译:我们发现,从正确到错误的阿谀奉承信号在多头注意力激活中最具线性可分性。受线性表示假说的启发,我们在残差流、多层感知机(MLP)和注意力层上训练线性探针,以分析这些信号的出现位置。尽管可分性也出现在残差流和MLP中,但使用这些探针进行调控在中间层注意力头的一个稀疏子集中最为有效。以TruthfulQA作为基础数据集,我们发现基于其训练的探针能有效地迁移到其他事实性问答基准测试中。此外,将我们发现的调控方向与先前识别的“真实性”方向进行比较,发现重叠有限,这表明事实准确性与抗顺从性源于相关但不同的机制。注意力模式分析进一步表明,有影响力的注意力头会不成比例地关注用户表达怀疑的部分,从而促成阿谀奉承的转变。总体而言,这些发现表明,通过利用注意力激活的内部几何结构进行简单、有针对性的线性干预,可以缓解阿谀奉承现象。