Interventions designed to modify a particular behavior in LLMs, such as refusal or sycophancy, often produce unintended changes in other behaviors. This lack of targeted control makes it difficult to design and implement reliable safety controls. To understand these side-effects, we introduce a diagnostic framework for analyzing interacting behaviors in LLMs. We model behaviors as low-rank subspaces in activation space, and study how interventions influence across behaviors. Across multiple instruction-tuned models (7B-70B) and across refusal, jailbreak, and sycophancy settings, we find that different behaviors share internal representations, and intervening on one behavior alters others in asymmetric ways. Some behaviors act as upstream control points whose interventions propagate broadly across other behaviors, while others remain more isolated. We relate these effects to two geometric quantities: (i) the overlap between behavior subspaces, measured as the average squared cosine of principal angles, and (ii) the angle between each behavior subspace and the decision subspace (capturing the model's final decision e.g., refuse vs. comply). Empirically, intervention effects on other behaviors tend to be larger for behavior pairs with higher subspace overlap, and for source behaviors whose subspaces lie closer (smaller angle) to the decision subspace. These findings highlight a challenge for targeted behavior control: behaviors are difficult to modify independently, as interventions can propagate through shared representations and asymmetric interactions.
翻译:为修改大语言模型中特定行为(如拒绝回答或谄媚)而设计的干预措施,往往会导致其他行为发生非预期的改变。这种缺乏针对性的控制使得可靠安全控制的设计与实现变得困难。为理解这些副作用,我们提出了一个用于分析大语言模型中交互行为的诊断框架。我们将行为建模为激活空间中的低秩子空间,并研究干预如何影响不同行为。通过对多个指令微调模型(7B-70B)在拒绝回答、越狱和谄媚等场景下的分析,我们发现不同行为共享内部表征,且对某一行为的干预会以非对称方式改变其他行为。某些行为充当上游控制点,其干预效应会广泛传播至其他行为,而另一些行为则保持相对孤立。我们将这些效应与两个几何量相关联:(i)行为子空间之间的重叠程度(通过主角度平均余弦平方度量),以及(ii)每个行为子空间与决策子空间(捕捉模型的最终决策,如拒绝与服从)之间的夹角。实验表明,对于子空间重叠度较高的行为对,以及子空间更接近决策子空间(夹角更小)的源行为,干预对其他行为的影响往往更大。这些发现揭示了针对性行为控制的挑战:干预会通过共享表征和非对称交互传播,使得行为难以被独立修改。