Complex social behaviors, such as empathy and strategic politeness, are widely assumed to resist the directional decomposition that makes activation steering effective for coarse attributes like sentiment or toxicity. We present STAR: Steering via Attribution and Representation, which tests this assumption by using attribution patching to identify the layer--token positions where each behavioral trait causally originates, then injecting contrastive activation vectors at precisely those locations. Evaluated on emotional dialogue and negotiation in both single- and multi-turn settings, localized injection consistently outperforms global steering and instruction priming; human evaluation confirms that gains reflect genuine improvements in perceived quality rather than lexical surface change. Our results suggest that complex interpersonal behaviors are encoded as localized, approximately linear directions in LLM activation space, and that behavioral alignment is fundamentally a localization problem.
翻译:复杂社会行为(如共情与策略性礼貌)通常被认为难以进行方向性分解,而这种分解正是激活导向对情感或毒性等粗粒度属性实现有效调控的基础。我们提出STAR:基于归因与表征的导向方法,该方法通过归因修补技术定位各行为特质因果起源的层-标记位置,并在这些精确位置注入对比激活向量,从而检验上述假设。在单轮与多轮场景下的情感对话与谈判任务评估中,局部注入方法持续优于全局导向与指令提示;人工评估进一步证实,性能提升反映了感知质量的真实改善,而非词汇表层变化。我们的研究结果表明,复杂人际行为在LLM激活空间中编码为局部化、近似线性的方向,且行为对齐本质上是一个定位问题。