Where should we intervene in a language model (LM) to localize and control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components (e.g., attention heads) from contrastive long-form responses, to steer such diffuse concepts (e.g., talk in verse vs. talk in prose). In GCM, we first construct a dataset of contrasting behavioral inputs and long-form responses. Then, we quantify how model components mediate the concept and select the strongest mediators for steering. We evaluate GCM on three behaviors--refusal, sycophancy, and style transfer--across three language models. GCM successfully localizes concepts expressed in long-form responses and outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM provides an effective approach for localizing from and controlling the long-form responses of LMs.
翻译:我们应当如何在语言模型的长期响应中,对分散于多个令牌的行为进行定位与控制?本文提出生成式因果中介(GCM)——一种从对比性长期响应中选取模型组件(如注意力头)以引导此类分散概念(例如以诗歌形式对话vs.以散文形式对话)的方法。在GCM中,我们首先构建包含行为输入与长期响应的对比数据集,继而量化模型组件对概念的介导程度,并选取最强介导组件进行引导。我们在三个语言模型上针对三种行为——拒绝、谄媚与风格迁移——评估了GCM。结果表明,GCM成功定位了长期响应中表达的概念,且在通过稀疏注意力头集进行引导时,其性能优于基于相关性探针的基线方法。综合来看,这些结果证明GCM为定位与控制语言模型长期响应提供了一种有效方法。