Where should we intervene in a language model (LM) to control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components, e.g., attention heads, to steer a binary concept (e.g., talk in verse vs. talk in prose) from contrastive long-form responses. In GCM, we first construct a dataset of contrasting inputs and responses. Then, we quantify how individual model components mediate the contrastive concept and select the strongest mediators for steering. We evaluate GCM on three tasks--refusal, sycophancy, and style transfer--across three language models. GCM successfully localizes concepts expressed in long-form responses and consistently outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM provides an effective approach for localizing and controlling the long-form responses of LMs.
翻译:我们应在语言模型(LM)的何处进行干预,以控制在长文本响应中分散于多个词元的行为?本文提出生成因果中介(Generative Causal Mediation, GCM),这是一种用于选择模型组件(例如注意力头)以从对比性长文本响应中调控二元概念(例如以诗歌形式表达与以散文形式表达)的方法。在GCM中,我们首先构建包含对比性输入与响应的数据集。随后,我们量化各个模型组件如何中介对比性概念,并选择最强的中介因子用于调控。我们在三种语言模型上针对三项任务——拒绝性回应、迎合性回应与风格迁移——评估GCM。GCM成功定位了长文本响应中表达的概念,并且在使用稀疏注意力头集合进行调控时,始终优于基于相关性探针的基线方法。综上,这些结果表明GCM为定位与控制语言模型的长文本响应提供了一种有效途径。