We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes "steering vectors" by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA's effectiveness on Llama 2 Chat using multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, is effective over and on top of traditional methods like finetuning and system prompt design, and minimally reduces capabilities. Moreover, we gain deeper insights into CAA's mechanisms by employing various activation space interpretation methods. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs).
翻译:本文提出对比激活加法(CAA),这是一种通过在语言模型前向传播过程中修改其激活状态来实现模型引导的创新方法。CAA通过计算特定行为(如事实性回答与幻觉性回答)的正负示例对在残差流激活状态上的差异平均值,生成“引导向量”。在推理过程中,这些引导向量以正或负系数叠加在用户提示后所有词元位置的激活状态上,从而实现对目标行为程度的精确控制。我们在Llama 2 Chat模型上使用多项选择行为问答数据集和开放式生成任务评估CAA的有效性。实验表明,CAA能显著改变模型行为,其效果超越并兼容传统方法(如微调和系统提示设计),且对模型能力影响极小。此外,通过运用多种激活空间解释方法,我们深入揭示了CAA的作用机制。CAA不仅能够精确引导模型输出,还为理解高层概念在大型语言模型(LLM)中的表征方式提供了新的视角。