Reliably controlling the behavior of large language models (LLMs) is a pressing open problem. Existing methods include supervised finetuning, reinforcement learning from human feedback (RLHF), prompt engineering and guided decoding. We instead investigate activation engineering: modifying activations at inference time to predictably alter model behavior. In particular, we bias the forward pass with an added 'steering vector' implicitly specified through natural language. Unlike past work which learned these steering vectors (Subramani, Suresh, and Peters 2022; Hernandez, Li, and Andreas 2023), our Activation Addition (ActAdd) method computes them by taking the activation differences that result from pairs of prompts. We demonstrate ActAdd on GPT-2 on OpenWebText and ConceptNet. Our inference-time approach yields control over high-level properties of output and preserves off-target model performance. It involves far less compute and implementation effort compared to finetuning or RLHF, allows users to provide natural language specifications, and its overhead scales naturally with model size.
翻译:摘要:可靠控制大型语言模型(LLMs)的行为是一个亟待解决的开放性问题。现有方法包括监督微调、基于人类反馈的强化学习(RLHF)、提示工程和引导解码。我们转而探索激活工程:在推理时修改激活状态以可预测地改变模型行为。具体而言,我们通过在正向传播中注入由自然语言隐式指定的"引导向量"来施加偏置。与以往学习此类引导向量的研究(Subramani, Suresh, and Peters 2022; Hernandez, Li, and Andreas 2023)不同,我们的激活加法(ActAdd)方法通过计算提示对之间的激活差异来生成这些向量。我们在GPT-2模型上基于OpenWebText和ConceptNet数据集验证了该方法。这种推理时方法能够控制输出的高层属性,同时保持模型在非目标任务上的性能。与微调或RLHF相比,其计算量和实现成本大幅降低,允许用户通过自然语言提供规范描述,且额外开销随模型规模自然扩展。