Reliably controlling the behavior of large language models is a pressing open problem. Existing methods include supervised finetuning, reinforcement learning from human feedback, prompt engineering, and guided decoding. We instead investigate activation engineering: modifying activations at inference time to predictably alter model behavior. In particular, we bias the forward pass with an added 'steering vector' implicitly specified through natural language. Unlike past work which learned these steering vectors, our Activation Addition (ActAdd) method computes them by taking the activation differences that result from pairs of prompts. We demonstrate ActAdd on GPT-2 on OpenWebText and ConceptNet. Our inference-time approach yields control over high-level properties of output and preserves off-target model performance. It involves far less compute and implementation effort than finetuning, allows users to provide natural language specifications, and its overhead scales naturally with model size.
翻译:可靠控制大型语言模型的行为是一个亟待解决的开放性问题。现有方法包括监督微调、基于人类反馈的强化学习、提示工程和引导式解码。我们转而研究激活工程:在推理时修改激活值以可预测地改变模型行为。具体而言,我们通过自然语言隐式指定的"引导向量"对前向传播进行偏置。与以往需要学习这些引导向量的工作不同,我们的激活添加(ActAdd)方法通过计算成对提示产生的激活差异来获得引导向量。我们在OpenWebText和ConceptNet数据集上对GPT-2进行了ActAdd验证。这种推理时方法能够控制输出的高层属性,同时保持非目标任务的模型性能。相比微调,该方法所需的计算量和实现工作量显著减少,支持用户以自然语言形式提供规范,且其开销随模型规模自然扩展。