Reliably controlling the behavior of large language models is a pressing open problem. Existing methods include supervised finetuning, reinforcement learning from human feedback, prompt engineering and guided decoding. We instead investigate activation engineering: modifying activations at inference-time to predictably alter model behavior. We bias the forward pass with a 'steering vector' implicitly specified through natural language. Past work learned these steering vectors; our Activation Addition (ActAdd) method instead computes them by taking the activation differences which result from pairs of prompts. We demonstrate ActAdd on GPT-2 on OpenWebText and ConceptNet, and replicate the effect on Llama-13B and GPT-J-6B. Our approach yields inference-time control over high-level properties of output & preserves performance on off-target topics. The method requires far less compute and implementation effort than finetuning and RLHF, allows for natural language specification by users, and its overhead scales naturally with model size.
翻译:可靠地控制大型语言模型的行为是一个迫切的开放性问题。现有方法包括监督微调、基于人类反馈的强化学习、提示工程和引导式解码。我们转而研究激活工程:在推理时修改激活值以可预测地改变模型行为。我们通过自然语言隐式指定的“引导向量”来偏置前向传播。以往的工作需要学习这些引导向量;而我们的激活加法(ActAdd)方法则通过计算配对提示产生的激活差异来直接获得它们。我们在OpenWebText和ConceptNet数据集上对GPT-2进行了ActAdd实验,并将效果复现到Llama-13B和GPT-J-6B上。该方法能够在推理时控制输出的高层属性,同时保留非目标主题的性能。与微调和RLHF相比,ActAdd所需的计算量和实现工作量更少,允许用户通过自然语言进行指定,且其额外开销随模型规模自然扩展。