Reliably controlling the behavior of large language models is a pressing open problem. Existing methods include supervised finetuning, reinforcement learning from human feedback, prompt engineering and guided decoding. We instead investigate activation engineering: modifying activations at inference-time to predictably alter model behavior. We bias the forward pass with a 'steering vector' implicitly specified through natural language. Past work learned these steering vectors; our Activation Addition (ActAdd) method instead computes them by taking activation differences resulting from pairs of prompts. We demonstrate ActAdd on a range of LLMs (LLaMA-3, OPT, GPT-2, and GPT-J), obtaining SOTA on detoxification and negative-to-positive sentiment control. Our approach yields inference-time control over high-level properties of output like topic and sentiment while preserving performance on off-target tasks. ActAdd takes far less compute and implementation effort than finetuning or RLHF, allows users control through natural language, and its computational overhead (as a fraction of inference time) appears stable or improving over increasing model size.
翻译:可靠控制大型语言模型的行为是一个亟待解决的开放性问题。现有方法包括监督微调、基于人类反馈的强化学习、提示工程和引导解码。本研究则探索激活工程:在推理时修改激活值以可预测地改变模型行为。我们通过自然语言隐式指定的“导向向量”对前向传播过程施加偏置。以往研究通过训练学习这些导向向量;而我们的激活加法(ActAdd)方法则通过计算提示对产生的激活差值来直接推导导向向量。我们在多种大型语言模型(LLaMA-3、OPT、GPT-2和GPT-J)上验证了ActAdd方法,在解毒任务和负向-正向情感控制任务中取得了最先进的性能。该方法能在推理时实现对输出高级特性(如主题和情感)的控制,同时保持非目标任务的性能表现。相较于微调或RLHF,ActAdd所需的计算资源和实现成本显著降低,允许用户通过自然语言进行控制,且其计算开销(占推理时间的比例)随模型规模增大呈现稳定或改善的趋势。