Recent work in activation steering has demonstrated the potential to better control the outputs of Large Language Models (LLMs), but it involves finding steering vectors. This is difficult because engineers do not typically know how features are represented in these models. We seek to address this issue by applying the idea of mean-centring to steering vectors. We find that taking the average of activations associated with a target dataset, and then subtracting the mean of all training activations, results in effective steering vectors. We test this method on a variety of models on natural language tasks by steering away from generating toxic text, and steering the completion of a story towards a target genre. We also apply mean-centring to extract function vectors, more effectively triggering the execution of a range of natural language tasks by a significant margin (compared to previous baselines). This suggests that mean-centring can be used to easily improve the effectiveness of activation steering in a wide range of contexts.
翻译:近期在激活控制领域的研究展示了更好地控制大型语言模型(LLMs)输出的潜力,但该方法需要寻找控制向量。由于工程师通常不了解这些模型中特征的具体表示方式,这一过程存在困难。我们尝试通过将均值中心化思想应用于控制向量来解决此问题。研究发现,取目标数据集相关激活值的平均值,再减去所有训练激活值的均值,即可生成有效的控制向量。我们基于自然语言任务对多种模型进行了测试:通过控制向量避免生成有害文本,并将故事续写导向目标类型。此外,我们将均值中心化应用于提取功能向量,从而更有效地触发一系列自然语言任务的执行(与先前基准相比有显著提升)。这表明均值中心化可在广泛场景中简便地提升激活控制的有效性。