Large language models (LLMs) often exhibit undesirable behaviours, such as generating untruthful or biased content. Editing their internal representations has been shown to be effective in mitigating such behaviours on top of the existing alignment methods. We propose a novel inference-time editing method, namely spectral editing of activations (SEA), to project the input representations into directions with maximal covariance with the positive demonstrations (e.g., truthful) while minimising covariance with the negative demonstrations (e.g., hallucinated). We also extend our method to non-linear editing using feature functions. We run extensive experiments on benchmarks concerning truthfulness and bias with six open-source LLMs of different sizes and model families. The results demonstrate the superiority of SEA in effectiveness, generalisation to similar tasks, as well as inference and data efficiency. We also show that SEA editing only has a limited negative impact on other model capabilities.
翻译:大语言模型常出现生成不真实或偏见内容等不良行为。研究表明,在现有对齐方法基础上编辑模型内部表征能有效缓解此类行为。我们提出一种新型推理时编辑方法——激活谱编辑(SEA),该方法将输入表征投影至与正面示范(如真实陈述)协方差最大、与负面示范(如幻觉内容)协方差最小的方向。我们还利用特征函数将该方法扩展至非线性编辑场景。我们在涉及真实性与偏见的基准测试上,对六个不同规模与系列的开源大语言模型开展了全面实验。结果表明,SEA在有效性、对相似任务的泛化能力、推理效率及数据效率方面具有显著优势。同时证明SEA编辑对其他模型能力产生的负面影响有限。