Activation Editing, which involves directly editting the internal representations of large language models (LLMs) to alter their behaviors and achieve desired properties, has emerged as a promising area of research. Existing works primarily treat LLMs' activations as points in space and modify them by adding steering vectors. However, this approach is limited in its ability to achieve greater performance improvement while maintaining the necessary consistency of activation magnitudes. To overcome these issues, we propose a novel editing method that views activations in terms of their directions and magnitudes. Our method, named Householder Pseudo-Rotation (HPR), mimics the rotation transformation, thus preserving activation norms and resulting in an improved performance on various safety benchmarks.
翻译:激活编辑通过直接修改大语言模型(LLMs)的内部表征来改变其行为并实现期望特性,已成为一个前景广阔的研究领域。现有工作主要将LLMs的激活视为空间中的点,并通过添加导向向量进行修改。然而,该方法在保持激活幅度必要一致性的同时,难以实现更显著的性能提升。为克服这些问题,我们提出一种从方向和幅度视角理解激活的新型编辑方法。该方法名为Householder伪旋转(HPR),通过模拟旋转变换来保持激活范数,进而在多项安全基准测试中实现更优性能。