Targeted Angular Reversal of Weights (TARS) for Knowledge Removal in Large Language Models

The sheer scale of data required to train modern large language models (LLMs) poses significant risks, as models are likely to gain knowledge of sensitive topics such as bio-security, as well the ability to replicate copyrighted works. Methods designed to remove such knowledge must do so from all prompt directions, in a multi-lingual capacity and without degrading general model performance. To this end, we introduce the targeted angular reversal (TARS) method of knowledge removal from LLMs. The TARS method firstly leverages the LLM in combination with a detailed prompt to aggregate information about a selected concept in the internal representation space of the LLM. It then refines this approximate concept vector to trigger the concept token with high probability, by perturbing the approximate concept vector with noise and transforming it into token scores with the language model head. The feedforward weight vectors in the LLM which operate directly on the internal representation space, and have the highest cosine similarity with this targeting vector, are then replaced by a reversed targeting vector, thus limiting the ability of the concept to propagate through the model. The modularity of the TARS method allows for a sequential removal of concepts from Llama 3.1 8B, such as the famous literary detective Sherlock Holmes, and the planet Saturn. It is demonstrated that the probability of triggering target concepts can be reduced to 0.00 with as few as 1 TARS edit, whilst simultaneously removing the knowledge bi-directionally. Moreover, knowledge is shown to be removed across all languages despite only being targeted in English. Importantly, TARS has minimal impact on the general model capabilities, as after removing 5 diverse concepts in a modular fashion, there is minimal KL divergence in the next token probabilities of the LLM on large corpora of Wikipedia text (median of 0.002).

翻译：训练现代大语言模型所需的海量数据规模带来了显著风险，因为模型很可能习得生物安全等敏感主题的知识，并具备复制受版权保护作品的能力。为此设计的知识移除方法必须能够从所有提示方向、在多语言能力下实现知识消除，同时不降低模型的整体性能。为此，我们提出了面向大语言模型知识移除的定向角度反转方法。该方法首先利用大语言模型结合详细提示，在模型的内部表示空间中聚合选定概念的信息。随后通过向近似概念向量添加噪声扰动，并利用语言模型头将其转换为词元分数，从而优化该近似概念向量，使其能以高概率触发概念词元。接着，将大语言模型中直接作用于内部表示空间且与该目标向量具有最高余弦相似度的前馈权重向量，替换为反转后的目标向量，从而限制概念在模型中的传播能力。该方法的模块化特性允许从Llama 3.1 8B模型中顺序移除多个概念，例如著名文学侦探夏洛克·福尔摩斯和土星。实验表明，仅需1次编辑即可将触发目标概念的概率降至0.00，同时实现知识的双向消除。此外，尽管仅针对英语进行目标设定，但知识移除效果在所有语言中均得以体现。重要的是，该方法对模型整体能力影响极小：在以模块化方式移除5个不同概念后，大语言模型在维基百科大型语料库上的下一词元概率KL散度变化极小（中位数为0.002）。