Despite the remarkable achievements of language models (LMs) across a broad spectrum of tasks, their propensity for generating toxic outputs remains a prevalent concern. Current solutions involving fine-tuning or auxiliary models usually require extensive memory and computational resources, rendering them less practical for deployment in large language models (LLMs). In this paper, we propose DeStein, a novel method that detoxififies LMs by altering their internal representations in the activation space with lower resource and time cost. Specifically, we leverage self-induced steering pairs to identify detoxification vectors through arithmetic operations in the activation space. During inference, detoxification is achieved by blending the detoxification vectors with the original representations. Empirical results demonstrate that our method significantly outperforms previous state-of-the-art approaches on popular detoxification metrics, while also maintaining satisfactory generation quality and diversity. Furthermore, we extend our method to multiple LLMs, demonstrating its practicality and scalability. We open-source our method at https://github.com/LizLizLi/DeStein . Warning: Some example model outputs contain highly offensive or disturbing text.
翻译:尽管语言模型(LM)在广泛任务中取得了显著成就,但其生成有毒输出的倾向仍是一个普遍关注的问题。当前涉及微调或辅助模型的解决方案通常需要大量内存和计算资源,使其在大型语言模型(LLM)部署中缺乏实用性。本文提出DeStein方法,这是一种通过更低资源和时间成本改变激活空间中内部表示来实现LM去毒化的新方法。具体而言,我们利用自诱导导向对(self-induced steering pairs),通过激活空间中的算术运算识别去毒化向量。在推理过程中,通过将去毒化向量与原始表示进行混合来实现去毒化。实验结果表明,我们的方法在主流去毒化指标上显著优于先前最先进方法,同时保持令人满意的生成质量和多样性。此外,我们将方法扩展至多个LLM,证明其实用性和可扩展性。我们已在https://github.com/LizLizLi/DeStein 开源该方法。警告:部分示例模型输出包含高度冒犯或令人不适的文本。