Despite the remarkable achievements of language models (LMs) across a broad spectrum of tasks, their propensity for generating toxic outputs remains a prevalent concern. Current solutions involving fine-tuning or auxiliary models usually require extensive memory and computational resources, rendering them less practical for deployment in large language models (LLMs). In this paper, we propose DeStein, a novel method that detoxififies LMs by altering their internal representations in the activation space with lower resource and time cost. Specifically, we leverage self-induced steering pairs to identify detoxification vectors through arithmetic operations in the activation space. During inference, detoxification is achieved by blending the detoxification vectors with the original representations. Empirical results demonstrate that our method significantly outperforms previous state-of-the-art approaches on popular detoxification metrics, while also maintaining satisfactory generation quality and diversity. Furthermore, we extend our method to multiple LLMs, demonstrating its practicality and scalability. Warning: some example model outputs contain highly offensive or disturbing text.
翻译:尽管语言模型在广泛任务中取得了显著成就,其生成有毒输出的倾向仍是普遍关注的问题。当前基于微调或辅助模型的解决方案通常需要大量内存和计算资源,使其在大语言模型部署中缺乏实用性。本文提出DeStein方法,通过修改激活空间中的内部表征实现语言模型去毒化,显著降低了资源与时间成本。具体而言,我们利用自诱导导向对,通过激活空间中的算术运算识别去毒化向量。在推理阶段,通过将去毒化向量与原始表征混合实现去毒化。实验结果表明,该方法在主流去毒化指标上显著超越先前最先进方法,同时保持了良好的生成质量和多样性。此外,我们将该方法扩展至多个大语言模型,验证了其实用性和可扩展性。警告:部分模型输出示例包含高度冒犯性或令人不适的文本。