Large language models (LLMs) are now ubiquitous in user-facing applications, yet they still generate undesirable toxic outputs, including profanity, vulgarity, and derogatory remarks. Although numerous detoxification methods exist, most apply broad, surface-level fixes and can therefore easily be circumvented by jailbreak attacks. In this paper we leverage sparse autoencoders (SAEs) to identify toxicity-related directions in the residual stream of models and perform targeted activation steering using the corresponding decoder vectors. We introduce three tiers of steering aggressiveness and evaluate them on GPT-2 Small and Gemma-2-2B, revealing trade-offs between toxicity reduction and language fluency. At stronger steering strengths, these causal interventions surpass competitive baselines in reducing toxicity by up to 20%, though fluency can degrade noticeably on GPT-2 Small depending on the aggressiveness. Crucially, standard NLP benchmark scores upon steering remain stable, indicating that the model's knowledge and general abilities are preserved. We further show that feature-splitting in wider SAEs hampers safety interventions, underscoring the importance of disentangled feature learning. Our findings highlight both the promise and the current limitations of SAE-based causal interventions for LLM detoxification, further suggesting practical guidelines for safer language-model deployment.
翻译:大语言模型(LLMs)现已广泛应用于面向用户的应用中,但其仍会生成不良的有毒输出,包括污言秽语、粗俗内容及贬损性言论。尽管存在多种去毒化方法,但大多数采用宽泛、浅层的修复方案,因此容易被越狱攻击绕过。本文利用稀疏自编码器(SAEs)识别模型残差流中与毒性相关的方向,并使用相应的解码器向量进行定向激活引导。我们引入了三个层级的引导强度,并在GPT-2 Small和Gemma-2-2B模型上进行了评估,揭示了毒性降低与语言流畅性之间的权衡关系。在较强引导强度下,这些因果干预在降低毒性方面超越现有基线方法达20%,但在GPT-2 Small模型上流畅性可能随引导强度出现显著下降。关键的是,引导后标准NLP基准分数保持稳定,表明模型的知识和通用能力得以保留。我们进一步证明,在更宽的SAEs中进行特征分割会阻碍安全干预,这凸显了解耦特征学习的重要性。我们的研究结果既揭示了基于SAE的因果干预在LLM去毒化方面的潜力,也指出了当前局限,同时为更安全的语言模型部署提供了实用指南。