Large Language Models (LLMs) are powerful text generators, yet they can produce toxic or harmful content even when given seemingly harmless prompts. This presents a serious safety challenge and can cause real-world harm. Toxicity is often subtle and context-dependent, making it difficult to detect at the token level or through coarse sentence-level signals. Moreover, efforts to mitigate toxicity often face a trade-off between safety and the coherence, or fluency of the generated text. In this work, we present a targeted subspace intervention strategy for identifying and suppressing hidden toxic patterns from underlying model representations, while preserving overall ability to generate safe fluent content. On the RealToxicityPrompts, our method achieves strong mitigation performance compared to existing baselines, with minimal impact on inference complexity. Across multiple LLMs, our approach reduces toxicity of state-of-the-art detoxification systems by 8-20%, while maintaining comparable fluency. Through extensive quantitative and qualitative analyses, we show that our approach achieves effective toxicity reduction without impairing generative performance, consistently outperforming existing baselines.
翻译:大语言模型(LLMs)是强大的文本生成器,但即使在看似无害的提示下,它们仍可能产生有毒或有害内容。这构成了严峻的安全挑战,并可能造成现实世界的危害。毒性通常具有隐蔽性和语境依赖性,难以在词元层面或通过粗粒度的句子级信号进行检测。此外,缓解毒性的努力往往面临安全性、生成文本的连贯性与流畅性之间的权衡。本研究提出了一种针对性的子空间干预策略,旨在从底层模型表示中识别并抑制隐藏的有毒模式,同时保持模型生成安全流畅内容的整体能力。在RealToxicityPrompts数据集上,相较于现有基线方法,我们的方法在几乎不影响推理复杂度的前提下实现了显著的毒性缓解效果。在多种大语言模型上的实验表明,该方法将当前最先进去毒系统的毒性降低了8-20%,同时保持了相当的流畅性。通过大量定量与定性分析,我们证明该方法在有效降低毒性的同时不损害生成性能,持续优于现有基线方法。