Impressive results have been achieved in natural language processing (NLP) tasks through the training of large language models (LLMs). However, these models occasionally produce toxic content such as insults, threats, and profanity in response to certain prompts, thereby constraining their practical utility. To tackle this issue, various finetuning-based and decoding-based approaches have been utilized to mitigate toxicity. However, these methods typically necessitate additional costs such as high-quality training data or auxiliary models. In this paper, we propose fine-grained detoxification via instance-level prefixes (FGDILP) to mitigate toxic text without additional cost. Specifically, FGDILP contrasts the contextualized representation in attention space using a positive prefix-prepended prompt against multiple negative prefix-prepended prompts at the instance level. This allows for constructing fine-grained subtoxicity vectors, which enables collaborative detoxification by fusing them to correct the normal generation process when provided with a raw prompt. We validate that FGDILP enables controlled text generation with regard to toxicity at both the utterance and context levels. Our method surpasses prompt-based baselines in detoxification, although at a slight cost to generation fluency and diversity.
翻译:通过训练大语言模型(LLMs),自然语言处理(NLP)任务取得了显著成果。然而,这些模型在应对特定提示时偶尔会产生辱骂、威胁和脏话等有害内容,从而限制了其实用性。为解决这一问题,人们采用了多种基于微调和基于解码的方法来缓解毒性。然而,这些方法通常需要额外成本,如高质量训练数据或辅助模型。本文提出通过实例级前缀实现细粒度去毒化(FGDILP),在不增加额外成本的情况下减少有毒文本。具体而言,FGDILP 在实例级别上,将带有正面前缀的提示与多个带有负面前缀的提示在注意力空间中进行上下文表示对比,从而构建细粒度子毒性向量,并通过融合这些向量协同校正原始提示下的正常生成过程。我们验证了FGDILP 能够在话语和上下文层面实现对毒性的可控文本生成。尽管在生成流畅性和多样性上略有牺牲,但我们的方法在去毒化方面优于基于提示的基线方法。