Impressive results have been achieved in natural language processing (NLP) tasks through the training of large language models (LLMs). However, these models occasionally produce toxic content such as insults, threats, and profanity in response to certain prompts, thereby constraining their practical utility. To tackle this issue, various finetuning-based and decoding-based approaches have been utilized to mitigate toxicity. However, these methods typically necessitate additional costs such as high-quality training data or auxiliary models. In this paper, we propose fine-grained detoxification via instance-level prefixes (FGDILP) to mitigate toxic text without additional cost. Specifically, FGDILP contrasts the contextualized representation in attention space using a positive prefix-prepended prompt against multiple negative prefix-prepended prompts at the instance level. This allows for constructing fine-grained subtoxicity vectors, which enables collaborative detoxification by fusing them to correct the normal generation process when provided with a raw prompt. We validate that FGDILP enables controlled text generation with regard to toxicity at both the utterance and context levels. Our method surpasses prompt-based baselines in detoxification, although at a slight cost to generation fluency and diversity.
翻译:通过训练大型语言模型(LLMs),自然语言处理任务已取得显著成果。然而,这些模型在面对特定提示时偶尔会产生包含侮辱、威胁和脏话等有害内容,从而限制了其实际应用价值。为解决此问题,现有方法主要采用基于微调和基于解码的策略来减轻毒性。但这些方法通常需要额外成本,如高质量训练数据或辅助模型。本文提出通过实例级前缀实现细粒度去毒化(FGDILP)技术,无需额外成本即可减轻有害文本生成。具体而言,FGDILP在注意力空间中对比包含正面前缀提示的上下文表示与多个实例级负面前缀提示的表示,从而构建细粒度子毒性向量。通过融合这些向量协同修正原始提示下的正常生成过程,实现去毒化。我们验证了FGDILP在语句级和语境级两个维度均能实现针对毒性的可控文本生成。该方法在去毒化效果上超越了基于提示的基线方法,但生成流畅度和多样性略有下降。