Warning: this paper contains model outputs exhibiting offensiveness and biases. Recently pre-trained language models (PLMs) have prospered in various natural language generation (NLG) tasks due to their ability to generate fairly fluent text. Nevertheless, these models are observed to capture and reproduce harmful contents in training corpora, typically toxic language and social biases, raising severe moral issues. Prior works on ethical NLG tackle detoxifying and debiasing separately, which is problematic since we find debiased models still exhibit toxicity while detoxified ones even exacerbate social biases. To address such a challenge, we propose the first unified framework of detoxifying and debiasing called UDDIA, which jointly formalizes these two problems as rectifying the output space. We theoretically interpret our framework as learning a text distribution mixing weighted attributes. Besides, UDDIA conducts adaptive optimization of only a few parameters during decoding based on a parameter-efficient tuning schema without any training data. This leads to minimal generation quality loss and improved rectification performance with acceptable computational cost. Experimental results demonstrate that compared to several strong baselines, UDDIA achieves debiasing and detoxifying simultaneously and better balances efficiency and effectiveness, taking a further step towards practical ethical NLG.
翻译:警告:本文包含模型生成的具有攻击性和偏见的输出。近年来,预训练语言模型(PLMs)因其生成流畅文本的能力,在各种自然语言生成(NLG)任务中取得显著进展。然而,这些模型被发现会捕获并复现训练语料中的有害内容,典型如毒性语言和社会偏见,引发严重的道德问题。先前关于道德NLG的研究分别处理去毒化和去偏见,这存在问题,因为我们发现去偏见后的模型仍表现出毒性,而去毒化后的模型甚至加剧了社会偏见。为应对这一挑战,我们提出首个名为UDDIA的统一去毒化与去偏见框架,该框架将这两个问题共同形式化为修正输出空间。我们从理论上将我们的框架解释为学习一种混合加权属性的文本分布。此外,UDDIA在解码过程中基于参数高效微调范式仅对少量参数进行自适应优化,无需任何训练数据。这导致生成质量损失最小化,并以可接受的计算成本提升修正性能。实验结果表明,与多个强基线相比,UDDIA能同时实现去偏见和去毒化,并更好地平衡效率与效果,朝着实用的道德NLG迈进一步。