Pretrained large language models have become indispensable for solving various natural language processing (NLP) tasks. However, safely deploying them in real world applications is challenging because they generate toxic content. To address this challenge, we propose two novel pretraining data augmentation strategies that significantly reduce model toxicity without compromising its utility. Our two strategies are: (1) MEDA: adds raw toxicity score as meta-data to the pretraining samples, and (2) INST: adds instructions to those samples indicating their toxicity. Our results indicate that our best performing strategy (INST) substantially reduces the toxicity probability up to 61% while preserving the accuracy on five benchmark NLP tasks as well as improving AUC scores on four bias detection tasks by 1.3%. We also demonstrate the generalizability of our techniques by scaling the number of training samples and the number of model parameters.
翻译:预训练大型语言模型已成为解决各种自然语言处理(NLP)任务不可或缺的工具。然而,在实际应用中安全部署这些模型颇具挑战,因为它们会生成有毒内容。为解决这一问题,我们提出了两种新颖的预训练数据增强策略,在不损害模型实用性的前提下显著降低其毒性。两种策略分别为:(1)MEDA:将原始毒性分数作为元数据添加到预训练样本中;(2)INST:向这些样本添加指示其毒性的指令。结果表明,我们表现最优的策略(INST)在保留五个基准NLP任务准确性的同时,将毒性概率降低61%,并在四项偏见检测任务上将AUC分数提升1.3%。我们还通过扩展训练样本数量及模型参数规模,证明了所提技术的通用性。