Recent generative large language models (LLMs) show remarkable performance in non-English languages, but when prompted in those languages they tend to express higher harmful social biases and toxicity levels. Prior work has shown that finetuning on specialized datasets can mitigate this behavior, and doing so in English can transfer to other languages. In this work, we investigate the impact of different finetuning methods on the model's bias and toxicity, but also on its ability to produce fluent and diverse text. Our results show that finetuning on curated non-harmful text is more effective for mitigating bias, and finetuning on direct preference optimization (DPO) datasets is more effective for mitigating toxicity. The mitigation caused by applying these methods in English also transfers to non-English languages. We find evidence that the extent to which transfer takes place can be predicted by the amount of data in a given language present in the model's pretraining data. However, this transfer of bias and toxicity mitigation often comes at the expense of decreased language generation ability in non-English languages, highlighting the importance of developing language-specific bias and toxicity mitigation methods.
翻译:近期生成式大语言模型(LLMs)在非英语语言中展现出卓越性能,但当使用这些语言进行提示时,模型往往表现出更高程度的有害社会偏见和毒性。已有研究表明,基于专业数据集的微调可缓解此类行为,且英语场景下的微调效果可迁移至其他语言。本研究系统探究了不同微调方法对模型偏见与毒性的影响,同时评估其对文本流畅性与多样性的生成能力。实验结果表明:基于精选无害文本的微调对缓解偏见更为有效,而基于直接偏好优化(DPO)数据集的微调在降低毒性方面更具优势。通过英语数据实施这些方法产生的缓解效果同样能迁移至非英语语言。我们发现迁移程度可通过模型预训练数据中特定语言的数据量进行预测。然而,这种偏见与毒性的缓解往往以牺牲非英语语言的文本生成能力为代价,这凸显了开发语言特异性偏见与毒性缓解方法的重要性。