Recent generative large language models (LLMs) show remarkable performance in non-English languages, but when prompted in those languages they tend to express higher harmful social biases and toxicity levels. Prior work has shown that finetuning on specialized datasets can mitigate this behavior, and doing so in English can transfer to other languages. In this work, we investigate the impact of different finetuning methods on the model's bias and toxicity, but also on its ability to produce fluent and diverse text. Our results show that finetuning on curated non-harmful text is more effective for mitigating bias, and finetuning on direct preference optimization (DPO) datasets is more effective for mitigating toxicity. The mitigation caused by applying these methods in English also transfers to non-English languages. We find evidence that the extent to which transfer takes place can be predicted by the amount of data in a given language present in the model's pretraining data. However, this transfer of bias and toxicity mitigation often comes at the expense of decreased language generation ability in non-English languages, highlighting the importance of developing language-specific bias and toxicity mitigation methods.
翻译:近期生成式大语言模型(LLMs)在非英语语言中展现出卓越性能,但当使用这些语言进行提示时,模型往往表现出更高程度的有害社会偏见与毒性水平。已有研究表明,基于专用数据集进行微调可缓解此类行为,且英语场景下的微调效果可迁移至其他语言。本研究系统探究了不同微调方法对模型偏见与毒性的影响,并同时评估其对文本流畅性与多样性的生成能力。实验结果表明:基于精选无害文本的微调对缓解偏见更为有效,而基于直接偏好优化(DPO)数据集的微调则对毒性缓解更具优势。在英语场景应用这些方法产生的缓解效果同样可迁移至非英语语言。我们发现迁移程度可通过模型预训练数据中特定语言的数据量进行预测。然而,这种偏见与毒性缓解的跨语言迁移往往以牺牲非英语语言生成能力为代价,这凸显了开发语言特异性偏见与毒性缓解方法的重要性。