Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their increasing global use. In this work, we explore zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. Finally, we show that bilingual sentence retrieval can predict the cross-lingual transferability of DPO preference tuning.
翻译:随着多语言大语言模型在全球范围内的日益广泛应用,对其进行去毒化处理已变得至关重要。本研究探讨了偏好调优在为大语言模型去毒化任务中的零样本跨语言泛化能力。与先前研究表明其他安全任务跨语言泛化有限不同,我们证明仅使用英语数据进行直接偏好优化训练,即可显著降低多语言开放式生成内容中的毒性。例如,经过训练后,mGPT-1.3B模型在17种不同语言中生成毒性续写的概率从46.8%降至3.9%。我们的研究结果也适用于其他多语言大语言模型,如BLOOM、Llama3和Aya-23。通过使用因果干预和激活分析等机制可解释性工具,我们识别了大语言模型中多层感知机层具有的双重多语言特性,这解释了直接偏好优化的跨语言泛化能力。最后,我们表明双语语句检索可以预测直接偏好优化中偏好调优的跨语言可迁移性。