Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their increasing global use. In this work, we explore zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. Finally, we show that bilingual sentence retrieval can predict the cross-lingual transferability of DPO preference tuning.
翻译:随着多语言大语言模型(LLM)在全球范围内的日益广泛应用,对其进行去毒化处理已变得至关重要。本研究探讨了在LLM去毒化任务中,偏好调优的零样本跨语言泛化能力。与先前研究表明其他安全任务跨语言泛化能力有限不同,我们证明仅使用英语数据进行直接偏好优化(DPO)训练,即可显著降低多语言开放生成中的毒性。例如,经过训练后,mGPT-1.3B在17种不同语言中生成有毒续写的概率从46.8%降至3.9%。我们的结果也适用于其他多语言LLM,如BLOOM、Llama3和Aya-23。通过使用因果干预和激活分析等机制可解释性工具,我们识别了LLM中多层感知机(MLP)层的双重多语言特性,这解释了DPO的跨语言泛化能力。最后,我们证明双语语句检索可以预测DPO偏好调优的跨语言可迁移性。