As the influence of large language models (LLMs) spans across global communities, their safety challenges in multilingual settings become paramount for alignment research. This paper examines the variations in safety challenges faced by LLMs across different languages and discusses approaches to alleviating such concerns. By comparing how state-of-the-art LLMs respond to the same set of malicious prompts written in higher- vs. lower-resource languages, we observe that (1) LLMs tend to generate unsafe responses much more often when a malicious prompt is written in a lower-resource language, and (2) LLMs tend to generate more irrelevant responses to malicious prompts in lower-resource languages. To understand where the discrepancy can be attributed, we study the effect of instruction tuning with reinforcement learning from human feedback (RLHF) or supervised finetuning (SFT) on the HH-RLHF dataset. Surprisingly, while training with high-resource languages improves model alignment, training in lower-resource languages yields minimal improvement. This suggests that the bottleneck of cross-lingual alignment is rooted in the pretraining stage. Our findings highlight the challenges in cross-lingual LLM safety, and we hope they inform future research in this direction.
翻译:随着大语言模型(LLMs)的影响力跨越全球社区,其在多语言环境下的安全挑战成为对齐研究的关键问题。本文考察了不同语言环境下LLMs面临的安全挑战差异,并探讨了缓解此类问题的途径。通过对比最先进LLMs对高资源语言与低资源语言编写的相同恶意提示的响应,我们发现:(1)当恶意提示以低资源语言编写时,LLMs生成不安全响应的频率显著升高;(2)对于低资源语言的恶意提示,LLMs更倾向于生成无关响应。为探究偏差根源,我们研究了基于人类反馈的强化学习(RLHF)或监督微调(SFT)在HH-RLHF数据集上的指令调优效果。令人惊讶的是,虽然高资源语言的训练能提升模型对齐度,但低资源语言的训练改善效果甚微。这表明跨语言对齐的瓶颈根植于预训练阶段。我们的发现揭示了跨语言LLM安全领域的挑战,并期望能为该方向未来研究提供参考。