As safety remains a crucial concern throughout the development lifecycle of Large Language Models (LLMs), researchers and industrial practitioners have increasingly focused on safeguarding and aligning LLM behaviors with human preferences and ethical standards. LLMs, trained on extensive multilingual corpora, exhibit powerful generalization abilities across diverse languages and domains. However, current safety alignment practices predominantly focus on single-language scenarios, which leaves their effectiveness in complex multilingual contexts, especially for those complex mixed-language formats, largely unexplored. In this study, we introduce Multilingual Blending, a mixed-language query-response scheme designed to evaluate the safety alignment of various state-of-the-art LLMs (e.g., GPT-4o, GPT-3.5, Llama3) under sophisticated, multilingual conditions. We further investigate language patterns such as language availability, morphology, and language family that could impact the effectiveness of Multilingual Blending in compromising the safeguards of LLMs. Our experimental results show that, without meticulously crafted prompt templates, Multilingual Blending significantly amplifies the detriment of malicious queries, leading to dramatically increased bypass rates in LLM safety alignment (67.23% on GPT-3.5 and 40.34% on GPT-4o), far exceeding those of single-language baselines. Moreover, the performance of Multilingual Blending varies notably based on intrinsic linguistic properties, with languages of different morphology and from diverse families being more prone to evading safety alignments. These findings underscore the necessity of evaluating LLMs and developing corresponding safety alignment strategies in a complex, multilingual context to align with their superior cross-language generalization capabilities.
翻译:随着安全性成为大语言模型(LLMs)整个开发生命周期中的关键关切,研究者和工业界从业者日益重视保障LLMs行为与人类偏好及伦理标准的一致性。基于大规模多语言语料训练的LLMs展现出跨多种语言和领域的强大泛化能力。然而,当前的安全对齐实践主要集中于单语言场景,其在复杂多语言环境(尤其是涉及复杂混合语言格式的情况)下的有效性在很大程度上尚未得到充分探索。本研究提出“多语言混合”——一种混合语言的查询-响应方案,旨在评估多种前沿LLMs(如GPT-4o、GPT-3.5、Llama3)在复杂多语言条件下的安全对齐性能。我们进一步探究了可能影响多语言混合在突破LLMs安全防护方面有效性的语言模式,包括语言可及性、形态特征及语系归属。实验结果表明,无需精心设计的提示模板,多语言混合能显著放大恶意查询的危害性,导致LLMs安全对齐的绕过率急剧上升(GPT-3.5上达67.23%,GPT-4o上达40.34%),远超单语言基线。此外,多语言混合的表现因语言内在特性而异,不同形态特征及来自不同语系的语言更易规避安全对齐机制。这些发现强调了在复杂多语言语境下评估LLMs并制定相应安全对齐策略的必要性,以匹配其卓越的跨语言泛化能力。