Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture

As safety remains a crucial concern throughout the development lifecycle of Large Language Models (LLMs), researchers and industrial practitioners have increasingly focused on safeguarding and aligning LLM behaviors with human preferences and ethical standards. LLMs, trained on extensive multilingual corpora, exhibit powerful generalization abilities across diverse languages and domains. However, current safety alignment practices predominantly focus on single-language scenarios, which leaves their effectiveness in complex multilingual contexts, especially for those complex mixed-language formats, largely unexplored. In this study, we introduce Multilingual Blending, a mixed-language query-response scheme designed to evaluate the safety alignment of various state-of-the-art LLMs (e.g., GPT-4o, GPT-3.5, Llama3) under sophisticated, multilingual conditions. We further investigate language patterns such as language availability, morphology, and language family that could impact the effectiveness of Multilingual Blending in compromising the safeguards of LLMs. Our experimental results show that, without meticulously crafted prompt templates, Multilingual Blending significantly amplifies the detriment of malicious queries, leading to dramatically increased bypass rates in LLM safety alignment (67.23% on GPT-3.5 and 40.34% on GPT-4o), far exceeding those of single-language baselines. Moreover, the performance of Multilingual Blending varies notably based on intrinsic linguistic properties, with languages of different morphology and from diverse families being more prone to evading safety alignments. These findings underscore the necessity of evaluating LLMs and developing corresponding safety alignment strategies in a complex, multilingual context to align with their superior cross-language generalization capabilities.

翻译：随着安全性成为大语言模型（LLMs）整个开发生命周期中的关键关切，研究者和工业界从业者日益重视保障LLMs行为与人类偏好及伦理标准的一致性。基于大规模多语言语料训练的LLMs展现出跨多种语言和领域的强大泛化能力。然而，当前的安全对齐实践主要集中于单语言场景，其在复杂多语言环境（尤其是涉及复杂混合语言格式的情况）下的有效性在很大程度上尚未得到充分探索。本研究提出“多语言混合”——一种混合语言的查询-响应方案，旨在评估多种前沿LLMs（如GPT-4o、GPT-3.5、Llama3）在复杂多语言条件下的安全对齐性能。我们进一步探究了可能影响多语言混合在突破LLMs安全防护方面有效性的语言模式，包括语言可及性、形态特征及语系归属。实验结果表明，无需精心设计的提示模板，多语言混合能显著放大恶意查询的危害性，导致LLMs安全对齐的绕过率急剧上升（GPT-3.5上达67.23%，GPT-4o上达40.34%），远超单语言基线。此外，多语言混合的表现因语言内在特性而异，不同形态特征及来自不同语系的语言更易规避安全对齐机制。这些发现强调了在复杂多语言语境下评估LLMs并制定相应安全对齐策略的必要性，以匹配其卓越的跨语言泛化能力。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日