Large Language Models (LLMs) have become increasingly popular for their advanced text generation capabilities across various domains. However, like any software, they face security challenges, including the risk of 'jailbreak' attacks that manipulate LLMs to produce prohibited content. A particularly underexplored area is the Multilingual Jailbreak attack, where malicious questions are translated into various languages to evade safety filters. Currently, there is a lack of comprehensive empirical studies addressing this specific threat. To address this research gap, we conducted an extensive empirical study on Multilingual Jailbreak attacks. We developed a novel semantic-preserving algorithm to create a multilingual jailbreak dataset and conducted an exhaustive evaluation on both widely-used open-source and commercial LLMs, including GPT-4 and LLaMa. Additionally, we performed interpretability analysis to uncover patterns in Multilingual Jailbreak attacks and implemented a fine-tuning mitigation method. Our findings reveal that our mitigation strategy significantly enhances model defense, reducing the attack success rate by 96.2%. This study provides valuable insights into understanding and mitigating Multilingual Jailbreak attacks.
翻译:大语言模型(LLMs)因其在各种领域展现出的先进文本生成能力而日益普及。然而,与任何软件一样,它们也面临安全挑战,包括操纵LLMs生成违禁内容的“越狱”攻击风险。一个尤为未充分探索的领域是多语言越狱攻击,即恶意问题被翻译成多种语言以规避安全过滤器。目前,针对这一特定威胁缺乏全面的实证研究。为弥补这一研究空白,我们开展了一项关于多语言越狱攻击的广泛实证研究。我们开发了一种新颖的语义保持算法,以创建多语言越狱数据集,并对广泛使用的开源和商业LLMs(包括GPT-4和LLaMa)进行了详尽评估。此外,我们进行了可解释性分析以揭示多语言越狱攻击的模式,并实施了一种微调缓解方法。我们的研究结果表明,我们的缓解策略显著增强了模型防御能力,将攻击成功率降低了96.2%。本研究为理解和缓解多语言越狱攻击提供了宝贵见解。