This study identifies the potential vulnerabilities of Large Language Models (LLMs) to 'jailbreak' attacks, specifically focusing on the Arabic language and its various forms. While most research has concentrated on English-based prompt manipulation, our investigation broadens the scope to investigate the Arabic language. We initially tested the AdvBench benchmark in Standardized Arabic, finding that even with prompt manipulation techniques like prefix injection, it was insufficient to provoke LLMs into generating unsafe content. However, when using Arabic transliteration and chatspeak (or arabizi), we found that unsafe content could be produced on platforms like OpenAI GPT-4 and Anthropic Claude 3 Sonnet. Our findings suggest that using Arabic and its various forms could expose information that might remain hidden, potentially increasing the risk of jailbreak attacks. We hypothesize that this exposure could be due to the model's learned connection to specific words, highlighting the need for more comprehensive safety training across all language forms.
翻译:本研究揭示了大型语言模型(LLMs)在应对“越狱”攻击时可能存在的脆弱性,特别聚焦于阿拉伯语及其变体形式。现有研究多集中于基于英语的提示词操纵,而我们的调查将范围扩展至阿拉伯语领域。我们首先在标准化阿拉伯语环境下测试了AdvBench基准,发现即使采用前缀注入等提示词操纵技术,仍不足以诱导LLMs生成不安全内容。然而,当使用阿拉伯语转写形式和聊天用语(或称阿拉伯聊天语)时,我们在OpenAI GPT-4和Anthropic Claude 3 Sonnet等平台上成功触发了不安全内容的生成。研究结果表明,利用阿拉伯语及其变体形式可能暴露原本隐藏的信息,从而增加越狱攻击的风险。我们推测这种暴露可能源于模型对特定词汇的习得关联,这凸显了在所有语言形式中进行更全面安全训练的必要性。