Recent advancements in AI safety have led to increased efforts in training and red-teaming large language models (LLMs) to mitigate unsafe content generation. However, these safety mechanisms may not be comprehensive, leaving potential vulnerabilities unexplored. This paper introduces MathPrompt, a novel jailbreaking technique that exploits LLMs' advanced capabilities in symbolic mathematics to bypass their safety mechanisms. By encoding harmful natural language prompts into mathematical problems, we demonstrate a critical vulnerability in current AI safety measures. Our experiments across 13 state-of-the-art LLMs reveal an average attack success rate of 73.6\%, highlighting the inability of existing safety training mechanisms to generalize to mathematically encoded inputs. Analysis of embedding vectors shows a substantial semantic shift between original and encoded prompts, helping explain the attack's success. This work emphasizes the importance of a holistic approach to AI safety, calling for expanded red-teaming efforts to develop robust safeguards across all potential input types and their associated risks.
翻译:人工智能安全领域的最新进展促使人们加大了对大型语言模型的训练和红队测试力度,以降低其生成不安全内容的风险。然而,现有的安全机制可能并不完善,仍存在尚未被发现的潜在漏洞。本文提出了一种名为MathPrompt的新型破解技术,该技术利用大型语言模型在符号数学方面的高级能力来绕过其安全机制。通过将有害的自然语言提示编码为数学问题,我们揭示了当前人工智能安全措施中存在的一个关键漏洞。我们在13个最先进的大型语言模型上进行的实验表明,该攻击的平均成功率高达73.6%,这凸显出现有的安全训练机制无法推广到数学编码的输入。对嵌入向量的分析显示,原始提示与编码后提示之间存在显著的语义偏移,这有助于解释攻击成功的原因。本研究强调了采取整体性方法保障人工智能安全的重要性,呼吁扩大红队测试范围,针对所有潜在的输入类型及其相关风险开发出更强大的安全防护措施。