The integration of Large Language Models (LLMs) into computer applications has introduced transformative capabilities but also significant security challenges. Existing safety alignments, which primarily focus on semantic interpretation, leave LLMs vulnerable to attacks that use non-standard data representations. This paper introduces ArtPerception, a novel black-box jailbreak framework that strategically leverages ASCII art to bypass the security measures of state-of-the-art (SOTA) LLMs. Unlike prior methods that rely on iterative, brute-force attacks, ArtPerception introduces a systematic, two-phase methodology. Phase 1 conducts a one-time, model-specific pre-test to empirically determine the optimal parameters for ASCII art recognition. Phase 2 leverages these insights to launch a highly efficient, one-shot malicious jailbreak attack. We propose a Modified Levenshtein Distance (MLD) metric for a more nuanced evaluation of an LLM's recognition capability. Through comprehensive experiments on four SOTA open-source LLMs, we demonstrate superior jailbreak performance. We further validate our framework's real-world relevance by showing its successful transferability to leading commercial models, including GPT-4o, Claude Sonnet 3.7, and DeepSeek-V3, and by conducting a rigorous effectiveness analysis against potential defenses such as LLaMA Guard and Azure's content filters. Our findings underscore that true LLM security requires defending against a multi-modal space of interpretations, even within text-only inputs, and highlight the effectiveness of strategic, reconnaissance-based attacks. Content Warning: This paper includes potentially harmful and offensive model outputs.
翻译:大型语言模型(LLMs)与计算机应用的集成带来了变革性能力,同时也引入了严峻的安全挑战。现有安全对齐机制主要关注语义理解,导致LLMs在面对使用非标准数据表征的攻击时仍存在脆弱性。本文提出ArtPerception,一种新颖的黑盒越狱框架,通过策略性地利用ASCII艺术绕过前沿(SOTA)LLMs的安全防护机制。与以往依赖迭代式暴力攻击的方法不同,ArtPerception提出一套系统化的两阶段方法:第一阶段执行一次性的模型特异性预测试,通过实证确定ASCII艺术识别的最优参数;第二阶段利用这些洞察发起高效的单次恶意越狱攻击。我们提出改进的莱文斯坦距离(MLD)度量标准,以更精细地评估LLMs的识别能力。通过对四个SOTA开源LLMs的全面实验,我们证明了该框架具有卓越的越狱性能。我们进一步验证了该框架的现实相关性:成功将其迁移至包括GPT-4o、Claude Sonnet 3.7和DeepSeek-V3在内的主流商业模型,并针对LLaMA Guard和Azure内容过滤器等潜在防御机制进行了严谨的有效性分析。我们的研究结果强调,真正的LLM安全需要防御多模态解释空间(即使在纯文本输入中),并凸显了基于战略侦察的攻击的有效性。内容警告:本文包含可能有害及具有冒犯性的模型输出。