ArtPerception：基于ASCII艺术的大语言模型越狱攻击及识别预测试 (ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test)

The integration of Large Language Models (LLMs) into computer applications has introduced transformative capabilities but also significant security challenges. Existing safety alignments, which primarily focus on semantic interpretation, leave LLMs vulnerable to attacks that use non-standard data representations. This paper introduces ArtPerception, a novel black-box jailbreak framework that strategically leverages ASCII art to bypass the security measures of state-of-the-art (SOTA) LLMs. Unlike prior methods that rely on iterative, brute-force attacks, ArtPerception introduces a systematic, two-phase methodology. Phase 1 conducts a one-time, model-specific pre-test to empirically determine the optimal parameters for ASCII art recognition. Phase 2 leverages these insights to launch a highly efficient, one-shot malicious jailbreak attack. We propose a Modified Levenshtein Distance (MLD) metric for a more nuanced evaluation of an LLM's recognition capability. Through comprehensive experiments on four SOTA open-source LLMs, we demonstrate superior jailbreak performance. We further validate our framework's real-world relevance by showing its successful transferability to leading commercial models, including GPT-4o, Claude Sonnet 3.7, and DeepSeek-V3, and by conducting a rigorous effectiveness analysis against potential defenses such as LLaMA Guard and Azure's content filters. Our findings underscore that true LLM security requires defending against a multi-modal space of interpretations, even within text-only inputs, and highlight the effectiveness of strategic, reconnaissance-based attacks. Content Warning: This paper includes potentially harmful and offensive model outputs.

翻译：大型语言模型（LLMs）与计算机应用的集成带来了变革性能力，同时也引入了严峻的安全挑战。现有安全对齐机制主要关注语义理解，导致LLMs在面对使用非标准数据表征的攻击时仍存在脆弱性。本文提出ArtPerception，一种新颖的黑盒越狱框架，通过策略性地利用ASCII艺术绕过前沿（SOTA）LLMs的安全防护机制。与以往依赖迭代式暴力攻击的方法不同，ArtPerception提出一套系统化的两阶段方法：第一阶段执行一次性的模型特异性预测试，通过实证确定ASCII艺术识别的最优参数；第二阶段利用这些洞察发起高效的单次恶意越狱攻击。我们提出改进的莱文斯坦距离（MLD）度量标准，以更精细地评估LLMs的识别能力。通过对四个SOTA开源LLMs的全面实验，我们证明了该框架具有卓越的越狱性能。我们进一步验证了该框架的现实相关性：成功将其迁移至包括GPT-4o、Claude Sonnet 3.7和DeepSeek-V3在内的主流商业模型，并针对LLaMA Guard和Azure内容过滤器等潜在防御机制进行了严谨的有效性分析。我们的研究结果强调，真正的LLM安全需要防御多模态解释空间（即使在纯文本输入中），并凸显了基于战略侦察的攻击的有效性。内容警告：本文包含可能有害及具有冒犯性的模型输出。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日