We introduce a novel family of adversarial attacks that exploit the inability of language models to interpret ASCII art. To evaluate these attacks, we propose the ToxASCII benchmark and develop two custom ASCII art fonts: one leveraging special tokens and another using text-filled letter shapes. Our attacks achieve a perfect 1.0 Attack Success Rate across ten models, including OpenAI's o1-preview and LLaMA 3.1. Warning: this paper contains examples of toxic language used for research purposes.
翻译:本文提出一类新颖的对抗性攻击方法,其利用语言模型无法解析ASCII艺术的特性。为评估此类攻击,我们提出ToxASCII基准测试集,并开发了两种定制ASCII艺术字体:一种利用特殊令牌,另一种采用文本填充的字母形状。我们的攻击在包括OpenAI o1-preview和LLaMA 3.1在内的十个模型上实现了1.0的完美攻击成功率。警告:本文包含出于研究目的使用的毒性语言示例。