Large language model (LLM)-powered content moderation systems have become a critical defense against harmful online content. However, these systems primarily operate on tokenized text and largely ignore the visual cues that humans naturally rely on when interpreting content. We show that this discrepancy creates a fundamental perceptual mismatch: content that is readily recognized as harmful by humans can become effectively invisible to automated moderation systems. To study this vulnerability, we introduce a class of Human-Perceptible Adversarial Attacks (HPAA), in which harmful expressions are embedded into otherwise benign text through visually salient typographic manipulations. Our key insight is that typographic features, including spacing, visual emphasis, and spatial arrangement, can be strategically combined to preserve human recognition of harmful content while substantially reducing machine detectability. Operating in black-box settings with only a small query budget, our attack automatically generates evasive content without requiring model access or gradient information. We evaluate the attack across multiple datasets and ten deployed moderation systems, including commercial APIs and state-of-the-art open-source guardrails. Results reveal a striking gap between human and machine perception: with only three detector queries, generated attacks achieve over 86\% human recognition while maintaining detection rates below 1\% across the evaluated systems. We further conduct ablation studies to identify the typographic factors driving successful evasion, analyze why current moderation architectures fail to capture these signals, and discuss practical defenses. Our findings expose a fundamental blind spot in today's LLM-based moderation ecosystem and highlight need for moderation systems that reason about content in a manner more consistent with human perceptual understanding.
翻译:基于大语言模型(LLM)的内容审核系统已成为抵御有害在线内容的关键防线。然而,这些系统主要基于分词文本运行,基本忽略了人类在理解内容时天然依赖的视觉线索。我们研究证明,这种差异导致了根本性的感知不匹配:人类轻易识别为有害的内容,对自动审核系统而言可能完全不可见。为探究这一漏洞,我们提出了一类人类可感知的对抗性攻击(HPAA),通过视觉显著的排版操作将有害表达嵌入看似无害的文本中。我们的核心洞见在于:通过策略性地组合间距、视觉强调和空间排列等排版特征,可在保持人类对有害内容识别度的同时,大幅降低机器检测率。在仅需少量查询预算的黑盒设置中,该攻击无需访问模型或梯度信息即可自动生成规避内容。我们在多个数据集和十个已部署的审核系统(包括商业API和先进的开源防护机制)上评估了该攻击。结果揭示了人类感知与机器感知之间的显著差距:仅需三次检测器查询,所生成攻击的人类识别率超过86%,而所评估系统的检测率均低于1%。我们进一步通过消融实验确定了实现成功规避的关键排版因素,分析了当前审核架构无法捕获这些信号的原因,并讨论了实用防御措施。本研究成果揭示了当今基于LLM的审核生态系统的根本盲点,并凸显了构建更符合人类感知理解方式来推理内容的审核系统的必要性。