We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for Large Language Models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of 3 open-weight LLM judges, whose binary safety assessments were validated on a stratified human-labeled subset. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.
翻译:我们提出证据表明,对抗性诗歌可作为大型语言模型的通用单轮越狱技术。在25个前沿专有和开源权重模型中,经过筛选的诗歌提示均实现了较高的攻击成功率,部分提供商的攻击成功率超过90%。将提示映射至MLCommons和欧盟CoP风险分类体系显示,诗歌攻击可跨CBRN、操纵、网络攻击和失控等多个风险领域迁移。通过标准化元提示将1200条MLCommons有害提示转换为诗歌形式后,其攻击成功率最高达到散文基线的18倍。输出结果由3个开源权重LLM评估器组成的集成系统进行评判,其二元安全性评估在分层人工标注子集上得到验证。诗歌框架在手工创作诗歌中平均实现62%的越狱成功率,在元提示转换中实现约43%的成功率(相较于非诗歌基线),显著优于非诗歌基线,并揭示了跨模型家族和安全训练方法的系统性漏洞。这些发现表明,仅通过风格变异即可规避当代安全机制,暗示当前对齐方法和评估协议存在根本性局限。