Safety mechanisms in LLMs remain vulnerable to attacks that reframe harmful requests through culturally coded structures. We introduce Adversarial Tales, a jailbreak technique that embeds harmful content within cyberpunk narratives and prompts models to perform functional analysis inspired by Vladimir Propp's morphology of folktales. By casting the task as structural decomposition, the attack induces models to reconstruct harmful procedures as legitimate narrative interpretation. Across 26 frontier models from nine providers, we observe an average attack success rate of 71.3%, with no model family proving reliably robust. Together with our prior work on Adversarial Poetry, these findings suggest that structurally-grounded jailbreaks constitute a broad vulnerability class rather than isolated techniques. The space of culturally coded frames that can mediate harmful intent is vast, likely inexhaustible by pattern-matching defenses alone. Understanding why these attacks succeed is therefore essential: we outline a mechanistic interpretability research agenda to investigate how narrative cues reshape model representations and whether models can learn to recognize harmful intent independently of surface form.
翻译:大型语言模型中的安全机制仍然容易受到通过文化编码结构重构有害请求的攻击。我们提出了“对抗性故事”这一越狱技术,该技术将有害内容嵌入赛博朋克叙事中,并提示模型执行受弗拉基米尔·普罗普民间故事形态学启发的功能分析。通过将任务构建为结构分解,该攻击诱导模型将有害程序重构为合法的叙事解读。在来自九家提供商的26个前沿模型中,我们观察到平均攻击成功率为71.3%,没有任何模型家族展现出可靠的鲁棒性。结合我们先前关于对抗性诗歌的研究,这些发现表明,基于结构的越狱构成了一类广泛的漏洞,而非孤立的技术。能够调解有害意图的文化编码框架空间是巨大的,很可能无法仅通过模式匹配防御来穷尽。因此,理解这些攻击为何成功至关重要:我们概述了一项机制可解释性研究议程,以探究叙事线索如何重塑模型表征,以及模型是否能学会独立于表层形式识别有害意图。