While prior work has focused on projecting adversarial examples back onto the manifold of natural data to restore safety, we argue that a comprehensive understanding of AI safety requires characterizing the unsafe regions themselves. This paper introduces a framework for systematically mapping the Manifold of Failure in Large Language Models (LLMs). We reframe the search for vulnerabilities as a quality diversity problem, using MAP-Elites to illuminate the continuous topology of these failure regions, which we term behavioral attraction basins. Our quality metric, Alignment Deviation, guides the search towards areas where the model's behavior diverges most from its intended alignment. Across three LLMs: Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini, we show that MAP-Elites achieves up to 63% behavioral coverage, discovers up to 370 distinct vulnerability niches, and reveals dramatically different model-specific topological signatures: Llama-3-8B exhibits a near-universal vulnerability plateau (mean Alignment Deviation 0.93), GPT-OSS-20B shows a fragmented landscape with spatially concentrated basins (mean 0.73), and GPT-5-Mini demonstrates strong robustness with a ceiling at 0.50. Our approach produces interpretable, global maps of each model's safety landscape that no existing attack method (GCG, PAIR, or TAP) can provide, shifting the paradigm from finding discrete failures to understanding their underlying structure.
翻译:尽管先前的研究侧重于将对抗样本投影回自然数据流形以恢复安全性,但我们认为全面理解人工智能安全需要刻画不安全区域本身。本文提出了一个系统化绘制大语言模型(LLMs)失败流形的框架。我们将漏洞搜索重新构建为质量多样性问题,利用MAP-Elites算法揭示这些失败区域的连续拓扑结构——我们称之为行为吸引域。我们的质量度量指标"对齐偏离度"引导搜索朝向模型行为与预期对齐目标偏离最严重的区域。在Llama-3-8B、GPT-OSS-20B和GPT-5-Mini三个大语言模型上的实验表明:MAP-Elites实现了最高63%的行为覆盖率,发现了多达370个不同的漏洞生态位,并揭示了显著差异的模型特定拓扑特征:Llama-3-8B呈现近乎普适的脆弱性高原(平均对齐偏离度0.93),GPT-OSS-20B展现碎片化景观与空间集中的吸引域(平均0.73),而GPT-5-Mini表现出强鲁棒性且存在0.50的上限。我们的方法生成了现有攻击方法(GCG、PAIR或TAP)无法提供的、可解释的全局安全景观图,将研究范式从发现离散故障转向理解其底层结构。