Current approaches to LLM adversarial testing suffer from coverage gaps: manual red-teaming does not scale, LLM-as-attacker methods exhibit mode collapse, and gradient-based approaches produce uninterpretable gibberish. We introduce a quality-diversity evolutionary framework that operates at the semantic level, evolving interpretable attack strategies rather than token sequences. Using MAP-Elites, we maintain a diverse archive of attacks across behavioral dimensions (strategy type, encoding method, prompt length). In experiments across GPT-4o-mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, and an open-weight coding model (Devstral-small-2), we discover distinct vulnerability profiles: GPT-4o-mini is vulnerable to hypothetical and multi-turn framing combined with ROT13 encoding (fitness 0.8), Gemini to direct attacks with ROT13 and multi-turn with Leetspeak (0.8), while Claude shows uniformly ambiguous responses across all strategies (max 0.4). The semantic representation produces interpretable attacks that reveal systematic, model-specific weaknesses, providing actionable insights for improving LLM safety and a reproducible baseline for evaluating future frontier models. Code and experiment artifacts are released at https://github.com/bassrehab/red-queen.
翻译:当前大语言模型对抗性测试方法存在覆盖缺口:人工红队测试难以规模化、大模型作为攻击者的方法呈现模式坍塌、梯度基方法生成不可解释的乱码。我们提出一种在语义层面运行的质量多样性演化框架,通过演化可解释的攻击策略而非词元序列。借助MAP-Elites算法,我们维护了涵盖行为维度(策略类型、编码方法、提示长度)的多样化攻击档案。在GPT-4o-mini、Claude 3.5 Sonnet、Gemini 2.0 Flash及开源权重编程模型(Devstral-small-2)的实验中,我们发现了不同的漏洞特征:GPT-4o-mini易受结合ROT13编码的假设性多轮框架攻击(适应度0.8),Gemini易受结合ROT13的直接攻击及多轮李特语攻击(0.8),而Claude对所有策略均呈现模糊响应(最大0.4)。该语义表示生成的攻击具有可解释性,揭示系统化的模型特异性弱点,为改进大模型安全提供可操作见解,并为评估未来前沿模型建立可复现基准。代码与实验工件发布于https://github.com/bassrehab/red-queen。