Evolutionary prompt search is a practical black-box approach for red teaming large language models (LLMs), but existing methods often collapse onto a small family of high-performing prompts, limiting coverage of distinct failure modes. We present a speciated quality-diversity (QD) extension of ToxSearch that maintains multiple high-toxicity prompt niches in parallel rather than optimizing a single best prompt. ToxSearch-S introduces unsupervised prompt speciation via a search methodology that maintains capacity-limited species with exemplar leaders, a reserve pool for outliers and emerging niches, and species-aware parent selection that trades off within-niche exploitation and cross-niche exploration. ToxSearch-S is found to reach higher peak toxicity ($\approx 0.73$ vs.\ $\approx 0.47$) and a extreme heavier tail (top-10 median $0.66$ vs.\ $0.45$) than the baseline, while maintaining comparable performance on moderately toxic prompts. Speciation also yields broader semantic coverage under a topic-as-species analysis (higher effective topic diversity $N_1$ and larger unique topic coverage $K$). Finally, species formed are well-separated in embedding space (mean separation ratio $\approx 1.93$) and exhibit distinct toxicity distributions, indicating that speciation partitions the adversarial space into behaviorally differentiated niches rather than superficial lexical variants. This suggests our approach uncovers a wider range of attack strategies.
翻译:进化提示搜索是一种实用的黑盒方法,用于对大型语言模型(LLMs)进行红队测试,但现有方法往往坍缩到一小族高性能提示上,限制了对不同失效模式的覆盖范围。我们提出了一种基于物种形成的质量-多样性(QD)扩展方法ToxSearch-S,该方法并行维护多个高毒性提示生态位,而非优化单一最佳提示。ToxSearch-S通过一种搜索方法引入无监督提示物种形成,该方法维护具有范例领导者的容量受限物种、一个用于异常值和新兴生态位的储备池,以及一种权衡生态位内利用与跨生态位探索的物种感知父代选择机制。研究发现,与基线相比,ToxSearch-S达到了更高的峰值毒性($\approx 0.73$ 对比 $\approx 0.47$)和更极端的重尾分布(前10位中位数 $0.66$ 对比 $0.45$),同时在中等毒性提示上保持相当的性能。物种形成还在以主题为物种的分析中产生了更广泛的语义覆盖(更高的有效主题多样性 $N_1$ 和更大的独特主题覆盖数 $K$)。最后,形成的物种在嵌入空间中分离良好(平均分离比 $\approx 1.93$),并展现出不同的毒性分布,表明物种形成将对抗空间划分为行为差异化的生态位,而非表面的词汇变体。这表明我们的方法揭示了更广泛的攻击策略。