Evolutionary prompt search is a practical black-box approach for red teaming large language models, however existing methods often collapse onto a small family of high-performing prompts, limiting coverage of distinct failure modes. We present a speciated quality-diversity extension of \textit{ToxSearch} that maintains multiple high-toxicity prompt niches in parallel rather than optimizing a single best prompt. \textit{ToxSearch-S} introduces unsupervised prompt speciation via a search methodology that maintains capacity-limited species with exemplar leaders, a reserve pool for emerging niches, and species-aware parent selection that trades off within-niche exploitation and cross-niche exploration. Preliminary results show \textit{ToxSearch-S} reaching higher peak toxicity ($\approx 0.73$ vs.\ $\approx 0.47$) with a heavier tail (top-10 median $0.66$ vs.\ $0.45$) than the baseline. Speciation also yields broader semantic coverage under a topics-as-species analysis (higher effective topic diversity and larger unique topic coverage). Finally, species formed are well-separated in embedding space (mean separation ratio $\approx 1.93$) and exhibit distinct toxicity distributions, indicating that speciation partitions the adversarial space into behaviorally differentiated niches rather than superficial lexical variants.
翻译:进化式提示搜索是对大型语言模型进行红队测试的一种实用黑盒方法,但现有方法常收敛于少数高性能提示家族,限制了不同失败模式的覆盖范围。我们提出一种基于物种分化的质量多样性扩展方法\textit{ToxSearch-S},通过并行维护多个高毒性提示生态位,而非优化单一最佳提示。\textit{ToxSearch-S}引入无监督提示物种分化,其搜索机制包含:容量受限的物种及其代表模板、为新兴生态位预留的储备池,以及平衡生态位内开发与跨生态位探索的物种感知父代选择策略。初步结果显示,\textit{ToxSearch-S}的峰值毒性更高($\approx 0.73$ vs.\ $\approx 0.47$),且尾部更厚(前十位中位数 $0.66$ vs.\ $0.45$)。基于主题-物种分析表明,物种分化还带来更广的语义覆盖(更高的有效主题多样性和更大的唯一主题覆盖率)。最终形成的物种在嵌入空间中分离良好(平均分离比 $\approx 1.93$),且呈现差异化毒性分布,表明该机制将对抗空间划分为行为可分的生态位,而非表层词汇变体。