Safety alignment in LLMs does not improve monotonically across model generations. Studying four generations of Google's Gemma family (7B-31B) with quality-diversity evolution (MAP-Elites) as an automated red-teaming probe, we find that Gemma 3 (12B) exhibits 68.7% +/- 5.7% attack success rate (ASR; mean +/- std, 3 seeds), significantly higher than its predecessor Gemma 2 (45.5% +/- 7.2%; p = 0.030, paired bootstrap) and its successor Gemma 4 (33.9% +/- 1.8%). Replaying evolved attack archives across generations reveals that attacks from other generations transfer to Gemma 3 at 44-46% but only 14-18% to Gemma 4, indicating that Gemma 4's safety gains generalize beyond the attack distributions evolved against earlier generations. Under our 8B judge, copyright and cybercrime vulnerabilities register at near-100% across all generations, though a second-judge audit (Section 6) suggests the copyright result is sensitive to judge choice. Misinformation ASR jumps from 29% to 99% between Gemma 2 and Gemma 3 and remains elevated at 77% in Gemma 4, indicating the regression was not fully addressed. These patterns are invisible to static benchmarks and emerge only through adaptive, longitudinal probing. All experiments use 3 random seeds with a unified self-hosted judge; code and artifacts are available at https://github.com/bassrehab/red-queen.
翻译:大语言模型(LLM)的安全对齐效果并未随模型代际更迭而单调提升。本研究以谷歌Gemma系列四代模型(7B-31B)为对象,采用质量-多样性进化算法(MAP-Elites)作为自动化红队探测工具。研究发现:Gemma 3(12B)模型攻击成功率(ASR)达68.7% ± 5.7%(均值±标准差,3个随机种子),显著高于前代Gemma 2(45.5% ± 7.2%;配对bootstrap检验p=0.030)及后续Gemma 4(33.9% ± 1.8%)。跨代重放进化攻击档案显示,其他代际的攻击对Gemma 3的迁移率为44-46%,而对Gemma 4仅为14-18%,表明Gemma 4的安全增益已泛化至针对早期代际进化出的攻击分布。在自主开发的8B判别器评估下,版权与网络犯罪漏洞在所有代际中的检出率接近100%;但第二判别器审计(第6节)表明版权结果对判别器选择敏感。Gemma 2至Gemma 3间,虚假信息攻击成功率从29%跃升至99%,且Gemma 4仍维持77%的高位,显示该退化问题未得到彻底解决。上述模式在静态基准测试中不可见,仅通过自适应纵向探测显现。所有实验均采用3个随机种子及统一自托管判别器;代码与复现材料见:https://github.com/bassrehab/red-queen。