Adversarial behavior plays a central role in aligning large language models with human values. However, existing alignment methods largely rely on static adversarial settings, which fundamentally limit robustness, particularly in multimodal settings with a larger attack surface. In this work, we move beyond static adversarial supervision and introduce co-evolutionary alignment with evolving attacks, instantiated by CEMMA (Co-Evolutionary Multi-Modal Alignment), an automated and adaptive framework for multimodal safety alignment. We introduce an Evolutionary Attacker that decomposes adversarial prompts into method templates and harmful intents. By employing genetic operators, including mutation, crossover, and differential evolution, it enables simple seed attacks to inherit the structural efficacy of sophisticated jailbreaks. The Adaptive Defender is iteratively updated on the synthesized hard negatives, forming a closed-loop process that adapts alignment to evolving attacks. Experiments show that the Evolutionary Attacker substantially increases red-teaming jailbreak attack success rate (ASR), while the Adaptive Defender improves robustness and generalization across benchmarks with higher data efficiency, without inducing excessive benign refusal, and remains compatible with inference-time defenses such as AdaShield.
翻译:对抗行为在将大型语言模型与人类价值观对齐方面发挥着核心作用。然而,现有的对齐方法主要依赖于静态对抗设置,这从根本上限制了鲁棒性,尤其是在攻击面更大的多模态场景中。在本工作中,我们超越了静态对抗监督,引入了伴随演化攻击的协同进化对齐,并通过CEMMA(协同进化多模态对齐)这一自动化、自适应的多模态安全对齐框架予以实例化。我们引入了一个演化攻击器,它将对抗性提示分解为方法模板和有害意图。通过采用包括突变、交叉和差分进化在内的遗传算子,它使得简单的种子攻击能够继承复杂越狱攻击的结构有效性。自适应防御器则在合成的困难负例上进行迭代更新,形成一个闭环过程,使对齐能够适应不断演化的攻击。实验表明,演化攻击器显著提高了红队越狱攻击成功率,而自适应防御器则提高了跨基准测试的鲁棒性和泛化能力,且具有更高的数据效率,不会引发过度的良性拒绝,并保持与AdaShield等推理时防御方法的兼容性。