EvoDefense: Co-Evolving Black-Box Defense with Large Language Models

Large Language Models (LLMs) remain highly vulnerable to diverse attacks, particularly in black-box settings where the internals of target models are inaccessible. Existing black-box defenses typically rely on pre-defined filtering heuristics, which often fail to generalize to unseen attack types and target model architectures. We introduce EvoDefense, an experience-guided co-evolving black-box defense paradigm. EvoDefense employs a guard LLM to detect malicious queries and an experience memory module to accumulate defense knowledge from previous interactions. At the core of EvoDefense is a continuous attack-defense evolution loop, where an attack generator and the guard model iteratively refine their attack strategies and defense policies through experience-guided optimization. This design enables EvoDefense to generalize across unseen attacks and target models without retraining. Experiments on HarmBench, AdvBench, and AlpacaEval show that EvoDefense achieves consistently strong defense performance across seven popular models and five representative LLM attacks, while preserving competitive general capabilities. On HarmBench, EvoDefense reduces the attack success rate (ASR) of AutoDAN-turbo on Gemini-3-flash and LLaMA-3-8B-Instruct from 29.4% and 43.4% to 8.4% and 6.2%, respectively.

翻译：大型语言模型（LLMs）面对多种攻击仍高度脆弱，尤其在无法访问目标模型内部结构的黑盒场景中。现有黑盒防御方法通常依赖预定义的过滤启发式规则，往往难以泛化至未见过的攻击类型与目标模型架构。我们提出EvoDefense——一种经验引导的协同进化黑盒防御范式。该范式采用守卫大语言模型检测恶意查询，并设计经验记忆模块积累历史交互中的防御知识。其核心在于攻击-防御持续进化循环机制：攻击生成器与守卫模型通过经验引导优化迭代改进攻击策略与防御策略。这种设计使EvoDefense无需重训练即可泛化至未见攻击类型与目标模型。在HarmBench、AdvBench与AlpacaEval上的实验表明，EvoDefense在七种主流模型与五种代表性LLM攻击场景中均保持稳定的强防御性能，同时维持具有竞争力的通用能力。在HarmBench上，EvoDefense将AutoDAN-turbo对Gemini-3-flash与LLaMA-3-8B-Instruct的攻击成功率（ASR）分别从29.4%与43.4%降至8.4%与6.2%。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

跨越黑盒：大语言模型的理论与机制

专知会员服务

37+阅读 · 1月7日

《信息战中基于大语言模型的AI代理红蓝队对抗沙盒方法：探索反信息、提示注入与AI素养中的人类控制》最新报告

专知会员服务

27+阅读 · 2025年5月29日

158页！天大等最新《大型语言模型安全：全面综述》

专知会员服务

51+阅读 · 2024年12月24日

【新书】大规模语言模型的隐私与安全，

专知会员服务

29+阅读 · 2024年12月4日