Large Language Models (LLMs) remain highly vulnerable to diverse attacks, particularly in black-box settings where the internals of target models are inaccessible. Existing black-box defenses typically rely on pre-defined filtering heuristics, which often fail to generalize to unseen attack types and target model architectures. We introduce EvoDefense, an experience-guided co-evolving black-box defense paradigm. EvoDefense employs a guard LLM to detect malicious queries and an experience memory module to accumulate defense knowledge from previous interactions. At the core of EvoDefense is a continuous attack-defense evolution loop, where an attack generator and the guard model iteratively refine their attack strategies and defense policies through experience-guided optimization. This design enables EvoDefense to generalize across unseen attacks and target models without retraining. Experiments on HarmBench, AdvBench, and AlpacaEval show that EvoDefense achieves consistently strong defense performance across seven popular models and five representative LLM attacks, while preserving competitive general capabilities. On HarmBench, EvoDefense reduces the attack success rate (ASR) of AutoDAN-turbo on Gemini-3-flash and LLaMA-3-8B-Instruct from 29.4% and 43.4% to 8.4% and 6.2%, respectively.
翻译:大型语言模型(LLMs)面对多种攻击仍高度脆弱,尤其在无法访问目标模型内部结构的黑盒场景中。现有黑盒防御方法通常依赖预定义的过滤启发式规则,往往难以泛化至未见过的攻击类型与目标模型架构。我们提出EvoDefense——一种经验引导的协同进化黑盒防御范式。该范式采用守卫大语言模型检测恶意查询,并设计经验记忆模块积累历史交互中的防御知识。其核心在于攻击-防御持续进化循环机制:攻击生成器与守卫模型通过经验引导优化迭代改进攻击策略与防御策略。这种设计使EvoDefense无需重训练即可泛化至未见攻击类型与目标模型。在HarmBench、AdvBench与AlpacaEval上的实验表明,EvoDefense在七种主流模型与五种代表性LLM攻击场景中均保持稳定的强防御性能,同时维持具有竞争力的通用能力。在HarmBench上,EvoDefense将AutoDAN-turbo对Gemini-3-flash与LLaMA-3-8B-Instruct的攻击成功率(ASR)分别从29.4%与43.4%降至8.4%与6.2%。