As Large Language Models (LLMs) are increasingly deployed in complex applications, their vulnerability to adversarial attacks raises urgent safety concerns, especially those evolving over multi-round interactions. Existing defenses are largely reactive and struggle to adapt as adversaries refine strategies across rounds. In this work, we propose CoopGuard , a stateful multi-round LLM defense framework based on cooperative agents that maintains and updates an internal defense state to counter evolving attacks. It employs three specialized agents (Deferring Agent, Tempting Agent, and Forensic Agent) for complementary round-level strategies, coordinated by System Agent, which conditions decisions on the evolving defense state (interaction history) and orchestrates agents over time. To evaluate evolving threats, we introduce the EMRA benchmark with 5,200 adversarial samples across 8 attack types, simulating progressively LLM multi-round attacks. Experiments show that CoopGuard reduces attack success rate by 78.9% over state-of-the-art defenses, while improving deceptive rate by 186% and reducing attack efficiency by 167.9%, offering a more comprehensive assessment of multi-round defense. These results demonstrate that CoopGuard provides robust protection for LLMs in multi-round adversarial scenarios.
翻译:随着大语言模型在复杂应用中的广泛部署,其面临对抗攻击(尤其是多轮交互中动态演化的攻击)的脆弱性引发了迫切的安全关切。现有防御手段多为被动响应式,难以适应攻击方跨轮次调整策略的演化特性。本文提出CoopGuard——一种基于协作智能体的带状态多轮大语言模型防御框架,通过维护并持续更新内部防御状态以应对演化型攻击。该框架部署三个专用智能体(延迟智能体、诱饵智能体与取证智能体)实施互补的轮级策略,并由系统智能体根据演化防御状态(交互历史)协调决策与智能体编排时序。为评估演化威胁,我们构建了包含5,200个对抗样本(覆盖8种攻击类型)的EMRA基准测试集,模拟渐进式多轮大语言模型攻击。实验表明,相较于现有最优防御方案,CoopGuard将攻击成功率降低78.9%,同时使欺骗率提升186%、攻击效率降低167.9%,实现了对多轮防御能力的更全面评估。这些结果证明CoopGuard能为多轮对抗场景中的大语言模型提供稳健防护。