Sponge attacks increasingly threaten LLM systems by inducing excessive computation and DoS. Existing defenses either rely on statistical filters that fail on semantically meaningful attacks or use static LLM-based detectors that struggle to adapt as attack strategies evolve. We introduce SHIELD, a multi-agent, auto-healing defense framework centered on a three-stage Defense Agent that integrates semantic similarity retrieval, pattern matching, and LLM-based reasoning. Two auxiliary agents, a Knowledge Updating Agent and a Prompt Optimization Agent, form a closed self-healing loop, when an attack bypasses detection, the system updates an evolving knowledgebase, and refines defense instructions. Extensive experiments show that SHIELD consistently outperforms perplexity-based and standalone LLM defenses, achieving high F1 scores across both non-semantic and semantic sponge attacks, demonstrating the effectiveness of agentic self-healing against evolving resource-exhaustion threats.
翻译:海绵攻击通过诱导过度计算和拒绝服务,日益威胁大语言模型系统。现有防御方案要么依赖统计过滤器(对语义攻击失效),要么采用静态的大语言模型检测器(难以适应攻击策略的演变)。我们提出了SHIELD,一个多智能体、自愈的防御框架,其核心是一个三阶段防御智能体,集成了语义相似性检索、模式匹配和大语言模型推理。两个辅助智能体——知识更新智能体和提示优化智能体——构成一个闭环自愈循环:当攻击绕过检测时,系统会更新动态知识库并优化防御指令。大量实验表明,SHIELD在基于困惑度的防御和独立大语言模型防御方案中均表现更优,在非语义和语义海绵攻击上均取得较高的F1分数,证明了智能体自愈机制对持续演变的资源耗尽威胁的有效性。