While Large Language Models (LLMs) have achieved remarkable performance, they remain vulnerable to jailbreak. The integration of Large Language Models (LLMs) with external tools via protocols such as the Model Context Protocol (MCP) introduces critical security vulnerabilities, including prompt injection, data exfiltration, and other threats. To counter these challenges, we propose MCP-GUARD, a robust, layered defense architecture designed for LLM-tool interactions. MCP-GUARD employs a three-stage detection pipeline that balances efficiency with accuracy: it progresses from lightweight static scanning for overt threats and a deep neural detector for semantic attacks, to our fine-tuned E5-based model which achieves 96.01\% accuracy in identifying adversarial prompts. Finally, an LLM arbitrator synthesizes these signals to deliver the final decision. To enable rigorous training and evaluation, we introduce MCP-ATTACKBENCH, a comprehensive benchmark comprising 70,448 samples augmented by GPT-4. This benchmark simulates diverse real-world attack vectors that circumvent conventional defenses in the MCP paradigm, thereby laying a solid foundation for future research on securing LLM-tool ecosystems.
翻译:尽管大型语言模型(LLM)已取得显著性能,其仍易受越狱攻击。通过模型上下文协议(MCP)等协议将大型语言模型与外部工具集成,会引入关键安全漏洞,包括提示注入、数据窃取等威胁。为应对这些挑战,我们提出MCP-GUARD——一种专为LLM-工具交互设计的鲁棒分层防御架构。MCP-GUARD采用兼顾效率与准确性的三阶段检测流程:从针对显性威胁的轻量级静态扫描,到针对语义攻击的深度神经检测器,最终通过我们微调的E5模型实现对抗性提示识别(准确率达96.01%)。最终由LLM仲裁器综合这些信号作出最终决策。为支持严格训练与评估,我们构建了MCP-ATTACKBENCH综合基准数据集,包含经GPT-4增强的70,448个样本。该基准模拟了MCP范式中规避传统防御的多样化现实攻击向量,从而为未来保护LLM-工具生态系统的研究奠定坚实基础。