While Large Language Models (LLMs) have achieved remarkable performance, they remain vulnerable to jailbreak. The integration of Large Language Models (LLMs) with external tools via protocols such as the Model Context Protocol (MCP) introduces critical security vulnerabilities, including prompt injection, data exfiltration, and other threats. To counter these challenges, we propose MCP-GUARD, a robust, layered defense architecture designed for LLM-tool interactions. MCP-GUARD employs a three-stage detection pipeline that balances efficiency with accuracy: it progresses from lightweight static scanning for overt threats and a deep neural detector for semantic attacks, to our fine-tuned E5-based model which achieves 96.01\% accuracy in identifying adversarial prompts. Finally, an LLM arbitrator synthesizes these signals to deliver the final decision. To enable rigorous training and evaluation, we introduce MCP-ATTACKBENCH, a comprehensive benchmark comprising 70,448 samples augmented by GPT-4. This benchmark simulates diverse real-world attack vectors that circumvent conventional defenses in the MCP paradigm, thereby laying a solid foundation for future research on securing LLM-tool ecosystems.
翻译:尽管大型语言模型(LLM)已取得显著性能,其仍易受越狱攻击。大型语言模型(LLM)通过模型上下文协议(MCP)等协议与外部工具集成,引入了关键安全漏洞,包括提示注入、数据窃取及其他威胁。为应对这些挑战,我们提出了MCP-GUARD——一个专为LLM-工具交互设计的鲁棒分层防御架构。MCP-GUARD采用三阶段检测流水线,在效率与准确性间取得平衡:从针对显性威胁的轻量级静态扫描,到针对语义攻击的深度神经检测器,最终通过我们微调的基于E5的模型(其识别对抗性提示的准确率达96.01%)完成检测。最终,LLM仲裁器综合这些信号作出最终决策。为支持严格的训练与评估,我们构建了MCP-ATTACKBENCH综合基准数据集,该数据集包含经GPT-4增强的70,448个样本,模拟了MCP范式中规避传统防御的多样化现实攻击向量,从而为未来保护LLM-工具生态系统的研究奠定了坚实基础。