Large Language Models (LLMs) demonstrate strong capabilities in solving complex tasks when integrated with external tools. The Model Context Protocol (MCP) has become a standard interface for enabling such tool-based interactions. However, these interactions introduce substantial security concerns, particularly when the MCP server is compromised or untrustworthy. While prior benchmarks primarily focus on prompt injection attacks or analyze the vulnerabilities of LLM-MCP interaction trajectories, limited attention has been given to the underlying system logs associated with malicious MCP servers. To address this gap, we present the first synthetic benchmark for evaluating LLMs' ability to identify security risks from system logs. We define nine categories of MCP server risks and generate 1,800 synthetic system logs using ten state-of-the-art LLMs. These logs are embedded in the return values of 243 curated MCP servers, yielding a dataset of 2,421 chat histories for training and 471 queries for evaluation. Our pilot experiments reveal that smaller models often fail to detect risky system logs, leading to high false negatives. While models trained with supervised fine-tuning (SFT) tend to over-flag benign logs, resulting in elevated false positives, Reinforcement Learning with Verifiable Reward (RLVR) offers a better precision-recall balance. In particular, after training with Group Relative Policy Optimization (GRPO), Llama3.1-8B-Instruct achieves 83 percent accuracy, surpassing the best-performing large remote model by 9 percentage points. Fine-grained, per-category analysis further underscores the effectiveness of reinforcement learning in enhancing LLM safety within the MCP framework. Code and data are available at https://github.com/PorUna-byte/MCP-RiskCue.
翻译:大型语言模型(LLMs)在与外部工具集成时展现出解决复杂任务的强大能力。模型上下文协议(MCP)已成为实现此类基于工具交互的标准接口。然而,这些交互引入了重大的安全隐患,尤其是在MCP服务器遭到入侵或不可信的情况下。虽然现有基准测试主要关注提示注入攻击或分析LLM-MCP交互轨迹的脆弱性,但对与恶意MCP服务器相关的底层系统日志的关注却十分有限。为填补这一空白,我们提出了首个用于评估LLMs从系统日志中识别安全风险能力的合成基准。我们定义了九类MCP服务器风险,并使用十个最先进的LLMs生成了1,800条合成系统日志。这些日志被嵌入243个精选MCP服务器的返回值中,生成了一个包含2,421条训练用聊天历史和471条评估用查询的数据集。我们的初步实验表明,较小模型通常无法检测到风险系统日志,导致高假阴性率。而通过监督微调(SFT)训练的模型倾向于过度标记良性日志,导致假阳性率升高;相比之下,基于可验证奖励的强化学习(RLVR)提供了更好的精确率-召回率平衡。特别是经过组相对策略优化(GRPO)训练后,Llama3.1-8B-Instruct模型达到了83%的准确率,比性能最佳的远程大模型高出9个百分点。细粒度的逐类别分析进一步证实了强化学习在增强MCP框架内LLM安全性方面的有效性。代码与数据可在 https://github.com/PorUna-byte/MCP-RiskCue 获取。