MindGuard: Intrinsic Decision Inspection for Securing LLM Agents Against Metadata Poisoning

The Model Context Protocol (MCP) is increasingly adopted to standardize the interaction between LLM agents and external tools. However, this trend introduces a new threat: Tool Poisoning Attacks (TPA), where tool metadata is poisoned to induce the agent to perform unauthorized operations. Existing defenses that primarily focus on behavior-level analysis are fundamentally ineffective against TPA, as poisoned tools need not be executed, leaving no behavioral trace to monitor. Thus, we propose MindGuard, a decision-level guardrail for LLM agents, providing provenance tracking of call decisions, policy-agnostic detection, and poisoning source attribution against TPA. While fully explaining LLM decision remains challenging, our empirical findings uncover a strong correlation between LLM attention mechanisms and tool invocation decisions. Therefore, we choose attention as an empirical signal for decision tracking and formalize this as the Decision Dependence Graph (DDG), which models the LLM's reasoning process as a weighted, directed graph where vertices represent logical concepts and edges quantify the attention-based dependencies. We further design robust DDG construction and graph-based anomaly analysis mechanisms that efficiently detect and attribute TPA attacks. Extensive experiments on real-world datasets demonstrate that MindGuard achieves 94\%-99\% average precision in detecting poisoned invocations, 95\%-100\% attribution accuracy, with processing times under one second and no additional token cost. Moreover, DDG can be viewed as an adaptation of the classical Program Dependence Graph (PDG), providing a solid foundation for applying traditional security policies at the decision level.

翻译：模型上下文协议（MCP）正日益普及，以标准化LLM智能体与外部工具之间的交互。然而，这一趋势引入了一种新的威胁：工具毒化攻击（TPA），即通过毒化工具元数据诱导智能体执行未授权操作。现有主要关注行为层面分析的防御机制对TPA从根本上无效，因为被毒化的工具无需实际执行，从而不留下可供监控的行为痕迹。为此，我们提出MindGuard——一种面向LLM智能体的决策级防护机制，针对TPA提供调用决策溯源追踪、策略无关的检测以及毒化源归因功能。尽管完全解释LLM决策仍具挑战性，我们的实证研究发现LLM注意力机制与工具调用决策之间存在强相关性。因此，我们选择注意力作为决策追踪的经验信号，并将其形式化为决策依赖图（DDG）。该模型将LLM的推理过程建模为加权有向图，其中顶点表示逻辑概念，边量化基于注意力的依赖关系。我们进一步设计了鲁棒的DDG构建机制和基于图的异常分析机制，能高效检测并归因TPA攻击。在真实数据集上的大量实验表明，MindGuard在检测毒化调用时达到94%-99%的平均精确率，归因准确率达95%-100%，处理时间低于1秒且无额外token开销。此外，DDG可视为经典程序依赖图（PDG）的适配形式，为在决策层面应用传统安全策略奠定了坚实基础。