When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More

A growing line of work equips large language model (LLM) agents with graph neural networks (GNNs) as callable tools, assuming the agent exercises judgment over when and how much to rely on such a tool. We test this directly. We expose a frozen GNN to a ReAct-style LLM agent as an explicit tool and measure, on node classification over a text-attributed graph (ogbn-arxiv, replicated on WikiCS), whether the agent uses the tool or merely obeys it. We find the agent does not exercise judgment: its predictions agree with the raw GNN's 97.6-99.2% of the time (5 seeds), collapsing into a GNN parrot that adopts the tool's output wholesale and bypasses its own reasoning. Sweeping backbone capability (Qwen2.5 0.5B-7B), the deference is not a weak-model artifact: among models able to invoke the tool, agreement rises with capability (0.60 to 0.98 from 1.5B to 7B). Crucially, the cost of deference does not shrink as capability grows and grows where alternatives emerge: a per-node oracle over the available actions beats the parrot by 0.09-0.18 at 3B and 0.12-0.22 at 7B, roughly doubling at high homophily, because the parrot is pinned to the frozen GNN while the agent's alternatives improve; at 7B a simple neighbour-label tool overtakes the GNN at high homophily (0.81 vs 0.71) yet the agent still defers. A simple selective-invocation gate recovers about half of that high-homophily gap (0.71 to 0.83) but yields no net global gain, and held-out estimates bound the best achievable gate over standard test-time features to at most a third of the oracle headroom: reliable selective invocation looks limited by available information, not merely router design. Our results are a cautionary measurement: evaluations of agent+tool systems cannot assume the agent adds judgment on top of the tool, and selective invocation must be designed in rather than expected to emerge from scale.

翻译：越来越多的研究为大型语言模型（LLM）智能体配备图神经网络（GNN）作为可调用的工具，并假设智能体能够自主判断何时以及多大程度上依赖该工具。我们对此直接进行检验。我们将冻结的GNN作为显式工具暴露给ReAct风格的LLM智能体，并在文本属性图（ogbn-arxiv，在WikiCS上复现）的节点分类任务中测量智能体是使用该工具还是仅仅服从它。我们发现智能体并未行使判断力：其预测与原始GNN的结果在97.6%-99.2%的时间内一致（5个随机种子），沦为GNN的复读机——直接全盘采用工具的输出，绕过自身推理。在骨干模型能力扫描（Qwen2.5 0.5B-7B）中，这种遵从并非弱模型的伪影：在能够调用工具的模型中，一致率随能力提升而上升（从1.5B的0.60增长至7B的0.98）。关键的是，遵从的代价并未随能力增长而缩小，反而在可选方案出现时扩大：3B规模下，基于每个节点对可用动作的专家策略比复读机性能高出0.09-0.18，7B规模下高出0.12-0.22，在高同质性环境下约翻倍——因为复读机被冻结的GNN所束缚，而智能体的可选方案持续改进；在7B规模下，简单的邻居标签工具在高同质性环境下已超越GNN（0.81 vs 0.71），但智能体仍选择遵从。简单的选择性调用门控机制可恢复高同质性差距的一半（0.71提升至0.83），但未带来全局净收益；基于留出数据的估计表明，在标准测试时特征上可实现的最佳门控性能最多仅为专家策略提升空间的三分之一：可靠的选择性调用受限于可用信息，而非仅仅路由器的设计。我们的结果是一次警示性测量：对智能体+工具系统的评估不能假设智能体在工具之上叠加了判断力，选择性调用必须被刻意设计而非寄望于随规模涌现。