Large language models (LLMs) are increasingly deployed in agentic systems, where a fundamental task is mapping user intents to relevant external tools. Errors in tool selection can have severe outcomes, such as unauthorized data access, even without modifying the agent's underlying model. Existing evaluations measure performance on curated, benign benchmarks. However, a pipeline's behavior in deployment depends on the tool pool the agent actually encounters, which in open registries is shaped by third parties. We introduce LLMCert-T, the first statistical framework that returns \textbf{high-confidence upper bounds on the probability that a tool-selection pipeline satisfies a declared safety specification under a realistic tool distribution}. LLMCert-T models tool-selection evaluation as a Bernoulli estimation problem, drawing inserted-tool sequences from a distribution that the safety specification fixes. To evaluate robustness against realistic deployment conditions, we instantiate this distribution as a stochastic process that generates inserted-tool sequences round by round, conditioning each round on the agent's selection in the previous round. LLMCert-T aggregates the per-trial outcomes into a one-sided Clopper-Pearson upper bound on the probability that the specification is satisfied. By returning this bound as a certificate with statistical guarantees over the inserted-tool sequence distribution, LLMCert-T makes safety claims intuitive, actionable, and comparable across models, retrievers, mitigations, and registry policies. Across popular BFCL and OpenAPI tool pools, LLMCert-T shows that current LLM agents remain fragile under Distractor Selection and Top-N Saturation specifications: their certified correctness upper bounds drop to approximately 20\%, far below their clean-pool lower bounds.
翻译:大语言模型(LLMs)正日益部署于代理系统中,其核心任务是将用户意图映射到相关的外部工具。工具选择错误可能导致严重后果,例如未经授权的数据访问,即使不修改代理的基础模型。现有评估在精心设计的良性基准上衡量性能,但实际部署中管线的行为取决于代理实际遇到的工具池——在开放注册表中,这由第三方决定。我们提出LLMCert-T,这是首个统计框架,可返回**在真实工具分布下工具选择管线满足声明安全规范概率的高置信度上界**。LLMCert-T将工具选择评估建模为伯努利估计问题,从安全规范固定的分布中抽取插入工具序列。为评估对真实部署条件的鲁棒性,我们将该分布实例化为一个随机过程,逐轮生成插入工具序列,并根据代理前一轮的选择条件化每一轮。LLMCert-T将每次试验的结果聚合为单侧Clopper-Pearson上界,以表示规范被满足的概率。通过将此上界作为对插入工具序列分布具有统计保证的认证返回,LLMCert-T使安全主张变得直观、可操作,并可在模型、检索器、缓解策略和注册策略间进行比较。在流行的BFCL和OpenAPI工具池上,LLMCert-T表明当前LLM代理在干扰选择和Top-N饱和规范下依然脆弱:其认证正确性上界降至约20%,远低于其在清洁池中的下界。