Large language models deployed as agents increasingly interact with external systems through tool calls--actions with real-world consequences that text outputs alone do not carry. Safety evaluations, however, overwhelmingly measure text-level refusal behavior, leaving a critical question unanswered: does alignment that suppresses harmful text also suppress harmful actions? We introduce the GAP benchmark, a systematic evaluation framework that measures divergence between text-level safety and tool-call-level safety in LLM agents. We test six frontier models across six regulated domains (pharmaceutical, financial, educational, employment, legal, and infrastructure), seven jailbreak scenarios per domain, three system prompt conditions (neutral, safety-reinforced, and tool-encouraging), and two prompt variants, producing 17,420 analysis-ready datapoints. Our central finding is that text safety does not transfer to tool-call safety. Across all six models, we observe instances where the model's text output refuses a harmful request while its tool calls simultaneously execute the forbidden action--a divergence we formalize as the GAP metric. Even under safety-reinforced system prompts, 219 such cases persist across all six models. System prompt wording exerts substantial influence on tool-call behavior: TC-safe rates span 21 percentage points for the most robust model and 57 for the most prompt-sensitive, with 16 of 18 pairwise ablation comparisons remaining significant after Bonferroni correction. Runtime governance contracts reduce information leakage in all six models but produce no detectable deterrent effect on forbidden tool-call attempts themselves. These results demonstrate that text-only safety evaluations are insufficient for assessing agent behavior and that tool-call safety requires dedicated measurement and mitigation.
翻译:部署为代理的大语言模型正日益通过工具调用与外部系统交互——这种具有现实后果的行为是纯文本输出所不具备的。然而,现行安全评估主要衡量文本层面的拒绝行为,导致一个关键问题悬而未决:抑制有害文本的对齐机制是否同样能抑制有害行动?我们提出GAP基准测试,这是一个系统化评估框架,用于衡量大语言模型代理中文本层面安全与工具调用层面安全之间的差异。我们在六个受监管领域(医药、金融、教育、就业、法律和基础设施)测试了六个前沿模型,每个领域包含七种越狱场景、三种系统提示条件(中性、安全强化和工具鼓励)以及两种提示变体,最终生成17,420个可分析数据点。我们的核心发现是:文本安全无法迁移至工具调用安全。在所有六个模型中,我们观察到模型在文本输出拒绝有害请求的同时,其工具调用却执行了被禁止操作的情况——我们将这种差异形式化为GAP度量指标。即使在安全强化的系统提示下,六个模型仍存在219个此类案例。系统提示的措辞对工具调用行为产生显著影响:最稳健模型的工具调用安全率跨度达21个百分点,而对提示最敏感模型的跨度达57个百分点,在Bonferroni校正后,18组配对消融比较中有16组仍保持显著差异。运行时治理合约能减少所有六个模型的信息泄露,但对被禁止工具调用尝试本身未产生可检测的威慑效果。这些结果表明,仅依赖文本的安全评估不足以衡量代理行为,工具调用安全需要专门的测量与缓解机制。