AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations

Indirect prompt injection in tool-use agents is a concrete production threat: LLM agents read from integrations (third-party services such as Gmail, Salesforce, or Jira accessed through tool calls) whose response content the user neither writes nor controls. Existing benchmarks under-measure the threat: most cover only a handful of integrations with the same attack payload replayed across runs, and open-source guards are trained on chat-style data rather than tool-response content. We introduce AGENTREDBENCH, a dynamic LLM-driven redteaming benchmark of 215 subtle underspecified authorization (attacks at the boundary of what the user's request authorises) scenarios across 24 enterprise integrations in nine functional families and five attack types. Across an eight-model panel (Anthropic, OpenAI, Google), no-guard ASR (attack success rate) ranges from 32% (Claude Sonnet 4.6) to 81% (Gemini 3 Flash). To keep the scenario set out of training corpora and preserve headline ASR meaning over time, we release the codebase, integration schemas, and AGENTREDGUARD model openly; the canonical scenarios are evaluated through a maintainer-mediated channel with immutable versioning. We release AGENTREDGUARD alongside the benchmark: a guard trained on an integration-diverse corpus of adversarial tool-response content. AGENTREDGUARD cuts panel ASR from 69.9% to 2.4% at 0.37% false-positive rate, outperforming every open-source baseline with non-trivial detection (Llama Guard, PromptGuard 2, ProtectAI) on both axes. Cross-integration and cross-attack type holdouts both confirm the gain transfers beyond the training subset.

翻译：在工具使用型代理中，间接提示注入是一种具体的生产威胁：LLM代理读取来自集成（通过工具调用访问的第三方服务，如Gmail、Salesforce或Jira）的响应内容，而用户既无法编写也无法控制这些内容。现有基准测试低估了这一威胁：大多数仅覆盖少数几种集成，且在不同运行中重复使用相同的攻击载荷，而开源防护手段则是在聊天风格数据而非工具响应内容上训练的。我们提出AGENTREDBENCH，这是一个由LLM驱动的动态红队测试基准，涵盖九个功能类别的24种企业集成中的215个细微、欠明确的授权场景（处于用户请求授权边界的攻击）及五种攻击类型。在八模型评估面板（Anthropic、OpenAI、Google）上，无防护时的攻击成功率（ASR）范围从32%（Claude Sonnet 4.6）到81%（Gemini 3 Flash）。为确保场景集不进入训练语料库并保持主要ASR指标随时间推移的意义，我们公开了代码库、集成模式和AGENTREDGUARD模型；规范场景通过维护者中介渠道进行评估，并采用不可变版本管理。我们随基准测试一同发布AGENTREDGUARD：一个在多样化集成语料的对抗性工具响应内容上训练的防护模型。AGENTREDGUARD将面板平均ASR从69.9%降至2.4%，同时假阳性率仅为0.37%，在检测性能的两个维度上均优于所有具有非平凡检测能力的开源基线（Llama Guard、PromptGuard 2、ProtectAI）。跨集成与跨攻击类型的保留测试均证实其增益可迁移至训练子集之外。