AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations

Indirect prompt injection in tool-use agents is a concrete production threat: LLM agents read from integrations (third-party services such as Gmail, Salesforce, or Jira accessed through tool calls) whose response content the user neither writes nor controls. Existing benchmarks under-measure the threat: most cover only a handful of integrations with the same attack payload replayed across runs, and open-source guards are trained on chat-style data rather than tool-response content. We introduce AGENTREDBENCH, a dynamic LLM-driven redteaming benchmark of 215 subtle underspecified authorization (attacks at the boundary of what the user's request authorises) scenarios across 24 enterprise integrations in nine functional families and five attack types. Across an eight-model panel (Anthropic, OpenAI, Google), no-guard ASR (attack success rate) ranges from 32% (Claude Sonnet 4.6) to 81% (Gemini 3 Flash). To keep the scenario set out of training corpora and preserve headline ASR meaning over time, we release the codebase, integration schemas, and AGENTREDGUARD model openly; the canonical scenarios are evaluated through a maintainer-mediated channel with immutable versioning. We release AGENTREDGUARD alongside the benchmark: a guard trained on an integration-diverse corpus of adversarial tool-response content. AGENTREDGUARD cuts panel ASR from 69.9% to 2.4% at 0.37% false-positive rate, outperforming every open-source baseline with non-trivial detection (Llama Guard, PromptGuard 2, ProtectAI) on both axes. Cross-integration and cross-attack type holdouts both confirm the gain transfers beyond the training subset.

翻译：工具使用型智能体中的间接提示注入构成了一种具体的生产环境威胁：LLM智能体通过工具调用读取来自集成服务（如Gmail、Salesforce或Jira等第三方服务）的内容回复，而用户既无法编写也无法控制这些回复内容。现有基准测试对这类威胁的评估存在不足：大多数测试仅覆盖少量集成场景，且采用跨运行重复播放相同攻击载荷的方式，同时开源防护模型主要基于聊天风格数据而非工具回复内容进行训练。我们提出AGENTREDBENCH——一个由LLM驱动的动态红队测试基准，涵盖215个精心设计的未充分定义授权场景（位于用户请求授权边界处的攻击），涉及9个功能家族、24个企业级集成场景以及5种攻击类型。在包含Anthropic、OpenAI、Google八种模型的测试面板中，无防护场景下的攻击成功率（ASR）范围为32%（Claude Sonnet 4.6）至81%（Gemini 3 Flash）。为确保场景集不在训练语料中出现并保持ASR数值的长期有效性，我们开源了代码库、集成架构与AGENTREDGUARD模型；标准场景通过维护者调解通道进行不可变版本化评估。我们同步发布了AGENTREDGUARD：一个基于多样化集成场景对抗性工具回复内容训练的安全防护模型。AGENTREDGUARD将面板ASR从69.9%降至2.4%（误报率0.37%），在检测能力与误报率两个维度均全面超越各开源基线模型（Llama Guard、PromptGuard 2、ProtectAI）。跨集成与跨攻击类型的保留测试均证实其增益效果可超越训练子集范围。