Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

LLM agents have begun to find real security vulnerabilities that human auditors and automated fuzzers missed for decades, in source-available targets where the analyst can build and instrument the code. In practice the work is split among several agents, wired together by a harness: the program that fixes which roles exist, how they pass information, which tools each may call, and how retries are coordinated. When the language model is held fixed, changing only the harness can still change success rates by several-fold on public agent benchmarks, yet most harnesses are written by hand; recent harness optimizers each search only a narrow slice of the design space and rely on coarse pass/fail feedback that gives no diagnostic signal about why a trial failed. AgentFlow addresses both limitations with a typed graph DSL whose search space jointly covers agent roles, prompts, tools, communication topology, and coordination protocol, paired with a feedback-driven outer loop that reads runtime signals from the target program itself to diagnose which part of the harness caused the failure and rewrite it accordingly. We evaluate AgentFlow on TerminalBench-2 with Claude Opus 4.6 and on Google Chrome with Kimi K2.5. AgentFlow reaches 84.3% on TerminalBench-2, the highest score in the public leaderboard snapshot we evaluate against, and discovers ten previously unknown zero-day vulnerabilities in Google Chrome, including two Critical sandbox-escape vulnerabilities (CVE-2026-5280 and CVE-2026-6297).

翻译：大语言模型智能体已开始发现人类审计人员和自动化模糊测试工具数十年未能发现的真实安全漏洞，这些漏洞存在于分析人员可构建并检测代码的源代码可用目标中。实际工作中，任务被分配给多个智能体，通过一个编排框架进行串联：该框架定义了角色分工、信息传递方式、各智能体可调用的工具以及重试协调策略。当语言模型固定不变时，仅改变编排框架就能在公开智能体基准测试中使成功率达到数倍差异，但当前大多数编排框架仍依赖人工编写；现有的编排框架优化器仅搜索设计空间的狭窄子集，且依赖粗粒度的通过/失败反馈——这种反馈无法提供试错失败原因的诊断信号。AgentFlow通过一种类型化图领域特定语言（DSL）解决了上述双重局限，其搜索空间同时覆盖智能体角色、提示词、工具、通信拓扑和协调协议，并配合反馈驱动外循环机制，从目标程序自身读取运行时信号，以诊断导致失败的编排框架组件并对其重写。我们使用Claude Opus 4.6在TerminalBench-2上、使用Kimi K2.5在Google Chrome上评估了AgentFlow。AgentFlow在TerminalBench-2上达到84.3%的准确率，创下我们评估时公开排行榜中的最高分，并在Google Chrome中发现了十个此前未知的零日漏洞，包括两个严重级别的沙箱逃逸漏洞（CVE-2026-5280和CVE-2026-6297）。