Testing Agentic Workflows with Structural Coverage Criteria

Multi-agent systems increasingly expose explicit workflow structure: agents, tools, tool-access rules, restrictions, and delegation paths. Existing evaluations rely largely on end-to-end task success, benchmark scores, final-response quality, or prompt-level checks, which provide limited evidence that this declared coordination structure has actually been exercised. This makes it difficult to assess test-suite adequacy or detect structural regressions in tool access, restrictions, and inter-agent delegation. We address this gap with a structural testing approach for multi-agent workflow specifications. The approach represents each workflow as a typed coordination graph, derives coverage obligations over reachable agents, allowed tool edges, restricted tool edges, and delegation edges, and uses coverage-driven generation with DSPy-based scenario realization to produce executable tests. The graph fixes what must be covered; DSPy realizes those obligations as natural-language scenarios whose witnesses are checked at runtime. We implement the approach for OpenAI Agents SDK-style workflows and evaluate it on ten SDK-derived benchmarks comprising 49 reachable agents, 47 tools, and 403 structural obligations. Generated scenarios witness 54/75 allowed-tool obligations and 36/48 delegation obligations within a bounded refinement budget. The adversarial restricted-tool criterion elicits 23/248 restricted-call violations, separating workflows whose restrictions hold under probing from workflows with concrete misrouting failures. These results show that structural coverage provides a useful adequacy layer for multi-agent workflow testing: it does not replace semantic or end-to-end evaluation, but reveals whether declared agents, tool-access rules, restrictions, and delegation paths have been exercised.

翻译：多智能体系统日益显式化其工作流结构：智能体、工具、工具访问规则、限制条件及委托路径。现有评估主要依赖端到端任务成功率、基准分数、最终响应质量或提示级检查，这些方法难以证明声明的协调结构是否真正被触发。这导致测试套件充分性评估困难，且难以检测工具访问、限制条件和智能体间委托的结构性退化。为此，我们提出一种面向多智能体工作流规约的结构化测试方法。该方法将每个工作流表示为类型化协调图，推导覆盖可达智能体、允许工具边、受限工具边及委托边的覆盖义务，并借助基于DSPy的场景生成实现覆盖驱动的可执行测试生成。类型化协调图固定了必须覆盖的元素；DSPy将这些义务实现为自然语言场景，并在运行时验证其见证物。我们针对OpenAI Agents SDK风格工作流实现了该方法，并在十个基于SDK的基准测试上评估，这些测试涉及49个可达智能体、47个工具及403项结构覆盖义务。在有限细化预算内，生成的场景见证了54/75项允许工具义务和36/48项委托义务。对抗性受限工具准则触发了23/248项受限调用违规，区分了在探测下限制条件保持有效的工作流与存在具体路由失败的工作流。这些结果表明，结构覆盖率为多智能体工作流测试提供了有效的充分性层：它不取代语义或端到端评估，但能揭示声明的智能体、工具访问规则、限制条件及委托路径是否已被实际触发。