MASTOR: A Multi-Agent Approach to Semantic Test Oracle Generation for RESTful APIs

Existing automated RESTful API testing approaches commonly rely on simple checks (e.g., HTTP status codes, schema conformance), which are insufficient for detecting semantic faults, business logic violations, and state-dependent inconsistencies. To address this, we propose MASTOR, a Multi-Agent approach for generating Semantic Test Oracles for RESTful APIs based on implementation source code. MASTOR consists of two phases: source analysis and oracle generation. The former employs a source extraction agent to construct a source context for each endpoint operation by analyzing a transitive import closure of relevant source files. The latter employs two parallel oracle-generation paths over the collected contexts: a single-operation path producing status and field oracles per operation, and a multi-operation path generating behavioral consistency oracles for operation sequences by leveraging cross-operation semantic associations. Both paths apply a challenger-agent review, where a dedicated reviewer identifies weaknesses and issues improvement hints to guide targeted regeneration, followed by oracle normalization to filter out structurally invalid oracles. We evaluated MASTOR on a benchmark of 13 open-source RESTful API projects (296 operations, 251,303 lines of code) from the WFD and PRAB datasets. MASTOR achieved an average mutation score of 75.4%, generating 10,022 oracles. These oracles were translated into executable assertions via ToJUnit and ToPostmanAssertify, and into human-readable descriptions via ToReadable. In a baseline comparison on 50 selected operations, MASTOR outperformed Direct Prompting by 30.1 percentage points (69.9% vs. 39.8%) and SATORI by 49.4 percentage points (69.9% vs. 20.5%).

翻译：现有的RESTful API自动化测试方法通常依赖简单检查（如HTTP状态码、模式符合性），这不足以检测语义故障、业务逻辑违规及状态依赖不一致性。为此，我们提出MASTOR——一种基于实现源代码为RESTful API生成语义测试预言的多智能体方法。MASTOR包含两个阶段：源代码分析与预言生成。前者通过源码提取智能体，分析相关源文件的传递导入闭包，为每个端点操作构建源代码上下文。后者在收集的上下文上采用两条并行的预言生成路径：单操作路径为每个操作生成状态预言与字段预言；多操作路径通过利用跨操作语义关联，为操作序列生成行为一致性预言。两条路径均应用挑战者智能体审查——由专用审查者识别缺陷并给出改进提示以指导定向再生，随后通过预言规范化过滤结构无效的预言。我们在来自WFD与PRAB数据集的13个开源RESTful API项目基准测试（296个操作，251,303行代码）上评估了MASTOR。MASTOR实现了75.4%的平均变异分数，生成10,022条预言。这些预言通过ToJUnit与ToPostmanAssertify转换为可执行断言，并通过ToReadable转换为人类可读描述。在50个选定操作的基线对比中，MASTOR相比Direct Prompting提升30.1个百分点（69.9% vs. 39.8%），相比SATORI提升49.4个百分点（69.9% vs. 20.5%）。