Existing automated RESTful API testing approaches commonly rely on simple checks (e.g., HTTP status codes, schema conformance), which are insufficient for detecting semantic faults, business logic violations, and state-dependent inconsistencies. To address this, we propose MASTOR, a Multi-Agent approach for generating Semantic Test Oracles for RESTful APIs based on implementation source code. MASTOR consists of two phases: source analysis and oracle generation. The former employs a source extraction agent to construct a source context for each endpoint operation by analyzing a transitive import closure of relevant source files. The latter employs two parallel oracle-generation paths over the collected contexts: a single-operation path producing status and field oracles per operation, and a multi-operation path generating behavioral consistency oracles for operation sequences by leveraging cross-operation semantic associations. Both paths apply a challenger-agent review, where a dedicated reviewer identifies weaknesses and issues improvement hints to guide targeted regeneration, followed by oracle normalization to filter out structurally invalid oracles. We evaluated MASTOR on a benchmark of 13 open-source RESTful API projects (296 operations, 251,303 lines of code) from the WFD and PRAB datasets. MASTOR achieved an average mutation score of 75.4%, generating 10,022 oracles. These oracles were translated into executable assertions via ToJUnit and ToPostmanAssertify, and into human-readable descriptions via ToReadable. In a baseline comparison on 50 selected operations, MASTOR outperformed Direct Prompting by 30.1 percentage points (69.9% vs. 39.8%) and SATORI by 49.4 percentage points (69.9% vs. 20.5%).
翻译:现有的RESTful API自动化测试方法通常依赖简单检查(如HTTP状态码、模式符合性),这不足以检测语义故障、业务逻辑违规及状态依赖不一致性。为此,我们提出MASTOR——一种基于实现源代码为RESTful API生成语义测试预言的多智能体方法。MASTOR包含两个阶段:源代码分析与预言生成。前者通过源码提取智能体,分析相关源文件的传递导入闭包,为每个端点操作构建源代码上下文。后者在收集的上下文上采用两条并行的预言生成路径:单操作路径为每个操作生成状态预言与字段预言;多操作路径通过利用跨操作语义关联,为操作序列生成行为一致性预言。两条路径均应用挑战者智能体审查——由专用审查者识别缺陷并给出改进提示以指导定向再生,随后通过预言规范化过滤结构无效的预言。我们在来自WFD与PRAB数据集的13个开源RESTful API项目基准测试(296个操作,251,303行代码)上评估了MASTOR。MASTOR实现了75.4%的平均变异分数,生成10,022条预言。这些预言通过ToJUnit与ToPostmanAssertify转换为可执行断言,并通过ToReadable转换为人类可读描述。在50个选定操作的基线对比中,MASTOR相比Direct Prompting提升30.1个百分点(69.9% vs. 39.8%),相比SATORI提升49.4个百分点(69.9% vs. 20.5%)。