Existing REST API testing tools are typically evaluated using code coverage and crash-based fault metrics. However, recent LLM-based approaches increasingly generate tests from NL requirements to validate functional behaviour, making traditional metrics weak proxies for whether generated tests validate intended behaviour. To address this gap, we present RESTestBench, a benchmark comprising three REST services paired with manually verified NL requirements in both precise and vague variants, enabling controlled and reproducible evaluation of requirement-based test generation. RESTestBench further introduces a requirements-based mutation testing metric that measures the fault-detection effectiveness of a generated test case with respect to a specific requirement, extending the property-based approach of Bartocci et al. . Using RESTestBench, we evaluate two approaches across multiple state-of-the-art LLMs: (i) non-refinement-based generation, and (ii) refinement-based generation guided by interaction with the running SUT. In the refinement experiments, RESTestBench assesses how exposure to the actual implementation, valid or mutated, affects test effectiveness. Our results show that test effectiveness drops considerably when the generator interacts with faulty or mutated code, especially for vague requirements, sometimes negating the benefit of refinement and indicating that incorporating actual SUT behaviour is unnecessary when requirement detail is high.
翻译:现有REST API测试工具通常基于代码覆盖率和崩溃式故障指标进行评估。然而,近期基于LLM的方法越来越多地通过自然语言需求生成测试用例以验证功能行为,这使得传统指标难以有效表征生成测试用例是否真正验证了预期行为。为解决这一不足,我们提出RESTestBench基准框架,该框架包含三个配备人工验证的精确与模糊变体自然语言需求的REST服务,支持对基于需求的测试生成开展可控且可复现的评估。RESTestBench进一步引入基于需求的变异测试指标,用于衡量生成测试用例针对特定需求的故障检测有效性,扩展了Bartocci等人的基于属性方法。利用RESTestBench,我们评估了多种先进LLM下的两类方法:(i)非精化生成方法,以及(ii)通过与被测系统交互引导的精化生成方法。在精化实验中,RESTestBench评估实际实现(正常或变异)的暴露程度对测试有效性的影响。结果表明:当生成器与包含故障或变异代码交互时,测试有效性显著下降(尤以模糊需求为甚),有时甚至抵消精化带来的优势,这表明在需求细节充分时无需引入实际系统行为。