Assessing the effectiveness of REST API tests in black-box settings can be challenging due to the lack of access to source code coverage metrics and polyglot tech stack. We propose three metrics for capturing average, minimum, and maximum log coverage to handle the diverse test generation results and runtime behaviors over multiple runs. Using log coverage, we empirically evaluate three REST API test generation strategies, Evolutionary computing (EvoMaster v5.0.2), LLMs (Claude Opus 4.6 and GPT-5.2-Codex), and human-written Locust load tests, on Light-OAuth2 authorization microservice system. On average, Claude Opus 4.6 tests uncover 28.4% more unique log templates than human-written tests, whereas EvoMaster and GPT-5.2-Codex find 26.1% and 38.6% fewer, respectively. Next, we analyze combined log coverage to assess complementarity between strategies. Combining human-written tests with Claude Opus 4.6 tests increases total observed log coverage by 78.4% and 38.9% in human-written and Claude tests respectively. When combining Locust tests with EvoMaster the same increases are 30.7% and 76.9% and when using GPT-5.2-Codex 26.1% and 105.6%. This means that the generation strategies exercise largely distinct runtime behaviors. Our future work includes extending our study to multiple systems.
翻译:在黑盒环境下评估REST API测试的有效性具有挑战性,原因在于无法获取源代码覆盖率指标以及需要应对多语言技术栈。我们提出了三个指标来捕获平均、最小和最大日志覆盖率,以处理多次运行中多样化的测试生成结果和运行时行为。利用日志覆盖率,我们在Light-OAuth2授权微服务系统上,实证评估了三种REST API测试生成策略:进化计算(EvoMaster v5.0.2)、大语言模型(Claude Opus 4.6和GPT-5.2-Codex)以及人工编写的Locust负载测试。平均而言,Claude Opus 4.6测试发现的独特日志模板比人工编写测试多28.4%,而EvoMaster和GPT-5.2-Codex分别少发现26.1%和38.6%。此外,我们分析了组合日志覆盖率以评估策略间的互补性。将人工编写测试与Claude Opus 4.6测试相结合,总观测日志覆盖率在人工编写测试和Claude测试中分别提升78.4%和38.9%。当将Locust测试与EvoMaster组合时,相同提升分别为30.7%和76.9%;而使用GPT-5.2-Codex时则为26.1%和105.6%。这表明各生成策略主要激发了不同的运行时行为。我们未来的工作包括将研究扩展到多个系统。