Benchmarks are among the main drivers of progress in software engineering research, especially in software testing and debugging. However, current benchmarks in this field could be better suited for specific research tasks, as they rely on weak system oracles like crash detection, come with few unit tests only, need more elaborative research, or cannot verify the outcome of system tests. Our Tests4Py benchmark addresses these issues. It is derived from the popular BugsInPy benchmark, including 30 bugs from 5 real-world Python applications. Each subject in Tests4Py comes with an oracle to verify the functional correctness of system inputs. Besides, it enables the generation of system tests and unit tests, allowing for qualitative studies by investigating essential aspects of test sets and extensive evaluations. These opportunities make Tests4Py a next-generation benchmark for research in test generation, debugging, and automatic program repair.
翻译:基准测试是软件工程研究(尤其是软件测试与调试领域)取得进展的主要驱动力之一。然而,当前该领域的基准测试在特定研究任务中尚存不足:它们依赖如崩溃检测等弱系统预言,仅附带少量单元测试,需要更深入的研究工作,或无法验证系统测试的结果。我们的Tests4Py基准测试解决了这些问题。它源自广受欢迎的BugsInPy基准,包含来自5个真实世界Python应用程序的30个缺陷。Tests4Py中的每个测试对象均配备一个用于验证系统输入功能正确性的预言。此外,该基准支持系统测试和单元测试的生成,可通过探究测试集的关键方面进行定性研究,并支持广泛的评估。这些特性使Tests4Py成为测试生成、调试及自动程序修复领域的下一代基准测试平台。