Benchmarks are among the main drivers of progress in software engineering research. However, many current benchmarks are limited by inadequate system oracles and sparse unit tests. Our Tests4Py benchmark, derived from the BugsInPy benchmark, addresses these limitations. It includes 73 bugs from seven real-world Python applications and six bugs from example programs. Each subject in Tests4Py is equipped with an oracle for verifying functional correctness and supports both system and unit test generation. This allows for comprehensive qualitative studies and extensive evaluations, making Tests4Py a cutting-edge benchmark for research in test generation, debugging, and automatic program repair.
翻译:基准测试集是软件工程研究取得进展的主要驱动力之一。然而,当前许多基准测试集受到不充分的系统预言和稀疏单元测试的限制。我们的Tests4Py基准测试集源自BugsInPy基准测试集,旨在解决这些局限性,它包含来自七个真实世界Python应用程序的73个错误以及来自示例程序的六个错误。Tests4Py中的每个测试对象都配备了用于验证功能正确性的预言,并且支持系统测试与单元测试的生成。这使得能够进行全面的定性研究和广泛的评估,从而使Tests4Py成为用于测试生成、调试和自动程序修复研究的尖端基准测试集。