Bug reproduction is critical in the software debugging and repair process, yet the majority of bugs in open-source and industrial settings lack executable tests to reproduce them at the time they are reported, making diagnosis and resolution more difficult and time-consuming. To address this challenge, we introduce AssertFlip, a novel technique for automatically generating Bug Reproducible Tests (BRTs) using large language models (LLMs). Unlike existing methods that attempt direct generation of failing tests, AssertFlip first generates passing tests on the buggy behaviour and then inverts these tests to fail when the bug is present. We hypothesize that LLMs are better at writing passing tests than ones that crash or fail on purpose. Our results show that AssertFlip outperforms all known techniques in the leaderboard of SWT-Bench, a benchmark curated for BRTs. Specifically, AssertFlip achieves a fail-to-pass success rate of 43.6% on the SWT-Bench-Verified subset.
翻译:缺陷复现在软件调试与修复过程中至关重要,然而在开源和工业环境中,大多数缺陷在被报告时缺乏可执行的测试用例来复现,这使得诊断和解决过程更加困难且耗时。为应对这一挑战,我们提出了AssertFlip,一种利用大语言模型自动生成缺陷可复现测试的新技术。与现有尝试直接生成失败测试的方法不同,AssertFlip首先生成针对缺陷行为的通过测试,随后将这些测试反转,使其在缺陷存在时失败。我们假设LLM更擅长编写通过测试,而非刻意使其崩溃或失败的测试。实验结果表明,在专门为缺陷可复现测试构建的基准测试集SWT-Bench排行榜上,AssertFlip优于所有已知技术。具体而言,在SWT-Bench-Verified子集上,AssertFlip实现了43.6%的“失败转通过”成功率。