Towards More Realistic Evaluation for Neural Test Oracle Generation

Effective unit tests can help guard and improve software quality but require a substantial amount of time and effort to write and maintain. A unit test consists of a test prefix and a test oracle. Synthesizing test oracles, especially functional oracles, is a well-known challenging problem. Recent studies proposed to leverage neural models to generate test oracles, i.e., neural test oracle generation (NTOG), and obtained promising results. However, after a systematic inspection, we find there are some inappropriate settings in existing evaluation methods for NTOG. These settings could mislead the understanding of existing NTOG approaches' performance. We summarize them as 1) generating test prefixes from bug-fixed program versions, 2) evaluating with an unrealistic metric, and 3) lacking a straightforward baseline. In this paper, we first investigate the impacts of these settings on evaluating and understanding the performance of NTOG approaches. We find that 1) unrealistically generating test prefixes from bug-fixed program versions inflates the number of bugs found by the state-of-the-art NTOG approach TOGA by 61.8%, 2) FPR (False Positive Rate) is not a realistic evaluation metric and the Precision of TOGA is only 0.38%, and 3) a straightforward baseline NoException, which simply expects no exception should be raised, can find 61% of the bugs found by TOGA with twice the Precision. Furthermore, we introduce an additional ranking step to existing evaluation methods and propose an evaluation metric named Found@K to better measure the cost-effectiveness of NTOG approaches. We propose a novel unsupervised ranking method to instantiate this ranking step, significantly improving the cost-effectiveness of TOGA. Eventually, we propose a more realistic evaluation method TEval+ for NTOG and summarize seven rules of thumb to boost NTOG approaches into their practical usages.

翻译：有效的单元测试有助于保障和提升软件质量，但编写和维护它们需要耗费大量时间和精力。一个单元测试由测试前缀和测试预言组成。合成测试预言（尤其是功能性预言）是一个公认的难题。近期研究提出利用神经模型生成测试预言（即神经测试预言生成，NTOG），并取得了令人鼓舞的成果。然而，经过系统性审查，我们发现现有NTOG评估方法中存在若干不恰当设置。这些设置可能误导对现有NTOG方法性能的理解。我们将其归纳为：1）从修复bug后的程序版本生成测试前缀；2）使用不现实的评估指标；3）缺乏直观的基线方法。本文首先研究了这些设置对评估和理解NTOG方法性能的影响。我们发现：1）从修复bug后的程序版本中非现实地生成测试前缀，会导致最新NTOG方法TOGA发现的bug数量虚增61.8%；2）假正率（FPR）并非现实的评估指标，TOGA的准确率（Precision）仅为0.38%；3）一个仅期望不抛出异常的直观基线方法NoException，能以两倍的准确率发现TOGA所能发现的61%的bug。此外，我们在现有评估方法中引入额外的排序步骤，并提出名为Found@K的评估指标，以更好地衡量NTOG方法的成本效益。我们提出了一种新型无监督排序方法来实例化该排序步骤，显著提升了TOGA的成本效益。最终，我们为NTOG提出更真实的评估方法TEval+，并总结七条经验法则以推动NTOG方法的实际应用。