An "adequate" test suite should effectively find all inconsistencies between a system's requirements/specifications and its implementation. Practitioners frequently use code coverage to approximate adequacy, while academics argue that mutation score may better approximate true (oracular) adequacy coverage. High code coverage is increasingly attainable even on large systems via automatic test generation, including fuzzing. In light of all of these options for measuring and improving testing effort, how should a QA engineer spend their time? We propose a new framework for reasoning about the extent, limits, and nature of a given testing effort based on an idea we call the oracle gap, or the difference between source code coverage and mutation score for a given software element. We conduct (1) a large-scale observational study of the oracle gap across popular Maven projects, (2) a study that varies testing and oracle quality across several of those projects and (3) a small-scale observational study of highly critical, well-tested code across comparable blockchain projects. We show that the oracle gap surfaces important information about the extent and quality of a test effort beyond either adequacy metric alone. In particular, it provides a way for practitioners to identify source files where it is likely a weak oracle tests important code.
翻译:“充分”的测试套件应能有效发现系统需求/规范与其实现之间的所有不一致。实践者常使用代码覆盖率来近似评估充分性,而学术界则认为变异分数可能更好地逼近真正的(全知式)充分性覆盖。通过自动化测试生成(包括模糊测试),即使在大规模系统上,高代码覆盖率也日益可达。鉴于这些衡量和改进测试工作的多种选择,QA工程师应如何分配时间?我们提出一种新框架,基于称为“全知缺口”(即给定软件元素的源码覆盖率与变异分数之差)的概念,来推理特定测试工作的范围、局限和性质。我们开展了:(1)对流行Maven项目全知缺口的大规模观测研究;(2)在多个项目中变化测试与全知质量的研究;(3)对可比较区块链项目中高度关键、经过充分测试的代码的小规模观测研究。我们证明,全知缺口揭示了关于测试工作范围和质量的重要信息,其作用超越了单独任一充分性度量指标。特别是,它为实践者提供了一种识别那些可能包含弱全知测试但覆盖重要代码的源文件的方法。