Beyond Coverage and Kill Scores: Empirically Measuring Test Suite Behavioural Gaps

Traditional test adequacy metrics measure a system's implementation, not whether it adheres to its expected behaviour. While developers rely heavily on code coverage and mutation testing to assess test suite quality, these metrics are fundamentally implementation-centric and cannot detect gaps between what the code is expected to do and what it actually does. Unfortunately, there has been no way to reliably detect these discrepancies; in this paper we introduce an automated proof-of-concept approach to investigate these gaps. The approach extracts expected method-level behaviours from natural language documentation and source code, maps them to existing test cases, and identifies gaps between expected and validated behaviours. We evaluate the approach across ten popular open-source Java libraries comprising 8,922 methods, extracting 20,729 behaviours with 93.1% precision. Our empirical analysis conservatively estimates that 17.5% of detected expected behaviours remain entirely untested, which we term as the test suite's behavioural gap. To determine if these gaps are merely an artifact of human-driven testing, we evaluate state-of-the-art automated test generators (EVOSUITE / ASTER), finding that they similarly fail to validate at least 20.6% / 27.1% of detected expected behaviours. We further demonstrate that behavioural gaps are not predicted by traditional structural metrics: the majority of untested behaviours occur in methods that already have high line coverage, and over half persist in methods with high mutation kill score. These results suggest behavioural coverage acts as an independent dimension of test suite adequacy that can complement traditional structural metrics.

翻译：传统的测试充分性指标衡量的是系统的实现，而非其是否符合预期行为。尽管开发者高度依赖代码覆盖率和变异测试来评估测试套件质量，但这些指标本质上以实现为中心，无法检测代码预期行为与实际行为之间的差距。遗憾的是，目前尚无可靠方法检测这些差异；本文引入了一种自动化概念验证方法以探究这些差距。该方法从自然语言文档和源代码中提取预期的方法级行为，将其映射到现有测试用例，并识别预期行为与已验证行为之间的缺口。我们评估了该方法在十个流行的开源Java库（共包含8,922个方法）上的表现，提取了20,729个行为，精确率为93.1%。实证分析保守估计，17.5%的检测到的预期行为完全未经测试，我们将其定义为测试套件的“行为缺失”。为确定这些缺失是否仅为人工测试的产物，我们评估了最先进的自动化测试生成工具（EVOSUITE/ASTER），发现它们同样未能验证至少20.6%/27.1%的检测到的预期行为。我们进一步证明，行为缺失无法由传统结构指标预测：多数未测试行为出现在已具有高代码行覆盖率的方法中，且超过半数持续存在于高变异得分的方法中。这些结果表明，行为覆盖率作为测试套件充分性的独立维度，可补充传统结构指标。