DISTINCT: A Description-Guided Branch-Consistency Analysis Framework for Non-Regressive Test Case Generation

Automated test-generation research overwhelmingly assumes the correctness of focal methods, yet practitioners routinely face non-regression scenarios where the focal method may be defective. A baseline evaluation of EVOSUITE and two leading Large Language Model (LLM)-based generators, namely CHATTESTER and CHATUNITEST, on defective focal methods reveals that, despite achieving up to 83% branch coverage, none of the generated tests expose defects, due to a lack of awareness of developer intent. To resolve this problem, we first construct two new benchmarks, namely Defects4J-Desc and QuixBugs-Desc, for experiments, where each focal method is equipped with an additional Natural Language Description (NLD) to support code functionality understanding. Subsequently, we propose DISTINCT, a description-guided branch-consistency analysis framework that transforms LLMs into fault-aware test generators. DISTINCT carries three iterative components: (1) a Generator that derives initial tests based on the NLDs and the focal method, (2) a Validator that iteratively fixes uncompilable tests using compiler diagnostics, and (3) an Analyzer that iteratively aligns test behavior with NLD semantics via branch-level analysis. Extensive experiments confirm the effectiveness of our approach. Compared to state-of-the-art approaches, DISTINCT achieves an average improvement of 14.64% in Compilation Success Rate (CSR), 6.66% in Passing Rate (PR), and particularly 95.22% in Defect Detection Rate (DDR) across both benchmarks. In terms of code coverage, DISTINCT improves Statement Coverage (SC) by an average of 3.77% and Branch Coverage (BC) by 5.36%. These results set a new baseline for non-regressive test generation and highlight how description-driven reasoning enables LLMs to move beyond coverage chasing toward effective defect detection.

翻译：自动化测试生成研究普遍假设待测方法的正确性，然而实践者经常面临待测方法可能存在缺陷的非回归场景。针对存在缺陷的待测方法，对EVOSUITE以及两个领先的基于大语言模型（LLM）的生成器（即CHATTESTER和CHATUNITEST）进行的基线评估表明，尽管生成的测试达到了高达83%的分支覆盖率，但由于缺乏对开发者意图的认知，没有一项测试能够暴露缺陷。为解决此问题，我们首先构建了两个新的基准数据集，即Defects4J-Desc和QuixBugs-Desc，用于实验，其中每个待测方法都配备了一个额外的自然语言描述（NLD）以支持对代码功能的理解。随后，我们提出了DISTINCT，一个描述引导的分支一致性分析框架，该框架将LLM转化为具备缺陷感知能力的测试生成器。DISTINCT包含三个迭代组件：（1）一个基于NLD和待测方法生成初始测试的生成器，（2）一个利用编译器诊断信息迭代修复不可编译测试的验证器，以及（3）一个通过分支级分析迭代地将测试行为与NLD语义对齐的分析器。大量实验证实了我们方法的有效性。与最先进的方法相比，DISTINCT在两个基准数据集上平均提升了14.64%的编译成功率（CSR）、6.66%的通过率（PR），特别是显著提升了95.22%的缺陷检测率（DDR）。在代码覆盖率方面，DISTINCT平均将语句覆盖率（SC）提升了3.77%，将分支覆盖率（BC）提升了5.36%。这些结果为非回归测试生成设立了新的基准，并突显了描述驱动的推理如何使LLM超越对覆盖率的追逐，迈向有效的缺陷检测。