An LLM-based Readability Measurement for Unit Tests' Context-aware Inputs

Automated test techniques usually generate unit tests with higher code coverage than manual tests. However, the readability of automated tests is crucial for code comprehension and maintenance. The readability of unit tests involves many aspects. In this paper, we focus on test inputs. The central limitation of existing studies on input readability is that they focus on test codes alone without taking the tested source codes into consideration, making them either ignore different source codes' different readability requirements or require manual efforts to write readable inputs. However, we observe that the source codes specify the contexts that test inputs must satisfy. Based on such observation, we introduce the \underline{C}ontext \underline{C}onsistency \underline{C}riterion (a.k.a, C3), which is a readability measurement tool that leverages Large Language Models to extract primitive-type (including string-type) parameters' readability contexts from the source codes and checks whether test inputs are consistent with those contexts. We have also proposed EvoSuiteC3. It leverages C3's extracted contexts to help EvoSuite generate readable test inputs. We have evaluated C3's performance on $409$ \java{} classes and compared manual and automated tests' readability under C3 measurement. The results are two-fold. First, The Precision, Recall, and F1-Score of C3's mined readability contexts are \precision{}, \recall{}, and \fone{}, respectively. Second, under C3's measurement, the string-type input readability scores of EvoSuiteC3, ChatUniTest (an LLM-based test generation tool), manual tests, and two traditional tools (EvoSuite and Randoop) are $90\%$, $83\%$, $68\%$, $8\%$, and $8\%$, showing the traditional tools' inability in generating readable string-type inputs.

翻译：自动化测试技术通常能生成比手动测试代码覆盖率更高的单元测试。然而，自动化测试的可读性对于代码理解和维护至关重要。单元测试的可读性涉及诸多方面。本文聚焦于测试输入。现有关于输入可读性研究的核心局限在于仅关注测试代码本身，而未考虑被测试的源代码，导致其要么忽略了不同源代码对可读性的差异化要求，要么需要人工介入编写可读性输入。但我们观察到，源代码规定了测试输入必须满足的上下文环境。基于这一观察，我们提出了上下文一致性准则（简称C3），这是一种利用大语言模型从源代码中提取原始类型（包括字符串类型）参数的可读性上下文，并检查测试输入是否与这些上下文保持一致的度量工具。我们还提出了EvoSuiteC3，该工具利用C3提取的上下文辅助EvoSuite生成可读性测试输入。我们在409个Java类上评估了C3的性能，并在C3度量下比较了手动测试与自动化测试的可读性。结果包含两方面：首先，C3挖掘的可读性上下文的精确率、召回率和F1分数分别为\precision{}、\recall{}和\fone{}；其次，在C3度量下，EvoSuiteC3、ChatUniTest（一种基于大语言模型的测试生成工具）、手动测试以及两种传统工具（EvoSuite和Randoop）的字符串类型输入可读性得分分别为90%、83%、68%、8%和8%，这揭示了传统工具在生成可读性字符串类型输入方面的不足。