Large Language Models (LLMs) have demonstrated significant potential in automating software testing, specifically in generating unit test cases. However, the validation of LLM-generated test cases remains a challenge, particularly when the ground truth is unavailable. This paper introduces VALTEST, a novel framework designed to automatically validate test cases generated by LLMs by leveraging token probabilities. We evaluate VALTEST using nine test suites generated from three datasets (HumanEval, MBPP, and LeetCode) across three LLMs (GPT-4o, GPT-3.5-turbo, and LLama3.1 8b). By extracting statistical features from token probabilities, we train a machine learning model to predict test case validity. VALTEST increases the validity rate of test cases by 6.2% to 24%, depending on the dataset and LLM. Our results suggest that token probabilities are reliable indicators for distinguishing between valid and invalid test cases, which provides a robust solution for improving the correctness of LLM-generated test cases in software testing. In addition, we found that replacing the identified invalid test cases by VALTEST, using a Chain-of-Thought prompting results in a more effective test suite while keeping the high validity rates.
翻译:大型语言模型(LLMs)在自动化软件测试,特别是单元测试用例生成方面展现出巨大潜力。然而,LLM生成测试用例的验证仍是一个挑战,尤其是在缺乏真实基准的情况下。本文提出VALTEST,一种通过利用词元概率自动验证LLM生成测试用例的新框架。我们使用基于三个数据集(HumanEval、MBPP和LeetCode)和三个LLM(GPT-4o、GPT-3.5-turbo及LLama3.1 8b)生成的九个测试套件对VALTEST进行评估。通过从词元概率中提取统计特征,我们训练机器学习模型以预测测试用例的有效性。根据数据集和LLM的不同,VALTEST将测试用例的有效率提升了6.2%至24%。实验结果表明,词元概率是区分有效与无效测试用例的可靠指标,这为提升软件测试中LLM生成测试用例的正确性提供了稳健解决方案。此外,我们发现通过思维链提示替换VALTEST识别的无效测试用例,可在保持高有效率的同时构建更有效的测试套件。