VALTEST: Automated Validation of Language Model Generated Test Cases

Large Language Models (LLMs) have demonstrated significant potential in automating software testing, specifically in generating unit test cases. However, the validation of LLM-generated test cases remains a challenge, particularly when the ground truth is unavailable. This paper introduces VALTEST, a novel framework designed to automatically validate test cases generated by LLMs by leveraging token probabilities. We evaluate VALTEST using nine test suites generated from three datasets (HumanEval, MBPP, and LeetCode) across three LLMs (GPT-4o, GPT-3.5-turbo, and LLama3.1 8b). By extracting statistical features from token probabilities, we train a machine learning model to predict test case validity. VALTEST increases the validity rate of test cases by 6.2% to 24%, depending on the dataset and LLM. Our results suggest that token probabilities are reliable indicators for distinguishing between valid and invalid test cases, which provides a robust solution for improving the correctness of LLM-generated test cases in software testing. In addition, we found that replacing the identified invalid test cases by VALTEST, using a Chain-of-Thought prompting results in a more effective test suite while keeping the high validity rates.

翻译：大型语言模型（LLMs）在自动化软件测试，特别是单元测试用例生成方面展现出巨大潜力。然而，LLM生成测试用例的验证仍是一个挑战，尤其是在缺乏真实基准的情况下。本文提出VALTEST，一种通过利用词元概率自动验证LLM生成测试用例的新框架。我们使用基于三个数据集（HumanEval、MBPP和LeetCode）和三个LLM（GPT-4o、GPT-3.5-turbo及LLama3.1 8b）生成的九个测试套件对VALTEST进行评估。通过从词元概率中提取统计特征，我们训练机器学习模型以预测测试用例的有效性。根据数据集和LLM的不同，VALTEST将测试用例的有效率提升了6.2%至24%。实验结果表明，词元概率是区分有效与无效测试用例的可靠指标，这为提升软件测试中LLM生成测试用例的正确性提供了稳健解决方案。此外，我们发现通过思维链提示替换VALTEST识别的无效测试用例，可在保持高有效率的同时构建更有效的测试套件。

相关内容

CASES

关注 4

CASES：International Conference on Compilers, Architectures, and Synthesis for Embedded Systems。 Explanation：嵌入式系统编译器、体系结构和综合国际会议。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/cases/index.html

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日