Recent development of large language models (LLMs) for code like CodeX and CodeT5+ demonstrates tremendous promise in achieving code intelligence. Their ability of synthesizing code that completes a program for performing a pre-defined task has been intensively tested and verified on benchmark datasets including HumanEval and MBPP. Yet, evaluation of these LLMs from more perspectives (than just program synthesis) is also anticipated, considering their broad scope of applications in software engineering. In this paper, we explore the ability of LLMs for testing programs/code. By performing thorough analyses of recent LLMs for code in program testing, we show a series of intriguing properties of these models and demonstrate how program testing ability of LLMs can be improved. Following recent work which utilizes generated test cases to enhance program synthesis, we further leverage our findings in improving the quality of the synthesized programs and show +11.77% and +4.22% higher code pass rates on HumanEval+ comparing with the GPT-3.5-turbo baseline and the recent state-of-the-art, respectively.
翻译:近年来,面向代码的大型语言模型(如CodeX和CodeT5+)的最新发展在实现代码智能方面展现出巨大潜力。它们在执行预定义任务时合成完整程序代码的能力,已在HumanEval和MBPP等基准数据集上得到广泛测试与验证。然而,考虑到这些模型在软件工程领域的广泛应用前景,人们也期待从更多维度(而非仅程序合成)对其展开评估。本文探索了大型语言模型测试程序/代码的能力。通过对近期面向代码的大型语言模型进行程序测试的深入分析,我们揭示了这些模型的一系列有趣特性,并展示了如何提升LLMs的程序测试能力。借鉴近期利用生成测试用例增强程序合成的研究,我们进一步将研究成果应用于提升合成程序质量,相比GPT-3.5-turbo基线模型和最新技术方案,在HumanEval+上的代码通过率分别提升11.77%和4.22%。