The remarkable capability of large language models (LLMs) in generating high-quality code has drawn increasing attention in the software testing community. However, existing code LLMs often demonstrate unsatisfactory capabilities in generating accurate and complete tests since they were trained on code snippets collected without differentiating between code for testing purposes and other code. In this paper, we present a large-scale dataset UniTSyn, which is capable of enhancing the prowess of LLMs for Unit Test Synthesis. Associating tests with the tested functions is crucial for LLMs to infer the expected behavior and the logic paths to be verified. By leveraging Language Server Protocol, UniTSyn achieves the challenging goal of collecting focal-test pairs without per-project execution setups or per-language heuristics that tend to be fragile and difficult to scale. It contains 2.7 million focal-test pairs across five mainstream programming languages, making it possible to be utilized for enhancing the test generation ability of LLMs. The details of UniTSyn can be found in Table 1. Our experiments demonstrate that, by building an autoregressive model based on UniTSyn, we can achieve significant benefits in learning and understanding unit test representations, resulting in improved generation accuracy and code coverage across all evaluated programming languages. Code and data will be publicly available.
翻译:大语言模型在生成高质量代码方面的卓越能力正日益受到软件测试领域的关注。然而,现有的代码大语言模型在生成准确且完整的测试用例方面往往表现不佳,这是因为它们的训练数据是从代码片段中收集的,并未区分用于测试目的的代码与其他代码。本文提出了一种大规模数据集UniTSyn,它能够增强大语言模型在单元测试合成方面的能力。将被测函数与测试用例相关联,对于大语言模型推断预期行为和待验证的逻辑路径至关重要。通过利用语言服务器协议,UniTSyn 在无需每个项目的执行设置或依赖脆弱且难以扩展的特定语言启发式方法的情况下,实现了收集焦点-测试对的具有挑战性的目标。该数据集包含五种主流编程语言的 270 万对焦点-测试对,使其能够用于增强大语言模型的测试生成能力。UniTSyn 的详细信息见表 1。我们的实验表明,通过基于 UniTSyn 构建自回归模型,我们能够在学习和理解单元测试表示方面取得显著收益,从而在所有评估的编程语言中提高生成准确性和代码覆盖率。代码和数据将公开提供。