The advent of large language models (LLMs) like GitHub Copilot has significantly enhanced programmers' productivity, particularly in code generation. However, these models often struggle with real-world tasks without fine-tuning. As LLMs grow larger and more performant, fine-tuning for specialized tasks becomes increasingly expensive. Parameter-efficient fine-tuning (PEFT) methods, which fine-tune only a subset of model parameters, offer a promising solution by reducing the computational costs of tuning LLMs while maintaining their performance. Existing studies have explored using PEFT and LLMs for various code-related tasks and found that the effectiveness of PEFT techniques is task-dependent. The application of PEFT techniques in unit test generation remains underexplored. The state-of-the-art is limited to using LLMs with full fine-tuning to generate unit tests. This paper investigates both full fine-tuning and various PEFT methods, including LoRA, (IA)^3, and prompt tuning, across different model architectures and sizes. We use well-established benchmark datasets to evaluate their effectiveness in unit test generation. Our findings show that PEFT methods can deliver performance comparable to full fine-tuning for unit test generation, making specialized fine-tuning more accessible and cost-effective. Notably, prompt tuning is the most effective in terms of cost and resource utilization, while LoRA approaches the effectiveness of full fine-tuning in several cases.
翻译:GitHub Copilot等大型语言模型(LLMs)的出现显著提升了程序员的生产力,特别是在代码生成方面。然而,这些模型若未经微调,往往难以应对实际任务。随着LLMs规模不断扩大、性能持续提升,针对特定任务的微调成本日益增加。参数高效微调(PEFT)方法通过仅微调模型参数子集,在保持性能的同时降低LLMs的调优计算成本,为此提供了有前景的解决方案。现有研究已探索将PEFT与LLMs应用于各类代码相关任务,并发现PEFT技术的有效性具有任务依赖性。PEFT技术在单元测试生成中的应用仍待深入探索,当前最先进方法仅限于使用完全微调的LLMs生成单元测试。本文研究了包括LoRA、(IA)^3和提示调优在内的多种PEFT方法及完全微调策略,覆盖不同模型架构与规模。我们采用成熟的基准数据集评估其在单元测试生成中的有效性。研究结果表明,对于单元测试生成任务,PEFT方法能够达到与完全微调相当的性能,使专业化微调更易实施且成本效益更高。值得注意的是,提示调优在成本与资源利用率方面最为高效,而LoRA在多种情况下可接近完全微调的效果。