Deep learning (DL)-based code completion tools have transformed software development by enabling advanced code generation. These tools leverage models trained on vast amounts of code from numerous repositories, capturing general coding patterns. However, the impact of fine-tuning these models for specific organizations or developers to boost their performance on such subjects remains unexplored. In this work, we fill this gap by presenting solid empirical evidence answering this question. More specifically, we consider 136 developers from two organizations (Apache and Spring), two model architectures (T5 and Code Llama), and three model sizes (60M, 750M, and 7B trainable parameters). T5 models (60M, 750M) were pre-trained and fine-tuned on over 2,000 open-source projects, excluding the subject organizations' data, and compared against versions fine-tuned on organization- and developer-specific datasets. For the Code Llama model (7B), we compared the performance of the already pre-trained model publicly available online with the same model fine-tuned via parameter-efficient fine-tuning on organization- and developer-specific datasets. Our results show that there is a boost in prediction capabilities provided by both an organization-specific and a developer-specific additional fine-tuning, with the former being particularly performant. Such a finding generalizes across (i) the two subject organizations (i.e., Apache and Spring) and (ii) models of completely different magnitude (from 60M to 7B trainable parameters). Finally, we show that DL models fine-tuned on an organization-specific dataset achieve the same completion performance of pre-trained code models used out of the box and being $\sim$10$\times$ larger, with consequent savings in terms of deployment and inference cost (e.g., smaller GPUs needed).
翻译:基于深度学习(DL)的代码补全工具通过实现高级代码生成,已经改变了软件开发方式。这些工具利用在大量代码库的海量代码上训练的模型,捕捉通用的编码模式。然而,针对特定组织或开发者对这些模型进行微调,以提升其在相关主体上的性能,其影响尚未得到充分探索。在本研究中,我们通过提供坚实的实证证据来回答这一问题,填补了这一空白。具体而言,我们考虑了来自两个组织(Apache 和 Spring)的 136 名开发者、两种模型架构(T5 和 Code Llama)以及三种模型规模(6000 万、7.5 亿和 70 亿可训练参数)。T5 模型(6000 万、7.5 亿)在超过 2000 个开源项目上进行了预训练和微调(排除了目标组织的数据),并与在组织特定和开发者特定数据集上微调的版本进行了比较。对于 Code Llama 模型(70 亿),我们比较了公开可用的已预训练模型的性能与通过参数高效微调在组织特定和开发者特定数据集上微调的同一模型的性能。我们的结果表明,通过组织特定和开发者特定的额外微调,模型的预测能力均得到提升,其中前者表现尤为突出。这一发现在以下两个方面具有普遍性:(i)两个目标组织(即 Apache 和 Spring),以及(ii)规模完全不同的模型(从 6000 万到 70 亿可训练参数)。最后,我们证明,在组织特定数据集上微调的深度学习模型,能够达到与开箱即用的预训练代码模型相同的补全性能,而后者的规模约为前者的 $\sim$10 倍,从而在部署和推理成本方面实现节约(例如,所需 GPU 更小)。