The use of modern Natural Language Processing (NLP) techniques has shown to be beneficial for software engineering tasks, such as vulnerability detection and type inference. However, training deep NLP models requires significant computational resources. This paper explores techniques that aim at achieving the best usage of resources and available information in these models. We propose a generic approach, EarlyBIRD, to build composite representations of code from the early layers of a pre-trained transformer model. We empirically investigate the viability of this approach on the CodeBERT model by comparing the performance of 12 strategies for creating composite representations with the standard practice of only using the last encoder layer. Our evaluation on four datasets shows that several early layer combinations yield better performance on defect detection, and some combinations improve multi-class classification. More specifically, we obtain a +2 average improvement of detection accuracy on Devign with only 3 out of 12 layers of CodeBERT and a 3.3x speed-up of fine-tuning. These findings show that early layers can be used to obtain better results using the same resources, as well as to reduce resource usage during fine-tuning and inference.
翻译:现代自然语言处理(NLP)技术已被证明对软件工程任务(如漏洞检测和类型推断)具有显著价值。然而,深度NLP模型的训练需要大量计算资源。本文探讨了如何在资源利用和模型信息获取之间达成最优平衡的技术。我们提出了一种通用方法EarlyBIRD,通过预训练Transformer模型的早期层构建代码的复合表示。通过比较12种复合表示构建策略与仅使用最后一层编码器的标准做法在CodeBERT模型上的性能,我们对该方法的可行性进行了实证研究。在四个数据集上的评估表明,多种早期层组合在缺陷检测中取得了更优性能,部分组合还提升了多分类任务的表现。具体而言,仅使用CodeBERT 12层中的3层,我们在Devign数据集上的检测准确率平均提升2个点,同时微调速度提升3.3倍。这些发现表明,利用早期层不仅能在相同资源条件下获得更优结果,还能有效降低微调和推理阶段的资源消耗。