The use of modern Natural Language Processing (NLP) techniques has shown to be beneficial for software engineering tasks, such as vulnerability detection and type inference. However, training deep NLP models requires significant computational resources. This paper explores techniques that aim at achieving the best usage of resources and available information in these models. We propose a generic approach, EarlyBIRD, to build composite representations of code from the early layers of a pre-trained transformer model. We empirically investigate the viability of this approach on the CodeBERT model by comparing the performance of 12 strategies for creating composite representations with the standard practice of only using the last encoder layer. Our evaluation on four datasets shows that several early layer combinations yield better performance on defect detection, and some combinations improve multi-class classification. More specifically, we obtain a +2 average improvement of detection accuracy on Devign with only 3 out of 12 layers of CodeBERT and a 3.3x speed-up of fine-tuning. These findings show that early layers can be used to obtain better results using the same resources, as well as to reduce resource usage during fine-tuning and inference.
翻译:现代自然语言处理(NLP)技术已被证明对软件工程任务(如漏洞检测和类型推断)具有积极意义。然而,训练深度NLP模型需要大量计算资源。本文探索了旨在实现这些模型中资源与可用信息最佳利用的技术。我们提出了一种通用方法EarlyBIRD,通过预训练Transformer模型的早期层构建代码的复合表示。我们基于CodeBERT模型,通过比较12种复合表示构建策略与仅使用最后一层编码器层的标准实践的性能,实证研究了该方法的可行性。在四个数据集上的评估表明,多种早期层组合在缺陷检测上表现更优,部分组合还提升了多分类性能。具体而言,仅使用CodeBERT的12层中的3层,我们在Devign数据集上获得了平均+2的检测准确率提升,并实现了3.3倍的微调加速。这些发现表明,利用早期层既能在相同资源下获得更优结果,也能在微调和推理过程中减少资源消耗。