CodeSAM: Source Code Representation Learning by Infusing Self-Attention with Multi-Code-View Graphs

Machine Learning (ML) for software engineering (SE) has gained prominence due to its ability to significantly enhance the performance of various SE applications. This progress is largely attributed to the development of generalizable source code representations that effectively capture the syntactic and semantic characteristics of code. In recent years, pre-trained transformer-based models, inspired by natural language processing (NLP), have shown remarkable success in SE tasks. However, source code contains structural and semantic properties embedded within its grammar, which can be extracted from structured code-views like the Abstract Syntax Tree (AST), Data-Flow Graph (DFG), and Control-Flow Graph (CFG). These code-views can complement NLP techniques, further improving SE tasks. Unfortunately, there are no flexible frameworks to infuse arbitrary code-views into existing transformer-based models effectively. Therefore, in this work, we propose CodeSAM, a novel scalable framework to infuse multiple code-views into transformer-based models by creating self-attention masks. We use CodeSAM to fine-tune a small language model (SLM) like CodeBERT on the downstream SE tasks of semantic code search, code clone detection, and program classification. Experimental results show that by using this technique, we improve downstream performance when compared to SLMs like GraphCodeBERT and CodeBERT on all three tasks by utilizing individual code-views or a combination of code-views during fine-tuning. We believe that these results are indicative that techniques like CodeSAM can help create compact yet performant code SLMs that fit in resource constrained settings.

翻译：机器学习（ML）在软件工程（SE）中的应用日益突出，因其能够显著提升各类SE应用的性能。这一进展很大程度上归功于可泛化的源代码表示方法的发展，这些方法能有效捕捉代码的句法和语义特征。近年来，受自然语言处理（NLP）启发，基于Transformer的预训练模型在SE任务中取得了显著成功。然而，源代码的语法中嵌入了结构和语义特性，这些特性可从抽象语法树（AST）、数据流图（DFG）和控制流图（CFG）等结构化代码视图中提取。这些代码视图可以补充NLP技术，从而进一步提升SE任务的性能。遗憾的是，目前缺乏灵活的框架来有效地将任意代码视图融入现有的基于Transformer的模型中。因此，在本工作中，我们提出了CodeSAM，一种新颖的可扩展框架，通过创建自注意力掩码将多个代码视图注入基于Transformer的模型。我们使用CodeSAM在下游SE任务（包括语义代码搜索、代码克隆检测和程序分类）上对小型语言模型（SLM）（如CodeBERT）进行微调。实验结果表明，通过在微调过程中利用单个代码视图或代码视图组合，与GraphCodeBERT和CodeBERT等SLM相比，该技术在所有三个任务上均提升了下游性能。我们相信这些结果表明，像CodeSAM这样的技术有助于创建紧凑且高性能的代码SLM，以适应资源受限的环境。