While the majority of existing pre-trained models from code learn source code features such as code tokens and abstract syntax trees, there are some other works that focus on learning from compiler intermediate representations (IRs). Existing IR-based models typically utilize IR features such as instructions, control and data flow graphs (CDFGs), call graphs, etc. However, these methods confuse variable nodes and instruction nodes in a CDFG and fail to distinguish different types of flows, and the neural networks they use fail to capture long-distance dependencies and have over-smoothing and over-squashing problems. To address these weaknesses, we propose FAIR, a Flow type-Aware pre-trained model for IR that involves employing (1) a novel input representation of IR programs; (2) Graph Transformer to address over-smoothing, over-squashing and long-dependencies problems; and (3) five pre-training tasks that we specifically propose to enable FAIR to learn the semantics of IR tokens, flow type information, and the overall representation of IR. Experimental results show that FAIR can achieve state-of-the-art results on four code-related downstream tasks.
翻译:尽管现有的大多数代码预训练模型学习源码特征(如代码标记和抽象语法树),但也有部分工作专注于学习编译器中间表示(IR)。已有的基于IR的方法通常使用指令、控制流图和数据流图(CDFG)、调用图等IR特征。然而,这些方法在CDFG中混淆了变量节点和指令节点,未能区分不同类型的流,且所采用的神经网络无法捕获长距离依赖关系,并存在过度平滑和过度压缩问题。针对这些不足,我们提出了FAIR——一种面向IR的流类型感知预训练模型,其核心包括:(1)一种新颖的IR程序输入表示方法;(2)采用图Transformer解决过度平滑、过度压缩及长距离依赖问题;(3)专门设计五项预训练任务,使FAIR能够学习IR标记的语义、流类型信息以及IR的整体表示。实验结果表明,FAIR在四个与代码相关的下游任务中均取得了最先进的性能。