Transformer based code models have impressive performance in many software engineering tasks. However, their effectiveness degrades when symbols are missing or not informative. The reason is that the model may not learn to pay attention to the right correlations/contexts without the help of symbols. We propose a new method to pre-train general code models when symbols are lacking. We observe that in such cases, programs degenerate to something written in a very primitive language. We hence propose to use program analysis to extract contexts a priori (instead of relying on symbols and masked language modeling as in vanilla models). We then leverage a novel attention masking method to only allow the model attending to these contexts, e.g., bi-directional program dependence transitive closures and token co-occurrences. In the meantime, the inherent self-attention mechanism is utilized to learn which of the allowed attentions are more important compared to others. To realize the idea, we enhance the vanilla tokenization and model architecture of a BERT model, construct and utilize attention masks, and introduce a new pre-training algorithm. We pre-train this BERT-like model from scratch, using a dataset of 26 million stripped binary functions with explicit program dependence information extracted by our tool. We apply the model in three downstream tasks: binary similarity, type inference, and malware family classification. Our pre-trained model can improve the SOTAs in these tasks from 53% to 64%, 49% to 60%, and 74% to 94%, respectively. It also substantially outperforms other general pre-training techniques of code understanding models.
翻译:基于Transformer的代码模型在多项软件工程任务中表现优异。然而当符号信息缺失或语义不明确时,其有效性显著下降。究其原因在于模型无法借助符号信息学习正确的关联关系与上下文语境。为此我们提出一种在符号缺失条件下预训练通用代码模型的新方法。研究发现,此类场景中的程序已退化为由原始语言编写的代码片段。我们提出通过程序分析预先提取上下文信息(区别于传统模型依赖符号和掩码语言建模的方式),进而设计新型注意力掩码机制使模型仅关注这些预设上下文——例如双向程序依赖传递闭包与词符共现关系。同时利用自注意力机制的内在特性,学习被允许的注意力关系中各成分的重要性差异。为实施该方法,我们在BERT模型的基础上改进基础词元化与模型架构,构建并应用注意力掩码矩阵,引入全新预训练算法。我们使用含2600万条剥离符号的二进制函数数据集(附带由专用工具提取的显式程序依赖信息)从头预训练该类BERT模型,并在二进制相似性分析、类型推断和恶意软件家族分类三个下游任务中验证模型性能。实验表明,我们的预训练模型将上述任务的SOTA指标分别从53%提升至64%、49%提升至60%、74%提升至94%,且显著优于其他通用代码理解模型的预训练技术。