CodeArt: Better Code Models by Attention Regularization When Symbols Are Lacking

Transformer based code models have impressive performance in many software engineering tasks. However, their effectiveness degrades when symbols are missing or not informative. The reason is that the model may not learn to pay attention to the right correlations/contexts without the help of symbols. We propose a new method to pre-train general code models when symbols are lacking. We observe that in such cases, programs degenerate to something written in a very primitive language. We hence propose to use program analysis to extract contexts a priori (instead of relying on symbols and masked language modeling as in vanilla models). We then leverage a novel attention masking method to only allow the model attending to these contexts, e.g., bi-directional program dependence transitive closures and token co-occurrences. In the meantime, the inherent self-attention mechanism is utilized to learn which of the allowed attentions are more important compared to others. To realize the idea, we enhance the vanilla tokenization and model architecture of a BERT model, construct and utilize attention masks, and introduce a new pre-training algorithm. We pre-train this BERT-like model from scratch, using a dataset of 26 million stripped binary functions with explicit program dependence information extracted by our tool. We apply the model in three downstream tasks: binary similarity, type inference, and malware family classification. Our pre-trained model can improve the SOTAs in these tasks from 53% to 64%, 49% to 60%, and 74% to 94%, respectively. It also substantially outperforms other general pre-training techniques of code understanding models.

翻译：基于Transformer的代码模型在多项软件工程任务中表现优异。然而当符号信息缺失或语义不明确时，其有效性显著下降。究其原因在于模型无法借助符号信息学习正确的关联关系与上下文语境。为此我们提出一种在符号缺失条件下预训练通用代码模型的新方法。研究发现，此类场景中的程序已退化为由原始语言编写的代码片段。我们提出通过程序分析预先提取上下文信息（区别于传统模型依赖符号和掩码语言建模的方式），进而设计新型注意力掩码机制使模型仅关注这些预设上下文——例如双向程序依赖传递闭包与词符共现关系。同时利用自注意力机制的内在特性，学习被允许的注意力关系中各成分的重要性差异。为实施该方法，我们在BERT模型的基础上改进基础词元化与模型架构，构建并应用注意力掩码矩阵，引入全新预训练算法。我们使用含2600万条剥离符号的二进制函数数据集（附带由专用工具提取的显式程序依赖信息）从头预训练该类BERT模型，并在二进制相似性分析、类型推断和恶意软件家族分类三个下游任务中验证模型性能。实验表明，我们的预训练模型将上述任务的SOTA指标分别从53%提升至64%、49%提升至60%、74%提升至94%，且显著优于其他通用代码理解模型的预训练技术。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日