Pre-Training Representations of Binary Code Using Contrastive Learning

Compiled software is delivered as executable binary code. Developers write source code to express the software semantics, but the compiler converts it to a binary format that the CPU can directly execute. Therefore, binary code analysis is critical to applications in reverse engineering and computer security tasks where source code is not available. However, unlike source code and natural language that contain rich semantic information, binary code is typically difficult for human engineers to understand and analyze. While existing work uses AI models to assist source code analysis, few studies have considered binary code. In this paper, we propose a COntrastive learning Model for Binary cOde Analysis, or COMBO, that incorporates source code and comment information into binary code during representation learning. Specifically, we present three components in COMBO: (1) a primary contrastive learning method for cold-start pre-training, (2) a simplex interpolation method to incorporate source code, comments, and binary code, and (3) an intermediate representation learning algorithm to provide binary code embeddings. Finally, we evaluate the effectiveness of the pre-trained representations produced by COMBO using three indicative downstream tasks relating to binary code: algorithmic functionality classification, binary code similarity, and vulnerability detection. Our experimental results show that COMBO facilitates representation learning of binary code visualized by distribution analysis, and improves the performance on all three downstream tasks by 5.45% on average compared to state-of-the-art large-scale language representation models. To the best of our knowledge, COMBO is the first language representation model that incorporates source code, binary code, and comments into contrastive code representation learning and unifies multiple tasks for binary code analysis.

翻译：编译后的软件以可执行的二进制代码形式交付。开发者编写源代码来表达软件语义，但编译器将其转换为CPU可直接执行的二进制格式。因此，在无法获取源代码的反向工程与计算机安全任务中，二进制代码分析至关重要。然而，与包含丰富语义信息的源代码和自然语言不同，二进制代码通常难以被人类工程师理解和分析。尽管已有工作利用AI模型辅助源代码分析，但针对二进制代码的研究仍然较少。本文提出一种面向二进制代码分析的对比学习模型（COMBO），该模型在表示学习过程中将源代码与注释信息融入二进制代码。具体地，COMBO包含三个组成部分：（1）用于冷启动预训练的主对比学习方法，（2）融合源代码、注释与二进制代码的单纯形插值方法，以及（3）生成二进制代码嵌入的中间表示学习算法。最后，我们通过三类代表性的二进制代码下游任务（算法功能分类、二进制代码相似性检测和漏洞检测）评估COMBO生成的预训练表示的有效性。实验结果表明，分布分析可视化了COMBO对二进制代码表示学习的促进作用；与现有最先进的大规模语言表示模型相比，COMBO在所有三个下游任务上的平均性能提升5.45%。据我们所知，COMBO是首个将源代码、二进制代码和注释统一纳入对比代码表示学习，并整合多任务用于二进制代码分析的语言表示模型。