Pre-Training Representations of Binary Code Using Contrastive Learning

Binary code analysis and comprehension is critical to applications in reverse engineering and computer security tasks where source code is not available. Unfortunately, unlike source code, binary code lacks semantics and is more difficult for human engineers to understand and analyze. In this paper, we present ContraBin, a contrastive learning technique that integrates source code and comment information along with binaries to create an embedding capable of aiding binary analysis and comprehension tasks. Specifically, we present three components in ContraBin: (1) a primary contrastive learning method for initial pre-training, (2) a simplex interpolation method to integrate source code, comments, and binary code, and (3) an intermediate representation learning algorithm to train a binary code embedding. We further analyze the impact of human-written and synthetic comments on binary code comprehension tasks, revealing a significant performance disparity. While synthetic comments provide substantial benefits, human-written comments are found to introduce noise, even resulting in performance drops compared to using no comments. These findings reshape the narrative around the role of comment types in binary code analysis. We evaluate the effectiveness of ContraBin through four indicative downstream tasks related to binary code: algorithmic functionality classification, function name recovery, code summarization, and reverse engineering. The results show that ContraBin considerably improves performance on all four tasks, measured by accuracy, mean of average precision, and BLEU scores as appropriate. ContraBin is the first language representation model to incorporate source code, binary code, and comments into contrastive code representation learning and is intended to contribute to the field of binary code analysis. The dataset used in this study is available for further research.

翻译：二进制代码分析与理解在逆向工程和计算机安全任务中至关重要，这些任务通常无法获取源代码。然而，与源代码不同，二进制代码缺乏语义信息，使得人类工程师更难理解和分析。本文提出ContraBin，一种对比学习技术，它整合了源代码、注释信息以及二进制文件，以创建能够辅助二进制分析与理解任务的嵌入表示。具体而言，我们在ContraBin中提出了三个组成部分：(1) 用于初始预训练的主要对比学习方法，(2) 整合源代码、注释和二进制代码的单纯形插值方法，以及(3) 训练二进制代码嵌入的中间表示学习算法。我们进一步分析了人工编写和合成注释对二进制代码理解任务的影响，揭示了显著的性能差异。虽然合成注释带来了实质性的益处，但人工编写的注释被发现引入了噪声，甚至导致性能相比不使用注释时有所下降。这些发现重塑了关于注释类型在二进制代码分析中作用的现有认知。我们通过四个与二进制代码相关的指示性下游任务评估ContraBin的有效性：算法功能分类、函数名恢复、代码摘要和逆向工程。结果显示，ContraBin在所有四个任务上均显著提升了性能，相应的评估指标包括准确率、平均精度均值以及BLEU分数。ContraBin是首个将源代码、二进制代码和注释整合到对比代码表示学习中的语言表示模型，旨在为二进制代码分析领域做出贡献。本研究中使用的数据集可供进一步研究使用。