Binary code similarity detection (BCSD) is widely used in various binary analysis tasks such as vulnerability search, malware detection, clone detection, and patch analysis. Recent studies have shown that the learning-based binary code embedding models perform better than the traditional feature-based approaches. In this paper, we propose a novel transformer-based binary code embedding model named UniASM to learn representations of the binary functions. We design two new training tasks to make the spatial distribution of the generated vectors more uniform, which can be used directly in BCSD without any fine-tuning. In addition, we present a new tokenization approach for binary functions, which increases the token's semantic information and mitigates the out-of-vocabulary (OOV) problem. We conduct an in-depth analysis of the factors affecting model performance through ablation experiments and obtain some new and valuable findings. The experimental results show that UniASM outperforms the state-of-the-art (SOTA) approach on the evaluation dataset. The average scores of Recall@1 on cross-compilers, cross-optimization levels, and cross-obfuscations are 0.77, 0.72, and 0.72. Besides, in the real-world task of known vulnerability search, UniASM outperforms all the current baselines.
翻译:二进制代码相似性检测(BCSD)广泛应用于各类二进制分析任务,如漏洞搜索、恶意软件检测、克隆检测及补丁分析。近年研究表明,基于学习的二进制代码嵌入模型优于传统基于特征的方法。本文提出一种新颖的基于Transformer的二进制代码嵌入模型UniASM,用于学习二进制函数的表示。我们设计了两个新的训练任务,使生成向量的空间分布更加均匀,从而可直接用于BCSD而无需任何微调。此外,我们提出了一种新的二进制函数令牌化方法,该方法增强了令牌的语义信息并缓解了词表外(OOV)问题。通过消融实验,我们对影响模型性能的因素进行了深入分析,并获得了一些新的有价值的发现。实验结果显示,UniASM在评估数据集上优于最先进的(SOTA)方法。在不同编译器、不同优化级别和不同混淆条件下的Recall@1平均得分分别为0.77、0.72和0.72。此外,在已知漏洞搜索的实际任务中,UniASM优于所有现有基线方法。