Binary Code Embedding (BCE) has important applications in various reverse engineering tasks such as binary code similarity detection, type recovery, control-flow recovery and data-flow analysis. Recent studies have shown that the Transformer model can comprehend the semantics of binary code to support downstream tasks. However, existing models overlooked the prior knowledge of assembly language. In this paper, we propose a novel Transformer-based approach, namely kTrans, to generate knowledge-aware binary code embedding. By feeding explicit knowledge as additional inputs to the Transformer, and fusing implicit knowledge with a novel pre-training task, kTrans provides a new perspective to incorporating domain knowledge into a Transformer framework. We inspect the generated embeddings with outlier detection and visualization, and also apply kTrans to 3 downstream tasks: Binary Code Similarity Detection (BCSD), Function Type Recovery (FTR) and Indirect Call Recognition (ICR). Evaluation results show that kTrans can generate high-quality binary code embeddings, and outperforms state-of-the-art (SOTA) approaches on downstream tasks by 5.2%, 6.8%, and 12.6% respectively. kTrans is publicly available at: https://github.com/Learner0x5a/kTrans-release
翻译:二进制代码嵌入(Binary Code Embedding, BCE)在多种逆向工程任务中具有重要应用,例如二进制代码相似性检测、类型恢复、控制流恢复和数据流分析。近期研究表明,Transformer模型能够理解二进制代码的语义以支持下游任务。然而,现有模型忽视了汇编语言的先验知识。本文提出一种新颖的基于Transformer的方法——kTrans,用于生成知识感知的二进制代码嵌入。通过将显式知识作为额外输入馈入Transformer,并结合一种新颖的预训练任务融合隐式知识,kTrans为将领域知识融入Transformer框架提供了新视角。我们利用异常检测和可视化技术检视生成的嵌入,并将kTrans应用于3个下游任务:二进制代码相似性检测(Binary Code Similarity Detection, BCSD)、函数类型恢复(Function Type Recovery, FTR)和间接调用识别(Indirect Call Recognition, ICR)。评估结果表明,kTrans能够生成高质量的二进制代码嵌入,并在下游任务上分别以5.2%、6.8%和12.6%的性能提升优于现有最先进(State-of-the-Art, SOTA)方法。kTrans已开源,可通过https://github.com/Learner0x5a/kTrans-release 获取。