Binary code search plays a crucial role in applications like software reuse detection. Currently, existing models are typically based on either internal code semantics or a combination of function call graphs (CG) and internal code semantics. However, these models have limitations. Internal code semantic models only consider the semantics within the function, ignoring the inter-function semantics, making it difficult to handle situations such as function inlining. The combination of CG and internal code semantics is insufficient for addressing complex real-world scenarios. To address these limitations, we propose BinEnhance, a novel framework designed to leverage the inter-function semantics to enhance the expression of internal code semantics for binary code search. Specifically, BinEnhance constructs an External Environment Semantic Graph (EESG), which establishes a stable and analogous external environment for homologous functions by using different inter-function semantic relations (e.g., call, location, data-co-use). After the construction of EESG, we utilize the embeddings generated by existing internal code semantic models to initialize nodes of EESG. Finally, we design a Semantic Enhancement Model (SEM) that uses Relational Graph Convolutional Networks (RGCNs) and a residual block to learn valuable external semantics on the EESG for generating the enhanced semantics embedding. In addition, BinEnhance utilizes data feature similarity to refine the cosine similarity of semantic embeddings. We conduct experiments under six different tasks (e.g., under function inlining scenario) and the results illustrate the performance and robustness of BinEnhance. The application of BinEnhance to HermesSim, Asm2vec, TREX, Gemini, and Asteria on two public datasets results in an improvement of Mean Average Precision (MAP) from 53.6% to 69.7%. Moreover, the efficiency increases fourfold.
翻译:二进制代码搜索在软件复用检测等应用中发挥着关键作用。当前已有模型通常基于内部代码语义,或结合函数调用图(CG)与内部代码语义。然而,这些模型存在局限性:内部代码语义模型仅考虑函数内部语义,忽略了函数间语义,难以处理函数内联等情形;而CG与内部代码语义的结合仍不足以应对复杂的实际场景。为克服这些局限,本文提出BinEnhance——一种利用函数间语义增强内部代码语义表达的新型二进制代码搜索框架。具体而言,BinEnhance构建外部环境语义图(EESG),通过调用关系、位置关系、数据共现关系等多种函数间语义关系,为同源函数建立稳定且可类比的外部环境。在构建EESG后,我们利用现有内部代码语义模型生成的嵌入向量初始化EESG节点。随后,设计语义增强模型(SEM),采用关系图卷积网络(RGCN)与残差块在EESG上学习有价值的外部语义,进而生成增强的语义嵌入向量。此外,BinEnhance利用数据特征相似度优化语义嵌入向量的余弦相似度计算。我们在六种不同任务场景(如函数内联场景)下进行实验,结果验证了BinEnhance的性能与鲁棒性。在两个公开数据集上对HermesSim、Asm2vec、TREX、Gemini及Asteria应用BinEnhance后,平均精度均值(MAP)从53.6%提升至69.7%,且效率提升四倍。