VEXIR2Vec: An Architecture-Neutral Embedding Framework for Binary Similarity

S. VenkataKeerthy,Soumya Banerjee,Sayan Dey,Yashas Andaluri,Raghul PS,Subrahmanyam Kalyanasundaram,Fernando Magno Quintão Pereira,Ramakrishna Upadrasta

Binary similarity involves determining whether two binary programs exhibit similar functionality, often originating from the same source code. In this work, we propose VexIR2Vec, an approach for binary similarity using VEX-IR, an architecture-neutral Intermediate Representation (IR). We extract the embeddings from sequences of basic blocks, termed peepholes, derived by random walks on the control-flow graph. The peepholes are normalized using transformations inspired by compiler optimizations. The VEX-IR Normalization Engine mitigates, with these transformations, the architectural and compiler-induced variations in binaries while exposing semantic similarities. We then learn the vocabulary of representations at the entity level of the IR using the knowledge graph embedding techniques in an unsupervised manner. This vocabulary is used to derive function embeddings for similarity assessment using VexNet, a feed-forward Siamese network designed to position similar functions closely and separate dissimilar ones in an n-dimensional space. This approach is amenable for both diffing and searching tasks, ensuring robustness against Out-Of-Vocabulary (OOV) issues. We evaluate VexIR2Vec on a dataset comprising 2.7M functions and 15.5K binaries from 7 projects compiled across 12 compilers targeting x86 and ARM architectures. In diffing experiments, VexIR2Vec outperforms the nearest baselines by $40\%$, $18\%$, $21\%$, and $60\%$ in cross-optimization, cross-compilation, cross-architecture, and obfuscation settings, respectively. In the searching experiment, VexIR2Vec achieves a mean average precision of $0.76$, outperforming the nearest baseline by $46\%$. Our framework is highly scalable and is built as a lightweight, multi-threaded, parallel library using only open-source tools. VexIR2Vec is $3.1$-$3.5 \times$ faster than the closest baselines and orders-of-magnitude faster than other tools.

翻译：二进制相似性旨在判断两个二进制程序是否表现出相似的功能，通常源于同一源代码。本文提出VexIR2Vec，一种基于架构无关中间表示VEX-IR的二进制相似性分析方法。我们通过控制流图上的随机游走提取基本块序列（称为窥孔）并从中获取嵌入表示。这些窥孔序列通过受编译器优化启发的变换进行归一化处理。VEX-IR归一化引擎借助这些变换缓解二进制文件中因架构和编译器差异引起的变化，同时凸显语义相似性。随后，我们以无监督方式运用知识图嵌入技术在IR实体级别学习表示词汇表。该词汇表通过前馈孪生网络VexNet生成函数嵌入以进行相似性评估，该网络设计用于在n维空间中使相似函数彼此靠近、相异函数彼此分离。本方法同时适用于差异比对和搜索任务，并能有效应对未登录词问题。我们在包含7个项目、由12种编译器针对x86和ARM架构编译生成的2.7M个函数和15.5K个二进制文件的数据集上评估VexIR2Vec。在差异比对实验中，VexIR2Vec在跨优化、跨编译、跨架构和混淆设置下分别以$40\%$、$18\%$、$21\%$和$60\%$的优势超越最接近的基线方法。在搜索实验中，VexIR2Vec达到0.76的平均精度均值，较最接近基线提升$46\%$。该框架具有高度可扩展性，仅使用开源工具构建为轻量级多线程并行库。VexIR2Vec的运行速度比最接近基线快$3.1$-$3.5$倍，较其他工具快数个数量级。