We propose VEXIR2Vec, a code embedding framework for finding similar functions in binaries. Our representations rely on VEX IR, the intermediate representation used by binary analysis tools like Valgrind and angr. Our proposed embeddings encode both syntactic and semantic information to represent a function, and is both application and architecture independent. We also propose POV, a custom Peephole Optimization engine that normalizes the VEX IR for effective similarity analysis. We design several optimizations like copy/constant propagation, constant folding, common subexpression elimination and load-store elimination in POV. We evaluate our framework on two experiments -- diffing and searching -- involving binaries targeting different architectures, compiled using different compilers and versions, optimization sequences, and obfuscations. We show results on several standard projects and on real-world vulnerabilities. Our results show that VEXIR2Vec achieves superior precision and recall values compared to the state-of-the-art works. Our framework is highly scalable and is built as a multi-threaded, parallel library by only using open-source tools. VEXIR2Vec achieves about $3.2 \times$ speedup on the closest competitor, and orders-of-magnitude speedup on other tools.
翻译:我们提出了VEXIR2Vec,一种用于在二进制文件中寻找相似函数的代码嵌入框架。我们的表示基于VEX IR——这是Valgrind和angr等二进制分析工具所使用的中间表示。所提出的嵌入编码了函数的语法和语义信息,且与应用程序及架构无关。我们还提出了POV,一种自定义的窥孔优化引擎,用于规范化VEX IR以实现有效的相似性分析。我们在POV中设计了多种优化,如复制/常量传播、常量折叠、公共子表达式消除以及加载-存储消除。我们在两个实验——差异对比与搜索——中评估了本框架,实验涉及面向不同架构的二进制文件,这些文件由不同编译器及版本、优化序列和混淆技术编译而成。我们在多个标准项目及真实世界漏洞上展示了结果。结果表明,与现有最先进的工作相比,VEXIR2Vec实现了更高的精确率和召回率。本框架具有高度可扩展性,并且仅通过使用开源工具构建为多线程并行库。VEXIR2Vec相比最接近的竞争对手实现了约$3.2 \times$的加速,而在其他工具上则实现了数量级的加速。