SemDiff: Binary Similarity Detection by Diffing Key-Semantics Graphs

Binary similarity detection is a critical technique that has been applied in many real-world scenarios where source code is not available, e.g., bug search, malware analysis, and code plagiarism detection. Existing works are ineffective in detecting similar binaries in cases where different compiling optimizations, compilers, source code versions, or obfuscation are deployed. We observe that all the cases do not change a binary's key code behaviors although they significantly modify its syntax and structure. With this key observation, we extract a set of key instructions from a binary to capture its key code behaviors. By detecting the similarity between two binaries' key instructions, we can address well the ineffectiveness limitation of existing works. Specifically, we translate each extracted key instruction into a self-defined key expression, generating a key-semantics graph based on the binary's control flow. Each node in the key-semantics graph denotes a key instruction, and the node attribute is the key expression. To quantify the similarity between two given key-semantics graphs, we first serialize each graph into a sequence of key expressions by topological sort. Then, we tokenize and concatenate key expressions to generate token lists. We calculate the locality-sensitive hash value for all token lists and quantify their similarity. %We implement a prototype, called SemDiff, consisting of two modules: graph generation and graph diffing. The first module generates a pair of key-semantics graphs and the second module diffs the graphs. Our evaluation results show that overall, SemDiff outperforms state-of-the-art tools when detecting the similarity of binaries generated from different optimization levels, compilers, and obfuscations. SemDiff is also effective for library version search and finding similar vulnerabilities in firmware.

翻译：摘要：二进制相似性检测是一项关键技术，广泛应用于源代码不可用的现实场景，例如漏洞搜索、恶意软件分析和代码抄袭检测。现有方法在检测由不同编译优化、编译器、源代码版本或混淆技术生成的二进制文件时，往往效果不佳。我们观察到，尽管这些情况会显著改变二进制文件的语法和结构，但不会改变其关键代码行为。基于这一关键观察，我们从二进制文件中提取一组关键指令以捕获其关键代码行为。通过检测两个二进制文件关键指令的相似性，能够有效克服现有方法的局限性。具体而言，我们将每条提取的关键指令转换为自定义的关键表达式，并基于二进制文件的控制流生成关键语义图。关键语义图中的每个节点代表一条关键指令，节点属性为该关键表达式。为了量化两个给定关键语义图的相似性，首先通过拓扑排序将每个图序列化为关键表达式序列。随后对关键表达式进行分词和拼接，生成令牌列表。我们计算所有令牌列表的局部敏感哈希值并量化其相似性。我们实现了一个原型系统 SemDiff，包含两个模块：图生成和图比较。图生成模块生成一对关键语义图，图比较模块对两者进行比较。评估结果表明，在检测不同优化级别、编译器和混淆技术生成的二进制文件相似性方面，SemDiff 整体上优于现有工具。此外，SemDiff 在库版本搜索和固件相似漏洞发现方面同样有效。