Binary code similarity detection is an important problem with applications in areas such as malware analysis, vulnerability research and license violation detection. This paper proposes a novel graph neural network architecture combined with a novel graph data representation called call graphlets. A call graphlet encodes the neighborhood around each function in a binary executable, capturing the local and global context through a series of statistical features. A specialized graph neural network model operates on this graph representation, learning to map it to a feature vector that encodes semantic binary code similarities using deep-metric learning. The proposed approach is evaluated across five distinct datasets covering different architectures, compiler tool chains, and optimization levels. Experimental results show that the combination of call graphlets and the novel graph neural network architecture achieves comparable or state-of-the-art performance compared to baseline techniques across cross-architecture, mono-architecture and zero shot tasks. In addition, our proposed approach also performs well when evaluated against an out-of-domain function inlining task. The work provides a general and effective graph neural network-based solution for conducting binary code similarity detection.
翻译:二元代码相似性检测是一个重要问题,在恶意软件分析、漏洞研究和许可证违规检测等领域具有广泛应用。本文提出了一种新颖的图神经网络架构,结合一种称为调用图子结构的新型图数据表示方法。调用图子结构对二元可执行文件中每个函数周围的邻域进行编码,通过一系列统计特征捕获局部和全局上下文。一个专门的图神经网络模型在此图表示上运行,学习将其映射到特征向量,该向量使用深度度量学习编码语义层面的二元代码相似性。所提出的方法在五个不同数据集上进行了评估,涵盖不同架构、编译器工具链和优化级别。实验结果表明,与基线技术相比,调用图子结构与新颖图神经网络架构的组合在跨架构、单架构和零样本任务上实现了可比或最先进的性能。此外,当针对域外函数内联任务进行评估时,我们提出的方法也表现良好。该工作为进行二元代码相似性检测提供了一个通用且有效的基于图神经网络的解决方案。