CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity Detection

Binary code similarity detection (BCSD) is a fundamental technique for various application. Many BCSD solutions have been proposed recently, which mostly are embedding-based, but have shown limited accuracy and efficiency especially when the volume of target binaries to search is large. To address this issue, we propose a cost-effective BCSD framework, CEBin, which fuses embedding-based and comparison-based approaches to significantly improve accuracy while minimizing overheads. Specifically, CEBin utilizes a refined embedding-based approach to extract features of target code, which efficiently narrows down the scope of candidate similar code and boosts performance. Then, it utilizes a comparison-based approach that performs a pairwise comparison on the candidates to capture more nuanced and complex relationships, which greatly improves the accuracy of similarity detection. By bridging the gap between embedding-based and comparison-based approaches, CEBin is able to provide an effective and efficient solution for detecting similar code (including vulnerable ones) in large-scale software ecosystems. Experimental results on three well-known datasets demonstrate the superiority of CEBin over existing state-of-the-art (SOTA) baselines. To further evaluate the usefulness of BCSD in real world, we construct a large-scale benchmark of vulnerability, offering the first precise evaluation scheme to assess BCSD methods for the 1-day vulnerability detection task. CEBin could identify the similar function from millions of candidate functions in just a few seconds and achieves an impressive recall rate of $85.46\%$ on this more practical but challenging task, which are several order of magnitudes faster and $4.07\times$ better than the best SOTA baseline. Our code is available at https://github.com/Hustcw/CEBin.

翻译：二进制代码相似性检测（BCSD）是多种应用的基础技术。近年来虽涌现出大量基于嵌入的BCSD解决方案，但它们在精度与效率方面存在局限，尤其在目标二进制文件搜索规模较大时表现尤为突出。针对这一问题，我们提出一种成本高效的BCSD框架CEBin，该框架融合了基于嵌入与基于比较的方法，在显著提升精度的同时最小化开销。具体而言，CEBin首先采用改进的嵌入方法提取目标代码特征，高效缩小候选相似代码的搜索范围以提升性能；随后利用基于比较的方法对候选集进行成对比较，捕获更细微复杂的代码关系，从而大幅提升相似性检测精度。通过弥合嵌入方法与比较方法之间的鸿沟，CEBin能够为大规模软件生态中的相似代码（包括脆弱代码）检测提供高效实用的解决方案。在三个知名数据集上的实验结果表明，CEBin显著优于现有最先进（SOTA）基线方法。为进一步评估BCSD在真实场景中的实用性，我们构建了一个大规模漏洞基准集，首次为1天漏洞检测任务提供精确评估方案。CEBin可在数秒内从数百万候选函数中识别出相似函数，在该更具实践性且富有挑战性的任务上实现了85.46%的惊人召回率，其速度比最优SOTA基线快数个数量级，性能提升达4.07倍。我们的代码开源在https://github.com/Hustcw/CEBin。