Binary similarity analysis determines if two binary executables are from the same source program. Existing techniques leverage static and dynamic program features and may utilize advanced Deep Learning techniques. Although they have demonstrated great potential, the community believes that a more effective representation of program semantics can further improve similarity analysis. In this paper, we propose a new method to represent binary program semantics. It is based on a novel probabilistic execution engine that can effectively sample the input space and the program path space of subject binaries. More importantly, it ensures that the collected samples are comparable across binaries, addressing the substantial variations of input specifications. Our evaluation on 9 real-world projects with 35k functions, and comparison with 6 state-of-the-art techniques show that PEM can achieve a precision of 96% with common settings, outperforming the baselines by 10-20%.
翻译:二进制相似性分析确定两个二进制可执行程序是否源自同一源代码程序。现有技术利用静态和动态程序特征,并可结合高级深度学习技术。尽管这些方法已展现出巨大潜力,但学界认为更有效的程序语义表征可进一步改进相似性分析。本文提出一种新的二进制程序语义表征方法,该方法基于新型概率执行引擎,能有效对目标二进制的输入空间和程序路径空间进行采样。更重要的是,该方法能确保跨二进制程序采集的样本具有可比性,从而解决输入规范显著差异的问题。我们在9个真实世界项目的35000个函数上进行评估,并与6种最先进技术进行比较,结果表明PEM在常规设置下可实现96%的精度,比基线方法提升10-20%。