We consider the problem of program clone search, i.e. given a target program and a repository of known programs (all in executable format), the goal is to find the program in the repository most similar to the target program - with potential applications in terms of reverse engineering, program clustering, malware lineage and software theft detection. Recent years have witnessed a blooming in code similarity techniques, yet most of them focus on function-level similarity and function clone search, while we are interested in program-level similarity and program clone search. Actually, our study shows that prior similarity approaches are either too slow to handle large program repositories, or not precise enough, or yet not robust against slight variations introduced by compilers, source code versions or light obfuscations. We propose a novel spectral analysis method for program-level similarity and program clone search called Programs Spectral Similarity (PSS). In a nutshell, PSS one-time spectral feature extraction is tailored for large repositories, making it a perfect fit for program clone search. We have compared the different approaches with extensive benchmarks, showing that PSS reaches a sweet spot in terms of precision, speed and robustness.
翻译:我们考虑程序克隆搜索问题,即给定一个目标程序和一个已知程序库(均为可执行格式),目标是找出库中与目标程序最相似的程序——这在地球逆向工程、程序聚类、恶意软件溯源及软件盗窃检测等领域具有潜在应用价值。近年来,代码相似性技术蓬勃发展,但多数研究聚焦于函数级相似性和函数克隆搜索,而我们关注的是程序级相似性与程序克隆搜索。实际上,我们的研究表明,现有相似性方法要么因处理大型程序库时速度过慢,要么精度不足,要么无法应对编译器、源代码版本或轻量混淆引入的细微差异。为此,我们提出一种名为“程序谱相似性”(PSS)的新型谱分析方法,专门用于程序级相似性度量与程序克隆搜索。简而言之,PSS的一步式谱特征提取专为大规模程序库设计,使其完美适配程序克隆搜索场景。我们通过广泛的基准测试比较了不同方法,结果表明PSS在精度、速度和鲁棒性之间达到了最优平衡。