Program similarity has become an increasingly popular area of research with various security applications such as plagiarism detection, author identification, and malware analysis. However, program similarity research faces a few unique dataset quality problems in evaluating the effectiveness of novel approaches. First, few high-quality datasets for binary program similarity exist and are widely used in this domain. Second, there are potentially many different, disparate definitions of what makes one program similar to another and in many cases there is often a large semantic gap between the labels provided by a dataset and any useful notion of behavioral or semantic similarity. In this paper, we present HELIX - a framework for generating large, synthetic program similarity datasets. We also introduce Blind HELIX, a tool built on top of HELIX for extracting HELIX components from library code automatically using program slicing. We evaluate HELIX and Blind HELIX by comparing the performance of program similarity tools on a HELIX dataset to a hand-crafted dataset built from multiple, disparate notions of program similarity. Using Blind HELIX, we show that HELIX can generate realistic and useful datasets of virtually infinite size for program similarity research with ground truth labels that embody practical notions of program similarity. Finally, we discuss the results and reason about relative tool ranking.
翻译:程序相似度已成为日益热门的研究领域,在剽窃检测、作者识别和恶意软件分析等多种安全应用中具有重要价值。然而,程序相似度研究在评估新方法的有效性时面临若干独特的基准数据集质量问题。首先,针对二进制程序相似度的高质量数据集稀缺且未在该领域得到广泛使用。其次,关于程序相似度的定义可能存在多种截然不同的标准,在许多情况下,数据集提供的标签与行为或语义相似性的有效概念之间往往存在巨大的语义鸿沟。本文提出HELIX——一个用于生成大规模合成程序相似度数据集的框架。同时,我们引入Blind HELIX工具,该工具基于HELIX构建,可通过程序切片技术自动从库代码中提取HELIX组件。通过将程序相似度工具在HELIX数据集上的性能与基于多种不同程序相似性概念手工构建的数据集进行比较,我们对HELIX和Blind HELIX进行了评估。利用Blind HELIX,我们证明HELIX能够生成几乎无限规模、兼具实用性与真实性的程序相似度研究数据集,其真实标签体现了程序相似性的实际定义。最后,我们讨论了实验结果并分析了工具的相对性能排序。