Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpuses of malicious binaries, obtaining high-quality corpuses of benign binaries for modern systems has proven challenging (e.g., due to licensing issues). Consequently, machine learning based pipelines for binary analysis utilize either costly commercial corpuses (e.g., VirusTotal) or open-source binaries (e.g., coreutils) available in limited quantities. To address these issues, we present Assemblage: an extensible cloud-based distributed system that crawls, configures, and builds Windows PE binaries to obtain high-quality binary corpuses suitable for training state-of-the-art models in binary analysis. We have run Assemblage on AWS over the past year, producing 890k Windows PE and 428k Linux ELF binaries across 29 configurations. Assemblage is designed to be both reproducible and extensible, enabling users to publish "recipes" for their datasets, and facilitating the extraction of a wide array of features. We evaluated Assemblage by using its data to train modern learning-based pipelines for compiler provenance and binary function similarity. Our results illustrate the practical need for robust corpuses of high-quality Windows PE binaries in training modern learning-based binary analyses. Assemblage can be downloaded from https://assemblage-dataset.net
翻译:二进制代码无处不在,二进制分析是逆向工程、恶意软件分类和漏洞发现中的关键任务。然而,尽管存在大量恶意二进制文件库,但在现代系统中获取高质量的良性二进制文件库(例如由于许可问题)已被证明具有挑战性。因此,基于机器学习的二进制分析流程要么依赖昂贵的商业库(如VirusTotal),要么依赖数量有限的开源二进制文件(如coreutils)。为解决这些问题,我们提出了Assemblage:一个可扩展的基于云的分布式系统,该系统通过爬取、配置和构建Windows PE二进制文件,以获取适用于训练二进制分析领域最先进模型的高质量二进制库。过去一年中,我们在AWS上运行了Assemblage,生成了29种配置下的89万个Windows PE二进制文件和42.8万个Linux ELF二进制文件。Assemblage设计为可重现且可扩展,允许用户发布其数据集的"配方",并促进多种特征的提取。我们通过使用其数据训练编译器来源识别和二进制函数相似性等现代学习流程来评估Assemblage。结果表明,在训练现代基于学习的二进制分析时,对高质量Windows PE二进制文件的鲁棒库存在实际需求。Assemblage可从https://assemblage-dataset.net下载。