Assemblage: Automatic Binary Dataset Construction for Machine Learning

Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpora of malicious binaries, obtaining high-quality corpora of benign binaries for modern systems has proven challenging (e.g., due to licensing issues). Consequently, machine learning based pipelines for binary analysis utilize either costly commercial corpora (e.g., VirusTotal) or open-source binaries (e.g., coreutils) available in limited quantities. To address these issues, we present Assemblage: an extensible cloud-based distributed system that crawls, configures, and builds Windows PE binaries to obtain high-quality binary corpuses suitable for training state-of-the-art models in binary analysis. We have run Assemblage on AWS over the past year, producing 890k Windows PE and 428k Linux ELF binaries across 29 configurations. Assemblage is designed to be both reproducible and extensible, enabling users to publish "recipes" for their datasets, and facilitating the extraction of a wide array of features. We evaluated Assemblage by using its data to train modern learning-based pipelines for compiler provenance and binary function similarity. Our results illustrate the practical need for robust corpora of high-quality Windows PE binaries in training modern learning-based binary analyses. Assemblage code is open sourced under the MIT license, and the dataset can be downloaded from https://assemblage-dataset.net

翻译：二进制代码无处不在，二进制分析是逆向工程、恶意软件分类和漏洞发现中的关键任务。然而，尽管存在大量恶意二进制文件语料库，获取适用于现代系统的高质量良性二进制文件语料库已被证明具有挑战性（例如，由于许可问题）。因此，基于机器学习的二进制分析流程通常使用昂贵的商业语料库（如VirusTotal）或数量有限的开源二进制文件（如coreutils）。为解决这些问题，我们提出了Assemblage：一个可扩展的、基于云的分布式系统，能够爬取、配置并构建Windows PE二进制文件，从而获得适用于训练二进制分析领域最先进模型的高质量二进制语料库。过去一年中，我们在AWS上运行Assemblage，生成了涵盖29种配置的89万个Windows PE二进制文件和42.8万个Linux ELF二进制文件。Assemblage设计为兼具可复现性和可扩展性，允许用户发布其数据集的“构建配方”，并支持提取广泛的特征。我们通过使用Assemblage生成的数据训练基于现代学习方法的编译器溯源和二进制函数相似性分析流程来评估该系统。实验结果证明了在训练现代基于学习的二进制分析模型时，对高质量Windows PE二进制文件稳健语料库的实际需求。Assemblage代码已在MIT许可下开源，数据集可从https://assemblage-dataset.net下载。