In this paper, we provide practical tools to improve the scientific soundness of firmware corpora beyond the state of the art. We identify binary analysis challenges that significantly impact corpus creation. We use them to derive a framework of key corpus requirements that nurture the scientific goals of replicability and representativeness. We apply the framework to 44 top tier papers and collect 704 data points to show that there is currently no common ground on corpus creation. We discover in otherwise excellent work, that incomplete documentation and inflated corpus sizes blur visions on representativeness and hinder replicability. Our results show that the strict framework provides useful and practical guidelines that can identify miniscule step stones in corpus creation with significant impact on soundness. Finally, we show that it is possible to meet all requirements: We provide a new corpus called LFwC. It is designed for large-scale static analyses on Linux-based firmware and consists of 10,913 high-quality images, covering 2,365 network appliances. We share rich meta data and scripts for replicability with the community. We verify unpacking, perform deduplication, identify contents, and provide bug ground truth. We identify ISAs and Linux kernels. All samples can be unpacked with the open source tool FACT.
翻译:本文提供了实用工具,旨在提升固件语料库的科学严谨性,超越现有技术水平。我们识别出显著影响语料库创建的二进制分析挑战,并据此推导出关键语料库需求框架,以培育可复现性与代表性这两项科学目标。将该框架应用于44篇顶级论文,收集704个数据点后,我们发现目前语料库创建缺乏共识。在原本优秀的工作中,我们发现不完整的文档和膨胀的语料库规模模糊了代表性视角,并阻碍了可复现性。结果表明,该严格框架提供了实用指南,能识别出语料库创建中微小但显著影响科学严谨性的关键步骤。最后,我们证明满足所有需求是可行的:提供了名为LFwC的新语料库,专为基于Linux固件的大规模静态分析设计,包含10,913个高质量镜像,覆盖2,365种网络设备。我们向社区分享丰富的元数据和可复现性脚本,验证解包过程、执行去重、识别内容并提供错误基准真相。我们还识别了指令集架构与Linux内核版本。所有样本均可通过开源工具FACT进行解包。