Deep learning research for binary analysis faces a critical infrastructure gap. Today, existing datasets target single platforms, require specialized tooling, or provide only hand-engineered features incompatible with modern neural architectures; no single dataset supports accessible research and pedagogy on realistic use cases. To solve this, we introduce Binary-30K, the first heterogeneous binary dataset designed for sequence-based models like transformers. Critically, Binary-30K covers Windows, Linux, macOS, and Android across 15+ CPU architectures. With 29,793 binaries and approximately 26.93% malware representation, Binary-30K enables research on platform-invariant detection, cross-target transfer learning, and long-context binary understanding. The dataset provides pre-computed byte-level BPE tokenization alongside comprehensive structural metadata, supporting both sequence modeling and structure-aware approaches. Platform-first stratified sampling ensures representative coverage across operating systems and architectures, while distribution via Hugging Face with official train/validation/test splits enables reproducible benchmarking. The dataset is publicly available at https://huggingface.co/datasets/mjbommar/binary-30k, providing an accessible resource for researchers, practitioners, and students alike.
翻译:面向二进制分析的深度学习研究面临关键的基础设施缺口。当前,现有数据集通常针对单一平台、需要专用工具,或仅提供与现代神经架构不兼容的手工设计特征;尚无单一数据集能支持对现实用例的可访问研究与教学。为此,我们提出了Binary-30K,这是首个专为Transformer等序列模型设计的异构二进制数据集。关键之处在于,Binary-30K覆盖了Windows、Linux、macOS和Android系统,涵盖超过15种CPU架构。该数据集包含29,793个二进制文件,其中约26.93%为恶意软件样本,支持平台无关检测、跨目标迁移学习和长上下文二进制理解等研究方向。数据集提供了预计算的字节级BPE分词结果及全面的结构元数据,既支持序列建模,也兼容结构感知方法。采用平台优先的分层抽样策略,确保了对不同操作系统和架构的代表性覆盖;同时通过Hugging Face平台发布,并提供官方的训练/验证/测试划分,以实现可复现的基准测试。数据集公开发布于https://huggingface.co/datasets/mjbommar/binary-30k,为研究人员、从业者和学生提供了易于获取的资源。