Binary-30K: A Heterogeneous Dataset for Deep Learning in Binary Analysis and Malware Detection

Deep learning research for binary analysis faces a critical infrastructure gap. Today, existing datasets target single platforms, require specialized tooling, or provide only hand-engineered features incompatible with modern neural architectures; no single dataset supports accessible research and pedagogy on realistic use cases. To solve this, we introduce Binary-30K, the first heterogeneous binary dataset designed for sequence-based models like transformers. Critically, Binary-30K covers Windows, Linux, macOS, and Android across 15+ CPU architectures. With 29,793 binaries and approximately 26.93% malware representation, Binary-30K enables research on platform-invariant detection, cross-target transfer learning, and long-context binary understanding. The dataset provides pre-computed byte-level BPE tokenization alongside comprehensive structural metadata, supporting both sequence modeling and structure-aware approaches. Platform-first stratified sampling ensures representative coverage across operating systems and architectures, while distribution via Hugging Face with official train/validation/test splits enables reproducible benchmarking. The dataset is publicly available at https://huggingface.co/datasets/mjbommar/binary-30k, providing an accessible resource for researchers, practitioners, and students alike.

翻译：面向二进制分析的深度学习研究面临关键的基础设施缺口。当前，现有数据集通常针对单一平台、需要专用工具，或仅提供与现代神经架构不兼容的手工设计特征；尚无单一数据集能支持对现实用例的可访问研究与教学。为此，我们提出了Binary-30K，这是首个专为Transformer等序列模型设计的异构二进制数据集。关键之处在于，Binary-30K覆盖了Windows、Linux、macOS和Android系统，涵盖超过15种CPU架构。该数据集包含29,793个二进制文件，其中约26.93%为恶意软件样本，支持平台无关检测、跨目标迁移学习和长上下文二进制理解等研究方向。数据集提供了预计算的字节级BPE分词结果及全面的结构元数据，既支持序列建模，也兼容结构感知方法。采用平台优先的分层抽样策略，确保了对不同操作系统和架构的代表性覆盖；同时通过Hugging Face平台发布，并提供官方的训练/验证/测试划分，以实现可复现的基准测试。数据集公开发布于https://huggingface.co/datasets/mjbommar/binary-30k，为研究人员、从业者和学生提供了易于获取的资源。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【AAAI2026】Align3GR：面向 LLM 生成式推荐的统一多层次对齐方法

专知会员服务

13+阅读 · 2025年11月17日

【CVPR2025】CarPlanner: 一种用于自动驾驶大规模强化学习的一致性自回归轨迹规划

专知会员服务

14+阅读 · 2025年3月2日

【CVPR2024】VideoMAC: 视频掩码自编码器与卷积神经网络

专知会员服务

17+阅读 · 2024年3月4日

【NeurIPS2023】基于反事实保守Q学习的离线多智能体强化学习

专知会员服务

17+阅读 · 2023年9月25日