The advancement of graph-based malware analysis is critically limited by the absence of large-scale datasets that capture the inherent hierarchical structure of software. Existing methods often oversimplify programs into single level graphs, failing to model the crucial semantic relationship between high-level functional interactions and low-level instruction logic. To bridge this gap, we introduce \dataset, the largest public hierarchical graph dataset for malware analysis, comprising over \textbf{200M} Control Flow Graphs (CFGs) nested within \textbf{595K} Function Call Graphs (FCGs). This two-level representation preserves structural semantics essential for building robust detectors resilient to code obfuscation and malware evolution. We demonstrate HiGraph's utility through a large-scale analysis that reveals distinct structural properties of benign and malicious software, establishing it as a foundational benchmark for the community. The dataset and tools are publicly available at https://higraph.org.
翻译:基于图的恶意软件分析进展因缺乏能够捕捉软件固有分层结构的大规模数据集而受到严重限制。现有方法常将程序简化为单层图,未能建模高层功能交互与低层指令逻辑之间的关键语义关系。为填补这一空白,我们提出了\dataset——目前用于恶意软件分析的最大公开分层图数据集,包含嵌入在\textbf{595K}个函数调用图(FCG)中的\textbf{2亿}余个控制流图(CFG)。这种双层表示保留了构建对代码混淆和恶意软件演化具有鲁棒性的检测器所必需的结构语义。我们通过一项大规模分析展示了HiGraph的实用性,该分析揭示了良性软件与恶意软件迥异的结构特性,使其成为社区的基础基准。数据集与工具公开访问地址:https://higraph.org。