Quantifying the Generalization Gap: A New Benchmark for Out-of-Distribution Graph-Based Android Malware Classification

While graph-based Android malware classifiers achieve over 94% accuracy on standard benchmarks, they exhibit a significant generalization gap under distribution shift, suffering up to 45% performance degradation when encountering unseen malware variants from known families. This work systematically investigates this critical yet overlooked challenge for real-world deployment by introducing a benchmarking suite designed to simulate two prevalent scenarios: MalNet-Tiny-Common for covariate shift, and MalNet-Tiny-Distinct for domain shift. Furthermore, we identify an inherent limitation in existing benchmarks where the inputs are structure-only function call graphs, which fails to capture the latent semantic patterns necessary for robust generalization. To verify this, we construct a semantic enrichment framework that augments the original topology with function-level attributes, including lightweight metadata and LLM-based code embeddings. By providing this expanded feature set, we aim to equip future research with richer behavioral information to facilitate the development of more sophisticated detection techniques. Empirical evaluations confirm the effectiveness of our data-centric methodology, with which classification performs better under distribution shift compared to model-based approaches, and consistently further enhances robustness when used in conjunction. We release our precomputed datasets, along with an extensible implementation of our comprehensive pipeline, to lay the groundwork for building resilient malware detection systems for evolving threat environments.

翻译：尽管基于图的安卓恶意软件分类器在标准基准测试中达到了超过94%的准确率，但在分布偏移下它们表现出显著的泛化差距，当遇到已知家族中未见过的恶意软件变体时，性能下降高达45%。本研究通过引入一个旨在模拟两种常见场景的基准测试套件，系统性地探讨了这一对实际部署至关重要却长期被忽视的挑战：用于协变量偏移的MalNet-Tiny-Common，以及用于领域偏移的MalNet-Tiny-Distinct。此外，我们指出了现有基准测试中存在的一个固有局限，即其输入仅为结构化的函数调用图，这无法捕捉到实现稳健泛化所必需的潜在语义模式。为验证这一点，我们构建了一个语义增强框架，该框架通过函数级属性（包括轻量级元数据和基于LLM的代码嵌入）来增强原始拓扑结构。通过提供这一扩展的特征集，我们旨在为未来研究提供更丰富的行为信息，以促进更复杂检测技术的发展。实证评估证实了我们以数据为中心的方法的有效性：与基于模型的方法相比，使用该方法在分布偏移下的分类性能更优，并且当结合使用时能持续进一步增强鲁棒性。我们发布了预计算的数据集，以及我们完整流程的可扩展实现，旨在为构建适应不断演变的威胁环境的弹性恶意软件检测系统奠定基础。