Industry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines, meaning a 0.1\% change can cause an overwhelming number of false positives. However, academic research is often restrained to public datasets on the order of ten thousand samples and is too small to detect improvements that may be relevant to industry. Working within these constraints, we devise an approach to generate a benchmark of configurable difficulty from a pool of available samples. This is done by leveraging malware family information from tools like AVClass to construct training/test splits that have different generalization rates, as measured by a secondary model. Our experiments will demonstrate that using a less accurate secondary model with disparate features is effective at producing benchmarks for a more sophisticated target model that is under evaluation. We also ablate against alternative designs to show the need for our approach.
翻译:行业从业者关注恶意软件检测精度的微小提升,因其模型部署于数亿台设备上,0.1%的变化即可能引发海量误报。然而,学术研究常受限于样本量级仅约万级的公开数据集,难以检测出对行业具有实际价值的改进。在此约束下,我们提出一种方法,通过现有样本集生成难度可配置的基准测试。该方法借助AVClass等工具的恶意软件家族信息,构建具有不同泛化率的训练/测试划分(以次级模型评估泛化率)。实验表明,采用特征差异较大且精度较低的次级模型,能有效为待评估的更复杂目标模型生成基准测试。我们亦通过消融实验对比其他设计方案,论证了本方法的必要性。