Malware continues to be a predominant operational risk for organizations, especially when obfuscation techniques are used to evade detection. Despite the ongoing efforts in the development of Machine Learning (ML) detection approaches, there is still a lack of feature compatibility in public datasets. This limits generalization when facing distribution shifts, as well as transferability to different datasets. This study evaluates the suitability of different data preprocessing approaches for the detection of Portable Executable (PE) files with ML models. The preprocessing pipeline unifies EMBERv2 (2,381-dim) features datasets, trains paired models under two training setups: EMBER + BODMAS and EMBER + BODMAS + ERMDS. Regarding model evaluation, both EMBER + BODMAS and EMBER + BODMAS + ERMDS models are tested against TRITIUM, INFERNO and SOREL-20M. ERMDS is also used for testing for the EMBER + BODMAS setup.
翻译:恶意软件仍然是组织面临的重大运营风险,尤其是在使用混淆技术规避检测时。尽管机器学习检测方法的开发持续进行,但公开数据集仍缺乏特征兼容性。这限制了模型在面对分布偏移时的泛化能力,以及在不同数据集间的可迁移性。本研究评估了不同数据预处理方法在基于机器学习模型检测可移植可执行文件时的适用性。预处理流程统一了EMBERv2(2381维)特征数据集,并在两种训练设置下训练配对模型:EMBER + BODMAS 和 EMBER + BODMAS + ERMDS。在模型评估方面,EMBER + BODMAS 和 EMBER + BODMAS + ERMDS 模型均针对 TRITIUM、INFERNO 和 SOREL-20M 数据集进行测试。同时,ERMDS 还被用于 EMBER + BODMAS 设置下的测试。