The performance figures of modern drift-adaptive malware classifiers appear promising, but does this translate to genuine operational reliability? The standard evaluation paradigm primarily focuses on baseline performance metrics, neglecting confidence-error alignment and operational stability. While prior works established the importance of temporal evaluation and introduced selective classification in malware classification tasks, we take a complementary direction by investigating whether malware classifiers maintain reliable and stable confidence estimates under distribution shifts and exploring the tensions between scientific advancement and practical impacts when they do not. We propose Aurora, a framework to evaluate malware classifiers based on their confidence quality and operational resilience. Aurora subjects the confidence profile of a given model to verification to assess the reliability of its estimates. Unreliable confidence estimates erode operational trust, waste valuable annotation budgets on non-informative samples for active learning, and leave error-prone instances undetected in selective classification. Aurora is further complemented by a set of metrics designed to go beyond point-in-time performance, striving towards a more holistic assessment of operational stability throughout temporal evaluation periods. The fragility we observe in SOTA frameworks across datasets of varying drift severity suggests it may be time to revisit the underlying assumptions.
翻译:现代漂移自适应恶意软件分类器的性能指标看似优异,但这能否转化为真正的运行可靠性?标准评估范式主要关注基线性能指标,忽视了置信度-误差对齐与运行稳定性。尽管先前研究已证实时序评估的重要性,并将选择性分类引入恶意软件分类任务,我们则采取互补的研究方向:探究恶意软件分类器在分布漂移下是否保持可靠且稳定的置信度估计,并当置信度不可靠时,深入考察科学进展与实际应用影响之间的张力关系。我们提出Aurora框架,该框架基于置信度质量与运行韧性对恶意软件分类器进行评估。Aurora通过对给定模型的置信度分布进行验证,以评估其估计结果的可靠性。不可靠的置信度估计会削弱运行信任度,在主动学习中浪费宝贵的标注资源于非信息性样本,并导致选择性分类中错误易发实例未被检测。Aurora进一步辅以一套超越单时间点性能的评估指标,致力于在时序评估周期内实现更全面的运行稳定性评估。我们在不同漂移严重程度的数据集中观察到SOTA框架所表现出的脆弱性,这表明或许应当重新审视其底层假设。