Machine learning has demonstrated remarkable performance over finite datasets, yet whether the scores over the fixed benchmarks can sufficiently indicate the model's performance in the real world is still in discussion. In reality, an ideal robust model will probably behave similarly to the oracle (e.g., the human users), thus a good evaluation protocol is probably to evaluate the models' behaviors in comparison to the oracle. In this paper, we introduce a new robustness measurement that directly measures the image classification model's performance compared with a surrogate oracle (i.e., a foundation model). Besides, we design a simple method that can accomplish the evaluation beyond the scope of the benchmarks. Our method extends the image datasets with new samples that are sufficiently perturbed to be distinct from the ones in the original sets, but are still bounded within the same image-label structure the original test image represents, constrained by a foundation model pretrained with a large amount of samples. As a result, our new method will offer us a new way to evaluate the models' robustness performance, free of limitations of fixed benchmarks or constrained perturbations, although scoped by the power of the oracle. In addition to the evaluation results, we also leverage our generated data to understand the behaviors of the model and our new evaluation strategies.
翻译:机器学习在有限数据集上已展现出卓越性能,但固定基准上的分数能否充分反映模型在现实世界的表现仍待探讨。实际上,理想的鲁棒模型应接近"神谕"(如人类用户)的行为,因此良好的评估协议或许应通过对比神谕来评估模型行为。本文提出一种新的鲁棒性度量方法,可直接衡量图像分类模型相较于替代神谕(即基础模型)的性能。此外,我们设计了一种简单方法,可在基准测试范围之外完成评估。该方法通过生成充分扰动的新样本扩展图像数据集——这些样本虽与原始样本存在显著差异,但仍保持原始测试图像所对应的图像-标签结构,且受限于由海量样本预训练的基础模型约束。由此,我们的新方法提供了一种评估模型鲁棒性能的全新途径,摆脱了固定基准或受限扰动的局限(尽管受限于神谕的能力范围)。除评估结果外,我们还利用生成数据探究模型行为特征及新型评估策略。