Machine learning has demonstrated remarkable performance over finite datasets, yet whether the scores over the fixed benchmarks can sufficiently indicate the model's performance in the real world is still in discussion. In reality, an ideal robust model will probably behave similarly to the oracle (e.g., the human users), thus a good evaluation protocol is probably to evaluate the models' behaviors in comparison to the oracle. In this paper, we introduce a new robustness measurement that directly measures the image classification model's performance compared with a surrogate oracle (i.e., a foundation model). Besides, we design a simple method that can accomplish the evaluation beyond the scope of the benchmarks. Our method extends the image datasets with new samples that are sufficiently perturbed to be distinct from the ones in the original sets, but are still bounded within the same image-label structure the original test image represents, constrained by a foundation model pretrained with a large amount of samples. As a result, our new method will offer us a new way to evaluate the models' robustness performance, free of limitations of fixed benchmarks or constrained perturbations, although scoped by the power of the oracle. In addition to the evaluation results, we also leverage our generated data to understand the behaviors of the model and our new evaluation strategies.
翻译:机器学习在有限数据集上展现了卓越性能,然而,固定基准上的分数能否充分反映模型在现实世界中的表现仍在讨论之中。实际上,一个理想的鲁棒模型可能会表现出与“神谕”(例如人类用户)相似的行为,因此,一种良好的评估协议或许是通过与神谕的对比来评估模型的行为。本文引入了一种新的鲁棒性度量方法,该方法直接衡量图像分类模型与替代神谕(即基础模型)相比的表现。此外,我们设计了一种简单的方法,可以在基准范围之外完成评估。该方法通过添加新样本来扩展图像数据集,这些样本经过充分扰动,与原始数据集中的样本不同,但仍限定在原始测试图像所代表的同一图像-标签结构内,并受大量样本预训练的基础模型约束。因此,我们的新方法提供了一种新的方式来评估模型的鲁棒性能,摆脱了固定基准或受限扰动的限制,尽管其范围受限于神谕的能力。除了评估结果之外,我们还利用生成的数据来理解模型的行为以及新的评估策略。