Machine learning has demonstrated remarkable performance over finite datasets, yet whether the scores over the fixed benchmarks can sufficiently indicate the model's performance in the real world is still in discussion. In reality, an ideal robust model will probably behave similarly to the oracle (e.g., the human users), thus a good evaluation protocol is probably to evaluate the models' behaviors in comparison to the oracle. In this paper, we introduce a new robustness measurement that directly measures the image classification model's performance compared with a surrogate oracle (i.e., a foundation model). Besides, we design a simple method that can accomplish the evaluation beyond the scope of the benchmarks. Our method extends the image datasets with new samples that are sufficiently perturbed to be distinct from the ones in the original sets, but are still bounded within the same image-label structure the original test image represents, constrained by a foundation model pretrained with a large amount of samples. As a result, our new method will offer us a new way to evaluate the models' robustness performance, free of limitations of fixed benchmarks or constrained perturbations, although scoped by the power of the oracle. In addition to the evaluation results, we also leverage our generated data to understand the behaviors of the model and our new evaluation strategies.
翻译:机器学习在有限数据集上已展现出卓越性能,但固定基准上的评分能否充分反映模型在真实世界中的表现仍存争议。实际上,理想的鲁棒模型应尽可能接近预言机(如人类用户),因此良好的评估协议或需通过比较模型行为与预言机来实施。本文提出一种新的鲁棒性度量方法,直接衡量图像分类模型相较于替代预言机(即基础模型)的性能表现。此外,我们设计了一种可超越基准数据集范围的简单评估方法。该方法通过基础模型(基于海量样本预训练)约束,在保留原始测试图像标签结构的前提下,生成与原集样本存在充分扰动的扩展数据,且扰动幅度被限制在相同图像-标签结构内。由此,新方法提供了一种摆脱固定基准或受限扰动局限的模型鲁棒性评估途径(尽管受限于预言机能力)。除评估结果外,我们还利用生成数据深入理解模型行为及新型评估策略。