``Effective robustness'' measures the extra out-of-distribution (OOD) robustness beyond what can be predicted from the in-distribution (ID) performance. Existing effective robustness evaluations typically use a single test set such as ImageNet to evaluate ID accuracy. This becomes problematic when evaluating models trained on different data distributions, e.g., comparing models trained on ImageNet vs. zero-shot language-image pre-trained models trained on LAION. In this paper, we propose a new effective robustness evaluation metric to compare the effective robustness of models trained on different data distributions. To do this we control for the accuracy on multiple ID test sets that cover the training distributions for all the evaluated models. Our new evaluation metric provides a better estimate of the effectiveness robustness and explains the surprising effective robustness gains of zero-shot CLIP-like models exhibited when considering only one ID dataset, while the gains diminish under our evaluation.
翻译:“有效鲁棒性”衡量的是超出可通过分布内(ID)性能预测的额外分布外(OOD)鲁棒性。现有的有效鲁棒性评估通常使用单一测试集(如ImageNet)来评估ID准确率。这在评估基于不同数据分布训练的模型(例如,比较在ImageNet上训练的模型与在LAION上训练的零样本语言-图像预训练模型)时会出现问题。本文提出了一种新的有效鲁棒性评估指标,用于比较基于不同数据分布训练的模型的有效鲁棒性。为此,我们控制了覆盖所有被评估模型训练分布的多个ID测试集的准确率。我们的新评估指标能够更准确地估计有效鲁棒性,并解释了零样本类CLIP模型在仅考虑一个ID数据集时所表现出的令人惊讶的有效鲁棒性提升,而在我们的评估下这些提升则消失了。