Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles

When a deep learning model is deployed in the wild, it can encounter test data drawn from distributions different from the training data distribution and suffer drop in performance. For safe deployment, it is essential to estimate the accuracy of the pre-trained model on the test data. However, the labels for the test inputs are usually not immediately available in practice, and obtaining them can be expensive. This observation leads to two challenging tasks: (1) unsupervised accuracy estimation, which aims to estimate the accuracy of a pre-trained classifier on a set of unlabeled test inputs; (2) error detection, which aims to identify mis-classified test inputs. In this paper, we propose a principled and practically effective framework that simultaneously addresses the two tasks. The proposed framework iteratively learns an ensemble of models to identify mis-classified data points and performs self-training to improve the ensemble with the identified points. Theoretical analysis demonstrates that our framework enjoys provable guarantees for both accuracy estimation and error detection under mild conditions readily satisfied by practical deep learning models. Along with the framework, we proposed and experimented with two instantiations and achieved state-of-the-art results on 59 tasks. For example, on iWildCam, one instantiation reduces the estimation error for unsupervised accuracy estimation by at least 70% and improves the F1 score for error detection by at least 4.7% compared to existing methods.

翻译：当深学习模式在野外部署时,它会遇到从与培训数据分布不同的分布中获取的测试数据,而其性能下降。为了安全部署,必须估计测试数据中经过预先训练的模型的准确性。然而,测试投入的标签通常在实际中无法立即提供,获取成本很高。这一观察导致两项具有挑战性的任务:(1) 未经监督的准确性估计,目的是估算一组未贴标签的测试投入中经过预先训练的分类员的准确性;(2) 检测错误,目的是查明分类错误的测试投入。在本文中,我们提出了一个原则性和实际有效的框架,同时处理这两项任务。拟议的框架反复学习了一组模型,以确定分类错误的数据点,并进行自我培训,以改进与所查明的要点的共性。理论分析表明,我们的框架享有可核实的保证,即准确性估算和误判在一套容易得到实际深厚学习模式满足的温和条件下进行。与框架一起,我们提出并试验了两个最起码的即时的测试,在最低限度的状态上实现了两个标准,在50年的准确性评估中,通过不精确度上,将现有的方法降低。