Deep neural networks have achieved impressive performance on many computer vision benchmarks in recent years. However, can we be confident that impressive performance on benchmarks will translate to strong performance in real-world environments? Many environments in the real world are safety critical, and even slight model failures can be catastrophic. Therefore, it is crucial to test models rigorously before deployment. We argue, through both statistical theory and empirical evidence, that selecting representative image datasets for testing a model is likely implausible in many domains. Furthermore, performance statistics calculated with non-representative image datasets are highly unreliable. As a consequence, we cannot guarantee that models which perform well on withheld test images will also perform well in the real world. Creating larger and larger datasets will not help, and bias aware datasets cannot solve this problem either. Ultimately, there is little statistical foundation for evaluating models using withheld test sets. We recommend that future evaluation methodologies focus on assessing a model's decision-making process, rather than metrics such as accuracy.
翻译:近年来,深度神经网络在众多计算机视觉基准测试中取得了令人瞩目的性能表现。然而,我们能否确信基准测试中的优异性能必然转化为现实环境中的稳健表现?现实世界中的许多环境具有安全关键性,即使微小的模型故障也可能导致灾难性后果。因此,在部署前对模型进行严格测试至关重要。我们通过统计理论与实证证据表明,在许多领域中,为模型测试选择具有代表性的图像数据集很可能是不可行的。此外,基于非代表性图像数据集计算的性能统计量具有高度不可靠性。因此,我们无法保证在预留测试图像上表现良好的模型在现实世界中同样表现优异。单纯扩大数据集规模无助于解决此问题,基于偏差感知构建的数据集亦无法化解这一困境。最终,使用预留测试集评估模型的方法缺乏坚实的统计学基础。我们建议未来的评估方法应聚焦于分析模型的决策过程,而非依赖准确率等传统指标。