Proper confidence calibration of deep neural networks is essential for reliable predictions in safety-critical tasks. Miscalibration can lead to model over-confidence and/or under-confidence; i.e., the model's confidence in its prediction can be greater or less than the model's accuracy. Recent studies have highlighted the over-confidence issue by introducing calibration techniques and demonstrated success on various tasks. However, miscalibration through under-confidence has not yet to receive much attention. In this paper, we address the necessity of paying attention to the under-confidence issue. We first introduce a novel metric, a miscalibration score, to identify the overall and class-wise calibration status, including being over or under-confident. Our proposed metric reveals the pitfalls of existing calibration techniques, where they often overly calibrate the model and worsen under-confident predictions. Then we utilize the class-wise miscalibration score as a proxy to design a calibration technique that can tackle both over and under-confidence. We report extensive experiments that show our proposed methods substantially outperforming existing calibration techniques. We also validate our proposed calibration technique on an automatic failure detection task with a risk-coverage curve, reporting that our methods improve failure detection as well as trustworthiness of the model. The code are available at \url{https://github.com/AoShuang92/miscalibration_TS}.
翻译:深度神经网络的正确置信度校准对于安全关键任务中的可靠预测至关重要。错误校准可能导致模型过度自信和/或缺乏自信;即模型对其预测的置信度可能高于或低于模型的实际准确率。近期研究通过引入校准技术强调了过度自信问题,并在多种任务上取得了成功。然而,因缺乏自信导致的错误校准尚未得到足够关注。本文阐述了关注缺乏自信问题的必要性。我们首先提出一种新型指标——错误校准分数,用于识别整体及类别级别的校准状态,包括过度自信或缺乏自信。我们提出的指标揭示了现有校准技术的缺陷:它们常过度校准模型,反而加剧了缺乏自信的预测。随后,我们利用类别级错误校准分数作为代理,设计了一种能同时处理过度自信和缺乏自信的校准技术。大量实验表明,我们提出的方法显著优于现有校准技术。我们还通过风险覆盖曲线在自动故障检测任务上验证了该校准技术,报告指出我们的方法在改进故障检测的同时提升了模型的可信度。代码可于 \url{https://github.com/AoShuang92/miscalibration_TS} 获取。