Calibrating deep neural models plays an important role in building reliable, robust AI systems in safety-critical applications. Recent work has shown that modern neural networks that possess high predictive capability are poorly calibrated and produce unreliable model predictions. Though deep learning models achieve remarkable performance on various benchmarks, the study of model calibration and reliability is relatively underexplored. Ideal deep models should have not only high predictive performance but also be well calibrated. There have been some recent methods proposed to calibrate deep models by using different mechanisms. In this survey, we review the state-of-the-art calibration methods and provide an understanding of their principles for performing model calibration. First, we start with the definition of model calibration and explain the root causes of model miscalibration. Then we introduce the key metrics that can measure this aspect. It is followed by a summary of calibration methods that we roughly classified into four categories: post-hoc calibration, regularization methods, uncertainty estimation, and composition methods. We also covered some recent advancements in calibrating large models, particularly large language models (LLMs). Finally, we discuss some open issues, challenges, and potential directions.
翻译:校准深度神经模型在安全关键应用中构建可靠、稳健的AI系统方面起着重要作用。近期研究表明,具有高预测能力的现代神经网络往往校准不良,产生不可靠的模型预测。尽管深度学习模型在各种基准测试上取得了显著性能,但对模型校准和可靠性的研究相对不足。理想的深度模型不仅应具备高预测性能,还应得到良好校准。近年来已有一些方法通过不同机制对深度模型进行校准。在本综述中,我们回顾了最先进的校准方法,并深入理解其执行模型校准的原理。首先,我们从模型校准的定义出发,解释模型校准不良的根本原因。随后介绍能够衡量这一方面的关键指标。接着概述校准方法,我们将其大致分为四类:事后校准、正则化方法、不确定性估计和组合方法。我们还涵盖了校准大模型(特别是大型语言模型,LLMs)的一些最新进展。最后,我们讨论了若干开放问题、挑战及潜在研究方向。