Machine-generated probability predictions are essential in modern classification tasks such as image classification. A model is well calibrated when its predicted probabilities correspond to observed event frequencies. Despite the need for multicategory recalibration methods, existing methods are limited to (i) comparing calibration between two or more models rather than directly assessing the calibration of a single model, (ii) requiring under-the-hood model access, e.g., accessing logit-scale predictions within the layers of a neural network, and (iii) providing output which is difficult for human analysts to understand. To overcome (i)-(iii), we propose Multicategory Linear Log Odds (MCLLO) recalibration, which (i) includes a likelihood ratio hypothesis test to assess calibration, (ii) does not require under-the-hood access to models and is thus applicable on a wide range of classification problems, and (iii) can be easily interpreted. We demonstrate the effectiveness of the MCLLO method through simulations and three real-world case studies involving image classification via convolutional neural network, obesity analysis via random forest, and ecology via regression modeling. We compare MCLLO to four comparator recalibration techniques utilizing both our hypothesis test and the existing calibration metric Expected Calibration Error to show that our method works well alone and in concert with other methods.
翻译:机器生成的概率预测在现代分类任务(如图像分类)中至关重要。当模型的预测概率与观测到的事件频率相符时,该模型被认为是良好校准的。尽管多类别再校准方法存在需求,但现有方法局限于:(i)比较两个或多个模型之间的校准程度,而非直接评估单个模型的校准;(ii)需要底层模型访问权限,例如访问神经网络层内的对数几率尺度预测;(iii)提供的输出结果难以被人类分析者理解。为克服(i)至(iii)的局限,我们提出了多类别线性对数几率(MCLLO)再校准方法,该方法(i)包含用于评估校准程度的似然比假设检验;(ii)无需底层模型访问权限,因而可广泛应用于各类分类问题;(iii)易于解释。我们通过仿真实验及三个真实案例研究(包括基于卷积神经网络的图像分类、基于随机森林的肥胖症分析以及基于回归模型的生态学研究)验证了MCLLO方法的有效性。我们利用本文提出的假设检验及现有校准指标——期望校准误差,将MCLLO与四种对比再校准技术进行比较,结果表明我们的方法既能独立有效工作,也能与其他方法协同使用。