This article proposes an alternative to the Hosmer-Lemeshow (HL) test for evaluating the calibration of probability forecasts for binary events. The approach is based on e-values, a new tool for hypothesis testing. An e-value is a random variable with expected value less or equal to one under a null hypothesis. Large e-values give evidence against the null hypothesis, and the multiplicative inverse of an e-value is a p-value. Our test uses online isotonic regression to estimate the calibration curve as a `betting strategy' against the null hypothesis. We show that the test has power against essentially all alternatives, which makes it theoretically superior to the HL test and at the same time resolves the well-known instability problem of the latter. A simulation study shows that a feasible version of the proposed eHL test can detect slight miscalibrations in practically relevant sample sizes, but trades its universal validity and power guarantees against a reduced empirical power compared to the HL test in a classical simulation setup.We illustrate our test on recalibrated predictions for credit card defaults during the Taiwan credit card crisis, where the classical HL test delivers equivocal results.
翻译:本文提出了一种替代Hosmer-Lemeshow(HL)检验的方法,用于评估二元事件概率预测的校准性。该方法基于一种新的假设检验工具——e值。e值是一个随机变量,在原假设下其期望值小于或等于1。较大的e值提供反对原假设的证据,而e值的乘法逆元即为p值。我们的检验利用在线等渗回归将校准曲线估计为一种针对原假设的“下注策略”。我们证明该检验对几乎所有备择假设都有效力,这使其在理论上优于HL检验,同时解决了后者众所周知的稳定性问题。模拟研究表明,所提出的eHL检验的可行版本能在实际相关样本量下检测出轻微的校准偏差,但在经典模拟设置中,其普适有效性和效力保证以经验效力相比HL检验有所降低为代价。我们以台湾信用卡危机期间信用卡违约的重新校准预测为例说明了该检验,其中经典HL检验给出了模棱两可的结果。