In the recent literature on machine learning and decision making, calibration has emerged as a desirable and widely-studied statistical property of the outputs of binary prediction models. However, the algorithmic aspects of measuring model calibration have remained relatively less well-explored. Motivated by [BGHN23], which proposed a rigorous framework for measuring distances to calibration, we initiate the algorithmic study of calibration through the lens of property testing. We define the problem of calibration testing from samples where given $n$ draws from a distribution $\mathcal{D}$ on (predictions, binary outcomes), our goal is to distinguish between the case where $\mathcal{D}$ is perfectly calibrated, and the case where $\mathcal{D}$ is $\varepsilon$-far from calibration. We design an algorithm based on approximate linear programming, which solves calibration testing information-theoretically optimally (up to constant factors) in time $O(n^{1.5} \log(n))$. This improves upon state-of-the-art black-box linear program solvers requiring $\Omega(n^\omega)$ time, where $\omega > 2$ is the exponent of matrix multiplication. We also develop algorithms for tolerant variants of our testing problem, and give sample complexity lower bounds for alternative calibration distances to the one considered in this work. Finally, we present preliminary experiments showing that the testing problem we define faithfully captures standard notions of calibration, and that our algorithms scale to accommodate moderate sample sizes.
翻译:在近期关于机器学习与决策制定的文献中,校准已成为二元预测模型输出的一种理想且被广泛研究的统计性质。然而,测量模型校准的算法层面尚未得到充分探索。受[BGHN23]的启发——该工作提出了一个用于测量与校准距离的严格框架——我们通过性质检验的视角开启了校准的算法研究。我们定义了基于样本的校准检验问题:给定从分布$\mathcal{D}$(预测值、二元结果)中抽取的$n$个样本,目标是区分$\mathcal{D}$完全校准的情况与$\mathcal{D}$与校准存在$\varepsilon$距离的情况。我们设计了一种基于近似线性规划的算法,该算法在信息论意义上(至多常数因子范围内)以$O(n^{1.5} \log(n))$时间最优地解决了校准检验问题。这改进了需要$\Omega(n^\omega)$时间的现有最优黑箱线性规划求解器(其中$\omega > 2$为矩阵乘法指数)。我们还为检验问题的容忍变体开发了算法,并给出了与本文所考虑校准距离不同的替代校准距离的样本复杂度下界。最后,我们通过初步实验表明,所定义的检验问题忠实捕捉了校准的标准概念,且算法可扩展以处理中等样本规模。