Probabilistic predictions can be evaluated through comparisons with observed label frequencies, that is, through the lens of calibration. Recent scholarship on algorithmic fairness has started to look at a growing variety of calibration-based objectives under the name of multi-calibration but has still remained fairly restricted. In this paper, we explore and analyse forms of evaluation through calibration by making explicit the choices involved in designing calibration scores. We organise these into three grouping choices and a choice concerning the agglomeration of group errors. This provides a framework for comparing previously proposed calibration scores and helps to formulate novel ones with desirable mathematical properties. In particular, we explore the possibility of grouping datapoints based on their input features rather than on predictions and formally demonstrate advantages of such approaches. We also characterise the space of suitable agglomeration functions for group errors, generalising previously proposed calibration scores. Complementary to such population-level scores, we explore calibration scores at the individual level and analyse their relationship to choices of grouping. We draw on these insights to introduce and axiomatise fairness deviation measures for population-level scores. We demonstrate that with appropriate choices of grouping, these novel global fairness scores can provide notions of (sub-)group or individual fairness.
翻译:概率预测可以通过与观察到的标签频率进行比较来评估,即通过校准的角度。最近关于算法公平性的学术研究开始关注日益多样化的基于校准的目标,并以多校准之名展开探讨,但领域仍相当有限。本文通过明确校准分数设计中的选择因素,探究并分析通过校准进行评估的多种形式。我们将这些选择归纳为三类分组选择以及关于组误差聚合方式的选择,从而构建了一个比较既有校准分数的框架,并有助于制定具有理想数学性质的新型校准分数。特别地,我们探索了基于输入特征而非预测结果对数据点进行分组的可能性,并从形式上证明了此类方法的优势。我们还描述了组误差的合适聚合函数空间,推广了先前提出的校准分数。作为群体级分数的补充,我们探索了个体层面的校准分数,并分析了其与分组选择的关系。基于这些见解,我们引入并公理化群体级分数的公平性偏差度量。研究表明,通过适当的分组选择,这些新型全局公平性分数能够提供(子)群体或个体公平性的概念。