Probabilistic predictions can be evaluated through comparisons with observed label frequencies, that is, through the lens of calibration. Recent scholarship on algorithmic fairness has started to look at a growing variety of calibration-based objectives under the name of multi-calibration but has still remained fairly restricted. In this paper, we explore and analyse forms of evaluation through calibration by making explicit the choices involved in designing calibration scores. We organise these into three grouping choices and a choice concerning the agglomeration of group errors. This provides a framework for comparing previously proposed calibration scores and helps to formulate novel ones with desirable mathematical properties. In particular, we explore the possibility of grouping datapoints based on their input features rather than on predictions and formally demonstrate advantages of such approaches. We also characterise the space of suitable agglomeration functions for group errors, generalising previously proposed calibration scores. Complementary to such population-level scores, we explore calibration scores at the individual level and analyse their relationship to choices of grouping. We draw on these insights to introduce and axiomatise fairness deviation measures for population-level scores. We demonstrate that with appropriate choices of grouping, these novel global fairness scores can provide notions of (sub-)group or individual fairness.
翻译:概率预测可以通过与观察到的标签频率进行比较来评估,即通过校准的视角进行评估。近年来,算法公平性的相关研究开始关注日益多样化的基于校准的目标,这些目标被称为“多校准”,但其范围仍然相当有限。在本文中,我们通过明确设计校准评分时所涉及的选择,探索并分析了通过校准进行评估的形式。我们将这些选择分为三类分组选择以及关于群体误差聚合的选择。这为比较先前提出的校准评分提供了框架,并有助于制定具有理想数学性质的新型评分。特别地,我们探索了基于输入特征而非预测结果对数据点进行分组的可能性,并从形式上证明了此类方法的优势。我们还刻画了群体误差的合适聚合函数空间,从而推广了先前提出的校准评分。作为群体层面评分的补充,我们探索了个体层面的校准评分,并分析了它们与分组选择的关系。基于这些见解,我们引入并公理化了群体层面评分的公平性偏差度量。我们证明,通过适当的分组选择,这些新型全局公平性评分可以提供(子)群体或个体公平性的概念。