Machine learning (ML) models used in prediction and classification tasks may display performance disparities across population groups determined by sensitive attributes (e.g., race, sex, age). We consider the problem of evaluating the performance of a fixed ML model across population groups defined by multiple sensitive attributes (e.g., race and sex and age). Here, the sample complexity for estimating the worst-case performance gap across groups (e.g., the largest difference in error rates) increases exponentially with the number of group-denoting sensitive attributes. To address this issue, we propose an approach to test for performance disparities based on Conditional Value-at-Risk (CVaR). By allowing a small probabilistic slack on the groups over which a model has approximately equal performance, we show that the sample complexity required for discovering performance violations is reduced exponentially to be at most upper bounded by the square root of the number of groups. As a byproduct of our analysis, when the groups are weighted by a specific prior distribution, we show that R\'enyi entropy of order 2/3 of the prior distribution captures the sample complexity of the proposed CVaR test algorithm. Finally, we also show that there exists a non-i.i.d. data collection strategy that results in a sample complexity independent of the number of groups.
翻译:在预测与分类任务中使用的机器学习模型可能在由敏感属性(如种族、性别、年龄)划分的群体间表现出性能差异。本文研究如何评估固定机器学习模型在由多个敏感属性(如种族、性别与年龄)组合定义的群体间的性能表现。在此场景下,估计群体间最差性能差距(如错误率最大差异)所需的样本复杂度随群体标识性敏感属性数量呈指数级增长。为解决该问题,我们提出一种基于条件风险价值的性能差异检验方法。通过允许模型在近似相等性能的群体上存在微小概率松弛,我们证明发现性能违规所需的样本复杂度可指数级降低,其上限至多为群体数量的平方根。作为分析的副产品,当群体按特定先验分布加权时,我们证明该先验分布的2/3阶Rényi熵决定了所提CVaR检验算法的样本复杂度。最后,我们还证明存在一种非独立同分布的数据收集策略,可使样本复杂度与群体数量无关。