Machine learning (ML) models used in prediction and classification tasks may display performance disparities across population groups determined by sensitive attributes (e.g., race, sex, age). We consider the problem of evaluating the performance of a fixed ML model across population groups defined by multiple sensitive attributes (e.g., race and sex and age). Here, the sample complexity for estimating the worst-case performance gap across groups (e.g., the largest difference in error rates) increases exponentially with the number of group-denoting sensitive attributes. To address this issue, we propose an approach to test for performance disparities based on Conditional Value-at-Risk (CVaR). By allowing a small probabilistic slack on the groups over which a model has approximately equal performance, we show that the sample complexity required for discovering performance violations is reduced exponentially to be at most upper bounded by the square root of the number of groups. As a byproduct of our analysis, when the groups are weighted by a specific prior distribution, we show that R\'enyi entropy of order $2/3$ of the prior distribution captures the sample complexity of the proposed CVaR test algorithm. Finally, we also show that there exists a non-i.i.d. data collection strategy that results in a sample complexity independent of the number of groups.
翻译:机器学习(ML)模型在预测和分类任务中,可能在不同群体间(由敏感属性如种族、性别、年龄决定)表现出性能差异。我们考虑评估一个固定ML模型在由多个敏感属性(如种族、性别和年龄)定义的群体中的性能问题。在此,估计群体间最差性能差距(如错误率的最大差异)的样本复杂度随群体标识敏感属性数量的增加呈指数增长。为解决这一问题,我们提出了一种基于条件风险价值(CVaR)来测试性能差异的方法。通过允许模型性能近似相等的群体存在较小的概率松弛,我们证明:发现性能违规所需的样本复杂度指数级降低,最多不超过群体数量的平方根上界。作为分析的副产品,当群体由特定先验分布加权时,我们表明该先验分布的$2/3$阶Rényi熵能够捕捉所提出的CVaR测试算法的样本复杂度。最后,我们还证明存在一种非独立同分布的数据收集策略,使得样本复杂度与群体数量无关。