Disaggregated evaluation -- estimation of performance of a machine learning model on different subpopulations -- is a core task when assessing performance and group-fairness of AI systems. A key challenge is that evaluation data is scarce, and subpopulations arising from intersections of attributes (e.g., race, sex, age) are often tiny. Today, it is common for multiple clients to procure the same AI model from a model developer, and the task of disaggregated evaluation is faced by each customer individually. This gives rise to what we call the multi-task disaggregated evaluation problem, wherein multiple clients seek to conduct a disaggregated evaluation of a given model in their own data setting (task). In this work we develop a disaggregated evaluation method called SureMap that has high estimation accuracy for both multi-task and single-task disaggregated evaluations of blackbox models. SureMap's efficiency gains come from (1) transforming the problem into structured simultaneous Gaussian mean estimation and (2) incorporating external data, e.g., from the AI system creator or from their other clients. Our method combines maximum a posteriori (MAP) estimation using a well-chosen prior together with cross-validation-free tuning via Stein's unbiased risk estimate (SURE). We evaluate SureMap on disaggregated evaluation tasks in multiple domains, observing significant accuracy improvements over several strong competitors.
翻译:分解评估——即评估机器学习模型在不同子群体上的性能——是评估人工智能系统性能与群体公平性的核心任务。一个关键挑战在于评估数据稀缺,且由属性交叉(如种族、性别、年龄)形成的子群体规模通常极小。当前,多个客户从模型开发者处采购同一AI模型已成为常态,而每个客户均需独立面对分解评估任务。这催生了我们称之为多任务分解评估的问题,即多个客户需在其自身数据场景(任务)中对给定模型进行分解评估。本研究提出一种名为SureMap的分解评估方法,该方法在对黑盒模型进行多任务与单任务分解评估时均具有较高的估计精度。SureMap的效率提升源于:(1)将问题转化为结构化同步高斯均值估计;(2)整合外部数据(例如来自AI系统创建者或其其他客户的数据)。我们的方法结合了基于精心设计先验分布的最大后验概率估计,以及通过斯坦无偏风险估计实现的无交叉验证调参。我们在多个领域的分解评估任务上对SureMap进行评估,观察到其相较于多个强基准方法均取得显著的精度提升。