A canonical desideratum for prediction problems is that performance guarantees should hold not just on average over the population, but also for meaningful subpopulations within the overall population. But what constitutes a meaningful subpopulation? In this work, we take the perspective that relevant subpopulations should be defined with respect to the clusters that naturally emerge from the distribution of individuals for which predictions are being made. In this view, a population refers to a mixture model whose components constitute the relevant subpopulations. We suggest two formalisms for capturing per-subgroup guarantees: first, by attributing each individual to the component from which they were most likely drawn, given their features; and second, by attributing each individual to all components in proportion to their relative likelihood of having been drawn from each component. Using online calibration as a case study, we study a multi-objective algorithm that provides guarantees for each of these formalisms by handling all plausible underlying subpopulation structures simultaneously, and achieve an $O(T^{1/2})$ rate even when the subpopulations are not well-separated. In comparison, the more natural cluster-then-predict approach that first recovers the structure of the subpopulations and then makes predictions suffers from a $O(T^{2/3})$ rate and requires the subpopulations to be separable. Along the way, we prove that providing per-subgroup calibration guarantees for underlying clusters can be easier than learning the clusters: separation between median subgroup features is required for the latter but not the former.
翻译:预测问题的一个经典期望是性能保证不仅应在总体平均水平上成立,也应在总体内有意义的子群体上成立。但什么才构成有意义的子群体?在本研究中,我们采取的观点是:相关子群体应根据预测个体分布自然形成的聚类来定义。基于这一视角,总体可视为其构成分量对应相关子群体的混合模型。我们提出两种形式化方法来捕获每子组保证:首先,根据个体特征将其归因于最可能生成该个体的分量;其次,根据个体从各分量生成的可能性比例将其归因于所有分量。以在线校准为案例研究,我们提出一种多目标算法,通过同时处理所有可能的潜在子群体结构,为这两种形式化方法分别提供保证,即使在子群体未充分分离时仍能达到$O(T^{1/2})$的收敛速率。相比之下,更直观的"先聚类后预测"方法需要先恢复子群体结构再进行预测,其收敛速率为$O(T^{2/3})$且要求子群体必须可分离。研究过程中我们证明:为潜在聚类提供每子组校准保证可能比学习聚类更容易——后者需要子群体特征中位数存在分离性,而前者无需此条件。