A canonical desideratum for prediction problems is that performance guarantees should hold not just on average over the population, but also for meaningful subpopulations within the overall population. But what constitutes a meaningful subpopulation? In this work, we take the perspective that relevant subpopulations should be defined with respect to the clusters that naturally emerge from the distribution of individuals for which predictions are being made. In this view, a population refers to a mixture model whose components constitute the relevant subpopulations. We suggest two formalisms for capturing per-subgroup guarantees: first, by attributing each individual to the component from which they were most likely drawn, given their features; and second, by attributing each individual to all components in proportion to their relative likelihood of having been drawn from each component. Using online calibration as a case study, we study a \variational algorithm that provides guarantees for each of these formalisms by handling all plausible underlying subpopulation structures simultaneously, and achieve an $O(T^{1/2})$ rate even when the subpopulations are not well-separated. In comparison, the more natural cluster-then-predict approach that first recovers the structure of the subpopulations and then makes predictions suffers from a $O(T^{2/3})$ rate and requires the subpopulations to be separable. Along the way, we prove that providing per-subgroup calibration guarantees for underlying clusters can be easier than learning the clusters: separation between median subgroup features is required for the latter but not the former.
翻译:预测问题的一个典型要求是性能保证不应仅在整个群体上平均成立,还应在总体内有意义的子群体上成立。但什么才构成有意义的子群体?在本工作中,我们持有的观点是:相关子群体的定义应基于待预测个体分布中自然涌现的聚类结构。在此视角下,群体指代一个混合模型,其各分量构成相关子群体。我们提出两种形式化方法来捕捉每个子组的保证:首先,根据个体特征将其归因于最可能生成该个体的分量;其次,按照从各分量抽取的相对似然比例,将个体归因于所有分量。以在线校准为案例研究,我们提出一种\variational算法,通过同时处理所有可能的潜在子群体结构,为这两种形式化方法分别提供保证,即使在子群体未充分分离的情况下仍能达到$O(T^{1/2})$的收敛速率。相比之下,更直观的"先聚类后预测"方法需要先恢复子群体结构再进行预测,其收敛速率为$O(T^{2/3})$且要求子群体必须可分离。研究过程中我们证明:为潜在聚类提供每子组校准保证可能比学习聚类本身更容易——后者需要子群体特征中位数存在分离性,而前者无需此条件。