Prediction models can perform poorly when deployed to target distributions different from the training distribution. To understand these operational failure modes, we develop a method, called DIstribution Shift DEcomposition (DISDE), to attribute a drop in performance to different types of distribution shifts. Our approach decomposes the performance drop into terms for 1) an increase in harder but frequently seen examples from training, 2) changes in the relationship between features and outcomes, and 3) poor performance on examples infrequent or unseen during training. These terms are defined by fixing a distribution on $X$ while varying the conditional distribution of $Y \mid X$ between training and target, or by fixing the conditional distribution of $Y \mid X$ while varying the distribution on $X$. In order to do this, we define a hypothetical distribution on $X$ consisting of values common in both training and target, over which it is easy to compare $Y \mid X$ and thus predictive performance. We estimate performance on this hypothetical distribution via reweighting methods. Empirically, we show how our method can 1) inform potential modeling improvements across distribution shifts for employment prediction on tabular census data, and 2) help to explain why certain domain adaptation methods fail to improve model performance for satellite image classification.
翻译:预测模型在部署到与训练分布不同的目标分布时,其性能可能显著下降。为理解这些操作失效模式,我们提出了一种名为分布偏移分解(DISDE)的方法,将性能下降归因于不同类型的分布偏移。我们的方法将性能下降分解为三个分量:1)训练集中常见但难度较高样本的增加;2)特征与结果之间关系的变化;3)训练期间不常见或未见样本上的性能不足。这些分量的定义基于固定$X$的分布并变动训练集与目标集之间$Y \mid X$的条件分布,或固定$Y \mid X$的条件分布并变动$X$的分布。为此,我们定义了一个由训练集和目标集均常见的取值构成的$X$上的假设分布,在此分布上可便捷比较$Y \mid X$及预测性能。通过重加权方法,我们估计该假设分布上的性能。实验表明,我们的方法能够:1)针对表格化人口普查数据的就业预测,揭示不同分布偏移下的潜在建模改进方向;2)解释为何某些域自适应方法未能提升卫星图像分类的模型性能。