Prediction models can perform poorly when deployed to target distributions different from the training distribution. To understand these operational failure modes, we develop a method, called DIstribution Shift DEcomposition (DISDE), to attribute a drop in performance to different types of distribution shifts. Our approach decomposes the performance drop into terms for 1) an increase in harder but frequently seen examples from training, 2) changes in the relationship between features and outcomes, and 3) poor performance on examples infrequent or unseen during training. These terms are defined by fixing a distribution on $X$ while varying the conditional distribution of $Y \mid X$ between training and target, or by fixing the conditional distribution of $Y \mid X$ while varying the distribution on $X$. In order to do this, we define a hypothetical distribution on $X$ consisting of values common in both training and target, over which it is easy to compare $Y \mid X$ and thus predictive performance. We estimate performance on this hypothetical distribution via reweighting methods. Empirically, we show how our method can 1) inform potential modeling improvements across distribution shifts for employment prediction on tabular census data, and 2) help to explain why certain domain adaptation methods fail to improve model performance for satellite image classification.
翻译:预测模型在部署到与训练分布不同的目标分布时,性能可能显著下降。为理解这些运行失效模式,我们提出了一种名为分布偏移分解(DISDE,DIstribution Shift DEcomposition)的方法,将性能下降归因于不同类型的分布偏移。该方法将性能下降分解为三项:1)训练中高频出现的困难样本比例增加;2)特征与结果之间关系的变化;3)训练中罕见或未见样本上的低性能。这些项通过固定 $X$ 的分布并改变训练与目标之间 $Y \mid X$ 的条件分布,或固定 $Y \mid X$ 的条件分布并改变 $X$ 的分布来定义。为此,我们定义了一个由训练与目标中同时存在的共同值构成的 $X$ 的假设分布,在该分布上可以更易比较 $Y \mid X$ 及预测性能。我们通过重加权方法估计该假设分布上的性能。实验结果表明,我们的方法能够:1)为表格人口普查数据上的就业预测提供跨分布偏移的潜在建模改进建议;2)帮助解释某些域适应方法为何未能提升卫星图像分类的模型性能。