Distribution shifts are ubiquitous in real-world machine learning applications, posing a challenge to the generalization of models trained on one data distribution to another. We focus on scenarios where data distributions vary across multiple segments of the entire population and only make local assumptions about the differences between training and test (deployment) distributions within each segment. We propose a two-stage multiply robust estimation method to improve model performance on each individual segment for tabular data analysis. The method involves fitting a linear combination of the based models, learned using clusters of training data from multiple segments, followed by a refinement step for each segment. Our method is designed to be implemented with commonly used off-the-shelf machine learning models. We establish theoretical guarantees on the generalization bound of the method on the test risk. With extensive experiments on synthetic and real datasets, we demonstrate that the proposed method substantially improves over existing alternatives in prediction accuracy and robustness on both regression and classification tasks. We also assess its effectiveness on a user city prediction dataset from a large technology company.
翻译:分布偏移在现实世界的机器学习应用中普遍存在,对从一种数据分布训练的模型向另一种分布的泛化能力构成挑战。我们聚焦于数据分布在整个群体的多个片段间发生变化的情境,且仅对每个片段内训练分布与测试(部署)分布之间的差异作出局部假设。为提升表格数据分析中每个独立片段的模型性能,我们提出一种两阶段多重稳健估计方法。该方法通过拟合基于模型的线性组合——这些模型利用来自多个片段的训练数据聚类学习得到——随后针对每个片段进行细化修正步骤。该方案设计兼容常用的现成机器学习模型。我们从理论上建立了该方法在测试风险上的泛化界保证。通过在合成数据集和真实数据集上的广泛实验,我们证明所提方法在回归与分类任务中,预测准确性与稳健性均显著优于现有替代方案。此外,我们还在某大型科技公司的用户城市预测数据集上验证了其有效性。