The problem of regression extrapolation, or out-of-distribution generalization, arises when predictions are required at test points outside the range of the training data. In such cases, the non-parametric guarantees for regression methods from both statistics and machine learning typically fail. Based on the theory of tail dependence, we propose a novel statistical extrapolation principle. After a suitable, data-adaptive marginal transformation, it assumes a simple relationship between predictors and the response at the boundary of the training predictor samples. This assumption holds for a wide range of models, including non-parametric regression functions with additive noise. Our semi-parametric method, progression, leverages this extrapolation principle and offers guarantees on the approximation error beyond the training data range. We demonstrate how this principle can be effectively integrated with existing approaches, such as random forests and additive models, to improve extrapolation performance on out-of-distribution samples.
翻译:回归外推问题,即分布外泛化问题,出现在需要在训练数据范围之外的测试点进行预测的场景中。在此类情况下,来自统计学和机器学习的回归方法的非参数保证通常会失效。基于尾部相依理论,我们提出了一种新颖的统计外推原理。在经过一种数据自适应的边缘变换后,该原理假设在训练预测变量样本的边界处,预测变量与响应变量之间存在简单的关系。这一假设适用于广泛的模型,包括带有加性噪声的非参数回归函数。我们的半参数方法——progression——利用了这一外推原理,并提供了超出训练数据范围的近似误差保证。我们展示了如何将这一原理与现有方法(如随机森林和加性模型)有效结合,以提升在分布外样本上的外推性能。