Estimating the generalization error (GE) of machine learning models is fundamental, with resampling methods being the most common approach. However, in non-standard settings, particularly those where observations are not independently and identically distributed, resampling using simple random data divisions may lead to biased GE estimates. This paper strives to present well-grounded guidelines for GE estimation in various such non-standard settings: clustered data, spatial data, unequal sampling probabilities, concept drift, and hierarchically structured outcomes. Our overview combines well-established methodologies with other existing methods that, to our knowledge, have not been frequently considered in these particular settings. A unifying principle among these techniques is that the test data used in each iteration of the resampling procedure should reflect the new observations to which the model will be applied, while the training data should be representative of the entire data set used to obtain the final model. Beyond providing an overview, we address literature gaps by conducting simulation studies. These studies assess the necessity of using GE-estimation methods tailored to the respective setting. Our findings corroborate the concern that standard resampling methods often yield biased GE estimates in non-standard settings, underscoring the importance of tailored GE estimation.
翻译:估计机器学习模型的泛化误差(GE)是基础性问题,重采样方法是最常用的途径。然而,在非标准环境下——特别是观测数据不满足独立同分布条件时——采用简单随机数据划分的重采样可能导致有偏的GE估计。本文致力于为多种非标准环境下的GE估计提供坚实的指导准则:聚类数据、空间数据、不等采样概率、概念漂移以及分层结构结果。本综述将成熟方法论与现有其他方法相结合——据我们所知,这些方法尚未频繁应用于此类特定场景。这些技术的统一原则是:重采样过程中每次迭代使用的测试数据应反映模型将要应用的新观测值,而训练数据则应代表用于最终模型构建的完整数据集。除提供综述外,我们通过模拟研究弥补文献空白。这些研究评估了针对特定环境定制GE估计方法的必要性。我们的研究结果证实了标准重采样方法在非标准环境中常产生有偏GE估计的担忧,凸显了定制化GE估计的重要性。