Cross-validation is the standard approach for tuning parameter selection in many non-parametric regression problems. However its use is less common in change-point regression, perhaps as its prediction error-based criterion may appear to permit small spurious changes and hence be less well-suited to estimation of the number and location of change-points. We show that in fact the problems of cross-validation with squared error loss are more severe and can lead to systematic under- or over-estimation of the number of change-points, and highly suboptimal estimation of the mean function in simple settings where changes are easily detectable. We propose two simple approaches to remedy these issues, the first involving the use of absolute error rather than squared error loss, and the second involving modifying the holdout sets used. For the latter, we provide conditions that permit consistent estimation of the number of change-points for a general change-point estimation procedure. We show these conditions are satisfied for least squares estimation using new results on its performance when supplied with the incorrect number of change-points. Numerical experiments show that our new approaches are competitive with common change-point methods using classical tuning parameter choices when error distributions are well-specified, but can substantially outperform these in misspecified models. An implementation of our methodology is available in the R package crossvalidationCP on CRAN.
翻译:交叉验证是许多非参数回归问题中调节参数选择的标准方法。然而,在变点回归中其应用较少,这或许是因为其基于预测误差的准则可能允许产生小的虚假变点,从而不太适用于估计变点的数量和位置。我们表明,事实上,使用平方误差损失的交叉验证问题更为严重,可能导致系统性地低估或高估变点数量,并且在变化易于检测的简单场景中,对均值函数的估计效果极差。我们提出两种简单方法来解决这些问题:第一种涉及使用绝对误差而非平方误差损失,第二种涉及修改留出集。对于后者,我们提供了能够使通用变点估计程序一致估计变点数量的条件。我们通过使用关于变点数量错误时最小二乘估计性能的新结果,证明这些条件得以满足。数值实验表明,当误差分布被正确指定时,我们的新方法与使用经典调节参数选择的常见变点方法相比具有竞争力,但在模型错误指定的情况下可显著超越这些方法。我们的方法实现已发布于CRAN的R包crossvalidationCP中。