Cross-validation is a widely used technique for assessing the performance of predictive models on unseen data. Many predictive models, such as Kernel-Based Partial Least-Squares (PLS) models, require the computation of $\mathbf{X}^{\mathbf{T}}\mathbf{X}$ and $\mathbf{X}^{\mathbf{T}}\mathbf{Y}$ using only training set samples from the input and output matrices, $\mathbf{X}$ and $\mathbf{Y}$, respectively. In this work, we present three algorithms that efficiently compute these matrices. The first one allows no column-wise preprocessing. The second one allows column-wise centering around the training set means. The third one allows column-wise centering and column-wise scaling around the training set means and standard deviations. Demonstrating correctness and superior computational complexity, they offer significant cross-validation speedup compared with straight-forward cross-validation and previous work on fast cross-validation - all without data leakage. Their suitability for parallelization is highlighted with an open-source Python implementation combining our algorithms with Improved Kernel PLS.
翻译:交叉验证是一种广泛用于评估预测模型在未见数据上性能的技术。许多预测模型(例如基于核的偏最小二乘模型)需利用输入矩阵 $\mathbf{X}$ 和输出矩阵 $\mathbf{Y}$ 中的训练集样本计算 $\mathbf{X}^{\mathbf{T}}\mathbf{X}$ 和 $\mathbf{X}^{\mathbf{T}}\mathbf{Y}$。本文提出三种高效计算这些矩阵的算法:第一种算法不进行列预处理;第二种算法支持基于训练集均值的列中心化;第三种算法支持基于训练集均值和标准差的列中心化与缩放。通过验证正确性及展示优越的计算复杂度,与直接交叉验证及先前快速交叉验证工作相比,这些算法在无数据泄露的前提下实现了显著的交叉验证加速效果。本文结合所提算法与改进型核PLS,通过开源Python实现凸显其并行化适用性。