Cross-validation is a statistical tool that can be used to improve large covariance matrix estimation. Although its efficiency is observed in practical applications, the theoretical reasons behind it remain largely intuitive, with formal proofs currently lacking. To carry on analytical analysis, we focus on the holdout method, a single iteration of cross-validation, rather than the traditional $k$-fold approach. We derive a closed-form expression for the estimation error when the population matrix follows a white inverse Wishart distribution, and we observe the optimal train-test split scales as the square root of the matrix dimension. For general population matrices, we connected the error to the variance of eigenvalues distribution, but approximations are necessary. Interestingly, in the high-dimensional asymptotic regime, both the holdout and $k$-fold cross-validation methods converge to the optimal estimator when the train-test ratio scales with the square root of the matrix dimension.
翻译:交叉验证是一种可用于改进大型协方差矩阵估计的统计工具。尽管其在实际应用中的有效性已被观察到,但其背后的理论原因在很大程度上仍停留在直观层面,目前尚缺乏形式化证明。为进行解析分析,我们聚焦于留出法——即交叉验证的单次迭代,而非传统的$k$折方法。当总体矩阵服从白色逆威沙特分布时,我们推导出估计误差的闭式表达式,并观察到最优训练-测试集划分比例与矩阵维度的平方根成正比。对于一般总体矩阵,我们将误差与特征值分布的方差相关联,但需进行近似处理。值得注意的是,在高维渐近体系中,当训练-测试比例与矩阵维度的平方根成比例时,留出法与$k$折交叉验证方法均收敛于最优估计量。