We derive a formula for optimal hard thresholding of the singular value decomposition in the presence of correlated additive noise; although it nominally involves unobservables, we show how to apply it even where the noise covariance structure is not a-priori known or is not independently estimable. The proposed method, which we call ScreeNOT, is a mathematically solid alternative to Cattell's ever-popular but vague Scree Plot heuristic from 1966. ScreeNOT has a surprising oracle property: it typically achieves exactly, in large finite samples, the lowest possible MSE for matrix recovery, on each given problem instance - i.e. the specific threshold it selects gives exactly the smallest achievable MSE loss among all possible threshold choices for that noisy dataset and that unknown underlying true low rank model. The method is computationally efficient and robust against perturbations of the underlying covariance structure. Our results depend on the assumption that the singular values of the noise have a limiting empirical distribution of compact support; this model, which is standard in random matrix theory, is satisfied by many models exhibiting either cross-row correlation structure or cross-column correlation structure, and also by many situations where there is inter-element correlation structure. Simulations demonstrate the effectiveness of the method even at moderate matrix sizes. The paper is supplemented by ready-to-use software packages implementing the proposed algorithm: package ScreeNOT in Python (via PyPI) and R (via CRAN).
翻译:我们推导了在相关加性噪声条件下奇异值分解的最优硬阈值公式;尽管该公式名义上涉及不可观测量,但我们展示了即使在噪声协方差结构先验未知或无法独立估计的情况下,如何应用该方法。所提出的方法称为ScreeNOT,是对Cattell于1966年提出的广受欢迎但模糊的Scree Plot启发式方法的一种数学上坚实的替代方案。ScreeNOT具有令人惊讶的预言性质:在大规模有限样本中,它通常能在每个给定问题实例上精确实现矩阵恢复的最低可能MSE——即其选择的特定阈值恰好在该含噪数据集和未知真实低秩模型下,达成所有可能阈值选择中可实现的最小MSE损失。该方法计算高效,且对底层协方差结构的扰动具有鲁棒性。我们的结果依赖于一个假设:噪声的奇异值具有紧支撑的经验极限分布;该随机矩阵理论中的标准模型适用于许多具有跨行相关结构、跨列相关结构或元素间相关结构的情形。仿真实验表明,即使在中等规模的矩阵上,该方法依然有效。论文附带了实现所提算法的即用型软件包:Python版本(通过PyPI)和R版本(通过CRAN)的ScreeNOT软件包。