The randomized singular value decomposition (R-SVD) is a popular sketching-based algorithm for efficiently computing the partial SVD of a large matrix. When the matrix is low-rank, the R-SVD produces its partial SVD exactly; but when the rank is large, it only yields an approximation. Motivated by applications in data science and principal component analysis (PCA), we analyze the R-SVD under a low-rank signal plus noise measurement model; specifically, when its input is a spiked random matrix. The singular values produced by the R-SVD are shown to exhibit a BBP-like phase transition: when the SNR exceeds a certain detectability threshold, that depends on the dimension reduction factor, the largest singular value is an outlier; below the threshold, no outlier emerges from the bulk of singular values. We further compute asymptotic formulas for the overlap between the ground truth signal singular vectors and the approximations produced by the R-SVD. Dimensionality reduction has the adverse affect of amplifying the noise in a highly nonlinear manner. Our results demonstrate the statistical advantage -- in both signal detection and estimation -- of the R-SVD over more naive sketched PCA variants; the advantage is especially dramatic when the sketching dimension is small. Our analysis is asymptotically exact, and substantially more fine-grained than existing operator-norm error bounds for the R-SVD, which largely fail to give meaningful error estimates in the moderate SNR regime. It applies for a broad family of sketching matrices previously considered in the literature, including Gaussian i.i.d. sketches, random projections, and the sub-sampled Hadamard transform, among others. Lastly, we derive an optimal singular value shrinker for singular values and vectors obtained through the R-SVD, which may be useful for applications in matrix denoising.
翻译:随机奇异值分解(R-SVD)是一种基于草图的流行算法,用于高效计算大规模矩阵的部分奇异值分解。当矩阵低秩时,R-SVD能精确计算其部分奇异值分解;但当秩较大时,该算法仅提供近似结果。受数据科学和主成分分析(PCA)应用的启发,我们在低秩信号加噪声的测量模型下分析R-SVD,具体针对其输入为尖峰随机矩阵的情形。研究表明,R-SVD产生的奇异值呈现出类似BBP相变的特性:当信噪比(SNR)超过某个依赖于降维因子的可检测阈值时,最大奇异值表现为离群值;当信噪比低于该阈值时,奇异值谱中不会出现离群值。我们进一步计算了真实信号奇异向量与R-SVD近似结果之间重叠度的渐近公式。降维会以高度非线性的方式放大噪声,产生不利影响。我们的结果表明,与更朴素的草图化PCA变体相比,R-SVD在信号检测和估计两方面均具有统计优势——当草图维度较小时,这一优势尤为显著。本文分析具有渐近精确性,且比现有R-SVD的算子范数误差界精细得多——后者在中等信噪比范围内基本无法提供有意义的误差估计。该分析适用于文献中先前考虑的广泛草图矩阵族,包括高斯独立同分布草图、随机投影以及子采样哈达玛变换等。最后,我们推导了R-SVD所得奇异值与奇异向量的最优奇异值收缩器,这可能对矩阵去噪等应用有重要价值。