The randomized singular value decomposition (R-SVD) is a popular sketching-based algorithm for efficiently computing the partial SVD of a large matrix. When the matrix is low-rank, the R-SVD produces its partial SVD exactly; but when the rank is large, it only yields an approximation. Motivated by applications in data science and principal component analysis (PCA), we analyze the R-SVD under a low-rank signal plus noise measurement model; specifically, when its input is a spiked random matrix. The singular values produced by the R-SVD are shown to exhibit a BBP-like phase transition: when the SNR exceeds a certain detectability threshold, that depends on the dimension reduction factor, the largest singular value is an outlier; below the threshold, no outlier emerges from the bulk of singular values. We further compute asymptotic formulas for the overlap between the ground truth signal singular vectors and the approximations produced by the R-SVD. Dimensionality reduction has the adverse affect of amplifying the noise in a highly nonlinear manner. Our results demonstrate the statistical advantage -- in both signal detection and estimation -- of the R-SVD over more naive sketched PCA variants; the advantage is especially dramatic when the sketching dimension is small. Our analysis is asymptotically exact, and substantially more fine-grained than existing operator-norm error bounds for the R-SVD, which largely fail to give meaningful error estimates in the moderate SNR regime. It applies for a broad family of sketching matrices previously considered in the literature, including Gaussian i.i.d. sketches, random projections, and the sub-sampled Hadamard transform, among others. Lastly, we derive an optimal singular value shrinker for singular values and vectors obtained through the R-SVD, which may be useful for applications in matrix denoising.
翻译:随机化奇异值分解(R-SVD)是一种基于草图化的流行算法,用于高效计算大型矩阵的部分SVD。当矩阵低秩时,R-SVD能精确计算其部分SVD;但秩较大时,只能提供近似结果。受数据科学与主成分分析(PCA)应用启发,我们在低秩信号加噪声测量模型(即输入为尖峰随机矩阵)下分析R-SVD。研究表明,R-SVD产生的奇异值呈现类似BBP的相变:当信噪比超过依赖于降维因子的可检测阈值时,最大奇异值成为离群值;低于阈值时,奇异值主体中无离群值出现。我们进一步推导了真实信号奇异向量与R-SVD近似结果之间重叠度的渐近公式。降维会以高度非线性的方式放大噪声,产生不利影响。我们的结果表明,相较于更朴素的草图化PCA变体,R-SVD在信号检测与估计两方面均具有统计优势——当草图化维度较小时优势尤为显著。该分析在渐近意义上精确,且比现有R-SVD的算子范数误差界(在中信噪比区间内往往无法给出有效误差估计)更为精细。它适用于文献中先前研究的广泛草图化矩阵族,包括高斯独立同分布草图、随机投影与子采样哈达玛变换等。最后,我们推导了针对R-SVD所获奇异值与奇异向量的最优奇异值收缩器,这或可用于矩阵去噪应用。