Randomized algorithms can be used to speed up the analysis of large datasets. In this paper, we develop a unified methodology for statistical inference via randomized sketching or projections in two of the most fundamental problems in multivariate statistical analysis: least squares and PCA. The methodology applies to fixed datasets -- i.e., is data-conditional -- and the only randomness is due to the randomized algorithm. We propose statistical inference methods for a broad range of sketching distributions, such as the subsampled randomized Hadamard transform (SRHT), Sparse Sign Embeddings (SSE) and CountSketch, sketching matrices with i.i.d. entries, and uniform subsampling. To our knowledge, no comparable methods are available for SSE and for SRHT in PCA. Our novel theoretical approach rests on showing the asymptotic normality of certain quadratic forms. As a contribution of broader interest, we show central limit theorems for quadratic forms of the SRHT, relying on a novel proof via a dyadic expansion that leverages the recursive structure of the Hadamard transform. Numerical experiments using both synthetic and empirical datasets support the efficacy of our methods, and in particular suggest that sketching methods can have better computation-estimation tradeoffs than recently proposed optimal subsampling methods.
翻译:随机化算法可用于加速大规模数据集的分析。本文针对多元统计分析中两个最基础的问题——最小二乘与PCA,发展了一套通过随机化草图或投影进行统计推断的统一方法论。该方法适用于固定数据集(即数据条件化),唯一随机性源于随机化算法。我们为广泛的草图分布提出统计推断方法,包括子采样随机化哈达玛变换(SRHT)、稀疏符号嵌入(SSE)、CountSketch、独立同分布元素的草图矩阵以及均匀子采样。据我们所知,目前尚无针对PCA中SSE与SRHT的同类方法可用。我们的新颖理论方法依赖于证明特定二次型的渐近正态性。作为具有更广泛意义的贡献,我们通过一种利用哈达玛变换递归结构的二元展开新证明,给出了SRHT二次型的中心极限定理。基于合成数据与实证数据的数值实验支持了我们方法的有效性,尤其表明草图方法比近期提出的最优子采样方法具有更优的计算-估计权衡。