Inference in Randomized Least Squares and PCA via Normality of Quadratic Forms

Randomized algorithms can be used to speed up the analysis of large datasets. In this paper, we develop a unified methodology for statistical inference via randomized sketching or projections in two of the most fundamental problems in multivariate statistical analysis: least squares and PCA. The methodology applies to fixed datasets -- i.e., is data-conditional -- and the only randomness is due to the randomized algorithm. We propose statistical inference methods for a broad range of sketching distributions, such as the subsampled randomized Hadamard transform (SRHT), Sparse Sign Embeddings (SSE) and CountSketch, sketching matrices with i.i.d. entries, and uniform subsampling. To our knowledge, no comparable methods are available for SSE and for SRHT in PCA. Our novel theoretical approach rests on showing the asymptotic normality of certain quadratic forms. As a contribution of broader interest, we show central limit theorems for quadratic forms of the SRHT, relying on a novel proof via a dyadic expansion that leverages the recursive structure of the Hadamard transform. Numerical experiments using both synthetic and empirical datasets support the efficacy of our methods, and in particular suggest that sketching methods can have better computation-estimation tradeoffs than recently proposed optimal subsampling methods.

翻译：随机化算法可用于加速大规模数据集的分析。本文针对多元统计分析中两个最基础的问题——最小二乘与PCA，发展了一套通过随机化草图或投影进行统计推断的统一方法论。该方法适用于固定数据集（即数据条件化），唯一随机性源于随机化算法。我们为广泛的草图分布提出统计推断方法，包括子采样随机化哈达玛变换（SRHT）、稀疏符号嵌入（SSE）、CountSketch、独立同分布元素的草图矩阵以及均匀子采样。据我们所知，目前尚无针对PCA中SSE与SRHT的同类方法可用。我们的新颖理论方法依赖于证明特定二次型的渐近正态性。作为具有更广泛意义的贡献，我们通过一种利用哈达玛变换递归结构的二元展开新证明，给出了SRHT二次型的中心极限定理。基于合成数据与实证数据的数值实验支持了我们方法的有效性，尤其表明草图方法比近期提出的最优子采样方法具有更优的计算-估计权衡。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日