Algorithmic Gaussianization through Sketching: Converting Data into Sub-gaussian Random Designs

Algorithmic Gaussianization is a phenomenon that can arise when using randomized sketching or sampling methods to produce smaller representations of large datasets: For certain tasks, these sketched representations have been observed to exhibit many robust performance characteristics that are known to occur when a data sample comes from a sub-gaussian random design, which is a powerful statistical model of data distributions. However, this phenomenon has only been studied for specific tasks and metrics, or by relying on computationally expensive methods. We address this by providing an algorithmic framework for gaussianizing data distributions via averaging, proving that it is possible to efficiently construct data sketches that are nearly indistinguishable (in terms of total variation distance) from sub-gaussian random designs. In particular, relying on a recently introduced sketching technique called Leverage Score Sparsified (LESS) embeddings, we show that one can construct an $n\times d$ sketch of an $N\times d$ matrix $A$, where $n\ll N$, that is nearly indistinguishable from a sub-gaussian design, in time $O(\text{nnz}(A)\log N + nd^2)$, where $\text{nnz}(A)$ is the number of non-zero entries in $A$. As a consequence, strong statistical guarantees and precise asymptotics available for the estimators produced from sub-gaussian designs (e.g., for least squares and Lasso regression, covariance estimation, low-rank approximation, etc.) can be straightforwardly adapted to our sketching framework. We illustrate this with a new approximation guarantee for sketched least squares, among other examples.

翻译：算法高斯化是一种在使用随机草图化或采样方法生成大数据集的小型表示时可能出现的现象：对于某些任务，这些草图化表示被观察到展现出许多鲁棒性能特征，这些特征在数据样本来自亚高斯随机设计（一种强大的数据分布统计模型）时已知会出现。然而，这一现象仅针对特定任务和度量进行了研究，或依赖计算成本高昂的方法。我们通过提供一种基于平均化操作的高斯化数据分布的算法框架来解决这一问题，证明可以高效地构建与亚高斯随机设计在总变差距离上几乎不可区分的数素描图。特别地，利用一种最近引入的称为杠杆分数稀疏化（LESS）嵌入的草图化技术，我们证明可以在 $O(\text{nnz}(A)\log N + nd^2)$ 时间内构建一个 $N\times d$ 矩阵 $A$ 的 $n\times d$ 素描图（其中 $n\ll N$），使其与亚高斯设计几乎不可区分，这里 $\text{nnz}(A)$ 是 $A$ 中非零元素的数量。因此，从亚高斯设计（例如最小二乘和Lasso回归、协方差估计、低秩近似等）中获得的强统计保证和精确渐近性质可直接适用于我们的草图化框架。我们通过为草图化最小二乘提供新的近似保证（以及其他示例）来展示这一点。