Matrix sketching is a powerful tool for reducing the size of large data matrices. Yet there are fundamental limitations to this size reduction when we want to recover an accurate estimator for a task such as least square regression. We show that these limitations can be circumvented in the distributed setting by designing sketching methods that minimize the bias of the estimator, rather than its error. In particular, we give a sparse sketching method running in optimal space and current matrix multiplication time, which recovers a nearly-unbiased least squares estimator using two passes over the data. This leads to new communication-efficient distributed averaging algorithms for least squares and related tasks, which directly improve on several prior approaches. Our key novelty is a new bias analysis for sketched least squares, giving a sharp characterization of its dependence on the sketch sparsity. The techniques include new higher-moment restricted Bai-Silverstein inequalities, which are of independent interest to the non-asymptotic analysis of deterministic equivalents for random matrices that arise from sketching.
翻译:矩阵草图是缩减大规模数据矩阵尺寸的有效工具。然而,当我们需要为最小二乘回归等任务恢复精确估计量时,这种尺寸缩减存在根本性限制。我们证明,在分布式场景中,可以通过设计最小化估计量偏差而非误差的草图方法来规避这些限制。具体而言,我们提出了一种在最优空间和当前矩阵乘法时间内运行的稀疏草图方法,该方法通过对数据的两遍扫描即可恢复近乎无偏的最小二乘估计量。这一成果催生了用于最小二乘及相关任务的新型通信高效分布式平均算法,直接改进了多种现有方法。我们的核心创新在于对草图最小二乘进行了新的偏差分析,精确刻画了其与草图稀疏性之间的依赖关系。相关技术包括新型高阶矩限制性Bai-Silverstein不等式——该不等式对于由草图生成的随机矩阵的确定性等价量的非渐近分析具有独立的研究价值。