Randomized Dimension Reduction with Statistical Guarantees

Large models and enormous data are essential driving forces of the unprecedented successes achieved by modern algorithms, especially in scientific computing and machine learning. Nevertheless, the growing dimensionality and model complexity, as well as the non-negligible workload of data pre-processing, also bring formidable costs to such successes in both computation and data aggregation. As the deceleration of Moore's Law slackens the cost reduction of computation from the hardware level, fast heuristics for expensive classical routines and efficient algorithms for exploiting limited data are increasingly indispensable for pushing the limit of algorithm potency. This thesis explores some of such algorithms for fast execution and efficient data utilization. From the computational efficiency perspective, we design and analyze fast randomized low-rank decomposition algorithms for large matrices based on "matrix sketching", which can be regarded as a dimension reduction strategy in the data space. These include the randomized pivoting-based interpolative and CUR decomposition discussed in Chapter 2 and the randomized subspace approximations discussed in Chapter 3. From the sample efficiency perspective, we focus on learning algorithms with various incorporations of data augmentation that improve generalization and distributional robustness provably. Specifically, Chapter 4 presents a sample complexity analysis for data augmentation consistency regularization where we view sample efficiency from the lens of dimension reduction in the function space. Then in Chapter 5, we introduce an adaptively weighted data augmentation consistency regularization algorithm for distributionally robust optimization with applications in medical image segmentation.

翻译：大型模型与海量数据是现代算法取得前所未有的成功的关键驱动力，尤其在科学计算和机器学习领域。然而，维度和模型复杂性的不断增长，以及数据预处理中不可忽视的工作量，也给计算和数据聚合方面的成功带来了巨大成本。随着摩尔定律放缓导致硬件层面的计算成本降低速度减缓，针对昂贵经典例程的快速启发式方法以及高效利用有限数据的算法，对于突破算法性能极限愈发不可或缺。本论文探讨了部分此类旨在实现快速执行与高效数据利用的算法。从计算效率角度，我们设计并分析了基于“矩阵草图化”（可视为数据空间中的降维策略）的大规模矩阵快速随机化低秩分解算法，包括第2章讨论的基于随机主元选择的插值分解和CUR分解，以及第3章讨论的随机子空间近似方法。从样本效率角度，我们聚焦于融合数据增强的各类学习算法，这些算法可被证明能提升泛化能力与分布鲁棒性。具体而言，第4章从函数空间降维视角出发，对数据增强一致性正则化方法进行了样本复杂度分析。第5章则介绍了一种自适应加权数据增强一致性正则化算法，用于分布鲁棒优化并在医学图像分割中取得应用。