This paper presents a selective review of statistical computation methods for massive data analysis. A huge amount of statistical methods for massive data computation have been rapidly developed in the past decades. In this work, we focus on three categories of statistical computation methods: (1) distributed computing, (2) subsampling methods, and (3) minibatch gradient techniques. The first class of literature is about distributed computing and focuses on the situation, where the dataset size is too huge to be comfortably handled by one single computer. In this case, a distributed computation system with multiple computers has to be utilized. The second class of literature is about subsampling methods and concerns about the situation, where the sample size of dataset is small enough to be placed on one single computer but too large to be easily processed by its memory as a whole. The last class of literature studies those minibatch gradient related optimization techniques, which have been extensively used for optimizing various deep learning models.
翻译:本文对用于大规模数据分析的统计计算方法进行了选择性综述。过去几十年间,针对大规模数据计算的统计方法迅速发展。本研究聚焦于三类统计计算方法:(1)分布式计算,(2)子采样方法,以及(3)小批量梯度技术。第一类文献涉及分布式计算,主要针对数据集规模过大、单台计算机难以高效处理的情况。此时必须采用多计算机构建的分布式计算系统。第二类文献研究子采样方法,关注数据集样本量虽可容纳于单台计算机,但因其整体规模过大而难以通过计算机内存直接处理的情况。最后一类文献探讨与深度学习模型优化中广泛使用的小批量梯度相关优化技术。