In the era of Big Data, analyzing high-dimensional and large datasets presents significant computational challenges. Although Bayesian statistics is well-suited for these complex data structures, Markov chain Monte Carlo (MCMC) method, which are essential for Bayesian estimation, suffers from computation cost because of its sequential nature. For faster and more effective computation, this paper introduces an algorithm to enhance a parallelizing MCMC method to handle this computation problem. We highlight the critical role of the overlapped area of posterior distributions after data partitioning, and propose a method using a machine learning classifier to effectively identify and extract MCMC draws from the area to approximate the actual posterior distribution. Our main contribution is the development of a Kullback-Leibler (KL) divergence-based criterion that simplifies hyperparameter tuning in training a classifier and makes the method nearly hyperparameter-free. Simulation studies validate the efficacy of our proposed methods.
翻译:在大数据时代,分析高维大规模数据集带来了显著的计算挑战。尽管贝叶斯统计非常适合处理这些复杂的数据结构,但贝叶斯估计所必需的马尔可夫链蒙特卡洛(MCMC)方法因其顺序执行特性而面临高昂的计算成本。为了实现更快、更有效的计算,本文提出一种算法来增强并行化MCMC方法以应对此计算问题。我们强调了数据分割后后验分布重叠区域的关键作用,并提出一种使用机器学习分类器的方法,以有效识别并从该区域提取MCMC样本,从而逼近真实后验分布。我们的主要贡献是开发了一种基于Kullback-Leibler(KL)散度的准则,该准则简化了分类器训练中的超参数调优,使该方法几乎无需超参数设置。模拟研究验证了我们所提方法的有效性。