This paper presents a control variate-based Markov chain Monte Carlo algorithm for efficient sampling from the probability simplex, with a focus on applications in large-scale Bayesian models such as latent Dirichlet allocation. Standard Markov chain Monte Carlo methods, particularly those based on Langevin diffusions, suffer from significant discretization errors near the boundaries of the simplex, which are exacerbated in sparse data settings. To address this issue, we propose an improved approach based on the stochastic Cox--Ingersoll--Ross process, which eliminates discretization errors and enables exact transition densities. Our key contribution is the integration of control variates, which significantly reduces the variance of the stochastic gradient estimator in the Cox--Ingersoll--Ross process, thereby enhancing the accuracy and computational efficiency of the algorithm. We provide a theoretical analysis showing the variance reduction achieved by the control variates approach and demonstrate the practical advantages of our method in data subsampling settings. Empirical results on large datasets show that the proposed method outperforms existing approaches in both accuracy and scalability.
翻译:本文提出一种基于控制变量的马尔可夫链蒙特卡洛算法,用于从概率单纯形中进行高效采样,特别关注其在潜在狄利克雷分配等大规模贝叶斯模型中的应用。标准的马尔可夫链蒙特卡洛方法(尤其是基于朗之万扩散的方法)在单纯形边界附近存在显著离散化误差,该问题在稀疏数据场景下尤为突出。为解决此问题,我们提出一种基于随机Cox–Ingersoll–Ross过程的改进方法,该方法消除了离散化误差并实现了精确的转移密度。我们的核心贡献在于引入了控制变量技术,该技术显著降低了Cox–Ingersoll–Ross过程中随机梯度估计量的方差,从而提升了算法的精度与计算效率。我们通过理论分析证明了控制变量方法实现的方差缩减效果,并在数据子采样场景中验证了本方法的实际优势。在大规模数据集上的实验结果表明,所提方法在精度与可扩展性方面均优于现有方法。