Dirichlet Process Mixture Models (DPMMs) are widely used to address clustering problems. Their main advantage lies in their ability to automatically estimate the number of clusters during the inference process through the Bayesian non-parametric framework. However, the inference becomes considerably slow as the dataset size increases. This paper proposes a new distributed Markov Chain Monte Carlo (MCMC) inference method for DPMMs (DisCGS) using sufficient statistics. Our approach uses the collapsed Gibbs sampler and is specifically designed to work on distributed data across independent and heterogeneous machines, which habilitates its use in horizontal federated learning. Our method achieves highly promising results and notable scalability. For instance, with a dataset of 100K data points, the centralized algorithm requires approximately 12 hours to complete 100 iterations while our approach achieves the same number of iterations in just 3 minutes, reducing the execution time by a factor of 200 without compromising clustering performance. The code source is publicly available at https://github.com/redakhoufache/DisCGS.
翻译:狄利克雷过程混合模型(DPMMs)被广泛用于解决聚类问题。其主要优势在于通过贝叶斯非参数框架,能够在推理过程中自动估计聚类数量。然而,随着数据集规模增大,推理过程变得显著缓慢。本文提出了一种基于充分统计量的分布式马尔可夫链蒙特卡洛(MCMC)推理新方法(DisCGS),用于DPMMs。该方法采用坍缩吉布斯采样器,专为处理分布在不同独立异构机器上的数据而设计,从而适用于水平联邦学习场景。我们的方法取得了极具前景的结果和显著的可扩展性。例如,在包含10万个数据点的数据集上,集中式算法完成100次迭代约需12小时,而我们的方法仅需3分钟即可完成相同迭代次数,执行时间缩减了200倍,且未降低聚类性能。源代码已公开在 https://github.com/redakhoufache/DisCGS。