In mixture modeling and clustering applications, the number of components and clusters is often not known. A stick-breaking mixture model, such as the Dirichlet process mixture model, is an appealing construction that assumes infinitely many components, while shrinking the weights of most of the unused components to near zero. However, it is well-known that this shrinkage is inadequate: even when the component distribution is correctly specified, spurious weights appear and give an inconsistent estimate of the number of clusters. In this article, we propose a simple solution: when breaking each mixture weight stick into two pieces, the length of the second piece is multiplied by a quasi-Bernoulli random variable, taking value one or a small constant close to zero. This effectively creates a soft-truncation and further shrinks the unused weights. Asymptotically, we show that as long as this small constant diminishes to zero at a rate faster than $o(1/n^2)$, with $n$ the sample size, the posterior distribution will converge to the true number of clusters. In comparison, we rigorously explore Dirichlet process mixture models using a concentration parameter that is either constant or rapidly diminishes to zero -- both of which lead to inconsistency for the number of clusters. Our proposed model is easy to implement, requiring only a small modification of a standard Gibbs sampler for mixture models. In simulations and a data application of clustering brain networks, our proposed method recovers the ground-truth number of clusters, and leads to a small number of clusters.
翻译:在混合建模和聚类应用中,组分数和簇数往往未知。棍子断裂混合模型(如狄利克雷过程混合模型)是一种具有吸引力的构造,它假设存在无限多个组分,同时将大多数未使用组分的权重收缩至接近零。然而,众所周知这种收缩并不充分:即使组分分布被正确指定,仍会出现虚假权重,导致对簇数的估计不一致。本文提出一种简单解决方案:在将每个混合权重棍子断裂为两段时,第二段的长度乘以一个准伯努利随机变量(取值为1或接近零的小常数)。这有效实现了软截断,并进一步收缩未使用权重。渐近分析表明,只要这个小常数以快于$o(1/n^2)$的速率趋于零(其中$n$为样本量),后验分布将收敛至真实的簇数。作为对比,我们严格探讨了使用固定浓度参数或快速衰减至零的浓度参数的狄利克雷过程混合模型——两者均导致对簇数的不一致性。所提出的模型易于实现,仅需对标准混合模型吉布斯采样器进行微小修改。在模拟实验和脑网络聚类的数据应用中,我们的方法恢复了真实的簇数,并得到少量簇的结果。