Clustering in high-dimensions poses many statistical challenges. While traditional distance-based clustering methods are computationally feasible, they lack probabilistic interpretation and rely on heuristics for estimation of the number of clusters. On the other hand, probabilistic model-based clustering techniques often fail to scale and devising algorithms that are able to effectively explore the posterior space is an open problem. Based on recent developments in Bayesian distance-based clustering, we propose a hybrid solution that entails defining a likelihood on pairwise distances between observations. The novelty of the approach consists in including both cohesion and repulsion terms in the likelihood, which allows for cluster identifiability. This implies that clusters are composed of objects which have small "dissimilarities" among themselves (cohesion) and similar dissimilarities to observations in other clusters (repulsion). We show how this modelling strategy has interesting connection with existing proposals in the literature as well as a decision-theoretic interpretation. The proposed method is computationally efficient and applicable to a wide variety of scenarios. We demonstrate the approach in a simulation study and an application in digital numismatics.
翻译:高维聚类面临诸多统计挑战。传统基于距离的聚类方法虽计算可行,但缺乏概率解释,且聚类数量的估计依赖启发式方法。另一方面,基于概率模型的聚类技术往往难以扩展,而设计能够有效探索后验空间的算法仍是一个开放问题。基于贝叶斯距离聚类的最新进展,我们提出一种混合解决方案:在观测值之间的成对距离上定义似然函数。该方法的新颖之处在于似然中同时包含凝聚项与排斥项,从而确保聚类可识别性。这意味着聚类由内部"不相似性"较小(凝聚)且与其他聚类中观测值的不相似性相似(排斥)的物体组成。我们展示了该建模策略与现有文献中的相关方案之间的有趣联系,并提供了决策理论解释。所提方法计算高效,适用于多种场景。我们通过模拟研究及数字钱币学应用验证了该方法的有效性。