Bayesian clustering accounts for uncertainty but is computationally demanding at scale. Furthermore, real-world datasets often contain missing values, and simple imputation ignores the associated uncertainty, resulting in suboptimal results. We present Cluster-PFN, a Transformer-based model that extends Prior-Data Fitted Networks (PFNs) to unsupervised Bayesian clustering. Trained entirely on synthetic datasets generated from a finite Gaussian Mixture Model (GMM) prior, Cluster-PFN learns to estimate the posterior distribution over both the number of clusters and the cluster assignments. Our method estimates the number of clusters more accurately than handcrafted model selection procedures such as AIC, BIC and Variational Inference (VI), and achieves clustering quality competitive with VI while being orders of magnitude faster. Cluster-PFN can be trained on complex priors that include missing data, outperforming imputation-based baselines on real-world genomic datasets, at high missingness. These results show that the Cluster-PFN can provide scalable and flexible Bayesian clustering.
翻译:贝叶斯聚类能够量化不确定性,但在大规模应用中计算代价高昂。此外,现实数据集常包含缺失值,而简单的插补方法忽略了相关的不确定性,导致次优结果。本文提出Cluster-PFN,这是一种基于Transformer的模型,它将先验数据拟合网络(PFNs)扩展至无监督贝叶斯聚类任务。该模型完全在由有限高斯混合模型(GMM)先验生成的合成数据集上进行训练,从而学习估计关于聚类数量与聚类分配的后验分布。我们的方法在估计聚类数量方面比人工设计的模型选择程序(如AIC、BIC和变分推断(VI))更准确,并且在达到与VI相当的聚类质量的同时,计算速度提升了数个数量级。Cluster-PFN能够在包含缺失数据的复杂先验上进行训练,在高缺失率情况下,其在真实世界基因组数据集上的表现优于基于插补的基线方法。这些结果表明,Cluster-PFN能够提供可扩展且灵活的贝叶斯聚类解决方案。