Bayesian clustering accounts for uncertainty but is computationally demanding at scale. Furthermore, real-world datasets often contain missing values, and simple imputation ignores the associated uncertainty, resulting in suboptimal results. We present Cluster-PFN, a Transformer-based model that extends Prior-Data Fitted Networks (PFNs) to unsupervised Bayesian clustering. Trained entirely on synthetic datasets generated from a finite Gaussian Mixture Model (GMM) prior, Cluster-PFN learns to estimate the posterior distribution over both the number of clusters and the cluster assignments. Our method estimates the number of clusters more accurately than handcrafted model selection procedures such as AIC, BIC and Variational Inference (VI), and achieves clustering quality competitive with VI while being orders of magnitude faster. Cluster-PFN can be trained on complex priors that include missing data, outperforming imputation-based baselines on real-world genomic datasets, at high missingness. These results show that the Cluster-PFN can provide scalable and flexible Bayesian clustering.
翻译:贝叶斯聚类能够量化不确定性,但在大规模计算时计算成本高昂。此外,现实世界的数据集常包含缺失值,而简单的插补方法忽略了相关的不确定性,导致结果欠佳。我们提出了Cluster-PFN,这是一种基于Transformer的模型,它将先验数据拟合网络(PFNs)扩展到无监督贝叶斯聚类。该模型完全在由有限高斯混合模型(GMM)先验生成的合成数据集上训练,学习估计关于聚类数量和聚类分配的后验分布。我们的方法在估计聚类数量方面比手工设计的模型选择程序(如AIC、BIC和变分推断(VI))更准确,并且在实现与VI相竞争的聚类质量的同时,速度提高了数个数量级。Cluster-PFN可以在包含缺失数据的复杂先验上进行训练,在高缺失率情况下,其在真实世界基因组数据集上的表现优于基于插补的基线方法。这些结果表明,Cluster-PFN能够提供可扩展且灵活的贝叶斯聚类。