Nonparametric Bayesian approaches provide a flexible framework for clustering without pre-specifying the number of groups, yet they are well known to overestimate the number of clusters, especially for functional data. We show that a fundamental cause of this phenomenon lies in misspecification of the error structure: errors are conventionally assumed to be independent across observed points in Bayesian functional models. Through high-dimensional clustering theory, we demonstrate that ignoring the underlying correlation leads to excess clusters regardless of the flexibility of prior distributions. Guided by this theory, we propose incorporating the underlying correlation structures via Gaussian processes and also present its scalable approximation with principled hyperparameter selection. Numerical experiments illustrate that even simple clustering based on Dirichlet processes performs well once error dependence is properly modeled.
翻译:非参数贝叶斯方法为聚类分析提供了无需预设组数的灵活框架,但众所周知其倾向于高估聚类数量,尤其在函数型数据中更为显著。本文揭示该现象的根本原因在于误差结构的误设:在贝叶斯函数模型中,误差通常被假定为观测点间相互独立。通过高维聚类理论,我们证明忽略潜在相关性将导致聚类数量膨胀,且该现象与先验分布的灵活性无关。基于此理论指导,我们提出通过高斯过程纳入潜在相关结构,并给出具有理论依据的超参数选择的可扩展近似方法。数值实验表明,即使基于狄利克雷过程的简单聚类方法,在正确建模误差依赖性后也能取得良好效果。