Model-based clustering of moderate or large dimensional data is notoriously difficult. We propose a model for simultaneous dimensionality reduction and clustering by assuming a mixture model for a set of latent scores, which are then linked to the observations via a Gaussian latent factor model. This approach was recently investigated by Chandra et al. (2023). The authors use a factor-analytic representation and assume a mixture model for the latent factors. However, performance can deteriorate in the presence of model misspecification. Assuming a repulsive point process prior for the component-specific means of the mixture for the latent scores is shown to yield a more robust model that outperforms the standard mixture model for the latent factors in several simulated scenarios. The repulsive point process must be anisotropic to favor well-separated clusters of data, and its density should be tractable for efficient posterior inference. We address these issues by proposing a general construction for anisotropic determinantal point processes. We illustrate our model in simulations as well as a plant species co-occurrence dataset.
翻译:中等或高维数据的模型聚类问题因其复杂性而广为人知。本文提出了一种同时进行降维与聚类的模型,其核心思想是假设一组潜在得分服从混合模型,再通过高斯潜在因子模型将这些得分与观测数据相关联。Chandra等人(2023)近期对此方法进行了研究,他们采用因子分析表示并假设潜在因子服从混合模型。然而,当模型存在设定偏误时,该方法的性能可能显著下降。研究表明,若对潜在得分混合模型中各组分特定均值施加排斥点过程先验,可构建出更具鲁棒性的模型,在多种模拟场景中均优于传统的潜在因子混合模型。为确保数据形成分离良好的聚类,该排斥点过程需具备各向异性特性,且其密度函数需便于处理以实现高效的后验推断。针对这些问题,我们提出了一种构建各向异性行列式点过程的通用方法。我们通过模拟实验及一个植物物种共现数据集验证了所提模型的有效性。