A Novel Theoretical Analysis for Clustering Heteroscedastic Gaussian Data without Knowledge of the Number of Clusters

This paper addresses the problem of clustering measurement vectors that are heteroscedastic in that they can have different covariance matrices. From the assumption that the measurement vectors within a given cluster are Gaussian distributed with possibly different and unknown covariant matrices around the cluster centroid, we introduce a novel cost function to estimate the centroids. The zeros of the gradient of this cost function turn out to be the fixed-points of a certain function. As such, the approach generalizes the methodology employed to derive the existing Mean-Shift algorithm. But as a main and novel theoretical result compared to Mean-Shift, this paper shows that the sole fixed-points of the identified function tend to be the cluster centroids if both the number of measurements per cluster and the distances between centroids are large enough. As a second contribution, this paper introduces the Wald kernel for clustering. This kernel is defined as the p-value of the Wald hypothesis test for testing the mean of a Gaussian. As such, the Wald kernel measures the plausibility that a measurement vector belongs to a given cluster and it scales better with the dimension of the measurement vectors than the usual Gaussian kernel. Finally, the proposed theoretical framework allows us to derive a new clustering algorithm called CENTRE-X that works by estimating the fixed-points of the identified function. As Mean-Shift, CENTRE-X requires no prior knowledge of the number of clusters. It relies on a Wald hypothesis test to significantly reduce the number of fixed points to calculate compared to the Mean-Shift algorithm, thus resulting in a clear gain in complexity. Simulation results on synthetic and real data sets show that CENTRE-X has comparable or better performance than standard clustering algorithms K-means and Mean-Shift, even when the covariance matrices are not perfectly known.

翻译：本文针对具有不同协方差矩阵的异方差测量向量聚类问题展开研究。基于给定簇内测量向量服从高斯分布（其簇质心周围可能存在未知且不同的协方差矩阵）的假设，我们提出一种新型代价函数用于估计质心。该代价函数梯度的零解恰好是某函数的固定点。因此，本方法推广了现有均值漂移算法的推导方法论。但与均值漂移相比，本文的核心创新理论成果表明：当每个簇的测量数及质心间距离均足够大时，所识别函数的唯一固定点趋于簇质心。作为第二项贡献，本文提出用于聚类的瓦尔德核函数。该核定义为检验高斯分布均值的瓦尔德假设检验的p值，因此可度量测量向量归属于给定簇的合理性，且相较于常规高斯核具有更好的维度扩展性。最终，基于所提理论框架，我们推导出名为CENTRE-X的新型聚类算法，该算法通过估计识别函数的固定点实现聚类。与均值漂移类似，CENTRE-X无需预先知晓聚类数。其通过瓦尔德假设检验显著减少需计算的固定点数量，从而在复杂度上相较于均值漂移算法实现明显增益。合成数据集与真实数据集的仿真结果表明，即使协方差矩阵非完全已知，CENTRE-X仍具备与K-means、均值漂移等标准聚类算法相当或更优的性能。