This paper addresses the problem of clustering measurement vectors that are heteroscedastic in that they can have different covariance matrices. From the assumption that the measurement vectors within a given cluster are Gaussian distributed with possibly different and unknown covariant matrices around the cluster centroid, we introduce a novel cost function to estimate the centroids. The zeros of the gradient of this cost function turn out to be the fixed-points of a certain function. As such, the approach generalizes the methodology employed to derive the existing Mean-Shift algorithm. But as a main and novel theoretical result compared to Mean-Shift, this paper shows that the sole fixed-points of the identified function tend to be the cluster centroids if both the number of measurements per cluster and the distances between centroids are large enough. As a second contribution, this paper introduces the Wald kernel for clustering. This kernel is defined as the p-value of the Wald hypothesis test for testing the mean of a Gaussian. As such, the Wald kernel measures the plausibility that a measurement vector belongs to a given cluster and it scales better with the dimension of the measurement vectors than the usual Gaussian kernel. Finally, the proposed theoretical framework allows us to derive a new clustering algorithm called CENTRE-X that works by estimating the fixed-points of the identified function. As Mean-Shift, CENTRE-X requires no prior knowledge of the number of clusters. It relies on a Wald hypothesis test to significantly reduce the number of fixed points to calculate compared to the Mean-Shift algorithm, thus resulting in a clear gain in complexity. Simulation results on synthetic and real data sets show that CENTRE-X has comparable or better performance than standard clustering algorithms K-means and Mean-Shift, even when the covariance matrices are not perfectly known.
翻译:本文针对具有不同协方差矩阵的异方差测量向量聚类问题展开研究。基于给定簇内测量向量服从高斯分布(其簇质心周围可能存在未知且不同的协方差矩阵)的假设,我们提出一种新型代价函数用于估计质心。该代价函数梯度的零解恰好是某函数的固定点。因此,本方法推广了现有均值漂移算法的推导方法论。但与均值漂移相比,本文的核心创新理论成果表明:当每个簇的测量数及质心间距离均足够大时,所识别函数的唯一固定点趋于簇质心。作为第二项贡献,本文提出用于聚类的瓦尔德核函数。该核定义为检验高斯分布均值的瓦尔德假设检验的p值,因此可度量测量向量归属于给定簇的合理性,且相较于常规高斯核具有更好的维度扩展性。最终,基于所提理论框架,我们推导出名为CENTRE-X的新型聚类算法,该算法通过估计识别函数的固定点实现聚类。与均值漂移类似,CENTRE-X无需预先知晓聚类数。其通过瓦尔德假设检验显著减少需计算的固定点数量,从而在复杂度上相较于均值漂移算法实现明显增益。合成数据集与真实数据集的仿真结果表明,即使协方差矩阵非完全已知,CENTRE-X仍具备与K-means、均值漂移等标准聚类算法相当或更优的性能。