We develop a novel clustering method for distributional data, where each data point is regarded as a probability distribution on the real line. For distributional data, it has been challenging to develop a clustering method that utilizes the mode of variation of data because the space of probability distributions lacks a vector space structure, preventing the application of existing methods for functional data. In this study, we propose a novel clustering method for distributional data on the real line, which takes account of difference in both the mean and mode of variation structures of clusters, in the spirit of the $k$-centres clustering approach proposed for functional data. Specifically, we consider the space of distributions equipped with the Wasserstein metric and define a geodesic mode of variation of distributional data using geodesic principal component analysis. Then, we utilize the geodesic mode of each cluster to predict the cluster membership of each distribution. We theoretically show the validity of the proposed clustering criterion by studying the probability of correct membership. Through a simulation study and real data application, we demonstrate that the proposed distributional clustering method can improve cluster quality compared to conventional clustering algorithms.
翻译:本文提出了一种针对分布数据的新型聚类方法,其中每个数据点被视为实数轴上的概率分布。对于分布数据,由于概率分布空间缺乏向量空间结构,无法直接应用现有的函数型数据方法,因此开发一种能够利用数据变异模式的聚类方法一直具有挑战性。本研究受函数型数据k中心聚类方法的启发,提出了一种针对实数轴上分布数据的新型聚类方法,该方法同时考虑了各类别在均值结构和变异模式结构上的差异。具体而言,我们考虑配备Wasserstein度量的分布空间,并利用测地主成分分析定义分布数据的测地变异模式。随后,我们利用每个类别的测地模式来预测各分布的类别归属。通过研究正确归属的概率,我们从理论上证明了所提聚类准则的有效性。通过模拟研究和实际数据应用,我们证明与传统的聚类算法相比,所提出的分布聚类方法能够提升聚类质量。