This paper presents a clustering technique that reduces the susceptibility to data noise by learning and clustering the data-distribution and then assigning the data to the cluster of its distribution. In the process, it reduces the impact of noise on clustering results. This method involves introducing a new distance among distributions, namely the expectation distance (denoted, ED), that goes beyond the state-of-art distribution distance of optimal mass transport (denoted, $W_2$ for $2$-Wasserstein): The latter essentially depends only on the marginal distributions while the former also employs the information about the joint distributions. Using the ED, the paper extends the classical $K$-means and $K$-medoids clustering to those over data-distributions (rather than raw-data) and introduces $K$-medoids using $W_2$. The paper also presents the closed-form expressions of the $W_2$ and ED distance measures. The implementation results of the proposed ED and the $W_2$ distance measures to cluster real-world weather data as well as stock data are also presented, which involves efficiently extracting and using the underlying data distributions -- Gaussians for weather data versus lognormals for stock data. The results show striking performance improvement over classical clustering of raw-data, with higher accuracy realized for ED. Also, not only does the distribution-based clustering offer higher accuracy, but it also lowers the computation time due to reduced time-complexity.
翻译:本文提出了一种聚类技术,通过学习和聚类数据分布,然后将数据分配到其所属分布的簇中,从而降低对数据噪声的敏感性。在此过程中,该方法减少了噪声对聚类结果的影响。该技术引入了一种新的分布间距离度量,即期望距离(记为ED),其超越了最优质量传输领域的最新分布距离(记为$W_2$,对应$2$-Wasserstein距离):后者本质上仅依赖于边缘分布,而前者还利用了联合分布的信息。利用ED,本文将经典的$K$-均值和$K$-中心点聚类扩展至数据分布层面(而非原始数据),并引入了基于$W_2$的$K$-中心点方法。同时,本文给出了$W_2$与ED距离度量的闭式表达式。文中还展示了将所提出的ED和$W_2$距离度量应用于实际气象数据及股票数据聚类的实现结果,其中涉及高效提取并利用底层数据分布——气象数据服从高斯分布,而股票数据服从对数正态分布。结果表明,与经典的原始数据聚类相比,该方法性能显著提升,且ED实现的准确率更高。此外,基于分布的聚类不仅提供了更高的准确率,还因降低了时间复杂度而缩短了计算时间。