This paper investigates two fundamental descriptors of data, i.e., density distribution versus mass distribution, in the context of clustering. Density distribution has been the de facto descriptor of data distribution since the introduction of statistics. We show that density distribution has its fundamental limitation -- high-density bias, irrespective of the algorithms used to perform clustering. Existing density-based clustering algorithms have employed different algorithmic means to counter the effect of the high-density bias with some success, but the fundamental limitation of using density distribution remains an obstacle to discovering clusters of arbitrary shapes, sizes and densities. Using the mass distribution as a better foundation, we propose a new algorithm which maximizes the total mass of all clusters, called mass-maximization clustering (MMC). The algorithm can be easily changed to maximize the total density of all clusters in order to examine the fundamental limitation of using density distribution versus mass distribution. The key advantage of the MMC over the density-maximization clustering is that the maximization is conducted without a bias towards dense clusters.
翻译:本文在聚类背景下研究了数据的两种基本描述子,即密度分布与质量分布。自统计学诞生以来,密度分布一直是数据分布的事实标准描述子。我们证明,无论采用何种聚类算法,密度分布都存在其根本性局限——高密度偏差。现有的基于密度的聚类算法已采用多种算法手段来抵消高密度偏差的影响并取得了一定成效,但使用密度分布的根本局限仍然是发现任意形状、大小和密度簇群的障碍。以质量分布作为更优基础,我们提出了一种最大化所有簇总质量的新算法,称为质量最大化聚类(MMC)。该算法可轻松调整为最大化所有簇的总密度,从而检验使用密度分布与质量分布的根本局限。MMC相较于密度最大化聚类的关键优势在于,其最大化过程不会对密集簇产生偏好。