One of the most employed yet simple algorithm for cluster analysis is the k-means algorithm. k-means has successfully witnessed its use in artificial intelligence, market segmentation, fraud detection, data mining, psychology, etc., only to name a few. The k-means algorithm, however, does not always yield the best quality results. Its performance heavily depends upon the number of clusters supplied and the proper initialization of the cluster centroids or seeds. In this paper, we conduct an analysis of the performance of k-means on image data by employing parametric entropies in an entropy based centroid initialization method and propose the best fitting entropy measures for general image datasets. We use several entropies like Taneja entropy, Kapur entropy, Aczel Daroczy entropy, Sharma Mittal entropy. We observe that for different datasets, different entropies provide better results than the conventional methods. We have applied our proposed algorithm on these datasets: Satellite, Toys, Fruits, Cars, Brain MRI, Covid X-Ray.
翻译:k-means算法是最常用且简单的聚类分析算法之一。该算法已成功应用于人工智能、市场细分、欺诈检测、数据挖掘、心理学等多个领域。然而,k-means算法并不总能产生最优结果,其性能高度依赖于预设的聚类数量以及聚类质心(种子点)的初始化质量。本文通过将参数熵应用于基于熵的质心初始化方法,分析了k-means在图像数据上的性能表现,并提出了适用于通用图像数据集的最优熵度量方法。我们采用了Taneja熵、Kapur熵、Aczel Daroczy熵、Sharma Mittal熵等多种熵度量。实验发现,对于不同数据集,不同熵度量相较于传统方法能取得更优结果。我们将所提出的算法应用于以下数据集:卫星图像、玩具图像、水果图像、汽车图像、脑部MRI图像、COVID-19 X光图像。