This work proposes a hierarchical clustering algorithm for high-dimensional datasets using the cyclic space of reversible finite cellular automata. In cellular automaton (CA) based clustering, if two objects belong to the same cycle, they are closely related and considered as part of the same cluster. However, if a high-dimensional dataset is clustered using the cycles of one CA, closely related objects may belong to different cycles. This paper identifies the relationship between objects in two different cycles based on the median of all elements in each cycle so that they can be grouped in the next stage. Further, to minimize the number of intermediate clusters which in turn reduces the computational cost, a rule selection strategy is taken to find the best rules based on information propagation and cycle structure. After encoding the dataset using frequency-based encoding such that the consecutive data elements maintain a minimum hamming distance in encoded form, our proposed clustering algorithm iterates over three stages to finally cluster the data elements into the desired number of clusters given by user. This algorithm can be applied to various fields, including healthcare, sports, chemical research, agriculture, etc. When verified over standard benchmark datasets with various performance metrics, our algorithm is at par with the existing algorithms with quadratic time complexity.
翻译:本研究提出了一种利用可逆有限元胞自动机的循环空间对高维数据集进行层次聚类的算法。在基于元胞自动机(CA)的聚类中,若两个对象属于同一循环,则它们密切相关,可视为同一簇的组成部分。然而,若使用单一CA的循环对高维数据集进行聚类,密切相关的对象可能分属不同循环。本文通过计算每个循环中所有元素的中值来确定两个不同循环中对象间的关联性,从而使其能在下一阶段被归入同一组。此外,为最小化中间簇的数量以降低计算成本,我们采用了一种基于信息传播与循环结构的规则选择策略来寻找最优规则。通过基于频率的编码方式对数据集进行编码,使得连续数据元素在编码形式下保持最小汉明距离后,本文提出的聚类算法通过三个阶段迭代运行,最终将数据元素聚类至用户指定的目标簇数。该算法可应用于医疗保健、体育、化学研究、农业等多个领域。通过在标准基准数据集上使用多种性能指标进行验证,本算法的时间复杂度为二次方,其性能与现有算法相当。