Clustering algorithms are pivotal in data analysis, enabling the organization of data into meaningful groups. However, individual clustering methods often exhibit inherent limitations and biases, preventing the development of a universal solution applicable to diverse datasets. To address these challenges, we introduce a robust clustering framework that integrates the Minimum Description Length (MDL) principle with a genetic optimization algorithm. The framework begins with an ensemble clustering approach to generate an initial clustering solution, which is then refined using MDL-guided evaluation functions and optimized through a genetic algorithm. This integration allows the method to adapt to the dataset's intrinsic properties, minimizing dependency on the initial clustering input and ensuring a data-driven, robust clustering process. We evaluated the proposed method on thirteen benchmark datasets using four established validation metrics: accuracy, normalized mutual information (NMI), Fisher score, and adjusted Rand index (ARI). Experimental results demonstrate that our approach consistently outperforms traditional clustering methods, yielding higher accuracy, improved stability, and reduced bias. The methods adaptability makes it effective across datasets with diverse characteristics, highlighting its potential as a versatile and reliable tool for complex clustering tasks. By combining the MDL principle with genetic optimization, this study offers a significant advancement in clustering methodology, addressing key limitations and delivering superior performance in varied applications.
翻译:聚类算法在数据分析中至关重要,能够将数据组织成有意义的组别。然而,单个聚类方法往往存在固有的局限性和偏差,阻碍了开发适用于多样化数据集的通用解决方案。为应对这些挑战,我们提出了一种鲁棒的聚类框架,该框架将最小描述长度(MDL)原理与遗传优化算法相结合。该框架首先采用集成聚类方法生成初始聚类方案,随后通过MDL引导的评估函数进行优化,并利用遗传算法进行迭代改进。这种集成使方法能够适应数据集的内在特性,减少对初始聚类输入的依赖,并确保一个数据驱动、鲁棒的聚类过程。我们在十三个基准数据集上使用四种成熟的验证指标(准确率、归一化互信息(NMI)、Fisher分数和调整兰德指数(ARI))对所提方法进行了评估。实验结果表明,我们的方法在各项指标上均优于传统聚类方法,实现了更高的准确率、更强的稳定性和更低的偏差。该方法良好的适应性使其能够有效处理具有不同特征的数据集,突显了其作为复杂聚类任务的通用且可靠工具的潜力。通过将MDL原理与遗传优化相结合,本研究在聚类方法学上取得了重要进展,解决了关键局限性,并在多种应用中实现了卓越性能。