Clustering algorithms are fundamental in data analysis, enabling the organization of data into meaningful groups. However, individual clustering methods often face limitations and biases, making it challenging to develop a universal solution for diverse datasets. To address this, we propose a novel clustering framework that combines the Minimum Description Length (MDL) principle with a genetic optimization algorithm. This approach begins with an ensemble clustering solution as a baseline, which is refined using MDL-based evaluation functions and optimized with a genetic algorithm. By leveraging the MDL principle, the method adapts to the intrinsic properties of datasets, minimizing dependence on input clusters and ensuring a data-driven process. The proposed method was evaluated on thirteen benchmark datasets using four validation metrics: accuracy, normalized mutual information (NMI), Fisher score, and adjusted Rand index (ARI). Results show that the method consistently outperforms traditional clustering algorithms, achieving higher accuracy, greater stability, and reduced biases. Its adaptability makes it a reliable tool for clustering complex and varied datasets. This study demonstrates the potential of combining MDL and genetic optimization to create a robust and versatile clustering framework, advancing the field of data analysis and offering a scalable solution for diverse applications.
翻译:聚类算法是数据分析的基础,能够将数据组织成有意义的群组。然而,单个聚类方法常常面临局限性和偏差,难以针对多样化数据集开发出通用解决方案。为此,我们提出了一种新颖的聚类框架,该框架将最小描述长度(MDL)原理与遗传优化算法相结合。该方法以集成聚类解作为基线,通过基于MDL的评估函数进行优化,并利用遗传算法进行改进。通过运用MDL原理,该方法能够适应数据集的内在特性,减少对输入聚类的依赖,并确保数据驱动的过程。我们使用四种验证指标(准确率、归一化互信息(NMI)、Fisher分数和调整兰德指数(ARI))在十三个基准数据集上对所提方法进行了评估。结果表明,该方法始终优于传统聚类算法,实现了更高的准确性、更强的稳定性和更低的偏差。其适应性使其成为处理复杂多变数据集的可靠工具。本研究展示了结合MDL与遗传优化构建鲁棒且通用聚类框架的潜力,推动了数据分析领域的发展,并为多样化应用提供了可扩展的解决方案。