With the development of Big data technology, data analysis has become increasingly important. Traditional clustering algorithms such as K-means are highly sensitive to the initial centroid selection and perform poorly on non-convex datasets. In this paper, we address these problems by proposing a data-driven Bregman divergence parameter optimization clustering algorithm (DBGSA), which combines the Universal Gravitational Algorithm to bring similar points closer in the dataset. We construct a gravitational coefficient equation with a special property that gradually reduces the influence factor as the iteration progresses. Furthermore, we introduce the Bregman divergence generalized power mean information loss minimization to identify cluster centers and build a hyperparameter identification optimization model, which effectively solves the problems of manual adjustment and uncertainty in the improved dataset. Extensive experiments are conducted on four simulated datasets and six real datasets. The results demonstrate that DBGSA significantly improves the accuracy of various clustering algorithms by an average of 63.8\% compared to other similar approaches like enhanced clustering algorithms and improved datasets. Additionally, a three-dimensional grid search was established to compare the effects of different parameter values within threshold conditions, and it was discovered the parameter set provided by our model is optimal. This finding provides strong evidence of the high accuracy and robustness of the algorithm.
翻译:随着大数据技术的发展,数据分析变得日益重要。传统聚类算法如K-means对初始质心选择高度敏感,且在非凸数据集上表现欠佳。针对这些问题,本文提出一种数据驱动型Bregman散度参数优化聚类算法(DBGSA),该算法结合万有引力算法使数据集中相似点相互靠近。我们构建了具有特殊性质的引力系数方程,该方程随迭代进程逐渐减小影响因子。此外,引入Bregman散度广义幂平均信息损失最小化方法识别聚类中心,并构建超参数识别优化模型,有效解决了改进数据集的人工调参及不确定性问题。在四个模拟数据集和六个真实数据集上进行了大量实验。结果表明,与增强聚类算法和改进数据集等同类方法相比,DBGSA使各类聚类算法的准确率平均提升63.8%。同时,通过建立三维网格搜索比较阈值条件下不同参数值的影响,发现本文模型提供的参数集为最优解。这一发现有力证明了该算法的高精度与强鲁棒性。