Clustering is a widely used technique with a long and rich history in a variety of areas. However, most existing algorithms do not scale well to large datasets, or are missing theoretical guarantees of convergence. This paper introduces a provably robust clustering algorithm based on loss minimization that performs well on Gaussian mixture models with outliers. It provides theoretical guarantees that the algorithm obtains high accuracy with high probability under certain assumptions. Moreover, it can also be used as an initialization strategy for $k$-means clustering. Experiments on real-world large-scale datasets demonstrate the effectiveness of the algorithm when clustering a large number of clusters, and a $k$-means algorithm initialized by the algorithm outperforms many of the classic clustering methods in both speed and accuracy, while scaling well to large datasets such as ImageNet.
翻译:聚类是一种广泛应用的技术,在诸多领域有着悠久而丰富的历史。然而,大多数现有算法难以有效扩展到大规模数据集,或缺乏收敛性的理论保证。本文提出一种基于损失最小化的、具有鲁棒性保证的聚类算法,该算法在含离群点的高斯混合模型上表现优异。理论证明表明,在特定假设下,该算法能以高概率获得高精度。此外,它还可作为 $k$-均值聚类的初始化策略。在真实世界大规模数据集上的实验展示了该算法在聚类大量簇时的有效性,而由该算法初始化的 $k$-均值算法在速度和精度上均优于许多经典聚类方法,同时能够良好地扩展到ImageNet等大规模数据集。