In clustering algorithm selection, we are given a massive dataset and must efficiently select which clustering algorithm to use. We study this problem in a semi-supervised setting, with an unknown ground-truth clustering that we can only access through expensive oracle queries. Ideally, the clustering algorithm's output will be structurally close to the ground truth. We approach this problem by introducing a notion of size generalization for clustering algorithm accuracy. We identify conditions under which we can (1) subsample the massive clustering instance, (2) evaluate a set of candidate algorithms on the smaller instance, and (3) guarantee that the algorithm with the best accuracy on the small instance will have the best accuracy on the original big instance. We provide theoretical size generalization guarantees for three classic clustering algorithms: single-linkage, k-means++, and (a smoothed variant of) Gonzalez's k-centers heuristic. We validate our theoretical analysis with empirical results, observing that on real-world clustering instances, we can use a subsample of as little as 5% of the data to identify which algorithm is best on the full dataset.
翻译:在聚类算法选择问题中,我们需要针对给定的海量数据集,高效地选定应采用的聚类算法。本文在半监督场景下研究该问题,其中真实聚类标签未知,仅能通过昂贵的专家查询进行访问。理想情况下,聚类算法的输出在结构上应与真实标签接近。我们通过引入聚类算法准确性的规模泛化概念来应对这一挑战。我们确定了以下条件:(1)对大规模聚类实例进行子采样;(2)在较小实例上评估候选算法集合;(3)保证在较小实例上准确性最优的算法在原始大规模实例上同样表现最佳。针对三种经典聚类算法——单链接聚类、k-means++算法以及(经平滑处理的)Gonzalez k-中心点启发式算法,我们提供了理论上的规模泛化保证。通过实证结果验证理论分析,我们观察到:在实际聚类实例中,仅需使用5%的数据子样本即可识别出在全数据集上表现最优的算法。