In clustering algorithm selection, we are given a massive dataset and must efficiently select which clustering algorithm to use. We study this problem in a semi-supervised setting, with an unknown ground-truth clustering that we can only access through expensive oracle queries. Ideally, the clustering algorithm's output will be structurally close to the ground truth. We approach this problem by introducing a notion of size generalization for clustering algorithm accuracy. We identify conditions under which we can (1) subsample the massive clustering instance, (2) evaluate a set of candidate algorithms on the smaller instance, and (3) guarantee that the algorithm with the best accuracy on the small instance will have the best accuracy on the original big instance. We provide theoretical size generalization guarantees for three classic clustering algorithms: single-linkage, k-means++, and (a smoothed variant of) Gonzalez's k-centers heuristic. We validate our theoretical analysis with empirical results, observing that on real-world clustering instances, we can use a subsample of as little as 5% of the data to identify which algorithm is best on the full dataset.
翻译:在聚类算法选择问题中,给定一个大规模数据集,我们需要高效地选择使用哪种聚类算法。我们在半监督设置下研究该问题,其中存在一个未知的真实聚类结构,我们只能通过昂贵的查询访问该结构。理想情况下,聚类算法的输出应在结构上接近真实聚类。我们通过引入聚类算法准确性的规模泛化概念来解决这一问题。我们识别了以下条件:(1) 对大规模聚类实例进行子采样;(2) 在较小实例上评估一组候选算法;(3) 保证在较小实例上准确率最高的算法在原始大规模实例上同样具有最高准确率。我们为三种经典聚类算法提供了理论上的规模泛化保证:单链接聚类、k-means++以及(平滑变体的)Gonzalez的k-中心启发式算法。我们通过实验结果验证了理论分析,观察到在真实世界聚类实例中,仅需使用多达5%的数据子样本即可识别出在全数据集上表现最佳的算法。