The anticlustering problem is to partition a set of objects into K equal-sized anticlusters such that the sum of distances within anticlusters is maximized. The anticlustering problem is NP-hard. We focus on anticlustering in Euclidean spaces, where the input data is tabular and each object is represented as a D-dimensional feature vector. Distances are measured as squared Euclidean distances between the respective vectors. Applications of Euclidean anticlustering include social studies, particularly in psychology, K-fold cross-validation in which each fold should be a good representative of the entire dataset, the creation of mini-batches for gradient descent in neural network training, and balanced K-cut partitioning. In particular, machine-learning applications involve million-scale datasets and very large values of K, making scalable anticlustering algorithms essential. Existing algorithms are either exact methods that can solve only small instances or heuristic methods, among which the most scalable is the exchange-based heuristic fast_anticlustering. We propose a new algorithm, the Assignment-Based Anticlustering algorithm (ABA), which scales to very large instances. A computational study shows that ABA outperforms fast_anticlustering in both solution quality and running time. Moreover, ABA scales to instances with millions of objects and hundreds of thousands of anticlusters within short running times, beyond what fast_anticlustering can handle. As a balanced K-cut partitioning method for tabular data, ABA is superior to the well-known METIS method in both solution quality and running time. The code of the ABA algorithm is available on GitHub.
翻译:反聚类问题旨在将一组对象划分为K个规模相等的反簇,使得反簇内部距离之和最大化。该问题是NP难问题。本文聚焦于欧几里得空间中的反聚类,其中输入数据为表格形式,每个对象表示为一个D维特征向量。距离度量为相应向量之间的平方欧几里得距离。欧几里得反聚类的应用包括社会科学(尤其是心理学)、要求每个折迭都能良好代表整个数据集的K折交叉验证、神经网络训练中梯度下降的小批量生成,以及平衡K割划分。特别地,机器学习应用涉及百万级规模的数据集和极大的K值,这使得可扩展的反聚类算法至关重要。现有算法要么是只能求解小型实例的精确方法,要么是启发式方法,其中最具可扩展性的是基于交换的启发式算法fast_anticlustering。我们提出一种新算法——基于分配的反聚类算法(ABA),该算法可扩展至超大规模实例。计算研究表明,ABA在求解质量和运行时间上均优于fast_anticlustering。此外,ABA能够在短时间内扩展至包含数百万对象和数十万反簇的实例,这超出了fast_anticlustering的处理能力。作为表格数据的平衡K割划分方法,ABA在求解质量和运行时间上均优于著名的METIS方法。ABA算法的代码已在GitHub上开源。