Communication-Avoiding Linear Algebraic Kernel K-Means on GPUs

Clustering is an important tool in data analysis, with K-means being popular for its simplicity and versatility. However, it cannot handle non-linearly separable clusters. Kernel K-means addresses this limitation but requires a large kernel matrix, making it computationally and memory intensive. Prior work has accelerated Kernel K-means by formulating it using sparse linear algebra primitives and implementing it on a single GPU. However, that approach cannot run on datasets with more than approximately 80,000 samples due to limited GPU memory. In this work, we address this issue by presenting a suite of distributed-memory parallel algorithms for large-scale Kernel K-means clustering on multi-GPU systems. Our approach maps the most computationally expensive components of Kernel K-means onto communication-efficient distributed linear algebra primitives uniquely tailored for Kernel K-means, enabling highly scalable implementations that efficiently cluster million-scale datasets. Central to our work is the design of partitioning schemes that enable communication-efficient composition of the linear algebra primitives that appear in Kernel K-means. Our 1.5D algorithm consistently achieves the highest performance, enabling Kernel K-means to scale to data one to two orders of magnitude larger than previously practical. On 256 GPUs, it achieves a geometric mean weak scaling efficiency of $79.7\%$ and a geometric mean strong scaling speedup of $4.2\times$. Compared to our 1D algorithm, the 1.5D approach achieves up to a $3.6\times$ speedup on 256 GPUs and reduces clustering time from over an hour to under two seconds relative to a single-GPU sliding window implementation. Our results show that distributed algorithms designed with application-specific linear algebraic formulations can achieve substantial performance improvement.

翻译：聚类是数据分析中的重要工具，其中K均值算法因其简洁性和通用性而广受欢迎。然而，该方法无法处理非线性可分的聚类簇。核K均值算法解决了这一局限性，但需要构建庞大的核矩阵，导致计算和内存开销巨大。先前的研究通过将核K均值表述为稀疏线性代数原语并在单GPU上实现，加速了该算法。然而，由于GPU内存限制，该方法无法处理超过约80,000个样本的数据集。本研究通过提出一套面向多GPU系统的分布式内存并行算法，解决了大规模核K均值聚类的扩展性问题。我们的方法将核K均值中计算代价最高的组件映射到专为核K均值定制的通信高效分布式线性代数原语上，实现了可高效处理百万级数据集的强可扩展性方案。本工作的核心在于设计了支持核K均值中线性代数原语通信高效组合的数据划分方案。我们提出的1.5D算法始终保持着最优性能，使核K均值能够处理比以往方法大一到两个数量级的数据。在256个GPU上，该算法实现了79.7%的几何平均弱扩展效率与4.2倍的几何平均强扩展加速比。相较于我们的一维算法，1.5D方法在256个GPU上最高可获得3.6倍的加速，并将聚类时间从单GPU滑动窗口实现的超过一小时缩短至两秒以内。实验结果表明，采用面向特定应用的线性代数表述设计的分布式算法能够实现显著的性能提升。