Communication-Avoiding Linear Algebraic Kernel K-Means on GPUs

Clustering is an important tool in data analysis, with K-means being popular for its simplicity and versatility. However, it cannot handle non-linearly separable clusters. Kernel K-means addresses this limitation but requires a large kernel matrix, making it computationally and memory intensive. Prior work has accelerated Kernel K-means by formulating it using sparse linear algebra primitives and implementing it on a single GPU. However, that approach cannot run on datasets with more than approximately 80,000 samples due to limited GPU memory. In this work, we address this issue by presenting a suite of distributed-memory parallel algorithms for large-scale Kernel K-means clustering on multi-GPU systems. Our approach maps the most computationally expensive components of Kernel K-means onto communication-efficient distributed linear algebra primitives uniquely tailored for Kernel K-means, enabling highly scalable implementations that efficiently cluster million-scale datasets. Central to our work is the design of partitioning schemes that enable communication-efficient composition of the linear algebra primitives that appear in Kernel K-means. Our 1.5D algorithm consistently achieves the highest performance, enabling Kernel K-means to scale to data one to two orders of magnitude larger than previously practical. On 256 GPUs, it achieves a geometric mean weak scaling efficiency of $79.7\%$ and a geometric mean strong scaling speedup of $4.2\times$. Compared to our 1D algorithm, the 1.5D approach achieves up to a $3.6\times$ speedup on 256 GPUs and reduces clustering time from over an hour to under two seconds relative to a single-GPU sliding window implementation. Our results show that distributed algorithms designed with application-specific linear algebraic formulations can achieve substantial performance improvement.

翻译：聚类是数据分析中的重要工具，其中K-Means算法因其简洁性和通用性而广受欢迎。然而，该算法无法处理非线性可分的聚类簇。核K-Means算法通过引入核函数解决了这一局限，但需要构建大型核矩阵，导致计算和内存开销巨大。先前的研究通过将核K-Means表述为稀疏线性代数原语并在单GPU上实现，实现了算法加速。但由于GPU内存限制，该方法无法处理超过约80,000个样本的数据集。本研究针对该问题，提出了一套面向多GPU系统的分布式内存并行算法，用于大规模核K-Means聚类。我们的方法将核K-Means中计算开销最大的组件映射到专为核K-Means定制的通信高效分布式线性代数原语上，实现了高度可扩展的实施方案，能够高效处理百万级数据集的聚类任务。本工作的核心在于设计分区方案，使得核K-Means中涉及的线性代数原语能够以通信高效的方式组合。我们提出的1.5D算法持续展现出最优性能，使核K-Means可处理的数据规模比以往方法扩大一到两个数量级。在256个GPU上，该算法实现了79.7%的几何平均弱扩展效率与4.2倍的几何平均强扩展加速比。相较于我们的一维算法，1.5D方法在256个GPU上最高可实现3.6倍的加速，相比单GPU滑动窗口实现将聚类时间从超过一小时缩短至两秒以内。实验结果表明，采用面向特定应用的线性代数表述设计的分布式算法能够实现显著的性能提升。