Flash-KMeans: Fast and Memory-Efficient Exact K-Means

$k$-means has historically been positioned primarily as an offline processing primitive, typically used for dataset organization or embedding preprocessing rather than as a first-class component in online systems. In this work, we revisit this classical algorithm under the lens of modern AI system design and enable $k$-means as an online primitive. We point out that existing GPU implementations of $k$-means remain fundamentally bottlenecked by low-level system constraints rather than theoretical algorithmic complexity. Specifically, the assignment stage suffers from a severe IO bottleneck due to the massive explicit materialization of the $N \times K$ distance matrix in High Bandwidth Memory (HBM). Simultaneously, the centroid update stage is heavily penalized by hardware-level atomic write contention caused by irregular, scatter-style token aggregations. To bridge this performance gap, we propose flash-kmeans, an IO-aware and contention-free $k$-means implementation for modern GPU workloads. Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization; (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions. Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability. Extensive evaluations on NVIDIA H200 GPUs demonstrate that flash-kmeans achieves up to 17.9$\times$ end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33$\times$ and over 200$\times$, respectively. Our code is open-sourced at https://github.com/svg-project/flash-kmeans.

翻译：$k$-means历来主要被定位为一种离线处理原语，通常用于数据集组织或嵌入预处理，而非在线系统中的一等组件。在这项工作中，我们从现代人工智能系统设计的角度重新审视这一经典算法，并使$k$-means能够作为在线原语使用。我们指出，现有的GPU上的$k$-means实现其根本瓶颈在于底层系统约束，而非理论算法复杂度。具体而言，分配阶段因在高带宽内存（HBM）中大规模显式物化$N \times K$距离矩阵而遭受严重的IO瓶颈；与此同时，质心更新阶段因不规则、分散式token聚合导致的硬件级原子写竞争而受到严重惩罚。为弥合这一性能差距，我们提出了flash-kmeans，一种面向现代GPU工作负载的IO感知且无竞争的$k$-means实现。Flash-kmeans引入了两项核心的内核级创新：（1）FlashAssign，它将距离计算与在线argmin融合，完全绕过中间内存物化；（2）排序-逆映射更新，它显式构建逆映射，将高竞争的原子分散操作转换为高带宽的、段级别的局部规约操作。此外，我们整合了算法-系统协同设计，包括分块流重叠和缓存感知编译启发式策略，以确保实际可部署性。在NVIDIA H200 GPU上的广泛评估表明，flash-kmeans相对于最佳基线实现了高达17.9$\times$的端到端加速，同时分别比cuML和FAISS等工业标准库快33$\times$和200$\times$以上。我们的代码已在https://github.com/svg-project/flash-kmeans开源。