We present \textbf{Flash-GMM}, a fused Triton kernel for efficient computation of Gaussian Mixture Models (GMMs) over large-scale data in a single GPU pass. By eliminating the need to materialize the full responsibility matrix in GPU memory, Flash-GMM achieves a \textbf{20$\times$} speedup over existing implementations and enables training on datasets more than \textbf{100$\times$} larger than previously feasible on one device. To demonstrate its impact, we integrate Flash-GMM into the IVF coarse quantizer for approximate nearest-neighbor (ANN) search. We show that soft GMM clustering is now a viable drop-in replacement for $k$-means, and that GMM responsibilities can be leveraged to assign border vectors to multiple clusters. Our approach reaches fixed recall targets with up to $1.7\times$ fewer distance computations, or equivalently, yields $+2$--$12$ recall@10 at matched computational cost. We release the kernel as an open-source project.
翻译:我们提出\textbf{Flash-GMM},一种融合的Triton核,用于在单次GPU遍历中高效计算大规模数据上的高斯混合模型(GMM)。通过避免在GPU内存中实例化完整的责任矩阵,Flash-GMM相比现有实现实现了\textbf{20倍}的加速,并支持在单个设备上训练此前无法处理的、规模大\textbf{100倍以上}的数据集。为展示其影响,我们将Flash-GMM集成到近似最近邻(ANN)搜索的IVF粗量化器中。我们证明,软GMM聚类现可作为$k$-均值方法的可行即插即用替代方案,并可利用GMM责任将边界向量分配到多个聚类。我们的方法在达到固定召回目标时,最多可减少$1.7$倍的距离计算量,或在同等计算成本下,召回率@10提升$+2$至$+12$。我们将该核作为开源项目发布。