Entropic optimal transport (EOT) via Sinkhorn iterations is widely used in modern machine learning, yet GPU solvers remain inefficient at scale. Tensorized implementations suffer quadratic HBM traffic from dense $n\times m$ interactions, while existing online backends avoid storing dense matrices but still rely on generic tiled map-reduce reduction kernels with limited fusion. We present \textbf{FlashSinkhorn}, an IO-aware EOT solver for squared Euclidean cost that rewrites stabilized log-domain Sinkhorn updates as row-wise LogSumExp reductions of biased dot-product scores, the same normalization as transformer attention. This enables FlashAttention-style fusion and tiling: fused Triton kernels stream tiles through on-chip SRAM and update dual potentials in a single pass, substantially reducing HBM IO per iteration while retaining linear-memory operations. We further provide streaming kernels for transport application, enabling scalable first- and second-order optimization. On A100 GPUs, FlashSinkhorn achieves up to $32\times$ forward-pass and $161\times$ end-to-end speedups over state-of-the-art online baselines on point-cloud OT, improves scalability on OT-based downstream tasks. For reproducibility, we release an open-source implementation at https://github.com/ot-triton-lab/ot_triton.
翻译:通过Sinkhorn迭代求解熵正则化最优传输(EOT)在现代机器学习中应用广泛,但GPU求解器在大规模问题上效率仍然低下。张量化实现因稠密的 $n\times m$ 交互而产生二次HBM通信开销,而现有的在线后端虽避免存储稠密矩阵,但仍依赖融合程度有限的通用分块映射-归约内核。本文提出 \textbf{FlashSinkhorn},一种面向IO的、针对平方欧氏距离成本的EOT求解器。它将稳定化对数域Sinkhorn更新重写为带偏置点积分数的行向LogSumExp归约,其归一化过程与Transformer注意力机制相同。这使得FlashAttention风格的内核融合与分块成为可能:融合的Triton内核通过片上SRAM流式处理数据块,并在单次遍历中更新对偶势,从而在保持线性内存操作的同时,显著降低了每次迭代的HBM IO开销。我们还为传输应用提供了流式处理内核,支持可扩展的一阶和二阶优化。在A100 GPU上,FlashSinkhorn在点云最优传输任务上,相比最先进的在线基线方法,前向传播速度提升高达 $32\times$,端到端速度提升高达 $161\times$,并在基于最优传输的下游任务中提升了可扩展性。为保证可复现性,我们在 https://github.com/ot-triton-lab/ot_triton 发布了开源实现。