We present HadaCore, a modified Fast Walsh-Hadamard Transform (FWHT) algorithm optimized for the Tensor Cores present in modern GPU hardware. HadaCore follows the recursive structure of the original FWHT algorithm, achieving the same asymptotic runtime complexity but leveraging a hardware-aware work decomposition that benefits from Tensor Core acceleration. This reduces bottlenecks from compute and data exchange. On Nvidia A100 and H100 GPUs, HadaCore achieves speedups of 1.1-1.4x and 1.0-1.3x, with a peak gain of 3.5x and 3.6x respectively, when compared to the existing state-of-the-art implementation of the original algorithm. We also show that when using FP16 or BF16, our implementation is numerically accurate, enabling comparable accuracy on MMLU benchmarks when used in an end-to-end Llama3 inference run with quantized (FP8) attention.
翻译:本文提出HadaCore,一种针对现代GPU硬件中张量核优化的改进型快速沃尔什-哈达玛变换算法。HadaCore遵循原始FWHT算法的递归结构,在保持相同渐近时间复杂度的同时,采用硬件感知的工作分解策略以充分利用张量核加速优势,从而减少计算与数据交换的瓶颈。在英伟达A100和H100 GPU上,相较于原始算法的最先进实现,HadaCore分别获得1.1-1.4倍与1.0-1.3倍的加速效果,峰值加速比分别达到3.5倍与3.6倍。实验同时表明,当采用FP16或BF16精度时,本实现仍能保持数值准确性,在端到端的Llama3推理任务中使用量化(FP8)注意力机制时,可在MMLU基准测试中获得相当的精度表现。