无需改变基础且保持速度：深度神经网络中矩阵乘法的GPU高效替代方案 (Changing Base Without Losing Pace: A GPU-Efficient Alternative to MatMul in DNNs)

Modern AI relies on huge matrix multiplications (MatMuls), whose computation poses a scalability problem for inference and training. We propose an alternative, GPU native bilinear operator to MatMuls in neural networks, which offers a three-way tradeoff between: speed, accuracy and parameter count. In particular, this operator requires substantially fewer FLOPs to evaluate ($\ll n^3$), yet increases the parameter count compared to MatMul ($\gg n^2$). We call this operator Strassen-Tile (STL). The key idea behind STL is a local learnable change-of-basis, applied on tiles of the weight and activation matrices, followed by an element-wise product between the tiles, implemented simultaneously via MatMul. The key technical question we study is how to optimize the change-of-basis of a given layer, which is a highly non-convex problem. We show that theory-backed initializations (inspired by fast matrix and polynomial multiplication) lead to substantially better accuracy than random SGD initialization. This phenomenon motivates further algorithmic study of STL optimization in DNNs. Our experiments demonstrate that STL can approximate 4x4 MatMul of tiles while reducing FLOPs by a factor of 2.66, and can improve Imagenet-1K accuracy of SoTA T2T-ViT-7 (4.3M parameters) while lowering FLOPs. Even with non-CUDA optimized PyTorch code, STL achieves wall-clock speedups in the compute-bound regime. These results, together with its theoretical grounds, suggest STL as a promising building block for scalable and cost-efficient AI.

翻译：现代人工智能依赖于大规模的矩阵乘法（MatMuls），其计算量对推理和训练的可扩展性构成了挑战。本文提出一种替代方案，即GPU原生双线性算子，用于替代神经网络中的矩阵乘法，该算子在速度、精度和参数量之间实现了三方面的权衡。具体而言，该算子的浮点运算次数显著减少（$\ll n^3$），但与矩阵乘法相比，参数量有所增加（$\gg n^2$）。我们将此算子称为Strassen-Tile（STL）。STL的核心思想是：对权重矩阵和激活矩阵的分块进行局部可学习的基变换，随后通过矩阵乘法同时实现分块间的逐元素乘积。我们研究的关键技术问题是如何优化给定层的基变换，这是一个高度非凸的优化问题。我们证明，基于理论指导的初始化方法（灵感来源于快速矩阵乘法与多项式乘法）相比随机SGD初始化能显著提升精度。这一现象推动了对深度神经网络中STL优化算法的进一步研究。实验表明，STL能够近似4×4分块的矩阵乘法，同时将浮点运算次数降低至原来的1/2.66，并能在降低浮点运算次数的同时，提升当前最优模型T2T-ViT-7（430万参数）在Imagenet-1K数据集上的准确率。即使使用未经CUDA优化的PyTorch代码，STL在计算受限场景下仍能实现实际运行速度的提升。这些结果及其理论基础表明，STL有望成为构建可扩展且高性价比人工智能系统的关键组件。