Dense linear layers are the dominant computational bottleneck in large neural networks, presenting a critical need for more efficient alternatives. Previous efforts focused on a small number of hand-crafted structured matrices and neglected to investigate whether these structures can surpass dense layers in terms of compute-optimal scaling laws when both the model size and training examples are optimally allocated. In this work, we present a unifying framework that enables searching among all linear operators expressible via an Einstein summation. This framework encompasses many previously proposed structures, such as low-rank, Kronecker, Tensor-Train, Block Tensor-Train (BTT), and Monarch, along with many novel structures. To analyze the framework, we develop a taxonomy of all such operators based on their computational and algebraic properties and show that differences in the compute-optimal scaling laws are mostly governed by a small number of variables that we introduce. Namely, a small $\omega$ (which measures parameter sharing) and large $\psi$ (which measures the rank) reliably led to better scaling laws. Guided by the insight that full-rank structures that maximize parameters per unit of compute perform the best, we propose BTT-MoE, a novel Mixture-of-Experts (MoE) architecture obtained by sparsifying computation in the BTT structure. In contrast to the standard sparse MoE for each entire feed-forward network, BTT-MoE learns an MoE in every single linear layer of the model, including the projection matrices in the attention blocks. We find BTT-MoE provides a substantial compute-efficiency gain over dense layers and standard MoE.
翻译:密集线性层是大型神经网络中主要的计算瓶颈,迫切需要更高效的替代方案。先前的研究局限于少数手工设计的结构化矩阵,且未探究在模型规模和训练样本均得到最优分配时,这些结构能否在计算最优的缩放定律方面超越密集层。本工作提出了一个统一框架,能够在所有可通过爱因斯坦求和表示的线性算子中进行搜索。该框架涵盖了许多先前提出的结构,如低秩、Kronecker、张量链、块张量链以及Monarch结构,同时包含多种新颖结构。为分析该框架,我们基于计算与代数特性建立了此类算子的分类体系,并证明计算最优缩放定律的差异主要由我们引入的少数变量决定。具体而言,较小的ω(衡量参数共享程度)与较大的ψ(衡量秩的大小)能可靠地带来更优的缩放定律。基于"最大化单位计算量参数量的满秩结构性能最佳"的洞见,我们提出了BTT-MoE——一种通过对BTT结构进行稀疏化计算得到的新型混合专家架构。与在整个前馈网络中应用标准稀疏MoE不同,BTT-MoE在模型的每个线性层(包括注意力模块中的投影矩阵)都学习一个MoE结构。实验表明,BTT-MoE相比密集层和标准MoE能带来显著的计算效率提升。