State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive. This has sparked a research agenda to reduce these models' parameter count and computational costs without significantly impacting their performance. Our study focuses on transformer-based LLMs, specifically targeting the computationally intensive feedforward networks (FFN), which are less studied than attention blocks. We consider three candidate linear layer approximations in the FFN by combining efficient low-rank and block-diagonal matrices. In contrast to many previous works that examined these approximations, our study i) explores these structures from the training-from-scratch perspective, ii) scales up to 1.3B parameters, and iii) is conducted within recent Transformer-based LLMs rather than convolutional architectures. We first demonstrate they can lead to actual computational gains in various scenarios, including online decoding when using a pre-merge technique. Additionally, we propose a novel training regime, called \textit{self-guided training}, aimed at improving the poor training dynamics that these approximations exhibit when used from initialization. Experiments on the large RefinedWeb dataset show that our methods are both efficient and effective for training and inference. Interestingly, these structured FFNs exhibit steeper scaling curves than the original models. Further applying self-guided training to the structured matrices with 32\% FFN parameters and 2.5$\times$ speed-up enables only a 0.4 perplexity increase under the same training FLOPs. Finally, we develop the wide and structured networks surpassing the current medium-sized and large-sized Transformer in perplexity and throughput performance. Our code is available at \url{https://github.com/CLAIRE-Labo/StructuredFFN/tree/main}.
翻译:大型语言模型(LLM)的最先进成果通常依赖于模型规模,但这会带来高昂的计算成本。这催生了一个研究方向:在不显著影响性能的前提下减少模型参数量与计算开销。本研究聚焦于基于Transformer的LLM,特别针对计算密集型的前馈网络(FFN)模块——该模块相较于注意力机制的研究相对不足。我们通过结合高效的低秩矩阵与块对角矩阵,在FFN中探索了三种候选线性层近似方案。与以往多数研究不同,本工作:i) 从零开始训练的角度探究这些结构,ii) 将模型规模扩展至13亿参数,iii) 在基于Transformer的现代LLM架构而非卷积架构中进行实验。我们首先证明这些方案能在多种场景(包括使用预融合技术的在线解码场景)中实现实际计算增益。此外,我们提出了一种称为“自引导训练”的新型训练策略,旨在改善这些近似结构从初始化开始训练时表现出的不良训练动态。在大型RefinedWeb数据集上的实验表明,我们的方法在训练和推理阶段均兼具高效性与有效性。值得注意的是,这些结构化FFN展现出比原始模型更陡峭的缩放曲线。进一步将自引导训练应用于仅保留32% FFN参数且获得2.5倍加速的结构化矩阵时,在相同训练浮点运算量下仅导致困惑度上升0.4。最后,我们开发了宽幅结构化网络,在困惑度与吞吐量性能上超越了当前中等规模及大规模Transformer模型。代码已开源:\url{https://github.com/CLAIRE-Labo/StructuredFFN/tree/main}。