Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers

State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive. This has sparked a research agenda to reduce these models' parameter counts and computational costs without significantly impacting their performance. Our study focuses on transformer-based LLMs, specifically targeting the computationally intensive feedforward networks (FFNs), which are less studied than attention blocks. We consider three structured linear parameterizations of the FFN using efficient low-rank and block-diagonal matrices. In contrast to many previous works that examined these approximations, our study i) explores these structures from a training-from-scratch perspective, ii) scales up to 1.3B parameters, and iii) is conducted within recent Transformer-based LLMs rather than convolutional architectures. We demonstrate that these structures can lead to actual computational gains in various scenarios, including online decoding when using a pre-merge technique. Additionally, we propose a novel training regime, called \textit{self-guided training}, aimed at improving the poor training dynamics that these approximations exhibit when used from initialization. Interestingly, the scaling performance of structured matrices is explored, revealing steeper curves in scaling training FLOPs, along with a favorable scaling trend in the overtraining regime. Specifically, we show that wide and structured networks can utilize training FLOPs more efficiently, with fewer parameters and lower loss than dense models at their optimal trade-off. Our code is available at \url{https://github.com/CLAIRE-Labo/StructuredFFN/tree/main}.

翻译：大型语言模型（LLM）的最先进成果通常依赖于模型规模，但这会带来高昂的计算开销。这催生了一个研究方向：在不显著影响性能的前提下，减少这些模型的参数量与计算成本。本研究聚焦于基于Transformer的LLM，特别针对计算密集的前馈网络（FFN）——相较于注意力模块，其研究相对不足。我们采用高效的**低秩矩阵**与**块对角矩阵**，为FFN设计了三种结构化线性参数化方案。与许多先前仅考察这些近似方法的工作不同，本研究：i) 从**从头训练**的视角探究这些结构，ii) 将模型规模扩展至**13亿参数**，iii) 在近期基于Transformer的LLM架构（而非卷积架构）中进行实验。我们证明，这些结构能在多种场景下带来实际的计算收益，包括使用预融合技术时的在线解码。此外，我们提出一种名为**自引导训练**的新型训练机制，旨在改善这些近似结构在初始化使用时表现出的不良训练动态。值得注意的是，我们探索了结构化矩阵的缩放性能，发现其在训练FLOPs缩放曲线中呈现更陡峭的趋势，并在过训练区域展现出有利的缩放特性。具体而言，我们表明**宽而结构化**的网络能更高效地利用训练FLOPs，在达到最优权衡时，其参数量更少、损失值低于稠密模型。代码发布于：\url{https://github.com/CLAIRE-Labo/StructuredFFN/tree/main}。