State-of-the-art LLMs often rely on scale with high computational costs, which has sparked a research agenda to reduce parameter counts and costs without significantly impacting performance. Our study focuses on Transformer-based LLMs, specifically applying low-rank parametrization to the computationally intensive feedforward networks (FFNs), which are less studied than attention blocks. In contrast to previous works, (i) we explore low-rank parametrization at scale, up to 1.3B parameters; (ii) within Transformer language models rather than convolutional architectures; and (iii) starting from training from scratch. Experiments on the large RefinedWeb dataset show that low-rank parametrization is both efficient (e.g., 2.6$\times$ FFN speed-up with 32\% parameters) and effective during training. Interestingly, these structured FFNs exhibit steeper scaling curves than the original models. Motivated by this finding, we develop the wide and structured networks surpassing the current medium-sized and large-sized Transformer in perplexity and throughput performance. Our code is available at https://github.com/CLAIRE-Labo/StructuredFFN/tree/main.
翻译:当前最先进的大型语言模型(LLMs)通常依赖大规模参数并伴随高昂计算成本,这催生了在不大幅影响性能的前提下减少参数量与成本的研究方向。本研究聚焦于基于Transformer的LLMs,特别将低秩参数化方法应用于计算密集的前馈网络(FFNs)——相较于注意力模块,该部分的研究相对不足。与先前工作相比:(i)我们探索了大规模(最高达13亿参数)的低秩参数化;(ii)研究针对Transformer语言模型而非卷积架构;(iii)采用从头开始的训练方式。在大型RefinedWeb数据集上的实验表明,低秩参数化在训练过程中兼具高效性(例如,使用32%参数实现FFN速度提升2.6倍)与有效性。有趣的是,这些结构化FFN展现出比原始模型更陡峭的扩展曲线。基于这一发现,我们开发了宽结构网络,其在困惑度与吞吐性能上超越了当前中等规模及大规模Transformer模型。代码已发布于https://github.com/CLAIRE-Labo/StructuredFFN/tree/main。