Improving the effectiveness and efficiency of large language models (LLMs) simultaneously is a critical yet challenging research goal. In this paper, we find that low-rank pre-training, normally considered as efficient methods that will compromise performance, can be scalably effective when reduced parameters are precisely targeted. Specifically, applying the low-dimensional module only to the attention layer -- resolves this issue and enhances both effectiveness and efficiency. We refer to this structure as Low-dimensional Projected Attention (LPA) and provide an explanatory analysis. Through extensive experimentation at parameter scales of 130M, 370M, and scaling up to 3B, we have validated the effectiveness and scalability of LPA. Our results show that LPA model can save up to 12.4% in time while achieving an approximate 5% improvement in test perplexity (ppl) and on downstream tasks compared with the vanilla Transformer.
翻译:在提升大型语言模型(LLMs)效能与效率的同时,是一个关键而富有挑战性的研究目标。本文发现,低秩预训练——通常被视为会牺牲性能的高效方法——当参数缩减被精确定向时,可以具备可扩展的有效性。具体而言,仅将低维模块应用于注意力层,即可解决此问题并同时提升效能与效率。我们将此结构称为低维投影注意力(LPA),并提供了理论分析。通过在130M、370M参数规模上进行大量实验,并扩展至3B规模,我们验证了LPA的有效性与可扩展性。结果表明,与原始Transformer相比,LPA模型在测试困惑度(ppl)及下游任务上可实现约5%的性能提升,同时节省高达12.4%的训练时间。