SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models

from arxiv, Accepted to Uncertainty in Artificial Intelligence (UAI) 2023 Conference; 13 pages, 4 figures (Main Paper) + 5 pages (Supplementary Material)

The pre-training and fine-tuning paradigm has contributed to a number of breakthroughs in Natural Language Processing (NLP). Instead of directly training on a downstream task, language models are first pre-trained on large datasets with cross-domain knowledge (e.g., Pile, MassiveText, etc.) and then fine-tuned on task-specific data (e.g., natural language generation, text summarization, etc.). Scaling the model and dataset size has helped improve the performance of LLMs, but unfortunately, this also lead to highly prohibitive computational costs. Pre-training LLMs often require orders of magnitude more FLOPs than fine-tuning and the model capacity often remains the same between the two phases. To achieve training efficiency w.r.t training FLOPs, we propose to decouple the model capacity between the two phases and introduce Sparse Pre-training and Dense Fine-tuning (SPDF). In this work, we show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training (Sparse Pre-training) and then recover the representational capacity by allowing the zeroed weights to learn (Dense Fine-tuning). We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs, without a significant loss in accuracy on the downstream tasks relative to the dense baseline. By rigorously evaluating multiple downstream tasks, we also establish a relationship between sparsity, task complexity and dataset size. Our work presents a promising direction to train large GPT models at a fraction of the training FLOPs using weight sparsity, while retaining the benefits of pre-trained textual representations for downstream tasks.

翻译：预训练与微调范式为自然语言处理领域带来了一系列重大突破。语言模型并非直接在下游任务上进行训练，而是首先在包含跨领域知识的大型数据集（例如Pile、MassiveText等）上完成预训练，随后在特定任务数据（如自然语言生成、文本摘要等）上进行微调。扩大模型与数据集规模有助于提升大型语言模型的性能，但这也导致了计算成本急剧增加——预训练LLM所需的FLOPs往往比微调高出数个数量级，且两个阶段的模型容量通常保持不变。为实现训练效率方面的FLOPs优化，我们提出将两个阶段的模型容量解耦，引入稀疏预训练与稠密微调方法。本工作展示了利用非结构化权重稀疏性在预训练阶段仅训练部分权重（稀疏预训练），再通过允许归零权重重新学习以恢复表征能力（稠密微调）的优势。实验表明，我们可在含13亿参数的GPT-3 XL模型中引入高达75%的稀疏度，使预训练FLOPs降低2.5倍，且在下游任务上的精确度相对于稠密基线模型无明显损失。通过对多项下游任务的严格评估，我们还建立了稀疏度、任务复杂度与数据集规模之间的关联。本研究提出了一种有前景的方向：在保留预训练文本表征对下游任务增益的前提下，通过权重稀疏性技术以极少的训练FLOPs训练大型GPT模型。