Large Language Models (LLMs) are difficult to fully fine-tune (e.g., with instructions or human feedback) due to their sheer number of parameters. A family of parameter-efficient sparse fine-tuning (SFT) methods have proven promising in terms of performance but their memory requirements increase proportionally to the size of the LLMs. In this work, we scale sparse fine-tuning to state-of-the-art LLMs like LLaMA 2 7B and 13B. At any given time, for a desired density level, we maintain an array of parameter indices and the deltas of these parameters relative to their pretrained values. We iterate among: (a) updating the active deltas, (b) pruning indices (based on the change of magnitude of their deltas) and (c) regrowth of indices. For regrowth, we explore two criteria based on either the accumulated gradients of a few candidate parameters or their approximate momenta estimated using the efficient SM3 optimizer. We experiment with instruction-tuning of LLMs on standard dataset mixtures, finding that SFT is often superior to popular parameter-efficient fine-tuning methods like LoRA (low-rank adaptation) in terms of performance and comparable in terms of run time. We additionally show that SFT is compatible with both quantization and efficient optimizers, to facilitate scaling to ever-larger model sizes. We release the code for SFT at https://github.com/AlanAnsell/peft and for the instruction-tuning experiments at https://github.com/ducdauge/sft-llm.
翻译:大型语言模型(LLM)因参数量庞大而难以进行完全微调(例如通过指令或人类反馈)。一系列参数高效的稀疏微调(SFT)方法在性能上已展现出潜力,但其内存需求随LLM规模扩大而成比例增长。在本工作中,我们将稀疏微调扩展到LLaMA 2 7B和13B等先进LLM。在任意给定时刻,针对期望的密度水平,我们维护一个参数索引数组以及这些参数相对于预训练值的增量。我们迭代执行以下步骤:(a)更新当前活跃的增量,(b)基于增量幅值变化进行索引剪枝,以及(c)索引重生。在重生阶段,我们探索两种准则:基于少数候选参数的累积梯度,或使用高效SM3优化器估计的近似动量。我们使用标准数据集混合指令对LLM进行微调实验,发现SFT在性能上通常优于LoRA(低秩适应)等流行的参数高效微调方法,且运行时间相当。此外,我们证明SFT可兼容量化和高效优化器,以支持向更大模型规模扩展。我们在https://github.com/AlanAnsell/peft 发布SFT代码,并在https://github.com/ducdauge/sft-llm 发布指令微调实验代码。