AdaGradSelect: An adaptive gradient-guided layer selection method for efficient fine-tuning of SLMs

Large Language Models (LLMs) can perform many NLP tasks well, but fully fine-tuning them is expensive and requires a lot of memory. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA reduce this cost by adding small low-rank updates to frozen model weights. However, these methods restrict the training to a limited subspace, which can sometimes reduce performance. For Small Language Models (SLMs), where efficiency gains matter even more, we introduce AdaGradSelect, an adaptive method that selects which transformer blocks to update based on gradients. Early observations showed that updating only the transformer blocks with the highest gradient norms can achieve performance close to full fine-tuning. Building on this insight, AdaGradSelect adaptively chooses which blocks to train. It uses a combination of Dirichlet-based sampling, which depends on how frequently blocks were updated in the past, and an epsilon-greedy exploration strategy. This lets the method explore different blocks in early training and gradually focus on the most important ones in later epochs. Experiments show that AdaGradSelect trains about 12 percent faster and uses 35 percent less GPU memory while delivering performance very close to full fine-tuning. On the GSM8K dataset, it outperforms LoRA (rank 256) by about 3 percent on average across models such as Qwen2.5-0.5B, LLaMA3.2-1B, and Phi4-mini-3.8B. It also achieves similar accuracy on the MATH dataset. Overall, AdaGradSelect provides a more effective and resource-efficient alternative to traditional fine-tuning methods.

翻译：大型语言模型（LLMs）在许多自然语言处理任务中表现出色，但对其进行全参数微调成本高昂且需要大量内存。参数高效微调（PEFT）方法（如LoRA）通过在冻结的模型权重上添加小型低秩更新来降低这一成本。然而，这些方法将训练限制在有限的子空间中，有时会降低性能。对于小型语言模型（SLMs）而言，效率提升更为关键，为此我们提出了AdaGradSelect——一种基于梯度自适应选择Transformer模块进行更新的方法。早期观察表明，仅更新梯度范数最高的Transformer模块即可达到接近全参数微调的性能。基于这一洞见，AdaGradSelect自适应地选择待训练模块。它结合了基于狄利克雷分布的采样（该采样依赖于模块历史更新频率）和ε-贪心探索策略。这使得方法在训练早期能探索不同模块，并在后续周期中逐步聚焦于最重要的模块。实验表明，AdaGradSelect在保持性能接近全参数微调的同时，训练速度提升约12%，GPU内存使用减少35%。在GSM8K数据集上，该方法在Qwen2.5-0.5B、LLaMA3.2-1B和Phi4-mini-3.8B等模型上的平均表现优于LoRA（秩256）约3%。在MATH数据集上也达到了相近的准确率。总体而言，AdaGradSelect为传统微调方法提供了更高效且资源节约的替代方案。