The ever-growing ecosystem of LLMs has posed a challenge in selecting the most appropriate pre-trained model to fine-tune amidst a sea of options. Given constrained resources, fine-tuning all models and making selections afterward is unrealistic. In this work, we formulate this resource-constrained selection task into predicting fine-tuning performance and illustrate its natural connection with Scaling Law. Unlike pre-training, We find that the fine-tuning scaling curve includes not just the well-known "power phase" but also the previously unobserved "pre-power phase". We also explain why existing Scaling Law fails to capture this phase transition phenomenon both theoretically and empirically. To address this, we introduce the concept of "pre-learned data size" into our Rectified Scaling Law, which overcomes theoretical limitations and fits experimental results much better. By leveraging our law, we propose a novel LLM selection algorithm that selects the near-optimal model with hundreds of times less resource consumption, while other methods may provide negatively correlated selection.
翻译:随着大语言模型生态系统的不断扩展,如何在众多选项中选择最合适的预训练模型进行微调已成为一项挑战。在资源受限的情况下,对所有模型进行微调后再进行选择是不现实的。本研究将这一资源受限的选择任务形式化为对微调性能的预测,并阐述了其与缩放定律之间的内在联系。与预训练不同,我们发现微调缩放曲线不仅包含众所周知的“幂律阶段”,还包括先前未被观察到的“前幂律阶段”。我们从理论和实证两方面解释了为何现有的缩放定律无法捕捉这一相变现象。为解决此问题,我们在修正缩放定律中引入了“预学习数据量”的概念,该概念克服了理论局限性,并能更好地拟合实验结果。通过运用该定律,我们提出了一种新颖的大语言模型选择算法,该算法能以数百倍更少的资源消耗选择出接近最优的模型,而其他方法可能提供负相关的选择结果。