Large Language Models (LLMs) based on the pre-trained fine-tuning paradigm have become pivotal in solving natural language processing tasks, consistently achieving state-of-the-art performance. Nevertheless, the theoretical understanding of how model complexity influences fine-tuning performance remains challenging and has not been well explored yet. In this paper, we focus on autoregressive LLMs and propose to employ Hidden Markov Models (HMMs) to model them. Based on the HMM modeling, we investigate the relationship between model complexity and the generalization capability in downstream tasks. Specifically, we consider a popular tuning paradigm for downstream tasks, head tuning, where all pre-trained parameters are frozen and only individual heads are trained atop pre-trained LLMs. Our theoretical analysis reveals that the risk initially increases and then decreases with rising model complexity, showcasing a "double descent" phenomenon. In this case, the initial "descent" is degenerate, signifying that the "sweet spot" where bias and variance are balanced occurs when the model size is zero. Obtaining the presented in this study conclusion confronts several challenges, primarily revolving around effectively modeling autoregressive LLMs and downstream tasks, as well as conducting a comprehensive risk analysis for multivariate regression. Our research is substantiated by experiments conducted on data generated from HMMs, which provided empirical support and alignment with our theoretical insights.
翻译:基于预训练-微调范式的大型语言模型已成为解决自然语言处理任务的关键技术,持续取得最先进的性能表现。然而,关于模型复杂度如何影响微调性能的理论理解仍具挑战性,尚未得到充分探索。本文聚焦于自回归大型语言模型,提出采用隐马尔可夫模型对其进行建模。基于该建模方法,我们探究了模型复杂度与下游任务泛化能力之间的关系。具体而言,我们考虑一种流行的下游任务调优范式——头部调优,即冻结所有预训练参数,仅在预训练大型语言模型顶部训练独立头部。我们的理论分析表明,风险随模型复杂度增加呈现先上升后下降的趋势,展现出"双下降"现象。在此情况下,初始的"下降"是退化的,这意味着偏差与方差达到平衡的"最佳点"出现在模型规模为零时。本研究所得结论面临若干挑战,主要围绕自回归大型语言模型与下游任务的有效建模,以及多元回归的全面风险分析。我们通过在隐马尔可夫模型生成数据上进行的实验验证了研究结论,这些实验为理论见解提供了实证支持与验证。