This paper proposes a Linear Programming (LP)-based local search framework for fine-tuning pretrained transformer models with explicit control against overfitting. The approach formulates transformer fine-tuning as a bilevel optimization-based regularization problem, in which model parameters and regularization hyperparameters are jointly updated. Information collected during initial warm-up iterations, including validation gradients and training Hessian information, is used to construct a local descent direction by solving an LP that minimizes a scaled directional derivative while preserving training optimality. This validation-aware descent direction enables focused local updates of both parameters and regularization hyperparameters, reducing overfitting without requiring repeated full retraining cycles. The resulting method, termed Linear Programming-based Fine-Tuning (LiFT) for transformers, differs from conventional fine-tuning by systematically identifying task-specific updates rather than relying on heuristic or grid-based hyperparameter selection. Experiments on GPT-2 Small fine-tuned on WikiText-2 demonstrate that LiFT enables effective adaptation through selective tuning of transformer blocks and regularization parameters, yielding consistent improvements in test perplexity across multiple layer configurations and regularization settings, with particularly pronounced gains in overfitting-prone scenarios. Beyond empirical performance, LiFT establishes a principled connection between transformer fine-tuning, bilevel optimization, local search, and regularization theory.
翻译:本文提出一种基于线性规划(Linear Programming, LP)的局部搜索框架,用于对预训练Transformer模型进行显式过拟合控制的微调。该方法将Transformer微调建模为基于双层优化的正则化问题,其中模型参数与正则化超参数被联合更新。利用初始预热迭代过程中收集的信息(包括验证梯度和训练Hessian信息),通过求解一个线性规划——该规划在保持训练最优性的同时最小化缩放后的方向导数——来构造局部下降方向。这种考虑验证信息的下降方向使得参数和正则化超参数均能进行聚焦式局部更新,从而在避免重复完整重新训练循环的前提下减少过拟合。由此产生的针对Transformer的方法被称为"基于线性规划的微调(LiFT)",其与传统微调的区别在于系统性地识别任务特定更新,而非依赖启发式或网格搜索的超参数选择。在GPT-2 Small上使用WikiText-2进行微调的实验表明,LiFT通过选择性调整Transformer模块和正则化参数实现有效适配,在多种层配置和正则化设置下均能稳定改善测试困惑度,尤其在易发生过拟合的场景中增益显著。除实证性能外,LiFT还在Transformer微调、双层优化、局部搜索与正则化理论之间建立了原理性联系。