Current Parameter-Efficient Fine-Tuning (PEFT) methods typically operate under an implicit assumption: Once a target module is selected, every token passing through it contributes equally to the downstream task and requires a parameter update. In this paper, we challenge this convention by revealing a pervasive token-level redundancy in the fine-tuning of large models (LMs). We propose TS-PEFT, a theoretical framework utilizing proximal optimization that acts as a dynamic probe to identify token-level redundancy during the fine-tuning process. Extensive experiments demonstrate that indiscriminately updating all tokens is not only computationally superfluous but often introduces optimization noise. Surprisingly, by discarding 30%-70% of token updates, TS-PEFT consistently matches or exceeds the performance of dense baselines such as LoRA, DoRA. Our in-depth analysis shows that the learned token-level sparsity is a superior indicator of module importance compared to traditional weight criteria, providing a novel data-driven perspective on the intrinsic adaptation mechanism of LMs.
翻译:当前的参数高效微调方法通常基于一个隐含假设:一旦选定目标模块,流经该模块的每个令牌对下游任务的贡献均等,且都需要进行参数更新。本文通过揭示大模型微调中普遍存在的令牌级冗余,对这一惯例提出了挑战。我们提出了TS-PEFT,这是一个利用近端优化的理论框架,可在微调过程中作为动态探针来识别令牌级冗余。大量实验表明,不加区分地更新所有令牌不仅计算冗余,且常引入优化噪声。令人惊讶的是,通过舍弃30%-70%的令牌更新,TS-PEFT在性能上始终匹配或超越LoRA、DoRA等密集基线方法。我们的深入分析表明,与传统权重标准相比,学习到的令牌级稀疏性是更优的模块重要性指标,这为理解大模型内在适应机制提供了全新的数据驱动视角。