Recently, the scale of transformers has grown rapidly, which introduces considerable challenges in terms of training overhead and inference efficiency in the scope of task adaptation. Existing works, namely Parameter-Efficient Fine-Tuning (PEFT) and model compression, have separately investigated the challenges. However, PEFT cannot guarantee the inference efficiency of the original backbone, especially for large-scale models. Model compression requires significant training costs for structure searching and re-training. Consequently, a simple combination of them cannot guarantee accomplishing both training efficiency and inference efficiency with minimal costs. In this paper, we propose a novel Parallel Yielding Re-Activation (PYRA) method for such a challenge of training-inference efficient task adaptation. PYRA first utilizes parallel yielding adaptive weights to comprehensively perceive the data distribution in downstream tasks. A re-activation strategy for token modulation is then applied for tokens to be merged, leading to calibrated token features. Extensive experiments demonstrate that PYRA outperforms all competing methods under both low compression rate and high compression rate, demonstrating its effectiveness and superiority in maintaining both training efficiency and inference efficiency for large-scale foundation models. Our code is available at https://github.com/THU-MIG/PYRA.
翻译:近年来,Transformer 模型的规模迅速增长,这在任务自适应领域带来了训练开销和推理效率方面的显著挑战。现有工作,即参数高效微调(PEFT)和模型压缩,已分别对这两类挑战进行了研究。然而,PEFT 无法保证原始主干模型的推理效率,尤其对于大规模模型而言。模型压缩则需要在结构搜索和重新训练上付出巨大的训练成本。因此,简单地将两者结合无法保证以最小成本同时实现训练效率和推理效率。本文针对这种训练-推理高效任务自适应的挑战,提出了一种新颖的并行生成再激活(PYRA)方法。PYRA 首先利用并行生成的自适应权重来全面感知下游任务的数据分布。随后,对需要合并的令牌应用一种用于令牌调制的再激活策略,从而产生校准后的令牌特征。大量实验表明,无论是在低压缩率还是高压缩率下,PYRA 均优于所有竞争方法,这证明了其在为大规模基础模型保持训练效率与推理效率方面的有效性和优越性。我们的代码可在 https://github.com/THU-MIG/PYRA 获取。