Implicit regularization of multi-task learning and finetuning in overparameterized neural networks

In this work, we investigate the inductive biases that result from learning multiple tasks, either simultaneously (multi-task learning, MTL) or sequentially (pretraining and subsequent finetuning, PT+FT). In the simplified setting of two-layer diagonal linear networks trained with gradient descent, we apply prior theoretical results to describe novel implicit regularization penalties associated with MTL and PT+FT, both of which incentivize feature sharing between tasks and sparsity in learned task-specific features. Notably, these results imply that during finetuning, networks operate in a hybrid of the kernel (or "lazy") regime and the feature learning ("rich") regime identified in prior work. Moreover, we show that PT+FT can exhibit a novel "nested feature selection" behavior not captured by either regime, which biases it to extract a sparse subset of the features learned during pretraining. In ReLU networks, we reproduce all of these qualitative behaviors empirically, in particular verifying that analogues of the sparsity biases predicted by the linear theory hold in the nonlinear case. Our findings hold qualitatively for a deep architecture trained on image classification tasks, and our characterization of the nested feature selection regime motivates a modification to PT+FT that we find empirically improves performance. We also observe that PT+FT (but not MTL) is biased to learn features that are correlated with (but distinct from) those needed for the auxiliary task, while MTL is biased toward using identical features for both tasks, which can lead to a tradeoff in performance as a function of the number of finetuning samples. Our results shed light on the impact of auxiliary task learning and suggest ways to leverage it more effectively.

翻译：本研究探讨了同时学习多个任务（多任务学习，MTL）或顺序学习（预训练及后续微调，PT+FT）所产生的归纳偏置。在梯度下降训练的双层对角线性网络的简化设定下，我们应用先前的理论结果描述了与MTL和PT+FT相关的新型隐式正则化惩罚，这两种方式均激励任务间特征共享以及所学任务特定特征的稀疏性。值得注意的是，这些结果表明，在微调过程中，网络运行在先前研究确定的核（或"懒惰"）机制与特征学习（"丰富"）机制的混合状态中。此外，我们证明PT+FT能够展现出这两种机制均未涵盖的新型"嵌套特征选择"行为，该行为使其偏向于从预训练阶段学到的特征中提取稀疏子集。在ReLU网络中，我们通过实验重现了所有这些定性行为，特别验证了线性理论预测的稀疏性偏置在非线性情况下的类似表现。我们的发现在基于图像分类任务训练的深度架构中定性成立，并且对嵌套特征选择机制的表征促使我们对PT+FT进行修改，实验表明该修改能提升性能。我们还观察到，PT+FT（而非MTL）偏向于学习与辅助任务所需特征相关（但不相同）的特征，而MTL偏向于对两个任务使用相同特征，这可能导致性能随微调样本数量变化而产生权衡。我们的研究结果揭示了辅助任务学习的影响，并提出了更有效地利用该方法的途径。