We develop a theory of transfer learning in infinitely wide neural networks under gradient flow that quantifies when pretraining on a source task improves generalization on a target task. We analyze both (i) fine-tuning, when the downstream predictor is trained on top of source-induced features and (ii) a jointly rich setting, where both pretraining and downstream tasks can operate in a feature learning regime, but the downstream model is initialized with the features obtained after pre-training. In this setup, the summary statistics of randomly initialized networks after a rich pre-training are adaptive kernels which depend on both source data and labels. For (i), we analyze the performance of a readout for different pretraining data regimes. For (ii), the summary statistics after learning the target task are still adaptive kernels with features from both source and target tasks. We test our theory on linear and polynomial regression tasks as well as real datasets. Our theory allows interpretable conclusions on performance, which depend on the amount of data on both tasks, the alignment between tasks, and the feature learning strength.
翻译:我们发展了一种在梯度流下无限宽神经网络中的迁移学习理论,该理论量化了在何种情况下源任务的预训练能够提升目标任务上的泛化性能。我们分析了两种情形:(i) 微调,即在下游预测器中基于源任务诱导的特征进行训练;(ii) 联合丰富设定,其中预训练和下游任务均可处于特征学习机制中,但下游模型以预训练后获得的特征进行初始化。在此设定下,经过丰富预训练的随机初始化网络的汇总统计量是自适应核,其依赖于源数据及其标签。对于情形(i),我们分析了不同预训练数据机制下读出层的性能。对于情形(ii),学习目标任务后的汇总统计量仍是自适应核,其特征同时来源于源任务和目标任务。我们在线性与多项式回归任务以及真实数据集上验证了我们的理论。该理论能够得出关于性能的可解释结论,这些结论取决于两个任务的数据量、任务间的对齐程度以及特征学习强度。