Prior research has found that differences in the early period of neural network training significantly impact the performance of in-distribution (ID) tasks. However, neural networks are often sensitive to out-of-distribution (OOD) data, making them less reliable in downstream applications. Yet, the impact of the early training period on OOD generalization remains understudied due to its complexity and lack of effective analytical methodologies. In this work, we investigate the relationship between learning dynamics and OOD generalization during the early period of neural network training. We utilize the trace of Fisher Information and sharpness, with a focus on gradual unfreezing (i.e. progressively unfreezing parameters during training) as the methodology for investigation. Through a series of empirical experiments, we show that 1) selecting the number of trainable parameters at different times during training, i.e. realized by gradual unfreezing -- has a minuscule impact on ID results, but greatly affects the generalization to OOD data; 2) the absolute values of sharpness and trace of Fisher Information at the initial period of training are not indicative for OOD generalization, but the relative values could be; 3) the trace of Fisher Information and sharpness may be used as indicators for the removal of interventions during early period of training for better OOD generalization.
翻译:先前研究发现,神经网络训练早期阶段的差异会显著影响分布内任务的性能。然而,神经网络通常对分布外数据敏感,这降低了其在下游应用中的可靠性。由于早期训练阶段对分布外泛化影响的复杂性及缺乏有效分析方法,相关研究尚不充分。本研究通过聚焦渐进解冻(即训练过程中逐步解冻参数)这一方法,利用Fisher信息迹和锐度指标探讨神经网络训练早期阶段的学习动态与分布外泛化之间的关系。基于一系列实证实验,我们证明:1)在不同训练时间点选择可训练参数数量(即通过渐进解冻实现的参数数量选择)对分布内结果影响极小,但显著影响分布外数据的泛化能力;2)训练初期锐度和Fisher信息迹的绝对值无法指示分布外泛化性能,但其相对值可能具有指示作用;3)Fisher信息迹和锐度可作为移除早期训练阶段干预措施以实现更好分布外泛化的指标。