Typical neural network trainings have substantial variance in test-set performance between repeated runs, impeding hyperparameter comparison and training reproducibility. We present the following results towards understanding this variation. (1) Despite having significant variance on their test-sets, we demonstrate that standard CIFAR-10 and ImageNet trainings have very little variance in their performance on the test-distributions from which those test-sets are sampled, suggesting that variance is less of a practical issue than previously thought. (2) We present a simplifying statistical assumption which closely approximates the structure of the test-set accuracy distribution. (3) We argue that test-set variance is inevitable in the following two senses. First, we show that variance is largely caused by high sensitivity of the training process to initial conditions, rather than by specific sources of randomness like the data order and augmentations. Second, we prove that variance is unavoidable given the observation that ensembles of trained networks are well-calibrated. (4) We conduct preliminary studies of distribution-shift, fine-tuning, data augmentation and learning rate through the lens of variance between runs.
翻译:典型神经网络训练在重复运行后测试集性能存在显著方差,这种差异阻碍了超参数比较和训练可重复性。我们针对这一变异的理解提出以下成果:(1)尽管训练模型在测试集上呈现显著方差,但标准CIFAR-10和ImageNet训练在生成这些测试集的测试分布上的性能方差极小,表明方差的实际影响低于先前认知;(2)提出一种简化统计假设,该假设能精确逼近测试集准确率的分布结构;(3)论证测试集方差在以下两个层面上不可避免:首先,方差主要源于训练过程对初始条件的高敏感性,而非数据顺序或数据增强等特定随机性来源;其次,基于已观测到的集成网络具有良好校准性,我们证明了方差的必然性;(4)通过跨训练轮次方差的视角,开展分布偏移、微调、数据增强及学习率的初步研究。