In theory, the choice of ReLU(0) in [0, 1] for a neural network has a negligible influence both on backpropagation and training. Yet, in the real world, 32 bits default precision combined with the size of deep learning problems makes it a hyperparameter of training methods. We investigate the importance of the value of ReLU'(0) for several precision levels (16, 32, 64 bits), on various networks (fully connected, VGG, ResNet) and datasets (MNIST, CIFAR10, SVHN, ImageNet). We observe considerable variations of backpropagation outputs which occur around half of the time in 32 bits precision. The effect disappears with double precision, while it is systematic at 16 bits. For vanilla SGD training, the choice ReLU'(0) = 0 seems to be the most efficient. For our experiments on ImageNet the gain in test accuracy over ReLU'(0) = 1 was more than 10 points (two runs). We also evidence that reconditioning approaches as batch-norm or ADAM tend to buffer the influence of ReLU'(0)'s value. Overall, the message we convey is that algorithmic differentiation of nonsmooth problems potentially hides parameters that could be tuned advantageously.
翻译:理论上,神经网络中ReLU(0)在[0,1]区间内的取值对反向传播和训练的影响可忽略不计。然而在实际应用中,32位默认精度与深度学习问题规模的结合使其成为训练方法的一个超参数。我们研究了ReLU'(0)取值在不同精度水平(16、32、64位)下对多种网络结构(全连接网络、VGG、ResNet)和数据集(MNIST、CIFAR10、SVHN、ImageNet)的重要性。在32位精度条件下,我们观察到约半数情况下反向传播输出会出现显著变化。双精度下该效应消失,而在16位精度下则系统性出现。对于标准SGD训练,选择ReLU'(0)=0似乎最为高效。在ImageNet上的实验中(两次运行),相较ReLU'(0)=1的方案,测试准确率提升超过10个百分点。我们还发现批归一化或ADAM等重调节方法倾向于缓冲ReLU'(0)取值的影响。总体而言,我们传递的信息是:非光滑问题的算法微分可能潜藏着可被优化调整的参数。