In this paper, we investigate the negative effect of activation functions on forward and backward propagation and how to counteract this effect. First, We examine how activation functions affect the forward and backward propagation of neural networks and derive a general form for gradient variance that extends the previous work in this area. We try to use mini-batch statistics to dynamically update the normalization factor to ensure the normalization property throughout the training process, rather than only accounting for the state of the neural network after weight initialization. Second, we propose ANAct, a method that normalizes activation functions to maintain consistent gradient variance across layers and demonstrate its effectiveness through experiments. We observe that the convergence rate is roughly related to the normalization property. We compare ANAct with several common activation functions on CNNs and residual networks and show that ANAct consistently improves their performance. For instance, normalized Swish achieves 1.4\% higher top-1 accuracy than vanilla Swish on ResNet50 with the Tiny ImageNet dataset and more than 1.2\% higher with CIFAR-100.
翻译:本文研究了激活函数对前向与反向传播的负面影响及其抵消方法。首先,我们分析了激活函数如何影响神经网络的前向与反向传播,推导出梯度方差的通用形式,该形式扩展了该领域的既有工作。我们尝试利用小批量统计量动态更新归一化因子,确保整个训练过程中(而非仅考虑权重初始化后的网络状态)保持归一化特性。其次,我们提出ANAct方法——对激活函数进行归一化以维持各层梯度方差的一致性,并通过实验证明其有效性。我们观察到收敛速度与归一化特性大致相关。在卷积神经网络和残差网络上,我们将ANAct与多种常见激活函数进行比较,结果表明ANAct能持续提升模型性能。例如,在Tiny ImageNet数据集上采用ResNet50架构时,经归一化的Swish函数相较于原始Swish函数获得1.4%的Top-1准确率提升;在CIFAR-100数据集上提升超过1.2%。