Substantial experiments have validated the success of Batch Normalization (BN) Layer in benefiting convergence and generalization. However, BN requires extra memory and float-point calculation. Moreover, BN would be inaccurate on micro-batch, as it depends on batch statistics. In this paper, we address these problems by simplifying BN regularization while keeping two fundamental impacts of BN layers, i.e., data decorrelation and adaptive learning rate. We propose a novel normalization method, named MimicNorm, to improve the convergence and efficiency in network training. MimicNorm consists of only two light operations, including modified weight mean operations (subtract mean values from weight parameter tensor) and one BN layer before loss function (last BN layer). We leverage the neural tangent kernel (NTK) theory to prove that our weight mean operation whitens activations and transits network into the chaotic regime like BN layer, and consequently, leads to an enhanced convergence. The last BN layer provides autotuned learning rates and also improves accuracy. Experimental results show that MimicNorm achieves similar accuracy for various network structures, including ResNets and lightweight networks like ShuffleNet, with a reduction of about 20% memory consumption. The code is publicly available at https://github.com/Kid-key/MimicNorm.
翻译:大量实验已证实批归一化(BN)层在促进收敛和泛化方面的成功。然而,BN需要额外内存和浮点计算。此外,由于依赖批次统计量,BN在微批次上会不准确。本文通过简化BN正则化同时保留BN层的两个核心影响(即数据去相关和自适应学习率)来解决这些问题。我们提出一种新颖的归一化方法——MimicNorm,以提升网络训练的收敛性和效率。MimicNorm仅包含两种轻量操作:改进的权重均值操作(从权重参数张量中减去均值)以及损失函数前的一个BN层(最后一个BN层)。我们利用神经正切核(NTK)理论证明,权重均值操作能像BN层一样对激活值进行白化处理并将网络转换至混沌状态,从而增强收敛性。最后一个BN层提供自动调节的学习率并提升精度。实验结果表明,在多种网络结构(包括ResNet和轻量级网络如ShuffleNet)中,MimicNorm可实现相似精度,同时减少约20%的内存消耗。代码已开源:https://github.com/Kid-key/MimicNorm。