Binarization of neural networks is a dominant paradigm in neural networks compression. The pioneering work BinaryConnect uses Straight Through Estimator (STE) to mimic the gradients of the sign function, but it also causes the crucial inconsistency problem. Most of the previous methods design different estimators instead of STE to mitigate it. However, they ignore the fact that when reducing the estimating error, the gradient stability will decrease concomitantly. These highly divergent gradients will harm the model training and increase the risk of gradient vanishing and gradient exploding. To fully take the gradient stability into consideration, we present a new perspective to the BNNs training, regarding it as the equilibrium between the estimating error and the gradient stability. In this view, we firstly design two indicators to quantitatively demonstrate the equilibrium phenomenon. In addition, in order to balance the estimating error and the gradient stability well, we revise the original straight through estimator and propose a power function based estimator, Rectified Straight Through Estimator (ReSTE for short). Comparing to other estimators, ReSTE is rational and capable of flexibly balancing the estimating error with the gradient stability. Extensive experiments on CIFAR-10 and ImageNet datasets show that ReSTE has excellent performance and surpasses the state-of-the-art methods without any auxiliary modules or losses.
翻译:神经网络二值化是神经网络压缩中的主导范式。开创性工作BinaryConnect使用直通估计器来模拟符号函数的梯度,但这导致了关键的不一致性问题。以往大多数方法设计不同的估计器替代直通估计器以缓解该问题。然而,这些方法忽略了在降低估计误差时,梯度稳定性会随之下降的事实。这种高度发散的梯度会损害模型训练,并增加梯度消失与梯度爆炸的风险。为充分考虑梯度稳定性,我们提出训练二值化神经网络的新视角:将其视为估计误差与梯度稳定性之间的均衡。基于此视角,我们首先设计两个指标以定量展示该均衡现象。此外,为有效平衡估计误差与梯度稳定性,我们修正了原始直通估计器,并提出基于幂函数的估计器——整流直通估计器。与其他估计器相比,ReSTE具有合理性,且能灵活地平衡估计误差与梯度稳定性。在CIFAR-10和ImageNet数据集上的大量实验表明,ReSTE在无需任何辅助模块或损失函数的情况下即展现出优异的性能,并超越了当前最先进的方法。