Binary Neural Networks (BNNs) can significantly accelerate the inference time of a neural network by replacing its expensive floating-point arithmetic with bitwise operations. Most existing solutions, however, do not fully optimize data flow through the BNN layers, and intermediate conversions from 1 to 16/32 bits often further hinder efficiency. We propose a novel training scheme that can increase data flow and parallelism in the BNN pipeline; specifically, we introduce a clipping block that decreases the data-width from 32 bits to 8. Furthermore, we reduce the internal accumulator size of a binary layer, usually kept using 32-bit to prevent data overflow without losing accuracy. Additionally, we provide an optimization of the Batch Normalization layer that both reduces latency and simplifies deployment. Finally, we present an optimized implementation of the Binary Direct Convolution for ARM instruction sets. Our experiments show a consistent improvement of the inference speed (up to 1.91 and 2.73x compared to two state-of-the-art BNNs frameworks) with no drop in accuracy for at least one full-precision model.
翻译:二元神经网络(BNNs)通过用位运算替代昂贵的浮点运算,可以显著加速神经网络的推理时间。然而,现有大多数解决方案未能充分优化BNN层级间的数据流,且从1位到16/32位的中间转换常进一步阻碍效率。我们提出一种新型训练方案,可提升BNN流水线中的数据流与并行性;具体而言,我们引入一个裁剪块,将数据宽度从32位降至8位。此外,我们减小二元层内部累加器的大小(通常保留32位以防止数据溢出),且不损失精度。同时,我们提供批量归一化层的优化,既降低延迟又简化部署。最后,我们提出针对ARM指令集的二元直接卷积优化实现。实验表明,与两种最先进的BNN框架相比,推理速度持续提升(高达1.91倍和2.73倍),且至少一个全精度模型的精度无下降。