We address the challenges associated with deploying neural networks on CPUs, with a particular focus on minimizing inference time while maintaining accuracy. Our novel approach is to use the dataflow (i.e., computation order) of a neural network to explore data reuse opportunities using heuristic-guided analysis and a code generation framework, which enables exploration of various Single Instruction, Multiple Data (SIMD) implementations to achieve optimized neural network execution. Our results demonstrate that the dataflow that keeps outputs in SIMD registers while also maximizing both input and weight reuse consistently yields the best performance for a wide variety of inference workloads, achieving up to 3x speedup for 8-bit neural networks, and up to 4.8x speedup for binary neural networks, respectively, over the optimized implementations of neural networks today.
翻译:我们针对在CPU上部署神经网络所面临的挑战,特别关注在保持精度的同时最小化推理时间。我们的创新方法是通过使用神经网络的数据流(即计算顺序),结合启发式引导分析和代码生成框架来探索数据复用机会,从而实现对多种单指令多数据流(SIMD)实现方案的探索,以达成优化的神经网络执行。研究结果表明,在SIMD寄存器中保留输出结果的同时最大化输入和权重复用的数据流,能在一系列推理工作负载中持续取得最佳性能:相较于当前优化的神经网络实现,该方法在8位神经网络上可实现高达3倍加速,在二值神经网络上可达4.8倍加速。