On the Efficiency of Convolutional Neural Networks

Since the breakthrough performance of AlexNet in 2012, convolutional neural networks (convnets) have grown into extremely powerful vision models. Deep learning researchers have used convnets to perform vision tasks with accuracy that was unachievable a decade ago. Confronted with the immense computation that convnets use, deep learning researchers also became interested in efficiency. However, the engineers who deployed efficient convnets soon realized that they were slower than the previous generation, despite using fewer operations. Many reverted to older models that ran faster. Hence researchers switched the objective of their search from arithmetic complexity to latency and produced a new wave of models that performed better. Paradoxically, these models also used more operations. Skepticism grew among researchers and engineers alike about the relevance of arithmetic complexity. Contrary to the prevailing view that latency and arithmetic complexity are irreconcilable, a simple formula relates both through computational efficiency. This insight enabled us to co-optimize the separate factors that determine latency. We observed that the degenerate conv2d layers that produce the best accuracy--complexity trade-off also use significant memory resources and have low computational efficiency. We devised block fusion algorithms to implement all the layers of a residual block in a single kernel, thereby creating temporal locality, avoiding communication, and reducing workspace size. Our ConvFirst model with block-fusion kernels has less arithmetic complexity and greater computational efficiency than baseline models and kernels, and ran approximately four times as fast as ConvNeXt. We also created novel tools, including efficiency gap plots and waterline analysis. Our unified approach to convnet efficiency envisions a new era of models and kernels that achieve greater accuracy at lower cost.

翻译：自2012年AlexNet实现突破性性能以来，卷积神经网络（卷积网络）已发展成为极为强大的视觉模型。深度学习研究者利用卷积网络执行视觉任务，其精度达到十年前无法企及的水平。面对卷积网络庞大的计算量，研究者也开始关注效率问题。然而，部署高效卷积网络的工程师很快发现，尽管这类网络运算次数更少，实际运行速度却低于前代模型。许多人因此重新启用运行速度更快的旧模型。于是研究者将搜索目标从算术复杂度转向延迟，催生了性能更优的新一代模型。矛盾的是，这些模型反而使用了更多运算量。研究者与工程师对算术复杂度相关性的质疑与日俱增。与主流观点认为延迟和算术复杂度不可调和相反，我们通过计算效率这一简单公式将两者关联起来。这一洞见使我们能够协同优化决定延迟的各个独立因素。我们观察到，产生最佳精度-复杂度权衡的退化conv2d层同时消耗大量内存资源且计算效率低下。我们设计了块融合算法，将残差块的所有层实现在单个内核中，从而创建时间局部性、避免通信并降低工作空间大小。采用块融合内核的ConvFirst模型在算术复杂度和计算效率方面均优于基准模型和内核，运行速度约为ConvNeXt的四倍。我们还创建了效率差距图和水线分析等新型工具。这种统一的卷积网络效率方法预示着一个新时代：以更低成本实现更高精度的模型与内核。