This work evaluates state-of-the-art convolution algorithms for CPU-based deep learning inference. While most prior studies focus on GPUs or NPUs, CPU implementations remain relatively underoptimized. We benchmark direct, GEMM-based, and Winograd convolutions across modern CPUs from ARM __ , Intel __ , AMD __ , Apple __ , and Nvidia __ , considering both latency and energy efficiency. Our results highlight the key architectural factors that govern CPU efficiency for convolution operations, providing practical guidance for energy-aware embedded deployment. As a main results of this work, the Nvidia __ AGX Orin combined with the GEMM algorithm achieves the best trade-off between inference latency and energy consumption.
翻译:本研究评估了基于CPU的深度学习推理中先进的卷积算法。尽管先前多数研究集中于GPU或NPU,CPU实现方案仍相对缺乏优化。我们在ARM、Intel、AMD、Apple及Nvidia的现代CPU平台上,对直接卷积、基于GEMM的卷积及Winograd卷积进行了延迟与能效的基准测试。研究结果揭示了决定CPU卷积运算效率的关键架构因素,为能源感知的嵌入式部署提供了实践指导。本工作的主要成果表明,Nvidia AGX Orin平台结合GEMM算法在推理延迟与能耗之间实现了最佳平衡。