This work evaluates State-of-the-Art convolution algorithms for CPU-based CNN inference. Although most prior studies focus on GPUs or NPUs, CPU implementations remain comparatively under-optimized. Our first contribution is to provide fair benchmarking for embedded CPU inference. We evaluate direct, GEMM-based, and Winograd convolutions across modern CPUs from ARM, Intel, AMD, and NVIDIA vendors, considering both latency and energy efficiency. To the best of our knowledge, this is the first study to present a fair, cross-vendor comparison of CPU energy consumption using a high-resolution socket-level measurement platform. To validate our methodology, we further compare socket-level power measurements with estimates derived from model-specific registers (MSRs), finding that MSRs underestimate the power consumption of convolution inference by 10--30%. Our results show that the ARM\R Cortex-A78AE CPU combined with an implicit GEMM convolution implementation offers the best trade-off between latency and power consumption, achieving ResNet50v1.5 inference in 102 ms with an average power of 25.3 W, corresponding to 2.58 J.
翻译:本研究评估了基于CPU的CNN推理中先进的卷积算法。尽管先前多数研究聚焦于GPU或NPU,但CPU实现方案仍相对缺乏优化。我们的首要贡献是为嵌入式CPU推理提供公平的基准测试框架。我们在ARM、英特尔、AMD和英伟达等厂商的现代CPU平台上,系统评估了直接卷积、基于GEMM的卷积及Winograd卷积算法,同时考量延迟与能效指标。据我们所知,这是首个采用高精度插槽级测量平台实现跨厂商CPU能耗公平对比的研究。为验证方法论可靠性,我们进一步对比了插槽级功率测量值与模型特定寄存器(MSR)推算的估计值,发现MSR会低估卷积推理功耗10-30%。实验结果表明:ARM® Cortex-A78AE CPU结合隐式GEMM卷积实现方案在延迟与功耗间达到最佳平衡,可在102毫秒内完成ResNet50v1.5推理任务,平均功耗25.3瓦,对应能耗为2.58焦耳。