The deployment of deep learning inference in production environments continues to grow, where throughput, latency, and hardware efficiency are critical. Although specialized accelerators are increasingly adopted, many inference workloads still run on CPU-only systems, particularly in legacy data centers and cost-sensitive environments. This study investigates the scalability limits of CPU-based inference for convolutional neural networks by benchmarking ResNet models across varying batch sizes on two hardware tiers: a legacy Intel Xeon E5-2403 v2 processor and a modern Intel Xeon 6 "Granite Rapids" platform. Results show that legacy CPUs quickly reach throughput saturation, with limited scaling beyond small batch sizes due to instruction-level and memory constraints. In contrast, the Granite Rapids system leverages Intel Advanced Matrix Extensions (AMX) to achieve substantially higher throughput. However, oversubscription beyond physical core limits introduces execution contention and tail-latency amplification, revealing a performance degradation regime in modern architectures. We introduce GDEV-AI, a reproducible benchmarking framework for analyzing scalability behavior and architectural saturation in CPU-based inference. By establishing a vendor-neutral baseline, this work provides empirical insight into performance bottlenecks and informs capacity planning in heterogeneous data center environments.
翻译:深度学习推理在生产环境中的部署持续增长,其中吞吐量、延迟和硬件效率至关重要。尽管专用加速器日益普及,许多推理工作负载仍运行在纯CPU系统上,尤其是在传统数据中心和成本敏感的环境中。本研究通过对ResNet模型在两个硬件层级(传统Intel Xeon E5-2403 v2处理器与现代Intel Xeon 6 "Granite Rapids"平台)上不同批处理大小的基准测试,探究了基于CPU的卷积神经网络推理的可扩展性极限。结果表明,传统CPU因指令级和内存限制,在超过小批量规模后扩展有限,迅速达到吞吐量饱和。相比之下,Granite Rapids系统利用Intel高级矩阵扩展(AMX)实现了显著更高的吞吐量。然而,超出物理核心限制的资源过度分配会引发执行争用和尾部延迟放大,揭示了现代架构中的性能退化机制。我们提出了GDEV-AI——一个用于分析基于CPU推理的可扩展性行为与架构饱和度的可复现基准测试框架。通过建立供应商中立的基线,本研究为异构数据中心环境中的性能瓶颈提供了实证依据,并为容量规划提供了参考。