Large-scale machine learning workloads increasingly rely on multi-GPU systems, yet their performance is often limited by an overlooked component: the CPU. Through a detailed study of modern large language model (LLM) inference and serving workloads, we find that multi-GPU performance frequently degrades not because GPUs are saturated, but because CPUs fail to keep the GPUs busy. Under limited CPU allocations, systems exhibit symptoms such as delayed kernel launch, stalled communication, and increased tokenization latency, leading to severe GPU underutilization even when ample GPU resources are available. This work presents a systematic analysis of CPU-induced slowdowns in multi-GPU LLM inference. We show that these bottlenecks persist even in serving stacks that employ process-level separation and modern GPU-side optimizations such as CUDA Graphs. Since the marginal cost of additional CPU cores is small relative to GPU instance pricing, our evaluation indicates that increasing the number of CPU cores can substantially improve performance and stability at minimal additional cost. Under moderate serving load, we observe that CPU-starved configurations frequently time out, while providing adequate CPU resources restores responsiveness and reduces time-to-first-token (TTFT) latency by 1.36-5.40x across configurations, all without requiring additional GPUs. This work shows that CPU provisioning is a crucial factor in multi-GPU LLM inference configuration, helping prevent control-side bottlenecks.
翻译:大规模机器学习工作负载日益依赖多GPU系统,但其性能常受限于一个被忽视的组件——CPU。通过对现代大语言模型推理与服务工作负载的详细研究,我们发现多GPU性能下降并非源于GPU饱和,而是由于CPU无法维持GPU的高效运作。在CPU资源受限的情况下,系统会出现内核启动延迟、通信停滞与分词延迟增加等典型症状,导致即便拥有充足的GPU资源,GPU利用率仍严重不足。本研究对多GPU大语言模型推理中CPU引发的性能下降进行了系统分析。结果表明,即便在采用进程级隔离及CUDA Graph等现代GPU端优化的服务栈中,此类瓶颈仍然存在。考虑到CPU内核的边际成本远低于GPU实例定价,我们的评估显示,增加CPU内核数量能够以最小的额外成本显著提升性能与稳定性。在中等服务负载下,我们观察到CPU匮乏的配置频繁导致超时,而提供充足的CPU资源则能恢复系统响应能力,并将各配置下的首令牌延迟降低1.36-5.40倍,且无需额外增加GPU。本研究表明,CPU资源配置是多GPU大语言模型推理配置的关键因素,有助于防止控制侧瓶颈的产生。