The transition from standard generative AI to \emph{reasoning-centric architectures}, exemplified by models capable of extensive Chain-of-Thought~(CoT) processing, marks a fundamental paradigm shift in system requirements. Unlike traditional workloads dominated by compute-bound prefill, reasoning workloads generate long chains of reasoning tokens that shift inference into a \emph{Capacity-Bound regime}. This paper presents a comprehensive system characterization, evaluating models ranging from 8B to 671B parameters on GPUs clusters. By systematically exploring the interplay between Data, Tensor, and Pipeline parallelism, we identify critical bottlenecks that defy standard scaling heuristics. Our analysis reveals that data parallelism is throughput efficient for small models but hits a capacity trap on reasoning workloads as KV-cache fragmentation forces early throttling resulting in sub-optimal compute utilization. Tensor parallelism unlocks stranded memory and delivers sublinear gains near the 32B crossover. At frontier scale, dense models (e.g., Llama-405B) are interconnect and memory-bandwidth bound and favor high-degree TP, while sparse Mixture-of-Experts (MoE) models (e.g., DeepSeek-R1) are limited by routing and synchronization latency and benefit from hybrid strategies. These insights provide a rigorous decision framework for navigating the reasoning cliff, establishing new architectural imperatives for the next generation of inference infrastructure.
翻译:从标准生成式人工智能向以\emph{推理为中心的架构}(例如能够进行广泛链式思维(CoT)处理的模型)的转变,标志着系统需求的根本性范式转变。与传统上以计算受限的预填充阶段为主的工作负载不同,推理工作负载会生成冗长的推理 token 序列,从而将推理过程推入一个\emph{容量受限}的状态。本文提供了一项全面的系统特征分析,在 GPU 集群上评估了从 8B 到 671B 参数不等的模型。通过系统地探索数据并行、张量并行和流水线并行之间的相互作用,我们识别出了一些违背标准缩放经验法则的关键瓶颈。我们的分析表明,数据并行对于小模型具有高吞吐效率,但在推理工作负载中会遭遇容量陷阱,因为 KV 缓存碎片化迫使过早进行节流,从而导致次优的计算利用率。张量并行可以释放被占用的内存,并在接近 32B 参数的交叉点附近带来次线性增益。在前沿规模上,密集模型(例如 Llama-405B)受限于互联和内存带宽,因此倾向于高程度的张量并行;而稀疏的混合专家(MoE)模型(例如 DeepSeek-R1)则受限于路由和同步延迟,因此从混合策略中获益。这些见解为应对推理悬崖提供了一个严谨的决策框架,并为下一代推理基础设施确立了新的架构准则。