Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles

The transition from standard generative AI to \emph{reasoning-centric architectures}, exemplified by models capable of extensive Chain-of-Thought~(CoT) processing, marks a fundamental paradigm shift in system requirements. Unlike traditional workloads dominated by compute-bound prefill, reasoning workloads generate long chains of reasoning tokens that shift inference into a \emph{Capacity-Bound regime}. This paper presents a comprehensive system characterization, evaluating models ranging from 8B to 671B parameters on GPUs clusters. By systematically exploring the interplay between Data, Tensor, and Pipeline parallelism, we identify critical bottlenecks that defy standard scaling heuristics. Our analysis reveals that data parallelism is throughput efficient for small models but hits a capacity trap on reasoning workloads as KV-cache fragmentation forces early throttling resulting in sub-optimal compute utilization. Tensor parallelism unlocks stranded memory and delivers sublinear gains near the 32B crossover. At frontier scale, dense models (e.g., Llama-405B) are interconnect and memory-bandwidth bound and favor high-degree TP, while sparse Mixture-of-Experts (MoE) models (e.g., DeepSeek-R1) are limited by routing and synchronization latency and benefit from hybrid strategies. These insights provide a rigorous decision framework for navigating the reasoning cliff, establishing new architectural imperatives for the next generation of inference infrastructure.

翻译：从标准生成式人工智能向以\emph{推理为中心的架构}（例如能够进行广泛链式思维（CoT）处理的模型）的转变，标志着系统需求的根本性范式转变。与传统上以计算受限的预填充阶段为主的工作负载不同，推理工作负载会生成冗长的推理 token 序列，从而将推理过程推入一个\emph{容量受限}的状态。本文提供了一项全面的系统特征分析，在 GPU 集群上评估了从 8B 到 671B 参数不等的模型。通过系统地探索数据并行、张量并行和流水线并行之间的相互作用，我们识别出了一些违背标准缩放经验法则的关键瓶颈。我们的分析表明，数据并行对于小模型具有高吞吐效率，但在推理工作负载中会遭遇容量陷阱，因为 KV 缓存碎片化迫使过早进行节流，从而导致次优的计算利用率。张量并行可以释放被占用的内存，并在接近 32B 参数的交叉点附近带来次线性增益。在前沿规模上，密集模型（例如 Llama-405B）受限于互联和内存带宽，因此倾向于高程度的张量并行；而稀疏的混合专家（MoE）模型（例如 DeepSeek-R1）则受限于路由和同步延迟，因此从混合策略中获益。这些见解为应对推理悬崖提供了一个严谨的决策框架，并为下一代推理基础设施确立了新的架构准则。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【ICLR2026】缩放推理步数暴露短板：揭示并提升大语言模型中的步数泛化能力

专知会员服务

10+阅读 · 2月1日

基于大语言模型（LLM）的智能体推理框架：从方法到场景的综述

专知会员服务

55+阅读 · 2025年8月26日

142页DeepSeek-R1 思维链技术：让我们一起<思考>大语言模型（LLM）的推理能力

专知会员服务

48+阅读 · 2025年4月12日

AI进入推理模型时代，一文带你读懂思维链

专知会员服务

40+阅读 · 2025年3月17日