Breakthroughs in the generative AI domain have fueled an explosion of large language model (LLM)-powered applications, whose workloads fundamentally consist of sequences of inferences through transformer architectures. Within this rapidly expanding ecosystem, dense LLMs--those that activate all model parameters for each token generation--form the foundation for advanced expert-based variants. Dense models continue to dominate because of their strong generalization ability, scalability, ease of fine-tuning, and versatility across diverse tasks. In LLM inference systems, performance is mainly characterized by latency, response time, and throughput (i.e., tokens generated per unit of time). Latency and throughput are inherently coupled: optimizing for one often comes at the expense of the other. Moreover, batching strategies and parallelism configurations, which are essential when dense model parameters exceed device memory capacity, can significantly affect both latency and overall system throughput. This paper (i) investigates the workloads of two representative dense LLMs--Llama-3.1-70B and Llama-3.1-405B, focusing in particular on intra-node parallelization schemes, (ii) analyzes how input characteristics, batching, and parallelism strategies influence latency flexibility and the latency-throughput tradeoff, and (iii) identifies key performance bottlenecks that inform design choices for meeting service-level agreements (SLAs) and sustaining inference quality. Our empirical evaluations reveal that Tensor Parallelism (TP) improves the latency objectives while Pipeline Parallelism (PP) is better-suited for throughput-oriented applications. We highlight that their hybrid usage by controlling the TP and PP degrees provides control over the latency-throughput interplay.
翻译:生成式人工智能领域的突破性进展推动了大语言模型(LLM)驱动应用的爆发式增长,其工作负载本质上由基于Transformer架构的推理序列构成。在这一快速扩张的生态系统中,稠密大语言模型——即每个词元生成时激活全部模型参数的架构——构成了高级专家化变体的基础。稠密模型因其强大的泛化能力、可扩展性、易于微调的特性以及跨多样化任务的通用性,持续占据主导地位。在LLM推理系统中,性能主要通过延迟、响应时间和吞吐量(即单位时间内生成的词元数)来表征。延迟与吞吐量存在内在耦合关系:优化其中一方往往以牺牲另一方为代价。此外,当稠密模型参数量超过设备内存容量时,批处理策略与并行化配置——这两项关键技术——会显著影响延迟与整体系统吞吐量。本文(i)研究了两个代表性稠密LLM(Llama-3.1-70B与Llama-3.1-405B)的工作负载特性,重点聚焦于节点内并行化方案;(ii)分析了输入特征、批处理及并行化策略如何影响延迟弹性与延迟-吞吐量权衡关系;(iii)识别了关键性能瓶颈,为满足服务等级协议(SLA)和维持推理质量的设计决策提供依据。我们的实证评估表明:张量并行(TP)能有效优化延迟目标,而流水线并行(PP)更适用于吞吐量导向的应用场景。我们进一步指出,通过调控TP与PP的并行维度进行混合部署,能够实现对延迟-吞吐量权衡关系的精细化调控。