Understanding and Optimizing Multi-Stage AI Inference Pipelines

Abhimanyu Rajeshkumar Bambhaniya,Hanjiang Wu,Suvinay Subramanian,Sudarshan Srinivasan,Souvik Kundu,Amir Yazdanbakhsh,Midhilesh Elavazhagan,Madhu Kumar,Tushar Krishna

from arxiv, Inference System Design for Multi-Stage AI Inference Pipelines. 13 Pages, 15 Figues, 3 Tables

The rapid evolution of Large Language Models (LLMs) has driven the need for increasingly sophisticated inference pipelines and hardware platforms. Modern LLM serving extends beyond traditional prefill-decode workflows, incorporating multi-stage processes such as Retrieval Augmented Generation (RAG), key-value (KV) cache retrieval, dynamic model routing, and multi step reasoning. These stages exhibit diverse computational demands, requiring distributed systems that integrate GPUs, ASICs, CPUs, and memory-centric architectures. However, existing simulators lack the fidelity to model these heterogeneous, multi-engine workflows, limiting their ability to inform architectural decisions. To address this gap, we introduce MIST, a Heterogeneous Multi-stage LLM inference Execution Simulator. MIST models diverse request stages; including RAG, KV retrieval, reasoning, prefill, and decode across complex hardware hierarchies. MIST supports heterogeneous clients executing multiple models concurrently unlike prior frameworks while incorporating advanced batching strategies and multi-level memory hierarchies. By integrating real hardware traces with analytical modeling, MIST captures critical trade-offs such as memory bandwidth contention, inter-cluster communication latency, and batching efficiency in hybrid CPU-accelerator deployments. Through case studies, we explore the impact of reasoning stages on end-to-end latency, optimal batching strategies for hybrid pipelines, and the architectural implications of remote KV cache retrieval. MIST empowers system designers to navigate the evolving landscape of LLM inference, providing actionable insights into optimizing hardware-software co-design for next-generation AI workloads.

翻译：大型语言模型（LLM）的快速演进推动了日益复杂的推理流水线和硬件平台的需求。现代LLM服务已超越传统的预填充-解码工作流程，融入了检索增强生成（RAG）、键值（KV）缓存检索、动态模型路由及多步推理等多阶段处理过程。这些阶段呈现出多样化的计算需求，需要集成GPU、ASIC、CPU及以内存为中心的架构的分布式系统。然而，现有模拟器缺乏对这些异构、多引擎工作流程的建模保真度，限制了其为架构决策提供信息的能力。为填补这一空白，我们提出了MIST——一个异构多阶段LLM推理执行模拟器。MIST对包括RAG、KV检索、推理、预填充和解码在内的多样化请求阶段进行建模，并跨越复杂的硬件层级结构。与以往框架不同，MIST支持同时执行多个模型的异构客户端，同时集成了先进的批处理策略和多级内存层级结构。通过将真实硬件跟踪与解析建模相结合，MIST捕捉了混合CPU-加速器部署中的关键权衡，例如内存带宽争用、集群间通信延迟和批处理效率。通过案例研究，我们探讨了推理阶段对端到端延迟的影响、混合流水线的最优批处理策略，以及远程KV缓存检索的架构含义。MIST赋能系统设计者驾驭不断演变的LLM推理格局，为优化下一代AI工作负载的硬件-软件协同设计提供可执行的洞见。