Modern LLM serving now spans multi-stage pipelines including RAG retrieval and KV cache reuse, each with distinct compute, memory, and latency demands. Inference engines expose a large configuration space with no systematic navigation methodology, and exhaustively benchmarking configurations can exceed 40K in cloud costs. Simultaneously, the hardware landscape is rapidly diversifying across AMD GPUs, TPUs, and custom ASICs, while cross-vendor prefill-decode (PD) disaggregated configurations lack unified software stacks for end-to-end evaluation today. To address this gap, we present MIST, a Heterogeneous Multi-stage LLM inference Execution Simulator. MIST models diverse request stages; including RAG, KV retrieval, reasoning, prefill, and decode across complex hardware hierarchies. MIST supports heterogeneous clients executing multiple models concurrently unlike prior frameworks while incorporating advanced batching strategies and multi-level memory hierarchies. By integrating real hardware traces with analytical modeling, MIST captures critical trade-offs such as memory bandwidth contention, inter-cluster communication latency, and batching efficiency in hybrid CPU-accelerator deployments. Through case studies, we explore the impact of reasoning stages on end-to-end latency, optimal batching strategies for hybrid pipelines, and the architectural implications of remote KV cache retrieval. MIST empowers system designers to navigate the evolving landscape of LLM inference, providing actionable insights into optimizing hardware-software co-design for next-generation AI workloads.
翻译:现代大语言模型(LLM)服务现已涵盖多阶段流水线,包括RAG检索和KV缓存复用,每个阶段对计算、内存和延迟均有不同需求。推理引擎暴露出巨大的配置空间,但缺乏系统化的导航方法,而对配置进行穷举式基准测试的云成本可能超过4万次。与此同时,硬件格局在AMD GPU、TPU和定制ASIC领域迅速多样化,跨厂商的预填充-解码(PD)分离配置目前缺乏统一的端到端评估软件栈。为填补这一空白,我们提出了MIST——一种异构多阶段大语言模型推理执行模拟器。MIST能够对多样化的请求阶段(包括RAG、KV检索、推理、预填充和解码)在复杂硬件层级上进行建模。与先前框架不同,MIST支持多个模型同时执行的异构客户端,并融合了高级批处理策略和多级内存层次结构。通过将真实硬件追踪与分析建模相结合,MIST捕获了关键权衡因素,例如混合CPU-加速器部署中的内存带宽争用、集群间通信延迟及批处理效率。通过案例研究,我们探讨了推理阶段对端到端延迟的影响、混合流水线的最优批处理策略,以及远程KV缓存检索的架构意义。MIST使系统设计者能够驾驭不断演变的LLM推理格局,为优化下一代AI工作负载的硬件-软件协同设计提供可操作洞察。