Kavier: Exploring Performance, Sustainability, and Efficiency of LLM Ecosystems under Inference through Cache-Aware Discrete-Event Simulation

Large Language Models (LLMs) are widely used by our increasingly digitalized society, but raise sustainability, performance, and financial concerns, especially as inference workloads grow. To improve the design and operation of LLM ecosystems, we envision simulators and simulation-based digital twins becoming primary decision-making tools. LLM ecosystems leverage many heterogeneous components, making simulation a non-trivial, yet critical operation. The simulation challenge is exacerbated by the absence of a comprehensive reference architecture of LLM ecosystems; the lack of such a conceptual model can be costly and could misguide the designers and engineers. Without a reference architecture, even the most experienced stakeholders could tinker in researching, engineering, or maintaining LLM ecosystems. In this work, we bring a three-fold contribution to the scientific community. Firstly, we synthesize, propose, and validate a reference architecture (RA) of LLM ecosystems under inference. Then, adhering to the reference architecture, we design Kavier, the first simulation instrument able to predict the performance, sustainability, and efficiency of LLM ecosystems under inference, through discrete-event and cache-aware simulation, focusing on Key-Value-(KV-)Caching and prompt prefix caching policies. Through experiments with a Kavier prototype and real-world traces, (i) we measure the accuracy of Kavier and its performance in massive-scale simulations, (ii) we compare the performance of different KV-Caching policies, and (iii) we analyze the performance, sustainability, and efficiency of LLM ecosystems under various prefix caching policies. Overall, we show that Kavier enables operators, researchers, and engineers to predict LLM ecosystems in a time, performance, and cost-efficient way.

翻译：摘要：大语言模型（LLMs）已被日益数字化的社会广泛使用，但随着推理工作负载的增长，其在可持续性、性能和成本方面引发了关注。为改善LLM生态系统的设计与运营，我们设想将模拟器及基于模拟的数字孪生作为主要决策工具。LLM生态系统利用了大量异构组件，这使得模拟成为一项复杂但至关重要的操作。由于缺乏LLM生态系统的综合参考架构，模拟挑战更为严峻；这一概念模型的缺失可能代价高昂，并误导设计者与工程师。缺乏参考架构时，即便是最有经验的相关方也可能在LLM生态系统的研究、工程或维护中陷入盲目探索。在本工作中，我们为科学界带来三项贡献：首先，我们综合、提出并验证了推理场景下LLM生态系统的参考架构（RA）；其次，遵循该参考架构，我们设计了Kavier——首个能够通过离散事件与缓存感知模拟（聚焦键值缓存及提示前缀缓存策略）预测推理场景下LLM生态系统性能、可持续性与效率的仿真工具。通过Kavier原型及真实世界轨迹的实验，我们（i）评估了Kavier的准确性及其在大规模模拟中的性能，（ii）比较了不同键值缓存策略的性能，（iii）分析了不同前缀缓存策略下LLM生态系统的性能、可持续性与效率。总体而言，我们证明Kavier使运营商、研究人员和工程师能够以时间、性能和成本高效的方式预测LLM生态系统。