We survey the large language model (LLM) serving area to understand the intricate dynamics between cost-efficiency and accuracy, which is magnified by the growing need for longer contextual understanding when deploying models at a massive scale. Our findings reveal that works in this space optimize along three distinct but conflicting goals: improving serving context length (C), improving serving accuracy (A), and improving serving performance (P). Drawing inspiration from the CAP theorem in databases, we propose a CAP principle for LLM serving, which suggests that any optimization can improve at most two of these three goals simultaneously. Our survey categorizes existing works within this framework. We find the definition and continuity of user-perceived measurement metrics are crucial in determining whether a goal has been met, akin to prior CAP databases in the wild. We recognize the CAP principle for LLM serving as a guiding principle, rather than a formal theorem, to inform designers of the inherent and dynamic trade-offs in serving models. As serving accuracy and performance have been extensively studied, this survey focuses on works that extend serving context length and address the resulting challenges.
翻译:我们调查了大语言模型(LLM)服务领域,旨在理解成本效率与准确性之间错综复杂的动态关系,这一关系因大规模部署模型时对更长上下文理解的日益增长需求而加剧。我们的发现表明,该领域的研究工作围绕三个不同且相互冲突的目标进行优化:提升服务上下文长度(C)、提升服务准确性(A)以及提升服务性能(P)。受数据库领域CAP定理的启发,我们提出LLM服务中的CAP原则,该原则指出任何优化最多只能同时改进这三个目标中的两个。我们的综述根据这一框架对现有工作进行分类。我们发现,用户感知测量指标的定义和连续性对于确定某个目标是否达成至关重要,这与之前实际应用中CAP数据库的情况类似。我们认为LLM服务中的CAP原则是一个指导性原则,而非形式化定理,用于告知设计者服务模型中所固有的动态权衡。鉴于服务准确性和性能已被广泛研究,本综述重点关注那些扩展服务上下文长度并应对由此产生挑战的工作。