We survey the large language model (LLM) serving area to understand the intricate dynamics between cost-efficiency and accuracy, which is magnified by the growing need for longer contextual understanding when deploying models at a massive scale. Our findings reveal that works in this space optimize along three distinct but conflicting goals: improving serving context length (C), improving serving accuracy (A), and improving serving performance (P). Drawing inspiration from the CAP theorem in databases, we propose a CAP principle for LLM serving, which suggests that any optimization can improve at most two of these three goals simultaneously. Our survey categorizes existing works within this framework. We find the definition and continuity of user-perceived measurement metrics are crucial in determining whether a goal has been met, akin to prior CAP databases in the wild. We recognize the CAP principle for LLM serving as a guiding principle, rather than a formal theorem, to inform designers of the inherent and dynamic trade-offs in serving models. As serving accuracy and performance have been extensively studied, this survey focuses on works that extend serving context length and address the resulting challenges.
翻译:本文对大语言模型服务领域进行综述,以理解成本效益与准确性之间复杂的动态关系,这种关系因大规模部署模型时对更长上下文理解的需求日益增长而被放大。我们的研究发现,该领域的研究工作沿着三个相互冲突但截然不同的目标进行优化:提升服务上下文长度、提高服务准确性以及改善服务性能。受数据库领域CAP定理的启发,我们提出了大语言模型服务中的CAP原则,该原则表明任何优化最多只能同时提升这三个目标中的两个。本综述在此框架下对现有研究工作进行了分类。我们发现,用户感知度量指标的定义与连续性对于判断目标是否达成至关重要,这与先前实际应用中的CAP数据库类似。我们将大语言模型服务中的CAP原则视为指导性原则而非形式化定理,以帮助设计者理解服务模型中固有的动态权衡。鉴于服务准确性与性能已得到广泛研究,本综述重点关注扩展服务上下文长度及应对由此产生挑战的相关工作。