Serving large language models (LLMs) in production can incur substantial costs, which has prompted recent advances in inference system optimizations. Today, these systems are evaluated against conventional latency and throughput metrics (eg. TTFT, TBT, Normalised Latency and TPOT). However, these metrics fail to fully capture the nuances of LLM inference, leading to an incomplete assessment of user-facing performance crucial for real-time applications such as chat and translation. In this paper, we first identify the pitfalls of current performance metrics in evaluating LLM inference systems. We then propose Metron, a comprehensive performance evaluation framework that includes fluidity-index -- a novel metric designed to reflect the intricacies of the LLM inference process and its impact on real-time user experience. Finally, we evaluate various existing open-source platforms and model-as-a-service offerings using Metron, discussing their strengths and weaknesses. Metron is available at https://github.com/project-metron/metron.
翻译:在生产环境中部署大语言模型(LLMs)可能产生高昂成本,这推动了推理系统优化技术的近期发展。目前,这些系统主要通过传统延迟与吞吐量指标进行评估(例如TTFT、TBT、标准化延迟和TPOT)。然而,这些指标未能充分捕捉LLM推理的细微特征,导致对面向用户性能的评估不够全面,而此类性能对于聊天和翻译等实时应用至关重要。本文首先指出当前性能指标在评估LLM推理系统时存在的缺陷,随后提出Metron——一个包含流畅度指数(fluidity-index)的综合性能评估框架,该创新指标旨在反映LLM推理过程的复杂性及其对实时用户体验的影响。最后,我们使用Metron对多种现有开源平台和模型即服务产品进行评估,并讨论其优势与不足。Metron项目地址:https://github.com/project-metron/metron。