Large language models (LLMs) are now at the core of conversational AI services such as real-time translation and chatbots, which provide live user interaction by incrementally streaming text to the user. However, existing LLM serving systems fail to provide good user experience because their optimization metrics are not always aligned with user experience. In this paper, we first introduce and define the notion of Quality-of-Experience (QoE) for text streaming services by considering each user's end-to-end interaction timeline. Based on this, we propose Andes, a QoE-aware LLM serving system that enhances user experience by ensuring that users receive the first token promptly and subsequent tokens at a smooth, digestible pace, even during surge periods. This is enabled by Andes's preemptive request scheduler that dynamically prioritizes requests at the token granularity based on each request's expected QoE gain and GPU resource usage. Our evaluations demonstrate that, compared to state-of-the-art LLM serving systems, Andes improves the average QoE by up to $4.7\times$ given the same GPU resource, or saves up to 61% GPU resources while maintaining the same high QoE.
翻译:大语言模型(LLM)现已成为实时翻译和聊天机器人等对话式人工智能服务的核心,这些服务通过向用户增量式流式传输文本来实现实时用户交互。然而,现有的LLM服务系统因其优化指标并不总是与用户体验保持一致,而无法提供良好的用户体验。本文首先通过考虑每个用户的端到端交互时间线,引入并定义了文本流式服务的体验质量概念。基于此,我们提出了Andes,一个具备QoE感知能力的LLM服务系统,它通过确保用户即使在流量高峰期间也能及时接收到首个令牌,并以平稳、易于理解的速度接收后续令牌,从而提升用户体验。这一能力得益于Andes的抢占式请求调度器,该调度器根据每个请求的预期QoE增益和GPU资源使用情况,在令牌粒度上动态地对请求进行优先级排序。我们的评估表明,与最先进的LLM服务系统相比,在给定相同GPU资源的情况下,Andes将平均QoE提升了高达$4.7\times$;或者在保持相同高水平QoE的同时,节省了高达61%的GPU资源。