The advent of large language models (LLMs) has transformed text-based services, enabling capabilities ranging from real-time translation to AI-driven chatbots. However, existing serving systems primarily focus on optimizing server-side aggregate metrics like token generation throughput, ignoring individual user experience with streamed text. As a result, under high and/or bursty load, a significant number of users can receive unfavorable service quality or poor Quality-of-Experience (QoE). In this paper, we first formally define QoE of text streaming services, where text is delivered incrementally and interactively to users, by considering the end-to-end token delivery process throughout the entire interaction with the user. Thereafter, we propose Andes, a QoE-aware serving system that enhances user experience for LLM-enabled text streaming services. At its core, Andes strategically allocates contended GPU resources among multiple requests over time to optimize their QoE. Our evaluations demonstrate that, compared to the state-of-the-art LLM serving systems like vLLM, Andes improves the average QoE by up to 3.2$\times$ under high request rate, or alternatively, it attains up to 1.6$\times$ higher request rate while preserving high QoE.
翻译:大语言模型的出现彻底革新了文本服务领域,使其具备从实时翻译到人工智能驱动聊天机器人等多种能力。然而,现有服务系统主要侧重于优化服务器端聚合指标(如令牌生成吞吐量),忽视了用户对流式传输文本的个体体验。因此,在高负载或突发流量下,大量用户可能面临低劣的服务质量或较差的体验质量。本文首先通过考虑与用户整个交互过程中端到端的令牌传递过程,正式定义了渐进式交互文本流式服务的QoE。在此基础上,我们提出QoE感知服务系统Andes,旨在提升大语言模型驱动的文本流式服务的用户体验。其核心机制是在时间维度上策略性地将争用型GPU资源分配给多个请求,以优化其QoE。评估结果表明,与vLLM等主流大语言模型服务系统相比,Andes在高请求率下可将平均QoE提升高达3.2倍,或在保持高QoE的同时将请求率提升至1.6倍。