We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework designed for high-performance, large-scale enterprise-grade serving, with deep optimizations for diverse AI accelerators. To address these challenges, xLLM builds a novel decoupled service-engine architecture. At the service layer, xLLM-Service features an intelligent scheduling module that efficiently processes multimodal requests and co-locates online and offline tasks through unified elastic scheduling to maximize cluster utilization. This module also relies on a workload-adaptive dynamic Prefill-Decode (PD) disaggregation policy and a novel Encode-Prefill-Decode (EPD) disaggregation policy designed for multimodal inputs. Furthermore, it incorporates a distributed architecture to provide global KV Cache management and robust fault-tolerant capabilities for high availability. At the engine layer, xLLM-Engine co-optimizes system and algorithm designs to fully saturate computing resources. This is achieved through comprehensive multi-layer execution pipeline optimizations, an adaptive graph mode and an xTensor memory management. xLLM-Engine also further integrates algorithmic enhancements such as optimized speculative decoding and dynamic EPLB, collectively serving to substantially boost throughput and inference efficiency. Extensive evaluations demonstrate that xLLM delivers significantly superior performance and resource efficiency. Under identical TPOT constraints, xLLM achieves throughput up to 1.7x that of MindIE and 2.2x that of vLLM-Ascend with Qwen-series models, while maintaining an average throughput of 1.7x that of MindIE with Deepseek-series models. xLLM framework is publicly available at https://github.com/jd-opensource/xllm and https://github.com/jd-opensource/xllm-service.
翻译:本文介绍xLLM,一个智能高效的大型语言模型推理框架,专为高性能、大规模企业级服务设计,并对多样化AI加速器进行了深度优化。为应对这些挑战,xLLM构建了一种新颖的解耦式服务-引擎架构。在服务层,xLLM-Service配备智能调度模块,能高效处理多模态请求,并通过统一的弹性调度协同部署在线与离线任务,以最大化集群利用率。该模块还依赖工作负载自适应的动态Prefill-Decode解耦策略,以及专为多模态输入设计的新型Encode-Prefill-Decode解耦策略。此外,它采用分布式架构,为高可用性提供全局KV Cache管理与鲁棒的容错能力。在引擎层,xLLM-Engine协同优化系统与算法设计,以充分饱和计算资源。这是通过全面的多层执行流水线优化、自适应图模式及xTensor内存管理实现的。xLLM-Engine还进一步集成了算法增强技术,如优化的推测解码与动态EPLB,共同显著提升吞吐量与推理效率。大量评估表明,xLLM在性能与资源效率方面均表现出显著优势。在相同TPOT约束下,使用Qwen系列模型时,xLLM的吞吐量最高可达MindIE的1.7倍、vLLM-Ascend的2.2倍;使用Deepseek系列模型时,其平均吞吐量保持为MindIE的1.7倍。xLLM框架已在https://github.com/jd-opensource/xllm 与 https://github.com/jd-opensource/xllm-service 公开。