xLLM Technical Report - 专知论文

Tongxuan Liu,Tao Peng,Peijun Yang,Xiaoyang Zhao,Xiusheng Lu,Weizhe Huang,Zirui Liu,Xiaoyu Chen,Zhiwei Liang,Jun Xiong,Donghe Jin,Minchao Zhang,Jinrong Guo,Yingxu Deng,Xu Zhang,Xianzhe Dong,Siqi Wang,Siyu Wu,Yu Wu,Zihan Tang,Yuting Zeng,Yanshu Wang,Jinguang Liu,Meng Kang,Menxin Li,Yunlong Wang,Yiming Liu,Xiaolong Ma,Yifan Wang,Yichen Zhang,Jinrun Yin,Keyang Zheng,Jiawei Yin,Jun Zhang,Ziyue Wang,Xiaobo Lin,Liangyu Liu,Liwei Lan,Yang Liu,Chunhua Peng,Han Liu,Songcheng Ren,Xuezhu Wang,Yunheng Shen,Yi Wang,Guyue Liu,Yitao Hu,Hui Chen,Tong Yang,Hailong Yang,Jing Li,Guiguang Ding,Ke Zhang

from arxiv, 39 pages

We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework designed for high-performance, large-scale enterprise-grade serving, with deep optimizations for diverse AI accelerators. To address these challenges, xLLM builds a novel decoupled service-engine architecture. At the service layer, xLLM-Service features an intelligent scheduling module that efficiently processes multimodal requests and co-locates online and offline tasks through unified elastic scheduling to maximize cluster utilization. This module also relies on a workload-adaptive dynamic Prefill-Decode (PD) disaggregation policy and a novel Encode-Prefill-Decode (EPD) disaggregation policy designed for multimodal inputs. Furthermore, it incorporates a distributed architecture to provide global KV Cache management and robust fault-tolerant capabilities for high availability. At the engine layer, xLLM-Engine co-optimizes system and algorithm designs to fully saturate computing resources. This is achieved through comprehensive multi-layer execution pipeline optimizations, an adaptive graph mode and an xTensor memory management. xLLM-Engine also further integrates algorithmic enhancements such as optimized speculative decoding and dynamic EPLB, collectively serving to substantially boost throughput and inference efficiency. Extensive evaluations demonstrate that xLLM delivers significantly superior performance and resource efficiency. Under identical TPOT constraints, xLLM achieves throughput up to 1.7x that of MindIE and 2.2x that of vLLM-Ascend with Qwen-series models, while maintaining an average throughput of 1.7x that of MindIE with Deepseek-series models. xLLM framework is publicly available at https://github.com/jd-opensource/xllm and https://github.com/jd-opensource/xllm-service.

翻译：我们介绍xLLM，一个智能高效的大型语言模型推理框架，专为高性能、大规模企业级服务设计，并对多种AI加速器进行了深度优化。为应对这些挑战，xLLM构建了一种新颖的解耦式服务-引擎架构。在服务层，xLLM-Service配备了一个智能调度模块，能高效处理多模态请求，并通过统一的弹性调度将在线与离线任务协同部署，以最大化集群利用率。该模块还依赖于一种工作负载自适应的动态预填充-解码解耦策略，以及一种专为多模态输入设计的新型编码-预填充-解码解耦策略。此外，它采用分布式架构，为高可用性提供全局KV缓存管理和强大的容错能力。在引擎层，xLLM-Engine协同优化系统与算法设计，以充分饱和计算资源。这是通过全面的多层执行流水线优化、自适应图模式以及xTensor内存管理实现的。xLLM-Engine还进一步集成了算法增强，如优化的推测解码和动态EPLB，共同显著提升吞吐量和推理效率。大量评估表明，xLLM在性能和资源效率方面均表现出显著优势。在相同的TPOT约束下，使用Qwen系列模型时，xLLM的吞吐量最高可达MindIE的1.7倍和vLLM-Ascend的2.2倍；而在使用Deepseek系列模型时，其平均吞吐量保持在MindIE的1.7倍。xLLM框架已在https://github.com/jd-opensource/xllm 和 https://github.com/jd-opensource/xllm-service 公开提供。