xLLM技术报告 (xLLM Technical Report)

Tongxuan Liu,Tao Peng,Peijun Yang,Xiaoyang Zhao,Xiusheng Lu,Weizhe Huang,Zirui Liu,Xiaoyu Chen,Zhiwei Liang,Jun Xiong,Donghe Jin,Minchao Zhang,Jinrong Guo,Yingxu Deng,Xu Zhang,Xianzhe Dong,Siqi Wang,Siyu Wu,Yu Wu,Zihan Tang,Yuting Zeng,Yanshu Wang,Jinguang Liu,Meng Kang,Menxin Li,Yunlong Wang,Yiming Liu,Xiaolong Ma,Yifan Wang,Yichen Zhang,Jinrun Yin,Keyang Zheng,Jiawei Yin,Jun Zhang,Ziyue Wang,Xiaobo Lin,Liangyu Liu,Liwei Lan,Yang Liu,Chunhua Peng,Han Liu,Songcheng Ren,Xuezhu Wang,Yunheng Shen,Yi Wang,Guyue Liu,Hui Chen,Tong Yang,Hailong Yang,Jing Li,Guiguang Ding,Ke Zhang

from arxiv, 39 pages

We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework designed for high-performance, large-scale enterprise-grade serving, with deep optimizations for diverse AI accelerators. To address these challenges, xLLM builds a novel decoupled service-engine architecture. At the service layer, xLLM-Service features an intelligent scheduling module that efficiently processes multimodal requests and co-locates online and offline tasks through unified elastic scheduling to maximize cluster utilization. This module also relies on a workload-adaptive dynamic Prefill-Decode (PD) disaggregation policy and a novel Encode-Prefill-Decode (EPD) disaggregation policy designed for multimodal inputs. Furthermore, it incorporates a distributed architecture to provide global KV Cache management and robust fault-tolerant capabilities for high availability. At the engine layer, xLLM-Engine co-optimizes system and algorithm designs to fully saturate computing resources. This is achieved through comprehensive multi-layer execution pipeline optimizations, an adaptive graph mode and an xTensor memory management. xLLM-Engine also further integrates algorithmic enhancements such as optimized speculative decoding and dynamic EPLB, collectively serving to substantially boost throughput and inference efficiency. Extensive evaluations demonstrate that xLLM delivers significantly superior performance and resource efficiency. Under identical TPOT constraints, xLLM achieves throughput up to 1.7x that of MindIE and 2.2x that of vLLM-Ascend with Qwen-series models, while maintaining an average throughput of 1.7x that of MindIE with Deepseek-series models. xLLM framework is publicly available at https://github.com/jd-opensource/xllm and https://github.com/jd-opensource/xllm-service.

翻译：本文介绍xLLM，一个智能高效的大型语言模型推理框架，专为高性能、大规模企业级服务设计，并对多样化AI加速器进行了深度优化。为应对这些挑战，xLLM构建了一种新颖的解耦式服务-引擎架构。在服务层，xLLM-Service配备智能调度模块，能高效处理多模态请求，并通过统一的弹性调度协同部署在线与离线任务，以最大化集群利用率。该模块还依赖工作负载自适应的动态Prefill-Decode解耦策略，以及专为多模态输入设计的新型Encode-Prefill-Decode解耦策略。此外，它采用分布式架构，为高可用性提供全局KV Cache管理与鲁棒的容错能力。在引擎层，xLLM-Engine协同优化系统与算法设计，以充分饱和计算资源。这是通过全面的多层执行流水线优化、自适应图模式及xTensor内存管理实现的。xLLM-Engine还进一步集成了算法增强技术，如优化的推测解码与动态EPLB，共同显著提升吞吐量与推理效率。大量评估表明，xLLM在性能与资源效率方面均表现出显著优势。在相同TPOT约束下，使用Qwen系列模型时，xLLM的吞吐量最高可达MindIE的1.7倍、vLLM-Ascend的2.2倍；使用Deepseek系列模型时，其平均吞吐量保持为MindIE的1.7倍。xLLM框架已在https://github.com/jd-opensource/xllm 与 https://github.com/jd-opensource/xllm-service 公开。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日