Mixed-precision inference techniques reduce the memory and computational demands of Large Language Models (LLMs) by applying hybrid precision formats to model weights, activations, and KV caches. However, existing systems struggle to (i) automatically generalize across diverse hardware architectures and precision formats, often requiring fragmented, hand-tuned kernels, and (ii) fully exploit available memory and compute resources, often causing performance bottlenecks. To address these problems, we propose TurboMind, a generalizable and efficient mixed-precision LLM inference engine of LMDeploy. TurboMind is built around two hardware-aware mixed-precision pipelines: A General Matrix Multiply (GEMM) pipeline that optimizes matrix operations through offline weight packing and online acceleration, and an attention pipeline that enables efficient attention computation with different Query, Key, and Value precision combinations. These pipelines are enabled by four key techniques: (i) Hardware-aware weight packing and (ii) adaptive head alignment for generalizability, and (iii) instruction-level parallelism and (iv) a KV memory loading pipeline for efficiency. We conduct comprehensive evaluations of LMDeploy powered by TurboMind across sixteen popular LLMs and four representative GPU architectures. Results demonstrate that LMDeploy achieves up to 61% lower serving latency (30% on average) and up to 156% higher throughput (58% on average) in mixed-precision workloads compared to existing mixed-precision frameworks, establishing consistent performance improvements across all tested configurations and hardware types. This work is open-sourced and publicly available at https://github.com/InternLM/lmdeploy.
翻译:混合精度推理技术通过将混合精度格式应用于模型权重、激活值和KV缓存,降低了大语言模型(LLM)的内存和计算需求。然而,现有系统难以:(i)在不同硬件架构和精度格式间自动泛化,通常需要零散的、手工调整的内核;(ii)充分利用可用内存和计算资源,常常导致性能瓶颈。针对这些问题,我们提出TurboMind,即LMDeploy中一个可泛化且高效的混合精度LLM推理引擎。TurboMind围绕两条硬件感知的混合精度流水线构建:一条是通用矩阵乘法(GEMM)流水线,通过离线权重打包和在线加速来优化矩阵运算;另一条是注意力流水线,支持使用不同的查询、键和值精度组合进行高效注意力计算。这些流水线由四项关键技术实现:(i)硬件感知的权重打包和(ii)自适应头部对齐以实现可泛化性,以及(iii)指令级并行和(iv)KV内存加载流水线以提高效率。我们对由TurboMind驱动的LMDeploy进行了全面评估,覆盖了十六个流行LLM和四种代表性GPU架构。结果表明,与现有混合精度框架相比,LMDeploy在混合精度工作负载中最高可实现61%的服务延迟降低(平均降低30%),以及最高156%的吞吐量提升(平均提升58%),在所有测试配置和硬件类型上均展现出持续的性能改进。本工作已开源,代码公开于https://github.com/InternLM/lmdeploy。