ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models

The typical process for developing LLMs involves pre-training a general foundation model on massive data, followed by fine-tuning on task-specific data to create specialized experts. Serving these experts poses challenges, as loading all experts onto devices is impractical, and frequent switching between experts in response to user requests incurs substantial I/O costs, increasing latency and expenses. Previous approaches decompose expert weights into pre-trained model weights and residual delta weights, then quantize the delta weights to reduce model size. However, these methods often lead to significant quantization errors at extremely low bitwidths and assume the appropriate model for a user request is known in advance, which is not practical. To address these issues, we introduce ME-Switch, a memory-efficient expert switching framework for LLM serving. ME-Switch uses mixed-precision quantization, selectively quantizing non-salient input channels of delta weights to extremely low bits while keeping salient ones intact, significantly reducing storage demands while maintaining performance. Additionally, we develop a routing method that efficiently directs user queries to the most suitable expert by transforming the model selection problem into a domain classification problem. Extensive experiments show ME-Switch's promising memory efficiency and routing performance. For example, when serving three models from the Mistral-7B family, ME-Switch reduces model size by 1.74x while maintaining nearly lossless performance on instruction, mathematical reasoning, and code generation tasks. Furthermore, ME-Switch can efficiently serve 16 models from the Mistral-7B family on a single NVIDIA A100 GPU.

翻译：大语言模型的典型开发流程包括在大规模数据上预训练通用基础模型，随后在任务特定数据上进行微调以创建专用专家。服务这些专家面临挑战：将所有专家加载到设备上不切实际，而根据用户请求频繁切换专家会产生巨大的I/O开销，增加延迟与成本。现有方法将专家权重分解为预训练模型权重和残差增量权重，并对增量权重进行量化以减小模型体积。然而，这些方法在极低比特宽度下往往导致显著量化误差，且假设用户请求对应的模型事先已知，这并不符合实际应用。为解决这些问题，我们提出ME-Switch——一种面向大语言模型服务的内存高效专家切换框架。ME-Switch采用混合精度量化，选择性将增量权重的非显著输入通道量化至极低比特，同时保持显著通道的完整性，在显著降低存储需求的同时维持性能。此外，我们开发了一种路由方法，通过将模型选择问题转化为领域分类问题，高效地将用户查询导向最合适的专家。大量实验表明，ME-Switch在内存效率和路由性能方面表现出色。例如，在服务Mistral-7B系列三个模型时，ME-Switch将模型体积减小1.74倍，同时在指令遵循、数学推理和代码生成任务上保持近乎无损的性能。此外，ME-Switch可在单张NVIDIA A100 GPU上高效服务Mistral-7B系列的16个模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日