The typical process for developing LLMs involves pre-training a general foundation model on massive data, followed by fine-tuning on task-specific data to create specialized experts. Serving these experts poses challenges, as loading all experts onto devices is impractical, and frequent switching between experts in response to user requests incurs substantial I/O costs, increasing latency and expenses. Previous approaches decompose expert weights into pre-trained model weights and residual delta weights, then quantize the delta weights to reduce model size. However, these methods often lead to significant quantization errors at extremely low bitwidths and assume the appropriate model for a user request is known in advance, which is not practical. To address these issues, we introduce ME-Switch, a memory-efficient expert switching framework for LLM serving. ME-Switch uses mixed-precision quantization, selectively quantizing non-salient input channels of delta weights to extremely low bits while keeping salient ones intact, significantly reducing storage demands while maintaining performance. Additionally, we develop a routing method that efficiently directs user queries to the most suitable expert by transforming the model selection problem into a domain classification problem. Extensive experiments show ME-Switch's promising memory efficiency and routing performance. For example, when serving three models from the Mistral-7B family, ME-Switch reduces model size by 1.74x while maintaining nearly lossless performance on instruction, mathematical reasoning, and code generation tasks. Furthermore, ME-Switch can efficiently serve 16 models from the Mistral-7B family on a single NVIDIA A100 GPU.
翻译:大语言模型的典型开发流程包括在大规模数据上预训练通用基础模型,随后在任务特定数据上进行微调以创建专用专家。服务这些专家面临挑战:将所有专家加载到设备上不切实际,而根据用户请求频繁切换专家会产生巨大的I/O开销,增加延迟与成本。现有方法将专家权重分解为预训练模型权重和残差增量权重,并对增量权重进行量化以减小模型体积。然而,这些方法在极低比特宽度下往往导致显著量化误差,且假设用户请求对应的模型事先已知,这并不符合实际应用。为解决这些问题,我们提出ME-Switch——一种面向大语言模型服务的内存高效专家切换框架。ME-Switch采用混合精度量化,选择性将增量权重的非显著输入通道量化至极低比特,同时保持显著通道的完整性,在显著降低存储需求的同时维持性能。此外,我们开发了一种路由方法,通过将模型选择问题转化为领域分类问题,高效地将用户查询导向最合适的专家。大量实验表明,ME-Switch在内存效率和路由性能方面表现出色。例如,在服务Mistral-7B系列三个模型时,ME-Switch将模型体积减小1.74倍,同时在指令遵循、数学推理和代码生成任务上保持近乎无损的性能。此外,ME-Switch可在单张NVIDIA A100 GPU上高效服务Mistral-7B系列的16个模型。