CoServe：有限内存下高效的多专家协作模型推理 (CoServe: Efficient Collaboration-of-Experts (CoE) Model Inference with Limited Memory)

Large language models like GPT-4 are resource-intensive, but recent advancements suggest that smaller, specialized experts can outperform the monolithic models on specific tasks. The Collaboration-of-Experts (CoE) approach integrates multiple expert models, improving the accuracy of generated results and offering great potential for precision-critical applications, such as automatic circuit board quality inspection. However, deploying CoE serving systems presents challenges to memory capacity due to the large number of experts required, which can lead to significant performance overhead from frequent expert switching across different memory and storage tiers. We propose CoServe, an efficient CoE model serving system on heterogeneous CPU and GPU with limited memory. CoServe reduces unnecessary expert switching by leveraging expert dependency, a key property of CoE inference. CoServe introduces a dependency-aware request scheduler and dependency-aware expert management for efficient inference. It also introduces an offline profiler to automatically find optimal resource allocation on various processors and devices. In real-world intelligent manufacturing workloads, CoServe achieves 4.5$\times$ to 12$\times$ higher throughput compared to state-of-the-art systems.

翻译：GPT-4等大型语言模型资源消耗巨大，但近期研究表明，针对特定任务，规模较小、专门化的专家模型性能可超越单一大型模型。多专家协作方法整合了多个专家模型，提升了生成结果的准确性，在诸如自动电路板质量检测等精度要求极高的应用中展现出巨大潜力。然而，部署CoE服务系统对内存容量提出了挑战，因为所需专家数量众多，这可能导致在不同内存和存储层级间频繁切换专家，从而产生显著的性能开销。我们提出了CoServe，一个在内存有限的异构CPU与GPU上的高效CoE模型服务系统。CoServe通过利用CoE推理的一个关键特性——专家依赖关系，来减少不必要的专家切换。它引入了依赖感知的请求调度器与依赖感知的专家管理机制以实现高效推理。此外，CoServe还引入了一个离线性能分析器，用于在各种处理器和设备上自动寻找最优资源分配方案。在真实的智能制造工作负载中，与最先进的系统相比，CoServe实现了4.5倍至12倍的吞吐量提升。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日