Large language models like GPT-4 are resource-intensive, but recent advancements suggest that smaller, specialized experts can outperform the monolithic models on specific tasks. The Collaboration-of-Experts (CoE) approach integrates multiple expert models, improving the accuracy of generated results and offering great potential for precision-critical applications, such as automatic circuit board quality inspection. However, deploying CoE serving systems presents challenges to memory capacity due to the large number of experts required, which can lead to significant performance overhead from frequent expert switching across different memory and storage tiers. We propose CoServe, an efficient CoE model serving system on heterogeneous CPU and GPU with limited memory. CoServe reduces unnecessary expert switching by leveraging expert dependency, a key property of CoE inference. CoServe introduces a dependency-aware request scheduler and dependency-aware expert management for efficient inference. It also introduces an offline profiler to automatically find optimal resource allocation on various processors and devices. In real-world intelligent manufacturing workloads, CoServe achieves 4.5$\times$ to 12$\times$ higher throughput compared to state-of-the-art systems.
翻译:GPT-4等大型语言模型资源消耗巨大,但近期研究表明,针对特定任务,规模较小、专门化的专家模型性能可超越单一大型模型。多专家协作方法整合了多个专家模型,提升了生成结果的准确性,在诸如自动电路板质量检测等精度要求极高的应用中展现出巨大潜力。然而,部署CoE服务系统对内存容量提出了挑战,因为所需专家数量众多,这可能导致在不同内存和存储层级间频繁切换专家,从而产生显著的性能开销。我们提出了CoServe,一个在内存有限的异构CPU与GPU上的高效CoE模型服务系统。CoServe通过利用CoE推理的一个关键特性——专家依赖关系,来减少不必要的专家切换。它引入了依赖感知的请求调度器与依赖感知的专家管理机制以实现高效推理。此外,CoServe还引入了一个离线性能分析器,用于在各种处理器和设备上自动寻找最优资源分配方案。在真实的智能制造工作负载中,与最先进的系统相比,CoServe实现了4.5倍至12倍的吞吐量提升。