As a current trend in Artificial Intelligence (AI), large foundation models are increasingly employed as the core of AI services. However, even after training, serving such models at scale remains a challenging task due to their heavy resource footprints, particularly in terms of GPU memory. While recent works revealed unique characteristics of systems serving foundation models that distinguish them from traditional distributed computing systems, there is still a lack of fundamental understanding of the underlying system management problems. This work aims at addressing this gap by extracting a novel problem of "server chain composition" via block placement and cache allocation for serving chainstructured jobs with large memory footprints, which models a fundamental problem in serving large foundation models through pipeline parallelism. After showing the NP-hardness of the optimal solution, the focus is turned to developing scalable algorithms with guaranteed performance under state-of-the-art load balancing. Application of the proposed solution to a distributed large language model (LLM) serving system shows significant reduction of response times compared to state-of-the-art solutions.
翻译:随着人工智能(AI)的当前发展趋势,大型基础模型日益被用作AI服务的核心。然而,即使在训练完成后,由于这些模型巨大的资源占用(特别是GPU内存),大规模部署此类服务仍是一项具有挑战性的任务。近期研究揭示了服务于基础模型的系统具有区别于传统分布式计算系统的独特特性,但对其底层系统管理问题的基本认知仍存在不足。本研究旨在填补这一空白,通过为内存占用量大的链式结构作业服务设计块放置与缓存分配方案,提炼出一个新颖的“服务器链组合”问题——该问题通过流水线并行性刻画了大型基础模型服务中的基本问题。在证明最优解具有NP难性质后,研究重点转向开发在最新负载均衡技术下具有性能保证的可扩展算法。将所提方案应用于分布式大语言模型(LLM)服务系统后,相比最新方案,响应时间实现了显著降低。