Large language models represent a groundbreaking shift in generative AI. Yet, these advances come with a significant challenge: the high cost of model serving. To mitigate these costs, consumer-grade GPUs emerge as a more affordable alternative. This presents an opportunity for more cost-efficient LLM serving by leveraging these GPUs. However, it is non-trivial to achieve high-efficiency LLM serving on consumer-grade GPUs, mainly due to two challenges: 1) these GPUs are often deployed in limited network conditions; 2) these GPUs often exhibit heterogeneity in host systems. To address these challenges, we present MoLink, a distributed LLM serving system for large models. It incorporates several key techniques, enabling efficient LLM serving on heterogeneous and weakly connected consumer-grade GPUs. Our experiments demonstrate that it achieves throughput improvements of up to 458\% and cost-profit margin improvements of up to 151\%, compared to state-of-the-art systems. MoLink allows users on Windows, Linux, and containerized VMs to seamlessly integrate GPUs with just a few lines of code over Ethernet or public networks. Currently, it supports 18 mainstream architectures of open-source large language models. The source code is publicly available https://github.com/oldcpple/MoLink.
翻译:大语言模型代表了生成式人工智能领域的突破性变革。然而,这些进展伴随着一项重大挑战:模型服务的高昂成本。为降低此类成本,消费级GPU成为一种更经济的选择。这为利用此类GPU实现更具成本效益的大语言模型服务提供了机遇。然而,在消费级GPU上实现高效的大语言模型服务并非易事,主要面临两大挑战:1)此类GPU通常部署在网络条件受限的环境中;2)其宿主机系统常呈现异构特性。为应对这些挑战,我们提出MoLink——一个面向大模型的分布式大语言模型服务系统。该系统融合多项关键技术,能够在异构且弱连接的消费级GPU集群上实现高效的大语言模型服务。实验表明,相较于现有最优系统,MoLink可实现高达458%的吞吐量提升与高达151%的成本收益边际改善。该系统支持Windows、Linux及容器化虚拟机用户通过以太网或公共网络,仅用数行代码即可无缝集成GPU资源。目前,MoLink已支持18种主流开源大语言模型架构。源代码已公开于https://github.com/oldcpple/MoLink。