Large language model (LLM) services are mostly centralized, leading to scalability bottlenecks and underutilization of substantial scattered GPU resources. While decentralization offers a promising alternative, existing frameworks primarily focus on cooperation among GPU providers while overlooking their inherent competitive dynamics, imposing substantial constraints such as excessive platform-level oversight or rigid requirements to execute all assigned requests using fixed software stacks on fixed hardware configurations. We argue that such assumptions are unrealistic in real-world decentralized environments. To this end, we propose WWW$.$Serve, a decentralized framework for interconnecting LLM services worldwide. It allows participants to flexibly determine their participation policies and resource commitments, and supports self-organizing request dispatch, enabling the network to autonomously allocate requests without centralized coordination. Empirically, we show that WWW$.$Serve improves global SLO (service-level-objective) attainment by up to 1.5x and lowers latency by 27.6%. Its performance approaches, and in some cases surpasses, centralized scheduling, while fully preserving the benefits of decentralization. These results highlight WWW$.$Serve as a promising foundation for real-world, decentralized LLM serving.
翻译:大语言模型(LLM)服务目前大多采用中心化架构,导致可扩展性瓶颈和大量分散GPU资源的利用不足。去中心化虽提供了有前景的替代方案,但现有框架主要聚焦于GPU提供者之间的合作,忽视了其固有的竞争动态,并施加了诸如过度平台级监管或强制要求在固定硬件配置上使用固定软件栈执行所有指定请求等严格约束。我们认为,在真实去中心化环境中,这些假设并不现实。为此,我们提出WWW$.$Serve——一个用于全球大语言模型服务互联的去中心化框架。该框架允许参与者灵活决定其参与策略与资源承诺,并支持自组织请求分发,使网络能够在无中心化协调的情况下自主分配请求。实验表明,WWW$.$Serve将全局服务等级协议(SLO)达标率提升至1.5倍,并将延迟降低27.6%。其性能接近甚至在某些场景下超越中心化调度,同时完整保留了去中心化的优势。这些结果彰显了WWW$.$Serve作为真实世界去中心化LLM服务基础架构的潜力。