Large language model (LLM) services are mostly centralized, leading to scalability bottlenecks and underutilization of substantial scattered GPU resources. While decentralization offers a promising alternative, existing frameworks primarily focus on cooperation among GPU providers while overlooking their inherent competitive dynamics, imposing substantial constraints such as excessive platform-level oversight or rigid requirements to execute all assigned requests using fixed software stacks on fixed hardware configurations. We argue that such assumptions are unrealistic in real-world decentralized environments. To this end, we propose WWW.Serve, a decentralized framework for interconnecting LLM services worldwide. It allows participants to flexibly determine their participation policies and resource commitments, and supports self-organizing request dispatch, enabling the network to autonomously allocate requests without centralized coordination. Empirically, we show that WWW.Serve improves global SLO (service-level-objective) attainment by up to 1.5x and lowers latency by 27.6%. Its performance approaches, and in some cases surpasses, centralized scheduling, while fully preserving the benefits of decentralization. These results highlight WWW.Serve as a promising foundation for real-world, decentralized LLM serving.
翻译:大语言模型(LLM)服务大多采用中心化架构,导致可扩展性瓶颈及大量分散GPU资源的利用率不足。尽管去中心化提供了一种有前景的替代方案,但现有框架主要聚焦于GPU提供者间的协作,而忽视了其内在的竞争动态,施加了诸如过度平台级监管或强制使用固定硬件配置与软件栈执行所有指派请求等严格约束。我们认为此类假设在真实去中心化环境中是不切实际的。为此,我们提出WWW.Serve——一个用于互连全球大语言模型服务的去中心化框架。该框架允许参与者灵活制定参与策略与资源承诺,并支持自组织请求分发,使网络无需中心化协调即可自主分配请求。实验表明,WWW.Serve将全局服务水平协议(SLO)达标率提升高达1.5倍,延迟降低27.6%。其性能接近甚至在某些场景下超越中心化调度,同时完整保留了去中心化的优势。这些结果凸显了WWW.Serve作为真实世界去中心化LLM服务基础架构的潜力。