While significant progress has been made in research and development on open-source and cost-efficient large-language models (LLMs), serving scalability remains a critical challenge, particularly for small organizations and individuals seeking to deploy and test their LLM innovations. Inspired by peer-to-peer networks that leverage decentralized overlay nodes to increase throughput and availability, we propose GenTorrent, an LLM serving overlay that harnesses computing resources from decentralized contributors. We identify four key research problems inherent to enabling such a decentralized infrastructure: 1) overlay network organization; 2) LLM communication privacy; 3) overlay forwarding for resource efficiency; and 4) verification of serving quality. This work presents the first systematic study of these fundamental problems in the context of decentralized LLM serving. Evaluation results from a prototype implemented on a set of decentralized nodes demonstrate that GenTorrent achieves a latency reduction of over 50% compared to the baseline design without overlay forwarding. Furthermore, the security features introduce minimal overhead to serving latency and throughput. We believe this work pioneers a new direction for democratizing and scaling future AI serving capabilities.
翻译:尽管开源且成本高效的大型语言模型(LLMs)在研发方面已取得显著进展,但服务可扩展性仍然是一个关键挑战,特别是对于寻求部署和测试其LLM创新成果的小型组织和个人而言。受利用去中心化覆盖节点以提高吞吐量和可用性的点对点网络启发,我们提出了GenTorrent,一种利用去中心化贡献者计算资源的LLM服务覆盖网络。我们识别了实现此类去中心化基础设施所固有的四个关键研究问题:1)覆盖网络组织;2)LLM通信隐私;3)面向资源效率的覆盖转发;以及4)服务质量的验证。本工作首次在去中心化LLM服务背景下对这些基本问题进行了系统性研究。在一组去中心化节点上实现的原型评估结果表明,与没有覆盖转发的基线设计相比,GenTorrent实现了超过50%的延迟降低。此外,其安全特性对服务延迟和吞吐量带来的开销极小。我们相信,这项工作为未来AI服务能力的民主化与规模化开创了一个新的方向。