As LLM deployments scale over more hardware, the probability of a single failure in a system increases significantly, and cloud operators must consider robust countermeasures to handle these inevitable failures. A common recovery approach is to simply restart the LLM serving instance; however, this is costly in model-as-a-service (MaaS) inference settings, where reloading model weights and recompiling computation graphs can introduce significant delays to incoming requests. We propose ReviveMoE, a method for rapid failure recovery in large-scale LLM deployments without restarting the serving instance. ReviveMoE is designed to support both the traditional LLM architecture, which collocates MoE and attention on the same hardware, and the disaggregated architectures, which separate MoE from attention. Integrated into Huawei Cloud's MaaS, ReviveMoE is built on top of Huawei's xDeepServe serving platform and the XCCL communications library.
翻译:随着LLM部署在更多硬件上扩展,系统中发生单点故障的概率显著增加,云运营商必须考虑采用稳健的应对措施来处理这些不可避免的故障。一种常见的恢复方法是简单地重启LLM服务实例;然而,这在模型即服务(MaaS)推理场景中成本高昂,因为重新加载模型权重和重新编译计算图会给传入请求带来显著的延迟。我们提出了ReviveMoE,一种无需重启服务实例即可实现大规模LLM部署快速故障恢复的方法。ReviveMoE旨在支持两种架构:传统的LLM架构(将MoE与注意力机制部署在同一硬件上)以及解耦架构(将MoE与注意力机制分离)。ReviveMoE已集成到华为云的MaaS中,构建于华为的xDeepServe服务平台和XCCL通信库之上。