Optimal Transport Aggregation for Distributed Mixture-of-Experts

Mixture-of-experts (MoE) models provide a flexible statistical framework for modeling heterogeneity and nonlinear relationships. In many modern applications, however, datasets are naturally distributed across multiple machines due to storage, computational, or governance constraints. We consider a distributed model aggregation setting in which local MoE models are trained independently on decentralized datasets and subsequently combined into a global estimator. Aggregating MoE models is challenging because standard averaging produces models that do not preserve the MoE structure, and therefore do not yield estimates of the global model parameters. To address this issue, we propose a principled aggregation framework based on optimal transport that constructs a reduced global MoE estimator by minimizing a transportation divergence between the collection of local estimators and the aggregated model. An efficient majorization--minimization (MM) algorithm is derived to solve the resulting optimization problem. The method requires only a single communication step from local machines to a central server, making it a frugal distributed learning approach particularly attractive for large-scale settings where communication costs are a major bottleneck. We further establish statistical guarantees for the aggregated estimator, including consistency under standard assumptions on the local estimators. Experiments on synthetic and real datasets demonstrate that the approach achieves performance comparable to centralized training while significantly reducing computation time. The source codes are publicly available on Github.

翻译：专家混合（MoE）模型为建模异质性和非线性关系提供了一个灵活的统计框架。然而，在许多现代应用中，由于存储、计算或治理限制，数据集自然地分布在多台机器上。我们考虑一个分布式模型聚合场景，其中本地MoE模型在分散的数据集上独立训练，随后被组合成一个全局估计器。聚合MoE模型具有挑战性，因为标准平均方法产生的模型无法保持MoE结构，因此无法得到全局模型参数的估计。为了解决这个问题，我们提出了一种基于最优传输的原则性聚合框架，通过最小化局部估计器集合与聚合模型之间的传输散度，构建一个简化的全局MoE估计器。我们推导了一种高效的主优化-最小化（MM）算法来解决由此产生的优化问题。该方法仅需要从本地机器到中央服务器进行一次通信，使其成为一种节俭的分布式学习方法，在通信成本是主要瓶颈的大规模场景中尤其具有吸引力。我们进一步为聚合估计器建立了统计保证，包括在局部估计器满足标准假设下的一致性。在合成和真实数据集上的实验表明，该方法在显著减少计算时间的同时，实现了与集中式训练相当的性能。源代码已在Github上公开。