Mixture-of-Experts (MoE) architectures decompose prediction tasks into specialized expert sub-networks selected by a gating mechanism. This letter adopts a communication-theoretic view of MoE gating, modeling the gate as a stochastic channel operating under a finite information rate. Within an information-theoretic learning framework, {we specialize a mutual-information generalization bound and develop a rate-distortion characterization $D(R_g)$ of finite-rate gating, where $R_g:=I(X; T)$, yielding (under a standard empirical rate-distortion optimality condition) $\mathbb{E}[R(W)] \le D(R_g)+δ_m+\sqrt{(2/m)\, I(S; W)}$. }The analysis yields capacity-aware limits for communication-constrained MoE systems, and numerical simulations on synthetic multi-expert models empirically confirm the predicted trade-offs between gating rate, expressivity, and generalization.
翻译:混合专家(MoE)架构通过门控机制选择专门的专家子网络,将预测任务分解为子任务。本文采用通信理论视角研究MoE门控,将门控建模为在有限信息速率下运行的随机信道。在信息理论学习框架内,{我们专门研究了一个互信息泛化界,并推导出有限速率门控的率失真特征$D(R_g)$,其中$R_g:=I(X; T)$,在标准经验率失真最优条件下,得到$\mathbb{E}[R(W)] \le D(R_g)+δ_m+\sqrt{(2/m)\, I(S; W)}}$。该分析揭示了通信受限MoE系统的容量感知限制,基于合成多专家模型的数值模拟经验性地验证了门控速率、表达能力和泛化能力之间的预测权衡关系。