Mixture-of-Experts (MoE) architectures decompose prediction tasks into specialized expert sub-networks selected by a gating mechanism. This letter adopts a communication-theoretic view of MoE gating, modeling the gate as a stochastic channel operating under a finite information rate. Within an information-theoretic learning framework, we specialize a mutual-information generalization bound and develop a rate-distortion characterization $D(R_g)$ of finite-rate gating, where $R_g:=I(X; T)$, yielding (under a standard empirical rate-distortion optimality condition) $\mathbb{E}[R(W)] \le D(R_g)+δ_m+\sqrt{(2/m)\, I(S; W)}$. The analysis yields capacity-aware limits for communication-constrained MoE systems, and numerical simulations on synthetic multi-expert models empirically confirm the predicted trade-offs between gating rate, expressivity, and generalization.
翻译:专家混合(MoE)架构通过门控机制选择专用专家子网络来分解预测任务。本文从通信理论视角研究MoE门控机制,将门控建模为有限信息速率下的随机信道。在信息论学习框架内,我们推导出互信息泛化界的特例,并建立了有限速率门控的率失真表征 $D(R_g)$,其中 $R_g:=I(X; T)$,在标准经验率失真最优性条件下得到 $\mathbb{E}[R(W)] \le D(R_g)+δ_m+\sqrt{(2/m)\, I(S; W)}$。该分析揭示了通信受限MoE系统的容量感知极限,基于合成多专家模型的数值仿真从经验上证实了门控速率、表达力与泛化能力之间的理论权衡关系。