Mixture-of-Experts (MoE) models achieve efficiency through sparse activation, but the role of geometric regularization in expert specialization remains unclear. We apply orthogonality loss to enforce expert diversity and find it fails on multiple fronts: it does not reduce weight-space overlap (MSO actually increases by up to 114%), activation-space overlap remains high (~0.6) regardless of regularization, and effects on performance are inconsistent -- marginal improvement on WikiText-103 (-0.9%), slight degradation on TinyStories (+0.9%), and highly variable results on PTB (std > 1.0). Our analysis across 7 regularization strengths reveals no significant correlation (r = -0.293, p = 0.523) between weight and activation orthogonality. These findings demonstrate that weight-space regularization neither achieves its geometric goal nor reliably improves performance, making it unsuitable for MoE diversity.
翻译:专家混合模型通过稀疏激活实现高效性,但几何正则化在专家专业化中的作用仍不明确。我们应用正交性损失以增强专家多样性,发现其在多个方面均告失败:该正则化未能减少权重空间重叠(模型相似度指标实际增加达114%),无论采用何种正则化策略激活空间重叠均保持高位(约0.6),且对性能的影响呈现不一致性——在WikiText-103上仅获得边际改进(-0.9%),在TinyStories上出现轻微性能下降(+0.9%),在PTB数据集上则表现出高度波动性(标准差>1.0)。我们通过对7种正则化强度的系统性分析发现,权重正交性与激活正交性之间不存在显著相关性(r = -0.293, p = 0.523)。这些结果表明,权重空间正则化既未实现其几何目标,也无法可靠提升模型性能,因此不适用于增强专家混合模型的多样性。