专家混合模型中的几何正则化：权重与激活之间的脱节 (Geometric Regularization in Mixture-of-Experts: The Disconnect Between Weights and Activations) - 专知论文

会员服务 ·

0

正则化 · 混合 · 混合模型 · 正交 · 多样性 ·

Geometric Regularization in Mixture-of-Experts: The Disconnect Between Weights and Activations

翻译：专家混合模型中的几何正则化：权重与激活之间的脱节

Mixture-of-Experts (MoE) models achieve efficiency through sparse activation, but the role of geometric regularization in expert specialization remains unclear. We apply orthogonality loss to enforce expert diversity and find it fails on multiple fronts: it does not reduce weight-space overlap (MSO actually increases by up to 114%), activation-space overlap remains high (~0.6) regardless of regularization, and effects on performance are inconsistent -- marginal improvement on WikiText-103 (-0.9%), slight degradation on TinyStories (+0.9%), and highly variable results on PTB (std > 1.0). Our analysis across 7 regularization strengths reveals no significant correlation (r = -0.293, p = 0.523) between weight and activation orthogonality. These findings demonstrate that weight-space regularization neither achieves its geometric goal nor reliably improves performance, making it unsuitable for MoE diversity.

翻译：专家混合模型通过稀疏激活实现高效性，但几何正则化在专家专业化中的作用仍不明确。我们应用正交性损失以增强专家多样性，发现其在多个方面均告失败：该正则化未能减少权重空间重叠（模型相似度指标实际增加达114%），无论采用何种正则化策略激活空间重叠均保持高位（约0.6），且对性能的影响呈现不一致性——在WikiText-103上仅获得边际改进（-0.9%），在TinyStories上出现轻微性能下降（+0.9%），在PTB数据集上则表现出高度波动性（标准差>1.0）。我们通过对7种正则化强度的系统性分析发现，权重正交性与激活正交性之间不存在显著相关性（r = -0.293, p = 0.523）。这些结果表明，权重空间正则化既未实现其几何目标，也无法可靠提升模型性能，因此不适用于增强专家混合模型的多样性。

0

相关内容

正则化

在数学，统计学和计算机科学中，尤其是在机器学习和逆问题中，正则化是添加信息以解决不适定问题或防止过度拟合的过程。正则化适用于不适定的优化问题中的目标函数。

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

专知会员服务

17+阅读 · 2022年5月10日

【CMU-Yuejie Chi等干货书】满足低秩矩阵分解的非凸优化综述，69页pdf，Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview

【CMU-Yuejie Chi等干货书】满足低秩矩阵分解的非凸优化综述，69页pdf，Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview

专知会员服务

33+阅读 · 2022年3月4日

【ICML2021】具有性能保证的弱监督下的对抗性多类学习

专知会员服务

17+阅读 · 2021年7月13日

【ICLR2021】基于动态正则化的联邦学习

专知会员服务

42+阅读 · 2021年1月18日

KDD20 | 基于差分变量去相关的稳定学习

专知会员服务

20+阅读 · 2021年1月7日

【CVPR2023】探索和利用不确定性的不完整多视角分类

【CVPR2023】探索和利用不确定性的不完整多视角分类

专知

42+阅读 · 2023年4月13日

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

专知

11+阅读 · 2020年8月28日

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

专知

13+阅读 · 2020年4月1日

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

语义分割中的深度学习方法全解：从FCN、SegNet到DeepLab

语义分割中的深度学习方法全解：从FCN、SegNet到DeepLab

炼数成金订阅号

26+阅读 · 2017年7月10日

分布式有监督学习的学习理论

国家自然科学基金

17+阅读 · 2015年12月31日

高维数据下的模型平均方法

国家自然科学基金

6+阅读 · 2014年12月31日

一般误差分布下若干半参数模型的复合分位数方法

国家自然科学基金

0+阅读 · 2014年12月31日

Poisson流形上的修正Hamilton方法

国家自然科学基金

0+阅读 · 2014年12月31日

变换结构方程模型的非参数贝叶斯分析

国家自然科学基金

4+阅读 · 2014年12月31日

Environment-Adaptive Covariate Selection: Learning When to Use Spurious Correlations for Out-of-Distribution Prediction

Arxiv

0+阅读 · 1月5日

Optimality of Non-Adaptive Algorithms in Online Submodular Welfare Maximization with Stochastic Outcomes

Arxiv

0+阅读 · 1月5日

On Efficient Approximate Aggregate Nearest Neighbor Queries over Learned Representations

Arxiv

0+阅读 · 1月5日

SGD with Dependent Data: Optimal Estimation, Regret, and Inference

Arxiv

0+阅读 · 1月4日

Mind the Gap. Doubling Constant Parametrization of Weighted Problems: TSP, Max-Cut, and More

Arxiv

0+阅读 · 1月2日

VIP会员

文章信息

相关主题

相关VIP内容

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

专知会员服务

17+阅读 · 2022年5月10日

【CMU-Yuejie Chi等干货书】满足低秩矩阵分解的非凸优化综述，69页pdf，Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview

【CMU-Yuejie Chi等干货书】满足低秩矩阵分解的非凸优化综述，69页pdf，Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview

专知会员服务

33+阅读 · 2022年3月4日

【ICML2021】具有性能保证的弱监督下的对抗性多类学习

专知会员服务

17+阅读 · 2021年7月13日

【ICLR2021】基于动态正则化的联邦学习

专知会员服务

42+阅读 · 2021年1月18日

KDD20 | 基于差分变量去相关的稳定学习

专知会员服务

20+阅读 · 2021年1月7日

热门VIP内容

开通专知VIP会员享更多权益服务

DeepSeek突然更新R1论文：暴增64页，能公开的全公开了

《网络化部队中的任务式指挥：近期美海军与空军条令及作战概念对任务式指挥的采纳》最新报告

【ETZH博士论文】语言模型编程

智能体化人工智能 (Agentic AI) 的前行之路：挑战与机遇

相关资讯

【CVPR2023】探索和利用不确定性的不完整多视角分类

【CVPR2023】探索和利用不确定性的不完整多视角分类

专知

42+阅读 · 2023年4月13日

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

专知

11+阅读 · 2020年8月28日

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

专知

13+阅读 · 2020年4月1日

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

语义分割中的深度学习方法全解：从FCN、SegNet到DeepLab

语义分割中的深度学习方法全解：从FCN、SegNet到DeepLab

炼数成金订阅号

26+阅读 · 2017年7月10日

相关论文

Environment-Adaptive Covariate Selection: Learning When to Use Spurious Correlations for Out-of-Distribution Prediction

Arxiv

0+阅读 · 1月5日

Optimality of Non-Adaptive Algorithms in Online Submodular Welfare Maximization with Stochastic Outcomes

Arxiv

0+阅读 · 1月5日

On Efficient Approximate Aggregate Nearest Neighbor Queries over Learned Representations

Arxiv

0+阅读 · 1月5日

SGD with Dependent Data: Optimal Estimation, Regret, and Inference

Arxiv

0+阅读 · 1月4日

Mind the Gap. Doubling Constant Parametrization of Weighted Problems: TSP, Max-Cut, and More

Arxiv

0+阅读 · 1月2日

相关基金

分布式有监督学习的学习理论

国家自然科学基金

17+阅读 · 2015年12月31日

高维数据下的模型平均方法

国家自然科学基金

6+阅读 · 2014年12月31日

一般误差分布下若干半参数模型的复合分位数方法

国家自然科学基金

0+阅读 · 2014年12月31日

Poisson流形上的修正Hamilton方法

国家自然科学基金

0+阅读 · 2014年12月31日

变换结构方程模型的非参数贝叶斯分析

国家自然科学基金

4+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员