Sparsely activated neural networks with conditional computation learn to route their inputs through different "expert" subnetworks, providing a form of modularity that densely activated models lack. Despite their possible benefits, models with learned routing often underperform their parameter-matched densely activated counterparts as well as models that use non-learned heuristic routing strategies. In this paper, we hypothesize that these shortcomings stem from the gradient estimation techniques used to train sparsely activated models that use non-differentiable discrete routing decisions. To address this issue, we introduce Soft Merging of Experts with Adaptive Routing (SMEAR), which avoids discrete routing by using a single "merged" expert constructed via a weighted average of all of the experts' parameters. By routing activations through a single merged expert, SMEAR does not incur a significant increase in computational costs and enables standard gradient-based training. We empirically validate that models using SMEAR outperform models that route based on metadata or learn sparse routing through gradient estimation. Furthermore, we provide qualitative analysis demonstrating that the experts learned via SMEAR exhibit a significant amount of specialization. All of the code used in our experiments is publicly available.
翻译:基于条件计算的稀疏激活神经网络通过学习将输入路由至不同的“专家”子网络,实现了稠密激活模型所缺乏的模块化特性。尽管具有潜在优势,但采用学习路由的模型在性能上往往不及参数匹配的稠密激活模型,也逊于使用非学习的启发式路由策略的模型。本文提出假设认为,这些不足源于用于训练稀疏激活模型(采用不可微分离散路由决策)的梯度估计技术。为解决该问题,我们提出软专家自适应路由融合(SMEAR)方法,通过构建所有专家参数的加权平均得到单个“融合”专家,从而规避离散路由。由于仅需通过单个融合专家路由激活值,SMEAR不会显著增加计算成本,并支持标准的梯度训练。实验验证表明,采用SMEAR的模型在性能上优于基于元数据路由或通过梯度估计学习稀疏路由的模型。此外,定性分析显示SMEAR学习的专家展现出显著的专业化特征。本实验所有代码均已公开。