Mixture of experts (MoE), introduced over 20 years ago, is the simplest gated modular neural network architecture. There is renewed interest in MoE because the conditional computation allows only parts of the network to be used during each inference, as was recently demonstrated in large scale natural language processing models. MoE is also of potential interest for continual learning, as experts may be reused for new tasks, and new experts introduced. The gate in the MoE architecture learns task decompositions and individual experts learn simpler functions appropriate to the gate's decomposition. In this paper: (1) we show that the original MoE architecture and its training method do not guarantee intuitive task decompositions and good expert utilization, indeed they can fail spectacularly even for simple data such as MNIST and FashionMNIST; (2) we introduce a novel gating architecture, similar to attention, that improves performance and results in a lower entropy task decomposition; and (3) we introduce a novel data-driven regularization that improves expert specialization. We empirically validate our methods on MNIST, FashionMNIST and CIFAR-100 datasets.
翻译:专家混合(MoE)模型作为20多年前提出的最简门控模块化神经网络架构,近年来因条件计算机制允许每次推理仅使用部分网络而重新受到关注,这一特性已在大型自然语言处理模型中得到验证。MoE对持续学习同样具有潜在价值——现有专家可被复用于新任务,同时可引入新专家。MoE架构中的门控机制学习任务分解策略,各专家则针对门控分解结果学习特定功能函数。本文:(1)证明原始MoE架构及其训练方法无法确保直观的任务分解与高效专家利用率,甚至对MNIST、FashionMNIST等简单数据集也可能出现严重失效;(2)提出一种类似注意力机制的新型门控架构,该架构在提升性能的同时可产生更低熵的任务分解;(3)引入一种新颖的数据驱动正则化方法以增强专家专业化能力。我们通过在MNIST、FashionMNIST和CIFAR-100数据集上的实验验证了所提方法的有效性。