Mixtures of Experts (MoE) are known for their ability to learn complex conditional distributions with multiple modes. However, despite their potential, these models are challenging to train and often tend to produce poor performance, explaining their limited popularity. Our hypothesis is that this under-performance is a result of the commonly utilized maximum likelihood (ML) optimization, which leads to mode averaging and a higher likelihood of getting stuck in local maxima. We propose a novel curriculum-based approach to learning mixture models in which each component of the MoE is able to select its own subset of the training data for learning. This approach allows for independent optimization of each component, resulting in a more modular architecture that enables the addition and deletion of components on the fly, leading to an optimization less susceptible to local optima. The curricula can ignore data-points from modes not represented by the MoE, reducing the mode-averaging problem. To achieve a good data coverage, we couple the optimization of the curricula with a joint entropy objective and optimize a lower bound of this objective. We evaluate our curriculum-based approach on a variety of multimodal behavior learning tasks and demonstrate its superiority over competing methods for learning MoE models and conditional generative models.
翻译:专家混合模型(Mixtures of Experts, MoE)以其学习具有多模态的复杂条件分布的能力而闻名。然而,尽管潜力巨大,这些模型在训练中面临挑战,且常常性能不佳,这解释了其有限的受欢迎程度。我们的假设是,这种性能不足源于常用的最大似然(ML)优化方法,该方法会导致模态平均化,并增加陷入局部极大值的风险。我们提出了一种新颖的基于课程的学习混合模型方法,其中MoE的每个组件都能选择自己的训练数据子集进行学习。这种方法允许每个组件独立优化,从而形成更模块化的架构,能够动态添加和删除组件,使得优化过程对局部最优解不那么敏感。这些课程可以忽略未被MoE表示的模态对应的数据点,从而减少模态平均化问题。为了实现良好的数据覆盖,我们将课程优化与联合熵目标相结合,并对该目标的下界进行优化。我们在多种多模态行为学习任务上评估了这种基于课程的方法,并证明了其在学习MoE模型和条件生成模型方面优于竞争方法。