Mixture-of-experts (MoE) architecture has been proven a powerful method for diverse tasks in training deep models in many applications. However, current MoE implementations are task agnostic, treating all tokens from different tasks in the same manner. In this work, we instead design a novel method that incorporates task information into MoE models at different granular levels with shared dynamic task-based adapters. Our experiments and analysis show the advantages of our approaches over the dense and canonical MoE models on multi-task multilingual machine translations. With task-specific adapters, our models can additionally generalize to new tasks efficiently.
翻译:专家混合(MoE)架构已被证明是许多应用中训练深度模型以处理多样化任务的有效方法。然而,当前的MoE实现是任务无关的,以相同方式处理不同任务的所有标记。在本研究中,我们设计了一种新颖的方法,在不同粒度级别上将任务信息融入MoE模型,并采用共享的动态基于任务的适配器。我们的实验和分析表明,与密集模型和经典MoE模型相比,我们的方法在多任务多语言机器翻译中具有优势。通过任务特定的适配器,我们的模型还能高效地泛化到新任务。