UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE

Zhenyu Liu,Yunxin Li,Xuanyu Zhang,Qixun Teng,Shenyuan Jiang,Xinyu Chen,Haoyuan Shi,Jinchao Li,Qi Wang,Haolan Chen,Fanbo Meng,Mingjun Zhao,Yu Xu,Yancheng He,Baotian Hu,Min Zhang

Recent advances in unified multimodal models indicate a clear trend towards comprehensive content generation. However, the auditory domain remains a significant challenge, with music and speech often developed in isolation, hindering progress towards universal audio synthesis. This separation stems from inherent task conflicts and severe data imbalances, which impede the development of a truly unified audio generation model. To address this challenge, we propose UniMoE-Audio, a unified speech and music generation model within a novel Dynamic-Capacity Mixture-of-Experts (MoE) framework. Architecturally, UniMoE-Audio introduces a Top-P routing strategy for dynamic expert number allocation, and a hybrid expert design comprising routed experts for domain-specific knowledge, shared experts for domain-agnostic features, and null experts for adaptive computation skipping. To tackle data imbalance, we introduce a three-stage training curriculum: 1) Independent Specialist Training leverages original datasets to instill domain-specific knowledge into each "proto-expert" without interference; 2) MoE Integration and Warmup incorporates these specialists into the UniMoE-Audio architecture, warming up the gate module and shared expert using a subset of balanced dataset; and 3) Synergistic Joint Training trains the entire model end-to-end on the fully balanced dataset, fostering enhanced cross-domain synergy. Extensive experiments show that UniMoE-Audio not only achieves state-of-the-art performance on major speech and music generation benchmarks, but also demonstrates superior synergistic learning, mitigating the performance degradation typically seen in naive joint training. Our findings highlight the substantial potential of specialized MoE architecture and curated training strategies in advancing the field of universal audio generation. Homepage: https://mukioxun.github.io/Uni-MoE-site/home.html

翻译：近年来，统一多模态模型的发展揭示了一个明确的趋势：迈向全面的内容生成。然而，听觉领域仍然是一个重大挑战，音乐和语音通常被孤立地开发，这阻碍了通用音频合成的进展。这种分离源于固有的任务冲突和严重的数据不平衡，从而阻碍了真正统一的音频生成模型的发展。为应对这一挑战，我们提出了UniMoE-Audio，一个基于新颖的动态容量专家混合（MoE）框架的统一语音与音乐生成模型。在架构上，UniMoE-Audio引入了Top-P路由策略以实现动态专家数量分配，以及一种混合专家设计，该设计包含用于领域特定知识的路由专家、用于领域无关特征的共享专家，以及用于自适应计算跳过的空专家。为解决数据不平衡问题，我们引入了一个三阶段训练方案：1）独立专家训练：利用原始数据集，在不产生干扰的情况下将领域特定知识注入每个“原型专家”；2）MoE集成与预热：将这些专家整合到UniMoE-Audio架构中，使用平衡数据集的子集对门控模块和共享专家进行预热；3）协同联合训练：在完全平衡的数据集上端到端地训练整个模型，以促进增强的跨领域协同。大量实验表明，UniMoE-Audio不仅在主要的语音和音乐生成基准测试中实现了最先进的性能，而且展现出卓越的协同学习能力，缓解了朴素联合训练中常见的性能下降。我们的研究结果突显了专用MoE架构和精心设计的训练策略在推进通用音频生成领域的巨大潜力。项目主页：https://mukioxun.github.io/Uni-MoE-site/home.html

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日