Recent advances in unified multimodal models indicate a clear trend towards comprehensive content generation. However, the auditory domain remains a significant challenge, with music and speech often developed in isolation, hindering progress towards universal audio synthesis. This separation stems from inherent task conflicts and severe data imbalances, which impede the development of a truly unified audio generation model. To address this challenge, we propose UniMoE-Audio, a unified speech and music generation model within a novel Dynamic-Capacity Mixture-of-Experts (MoE) framework. Architecturally, UniMoE-Audio introduces a Top-P routing strategy for dynamic expert number allocation, and a hybrid expert design comprising routed experts for domain-specific knowledge, shared experts for domain-agnostic features, and null experts for adaptive computation skipping. To tackle data imbalance, we introduce a three-stage training curriculum: 1) Independent Specialist Training leverages original datasets to instill domain-specific knowledge into each "proto-expert" without interference; 2) MoE Integration and Warmup incorporates these specialists into the UniMoE-Audio architecture, warming up the gate module and shared expert using a subset of balanced dataset; and 3) Synergistic Joint Training trains the entire model end-to-end on the fully balanced dataset, fostering enhanced cross-domain synergy. Extensive experiments show that UniMoE-Audio not only achieves state-of-the-art performance on major speech and music generation benchmarks, but also demonstrates superior synergistic learning, mitigating the performance degradation typically seen in naive joint training. Our findings highlight the substantial potential of specialized MoE architecture and curated training strategies in advancing the field of universal audio generation. Homepage: https://mukioxun.github.io/Uni-MoE-site/home.html
翻译:近年来,统一多模态模型的发展揭示了一个明确的趋势:迈向全面的内容生成。然而,听觉领域仍然是一个重大挑战,音乐和语音通常被孤立地开发,这阻碍了通用音频合成的进展。这种分离源于固有的任务冲突和严重的数据不平衡,从而阻碍了真正统一的音频生成模型的发展。为应对这一挑战,我们提出了UniMoE-Audio,一个基于新颖的动态容量专家混合(MoE)框架的统一语音与音乐生成模型。在架构上,UniMoE-Audio引入了Top-P路由策略以实现动态专家数量分配,以及一种混合专家设计,该设计包含用于领域特定知识的路由专家、用于领域无关特征的共享专家,以及用于自适应计算跳过的空专家。为解决数据不平衡问题,我们引入了一个三阶段训练方案:1)独立专家训练:利用原始数据集,在不产生干扰的情况下将领域特定知识注入每个“原型专家”;2)MoE集成与预热:将这些专家整合到UniMoE-Audio架构中,使用平衡数据集的子集对门控模块和共享专家进行预热;3)协同联合训练:在完全平衡的数据集上端到端地训练整个模型,以促进增强的跨领域协同。大量实验表明,UniMoE-Audio不仅在主要的语音和音乐生成基准测试中实现了最先进的性能,而且展现出卓越的协同学习能力,缓解了朴素联合训练中常见的性能下降。我们的研究结果突显了专用MoE架构和精心设计的训练策略在推进通用音频生成领域的巨大潜力。项目主页:https://mukioxun.github.io/Uni-MoE-site/home.html