We present UNISON, a latent diffusion framework that unifies speech generation, sound generation, and audio editing within a single model. A single model handles text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level audio editing, speech-in-scene editing, and timed temporal composition, all of which share a single set of weights. Our architecture features two core designs: (1) Layer-wise deep LLM fusion, which injects hidden states from uniformly sampled layers of a frozen MLLM into corresponding MM-DiT blocks via learned projections, providing depth-matched semantic conditioning that improves instruction following over single-layer baselines; and (2) a unified multi-task architecture where task identity is encoded solely by a channel-wise mask and source audio is provided through VAE-encoded channel concatenation. Training is stabilized by an online GPU-side multi-task data synthesis pipeline with task-homogeneous batching and a two-stage curriculum. With 621M--732M trainable parameters, UNISON achieves results competitive with or exceeding task-specialist models across evaluated domains, while being roughly $4\times$ smaller than comparable unified systems.
翻译:我们提出UNISON,一个潜在扩散框架,可在单一模型中统一语音生成、声音生成和音频编辑任务。单个模型处理文本到音频、文本到语音、零样本说话人克隆、混合语音与声音生成、场景级音频编辑、场景中语音编辑以及定时时序组合,所有这些任务共享同一组权重。我们的架构包含两个核心设计:(1)逐层深度LLM融合,通过学习的投影将冻结多模态大语言模型(MLLM)均匀采样层的隐藏状态注入对应的多模态扩散Transformer(MM-DiT)块中,提供深度匹配的语义条件,从而在指令遵循方面优于单层基线方法;(2)统一的多任务架构,其中任务身份仅由通道级掩码编码,源音频通过变分自编码器(VAE)编码的通道拼接提供。训练通过在线GPU端多任务数据合成流水线(采用任务同质批处理与两阶段课程策略)实现稳定。在6.21亿至7.32亿可训练参数规模下,UNISON在评估的各领域取得与任务专精模型相当或更优的结果,同时规模约为同类统一系统的四分之一。