Generating natural-sounding, multi-speaker dialogue is crucial for applications such as podcast creation, virtual agents, and multimedia content generation. However, existing systems struggle to maintain speaker consistency, model overlapping speech, and synthesize coherent conversations efficiently. In this paper, we introduce CoVoMix2, a fully non-autoregressive framework for zero-shot multi-talker dialogue generation. CoVoMix2 directly predicts mel-spectrograms from multi-stream transcriptions using a flow-matching-based generative model, eliminating the reliance on intermediate token representations. To better capture realistic conversational dynamics, we propose transcription-level speaker disentanglement, sentence-level alignment, and prompt-level random masking strategies. Our approach achieves state-of-the-art performance, outperforming strong baselines like MoonCast and Sesame in speech quality, speaker consistency, and inference speed. Notably, CoVoMix2 operates without requiring transcriptions for the prompt and supports controllable dialogue generation, including overlapping speech and precise timing control, demonstrating strong generalizability to real-world speech generation scenarios.
翻译:生成自然流畅的多说话人对话对于播客创作、虚拟代理和多媒体内容生成等应用至关重要。然而,现有系统在保持说话人一致性、建模重叠语音以及高效合成连贯对话方面存在困难。本文介绍了CoVoMix2,一个用于零样本多说话人对话生成的完全非自回归框架。CoVoMix2使用基于流匹配的生成模型,直接从多流文本转录预测梅尔频谱图,消除了对中间标记表示的依赖。为了更好地捕捉真实的对话动态,我们提出了转录级说话人解缠、句子级对齐和提示级随机掩码策略。我们的方法实现了最先进的性能,在语音质量、说话人一致性和推理速度方面均优于MoonCast和Sesame等强基线模型。值得注意的是,CoVoMix2在运行时无需提示的文本转录,并支持可控的对话生成,包括重叠语音和精确的时序控制,展现了在现实世界语音生成场景中的强大泛化能力。