Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a challenge. In this paper, we introduce CoVoMix: Conversational Voice Mixture Generation, a novel model for zero-shot, human-like, multi-speaker, multi-round dialogue speech generation. CoVoMix first converts dialogue text into multiple streams of discrete tokens, with each token stream representing semantic information for individual talkers. These token streams are then fed into a flow-matching based acoustic model to generate mixed mel-spectrograms. Finally, the speech waveforms are produced using a HiFi-GAN model. Furthermore, we devise a comprehensive set of metrics for measuring the effectiveness of dialogue modeling and generation. Our experimental results show that CoVoMix can generate dialogues that are not only human-like in their naturalness and coherence but also involve multiple talkers engaging in multiple rounds of conversation. This is exemplified by instances generated in a single channel where one speaker's utterance is seamlessly mixed with another's interjections or laughter, indicating the latter's role as an attentive listener. Audio samples are available at https://aka.ms/covomix.
翻译:近年来,零样本文本到语音(TTS)建模的进展,在生成高保真度和多样化的语音方面取得了显著进步。然而,对话生成以及实现类人自然的语音表现,仍然是一个挑战。本文提出CoVoMix:对话语音混合生成模型,这是一种用于零样本、类人、多说话人、多轮次对话语音生成的新型模型。CoVoMix首先将对话文本转换为多个离散令牌流,每个令牌流代表单个说话者的语义信息。这些令牌流随后被输入到一个基于流匹配的声学模型中,以生成混合的梅尔频谱图。最后,使用HiFi-GAN模型生成语音波形。此外,我们设计了一套全面的度量标准,用于评估对话建模与生成的效果。我们的实验结果表明,CoVoMix生成的对话不仅在自然度和连贯性上类人,而且能实现多说话人参与的多轮次对话。这体现在单通道生成的实例中,一位说话者的话语与另一位说话者的插话或笑声无缝混合,表明了后者作为专注倾听者的角色。音频样本可在 https://aka.ms/covomix 获取。