Spoken dialogue generation is crucial for applications like podcasts, dynamic commentary, and entertainment content, but poses significant challenges compared to single-utterance text-to-speech (TTS). Key requirements include accurate turn-taking, cross-turn acoustic consistency, and long-form stability, which current models often fail to address due to a lack of dialogue context modeling. To bridge this gap, we present MOSS-TTSD, a spoken dialogue synthesis model designed for expressive, multi-party conversational speech across multiple languages. With enhanced long-context modeling, MOSS-TTSD generates long-form spoken conversations from dialogue scripts with explicit speaker tags, supporting up to 60 minutes of single-pass synthesis, multi-party dialogue with up to 5 speakers, and zero-shot voice cloning from a short reference audio clip. The model supports various mainstream languages, including English and Chinese, and is adapted to several long-form scenarios. Additionally, to address limitations of existing evaluation methods, we propose TTSD-eval, an objective evaluation framework based on forced alignment that measures speaker attribution accuracy and speaker similarity without relying on speaker diarization tools. Both objective and subjective evaluation results show that MOSS-TTSD surpasses strong open-source and proprietary baselines in dialogue synthesis.
翻译:对话语音生成对于播客、动态解说及娱乐内容等应用至关重要,但与单语句文本到语音(TTS)相比,面临显著挑战。关键需求包括准确的轮换控制、跨轮次声学一致性以及长时稳定性,而现有模型因缺乏对话上下文建模往往难以满足这些要求。为弥合这一差距,我们提出了MOSS-TTSD——一种专为表达性多语言多说话人对话语音设计的合成模型。通过增强的长上下文建模能力,MOSS-TTSD可根据带有显式说话人标签的对话脚本生成长时间对话语音,支持单次合成长达60分钟、最多5位说话人的多轮对话,以及基于短参考音频的零样本语音克隆。该模型支持包括英语和中文在内的多种主流语言,并可适配多种长文本场景。此外,针对现有评估方法的局限性,我们提出了TTSD-eval——一种基于强制对齐的客观评估框架,可在不依赖说话人分隔工具的情况下衡量说话人归属准确率和说话人相似度。客观与主观评估结果均表明,MOSS-TTSD在对话合成任务上超越了强大的开源及商业基线模型。