High-quality speech dialogue datasets are crucial for Speech-LLM development, yet existing acquisition methods face significant limitations. Human recordings incur high costs and privacy concerns, while synthetic approaches often lack conversational authenticity. To address these challenges, we introduce \textsc{SpeechDialogueFactory}, a production-ready framework for generating natural speech dialogues efficiently. Our solution employs a comprehensive pipeline including metadata generation, dialogue scripting, paralinguistic-enriched utterance simulation, and natural speech synthesis with voice cloning. Additionally, the system provides an interactive UI for detailed sample inspection and a high-throughput batch synthesis mode. Evaluations show that dialogues generated by our system achieve a quality comparable to human recordings while significantly reducing production costs. We release our work as an open-source toolkit, alongside example datasets available in English and Chinese, empowering researchers and developers in Speech-LLM research and development.
翻译:高质量的语音对话数据集对于语音大语言模型(Speech-LLM)的开发至关重要,然而现有的获取方法存在显著局限。人工录制成本高昂且涉及隐私问题,而合成方法则往往缺乏对话的真实性。为应对这些挑战,我们提出了 \textsc{SpeechDialogueFactory},一个可用于生产、能高效生成自然语音对话的框架。我们的解决方案采用了一个包含元数据生成、对话脚本编写、副语言信息增强的话语模拟,以及结合语音克隆的自然语音合成的完整流程。此外,该系统提供了一个交互式用户界面用于详细样本检查,以及一个高吞吐量的批量合成模式。评估表明,由我们系统生成的对话在质量上可与人工录音相媲美,同时显著降低了生产成本。我们将本工作作为开源工具包发布,并提供了英语和中文的示例数据集,以赋能研究者和开发者在语音大语言模型领域的研究与开发。