Recent efforts in Spoken Dialogue Modeling aim to synthesize spoken dialogue without the need for direct transcription, thereby preserving the wealth of non-textual information inherent in speech. However, this approach faces a challenge when speakers talk simultaneously, requiring stereo dialogue data with speakers recorded on separate channels, a notably scarce resource. To address this, we have developed an innovative pipeline capable of transforming single-channel dialogue data into pseudo-stereo data. This expanded our training dataset from a mere 2,000 to an impressive 17,600 hours, significantly enriching the diversity and quality of the training examples available. The inclusion of this pseudo-stereo data has proven to be effective in improving the performance of spoken dialogue language models. Additionally, we explored the use of discrete units of different speech foundation models for spoken dialogue generation.
翻译:近期口语对话建模的研究致力于无需直接转录即可合成口语对话,从而保留语音中丰富的非文本信息。然而,当说话者同时发言时,该方法面临挑战,需要说话者在独立声道录制的立体声对话数据,而这种资源极为稀缺。为解决此问题,我们开发了一种创新流程,能够将单声道对话数据转换为伪立体声数据。这使得训练数据集从仅有的2,000小时大幅扩展至17,600小时,显著提升了可用训练样本的多样性与质量。实验证明,引入伪立体声数据能有效提升口语对话语言模型的性能。此外,我们还探究了不同语音基础模型的离散单元在口语对话生成中的应用。