Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques. However, these studies, mainly single-task focused, faced challenges like overfitting and performance degradation in speech recognition tasks, often at the cost of sacrificing performance in multi-task scenarios. This study presents a comprehensive comparison and optimization of discrete tokens generated by various leading SSL models in speech recognition and synthesis tasks. We aim to explore the universality of speech discrete tokens across multiple speech tasks. Experimental results demonstrate that discrete tokens achieve comparable results against systems trained on FBank features in speech recognition tasks and outperform mel-spectrogram features in speech synthesis in subjective and objective metrics. These findings suggest that universal discrete tokens have enormous potential in various speech-related tasks. Our work is open-source and publicly available to facilitate research in this direction.
翻译:自监督学习(SSL)在语音相关任务中的卓越表现,推动了利用离散标记进行语音识别和翻译等任务的研究,这类方法具有存储需求低且能充分利用自然语言处理技术的巨大潜力。然而,这些研究主要聚焦于单一任务,面临诸如语音识别任务中的过拟合和性能退化等挑战,往往以牺牲多任务场景下的性能为代价。本研究对多种主流SSL模型生成的离散标记在语音识别与合成任务中进行了全面的比较和优化。我们旨在探索语音离散标记在多个语音任务中的通用性。实验结果表明,在语音识别任务中,离散标记可取得与基于FBank特征训练的系统相当的性能,并在语音合成任务的主客观指标上优于梅尔频谱特征。这些发现表明,通用离散标记在各类语音相关任务中具有巨大潜力。我们的工作已开源并公开,以促进该方向的研究。