Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques. However, these studies, mainly single-task focused, faced challenges like overfitting and performance degradation in speech recognition tasks, often at the cost of sacrificing performance in multi-task scenarios. This study presents a comprehensive comparison and optimization of discrete tokens generated by various leading SSL models in speech recognition and synthesis tasks. We aim to explore the universality of speech discrete tokens across multiple speech tasks. Experimental results demonstrate that discrete tokens achieve comparable results against systems trained on FBank features in speech recognition tasks and outperform mel-spectrogram features in speech synthesis in subjective and objective metrics. These findings suggest that universal discrete tokens have enormous potential in various speech-related tasks. Our work is open-source and publicly available at https://github.com/k2-fsa/icefall.
翻译:自监督学习在语音相关任务中的卓越能力推动了利用离散令牌进行语音识别和翻译等任务的研究,这些方法具有较低的存储需求,并具备应用自然语言处理技术的巨大潜力。然而,现有研究主要集中在单一任务上,在语音识别任务中面临过拟合和性能下降等挑战,且往往以牺牲多任务场景下的性能为代价。本研究对多种主流自监督学习模型生成的离散令牌在语音识别与语音合成任务中进行了全面比较与优化,旨在探索语音离散令牌在多语音任务中的通用性。实验结果表明,在语音识别任务中,离散令牌达到了与基于FBank特征训练的系统相当的性能;在语音合成任务中,其在主观和客观指标上均优于梅尔频谱特征。这些发现表明,通用离散令牌在各类语音相关任务中具有巨大潜力。本工作已开源,可于https://github.com/k2-fsa/icefall获取。