We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning. Compared to a single-task model, VoxtLM exhibits a significant improvement in speech synthesis, with improvements in both speech intelligibility from 28.9 to 5.6 and objective quality from 2.68 to 3.90. VoxtLM also improves speech generation and speech recognition performance over the single-task counterpart. Further, VoxtLM is trained with publicly available data and training recipes and model checkpoints are open-sourced to make fully reproducible work.
翻译:我们提出了一种仅解码器语言模型VoxtLM,可执行四项任务:语音识别、语音合成、文本生成和语音延续。VoxtLM将文本词汇与来自自监督语音特征的离散语音令牌相结合,并利用特殊令牌实现多任务学习。与单任务模型相比,VoxtLM在语音合成上表现出显著提升:语音可懂度从28.9改善至5.6,客观质量评分从2.68提升至3.90。此外,VoxtLM在语音生成和语音识别性能上均优于单任务模型。我们使用公开数据训练VoxtLM,并开源训练方案与模型检查点,以确保研究的完全可复现性。