We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning. Compared to a single-task model, VoxtLM exhibits a significant improvement in speech synthesis, with improvements in both speech intelligibility from 28.9 to 5.6 and objective quality from 2.68 to 3.90. VoxtLM also improves speech generation and speech recognition performance over the single-task counterpart. VoxtLM is trained with publicly available data and training recipes and model checkpoints will be open-sourced to make fully reproducible work.
翻译:我们提出一种解码器仅的语言模型VoxtLM,可执行四项任务:语音识别、语音合成、文本生成及语音延续。VoxtLM将文本词汇与来自自监督语音特征的离散语音令牌整合,并利用特殊令牌实现多任务学习。与单任务模型相比,VoxtLM在语音合成方面表现出显著提升:语音清晰度从28.9提升至5.6,客观质量从2.68提升至3.90。同时,VoxtLM在语音生成和语音识别性能上均优于单任务模型。VoxtLM使用公开数据训练,训练方案及模型检查点将开源,以实现完全可复现的研究成果。