The fusion of speech and language in the era of large language models has garnered significant attention. Discrete speech token is often utilized in text-to-speech tasks for speech compression and portability, which is convenient for joint training with text and have good compression efficiency. However, we found that the discrete speech tokenizer still suffers from information loss. Therefore, we propose a simple yet effective continuous speech tokenizer and a text-to-speech model based on continuous speech tokens. Our results show that the speech language model based on the continuous speech tokenizer has better continuity and higher estimated Mean Opinion Scores (MoS). This enhancement is attributed to better information preservation rate of the continuous speech tokenizer across both low and high frequencies in the frequency domain.
翻译:在大语言模型时代,语音与语言的融合已引起广泛关注。离散语音分词器常被用于文本转语音任务以实现语音压缩与便携性,其便于与文本进行联合训练且具有较好的压缩效率。然而,我们发现离散语音分词器仍存在信息损失问题。为此,我们提出了一种简单而有效的连续语音分词器,以及基于连续语音分词的文本转语音模型。实验结果表明,基于连续语音分词器的语音语言模型具有更好的连续性及更高的预估平均意见得分。这一提升归因于连续语音分词器在频域的低频与高频部分均能保持更优的信息保留率。