Discretizing speech into tokens and generating them by a decoder-only model have been a promising direction for text-to-speech (TTS) and spoken language modeling (SLM). To shorten the sequence length of speech tokens, acoustic byte-pair encoding (BPE) has emerged in SLM that treats speech tokens from self-supervised semantic representations as characters to further compress the token sequence. But the gain in TTS has not been fully investigated, and the proper choice of acoustic BPE remains unclear. In this work, we conduct a comprehensive study on various settings of acoustic BPE to explore its effectiveness in decoder-only TTS models with semantic speech tokens. Experiments on LibriTTS verify that acoustic BPE uniformly increases the intelligibility and diversity of synthesized speech, while showing different features across BPE settings. Hence, acoustic BPE is a favorable tool for decoder-only TTS.
翻译:将语音离散化为标记并通过仅解码器模型生成,已成为文本转语音(TTS)和口语语言建模(SLM)领域的一个有前景的方向。为了缩短语音标记的序列长度,声学字节对编码(BPE)在SLM中应运而生,它将来自自监督语义表征的语音标记视为字符以进一步压缩标记序列。但BPE在TTS中的增益尚未得到充分研究,且声学BPE的合适选择仍不明确。在本工作中,我们对声学BPE的各种设置进行了全面研究,以探索其在基于语义语音标记的仅解码器TTS模型中的有效性。在LibriTTS数据集上的实验证实,声学BPE能一致地提升合成语音的可懂度和多样性,同时在不同BPE设置下展现出不同的特性。因此,声学BPE是仅解码器TTS的一种有利工具。