Discrete audio tokens derived from self-supervised learning models have gained widespread usage in speech generation. However, current practice of directly utilizing audio tokens poses challenges for sequence modeling due to the length of the token sequence. Additionally, this approach places the burden on the model to establish correlations between tokens, further complicating the modeling process. To address this issue, we propose acoustic BPE which encodes frequent audio token patterns by utilizing byte-pair encoding. Acoustic BPE effectively reduces the sequence length and leverages the prior morphological information present in token sequence, which alleviates the modeling challenges of token correlation. Through comprehensive investigations on a speech language model trained with acoustic BPE, we confirm the notable advantages it offers, including faster inference and improved syntax capturing capabilities. In addition, we propose a novel rescore method to select the optimal synthetic speech among multiple candidates generated by rich-diversity TTS system. Experiments prove that rescore selection aligns closely with human preference, which highlights acoustic BPE's potential to other speech generation tasks.
翻译:自监督学习模型导出的离散音频令牌已广泛应用于语音生成中。然而,当前直接使用音频令牌的做法因令牌序列过长而给序列建模带来挑战。此外,该方法需要模型自行建立令牌间的关联性,进一步增加了建模复杂度。为解决这一问题,我们提出声学子词对偶编码,利用字节对编码技术对频繁出现的音频令牌模式进行编码。声学子词对偶编码有效缩短了序列长度,并充分利用了令牌序列中蕴含的形态先验信息,从而缓解了令牌关联性建模的难题。基于声学子词对偶编码训练的语音语言模型的全面研究表明,该方法具有推理速度更快、语法捕捉能力更强等显著优势。此外,我们提出了一种新的重评分方法,用于从高多样性文本转语音系统生成的多个候选语音中选择最优合成结果。实验证明,该重评分选择结果与人类偏好高度一致,凸显了声学子词对偶编码在其他语音生成任务中的应用潜力。