Discrete audio tokens derived from self-supervised learning models have gained widespread usage in speech generation. However, current practice of directly utilizing audio tokens poses challenges for sequence modeling due to the length of the token sequence. Additionally, this approach places the burden on the model to establish correlations between tokens, further complicating the modeling process. To address this issue, we propose acoustic BPE which encodes frequent audio token patterns by utilizing byte-pair encoding. Acoustic BPE effectively reduces the sequence length and leverages the prior morphological information present in token sequence, which alleviates the modeling challenges of token correlation. Through comprehensive investigations on a speech language model trained with acoustic BPE, we confirm the notable advantages it offers, including faster inference and improved syntax capturing capabilities. In addition, we propose a novel rescore method to select the optimal synthetic speech among multiple candidates generated by rich-diversity TTS system. Experiments prove that rescore selection aligns closely with human preference, which highlights acoustic BPE's potential to other speech generation tasks.
翻译:从自监督学习模型导出的离散音频令牌在语音生成中已得到广泛应用。然而,当前直接使用音频令牌的做法因令牌序列长度较长而对序列建模构成挑战。此外,该方法将建立令牌间相关性的负担施加于模型之上,进一步复杂化了建模过程。为解决此问题,我们提出声学字节对编码,利用字节对编码对频繁出现的音频令牌模式进行编码。声学字节对编码有效缩短序列长度,并利用令牌序列中存在的先验形态信息,从而缓解令牌相关性带来的建模挑战。通过对基于声学字节对编码训练的语音语言模型进行综合研究,我们证实其具有显著优势,包括更快的推理速度和更强的句法捕获能力。此外,我们提出一种新型重评分方法,用于从高多样性文本转语音系统生成的多个候选中选择最优合成语音。实验证明,该重评分选择与人类偏好高度一致,突显了声学字节对编码在其他语音生成任务中的潜力。