Current large speech language models are mainly based on semantic tokens from discretization of self-supervised learned representations and acoustic tokens from a neural codec, following a semantic-modeling and acoustic-synthesis paradigm. However, semantic tokens discard paralinguistic attributes of speakers that is important for natural spoken communication, while prompt-based acoustic synthesis from semantic tokens has limits in recovering paralinguistic details and suffers from robustness issues, especially when there are domain gaps between the prompt and the target. This paper unifies two types of tokens and proposes the UniCodec, a universal speech token learning that encapsulates all semantics of speech, including linguistic and paralinguistic information, into a compact and semantically-disentangled unified token. Such a unified token can not only benefit speech language models in understanding with paralinguistic hints but also help speech generation with high-quality output. A low-bitrate neural codec is leveraged to learn such disentangled discrete representations at global and local scales, with knowledge distilled from self-supervised learned features. Extensive evaluations on multilingual datasets demonstrate its effectiveness in generating natural, expressive and long-term consistent output quality with paralinguistic attributes well preserved in several speech processing tasks.
翻译:当前的大型语音语言模型主要基于自监督学习表示离散化得到的语义标记和神经编解码器生成的声学标记,遵循语义建模与声学合成的范式。然而,语义标记舍弃了对自然口语交流至关重要的说话者副语言属性,而基于提示的语义标记声学合成在恢复副语言细节方面存在局限,且存在鲁棒性问题——当提示与目标之间存在领域差异时尤为明显。本文统一了这两类标记,提出UniCodec:一种通用语音标记学习方法,将语音的全部语义(包括语言信息与副语言信息)封装至紧凑且语义解耦的统一标记中。这种统一标记不仅有助于语音语言模型在理解过程中利用副语言线索,也能提升语音生成的质量。我们采用低比特率神经编解码器,通过从自监督学习特征中蒸馏知识,在全局与局部尺度上学习此类解耦的离散表示。在多语言数据集上的广泛评估表明,该方法在多项语音处理任务中能有效生成自然、富有表现力且长期一致的输出质量,同时完整保留副语言属性。