Speech tokenization is the task of representing speech signals as a sequence of discrete units. Such representations can be later used for various downstream tasks including automatic speech recognition, text-to-speech, etc. More relevant to this study, such representation serves as the basis of Speech Language Models. In this work, we tackle the task of speech tokenization under the noisy setup and present NAST: Noise Aware Speech Tokenization for Speech Language Models. NAST is composed of three main components: (i) a predictor; (ii) a residual encoder; and (iii) a decoder. We evaluate the efficiency of NAST considering several spoken language modeling tasks and show that NAST is superior to the evaluated baselines across all setups. Lastly, we analyze NAST and show its disentanglement properties and robustness to signal variations in the form of noise, reverberation, pitch-shift, and time-stretch. Code and pre-trained models are available at https://github.com/ShovalMessica/NAST.
翻译:语音分词任务旨在将语音信号表示为离散单元的序列。此类表示可后续用于多种下游任务,包括自动语音识别、文本转语音等。就本研究而言,该表示是构建语音语言模型的基础。本工作针对噪声环境下的语音分词任务,提出NAST:面向语音语言模型的噪声感知语音分词方法。NAST由三个核心组件构成:(i) 预测器;(ii) 残差编码器;(iii) 解码器。我们通过多项口语建模任务评估NAST的性能,结果表明在所有实验设置下NAST均优于基线方法。最后,我们对NAST进行深入分析,证明其具有解纠缠特性,并对噪声、混响、音高偏移和时间拉伸等信号变化具有鲁棒性。代码与预训练模型已发布于 https://github.com/ShovalMessica/NAST。