Generative Spoken Language Modeling research focuses on optimizing speech Language Models (LMs) using raw audio recordings without accessing any textual supervision. Such speech LMs usually operate over discrete units obtained from quantizing internal representations of self-supervised models. Although such units show impressive modeling results, their robustness capabilities have not been extensively investigated. This work focuses on improving the robustness of discrete input representations for generative spoken language modeling. First, we formally define how to measure the robustness of such representations to various signal variations that do not alter the spoken information (e.g., time-stretch). Next, we empirically demonstrate how current state-of-the-art representation models lack robustness to such variations. To overcome this, we propose an effective and efficient method to learn robust discrete speech representation for generative spoken language modeling. The proposed approach is based on applying a set of signal transformations to the speech signal and optimizing the model using an iterative pseudo-labeling scheme. Our method significantly improves over the evaluated baselines when considering encoding and modeling metrics. We additionally evaluate our method on the speech-to-speech translation task, considering Spanish-English and French-English translations, and show the proposed approach outperforms the evaluated baselines.
翻译:生成式口语语言建模研究侧重于在无需任何文本监督的情况下,利用原始音频录音优化语音语言模型。这类语音语言模型通常基于从自监督模型内部表示的量化中获得的离散单元运行。尽管这些单元展现出令人印象深刻的建模效果,但其鲁棒性能力尚未得到广泛研究。本文聚焦于提升生成式口语语言建模中离散输入表示的鲁棒性。首先,我们正式定义了如何衡量此类表示对各种不改变口语信息的信号变化(例如时间拉伸)的鲁棒性。接着,我们通过实验证明了当前最先进的表示模型对此类变化缺乏鲁棒性。为克服这一问题,我们提出了一种高效且有效的方法,用于学习面向生成式口语语言建模的鲁棒离散语音表示。所提出的方法基于对语音信号应用一组信号变换,并通过迭代伪标签方案优化模型。在编码与建模指标方面,我们的方法显著优于所评估的基线模型。此外,我们在语音到语音翻译任务上评估了该方法(考虑西班牙语-英语和法语-英语翻译),结果表明所提出的方法优于所评估的基线模型。