Moshi: a speech-text foundation model for real-time dialogue

We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning -- such as emotion or non-speech sounds -- is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. Moshi solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this "Inner Monologue" method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at https://github.com/kyutai-labs/moshi.

翻译：本文介绍Moshi，一种语音-文本基础模型及全双工口语对话框架。当前的口语对话系统依赖于独立组件的流水线，即语音活动检测、语音识别、文本对话和文本转语音。此类框架无法模拟真实对话体验：首先，其复杂性导致交互间存在数秒延迟；其次，文本作为对话的中间模态，会丢失对话中修饰语义的非语言信息（如情感或非语音声音）；最后，这些系统依赖说话人轮次分割，无法处理重叠语音、打断和插话现象。Moshi通过将口语对话重构为语音到语音生成任务，系统性地解决了上述问题。该模型以文本语言模型为骨干，通过神经音频编解码器的残差量化器生成语音标记，同时将用户语音与自身语音建模为并行流。这种设计消除了显式的说话人轮次限制，能够建模任意对话动态。我们进一步扩展了前序研究的层级化语义-声学标记生成方法，首先生成时间对齐的文本标记作为音频标记的前缀。这种"内心独白"方法不仅显著提升了生成语音的语言学质量，还实现了流式语音识别与文本转语音功能。最终模型成为首个实时全双工口语大语言模型，理论延迟为160毫秒（实测200毫秒），项目开源地址：https://github.com/kyutai-labs/moshi。