Full-Duplex Speech Language Models (FD-SLMs) are specialized foundation models designed to enable natural, real-time spoken interactions by modeling complex conversational turn-taking such as interruptions, backchannels, and overlapping speech. End-to-end (e2e) FD-SLMs leverage real-world double-channel conversational data to capture nuanced two-speaker dialogue patterns for human-like interactions, but their conversational abilities often degrade compared to pure-text conversation due to prolonged speech sequences and limited high-quality spoken dialogue data. Although interleaved text-speech generation could mitigate this degradation, integrating discrete text tokens into continuous double-channel audio streams could disrupt the precise time alignment required for fluid interaction. To address this, we propose TurnGuide, a novel text-speech interleaved generation approach for e2e FD-SLMs that dynamically segments assistant speech into dialogue turns and interleaves turn-level text and speech generation. This approach allows FD-SLMs to integrate the semantic intelligence of LLMs without compromising the natural acoustic flow. Extensive experiments show that TurnGuide not only significantly improves e2e FD-SLMs to produce semantically meaningful, coherent speech but also achieves state-of-the-art performance on various turn-taking events. Demos are available at https://dreamtheater123.github.io/TurnGuide-Demo/. Code will be available at https://github.com/dreamtheater123/TurnGuide.
翻译:全双工语音语言模型(FD-SLMs)是专门设计的基础模型,旨在通过建模复杂的对话轮转(如打断、反馈信号和重叠语音)来实现自然、实时的语音交互。端到端(e2e)FD-SLMs利用真实世界的双通道对话数据来捕捉细微的双说话人对话模式,以实现类人交互,但由于冗长的语音序列和高质量语音对话数据的有限性,其对话能力相较于纯文本对话常出现退化。尽管交错式文本-语音生成可缓解此退化,但将离散的文本标记整合到连续的双通道音频流中可能会破坏流畅交互所需的时间精确对齐。为此,我们提出TurnGuide,一种用于e2e FD-SLMs的新型文本-语音交错生成方法,该方法动态地将助手语音分割为对话轮次,并在轮次级别交错生成文本和语音。此方法使FD-SLMs能够整合大语言模型的语义智能,同时不损害自然的声学流畅性。大量实验表明,TurnGuide不仅显著提升了e2e FD-SLMs生成语义丰富、连贯语音的能力,还在多种轮转事件上实现了最先进的性能。演示可见于 https://dreamtheater123.github.io/TurnGuide-Demo/。代码将发布于 https://github.com/dreamtheater123/TurnGuide。