Full-duplex spoken dialogue systems significantly advance over traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech. In this paper, we introduce a novel End-to-End GPT-based model OmniFlatten for full-duplex conversation, capable of effectively modeling the complex behaviors inherent to natural conversations with low latency. To achieve full-duplex communication capabilities, we propose a multi-stage post-training scheme that progressively adapts a text-based large language model (LLM) backbone into a speech-text dialogue LLM, capable of generating text and speech in real time, without modifying the architecture of the backbone LLM. The training process comprises three stages: modality alignment, half-duplex dialogue learning, and full-duplex dialogue learning. Throughout all training stages, we standardize the data using a flattening operation, which allows us to unify the training methods and the model architecture across different modalities and tasks. Our approach offers a straightforward modeling technique and a promising research direction for developing efficient and natural end-to-end full-duplex spoken dialogue systems. Audio samples of dialogues generated by OmniFlatten can be found at this web site (https://omniflatten.github.io/).
翻译:全双工口语对话系统相较于传统的轮次式对话系统取得了显著进步,因其允许同时双向通信,更贴近人-人交互模式。然而,在全双工对话系统中实现低延迟与自然交互仍面临重大挑战,尤其需考虑人类对话动态特性,如打断、反馈信号及语音重叠等。本文提出一种新颖的基于GPT的端到端全双工对话模型OmniFlatten,该模型能够以低延迟有效建模自然对话中固有的复杂行为。为实现全双工通信能力,我们提出一种多阶段后训练方案,逐步将基于文本的大语言模型(LLM)主干适配为语音-文本对话LLM,使其能够实时生成文本与语音,且无需修改主干LLM的架构。训练过程包含三个阶段:模态对齐、半双工对话学习与全双工对话学习。在所有训练阶段中,我们通过扁平化操作对数据进行标准化处理,从而统一跨模态与跨任务的训练方法与模型架构。本方法为开发高效自然的端到端全双工口语对话系统提供了简洁的建模技术与具有前景的研究方向。OmniFlatten生成的对话音频样本可见于网站(https://omniflatten.github.io/)。