Full-duplex dialog models aim to listen and speak simultaneously, delivering rapid responses to dynamic user input. Among different solutions to full-duplexity, a native solution merges multiple channels in each time step, achieving the lowest latency. However, prevailing designs break down the textual monologue sentences for word-level alignment with audio streams, which degrades language modeling abilities. To help address this issue, we introduce "contiguous monologues", which are composed by continuous sentences and "waiting" intervals, mimicking human-like cognitive behavior in dialogs. We find a proper training paradigm to be critical for semantically aligning contiguous monologues with audio. To this end, we develop a "dual" training paradigm that alternates the position of the monologues, either leading or trailing the audio, across different training stages. A combination of our contiguous monologue and dual training strategy is applied in developing FLM-Audio, our 7B spoken dialog chatbot with native full-duplexity. As confirmed by experimental results, FLM-Audio achieves superior response qualities and chatting experiences while requiring significantly less training data.
翻译:全双工对话模型旨在同时倾听与说话,以对动态用户输入作出快速响应。在实现全双工的不同方案中,一种原生方案在每一时间步合并多个通道,实现了最低延迟。然而,主流设计将文本独白句子分解为单词级别以与音频流对齐,这损害了语言建模能力。为帮助解决此问题,我们引入了“连续独白”,它由连续句子和“等待”间隔构成,模拟了对话中类人的认知行为。我们发现,一个恰当的训练范式对于在语义上将连续独白与音频对齐至关重要。为此,我们开发了一种“双重”训练范式,该范式在不同训练阶段交替独白的位置——领先或跟随音频。我们的连续独白策略与双重训练范式相结合,应用于开发FLM-Audio,即我们具备原生全双工能力的70亿参数语音对话聊天机器人。实验结果证实,FLM-Audio在显著减少训练数据需求的同时,实现了更优的响应质量与聊天体验。