Real-time, full-duplex speech interaction is a key feature of next-generation spoken chatbots, allowing the model to listen and speak at the same time and to handle natural phenomena such as overlap, hesitation, and barge-in. Existing speech language models (SpeechLMs) such as LLaMA-Omni and GLM-4-Voice are still turn-based and rely on an external Voice Activity Detection (VAD) module to mark the end of the user's turn, which fundamentally limits their interactive ability. In this paper, we introduce BayLing-Duplex, a native full-duplex SpeechLM where a single autoregressive LLM decides when to listen, when to speak, and when to stop, with no auxiliary turn-taking module. The design adds only a few special tokens to the standard vocabulary, so it transfers across LLMs and reuses existing training and serving stacks with no architectural adaptation. Starting from the public GLM-4-Voice checkpoint and using only 400K full-duplex samples for fine-tuning followed by a lightweight DPO stage, BayLing-Duplex reaches 92% turn-taking success and 100% interruption success on InstructS2S-Eval, while improving the speech-response score from 2.17 to 3.39 over Moshi. BayLing-Duplex also matches or surpasses its turn-based counterpart on Llama Questions, Web Questions, and Alpaca-Eval, showing that simultaneous listen-and-speak modeling does not sacrifice response quality.
翻译:实时全双工语音交互是下一代口语聊天机器人的关键特性,它允许模型在听的同时说话,并能处理重叠、犹豫、插话等自然现象。现有语音语言模型(如LLaMA-Omni和GLM-4-Voice)仍基于轮流对话模式,依赖外部语音活动检测(VAD)模块标记用户发言结束,这从根本上限制了其交互能力。本文提出BayLing-Duplex,一种原生全双工语音语言模型,其中单个自回归大语言模型能自主决定何时倾听、何时说话、何时停止,无需辅助的对话轮次管理模块。该设计仅在标准词表中新增少量特殊标记,因此可跨大语言模型迁移,并复用现有训练与推理框架,无需调整架构。基于公开的GLM-4-Voice检查点,仅使用40万条全双工样本进行微调,再辅以轻量级DPO训练后,BayLing-Duplex在InstructS2S-Eval上达到92%的对话轮次接管成功率和100%的中断成功率,同时将语音响应得分从Moshi的2.17提升至3.39。在Llama Questions、Web Questions和Alpaca-Eval基准上,BayLing-Duplex的表现与基于轮次的模型相当或更优,表明同时听说的建模方式不会牺牲响应质量。