Spoken conversational systems require more than accurate speech generation to have human-like conversations: to feel natural and engaging, they must produce conversational behaviour that adapts dynamically to the context. Current spoken conversational systems, however, rarely allow such customization, limiting their naturalness and usability. In this work, we present the first open, instruction-following full-duplex conversational speech model that can be trained efficiently under typical academic resource constraints. By keeping the audio encoder frozen and finetuning only the language model, our model requires just 2,000 hours of data, without relying on large-scale pretraining or multi-stage optimization. The model can follow explicit instructions to control speaker voice, conversation topic, conversational behaviour (e.g., backchanneling and interruptions), and dialogue initiation. We propose a single-stage training protocol and systematically analyze design choices. Both the model and training code will be released to enable reproducible research on controllable full-duplex speech systems.
翻译:口语对话系统不仅需要精确的语音生成来实现类人对话:为了显得自然且引人入胜,它们必须产生能够动态适应上下文的对话行为。然而,当前的口语对话系统很少允许此类定制,限制了其自然度与可用性。在本工作中,我们提出了首个开源的、遵循指令的全双工对话语音模型,该模型可在典型的学术资源限制下高效训练。通过冻结音频编码器并仅微调语言模型,我们的模型仅需2,000小时数据,且无需依赖大规模预训练或多阶段优化。该模型能够遵循显式指令来控制说话者音色、对话主题、对话行为(如反馈性发声与打断)以及对话发起。我们提出了一种单阶段训练协议,并系统分析了设计选择。模型与训练代码均将开源,以支持可控全双工语音系统的可复现研究。