Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice agent that conditions turn-taking behavior on an explicitly assigned role in multi-party settings. The system is built on a speech large language model operating in chunk-wise streaming manner. We further introduce a reasoning-augmented variant that incorporates chain-of-thought reasoning over conversational context and the assigned role. We construct RolePlayConv, a large-scale synthetic dataset of spoken multi-party conversations with diverse assistant roles. Experiments on real-world meeting data and RolePlayConv show improved turn-taking precision by over 40% and recall by more than 70%, while substantially reducing false-positive interruptions compared to non-role-conditioned baselines.
翻译:在多人口语对话中,回合切换仍是语音代理面临的根本性挑战,尤其在动态话语权竞争和用户期望多变的场景下。我们提出ModeratorLM——一种角色扮演型语音代理,其回合切换行为基于多方场景中明确分配的角色进行条件化。该系统基于以分块流式方式运行的语音大语言模型构建。我们进一步引入推理增强变体,该变体融合了对对话上下文及分配角色进行思维链推理的能力。我们构建了RolePlayConv——包含多样辅助角色的大规模合成多人口语对话数据集。在真实会议数据和RolePlayConv上的实验表明,相较于无角色条件化的基线方法,回合切换精确率提升超过40%,召回率提升超过70%,同时显著降低了误报性打断。