Reliable turn-taking is essential for spoken dialogue systems. However, most existing methods are designed for two-speaker interaction and struggle with realistic multiparty audio containing overlap and rapid speaker changes. We study multiparty turn-taking on the VoxConverse dataset and propose an audio-only two-stage pipeline that separates when to trigger a turn boundary from whether the floor is actually transferring. A fast trigger scans the audio and proposes candidate end-of-turn times, while a lightweight verifier runs only at those times to decide \textsc{Hold} or \textsc{Shift} and support next-speaker prediction. We report results in the full multiparty setting and a controlled dyadic top-2 projection for comparability. We also investigate diffusion-based, label-preserving background-audio mixing as a data augmentation strategy. Results show improved shift detection over a baseline, with further improvements from diffusion augmentation.
翻译:可靠的话轮转换对于口语对话系统至关重要。然而,现有大多方法专为双人交互设计,难以应对包含交叠与快速说话人切换的复杂多方音频场景。我们在VoxConverse数据集上研究多方话轮转换,提出一种纯音频的两阶段框架:将触发话轮边界的时机判断与话权是否实际转移的判定相分离。快速触发器扫描音频并提议候选话轮结束时刻,轻量级验证器仅在这些时刻运行,判定“保持”或“转移”并支持下一说话人预测。我们在完整多方场景及可控的双人前两名投影(dyadic top-2 projection)场景下报告结果以促进可比性。我们同时探索了基于扩散的、标签保持的背景音频混合作为数据增强策略。实验表明,该方法对基线模型的转移检测有所提升,扩散增强进一步带来了性能改善。