Recent advances in AudioLLMs have enabled spoken dialogue systems to move beyond turn-based interaction toward real-time full-duplex communication, where the agent must decide when to speak, yield, or interrupt while the user is still talking. Existing full-duplex approaches either rely on voice activity cues, which lack semantic understanding, or on ASR-based modules, which introduce latency and degrade under overlapping speech and noise. Moreover, available datasets rarely capture realistic interaction dynamics, limiting evaluation and deployment. To mitigate the problem, we propose \textbf{FastTurn}, a unified framework for low-latency and robust turn detection. To advance latency while maintaining performance, FastTurn combines streaming CTC decoding with acoustic features, enabling early decisions from partial observations while preserving semantic cues. We also release a test set based on real human dialogue, capturing authentic turn transitions, overlapping speech, backchannels, pauses, pitch variation, and environmental noise. Experiments show FastTurn achieves higher decision accuracy with lower interruption latency than representative baselines and remains robust under challenging acoustic conditions, demonstrating its effectiveness for practical full-duplex dialogue systems.
翻译:近期AudioLLM的进展使口语对话系统从基于轮次交互转向实时全双工通信,系统需在用户持续说话时自主决策何时发言、让权或打断。现有全双工方案或依赖缺乏语义理解的语音活动线索,或采用基于ASR的模块,但后者会引入延迟并在重叠语音与噪声环境下性能下降。此外,现有数据集鲜少捕捉真实的交互动态,限制了评估与部署。针对此问题,我们提出统一框架\textbf{FastTurn},实现低延迟且鲁棒的话轮检测。为兼顾延迟与性能,FastTurn将流式CTC解码与声学特征相结合,在保留语义线索的同时支持基于部分观测的早期决策。我们还发布了基于真实人类对话的测试集,涵盖真实的话轮转换、重叠语音、反馈信号、停顿、音高变化及环境噪声。实验表明,FastTurn相较于代表性基线方法实现了更高的决策准确率与更低的打断延迟,并在复杂声学条件下保持鲁棒性,验证了其在实际全双工对话系统中的有效性。