Spoken dialogue models have significantly advanced intelligent human-computer interaction, yet they lack a plug-and-play full-duplex prediction module for semantic endpoint detection, hindering seamless audio interactions. In this paper, we introduce Phoenix-VAD, an LLM-based model that enables streaming semantic endpoint detection. Specifically, Phoenix-VAD leverages the semantic comprehension capability of the LLM and a sliding window training strategy to achieve reliable semantic endpoint detection while supporting streaming inference. Experiments on both semantically complete and incomplete speech scenarios indicate that Phoenix-VAD achieves excellent and competitive performance. Furthermore, this design enables the full-duplex prediction module to be optimized independently of the dialogue model, providing more reliable and flexible support for next-generation human-computer interaction.
翻译:口语对话模型已显著推动了智能人机交互的发展,然而其缺乏即插即用的全双工语义端点检测预测模块,阻碍了无缝音频交互的实现。本文提出Phoenix-VAD,一种基于大语言模型的流式语义端点检测系统。该模型通过利用大语言模型的语义理解能力,结合滑动窗口训练策略,在支持流式推理的同时实现可靠的语义端点检测。在语义完整与不完整语音场景下的实验表明,Phoenix-VAD取得了优异且具有竞争力的性能。此外,该设计使得全双工预测模块能够独立于对话模型进行优化,为新一代人机交互提供更可靠灵活的支持。