While low-latency interaction is critical for spoken dialogue, cascaded architectures are often bottlenecked by reactive turn-completion detection. We propose Endpoint Anticipation, shifting from reactive detection to proactive forecasting of end-of-turn signals. Our speech-based model anticipates endpoints upto 2.56 seconds in advance, enabling speculative execution of LLM and TTS pipelines on partial context. We introduce metrics to quantify the trade-off between realized latency reduction and computational redundancy. Evaluation across conversational and task-oriented datasets shows our model consistently outperforms competitive VAP-based baselines. Integration with the Unmute framework demonstrates a 505 ms average latency reduction with a 28.4% increase in speculative computation, effectively masking sequential bottlenecks to enable complex reasoning in real-time speech-to-speech interaction.
翻译:低延迟交互对口语对话至关重要,但级联架构常受限于反应式话轮结束检测。本文提出“端点预判”机制,将话轮结束检测从被动响应转为主动预测。基于语音的模型能够提前最多2.56秒预判端点,从而实现对大语言模型和语音合成流水线在部分上下文上的投机执行。我们引入量化指标来平衡实际延迟降低与计算冗余之间的权衡。在对话式和任务导向型数据集上的评估表明,所提模型持续优于基于VAP的竞争基线方法。与Unmute框架的集成实验显示:平均延迟降低505毫秒,投机计算量增加28.4%,有效掩盖了串行瓶颈,使实时语音交互中可执行复杂推理。