Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems. Conventional ASR-LLM-TTS pipelines follow a strictly sequential paradigm, requiring complete transcription and full reasoning before speech synthesis can begin, which results in high response latency. We propose the Discourse-Aware Dual-Track Streaming Response (DDTSR) framework, a low-latency architecture that enables listen-while-thinking and speak-while-thinking. DDTSR is built upon three key mechanisms: (1) connective-guided small-large model synergy, where an auxiliary small model generates minimal-committal discourse connectives while a large model performs knowledge-intensive reasoning in parallel; (2) streaming-based cross-modal collaboration, which dynamically overlaps ASR, LLM inference, and TTS to advance the earliest speakable moment; and (3) curriculum-learning-based discourse continuity enhancement, which maintains coherence and logical consistency between early responses and subsequent reasoning outputs. Experiments on two spoken dialogue benchmarks demonstrate that DDTSR reduces response latency by 19%-51% while preserving discourse quality. Further analysis shows that DDTSR functions as a plug-and-play module compatible with diverse LLM backbones, and remains robust across varying utterance lengths, indicating strong practicality and scalability for real-time spoken interaction.
翻译:实现类人响应能力是级联口语对话系统面临的关键挑战。传统ASR-LLM-TTS流水线采用严格串行范式,需在语音合成开始前完成完整转写与全部推理,导致响应延迟较高。本文提出话语感知双轨流式响应(DDTSR)框架,这是一种支持"边听边思考"与"边想边说"的低延迟架构。DDTSR基于三项核心机制构建:(1) 连接词引导的大小模型协同机制,通过辅助小模型生成最小承诺性话语连接词,同时大模型并行执行知识密集型推理;(2) 基于流式的跨模态协作机制,动态重叠ASR、LLM推理与TTS处理以提前最早可说话时刻;(3) 基于课程学习的话语连续性增强机制,保持早期响应与后续推理输出间的连贯性与逻辑一致性。在两个口语对话基准上的实验表明,DDTSR在保持话语质量的同时将响应延迟降低19%-51%。进一步分析显示,DDTSR可作为即插即用模块兼容多种LLM骨干网络,并在不同话语长度下保持鲁棒性,表明其在实际实时口语交互场景中具有较强实用性与可扩展性。