Real-time voice agents face a dilemma: end-to-end models often lack deep reasoning, while cascaded pipelines incur high latency by executing ASR, LLM reasoning, and TTS strictly in sequence, unlike human conversation where listeners often start thinking before the speaker finishes. Since cascaded architectures remain the dominant choice for complex tasks, existing cascaded streaming strategies attempt to reduce this latency via mechanical segmentation (e.g., fixed chunks, VAD-based splitting) or speculative generation, but they frequently either break semantic units or waste computation on predictions that must be rolled back. To address these challenges, we propose LTS-VoiceAgent, a Listen-Think-Speak framework that explicitly separates when to think from how to reason incrementally. It features a Dynamic Semantic Trigger to detect meaningful prefixes, and a Dual-Role Stream Orchestrator that coordinates a background Thinker (for state maintenance) and a foreground Speaker (for speculative solving). This parallel design enables "thinking while speaking" without blocking responses. We also introduce a Pause-and-Repair benchmark containing natural disfluencies to stress-test streaming robustness. Experiments across VERA, Spoken-MQA, BigBenchAudio, and our benchmark show that LTS-VoiceAgent achieves a stronger accuracy-latency-efficiency trade-off than serial cascaded baselines and existing streaming strategies.
翻译:实时语音智能体面临一个困境:端到端模型通常缺乏深度推理能力,而级联流水线则因严格按序执行自动语音识别(ASR)、大语言模型(LLM)推理和文本转语音(TTS)而产生高延迟,这与人类对话中听者常在说话者结束前就开始思考的模式不同。由于级联架构在复杂任务中仍是主流选择,现有的级联流式策略试图通过机械分段(例如固定分块、基于语音活动检测的分割)或推测生成来降低延迟,但这些方法常常会破坏语义单元,或浪费计算资源于必须回滚的预测上。为应对这些挑战,我们提出了LTS-VoiceAgent,这是一个“听-思-说”框架,明确地将“何时思考”与“如何增量推理”分离开来。其核心包括一个用于检测有意义前缀的动态语义触发器,以及一个协调后台思考者(用于状态维护)和前台说话者(用于推测性求解)的双角色流协调器。这种并行设计实现了“边说话边思考”,且不阻塞响应。我们还引入了一个包含自然非流利现象的“暂停与修复”基准测试,以对流式系统的鲁棒性进行压力测试。在VERA、Spoken-MQA、BigBenchAudio及我们提出的基准测试上的实验表明,LTS-VoiceAgent在准确性、延迟和效率之间取得了比串行级联基线及现有流式策略更优的权衡。