Learning When to Think While Listening in Large Audio-Language Models

Recent advances in Large Audio-Language Models (LALMs) have made real-time, streaming spoken interaction increasingly practical. In this setting, reasoning quality and responsiveness are tightly coupled: delaying reasoning until the speech endpoint can improve answer quality but moves deliberation into user-visible response delay, while answering too early risks committing before decisive evidence arrives. We introduce a learnable wait-think-answer control formulation for LALMs. Motivated by the incremental nature of human conversation, the controller decides under partial audio evidence when to wait, when to externalize a compact reasoning update, and when to answer. Using Qwen2.5-Omni-7B as the base model, we construct aligned wait-think-answer traces from spoken reasoning data, train the controller with supervised fine-tuning (SFT), and then apply Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO). The reward combines answer correctness, action validity, update timing, latency synchronization, reasoning quality, and chain consistency, optimizing the complete wait-think-answer trajectory and not the final answer alone. On a six-task synthetic spoken reasoning question answering (SRQA) benchmark, the six-reward DAPO controller improves the row-weighted accuracy from 67.6% to 70.3% while reducing post-endpoint final-think length by 14% under the same Qwen deployment harness. On a 186-item human-recorded Real Audio Bench, a transfer check beyond text-to-speech (TTS)-rendered speech, the controller family remains functional: SFT achieves the strongest accuracy, while the six-reward DAPO controller is the only learned variant whose final-think length falls below the base. These results suggest that a streaming model should learn when to make intermediate reasoning explicit during the audio stream.

翻译：近期大型音频-语言模型（LALMs）的进展使得实时流式口语交互日趋实用。在此场景下，推理质量与响应速度紧密耦合：若延迟至语音端点后再进行推理，虽能提升回答质量，但会将思考过程转化为用户可见的响应延迟；而过早作答则可能在关键证据出现前仓促决策。我们提出一种面向LALMs的可学习“等待-思考-作答”控制范式。受人类对话渐进性特征启发，该控制器基于部分音频证据决定何时等待、何时外化紧凑推理更新、以及何时作答。以Qwen2.5-Omni-7B为基础模型，我们从口语推理数据中构建对齐的“等待-思考-作答”轨迹，通过监督微调（SFT）训练控制器，再采用解耦裁剪与动态采样策略优化（DAPO）。奖励函数综合回答正确性、动作有效性、更新时机、延迟同步、推理质量及链式一致性，优化完整的“等待-思考-作答”轨迹而非仅最终答案。在六任务合成口语推理问答（SRQA）基准上，六维奖励DAPO控制器将行加权准确率从67.6%提升至70.3%，同时在相同Qwen部署框架下将端点后最终思考长度降低14%。在包含186条人类录音的真实音频基准测试（Real Audio Bench）中（该测试作为超越文本转语音渲染语音的迁移验证），控制器系列保持有效：SFT控制器取得最强准确率，而六维奖励DAPO控制器是唯一最终思考长度低于基线模型的可学习变体。这些结果表明，流式模型应主动学习在音频流中何时进行中间推理的显式表达。