During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional "thinking" mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user's speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.
翻译:在对话交互过程中,人类在聆听说话者时会潜意识地进行并行思考。尽管这种内部认知加工并非总是显性语言结构,但它对于生成高质量回应至关重要。受此认知现象启发,我们提出了一种名为FLAIR的全双工潜在内部推理方法,该方法在感知语音的同时进行潜在思考。与自然语言处理中需要事后生成的常规"思考"机制不同,我们的方法能自然适配口语对话系统:在用户发言阶段,该方法递归地将上一步的潜在嵌入输出作为下一步输入,实现严格遵循因果关系的连续推理,且不引入额外延迟。为实现这种潜在推理,我们设计了基于证据下界(Evidence Lower Bound)的目标函数,通过教师强制(teacher forcing)支持高效监督微调,无需显式推理标注。实验证明了这种"边听边思"设计的有效性,在一系列语音基准测试中取得了具有竞争力的结果。此外,FLAIR能稳健处理对话动态变化,在全双工交互指标上达到竞争性表现。