During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional "thinking" mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user's speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.
翻译:在对话交互过程中,人类在聆听发言者时会下意识地进行并行思考。尽管这种内部认知处理可能并非总表现为显性语言结构,但它对形成高质量回应至关重要。受此认知现象启发,我们提出一种名为FLAIR的新型全双工潜在内部推理方法,该方法能在语音感知的同时进行潜在思考。与自然语言处理中需要事后生成的传统"思考"机制不同,我们的方法天然适配口语对话系统:在用户发言阶段,它会递归地将上一步的潜在嵌入输出馈入下一步,实现严格遵循因果性且不引入额外延迟的持续推理。为实现这种潜在推理,我们设计了基于证据下界的目标函数,通过教师强制方式支持高效监督微调,从而规避了对显式推理标注的需求。实验表明,这种"边听边思考"的设计在多项语音基准测试中均取得了有竞争力的结果。此外,FLAIR能稳健处理对话动态,在全双工交互指标上达到竞争性表现。