Full-duplex spoken dialogue models allow voice agents to listen and speak concurrently, enabling natural interaction with real-time overlap. However, end-to-end dual-channel models that jointly encode user and agent streams may degrade in realistic acoustic environments: interfering speakers leaking into the user microphone can be encoded as part of the user query, corrupting the LLM's conditioning and causing unstable turn-taking and reduced response quality. We propose Interference-Resilient Adaptive Fusion (IRAF), a lightweight, streaming-compatible module that modulates the contribution of user audio to the LLM frame by frame. IRAF predicts a scalar reliability gate from target-speaker and user audio embeddings and rescales user representations before fusion with agent embeddings. Experiments on MS-MARCO and InstructS2S-200K show consistent gains in response quality and full-duplex interaction under interfering-speaker conditions.
翻译:全双工口语对话模型允许语音代理同时进行听与说,从而支持具有实时重叠的自然交互。然而,在现实声学环境中,联合编码用户与代理流的端到端双通道模型可能发生性能退化:干扰说话者混入用户麦克风的语音会被编码为用户查询的一部分,损坏大语言模型的条件输入,导致轮换不稳定及响应质量下降。我们提出抗干扰自适应融合模块(IRAF),一种轻量级、流式兼容的模块,通过逐帧调节用户音频对大语言模型的贡献。IRAF基于目标说话者与用户音频嵌入预测标量可靠性门控,并在与代理嵌入融合前对用户表示进行缩放。在MS-MARCO和InstructS2S-200K上的实验表明,在干扰说话者条件下,该模块在响应质量与全双工交互方面均取得一致性提升。