Speech-to-speech language models have recently emerged to enhance the naturalness of conversational AI. In particular, full-duplex models are distinguished by their real-time interactivity, including handling of pauses, interruptions, and backchannels. However, improving their factuality remains an open challenge. While scaling the model size could address this gap, it would make real-time inference prohibitively expensive. In this work, we propose MoshiRAG, a modular approach that combines a compact full-duplex interface with selective retrieval to access more powerful knowledge sources. Our asynchronous framework enables the model to identify knowledge-demanding queries and ground its responses in external information. By leveraging the natural temporal gap between response onset and the delivery of core information, the retrieval process can be completed while maintaining a natural conversation flow. With this approach, MoshiRAG achieves factuality comparable to the best publicly released non-duplex speech language models while preserving the interactivity inherent to full-duplex systems. Moreover, our flexible design supports plug-and-play retrieval methods without retraining and demonstrates strong performance on out-of-domain mathematical reasoning tasks.
翻译:近年来,语音到语音语言模型的出现增强了对话AI的自然性。其中,全双工模型因其实时交互能力(包括处理停顿、打断和反馈信号)而脱颖而出。然而,提升其事实准确性仍是一个开放挑战。虽然扩增模型参数可弥补这一差距,但会导致实时推理成本过高。本文提出MoshiRAG这一模块化方法,将紧凑的全双工接口与选择性检索相结合,以访问更强大的知识源。我们的异步框架使模型能够识别知识密集型查询,并将回答锚定于外部信息。通过利用回答触发与核心信息呈现之间的自然时间差,检索过程可在保持自然对话流的同时完成。采用该方法,MoshiRAG在事实准确性上与最优公开非双工语音语言模型相当,同时保留了全双工系统固有的交互性。此外,我们灵活的设计支持即插即用的检索方法而无需重新训练,并在领域外数学推理任务中展现出强劲性能。