Despite broad interest in modeling spoken dialogue agents, most approaches are inherently "half-duplex" -- restricted to turn-based interaction with responses requiring explicit prompting by the user or implicit tracking of interruption or silence events. Human dialogue, by contrast, is "full-duplex" allowing for rich synchronicity in the form of quick and dynamic turn-taking, overlapping speech, and backchanneling. Technically, the challenge of achieving full-duplex dialogue with LLMs lies in modeling synchrony as pre-trained LLMs do not have a sense of "time". To bridge this gap, we propose Synchronous LLMs for full-duplex spoken dialogue modeling. We design a novel mechanism to integrate time information into Llama3-8b so that they run synchronously with the real-world clock. We also introduce a training recipe that uses 212k hours of synthetic spoken dialogue data generated from text dialogue data to create a model that generates meaningful and natural spoken dialogue, with just 2k hours of real-world spoken dialogue data. Synchronous LLMs outperform state-of-the-art in dialogue meaningfulness while maintaining naturalness. Finally, we demonstrate the model's ability to participate in full-duplex dialogue by simulating interaction between two agents trained on different datasets, while considering Internet-scale latencies of up to 240 ms. Webpage: https://syncllm.cs.washington.edu/.
翻译:尽管对口语对话代理建模的兴趣广泛,但大多数方法本质上是“半双工”的——局限于轮次式交互,其响应需要用户显式提示或通过隐式跟踪打断或静默事件来触发。相比之下,人类对话是“全双工”的,允许以快速动态的轮转、重叠语音和反馈信号等形式实现丰富的同步性。从技术上讲,实现LLM全双工对话的挑战在于建模同步性,因为预训练的LLM不具备“时间”感知能力。为弥合这一差距,我们提出了用于全双工口语对话建模的同步LLM。我们设计了一种新颖的机制,将时间信息整合到Llama3-8b中,使其能够与现实世界时钟同步运行。我们还引入了一种训练方案,利用从文本对话数据生成的212k小时合成口语对话数据来创建一个模型,该模型仅需2k小时真实世界口语对话数据即可生成有意义且自然的口语对话。同步LLM在对话意义性方面优于现有最先进技术,同时保持了自然性。最后,我们通过模拟两个在不同数据集上训练的代理之间的交互,并考虑高达240毫秒的互联网级延迟,展示了该模型参与全双工对话的能力。项目网页:https://syncllm.cs.washington.edu/。