We present DyaPlex, a streaming, full-duplex speech-and-motion model designed for dyadic interaction. To capture the continuous and reciprocal nature of human communication, this full-duplex capability empowers the agent to simultaneously perceive and generate both speech and physical motion in a streaming fashion. At its core, our method leverages the strong priors of a foundational full-duplex speech model and integrates a novel motion pathway, thereby achieving fully synchronized multi-modal interaction. Specifically, we design a dual-tower Transformer architecture that preserves the zero-shot conversational reasoning of a frozen base speech model while constructing a deeply coupled, streaming motion pathway. By introducing a unified dyadic token interleaving mechanism and guiding cross-attention via a time-aligned speech-motion RoPE, our model effectively aligns autoregressive motions with rich latent speech features. Trained on the 4,000-hour Seamless Interaction dataset, our model effectively captures cross-speaker dependencies and establishes new state-of-the-art performance across both monadic and dyadic human interaction benchmarks.
翻译:我们提出了DyaPlex,一个面向对话交互的流式全双工语音与运动生成模型。为捕捉人类交流中连续且互惠的特性,该全双工能力使智能体能够以流式方式同时感知并生成语音与物理运动。其核心方法充分利用了基础全双工语音模型的强先验知识,并通过集成新型运动通路,实现了全同步的多模态交互。具体而言,我们设计了一种双塔Transformer架构,在保持冻结基础语音模型零样本对话推理能力的同时,构建了深度耦合的流式运动通路。通过引入统一的双人交互令牌交错机制,并借助时间对齐的语音-运动旋转位置编码(RoPE)引导交叉注意力,我们的模型有效实现了自回归运动与丰富潜在语音特征的对齐。该模型在4000小时的Seamless Interaction数据集上完成训练,有效捕捉了跨说话人依赖关系,并在单人与双人交互基准测试中均确立了新的最优性能。