Despite progress in speech-to-video synthesis, existing methods often struggle to capture cross-individual dependencies and provide fine-grained control over reactive behaviors in dyadic settings. To address these challenges, we propose InterDyad, a framework that enables naturalistic interactive dynamics synthesis via querying structural motion guidance. Specifically, we first design an Interactivity Injector that achieves video reenactment based on identity-agnostic motion priors extracted from reference videos. Building upon this, we introduce a MetaQuery-based modality alignment mechanism to bridge the gap between conversational audio and these motion priors. By leveraging a Multimodal Large Language Model (MLLM), our framework is able to distill linguistic intent from audio to dictate the precise timing and appropriateness of reactions. To further improve lip-sync quality under extreme head poses, we propose Role-aware Dyadic Gaussian Guidance (RoDG) for enhanced lip-synchronization and spatial consistency. Finally, we introduce a dedicated evaluation suite with novelly designed metrics to quantify dyadic interaction. Comprehensive experiments demonstrate that InterDyad significantly outperforms state-of-the-art methods in producing natural and contextually grounded two-person interactions. Please refer to our project page for demo videos: https://interdyad.github.io/.
翻译:尽管语音到视频合成取得了进展,现有方法在捕捉跨个体依赖关系以及在双人场景中对反应行为进行精细控制方面仍面临挑战。为解决这些问题,我们提出InterDyad框架,该框架通过查询结构性运动引导实现自然交互动态合成。具体而言,我们首先设计了一个交互注入模块,基于从参考视频中提取的身份无关运动先验实现视频重演。在此基础上,我们引入基于MetaQuery的模态对齐机制以弥合对话音频与这些运动先验之间的差距。通过利用多模态大语言模型(MLLM),我们的框架能够从音频中提炼语言意图,以确定反应的精确时机与适当性。为进一步改善极端头部姿态下的唇形同步质量,我们提出角色感知双人高斯引导(RoDG)以增强唇形同步与空间一致性。最后,我们引入一套专用评估方案,包含新设计的量化双人交互的指标。综合实验表明,InterDyad在生成自然且上下文连贯的双人交互方面显著优于现有最优方法。演示视频请参阅项目页面:https://interdyad.github.io/。