This paper introduces Stereo-Talker, a novel one-shot audio-driven human video synthesis system that generates 3D talking videos with precise lip synchronization, expressive body gestures, temporally consistent photo-realistic quality, and continuous viewpoint control. The process follows a two-stage approach. In the first stage, the system maps audio input to high-fidelity motion sequences, encompassing upper-body gestures and facial expressions. To enrich motion diversity and authenticity, large language model (LLM) priors are integrated with text-aligned semantic audio features, leveraging LLMs' cross-modal generalization power to enhance motion quality. In the second stage, we improve diffusion-based video generation models by incorporating a prior-guided Mixture-of-Experts (MoE) mechanism: a view-guided MoE focuses on view-specific attributes, while a mask-guided MoE enhances region-based rendering stability. Additionally, a mask prediction module is devised to derive human masks from motion data, enhancing the stability and accuracy of masks and enabling mask guiding during inference. We also introduce a comprehensive human video dataset with 2,203 identities, covering diverse body gestures and detailed annotations, facilitating broad generalization. The code, data, and pre-trained models will be released for research purposes.
翻译:本文提出立体对话者,一种新颖的单样本音频驱动人体视频合成系统,能够生成具有精确唇形同步、富有表现力的身体姿态、时间一致的光照真实感质量以及连续视角控制的三维对话视频。该系统采用两阶段流程。第一阶段将音频输入映射为包含上半身姿态与面部表情的高保真运动序列。为提升运动多样性与真实性,系统整合大语言模型先验与文本对齐的语义音频特征,利用大语言模型的跨模态泛化能力增强运动质量。第二阶段通过引入先验引导的专家混合机制改进基于扩散的视频生成模型:视角引导专家混合专注于视角特定属性,而掩码引导专家混合则提升基于区域的渲染稳定性。此外,设计掩码预测模块从运动数据推导人体掩码,增强掩码的稳定性与准确性,实现推理过程中的掩码引导。本文还构建了包含2,203个身份的综合人体视频数据集,涵盖多样化身体姿态与详细标注,以促进广泛泛化。代码、数据与预训练模型将开源供研究使用。