HM-Talker：面向高保真说话头合成的混合运动建模 (HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis)

Audio-driven talking head video generation enhances user engagement in human-computer interaction. However, current methods frequently produce videos with motion blur and lip jitter, primarily due to their reliance on implicit modeling of audio-facial motion correlations--an approach lacking explicit articulatory priors (i.e., anatomical guidance for speech-related facial movements). To overcome this limitation, we propose HM-Talker, a novel framework for generating high-fidelity, temporally coherent talking heads. HM-Talker leverages a hybrid motion representation combining both implicit and explicit motion cues. Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment. Specifically, our Cross-Modal Disentanglement Module (CMDM) extracts complementary implicit/explicit motion features while predicting AUs directly from audio input aligned to visual cues. To mitigate identity-dependent biases in explicit features and enhance cross-subject generalization, we introduce the Hybrid Motion Modeling Module (HMMM). This module dynamically merges randomly paired implicit/explicit features, enforcing identity-agnostic learning. Together, these components enable robust lip synchronization across diverse identities, advancing personalized talking head synthesis. Extensive experiments demonstrate HM-Talker's superiority over state-of-the-art methods in visual quality and lip-sync accuracy.

翻译：音频驱动的说话头视频生成增强了人机交互中的用户参与度。然而，现有方法常因依赖音频-面部运动关联的隐式建模——一种缺乏明确发音先验（即语音相关面部运动的解剖学指导）的方法——而产生运动模糊和唇部抖动的视频。为克服此局限，我们提出了HM-Talker，一种生成高保真、时序连贯说话头的新框架。HM-Talker利用结合隐式与显式运动线索的混合运动表征：显式线索采用解剖学定义的面部肌肉运动单元（AUs），与隐式特征协同以减少音素-视位错位。具体而言，我们的跨模态解缠模块（CMDM）从对齐视觉线索的音频输入中提取互补的隐式/显式运动特征并直接预测AUs。为降低显式特征中的身份依赖性偏差并增强跨主体泛化能力，我们引入了混合运动建模模块（HMMM）。该模块动态融合随机配对的隐式/显式特征，强制进行身份无关学习。这些组件共同实现了跨多样身份的鲁棒唇部同步，推动了个性化说话头合成的发展。大量实验证明，HM-Talker在视觉质量与唇部同步准确性上均优于当前最先进方法。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/