Talking-head generation requires joint modeling of identity, head pose, facial expression, and mouth dynamics. Existing methods typically address only a subset of these factors, and rely on fixed-weight or heuristic fusion when multiple conditions are involved. We present MoCoTalk, a multi-conditional video diffusion framework that unifies four complementary control signals: a reference image, facial keypoints, 3DMM-rendered shading meshes, and the corresponding speech audio. To resolve destructive interference among heterogeneous conditions, we introduce an Adaptive Multi-Condition Router that computes channel-wise, timestep-aware gating over the four condition streams, allowing the fusion strategy to vary with both feature subspace and noise level. To better capture speech-related facial dynamics, we design a Mouth-Augmented Shading Mesh, a 3DMM-based representation that decouples head motion, mouth motion, expression, and lighting. This design provides a temporally consistent geometric prior and allows flexible recombination of these attributes at inference. We further introduce a lip consistency loss to tighten audio-visual alignment. Extensive experiments show that MoCoTalk achieves state-of-the-art performance on the majority of structural, motion, and perceptual metrics, while offering attribute-level controllability that single-condition methods do not provide.
翻译:说话人头生成需要对身份、头部姿态、面部表情及口部动态进行联合建模。现有方法通常仅处理其中部分因素,且在涉及多条件时依赖固定权重或启发式融合策略。本文提出MoCoTalk——一种多条件视频扩散框架,统一整合四种互补控制信号:参考图像、面部关键点、基于3DMM生成的着色网格以及对应语音音频。为解决异质性条件间的破坏性干扰,我们引入自适应多条件路由器(Adaptive Multi-Condition Router),该模块对四个条件流分别计算通道维度和时间步感知的门控权重,使融合策略随特征子空间和噪声水平动态变化。为更精准捕获与语音相关的面部动态,我们设计了嘴部增强着色网格(Mouth-Augmented Shading Mesh)——一种基于3DMM的表示方法,可解耦头部运动、口部运动、表情及光照。该设计提供了时间一致性几何先验,并允许在推理阶段灵活重组这些属性。我们进一步引入唇部一致性损失以强化音视频对齐。大量实验表明,MoCoTalk在多数结构、运动及感知指标上达到当前最优性能,同时提供单条件方法无法实现的属性层级可控性。