Producing prompt-faithful videos that preserve a user-specified identity remains challenging: models need to extrapolate facial dynamics from sparse reference while balancing the tension between identity preservation and motion naturalness. Conditioning on a single image completely ignores the temporal signature, which leads to pose-locked motions, unnatural warping, and "average" faces when viewpoints and expressions change. To this end, we introduce an identity-conditioned variant of a diffusion-transformer video generator which uses a short reference video rather than a single portrait. Our key idea is to incorporate the dynamics in the reference. A short clip reveals subject-specific patterns, e.g., how smiles form, across poses and lighting. From this clip, a Sinkhorn-routed encoder learns compact identity tokens that capture characteristic dynamics while remaining pretrained backbone-compatible. Despite adding only lightweight conditioning, the approach consistently improves identity retention under large pose changes and expressive facial behavior, while maintaining prompt faithfulness and visual realism across diverse subjects and prompts.
翻译:生成既符合提示要求又能保持用户指定身份的视频仍然具有挑战性:模型需要从稀疏的参考信息中推断面部动态,同时平衡身份保持与运动自然性之间的张力。仅基于单张图像进行条件生成会完全忽略时间特征,这导致在视角和表情变化时出现姿态锁定的运动、不自然的扭曲以及"平均化"的面部。为此,我们引入了一种扩散-Transformer视频生成器的身份条件变体,它使用一段简短的参考视频而非单张肖像。我们的核心思想是融入参考视频中的动态信息。一段短视频片段揭示了主体特定的模式,例如微笑如何在不同姿态和光照条件下形成。从这个片段中,一个基于Sinkhorn路由的编码器学习紧凑的身份令牌,这些令牌在保持与预训练主干网络兼容的同时,捕捉了特征性的动态。尽管仅添加了轻量级的条件信息,该方法在大姿态变化和丰富的面部表情下,持续提升了身份保持能力,同时在不同主体和提示下保持了提示忠实度和视觉真实感。