While diffusion models have shown great potential in portrait generation, generating expressive, coherent, and controllable cinematic portrait videos remains a significant challenge. Existing intermediate signals for portrait generation, such as 2D landmarks and parametric models, have limited disentanglement capabilities and cannot express personalized details due to their sparse or low-rank representation. Therefore, existing methods based on these models struggle to accurately preserve subject identity and expressions, hindering the generation of highly expressive portrait videos. To overcome these limitations, we propose a high-fidelity personalized head representation that more effectively disentangles expression and identity. This representation captures both static, subject-specific global geometry and dynamic, expression-related details. Furthermore, we introduce an expression transfer module to achieve personalized transfer of head pose and expression details between different identities. We use this sophisticated and highly expressive head model as a conditional signal to train a diffusion transformer (DiT)-based generator to synthesize richly detailed portrait videos. Extensive experiments on self- and cross-reenactment tasks demonstrate that our method outperforms previous models in terms of identity preservation, expression accuracy, and temporal stability, particularly in capturing fine-grained details of complex motion.
翻译:尽管扩散模型在肖像生成方面展现出巨大潜力,但生成富有表现力、连贯且可控的电影级肖像视频仍是一项重大挑战。现有用于肖像生成的中间信号(如2D关键点和参数化模型)因其稀疏或低秩表示特性,解耦能力有限且无法表达个性化细节。因此,基于这些模型的现有方法难以准确保持主体身份与表情特征,阻碍了高表现力肖像视频的生成。为突破这些限制,我们提出一种高保真个性化头部表征,能更有效地解耦表情与身份特征。该表征同时捕获静态的主体特异性全局几何特征与动态的表情相关细节。此外,我们引入表情迁移模块,实现不同身份间头部姿态与表情细节的个性化迁移。我们利用这种精细且高表现力的头部模型作为条件信号,训练基于扩散Transformer(DiT)的生成器来合成细节丰富的肖像视频。在自驱动与跨身份重演任务上的大量实验表明,本方法在身份保持、表情准确性和时序稳定性方面均优于现有模型,尤其在捕捉复杂运动的细粒度细节上表现突出。