While diffusion models have shown great potential in portrait generation, generating expressive, coherent, and controllable cinematic portrait videos remains a significant challenge. Existing intermediate signals for portrait generation, such as 2D landmarks and parametric models, have limited disentanglement capabilities and cannot express personalized details due to their sparse or low-rank representation. Therefore, existing methods based on these models struggle to accurately preserve subject identity and expressions, hindering the generation of highly expressive portrait videos. To overcome these limitations, we propose a high-fidelity personalized head representation that more effectively disentangles expression and identity. This representation captures both static, subject-specific global geometry and dynamic, expression-related details. Furthermore, we introduce an expression transfer module to achieve personalized transfer of head pose and expression details between different identities. We use this sophisticated and highly expressive head model as a conditional signal to train a diffusion transformer (DiT)-based generator to synthesize richly detailed portrait videos. Extensive experiments on self- and cross-reenactment tasks demonstrate that our method outperforms previous models in terms of identity preservation, expression accuracy, and temporal stability, particularly in capturing fine-grained details of complex motion.
翻译:尽管扩散模型在肖像生成中展现了巨大潜力,生成富有表情、连贯且可控的电影级肖像视频仍是一项重大挑战。现有的用于肖像生成的中间信号(例如二维关键点和参数化模型)由于稀疏或低秩表征,其解耦能力有限,且无法表达个性化细节。因此,基于这些模型的现有方法难以准确保持主体身份与表情,阻碍了高表现力肖像视频的生成。为克服这些限制,我们提出一种高保真个性化头部表征,能更有效地解耦表情与身份。该表征同时捕捉了静态的、主体特定的全局几何信息与动态的、与表情相关的细节。此外,我们引入一个表情迁移模块,以实现不同身份间头部姿态与表情细节的个性化迁移。我们利用这一精细且高表现力的头部模型作为条件信号,训练基于扩散变换器(DiT)的生成器,以合成细节丰富的肖像视频。在自身重演与交叉重演任务上的大量实验表明,我们的方法在身份保持、表情准确性与时间稳定性方面优于先前模型,尤其善于捕捉复杂运动的细粒度细节。