Magic-Me: Identity-Specific Video Customized Diffusion

Creating content for a specific identity (ID) has shown significant interest in the field of generative models. In the field of text-to-image generation (T2I), subject-driven content generation has achieved great progress with the ID in the images controllable. However, extending it to video generation is not well explored. In this work, we propose a simple yet effective subject identity controllable video generation framework, termed Video Custom Diffusion (VCD). With a specified subject ID defined by a few images, VCD reinforces the identity information extraction and injects frame-wise correlation at the initialization stage for stable video outputs with identity preserved to a large extent. To achieve this, we propose three novel components that are essential for high-quality ID preservation: 1) an ID module trained with the cropped identity by prompt-to-segmentation to disentangle the ID information and the background noise for more accurate ID token learning; 2) a text-to-video (T2V) VCD module with 3D Gaussian Noise Prior for better inter-frame consistency and 3) video-to-video (V2V) Face VCD and Tiled VCD modules to deblur the face and upscale the video for higher resolution. Despite its simplicity, we conducted extensive experiments to verify that VCD is able to generate stable and high-quality videos with better ID over the selected strong baselines. Besides, due to the transferability of the ID module, VCD is also working well with finetuned text-to-image models available publically, further improving its usability. The codes are available at https://github.com/Zhen-Dong/Magic-Me.

翻译：针对特定身份（ID）生成内容在生成模型领域引起了广泛兴趣。在文本到图像生成（T2I）领域，基于主体驱动的内容生成已取得显著进展，能够控制图像中的身份信息。然而，将其扩展至视频生成的研究尚不充分。为此，本文提出一种简单而有效的主体身份可控视频生成框架，命名为视频定制扩散（Video Custom Diffusion, VCD）。通过若干指定身份图像定义主体ID，VCD在初始化阶段强化身份信息提取并注入帧间相关性，从而在较大程度上实现身份保持的稳定视频输出。为实现这一目标，我们提出了三个对高质量身份保持至关重要的新模块：1）身份模块（ID模块），该模块利用基于提示词-分割（prompt-to-segmentation）裁剪的身份区域进行训练，以解耦身份信息与背景噪声，实现更精确的身份标记学习；2）文本到视频（T2V）VCD模块，采用三维高斯噪声先验（3D Gaussian Noise Prior）以增强帧间一致性；3）视频到视频（V2V）人脸VCD模块和分块VCD模块，用于人脸去模糊及视频超分辨率处理。尽管方法简洁，我们通过大量实验验证，VCD能够生成稳定且高质量的视频，在身份保持效果方面优于所选强基线模型。此外，得益于身份模块的可迁移性，VCD还可与公开可用的微调文本到图像模型良好兼容，进一步提升了其实用性。相关代码已开源至 https://github.com/Zhen-Dong/Magic-Me。