This report presents MagicAvatar, a framework for multimodal video generation and animation of human avatars. Unlike most existing methods that generate avatar-centric videos directly from multimodal inputs (e.g., text prompts), MagicAvatar explicitly disentangles avatar video generation into two stages: (1) multimodal-to-motion and (2) motion-to-video generation. The first stage translates the multimodal inputs into motion/ control signals (e.g., human pose, depth, DensePose); while the second stage generates avatar-centric video guided by these motion signals. Additionally, MagicAvatar supports avatar animation by simply providing a few images of the target person. This capability enables the animation of the provided human identity according to the specific motion derived from the first stage. We demonstrate the flexibility of MagicAvatar through various applications, including text-guided and video-guided avatar generation, as well as multimodal avatar animation.
翻译:本报告提出MagicAvatar,一种用于人类虚拟人多模态视频生成与动画的框架。与大多数现有方法直接从多模态输入(如文本提示)生成以虚拟人为中心的视频不同,MagicAvatar显式地将虚拟人视频生成解耦为两个阶段:(1)多模态到运动与(2)运动到视频生成。第一阶段将多模态输入转化为运动/控制信号(如人体姿态、深度图、DensePose);第二阶段则在这些运动信号的引导下生成以虚拟人为中心的视频。此外,MagicAvatar仅需提供目标人物数张图像即可支持虚拟人动画,从而能够根据第一阶段导出的特定运动对输入的人体身份进行动画化。我们通过多种应用场景(包括文本引导与视频引导的虚拟人生成,以及多模态虚拟人动画)展示了MagicAvatar的灵活性。