Recent work on human animation usually involves audio, pose, or movement maps conditions, thereby achieves vivid animation quality. However, these methods often face practical challenges due to extra control conditions, cumbersome condition injection modules, or limitation to head region driving. Hence, we ask if it is possible to achieve striking half-body human animation while simplifying unnecessary conditions. To this end, we propose a half-body human animation method, dubbed EchoMimicV2, that leverages a novel Audio-Pose Dynamic Harmonization strategy, including Pose Sampling and Audio Diffusion, to enhance half-body details, facial and gestural expressiveness, and meanwhile reduce conditions redundancy. To compensate for the scarcity of half-body data, we utilize Head Partial Attention to seamlessly accommodate headshot data into our training framework, which can be omitted during inference, providing a free lunch for animation. Furthermore, we design the Phase-specific Denoising Loss to guide motion, detail, and low-level quality for animation in specific phases, respectively. Besides, we also present a novel benchmark for evaluating the effectiveness of half-body human animation. Extensive experiments and analyses demonstrate that EchoMimicV2 surpasses existing methods in both quantitative and qualitative evaluations.
翻译:近期的人体动画研究通常涉及音频、姿态或运动图作为条件,从而实现了生动的动画质量。然而,这些方法常因额外的控制条件、繁琐的条件注入模块或局限于头部区域驱动而面临实际挑战。因此,我们提出疑问:是否可能在简化不必要条件的同时,实现引人注目的半身人体动画?为此,我们提出了一种半身人体动画方法,命名为EchoMimicV2。该方法利用一种新颖的音频-姿态动态协调策略(包括姿态采样和音频扩散)来增强半身细节、面部与手势表现力,同时减少条件冗余。为弥补半身数据的稀缺性,我们利用头部局部注意力机制,将头部特写数据无缝整合到我们的训练框架中,该机制在推理阶段可被省略,为动画生成提供了"免费午餐"。此外,我们设计了阶段特异性去噪损失,分别在特定阶段指导动画的运动、细节和低层质量。同时,我们还提出了一个用于评估半身人体动画效果的新基准。大量实验与分析表明,EchoMimicV2在定量和定性评估中均超越了现有方法。