In recent years, generative artificial intelligence has achieved significant advancements in the field of image generation, spawning a variety of applications. However, video generation still faces considerable challenges in various aspects, such as controllability, video length, and richness of details, which hinder the application and popularization of this technology. In this work, we propose a controllable video generation framework, dubbed MimicMotion, which can generate high-quality videos of arbitrary length mimicking specific motion guidance. Compared with previous methods, our approach has several highlights. Firstly, we introduce confidence-aware pose guidance that ensures high frame quality and temporal smoothness. Secondly, we introduce regional loss amplification based on pose confidence, which significantly reduces image distortion. Lastly, for generating long and smooth videos, we propose a progressive latent fusion strategy. By this means, we can produce videos of arbitrary length with acceptable resource consumption. With extensive experiments and user studies, MimicMotion demonstrates significant improvements over previous approaches in various aspects. Detailed results and comparisons are available on our project page: https://tencent.github.io/MimicMotion .
翻译:近年来,生成式人工智能在图像生成领域取得了显著进展,催生了多种应用。然而,视频生成在可控性、视频长度和细节丰富度等诸多方面仍面临巨大挑战,阻碍了该技术的应用与普及。本文提出了一种可控视频生成框架,命名为MimicMotion,能够生成模仿特定运动引导的任意长度高质量视频。与先前方法相比,我们的方法具有若干亮点。首先,我们引入了置信感知姿态引导,确保高帧质量与时间平滑性。其次,我们提出了基于姿态置信度的区域损失放大机制,显著减少了图像失真。最后,为生成长且平滑的视频,我们提出了一种渐进式潜在融合策略。通过这种方式,我们能够以可接受的资源消耗生成任意长度的视频。通过大量实验和用户研究,MimicMotion在多个方面均展现出相较于先前方法的显著提升。详细结果与对比请见项目页面:https://tencent.github.io/MimicMotion。