We introduce Animate124 (Animate-one-image-to-4D), the first work to animate a single in-the-wild image into 3D video through textual motion descriptions, an underexplored problem with significant applications. Our 4D generation leverages an advanced 4D grid dynamic Neural Radiance Field (NeRF) model, optimized in three distinct stages using multiple diffusion priors. Initially, a static model is optimized using the reference image, guided by 2D and 3D diffusion priors, which serves as the initialization for the dynamic NeRF. Subsequently, a video diffusion model is employed to learn the motion specific to the subject. However, the object in the 3D videos tends to drift away from the reference image over time. This drift is mainly due to the misalignment between the text prompt and the reference image in the video diffusion model. In the final stage, a personalized diffusion prior is therefore utilized to address the semantic drift. As the pioneering image-text-to-4D generation framework, our method demonstrates significant advancements over existing baselines, evidenced by comprehensive quantitative and qualitative assessments.
翻译:我们提出Animate124(将单张图像动画化为4D场景),这是首个通过文本运动描述将单张野外图像动画化为3D视频的工作,该问题此前未被充分探索,却具有重要应用价值。我们的4D生成方法利用先进的4D网格动态神经辐射场(NeRF)模型,通过多阶段扩散先验进行优化。首先,在2D和3D扩散先验的引导下,利用参考图像优化静态模型,作为动态NeRF的初始化。随后,采用视频扩散模型学习与主体相关的运动特征。然而,3D视频中的物体随时间推移容易偏离参考图像,这种漂移主要源于视频扩散模型中文本提示与参考图像的对齐偏差。在最后阶段,我们利用个性化扩散先验来解决语义漂移问题。作为首个图像-文本到4D生成框架,我们的方法在定量与定性评估中均展现出相较于现有基线的显著优势。