Toward unlocking the potential of generative models in immersive 4D experiences, we introduce Virtual Pet, a novel pipeline to model realistic and diverse motions for target animal species within a 3D environment. To circumvent the limited availability of 3D motion data aligned with environmental geometry, we leverage monocular internet videos and extract deformable NeRF representations for the foreground and static NeRF representations for the background. For this, we develop a reconstruction strategy, encompassing species-level shared template learning and per-video fine-tuning. Utilizing the reconstructed data, we then train a conditional 3D motion model to learn the trajectory and articulation of foreground animals in the context of 3D backgrounds. We showcase the efficacy of our pipeline with comprehensive qualitative and quantitative evaluations using cat videos. We also demonstrate versatility across unseen cats and indoor environments, producing temporally coherent 4D outputs for enriched virtual experiences.
翻译:为解锁生成模型在沉浸式4D体验中的潜力,我们提出虚拟宠物(Virtual Pet)这一新型流水线,旨在为3D环境中的特定动物物种建模真实且多样化的运动。针对与环境几何对齐的3D运动数据有限的局限,我们利用单目互联网视频,并提取前景动物的可变形NeRF表示及背景的静态NeRF表示。为此,我们开发了一套重建策略,涵盖物种级共享模板学习与逐视频精细调优。利用重建数据,我们进一步训练条件3D运动模型,以学习前景动物在3D背景上下文中的轨迹与关节运动。我们通过全面的定性与定量评估,以猫视频为例展示了该流水线的有效性。此外,我们展示了其在未见过的猫与室内环境中的泛化能力,生成了时间连贯的4D输出,以丰富虚拟体验。