We introduce Ponymation, a new method for learning a generative model of articulated 3D animal motions from raw, unlabeled online videos. Unlike existing approaches for motion synthesis, our model does not require any pose annotations or parametric shape models for training, and is learned purely from a collection of raw video clips obtained from the Internet. We build upon a recent work, MagicPony, which learns articulated 3D animal shapes purely from single image collections, and extend it on two fronts. First, instead of training on static images, we augment the framework with a video training pipeline that incorporates temporal regularizations, achieving more accurate and temporally consistent reconstructions. Second, we learn a generative model of the underlying articulated 3D motion sequences via a spatio-temporal transformer VAE, simply using 2D reconstruction losses without relying on any explicit pose annotations. At inference time, given a single 2D image of a new animal instance, our model reconstructs an articulated, textured 3D mesh, and generates plausible 3D animations by sampling from the learned motion latent space.
翻译:我们提出Ponymation,一种从原始无标签在线视频中学习带关节的3D动物运动生成模型的新方法。与现有的运动合成方法不同,我们的模型在训练时无需任何姿态标注或参数化形状模型,仅从互联网收集的原始视频片段中学习。我们基于近期工作MagicPony(该工作仅从单张图像集学习带关节的3D动物形状)进行了两方面的扩展。首先,不同于在静态图像上训练,我们通过引入加入时间正则化的视频训练流水线来增强框架,从而获得更准确且时间一致的重建结果。其次,我们通过时空变换器VAE学习底层带关节的3D运动序列的生成模型,仅使用2D重建损失而不依赖任何显式姿态标注。在推理阶段,给定新动物实例的单张2D图像,我们的模型可重建带关节、具纹理的3D网格,并通过从学习到的运动隐空间采样生成合理的3D动画。