Building animatable 3D models is challenging due to the need for 3D scans, laborious registration, and manual rigging, which are difficult to scale to arbitrary categories. Recently, differentiable rendering provides a pathway to obtain high-quality 3D models from monocular videos, but these are limited to rigid categories or single instances. We present RAC that builds category 3D models from monocular videos while disentangling variations over instances and motion over time. Three key ideas are introduced to solve this problem: (1) specializing a skeleton to instances via optimization, (2) a method for latent space regularization that encourages shared structure across a category while maintaining instance details, and (3) using 3D background models to disentangle objects from the background. We show that 3D models of humans, cats, and dogs can be learned from 50-100 internet videos.
翻译:构建可动画化的3D模型具有挑战性,因其需要3D扫描、繁琐的配准及手动蒙皮,这使得其难以扩展到任意类别。近年来,可微分渲染为从单目视频获取高质量3D模型提供了途径,但这类方法局限于刚体类别或单一实例。我们提出RAC方法,能够从单目视频构建类别级3D模型,同时解耦实例间的差异与时序运动。为解决该问题,我们引入了三个关键思想:(1)通过优化将骨架专门适配于实例;(2)一种潜在空间正则化方法,在保持实例细节的同时鼓励类别共享结构;(3)利用3D背景模型将物体与背景分离。实验表明,从50-100个互联网视频中即可学习到人类、猫和狗的3D模型。