Unsupervised learning of object-centric representations in dynamic visual scenes is challenging. Unlike most previous approaches that learn to decompose 2D images, we present DynaVol, a 3D scene generative model that unifies geometric structures and object-centric learning in a differentiable volume rendering framework. The key idea is to perform object-centric voxelization to capture the 3D nature of the scene, which infers the probability distribution over objects at individual spatial locations. These voxel features evolve over time through a canonical-space deformation function, forming the basis for global representation learning via slot attention. The voxel features and global features are complementary and are both leveraged by a compositional NeRF decoder for volume rendering. DynaVol remarkably outperforms existing approaches for unsupervised dynamic scene decomposition. Once trained, the explicitly meaningful voxel features enable additional capabilities that 2D scene decomposition methods cannot achieve: it is possible to freely edit the geometric shapes or manipulate the motion trajectories of the objects.
翻译:动态视觉场景中物体中心化表示的无监督学习具有挑战性。与大多数先前的二维图像分解方法不同,我们提出DynaVol——一种在可微体渲染框架中统一几何结构与物体中心化学习的三维场景生成模型。其核心思想是通过以物体为中心的体素化来捕捉场景的三维本质,推断各空间位置上的物体概率分布。这些体素特征通过正则空间形变函数随时间演化,构成基于槽注意力进行全局表示学习的基础。体素特征与全局特征互为补充,共同被组合式NeRF解码器用于体渲染。DynaVol在无监督动态场景分解方面显著优于现有方法。训练完成后,具有显式语义的体素特征能实现二维场景分解方法无法实现的功能:可自由编辑物体几何形状或操控其运动轨迹。