Learning object-centric representations from unsupervised videos is challenging. Unlike most previous approaches that focus on decomposing 2D images, we present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning within a differentiable volume rendering framework. The key idea is to perform object-centric voxelization to capture the 3D nature of the scene, which infers per-object occupancy probabilities at individual spatial locations. These voxel features evolve through a canonical-space deformation function and are optimized in an inverse rendering pipeline with a compositional NeRF. Additionally, our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids. DynaVol-S significantly outperforms existing models in both novel view synthesis and unsupervised decomposition tasks for dynamic scenes. By jointly considering geometric structures and semantic features, it effectively addresses challenging real-world scenarios involving complex object interactions. Furthermore, once trained, the explicitly meaningful voxel features enable additional capabilities that 2D scene decomposition methods cannot achieve, such as novel scene generation through editing geometric shapes or manipulating the motion trajectories of objects.
翻译:从无监督视频中学习物体中心表征具有挑战性。与以往多数专注于分解二维图像的方法不同,我们提出了一种名为 DynaVol-S 的三维生成模型,用于动态场景,该模型可在可微分体渲染框架内实现物体中心学习。其核心思想是执行物体中心体素化以捕捉场景的三维本质,该方法推断各个空间位置上的逐物体占用概率。这些体素特征通过一个规范空间变形函数演化,并在一个包含组合式 NeRF 的逆向渲染流程中进行优化。此外,我们的方法整合了二维语义特征以创建三维语义网格,通过多个解耦的体素网格来表征场景。在动态场景的新视角合成和无监督分解任务中,DynaVol-S 均显著优于现有模型。通过联合考虑几何结构和语义特征,它有效应对了涉及复杂物体交互的具有挑战性的真实世界场景。此外,一旦训练完成,这些具有明确意义的体素特征能够实现二维场景分解方法所无法达到的额外能力,例如通过编辑几何形状或操控物体运动轨迹来生成新场景。