Understanding the compositional dynamics of the world in unsupervised 3D scenarios is challenging. Existing approaches either fail to make effective use of time cues or ignore the multi-view consistency of scene decomposition. In this paper, we propose DynaVol, an inverse neural rendering framework that provides a pilot study for learning time-varying volumetric representations for dynamic scenes with multiple entities (like objects). It has two main contributions. First, it maintains a time-dependent 3D grid, which dynamically and flexibly binds the spatial locations to different entities, thus encouraging the separation of information at a representational level. Second, our approach jointly learns grid-level local dynamics, object-level global dynamics, and the compositional neural radiance fields in an end-to-end architecture, thereby enhancing the spatiotemporal consistency of object-centric scene voxelization. We present a two-stage training scheme for DynaVol and validate its effectiveness on various benchmarks with multiple objects, diverse dynamics, and real-world shapes and textures. We present visualization at https://sites.google.com/view/dynavol-visual.
翻译:理解无监督三维场景中世界的组合动态具有挑战性。现有方法要么未能有效利用时间线索,要么忽略了场景分解的多视图一致性。本文提出DynaVol,一种逆神经渲染框架,为学习包含多个实体(如物体)的动态场景的时变体素表示提供了初步研究。其主要有两个贡献:首先,它维护了一个随时间变化的三维网格,能够动态且灵活地将空间位置绑定到不同实体,从而在表示层面促进信息分离。其次,我们的方法在端到端架构中联合学习网格级局部动态、物体级全局动态以及组合神经辐射场,从而增强以物体为中心的场景体素化的时空一致性。我们为DynaVol设计了一种两阶段训练方案,并在包含多个物体、多样化动态以及真实形状和纹理的各种基准上验证了其有效性。可视化结果可访问https://sites.google.com/view/dynavol-visual。