A long-standing goal in scene understanding is to obtain interpretable and editable representations that can be directly constructed from a raw monocular RGB-D video, without requiring specialized hardware setup or priors. The problem is significantly more challenging in the presence of multiple moving and/or deforming objects. Traditional methods have approached the setup with a mix of simplifications, scene priors, pretrained templates, or known deformation models. The advent of neural representations, especially neural implicit representations and radiance fields, opens the possibility of end-to-end optimization to collectively capture geometry, appearance, and object motion. However, current approaches produce global scene encoding, assume multiview capture with limited or no motion in the scenes, and do not facilitate easy manipulation beyond novel view synthesis. In this work, we introduce a factored neural scene representation that can directly be learned from a monocular RGB-D video to produce object-level neural presentations with an explicit encoding of object movement (e.g., rigid trajectory) and/or deformations (e.g., nonrigid movement). We evaluate ours against a set of neural approaches on both synthetic and real data to demonstrate that the representation is efficient, interpretable, and editable (e.g., change object trajectory). Code and data are available at: $\href{http://geometry.cs.ucl.ac.uk/projects/2023/factorednerf/}{\text{http://geometry.cs.ucl.ac.uk/projects/2023/factorednerf/}}$.
翻译:场景理解的一个长期目标是获得可解释且可编辑的表征,并能够直接从原始单目RGB-D视频中构建,无需专用硬件设备或先验知识。当场景中存在多个移动和/或形变物体时,该问题显著更具挑战性。传统方法常通过简化假设、场景先验、预训练模板或已知形变模型来处理这一设定。神经表征(尤其是神经隐式表征和辐射场)的出现,为端到端优化以统一捕捉几何、外观和物体运动提供了可能。然而,现有方法生成全局场景编码,假设多视角捕获且场景中运动不存在或极有限,且除新视角合成外难以实现便捷操控。本文提出一种分因式神经场景表征,可直接从单目RGB-D视频中学习,生成物体级神经表征,并显式编码物体运动(如刚体轨迹)和/或形变(如非刚体运动)。我们在合成与真实数据上将所提方法与多种神经方法进行对比,证明该表征具备高效性、可解释性和可编辑性(例如可修改物体轨迹)。代码与数据见:$\href{http://geometry.cs.ucl.ac.uk/projects/2023/factorednerf/}{\text{http://geometry.cs.ucl.ac.uk/projects/2023/factorednerf/}}$ 。