A long-standing goal in scene understanding is to obtain interpretable and editable representations that can be directly constructed from a raw monocular RGB-D video, without requiring specialized hardware setup or priors. The problem is significantly more challenging in the presence of multiple moving and/or deforming objects. Traditional methods have approached the setup with a mix of simplifications, scene priors, pretrained templates, or known deformation models. The advent of neural representations, especially neural implicit representations and radiance fields, opens the possibility of end-to-end optimization to collectively capture geometry, appearance, and object motion. However, current approaches produce global scene encoding, assume multiview capture with limited or no motion in the scenes, and do not facilitate easy manipulation beyond novel view synthesis. In this work, we introduce a factored neural scene representation that can directly be learned from a monocular RGB-D video to produce object-level neural presentations with an explicit encoding of object movement (e.g., rigid trajectory) and/or deformations (e.g., nonrigid movement). We evaluate ours against a set of neural approaches on both synthetic and real data to demonstrate that the representation is efficient, interpretable, and editable (e.g., change object trajectory). Code and data are available at http://geometry.cs.ucl.ac.uk/projects/2023/factorednerf .
翻译:在场景理解领域,长期目标是从原始单目RGB-D视频中直接获取可解释且可编辑的表示,而无需依赖特定硬件设备或先验信息。当场景中存在多个运动或形变物体时,这一问题会显著更具挑战性。传统方法通常采用简化假设、场景先验、预定义模板或已知形变模型等组合策略。神经表示(特别是神经隐式表示和辐射场)的出现,为端到端优化以整体捕捉几何、外观和物体运动提供了可能。然而,现有方法大多生成全局场景编码,要求具备有限或无运动的多视角捕捉条件,且除合成新视角外难以实现便捷操控。本研究提出一种分层神经场景表示,可直接从单目RGB-D视频中学习,为每个物体建立神经表示,并显式编码物体运动(如刚体轨迹)和/或形变(如非刚体运动)。我们在合成与真实数据上对多种神经方法进行评估,证明该表示具有高效性、可解释性和可编辑性(例如修改物体运动轨迹)。代码与数据详见 http://geometry.cs.ucl.ac.uk/projects/2023/factorednerf 。