A long-standing goal in scene understanding is to obtain interpretable and editable representations that can be directly constructed from a raw monocular RGB-D video, without requiring specialized hardware setup or priors. The problem is significantly more challenging in the presence of multiple moving and/or deforming objects. Traditional methods have approached the setup with a mix of simplifications, scene priors, pretrained templates, or known deformation models. The advent of neural representations, especially neural implicit representations and radiance fields, opens the possibility of end-to-end optimization to collectively capture geometry, appearance, and object motion. However, current approaches produce global scene encoding, assume multiview capture with limited or no motion in the scenes, and do not facilitate easy manipulation beyond novel view synthesis. In this work, we introduce a factored neural scene representation that can directly be learned from a monocular RGB-D video to produce object-level neural presentations with an explicit encoding of object movement (e.g., rigid trajectory) and/or deformations (e.g., nonrigid movement). We evaluate ours against a set of neural approaches on both synthetic and real data to demonstrate that the representation is efficient, interpretable, and editable (e.g., change object trajectory). The project webpage is available at: $\href{https://yushiangw.github.io/factorednerf/}{\text{link}}$.
翻译:场景理解中的一个长期目标是获得可解释且可编辑的表示,该表示可直接从原始单目RGB-D视频构建,无需专用硬件设置或先验知识。当存在多个移动和/或变形物体时,问题显著更具挑战性。传统方法通过一系列简化、场景先验、预训练模板或已知变形模型来处理这一设置。神经表示的出现,特别是神经隐式表示和辐射场,开启了端到端优化的可能性,以集体捕获几何、外观和物体运动。然而,当前方法产生全局场景编码,假设多视图捕获且场景中运动有限或无运动,并且除了新颖视角合成外,不便于进一步操控。在本工作中,我们引入了一种因子化神经场景表示,可直接从单目RGB-D视频学习,生成物体级神经表示,并明确编码物体移动(例如,刚体轨迹)和/或形变(例如,非刚体运动)。我们在合成数据和真实数据上针对一系列神经方法进行评估,以证明该表示是高效、可解释且可编辑的(例如,改变物体轨迹)。项目网页见:$\href{https://yushiangw.github.io/factorednerf/}{\text{链接}}$。