We explore the task of embodied view synthesis from monocular videos of deformable scenes. Given a minute-long RGBD video of people interacting with their pets, we render the scene from novel camera trajectories derived from the in-scene motion of actors: (1) egocentric cameras that simulate the point of view of a target actor and (2) 3rd-person cameras that follow the actor. Building such a system requires reconstructing the root-body and articulated motion of every actor, as well as a scene representation that supports free-viewpoint synthesis. Longer videos are more likely to capture the scene from diverse viewpoints (which helps reconstruction) but are also more likely to contain larger motions (which complicates reconstruction). To address these challenges, we present Total-Recon, the first method to photorealistically reconstruct deformable scenes from long monocular RGBD videos. Crucially, to scale to long videos, our method hierarchically decomposes the scene into the background and objects, whose motion is decomposed into carefully initialized root-body motion and local articulations. To quantify such "in-the-wild" reconstruction and view synthesis, we collect ground-truth data from a specialized stereo RGBD capture rig for 11 challenging videos, significantly outperforming prior methods. Our code, model, and data can be found at https://andrewsonga.github.io/totalrecon .
翻译:我们探索了从可变形场景的单目视频中进行具身视角合成的任务。给定一段时长一分钟、记录人物与宠物互动的RGBD视频,我们根据场景中演员的运动衍生出新颖的虚拟相机轨迹,并渲染场景:(1)模拟目标演员第一人称视角的自我中心相机,以及(2)跟随演员的第三人称相机。构建该系统需要重建每个演员的根身体(root-body)与关节运动,同时建立支持自由视角合成的场景表示。较长的视频更可能从多样化视角捕捉场景(这有助于重建),但也更可能包含大幅运动(这使重建复杂化)。为应对这些挑战,我们提出Total-Recon——首个从长时单目RGBD视频中实现可变形场景逼真重建的方法。关键在于,为扩展至长视频,我们的方法将场景层级分解为背景与物体,其运动被分解为经精细初始化的根身体运动与局部关节运动。为量化此类“野外”重建与视角合成效果,我们使用专用立体RGBD采集设备收集了11个挑战性视频的真值数据,显著超越了此前方法。我们的代码、模型及数据可在https://andrewsonga.github.io/totalrecon 获取。