We introduce HOSNeRF, a novel 360{\deg} free-viewpoint rendering method that reconstructs neural radiance fields for dynamic human-object-scene from a single monocular in-the-wild video. Our method enables pausing the video at any frame and rendering all scene details (dynamic humans, objects, and backgrounds) from arbitrary viewpoints. The first challenge in this task is the complex object motions in human-object interactions, which we tackle by introducing the new object bones into the conventional human skeleton hierarchy to effectively estimate large object deformations in our dynamic human-object model. The second challenge is that humans interact with different objects at different times, for which we introduce two new learnable object state embeddings that can be used as conditions for learning our human-object representation and scene representation, respectively. Extensive experiments show that HOSNeRF significantly outperforms SOTA approaches on two challenging datasets by a large margin of 40% ~ 50% in terms of LPIPS. The code, data, and compelling examples of 360{\deg} free-viewpoint renderings from single videos will be released in https://showlab.github.io/HOSNeRF.
翻译:我们提出了HOSNeRF,一种新颖的360度自由视点渲染方法,该方法从单个野外单目视频中重建动态人-物-场景的神经辐射场。我们的方法能暂停视频任意帧,并从任意视角渲染所有场景细节(包括动态人物、物体和背景)。该任务的首要挑战是人-物交互中复杂的物体运动,我们通过将新物体骨骼引入传统人体骨骼层级结构,有效估算了动态人-物模型中大型物体形变。第二个挑战是人在不同时间与不同物体交互,为此我们引入两种可学习的物体状态嵌入,分别作为人-物表征和场景表征的学习条件。大量实验表明,HOSNeRF在两个具有挑战性的数据集上,以LPIPS指标衡量显著超越最先进方法40%~50%。相关代码、数据及单视频生成360度自由视点渲染的精彩示例将于https://showlab.github.io/HOSNeRF发布。