It has long been challenging to recover the underlying dynamic 3D scene representations from a monocular RGB video. Existing works formulate this problem into finding a single most plausible solution by adding various constraints such as depth priors and strong geometry constraints, ignoring the fact that there could be infinitely many 3D scene representations corresponding to a single dynamic video. In this paper, we aim to learn all plausible 3D scene configurations that match the input video, instead of just inferring a specific one. To achieve this ambitious goal, we introduce a new framework, called OSN. The key to our approach is a simple yet innovative object scale network together with a joint optimization module to learn an accurate scale range for every dynamic 3D object. This allows us to sample as many faithful 3D scene configurations as possible. Extensive experiments show that our method surpasses all baselines and achieves superior accuracy in dynamic novel view synthesis on multiple synthetic and real-world datasets. Most notably, our method demonstrates a clear advantage in learning fine-grained 3D scene geometry. Our code and data are available at https://github.com/vLAR-group/OSN
翻译:长期以来,从单目RGB视频中恢复底层的动态三维场景表示一直是一个挑战。现有工作通过引入深度先验和强几何约束等多种限制条件,将该问题表述为寻找单一最可能解,却忽略了一个动态视频可能对应无限多种三维场景表示这一事实。本文旨在学习所有与输入视频匹配的合理三维场景配置,而非仅推断特定的一种。为实现这一目标,我们提出了一个称为OSN的新框架。该方法的核心是一个简单而创新的物体尺度网络,结合联合优化模块来学习每个动态三维物体的精确尺度范围。这使得我们能够采样尽可能多的忠实三维场景配置。大量实验表明,我们的方法超越了所有基线,并在多个合成与真实数据集上的动态新视图合成任务中取得了更高的精度。尤为突出的是,我们的方法在学习细粒度三维场景几何方面展现出明显优势。代码与数据公开于https://github.com/vLAR-group/OSN。