We have witnessed significant progress in deep learning-based 3D vision, ranging from neural radiance field (NeRF) based 3D representation learning to applications in novel view synthesis (NVS). However, existing scene-level datasets for deep learning-based 3D vision, limited to either synthetic environments or a narrow selection of real-world scenes, are quite insufficient. This insufficiency not only hinders a comprehensive benchmark of existing methods but also caps what could be explored in deep learning-based 3D analysis. To address this critical gap, we present DL3DV-10K, a large-scale scene dataset, featuring 51.2 million frames from 10,510 videos captured from 65 types of point-of-interest (POI) locations, covering both bounded and unbounded scenes, with different levels of reflection, transparency, and lighting. We conducted a comprehensive benchmark of recent NVS methods on DL3DV-10K, which revealed valuable insights for future research in NVS. In addition, we have obtained encouraging results in a pilot study to learn generalizable NeRF from DL3DV-10K, which manifests the necessity of a large-scale scene-level dataset to forge a path toward a foundation model for learning 3D representation. Our DL3DV-10K dataset, benchmark results, and models will be publicly accessible at https://dl3dv-10k.github.io/DL3DV-10K/.
翻译:我们在深度学习驱动的三维视觉领域已见证显著进展,涵盖基于神经辐射场(NeRF)的三维表征学习到新视角合成(NVS)应用。然而,现有面向深度学习三维视觉的场景级数据集,局限于合成环境或少量真实场景,其规模严重不足。这种不足不仅阻碍了现有方法的全面基准测试,更限制了深度学习三维分析可探索的边界。为填补这一关键空白,我们提出了DL3DV-10K——一个大规模场景数据集,包含来自65类兴趣点(POI)地点的10,510段视频中的5,120万帧图像,覆盖有界场景与无界场景,并包含不同层次的反射、透明度和光照条件。我们基于DL3DV-10K对近期NVS方法进行了全面的基准测试,揭示了未来NVS研究的重要启示。此外,我们在基于DL3DV-10K学习可泛化NeRF的预研中取得了令人鼓舞的成果,这充分表明大规模场景级数据集对构建三维表征基础模型的必要性。我们的DL3DV-10K数据集、基准测试结果及模型将开放于https://dl3dv-10k.github.io/DL3DV-10K/。