While recent advances in neural radiance field enable realistic digitization for large-scale scenes, the image-capturing process is still time-consuming and labor-intensive. Previous works attempt to automate this process using the Next-Best-View (NBV) policy for active 3D reconstruction. However, the existing NBV policies heavily rely on hand-crafted criteria, limited action space, or per-scene optimized representations. These constraints limit their cross-dataset generalizability. To overcome them, we propose GenNBV, an end-to-end generalizable NBV policy. Our policy adopts a reinforcement learning (RL)-based framework and extends typical limited action space to 5D free space. It empowers our agent drone to scan from any viewpoint, and even interact with unseen geometries during training. To boost the cross-dataset generalizability, we also propose a novel multi-source state embedding, including geometric, semantic, and action representations. We establish a benchmark using the Isaac Gym simulator with the Houses3K and OmniObject3D datasets to evaluate this NBV policy. Experiments demonstrate that our policy achieves a 98.26% and 97.12% coverage ratio on unseen building-scale objects from these datasets, respectively, outperforming prior solutions.
翻译:尽管神经辐射场的最新进展使得大规模场景的真实数字化成为可能,但图像采集过程仍然耗时且费力。先前的研究尝试使用最优下一视点策略来自动化主动三维重建过程。然而,现有的最优下一视点策略严重依赖于手工设计的准则、受限的动作空间或针对每个场景优化的表示。这些限制制约了其跨数据集的泛化能力。为克服这些限制,我们提出了GenNBV,一种端到端的可泛化最优下一视点策略。我们的策略采用基于强化学习的框架,并将典型的受限动作空间扩展至五维自由空间。这使得我们的智能体无人机能够从任意视点进行扫描,甚至在训练过程中与未见过的几何结构进行交互。为了提升跨数据集的泛化能力,我们还提出了一种新颖的多源状态嵌入,包括几何、语义和动作表示。我们使用Isaac Gym模拟器与Houses3K和OmniObject3D数据集建立了一个基准来评估此最优下一视点策略。实验表明,我们的策略在这两个数据集上对未见过的建筑尺度物体分别达到了98.26%和97.12%的覆盖率,优于先前的解决方案。