While recent advances in neural radiance field enable realistic digitization for large-scale scenes, the image-capturing process is still time-consuming and labor-intensive. Previous works attempt to automate this process using the Next-Best-View (NBV) policy for active 3D reconstruction. However, the existing NBV policies heavily rely on hand-crafted criteria, limited action space, or per-scene optimized representations. These constraints limit their cross-dataset generalizability. To overcome them, we propose GenNBV, an end-to-end generalizable NBV policy. Our policy adopts a reinforcement learning (RL)-based framework and extends typical limited action space to 5D free space. It empowers our agent drone to scan from any viewpoint, and even interact with unseen geometries during training. To boost the cross-dataset generalizability, we also propose a novel multi-source state embedding, including geometric, semantic, and action representations. We establish a benchmark using the Isaac Gym simulator with the Houses3K and OmniObject3D datasets to evaluate this NBV policy. Experiments demonstrate that our policy achieves a 98.26% and 97.12% coverage ratio on unseen building-scale objects from these datasets, respectively, outperforming prior solutions.
翻译:尽管神经辐射场的最新进展使得大规模场景的真实数字化成为可能,但图像采集过程仍然耗时且费力。先前的研究尝试利用最优下一视点策略实现主动三维重建过程的自动化。然而,现有的最优下一视点策略严重依赖于手工设计的准则、受限的动作空间或针对特定场景优化的表示方法。这些限制制约了其跨数据集的泛化能力。为克服这些局限,我们提出GenNBV——一种端到端的通用化最优下一视点策略。该策略采用基于强化学习的框架,将典型的受限动作空间扩展至五维自由空间,使智能体无人机能够从任意视点进行扫描,甚至在训练过程中与未见过的几何结构进行交互。为提升跨数据集泛化能力,我们还提出了一种新颖的多源状态嵌入方法,包含几何、语义与动作表示。我们基于Isaac Gym仿真器,使用Houses3K和OmniObject3D数据集建立基准测试平台以评估该策略。实验表明,我们的策略在这两个数据集未见过的建筑尺度物体上分别达到98.26%和97.12%的覆盖率,性能优于现有解决方案。