3D scene segmentation based on neural implicit representation has emerged recently with the advantage of training only on 2D supervision. However, existing approaches still requires expensive per-scene optimization that prohibits generalization to novel scenes during inference. To circumvent this problem, we introduce a generalizable 3D segmentation framework based on implicit representation. Specifically, our framework takes in multi-view image features and semantic maps as the inputs instead of only spatial information to avoid overfitting to scene-specific geometric and semantic information. We propose a novel soft voting mechanism to aggregate the 2D semantic information from different views for each 3D point. In addition to the image features, view difference information is also encoded in our framework to predict the voting scores. Intuitively, this allows the semantic information from nearby views to contribute more compared to distant ones. Furthermore, a visibility module is also designed to detect and filter out detrimental information from occluded views. Due to the generalizability of our proposed method, we can synthesize semantic maps or conduct 3D semantic segmentation for novel scenes with solely 2D semantic supervision. Experimental results show that our approach achieves comparable performance with scene-specific approaches. More importantly, our approach can even outperform existing strong supervision-based approaches with only 2D annotations. Our source code is available at: https://github.com/HLinChen/GNeSF.
翻译:基于神经隐式表征的三维场景分割近期兴起,其优势在于仅需二维监督即可训练。然而,现有方法仍需要昂贵的逐场景优化,导致推理阶段无法泛化至新场景。为解决该问题,我们提出一种基于隐式表征的可泛化三维分割框架。具体而言,本框架将多视角图像特征与语义图作为输入而非仅依赖空间信息,从而避免过拟合于特定场景的几何与语义特征。我们提出一种新颖的软投票机制,用于聚合各视角对应的三维点的二维语义信息。除图像特征外,视角差异信息也被编码至框架中以预测投票分数。直观上,该机制可使近邻视角的语义信息贡献度高于远距离视角。此外,我们设计了一个可见性模块,用于检测并过滤来自遮挡视角的有害信息。得益于所提方法的泛化能力,我们可在仅需二维语义监督的条件下,为新场景合成语义图或执行三维语义分割。实验结果表明,本方法性能与特定场景方法相当。更重要的是,在仅使用二维标注的情况下,我们的方法甚至能超越基于强监督的现有方法。源码已开源:https://github.com/HLinChen/GNeSF。