Robots in human-centered environments require accurate scene understanding to perform high-level tasks effectively. This understanding can be achieved through instance-aware semantic mapping, which involves reconstructing elements at the level of individual instances. Neural networks, the de facto solution for scene understanding, still face limitations such as overconfident incorrect predictions with out-of-distribution objects or generating inaccurate masks.Placing excessive reliance on these predictions makes the reconstruction susceptible to errors, reducing the robustness of the resulting maps and hampering robot operation. In this work, we propose Voxeland, a probabilistic framework for incrementally building instance-aware semantic maps. Inspired by the Theory of Evidence, Voxeland treats neural network predictions as subjective opinions regarding map instances at both geometric and semantic levels. These opinions are aggregated over time to form evidences, which are formalized through a probabilistic model. This enables us to quantify uncertainty in the reconstruction process, facilitating the identification of map areas requiring improvement (e.g. reobservation or reclassification). As one strategy to exploit this, we incorporate a Large Vision-Language Model (LVLM) to perform semantic level disambiguation for instances with high uncertainty. Results from the standard benchmarking on the publicly available SceneNN dataset demonstrate that Voxeland outperforms state-of-the-art methods, highlighting the benefits of incorporating and leveraging both instance- and semantic-level uncertainties to enhance reconstruction robustness. This is further validated through qualitative experiments conducted on the real-world ScanNet dataset.
翻译:在以人为本的环境中,机器人需要准确的环境理解以有效执行高级任务。这种理解可通过实例感知语义建图实现,其核心在于在个体实例层面重建场景元素。神经网络作为场景理解的事实解决方案,仍存在诸多局限:例如对分布外物体产生过度自信的错误预测,或生成不准确的掩码。过度依赖这些预测会使重建结果易受误差影响,降低最终地图的鲁棒性并阻碍机器人操作。本研究提出Voxeland——一个用于增量构建实例感知语义地图的概率框架。受证据理论启发,Voxeland将神经网络预测视为对地图实例在几何与语义层面的主观认知。这些认知随时间累积形成证据,并通过概率模型进行形式化表达。该框架使我们能够量化重建过程中的不确定性,从而识别需要改进的地图区域(例如重新观测或重新分类)。作为利用该特性的策略之一,我们引入大型视觉语言模型对高不确定性实例执行语义层面的消歧。在公开数据集SceneNN上的标准基准测试表明,Voxeland性能优于现有先进方法,凸显了融合利用实例级与语义级不确定性对提升重建鲁棒性的优势。通过在真实场景数据集ScanNet上进行的定性实验进一步验证了该结论。