Abstract semantic 3D scene understanding is a problem of critical importance in robotics. As robots still lack the common-sense knowledge about household objects and locations of an average human, we investigate the use of pre-trained language models to impart common sense for scene understanding. We introduce and compare a wide range of scene classification paradigms that leverage language only (zero-shot, embedding-based, and structured-language) or vision and language (zero-shot and fine-tuned). We find that the best approaches in both categories yield $\sim 70\%$ room classification accuracy, exceeding the performance of pure-vision and graph classifiers. We also find such methods demonstrate notable generalization and transfer capabilities stemming from their use of language.
翻译:摘要:语义三维场景理解是机器人领域中的一个关键问题。由于机器人缺乏普通人关于家庭物品和位置的常识知识,我们研究了使用预训练语言模型来赋予场景理解所需常识的方法。我们引入并比较了多种场景分类范式,这些范式仅依赖语言(零样本、基于嵌入和结构化语言)或视觉与语言(零样本和微调)。我们发现,这两类方法中的最优方案均能达到约70%的房间分类准确率,超越了纯视觉和图分类器的性能。我们还发现,这些方法因其对语言的使用而展现出显著的泛化和迁移能力。