Labeling Indoor Scenes with Fusion of Out-of-the-Box Perception Models

The image annotation stage is a critical and often the most time-consuming part required for training and evaluating object detection and semantic segmentation models. Deployment of the existing models in novel environments often requires detecting novel semantic classes not present in the training data. Furthermore, indoor scenes contain significant viewpoint variations, which need to be handled properly by trained perception models. We propose to leverage the recent advancements in state-of-the-art models for bottom-up segmentation (SAM), object detection (Detic), and semantic segmentation (MaskFormer), all trained on large-scale datasets. We aim to develop a cost-effective labeling approach to obtain pseudo-labels for semantic segmentation and object instance detection in indoor environments, with the ultimate goal of facilitating the training of lightweight models for various downstream tasks. We also propose a multi-view labeling fusion stage, which considers the setting where multiple views of the scenes are available and can be used to identify and rectify single-view inconsistencies. We demonstrate the effectiveness of the proposed approach on the Active Vision dataset and the ADE20K dataset. We evaluate the quality of our labeling process by comparing it with human annotations. Also, we demonstrate the effectiveness of the obtained labels in downstream tasks such as object goal navigation and part discovery. In the context of object goal navigation, we depict enhanced performance using this fusion approach compared to a zero-shot baseline that utilizes large monolithic vision-language pre-trained models.

翻译：图像标注阶段是训练和评估目标检测与语义分割模型的关键环节，通常也是最耗时的部分。将现有模型部署到新环境时，往往需要检测训练数据中未出现的新语义类别。此外，室内场景包含显著的视角变化，需要训练的感知模型妥善处理。我们提出利用最新先进模型在下游任务中的突破性进展，包括自底向上分割模型（SAM）、目标检测模型（Detic）和语义分割模型（MaskFormer），这些模型均在大型数据集上训练。旨在开发一种经济高效的标注方法，获取室内环境中语义分割和目标实例检测的伪标签，最终目标是为各类下游任务训练轻量级模型提供便利。我们还提出多视角标注融合阶段，该阶段考虑场景多视角可用的情形，通过多视图信息识别并修正单视图不一致性。在Active Vision数据集和ADE20K数据集上验证了所提方法的有效性。通过与人工作标注对比评估标注质量，同时展示所得标签在下游任务（如目标导向导航和部件发现）中的有效性。在目标导向导航任务中，与利用大型单视觉-语言预训练模型的零样本基线相比，该融合方法展现出更优性能。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日