The image annotation stage is a critical and often the most time-consuming part required for training and evaluating object detection and semantic segmentation models. Deployment of the existing models in novel environments often requires detecting novel semantic classes not present in the training data. Furthermore, indoor scenes contain significant viewpoint variations, which need to be handled properly by trained perception models. We propose to leverage the recent advancements in state-of-the-art models for bottom-up segmentation (SAM), object detection (Detic), and semantic segmentation (MaskFormer), all trained on large-scale datasets. We aim to develop a cost-effective labeling approach to obtain pseudo-labels for semantic segmentation and object instance detection in indoor environments, with the ultimate goal of facilitating the training of lightweight models for various downstream tasks. We also propose a multi-view labeling fusion stage, which considers the setting where multiple views of the scenes are available and can be used to identify and rectify single-view inconsistencies. We demonstrate the effectiveness of the proposed approach on the Active Vision dataset and the ADE20K dataset. We evaluate the quality of our labeling process by comparing it with human annotations. Also, we demonstrate the effectiveness of the obtained labels in downstream tasks such as object goal navigation and part discovery. In the context of object goal navigation, we depict enhanced performance using this fusion approach compared to a zero-shot baseline that utilizes large monolithic vision-language pre-trained models.
翻译:图像标注阶段是训练和评估目标检测与语义分割模型的关键环节,通常也是最耗时的部分。将现有模型部署到新环境时,往往需要检测训练数据中未出现的新语义类别。此外,室内场景包含显著的视角变化,需要训练的感知模型妥善处理。我们提出利用最新先进模型在下游任务中的突破性进展,包括自底向上分割模型(SAM)、目标检测模型(Detic)和语义分割模型(MaskFormer),这些模型均在大型数据集上训练。旨在开发一种经济高效的标注方法,获取室内环境中语义分割和目标实例检测的伪标签,最终目标是为各类下游任务训练轻量级模型提供便利。我们还提出多视角标注融合阶段,该阶段考虑场景多视角可用的情形,通过多视图信息识别并修正单视图不一致性。在Active Vision数据集和ADE20K数据集上验证了所提方法的有效性。通过与人工作标注对比评估标注质量,同时展示所得标签在下游任务(如目标导向导航和部件发现)中的有效性。在目标导向导航任务中,与利用大型单视觉-语言预训练模型的零样本基线相比,该融合方法展现出更优性能。