Object instance segmentation is a key challenge for indoor robots navigating cluttered environments with many small objects. Limitations in 3D sensing capabilities often make it difficult to detect every possible object. While deep learning approaches may be effective for this problem, manually annotating 3D data for supervised learning is time-consuming. In this work, we explore zero-shot instance segmentation (ZSIS) from RGB-D data to identify unseen objects in a semantic category-agnostic manner. We introduce a zero-shot split for Tabletop Objects Dataset (TOD-Z) to enable this study and present a method that uses annotated objects to learn the ``objectness'' of pixels and generalize to unseen object categories in cluttered indoor environments. Our method, SupeRGB-D, groups pixels into small patches based on geometric cues and learns to merge the patches in a deep agglomerative clustering fashion. SupeRGB-D outperforms existing baselines on unseen objects while achieving similar performance on seen objects. We further show competitive results on the real dataset OCID. With its lightweight design (0.4 MB memory requirement), our method is extremely suitable for mobile and robotic applications. Additional DINO features can increase performance with a higher memory requirement. The dataset split and code are available at https://github.com/evinpinar/supergb-d.
翻译:物体实例分割是室内机器人在布满小物体的杂乱环境中导航的关键挑战。由于3D感知能力有限,通常难以检测到所有可能存在的物体。尽管深度学习方法对此问题可能有效,但手动标注3D数据进行监督学习非常耗时。在本工作中,我们探索了基于RGB-D数据的零样本实例分割(ZSIS),以语义类别无关的方式识别未见物体。我们为桌面物体数据集引入零样本划分(TOD-Z)以支持本研究,并提出一种方法:利用标注物体学习像素的“物体性”,并泛化至杂乱室内环境中的未见物体类别。我们的方法SupeRGB-D基于几何线索将像素分组为小补丁,并通过深度凝聚聚类方式学习合并这些补丁。SupeRGB-D在未见物体上优于现有基线方法,同时在已见物体上达到相似性能。我们进一步在真实数据集OCID上展示了竞争性结果。凭借其轻量化设计(0.4 MB内存需求),该方法特别适用于移动与机器人应用。引入额外的DINO特征可提升性能,但会增加内存需求。数据集划分及代码见https://github.com/evinpinar/supergb-d。