Reliable 3D instance segmentation is fundamental to language-grounded robotic manipulation. Its critical application lies in cluttered environments, where occlusions, limited viewpoints, and noisy masks degrade perception. To address these challenges, we present Clutt3R-Seg, a zero-shot pipeline for robust 3D instance segmentation for language-grounded grasping in cluttered scenes. Our key idea is to introduce a hierarchical instance tree of semantic cues. Unlike prior approaches that attempt to refine noisy masks, our method leverages them as informative cues: through cross-view grouping and conditional substitution, the tree suppresses over- and under-segmentation, yielding view-consistent masks and robust 3D instances. Each instance is enriched with open-vocabulary semantic embeddings, enabling accurate target selection from natural language instructions. To handle scene changes during multi-stage tasks, we further introduce a consistency-aware update that preserves instance correspondences from only a single post-interaction image, allowing efficient adaptation without rescanning. Clutt3R-Seg is evaluated on both synthetic and real-world datasets, and validated on a real robot. Across all settings, it consistently outperforms state-of-the-art baselines in cluttered and sparse-view scenarios. Even on the most challenging heavy-clutter sequences, Clutt3R-Seg achieves an AP@25 of 61.66, over 2.2x higher than baselines, and with only four input views it surpasses MaskClustering with eight views by more than 2x. The code is available at: https://github.com/jeonghonoh/clutt3r-seg.
翻译:可靠的三维实例分割是语言驱动机器人操作的基础。其关键应用在于杂乱环境中,其中遮挡、有限视角和噪声掩码会降低感知性能。为应对这些挑战,我们提出了Clutt3R-Seg,一种用于杂乱场景中语言驱动抓取的鲁棒三维实例分割的零样本流程。我们的核心思想是引入一个语义线索的层次化实例树。与先前试图优化噪声掩码的方法不同,我们的方法将其作为信息线索加以利用:通过跨视图分组和条件替换,该树抑制过分割与欠分割,从而产生视图一致的掩码和鲁棒的三维实例。每个实例均通过开放词汇语义嵌入进行增强,使其能够根据自然语言指令准确选择目标。为处理多阶段任务中的场景变化,我们进一步引入了一种一致性感知更新机制,该机制仅需单张交互后图像即可保持实例对应关系,从而实现无需重新扫描的高效适应。Clutt3R-Seg在合成和真实数据集上进行了评估,并在真实机器人上进行了验证。在所有设置中,其在杂乱和稀疏视图场景下均持续优于现有先进基线方法。即使在最具挑战性的重度杂乱序列中,Clutt3R-Seg的AP@25达到61.66,较基线方法提升超过2.2倍;且仅使用四个输入视图时,其性能超过使用八个视图的MaskClustering方法两倍以上。代码发布于:https://github.com/jeonghonoh/clutt3r-seg。