Supervised 3D part segmentation models are tailored for a fixed set of objects and parts, limiting their transferability to open-set, real-world scenarios. Recent works have explored vision-language models (VLMs) as a promising alternative, using multi-view rendering and textual prompting to identify object parts. However, naively applying VLMs in this context introduces several drawbacks, such as the need for meticulous prompt engineering, and fails to leverage the 3D geometric structure of objects. To address these limitations, we propose COPS, a COmprehensive model for Parts Segmentation that blends the semantics extracted from visual concepts and 3D geometry to effectively identify object parts. COPS renders a point cloud from multiple viewpoints, extracts 2D features, projects them back to 3D, and uses a novel geometric-aware feature aggregation procedure to ensure spatial and semantic consistency. Finally, it clusters points into parts and labels them. We demonstrate that COPS is efficient, scalable, and achieves zero-shot state-of-the-art performance across five datasets, covering synthetic and real-world data, texture-less and coloured objects, as well as rigid and non-rigid shapes. The code is available at https://3d-cops.github.io.
翻译:监督式三维部件分割模型通常针对固定的物体和部件集合进行优化,这限制了其在开放集现实场景中的可迁移性。近期研究探索了视觉语言模型作为有前景的替代方案,通过多视角渲染和文本提示来识别物体部件。然而,在此场景中直接应用VLM存在若干缺陷,例如需要精细的提示工程,且未能充分利用物体的三维几何结构。为解决这些局限性,我们提出了COPS——一种融合视觉概念语义与三维几何信息的综合部件分割模型。COPS通过多视角渲染点云,提取二维特征并将其重投影至三维空间,采用新颖的几何感知特征聚合机制确保空间与语义一致性,最终将点聚类为部件并完成标注。实验表明,COPS具有高效性和可扩展性,在涵盖合成与真实数据、无纹理与彩色物体、刚性与非刚性形状的五大数据集上实现了零样本状态下的最优性能。代码已开源:https://3d-cops.github.io。