In 3D Referring Expression Segmentation (3D-RES), the earlier approach adopts a two-stage paradigm, extracting segmentation proposals and then matching them with referring expressions. However, this conventional paradigm encounters significant challenges, most notably in terms of the generation of lackluster initial proposals and a pronounced deceleration in inference speed. Recognizing these limitations, we introduce an innovative end-to-end Superpoint-Text Matching Network (3D-STMN) that is enriched by dependency-driven insights. One of the keystones of our model is the Superpoint-Text Matching (STM) mechanism. Unlike traditional methods that navigate through instance proposals, STM directly correlates linguistic indications with their respective superpoints, clusters of semantically related points. This architectural decision empowers our model to efficiently harness cross-modal semantic relationships, primarily leveraging densely annotated superpoint-text pairs, as opposed to the more sparse instance-text pairs. In pursuit of enhancing the role of text in guiding the segmentation process, we further incorporate the Dependency-Driven Interaction (DDI) module to deepen the network's semantic comprehension of referring expressions. Using the dependency trees as a beacon, this module discerns the intricate relationships between primary terms and their associated descriptors in expressions, thereby elevating both the localization and segmentation capacities of our model. Comprehensive experiments on the ScanRefer benchmark reveal that our model not only set new performance standards, registering an mIoU gain of 11.7 points but also achieve a staggering enhancement in inference speed, surpassing traditional methods by 95.7 times. The code and models are available at https://github.com/sosppxo/3D-STMN.
翻译:在三维指称表达分割(3D-RES)任务中,早期方法采用两阶段范式,首先生成分割提案,再将其与指称表达进行匹配。然而,这种传统范式面临显著挑战,最突出的是初始提案质量欠佳和推理速度明显下降。针对这些局限,我们提出了一种创新的端到端超点-文本匹配网络(3D-STMN),该网络融合了依赖关系驱动的洞见。我们模型的核心之一是超点-文本匹配(STM)机制。与传统方法遍历实例提案不同,STM直接将语言指示与其对应的超点(语义相关点的聚类)相关联。这一架构设计使我们的模型能够高效利用跨模态语义关系,主要依赖密集标注的超点-文本对,而非稀疏的实例-文本对。为了增强文本在分割过程中的引导作用,我们进一步引入了依赖关系驱动交互(DDI)模块,以深化网络对指称表达的语义理解。该模块以依存树为指引,识别表达中主词与其关联描述词之间的复杂关系,从而提升模型的定位与分割能力。在ScanRefer基准上的综合实验表明,我们的模型不仅创下了新的性能标准,mIoU提升了11.7个百分点,而且推理速度实现了惊人提升,比传统方法快95.7倍。代码与模型已开源至 https://github.com/sosppxo/3D-STMN。