Despite advances in vision-language understanding, implementing image segmentation within multimodal architectures remains a fundamental challenge in modern artificial intelligence systems. Existing vision-language models, which primarily rely on backbone architectures or CLIP-based embedding learning, demonstrate inherent limitations in fine-grained spatial localization and operational capabilities. This paper introduces SJTU: Spatial Judgments in multimodal models - Towards Unified segmentation through coordinate detection, a novel framework that leverages spatial coordinate understanding to bridge vision-language interaction and precise segmentation, enabling accurate target identification through natural language instructions. The framework proposes a novel approach for integrating segmentation techniques with vision-language models based on multimodal spatial inference. By leveraging normalized coordinate detection for bounding boxes and translating it into actionable segmentation outputs, we explore the possibility of integrating multimodal spatial and language representations. Based on the proposed technical approach, the framework demonstrates superior performance on various benchmark datasets as well as accurate object segmentation. Results on the COCO 2017 dataset for general object detection and Pascal VOC datasets for semantic segmentation demonstrate the generalization capabilities of the framework.
翻译:尽管视觉语言理解取得了进展,但在多模态架构中实现图像分割仍然是现代人工智能系统的一个基本挑战。现有的视觉语言模型主要依赖于骨干架构或基于CLIP的嵌入学习,在细粒度空间定位和操作能力方面表现出固有的局限性。本文提出SJTU:多模态模型中的空间判断——通过坐标检测实现统一分割,这是一种新颖的框架,利用空间坐标理解来桥接视觉语言交互与精确分割,从而通过自然语言指令实现准确的目标识别。该框架提出了一种基于多模态空间推理将分割技术与视觉语言模型相结合的新方法。通过利用边界框的归一化坐标检测并将其转化为可执行的分割输出,我们探索了整合多模态空间与语言表征的可能性。基于所提出的技术方法,该框架在多个基准数据集上展现出优越性能,并实现了精确的目标分割。在通用目标检测的COCO 2017数据集和语义分割的Pascal VOC数据集上的结果验证了该框架的泛化能力。