SJTU：多模态模型中的空间判断——通过坐标检测实现统一分割 (SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection)

Despite significant advances in vision-language understanding, implementing image segmentation within multimodal architectures remains a fundamental challenge in modern artificial intelligence systems. Existing vision-language models, which primarily rely on backbone architectures or CLIP-based embedding learning, demonstrate inherent limitations in fine-grained spatial localization and operational capabilities. This paper introduces SJTU: Spatial Judgments in Multimodal Models - Towards Unified Segmentation through Coordinate Detection, a framework that leverages spatial coordinate understanding to bridge vision-language interaction and precise segmentation, enabling accurate target identification through natural language instructions. The framework presents an approach for integrating segmentation techniques with vision-language models through spatial inference in multimodal space. By utilizing normalized coordinate detection for bounding boxes and transforming them into actionable segmentation outputs, we establish a connection between spatial and language representations in multimodal architectures. Experimental results demonstrate superior performance across benchmark datasets, achieving IoU scores of 0.5958 on COCO 2017 and 0.6758 on Pascal VOC. Testing on a single NVIDIA RTX 3090 GPU with 512x512 resolution images yields an average inference time of 7 seconds per image, demonstrating the framework's effectiveness in both accuracy and practical deployability. The project code is available at https://github.com/jw-chae/SJTU

翻译：尽管视觉-语言理解领域取得了显著进展，但在多模态架构中实现图像分割仍然是现代人工智能系统面临的一项根本性挑战。现有的视觉-语言模型主要依赖于骨干网络架构或基于CLIP的嵌入学习，在细粒度空间定位和操作能力方面表现出固有的局限性。本文介绍了SJTU：多模态模型中的空间判断——通过坐标检测实现统一分割，这是一个利用空间坐标理解来桥接视觉-语言交互与精确分割的框架，能够通过自然语言指令实现准确的目标识别。该框架提出了一种通过多模态空间中的空间推理，将分割技术与视觉-语言模型相集成的方法。通过利用归一化坐标检测来获取边界框，并将其转换为可执行的分割输出，我们在多模态架构中建立了空间表示与语言表示之间的联系。实验结果表明，该框架在多个基准数据集上均表现出优越的性能，在COCO 2017上获得了0.5958的IoU分数，在Pascal VOC上获得了0.6758的IoU分数。在单张NVIDIA RTX 3090 GPU上对512x512分辨率图像进行测试，平均每张图像的推理时间为7秒，证明了该框架在准确性和实际可部署性方面的有效性。项目代码可在 https://github.com/jw-chae/SJTU 获取。