We introduce SAM4MLLM, an innovative approach which integrates the Segment Anything Model (SAM) with Multi-Modal Large Language Models (MLLMs) for pixel-aware tasks. Our method enables MLLMs to learn pixel-level location information without requiring excessive modifications to the existing model architecture or adding specialized tokens. We introduce an inquiry-based approach that can effectively find prompt points for SAM to perform segmentation based on MLLM. It combines detailed visual information with the powerful expressive capabilities of large language models in a unified language-based manner without additional computational overhead in learning. Experimental results on pubic benchmarks demonstrate the effectiveness of our approach.
翻译:我们提出了SAM4MLLM,这是一种创新方法,它将Segment Anything Model(SAM)与多模态大语言模型(MLLMs)相结合,用于像素感知任务。我们的方法使MLLMs能够学习像素级的位置信息,而无需对现有模型架构进行大量修改或添加专门的标记。我们引入了一种基于查询的方法,该方法能够基于MLLM有效地为SAM找到执行分割的提示点。它以统一的、基于语言的方式,将详细的视觉信息与大语言模型的强大表达能力相结合,在学习过程中无需额外的计算开销。在公开基准测试上的实验结果证明了我们方法的有效性。