BiPrompt-SAM: Enhancing Image Segmentation via Explicit Selection between Point and Text Prompts

Segmentation is a fundamental task in computer vision, with prompt-driven methods gaining prominence due to their flexibility. The recent Segment Anything Model (SAM) has demonstrated powerful point-prompt segmentation capabilities, while text-based segmentation models offer rich semantic understanding. However, existing approaches rarely explore how to effectively combine these complementary modalities for optimal segmentation performance. This paper presents BiPrompt-SAM, a novel dual-modal prompt segmentation framework that fuses the advantages of point and text prompts through an explicit selection mechanism. Specifically, we leverage SAM's inherent ability to generate multiple mask candidates, combined with a semantic guidance mask from text prompts, and explicitly select the most suitable candidate based on similarity metrics. This approach can be viewed as a simplified Mixture of Experts (MoE) system, where the point and text modules act as distinct "experts," and the similarity scoring serves as a rudimentary "gating network." We conducted extensive evaluations on both the Endovis17 medical dataset and RefCOCO series natural image datasets. On Endovis17, BiPrompt-SAM achieved 89.55\% mDice and 81.46\% mIoU, comparable to state-of-the-art specialized medical segmentation models. On the RefCOCO series datasets, our method attained 87.1\%, 86.5\%, and 85.8\% IoU, significantly outperforming existing approaches. Experiments demonstrate that our explicit dual-selection method effectively combines the spatial precision of point prompts with the semantic richness of text prompts, particularly excelling in scenarios involving semantically complex objects, multiple similar objects, and partial occlusions. BiPrompt-SAM not only provides a simple yet effective implementation but also offers a new perspective on multi-modal prompt fusion.

翻译：分割是计算机视觉中的一项基础任务，而基于提示的方法因其灵活性日益受到重视。近期提出的 Segment Anything Model (SAM) 展现了强大的点提示分割能力，而基于文本的分割模型则提供了丰富的语义理解。然而，现有方法很少探究如何有效结合这两种互补模态以获得最优分割性能。本文提出 BiPrompt-SAM，一种新颖的双模态提示分割框架，通过显式选择机制融合点提示与文本提示的优势。具体而言，我们利用 SAM 生成多个掩码候选的固有能力，结合来自文本提示的语义引导掩码，并基于相似性度量显式选择最合适的候选。该方法可视为一种简化的混合专家（MoE）系统，其中点模块和文本模块充当不同的“专家”，相似性评分则充当一个基础的“门控网络”。我们在 Endovis17 医学数据集和 RefCOCO 系列自然图像数据集上进行了广泛评估。在 Endovis17 上，BiPrompt-SAM 取得了 89.55\% 的 mDice 和 81.46\% 的 mIoU，与最先进的专用医学分割模型相当。在 RefCOCO 系列数据集上，我们的方法分别达到了 87.1\%, 86.5\%, 和 85.8\% 的 IoU，显著优于现有方法。实验表明，我们的显式双选择方法有效结合了点提示的空间精度与文本提示的语义丰富性，尤其在涉及语义复杂对象、多个相似对象及部分遮挡的场景中表现出色。BiPrompt-SAM 不仅提供了一种简单而有效的实现，也为多模态提示融合提供了新的视角。