VPUFormer: Visual Prompt Unified Transformer for Interactive Image Segmentation

The integration of diverse visual prompts like clicks, scribbles, and boxes in interactive image segmentation could significantly facilitate user interaction as well as improve interaction efficiency. Most existing studies focus on a single type of visual prompt by simply concatenating prompts and images as input for segmentation prediction, which suffers from low-efficiency prompt representation and weak interaction issues. This paper proposes a simple yet effective Visual Prompt Unified Transformer (VPUFormer), which introduces a concise unified prompt representation with deeper interaction to boost the segmentation performance. Specifically, we design a Prompt-unified Encoder (PuE) by using Gaussian mapping to generate a unified one-dimensional vector for click, box, and scribble prompts, which well captures users' intentions as well as provides a denser representation of user prompts. In addition, we present a Prompt-to-Pixel Contrastive Loss (P2CL) that leverages user feedback to gradually refine candidate semantic features, aiming to bring image semantic features closer to the features that are similar to the user prompt, while pushing away those image semantic features that are dissimilar to the user prompt, thereby correcting results that deviate from expectations. On this basis, our approach injects prompt representations as queries into Dual-cross Merging Attention (DMA) blocks to perform a deeper interaction between image and query inputs. A comprehensive variety of experiments on seven challenging datasets demonstrates that the proposed VPUFormer with PuE, DMA, and P2CL achieves consistent improvements, yielding state-of-the-art segmentation performance. Our code will be made publicly available at https://github.com/XuZhang1211/VPUFormer.

翻译：在交互式图像分割中，集成点击、涂鸦、边界框等多样化视觉提示可显著便利用户交互并提升交互效率。现有研究多聚焦单一视觉提示类型，通过简单拼接提示与图像作为分割预测输入，存在提示表征效率低及交互性不足的问题。本文提出一种简洁而有效的视觉提示统一Transformer（VPUFormer），通过引入统一的提示表征与深度交互机制提升分割性能。具体而言，我们设计提示统一编码器（PuE），利用高斯映射为点击、边界框及涂鸦提示生成统一的一维向量，既能充分捕捉用户意图，又能实现用户提示的密集表征。此外，我们提出提示-像素对比损失（P2CL），借助用户反馈逐步优化候选语义特征，旨在拉近与用户提示相似的图像语义特征距离，推远不相似的特征，从而修正偏离预期的分割结果。在此基础上，本方法将提示表征作为查询注入双交叉融合注意力（DMA）模块，实现图像与查询输入的深层交互。在七个具有挑战性的数据集上的全面实验表明，所提出的VPUFormer结合PuE、DMA与P2CL可实现持续性能提升，达到最先进的分割效果。相关代码将开源至https://github.com/XuZhang1211/VPUFormer。