Interactive image segmentation aims to segment the target from the background with the manual guidance, which takes as input multimodal data such as images, clicks, scribbles, and bounding boxes. Recently, vision transformers have achieved a great success in several downstream visual tasks, and a few efforts have been made to bring this powerful architecture to interactive segmentation task. However, the previous works neglect the relations between two modalities and directly mock the way of processing purely visual information with self-attentions. In this paper, we propose a simple yet effective network for click-based interactive segmentation with cross-modality vision transformers. Cross-modality transformers exploits mutual information to better guide the learning process. The experiments on several benchmarks show that the proposed method achieves superior performance in comparison to the previous state-of-the-art models. The stability of our method in term of avoiding failure cases shows its potential to be a practical annotation tool. The code and pretrained models will be released under https://github.com/lik1996/iCMFormer.
翻译:交互式图像分割旨在通过人工引导从背景中分割出目标,其输入包含图像、点击、涂鸦及边界框等多模态数据。近年来,视觉Transformer在多项下游视觉任务中取得了巨大成功,已有少量工作尝试将这一强大架构引入交互式分割任务。然而,先前的工作忽略了两种模态之间的关系,直接模仿了通过自注意力机制处理纯视觉信息的方式。本文提出了一种简单而有效的基于点击的交互式分割网络,采用跨模态视觉Transformer。跨模态Transformer通过利用互信息更好地引导学习过程。在多个基准数据集上的实验表明,所提方法相较先前最先进模型取得了更优性能。该方法在避免失败案例方面的稳定性显示了其作为实用标注工具的潜力。代码与预训练模型将在https://github.com/lik1996/iCMFormer 发布。