Interactive image segmentation enables users to interact minimally with a machine, facilitating the gradual refinement of the segmentation mask for a target of interest. Previous studies have demonstrated impressive performance in extracting a single target mask through interactive segmentation. However, the information cues of previously interacted objects have been overlooked in the existing methods, which can be further explored to speed up interactive segmentation for multiple targets in the same category. To this end, we introduce novel interactive segmentation frameworks for both a single object and multiple objects in the same category. Specifically, our model leverages transformer backbones to extract interaction-focused visual features from the image and the interactions to obtain a satisfactory mask of a target as an exemplar. For multiple objects, we propose an exemplar-informed module to enhance the learning of similarities among the objects of the target category. To combine attended features from different modules, we incorporate cross-attention blocks followed by a feature fusion module. Experiments conducted on mainstream benchmarks demonstrate that our models achieve superior performance compared to previous methods. Particularly, our model reduces users' labor by around 15\%, requiring two fewer clicks to achieve target IoUs 85\% and 90\%. The results highlight our models' potential as a flexible and practical annotation tool. The source code will be released after publication.
翻译:交互式图像分割允许用户以最少的交互与机器协作,逐步优化对感兴趣目标的掩码分割。先前的研究已证明,通过交互式分割提取单一目标掩码能取得令人瞩目的性能。然而,现有方法忽视了先前交互对象的信息线索,这些线索可进一步用于加速对同一类别中多个目标的交互式分割。为此,我们针对单一目标及同一类别中的多个目标,分别提出了新颖的交互式分割框架。具体而言,我们的模型利用Transformer骨干网络从图像和交互中提取以交互为中心的视觉特征,从而获得一个令人满意的目标掩码作为范例。对于多目标分割,我们提出了一个范例感知模块,以增强对目标类别内对象间相似性的学习。为了融合来自不同模块的注意力特征,我们引入了交叉注意力块及随后的特征融合模块。在主流基准测试上进行的实验表明,我们的模型相比先前方法取得了更优的性能。特别地,我们的模型将用户操作量减少了约15%,在达到目标IoU为85%和90%时平均减少了两次点击。这些结果凸显了我们模型作为灵活实用标注工具的潜力。源代码将在论文发表后公开。