Click-point-based interactive segmentation has received widespread attention due to its efficiency. However, it's hard for existing algorithms to obtain precise and robust responses after multiple clicks. In this case, the segmentation results tend to have little change or are even worse than before. To improve the robustness of the response, we propose a structured click intent model based on graph neural networks, which adaptively obtains graph nodes via the global similarity of user-clicked Transformer tokens. Then the graph nodes will be aggregated to obtain structured interaction features. Finally, the dual cross-attention will be used to inject structured interaction features into vision Transformer features, thereby enhancing the control of clicks over segmentation results. Extensive experiments demonstrated the proposed algorithm can serve as a general structure in improving Transformer-based interactive segmenta?tion performance. The code and data will be released at https://github.com/hahamyt/scc.
翻译:基于点击点的交互式分割因其高效性受到广泛关注。然而,现有算法在多次点击后难以获得精确且鲁棒的响应。在此情况下,分割结果往往变化甚微,甚至不如之前。为提高响应的鲁棒性,我们提出了一种基于图神经网络的结构化点击意图模型,该模型通过用户点击的Transformer令牌的全局相似性自适应获取图节点。随后,图节点将被聚合以获得结构化交互特征。最后,采用双交叉注意力将结构化交互特征注入视觉Transformer特征中,从而增强点击对分割结果的控制能力。大量实验表明,所提算法可作为改进基于Transformer的交互式分割性能的通用结构。代码和数据将在https://github.com/hahamyt/scc发布。