Referring image segmentation (RIS) aims to locate the particular region corresponding to the language expression. Existing methods incorporate features from different modalities in a \emph{bottom-up} manner. This design may get some unnecessary image-text pairs, which leads to an inaccurate segmentation mask. In this paper, we propose a referring image segmentation method called HARIS, which introduces the Human-Like Attention mechanism and uses the parameter-efficient fine-tuning (PEFT) framework. To be specific, the Human-Like Attention gets a \emph{feedback} signal from multi-modal features, which makes the network center on the specific objects and discard the irrelevant image-text pairs. Besides, we introduce the PEFT framework to preserve the zero-shot ability of pre-trained encoders. Extensive experiments on three widely used RIS benchmarks and the PhraseCut dataset demonstrate that our method achieves state-of-the-art performance and great zero-shot ability.
翻译:参照图像分割(RIS)旨在定位与语言表达相对应的特定区域。现有方法以自底向上的方式融合来自不同模态的特征,这种设计可能导致生成不必要的图像-文本对,从而产生不精确的分割掩码。本文提出一种名为HARIS的参照图像分割方法,该方法引入类人注意力机制并采用参数高效微调(PEFT)框架。具体而言,类人注意力机制从多模态特征中获取反馈信号,使网络聚焦于特定目标并丢弃不相关的图像-文本对。此外,我们引入PEFT框架以保留预训练编码器的零样本能力。在三个广泛使用的RIS基准数据集和PhraseCut数据集上的大量实验表明,我们的方法实现了最先进的性能并展现出优异的零样本能力。