Referring image segmentation (RIS) aims to segment objects in an image conditioning on free-from text descriptions. Despite the overwhelming progress, it still remains challenging for current approaches to perform well on cases with various text expressions or with unseen visual entities, limiting its further application. In this paper, we present a novel RIS approach, which substantially improves the generalization ability by addressing the two dilemmas mentioned above. Specially, to deal with unconstrained texts, we propose to boost a given expression with an explicit and crucial prompt, which complements the expression in a unified context, facilitating target capturing in the presence of linguistic style changes. Furthermore, we introduce a multi-modal fusion aggregation module with visual guidance from a powerful pretrained model to leverage spatial relations and pixel coherences to handle the incomplete target masks and false positive irregular clumps which often appear on unseen visual entities. Extensive experiments are conducted in the zero-shot cross-dataset settings and the proposed approach achieves consistent gains compared to the state-of-the-art, e.g., 4.15\%, 5.45\%, and 4.64\% mIoU increase on RefCOCO, RefCOCO+ and ReferIt respectively, demonstrating its effectiveness. Additionally, the results on GraspNet-RIS show that our approach also generalizes well to new scenarios with large domain shifts.
翻译:指代图像分割(RIS)旨在根据自由形式的文本描述对图像中的目标进行分割。尽管取得了显著进展,但现有方法在面对多样化文本表达或未见视觉实体时仍表现欠佳,这限制了其进一步应用。本文提出一种新颖的RIS方法,通过解决上述两大困境显著提升泛化能力。具体而言,针对非受限文本问题,我们提出通过显式关键提示增强给定表达式,该提示在统一上下文中对表达式进行补充,从而在语言风格变化时促进目标捕获。此外,我们引入一个基于强预训练模型视觉引导的多模态融合聚合模块,利用空间关系与像素一致性处理未见视觉实体中常见的不完整目标掩模及假阳性不规则团块。在零样本跨数据集设置下开展大量实验,所提方法相比现有最优方法取得一致性提升:例如在RefCOCO、RefCOCO+和ReferIt上mIoU分别提升4.15%、5.45%和4.64%,验证了其有效性。同时,在GraspNet-RIS上的结果表明,本方法在存在较大领域偏移的新场景中同样具有良好的泛化能力。