Referring Image Segmentation (RIS) consistently requires language and appearance semantics to more understand each other. The need becomes acute especially under hard situations. To achieve, existing works tend to resort to various trans-representing mechanisms to directly feed forward language semantic along main RGB branch, which however will result in referent distribution weakly-mined in space and non-referent semantic contaminated along channel. In this paper, we propose Spatial Semantic Recurrent Mining (S\textsuperscript{2}RM) to achieve high-quality cross-modality fusion. It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing. During fusion, S\textsuperscript{2}RM will first generate a constraint-weak yet distribution-aware language feature, then bundle features of each row and column from rotated features of one modality context to recurrently correlate relevant semantic contained in feature from other modality context, and finally resort to self-distilled weights to weigh on the contributions of different parsed semantics. Via coparsing, S\textsuperscript{2}RM transports information from the near and remote slice layers of generator context to the current slice layer of parsed context, capable of better modeling global relationship bidirectional and structured. Besides, we also propose a Cross-scale Abstract Semantic Guided Decoder (CASG) to emphasize the foreground of the referent, finally integrating different grained features at a comparatively low cost. Extensive experimental results on four current challenging datasets show that our proposed method performs favorably against other state-of-the-art algorithms.
翻译:指代表达分割(RIS)始终需要语言与外观语义的相互更深入理解。在复杂场景下,这种需求尤为迫切。为实现这一目标,现有方法通常采用多种跨表示机制,将语言语义沿主RGB分支前馈传递,但这种方式会导致指代对象在空间上的弱挖掘分布,以及非指代语义沿通道的污染。本文提出空间语义循环挖掘(S²RM)以实现高质量跨模态融合,其遵循三部曲工作策略:语言特征分布、空间语义循环共解析、解析语义平衡。在融合过程中,S²RM首先生成约束弱化但分布感知的语言特征,随后将一种模态上下文中旋转特征的每行每列进行特征捆绑,以循环关联另一种模态上下文特征中的相关语义,最后通过自蒸馏权重对不同解析语义的贡献度进行加权。通过共解析机制,S²RM将生成器上下文中近层与远层切片层的信息传递至解析上下文当前切片层,从而更好地建模双向结构化全局关系。此外,本文还提出跨尺度抽象语义引导解码器(CASG),用于突显指代对象前景,最终以较低成本整合不同粒度的特征。在四个当前具有挑战性的数据集上的大量实验结果表明,所提方法优于其他最先进算法。