The newly proposed Generalized Referring Expression Segmentation (GRES) amplifies the formulation of classic RES by involving multiple/non-target scenarios. Recent approaches focus on optimizing the last modality-fused feature which is directly utilized for segmentation and object-existence identification. However, the attempt to integrate all-grained information into a single joint representation is impractical in GRES due to the increased complexity of the spatial relationships among instances and deceptive text descriptions. Furthermore, the subsequent binary target justification across all referent scenarios fails to specify their inherent differences, leading to ambiguity in object understanding. To address the weakness, we propose a $\textbf{H}$ierarchical Semantic $\textbf{D}$ecoding with $\textbf{C}$ounting Assistance framework (HDC). It hierarchically transfers complementary modality information across granularities, and then aggregates each well-aligned semantic correspondence for multi-level decoding. Moreover, with complete semantic context modeling, we endow HDC with explicit counting capability to facilitate comprehensive object perception in multiple/single/non-target settings. Experimental results on gRefCOCO, Ref-ZOM, R-RefCOCO, and RefCOCO benchmarks demonstrate the effectiveness and rationality of HDC which outperforms the state-of-the-art GRES methods by a remarkable margin. Code will be available $\href{https://github.com/RobertLuo1/HDC}{here}$.
翻译:新提出的广义指称表达分割(GRES)通过引入多目标/非目标场景,扩展了经典指称表达分割(RES)的范式。现有方法主要集中于优化最终用于分割和目标存在性判别的多模态融合特征。然而,在GRES任务中,由于实例间空间关系的复杂性以及文本描述可能存在的误导性,试图将所有粒度信息整合到单一联合表征中的做法并不现实。此外,现有方法在所有指称场景中采用统一的二值化目标判定机制,未能区分不同场景间的内在差异,导致对象理解存在歧义。为克服这些缺陷,本文提出一种基于计数辅助的分层语义解码框架(HDC)。该框架通过分层机制在不同粒度间传递互补的多模态信息,进而聚合每个良好对齐的语义对应关系进行多级解码。此外,通过完整的语义上下文建模,我们赋予HDC显式的计数能力,以促进在多目标/单目标/非目标场景下的全面对象感知。在gRefCOCO、Ref-ZOM、R-RefCOCO和RefCOCO基准上的实验结果表明,HDC框架在显著超越现有最先进GRES方法的同时,展现出优越的有效性与合理性。代码将发布于$\href{https://github.com/RobertLuo1/HDC}{此处}$。