Visual grounding (VG) aims to locate a specific target in an image based on a given language query. The discriminative information from context is important for distinguishing the target from other objects, particularly for the targets that have the same category as others. However, most previous methods underestimate such information. Moreover, they are usually designed for the standard scene (without any novel object), which limits their generalization to the open-vocabulary scene. In this paper, we propose a novel framework with context disentangling and prototype inheriting for robust visual grounding to handle both scenes. Specifically, the context disentangling disentangles the referent and context features, which achieves better discrimination between them. The prototype inheriting inherits the prototypes discovered from the disentangled visual features by a prototype bank to fully utilize the seen data, especially for the open-vocabulary scene. The fused features, obtained by leveraging Hadamard product on disentangled linguistic and visual features of prototypes to avoid sharp adjusting the importance between the two types of features, are then attached with a special token and feed to a vision Transformer encoder for bounding box regression. Extensive experiments are conducted on both standard and open-vocabulary scenes. The performance comparisons indicate that our method outperforms the state-of-the-art methods in both scenarios. {The code is available at https://github.com/WayneTomas/TransCP.
翻译:视觉定位(VG)旨在根据给定的语言查询,在图像中定位特定目标。源自上下文的区分性信息对于将目标与其他对象(尤其是与同类目标)区分开来至关重要。然而,以往的大多数方法低估了此类信息的作用。此外,它们通常针对标准场景(无任何新对象)设计,这限制了其在开放词汇场景中的泛化能力。本文提出了一种包含上下文解耦与原型继承的新型鲁棒视觉定位框架,以同时处理上述两种场景。具体而言,上下文解耦可分离指代对象与上下文特征,从而实现两者间更好的区分。原型继承通过原型库继承从解耦视觉特征中发现的原型,以充分利用已见数据,尤其是在开放词汇场景中。通过利用解耦后的语言特征与视觉原型特征进行哈达玛积(以平滑调整两类特征间的重要性)得到的融合特征,将附加特殊标记后馈入视觉Transformer编码器进行边界框回归。在标准场景与开放词汇场景上进行了大量实验。性能对比表明,我们的方法在两个场景中均优于现有最先进方法。代码已开源:https://github.com/WayneTomas/TransCP。