Visual Grounding (VG), which aims to locate a specific region referred to by expressions, is a fundamental yet challenging task in the multimodal understanding fields. While recent grounding transfer works have advanced the field through one-tower architectures, they still suffer from two primary limitations: (1) over-entangled multimodal representations that exacerbate deceptive modality biases, and (2) insufficient semantic reasoning that hinders the comprehension of referential cues. In this paper, we propose BARE, a bias-aware and reasoning-enhanced framework for one-tower visual grounding. BARE introduces a mechanism that preserves modality-specific features and constructs referential semantics through three novel modules: (i) language salience modulator, (ii) visual bias correction and (iii) referential relationship enhancement, which jointly mitigate multimodal distractions and enhance referential comprehension. Extensive experimental results on five benchmarks demonstrate that BARE not only achieves state-of-the-art performance but also delivers superior computational efficiency compared to existing approaches. The code is publicly accessible at https://github.com/Marloweeee/BARE.
翻译:视觉定位(Visual Grounding,VG)旨在根据描述表达式定位图像中的特定区域,是多模态理解领域中基础且具有挑战性的任务。尽管近期基于单塔架构的定位迁移工作推动了该领域的发展,但仍存在两个主要局限:(1)过度耦合的多模态表征加剧了欺骗性模态偏差;(2)语义推理能力不足,阻碍了对指代线索的理解。本文提出BARE,一种面向单塔视觉定位的偏差感知与推理增强框架。BARE通过保留模态特异性特征并构建指代语义的机制,引入三个创新模块:(i)语言显著性调制器,(ii)视觉偏差校正器,以及(iii)指代关系增强器,共同缓解多模态干扰并提升指代理解能力。在五个基准数据集上的大量实验结果表明,BARE不仅实现了最先进的性能,而且相比现有方法具有更优的计算效率。代码已公开于 https://github.com/Marloweeee/BARE。