Visual Grounding (VG) aims to localize specific objects in an image according to natural language expressions, serving as a fundamental task in vision-language understanding. However, existing VG benchmarks are mostly derived from datasets collected under clean environments, such as COCO, where scene diversity is limited. Consequently, they fail to reflect the complexity of real-world conditions, such as changes in illumination, weather, etc., that are critical to evaluating model robustness and generalization in safety-critical applications. To address these limitations, we present RGBT-Ground, the first large-scale visual grounding benchmark built for complex real-world scenarios. It consists of spatially aligned RGB and Thermal infrared (TIR) image pairs with high-quality referring expressions, corresponding object bounding boxes, and fine-grained annotations at the scene, environment, and object levels. This benchmark enables comprehensive evaluation and facilitates the study of robust grounding under diverse and challenging conditions. Furthermore, we establish a unified visual grounding framework that supports both uni-modal (RGB or TIR) and multi-modal (RGB-TIR) visual inputs. Based on it, we propose RGBT-VGNet, a simple yet effective baseline for fusing complementary visual modalities to achieve robust grounding. We conduct extensive adaptations to the existing methods on RGBT-Ground. Experimental results show that our proposed RGBT-VGNet significantly outperforms these adapted methods, particularly in nighttime and long-distance scenarios. All resources will be publicly released to promote future research on robust visual grounding in complex real-world environments.
翻译:视觉定位(VG)旨在根据自然语言描述在图像中定位特定物体,是视觉-语言理解领域的一项基础任务。然而,现有的VG基准大多源自于清洁环境下采集的数据集(如COCO),其场景多样性有限。因此,这些基准无法反映现实世界条件的复杂性(如光照变化、天气变化等),而这些条件对于评估安全关键应用中模型的鲁棒性和泛化能力至关重要。为应对这些局限性,我们提出了RGBT-Ground,这是首个为复杂现实场景构建的大规模视觉定位基准。该基准包含空间对齐的RGB与热红外(TIR)图像对,并配有高质量的指代表达式、对应的物体边界框,以及在场景、环境和物体层面的细粒度标注。此基准支持全面的评估,并有助于研究在多样化和挑战性条件下的鲁棒定位。此外,我们建立了一个统一的视觉定位框架,支持单模态(RGB或TIR)与多模态(RGB-TIR)视觉输入。基于此框架,我们提出了RGBT-VGNet——一种简单而有效的基线模型,通过融合互补的视觉模态以实现鲁棒定位。我们在RGBT-Ground上对现有方法进行了广泛的适配实验。结果表明,我们提出的RGBT-VGNet显著优于这些适配方法,尤其在夜间和远距离场景中表现突出。所有资源将公开发布,以推动复杂现实环境中鲁棒视觉定位的未来研究。