PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?

Multiple works have emerged to push the boundaries of multi-modal large language models (MLLMs) towards pixel-level understanding. The current trend is to train MLLMs with pixel-level grounding supervision in terms of masks on large-scale labelled data and specialized decoders for the segmentation task. However, we show that such MLLMs when evaluated on recent challenging vision-centric benchmarks, exhibit a weak ability in visual question answering (VQA). Surprisingly, some of these methods even downgrade the grounding ability of MLLMs that were never trained with such pixel-level supervision. In this work, we propose two novel challenging benchmarks with paired evaluation for both VQA and grounding. We demonstrate that simple baselines that are not unified achieve performance that matches or surpasses some of the pixel-level MLLMs. Our paired benchmarks and evaluation enable additional analysis on the reasons for failure with respect to VQA and/or grounding. Furthermore, we propose a prompt sensitivity analysis on both the language and visual prompts tailored for the grounding task. More importantly, we study the research question of ``When does grounding emerge in MLLMs with respect to the output tokens?'' We propose an interpretability tool that can be plugged into any MLLM to study the aforementioned question. We show that grounding does not necessarily coincide with the exact referring expression in the output, but can coincide with the object parts, its location, appearance, context or state. Code and datasets are publicly available at https://msiam.github.io/PixFoundationSeries/.

翻译：多项研究致力于将多模态大语言模型（MLLMs）的能力边界推向像素级理解。当前趋势是利用大规模标注数据中的掩码进行像素级定位监督，并针对分割任务设计专用解码器来训练MLLMs。然而，我们发现此类模型在近期具有挑战性的以视觉为中心的基准测试中，表现出较弱的视觉问答（VQA）能力。令人惊讶的是，其中某些方法甚至降低了从未接受过此类像素级监督训练的MLLMs的定位能力。本研究提出两个包含VQA与定位配对评估的新型挑战性基准。我们证明，未采用统一架构的简单基线模型能达到与部分像素级MLLMs相当或更优的性能。我们的配对基准与评估机制能够进一步分析模型在VQA和/或定位任务上的失败原因。此外，我们针对定位任务提出了面向语言提示和视觉提示的敏感性分析。更重要的是，我们研究了“MLLMs中定位能力何时在输出标记中显现”这一科学问题，并提出一种可嵌入任意MLLM的可解释性工具来探究该问题。研究表明，定位现象并非必然与输出中的确切指代表达同步出现，而可能体现在物体部件、空间位置、外观特征、上下文环境或状态描述中。代码与数据集已公开于 https://msiam.github.io/PixFoundationSeries/。