Multi-modal large language models (MLLMs) have demonstrated remarkable vision-language capabilities, primarily due to the exceptional in-context understanding and multi-task learning strengths of large language models (LLMs). The advent of visual instruction tuning has further enhanced MLLMs' performance in vision-language understanding. However, while existing MLLMs adeptly recognize \textit{what} objects are in an image, they still face challenges in effectively discerning \textit{where} these objects are, particularly along the distance (scene depth) axis. To overcome this limitation in MLLMs, we introduce Proximity Question Answering (Proximity QA), a novel framework designed to enable MLLMs to infer the proximity relationship between objects in images. The framework operates in two phases: the first phase focuses on guiding the models to understand the relative depth of objects, and the second phase further encourages the models to infer the proximity relationships between objects based on their depth perceptions. We also propose a VQA dataset called Proximity-110K, containing additional instructions that incorporate depth information and the proximity relationships of objects. We have conducted extensive experiments to validate Proximity QA's superior ability in depth perception and proximity analysis, outperforming other state-of-the-art MLLMs. Code and dataset will be released at \textcolor{magenta}{https://github.com/NorthSummer/ProximityQA.git}.
翻译:摘要:多模态大语言模型(MLLMs)凭借大语言模型(LLMs)卓越的上下文理解能力与多任务学习优势,展现出非凡的视觉-语言能力。视觉指令微调的出现进一步增强了MLLMs在视觉-语言理解中的表现。然而,现有MLLMs虽能准确识别图像中物体“是什么”,但在有效判断物体“在哪里”——尤其是沿距离(场景深度)轴的位置——方面仍面临挑战。为克服MLLMs这一局限,我们提出Proximity QA(邻近性问答)这一新型框架,旨在使MLLMs能够推断图像中物体间的邻近关系。该框架分为两个阶段:第一阶段聚焦于引导模型理解物体的相对深度,第二阶段进一步鼓励模型基于深度感知推断物体间的邻近关系。我们还提出名为Proximity-110K的VQA数据集,其中包含融入深度信息及物体邻近关系的额外指令。大量实验验证了Proximity QA在深度感知与邻近性分析方面的卓越能力,其表现优于其他最先进的MLLMs。代码与数据集将在 \textcolor{magenta}{https://github.com/NorthSummer/ProximityQA.git} 发布。