Understanding high-resolution (HR) images remains a critical challenge for multimodal large language models (MLLMs). Recent approaches leverage vision-based retrieval-augmented generation (RAG) to retrieve query-relevant crops from HR images, improving understanding capacity of MLLMs. However, this paradigm often leads to object fragmentation, resulting in semantic bias and incomplete retrieval, while also introducing false positives from irrelevant background patches. To address these issues, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework that enhances HR image understanding from both local and global perspectives. Locally, MRD enforces cross-scale semantic consistency via multi-resolution semantic fusion to mitigate single-resolution bias and alleviate object fragmentation. Globally, it integrates open-vocabulary object detection (OVD) as localization priors within a unified framework. Extensive experiments across multiple MLLMs on HR image benchmarks demonstrate that MRD achieves state-of-the-art (SOTA) performance on both single-object and multi-object understanding tasks. Code will be available at: https://github.com/yf0412/MRD.
翻译:理解高分辨率(HR)图像对于多模态大语言模型(MLLMs)仍是一个关键挑战。近期方法利用基于视觉的检索增强生成(RAG)技术,从高分辨率图像中检索与查询相关的图像块,提升了MLLMs的理解能力。然而,该范式常导致目标碎片化,造成语义偏差与检索不完整,同时会引入无关背景块导致的假阳性问题。为解决这些问题,我们提出无需训练的多分辨率检索-检测(MRD)框架,该框架从局部与全局双重视角增强高分辨率图像理解能力。在局部层面,MRD通过多分辨率语义融合强制跨尺度语义一致性,以缓解单一分辨率偏差并减轻目标碎片化。在全局层面,它将开放词汇目标检测(OVD)作为定位先验整合到统一框架中。在高分辨率图像基准上对多个MLLMs进行的广泛实验表明,MRD在单目标和多目标理解任务中均达到最优(SOTA)性能。代码将在以下地址开源:https://github.com/yf0412/MRD。