Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding tasks. However, their performance on high-resolution images remains suboptimal. While existing approaches often attribute this limitation to perceptual constraints and argue that MLLMs struggle to recognize small objects, leading them to use "zoom in" strategies for better detail, our analysis reveals a different cause: the main issue is not object size, but rather caused by complex background interference. We systematically analyze this "zoom in" operation through a series of decoupling experiments and propose the Hierarchical Decoupling Framework (HiDe), a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens, then leverages their attention weights to achieve precise alignment with the target visual regions. Subsequently, it employs Layout-Preserving Decoupling (LPD) to decouple these regions from the background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference. HiDe sets a new SOTA on V*Bench, HRBench4K, and HRBench8K, boosting Qwen2.5-VL 7B and InternVL3 8B to SOTA (92.1% and 91.6% on V*Bench), even surpassing RL methods. After optimization, HiDe uses 75% less memory than the previous training-free approach. Code is provided in https://tennine2077.github.io/HiDe.github.io/.
翻译:多模态大语言模型(MLLMs)在视觉理解任务中取得了显著进展,但在高分辨率图像上的表现仍不理想。现有方法通常将此限制归因于感知约束,认为MLLMs难以识别小目标,因而采用“放大”策略以获取更精细的细节。然而,我们的分析揭示了不同的原因:主要问题并非目标尺寸,而是由复杂的背景干扰所致。我们通过一系列解耦实验系统分析了这一“放大”操作,并提出层级解耦框架(HiDe)。该框架无需训练,利用令牌级注意力解耦(TAD)分离问题令牌并识别关键信息令牌,进而借助其注意力权重实现与目标视觉区域的精确对齐。随后,通过布局保持解耦(LPD)将这些区域从背景中分离,重构出保留关键空间布局且消除背景干扰的紧凑表示。HiDe在V*Bench、HRBench4K和HRBench8K上达到了新的最优性能,将Qwen2.5-VL 7B和InternVL3 8B提升至最优水平(V*Bench上分别达到92.1%和91.6%),甚至超越了强化学习方法。经优化后,HiDe相比此前无需训练的方法减少了75%的内存占用。代码已发布至 https://tennine2077.github.io/HiDe.github.io/。