Large Multimodal Models (LMMs) have achieved remarkable success in visual understanding, yet they struggle with knowledge-intensive queries involving long-tail entities or evolving information due to static parametric knowledge. Recent search-augmented approaches attempt to address this limitation, but existing methods rely on indiscriminate whole-image retrieval that introduces substantial visual redundancy and noise, and lack deep iterative reflection, limiting their effectiveness on complex visual queries. To overcome these challenges, we propose Glance-or-Gaze (GoG), a fully autonomous framework that shifts from passive perception to active visual planning. GoG introduces a Selective Gaze mechanism that dynamically chooses whether to glance at global context or gaze into high-value regions, filtering irrelevant information before retrieval. We design a dual-stage training strategy: Reflective GoG Behavior Alignment via supervised fine-tuning instills the fundamental GoG paradigm, while Complexity-Adaptive Reinforcement Learning further enhances the model's capability to handle complex queries through iterative reasoning. Experiments across six benchmarks demonstrate state-of-the-art performance. Ablation studies confirm that both Selective Gaze and complexity-adaptive RL are essential for effective visual search. We will release our data and models for further exploration soon.
翻译:大型多模态模型(LMMs)在视觉理解方面取得了显著成功,但在涉及长尾实体或动态信息的知识密集型查询上仍面临困难,这主要源于其静态参数化知识的局限。近期基于搜索增强的方法试图解决这一限制,但现有方法依赖于无差别的全图像检索,引入了大量视觉冗余与噪声,且缺乏深度的迭代反思,限制了其在复杂视觉查询上的有效性。为克服这些挑战,我们提出了Glance-or-Gaze(GoG),一个完全自主的框架,将模型从被动感知转向主动视觉规划。GoG引入了选择性凝视机制,动态选择是“一瞥”全局上下文还是“凝视”高价值区域,从而在检索前过滤无关信息。我们设计了一种双阶段训练策略:通过监督微调进行反思性GoG行为对齐,以奠定GoG的基本范式;同时采用复杂度自适应强化学习,通过迭代推理进一步增强模型处理复杂查询的能力。在六个基准测试上的实验证明了其最先进的性能。消融研究证实,选择性凝视机制与复杂度自适应强化学习对于有效的视觉搜索均至关重要。我们将很快发布相关数据与模型以供进一步探索。