High-resolution (HR) image perception presents a key bottleneck for multimodal large language models (MLLMs). While visual search offers a promising solution, existing methods struggle with the trade-off between coverage and efficiency. Visual expert-assisted search is efficient but prone to blind spots when proposals fail, whereas scan-based search guarantees coverage at the cost of computational redundancy and semantic fragmentation. To address this dilemma, we introduce CVSearch, a training-free adaptive framework that dynamically schedules search strategies via an Assess-then-Search workflow. Specifically, CVSearch first invokes expert-assisted search when global information is insufficient, and only triggers a novel semantic-aware scanning mechanism upon failure. Distinct from rigid grid partitioning, this efficient scanning paradigm incorporates Semantic Guided Adaptive Patching to decompose images into semantically consistent regions, effectively mitigating object fragmentation. Furthermore, we devise a Dynamic Bottom-Up Search strategy driven by a Visual Complexity prior to enable efficient and precise iterative exploration of local details. Extensive experiments on HR benchmarks demonstrate that CVSearch achieves state-of-the-art accuracy while substantially improving search efficiency. Code is released at https://github.com/liliupeng28/ICML26-CVSearch.
翻译:高分辨率(HR)图像感知是多模态大语言模型(MLLMs)面临的一个关键瓶颈。尽管视觉搜索提供了一种有前景的解决方案,但现有方法在覆盖范围与效率之间难以权衡。视觉专家辅助搜索虽高效,但在提议失败时容易出现盲区;而扫描式搜索虽能保证覆盖范围,却以计算冗余和语义碎片化为代价。为解决这一困境,我们提出了CVSearch——一种无需训练的自适应框架,通过“评估-再搜索”工作流动态调度搜索策略。具体而言,CVSearch首先在全局信息不足时调用专家辅助搜索,仅在失败后触发一种新颖的语义感知扫描机制。与僵化的网格划分不同,这种高效扫描范式引入语义引导的自适应分块技术,将图像分解为语义一致的区域,从而有效缓解目标碎片化问题。此外,我们设计了一种基于视觉复杂度先验的动态自底向上搜索策略,以实现对局部细节的高效且精确的迭代探索。在高分辨率基准上的广泛实验表明,CVSearch在显著提升搜索效率的同时,达到了最先进的准确率。代码已发布在https://github.com/liliupeng28/ICML26-CVSearch。