ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models

Efficient processing of high-resolution images is crucial for real-world vision-language applications. However, existing Large Vision-Language Models (LVLMs) incur substantial computational overhead due to the large number of vision tokens. With the advent of "thinking with images" models, reasoning now extends beyond text to the visual domain. This capability motivates our two-stage "coarse-to-fine" reasoning pipeline: first, a downsampled image is analyzed to identify task-relevant regions; then, only these regions are cropped at full resolution and processed in a subsequent reasoning stage. This approach reduces computational cost while preserving fine-grained visual details where necessary. A major challenge lies in inferring which regions are truly relevant to a given query. Recent related methods often fail in the first stage after input-image downsampling, due to perception-driven reasoning, where clear visual information is required for effective reasoning. To address this issue, we propose ERGO (Efficient Reasoning & Guided Observation) that performs reasoning-driven perception-leveraging multimodal context to determine where to focus. Our model can account for perceptual uncertainty, expanding the cropped region to cover visually ambiguous areas for answering questions. To this end, we develop simple yet effective reward components in a reinforcement learning framework for coarse-to-fine perception. Across multiple datasets, our approach delivers higher accuracy than the original model and competitive methods, with greater efficiency. For instance, ERGO surpasses Qwen2.5-VL-7B on the V* benchmark by 4.7 points while using only 23% of the vision tokens, achieving a 3x inference speedup. The code and models can be found at: https://github.com/nota-github/ERGO.

翻译：高效处理高分辨率图像对于现实世界的视觉语言应用至关重要。然而，现有的大型视觉语言模型（LVLMs）因视觉令牌数量庞大而产生巨大的计算开销。随着“图像思维”模型的出现，推理现已从文本领域扩展到视觉领域。这一能力促使我们构建了“由粗到精”的两阶段推理流程：首先，对下采样图像进行分析以识别任务相关区域；随后，仅将这些区域以全分辨率裁剪并在后续推理阶段进行处理。该方法在必要时保留细粒度视觉细节的同时，显著降低了计算成本。核心挑战在于如何推断哪些区域真正与给定查询相关。现有相关方法通常在输入图像下采样后的第一阶段就失效，这源于感知驱动的推理机制——其需要清晰的视觉信息才能进行有效推理。为解决这一问题，我们提出ERGO（高效推理与引导观察）模型，它执行推理驱动的感知，利用多模态上下文来确定关注区域。我们的模型能够考虑感知不确定性，通过扩展裁剪区域以覆盖视觉模糊区域来回答问题。为此，我们在强化学习框架中设计了简单而有效的奖励组件，用于实现由粗到精的感知。在多个数据集上的实验表明，我们的方法在保持更高效率的同时，其准确率超越了原始模型及同类竞争方法。例如，ERGO在V*基准测试中仅使用23%的视觉令牌，就以4.7分的优势超越Qwen2.5-VL-7B模型，并实现了3倍的推理加速。代码与模型已开源：https://github.com/nota-github/ERGO。