Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision-OPD (Vision On-Policy Distillation), a regional-to-global self-distillation framework that transfers the model's own privileged regional perception to its full-image policy. Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts. This enables the model to internalize the benefit of visual zooming without external teacher models, ground-truth labels, reward verifiers, or inference-time tool use. Experiments on multiple fine-grained visual understanding benchmarks show that Vision-OPD models achieve competitive or superior performance against much larger open-source, closed-source, and "Thinking-with-Images" agentic models.
翻译:多模态大语言模型在细粒度视觉理解中仍面临挑战——问题的答案往往取决于全图中的微小但关键证据。我们发现存在区域到全局的感知鸿沟:当以证据为中心的裁剪图像作为条件时,同一模型回答细粒度问题的准确率显著高于基于全图的条件,这表明多数失败源于难以聚焦相关证据而非局部识别能力不足。基于此观察,我们提出视觉-OPD(在线策略视觉蒸馏),一种区域到全局的自蒸馏框架,将模型自身的特权区域感知能力迁移至全图策略。视觉-OPD从同一多模态大语言模型中实例化两种条件策略:基于裁剪图像的教师策略与基于全图的学生策略。学生策略执行在线策略轨迹生成,视觉-OPD沿此轨迹最小化教师与学生逐词分布间的KL散度。该方法使模型无需外部教师模型、真实标签、奖励验证器或推理时工具调用,即可内化视觉聚焦的增益。在多个细粒度视觉理解基准上的实验表明,视觉-OPD模型相较于参数规模更大的开源模型、闭源模型及"图像思维"智能体模型,取得了具有竞争力或更优的性能。