Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision-OPD (Vision On-Policy Distillation), a regional-to-global self-distillation framework that transfers the model's own privileged regional perception to its full-image policy. Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts. This enables the model to internalize the benefit of visual zooming without external teacher models, ground-truth labels, reward verifiers, or inference-time tool use. Experiments on multiple fine-grained visual understanding benchmarks show that Vision-OPD models achieve competitive or superior performance against much larger open-source, closed-source, and "Thinking-with-Images" agentic models.
翻译:多模态大语言模型在细粒度视觉理解任务中仍面临挑战——问题的答案往往依赖于整幅图像中微小但关键的证据。我们观察到一种区域到整体的感知鸿沟:当同一多模态大语言模型以包含证据的局部裁剪图为条件时,回答细粒度问题的准确率显著高于以完整图像为条件的情况。这表明许多错误源于模型难以聚焦相关证据,而非局部识别能力不足。基于此发现,我们提出Vision-OPD(视觉同策略蒸馏),这是一种区域到整体的自蒸馏框架,可将模型自身优越的区域感知能力迁移至其整图处理策略。Vision-OPD从同一多模态大语言模型中实例化两个条件策略:裁剪条件教师模型与整图条件学生模型。学生模型生成同策略推理轨迹,Vision-OPD沿这些轨迹最小化教师与学生下一个词元分布间的词元级差异。这使得模型无需外部教师模型、真值标签、奖励验证器或推理时工具调用,即可内化视觉缩放带来的增益。在多个细粒度视觉理解基准上的实验表明,Vision-OPD模型取得了与更大规模的开源模型、闭源模型及"图像思维"智能体模型相媲美甚至更优的性能。