Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision-OPD (Vision On-Policy Distillation), a regional-to-global self-distillation framework that transfers the model's own privileged regional perception to its full-image policy. Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts. This enables the model to internalize the benefit of visual zooming without external teacher models, ground-truth labels, reward verifiers, or inference-time tool use. Experiments on multiple fine-grained visual understanding benchmarks show that Vision-OPD models achieve competitive or superior performance against much larger open-source, closed-source, and "Thinking-with-Images" agentic models.
翻译:多模态大语言模型在细粒度视觉理解任务中仍面临挑战,这类问题的答案往往取决于全图中微小但关键的证据。我们观察到一种区域到全局的感知鸿沟:当基于聚焦证据的裁剪图像而非对应完整图像时,同一多模态大语言模型回答细粒度问题的准确率显著提升,这表明许多失败源于难以聚焦相关证据而非局部识别能力不足。受此启发,我们提出Vision-OPD(视觉在线策略蒸馏)框架,这是一种区域到全局的自蒸馏方法,能将模型自身具备的特权区域感知能力迁移至其全图策略。Vision-OPD从同一多模态大语言模型中实例化两种条件策略:基于裁剪图像的教师策略与基于完整图像的学生策略。学生策略生成在线策略推演序列,Vision-OPD通过最小化师生模型在该推演序列中下一词元分布的逐词元差异,使模型内化视觉缩放的优势,无需外部教师模型、真实标签、奖励验证器或推理时工具调用。在多个细粒度视觉理解基准上的实验表明,Vision-OPD模型在性能上达到或超越参数量更大的开源模型、闭源模型及“思维图像化”智能体模型。