Multimodal large language models (MLLMs) integrate strong text reasoning with visual inputs, yet their responses can be inconsistent with the underlying images, indicating ineffective utilization of visual evidence during inference. The prevailing training paradigm relies on large-scale caption-based pretraining for general alignment, followed by supervised fine-tuning and reinforcement learning to enable instruction following and complex reasoning. However, such pretraining provides only weak visual grounding: short, coarse captions bias models toward salient objects while neglecting fine-grained visual evidence. In this paper, we introduce Visual Evidence Pre-Alignment (VEPA), an intermediate stage between pretraining and post-training that explores a novel sufficiency-driven objective with Group Relative Policy Optimization (GRPO) to optimize question-conditioned visual evidence descriptions. Extensive experiments across diverse benchmarks show that our VEPA consistently enhances performance on visually demanding evaluations and complements standard supervised post-training. Further analyses show that the income stems from strengthened, transferable visual grounding, rather than from additional task-specific training.
翻译:多模态大语言模型(MLLMs)将强大的文本推理能力与视觉输入相结合,然而其生成的回答可能与底层图像不一致,这表明在推理过程中对视觉证据的利用效率不足。当前主流的训练范式依赖大规模基于字幕的预训练实现通用对齐,随后通过监督微调和强化学习实现指令遵循与复杂推理。然而,此类预训练仅提供薄弱的视觉基础:简短、粗略的字幕导致模型偏向显著物体,而忽略了细粒度的视觉证据。本文提出视觉证据预对齐(VEPA),作为预训练与后训练之间的中间阶段,通过引入基于充分性驱动目标的群体相对策略优化(GRPO),探索优化面向问题的视觉证据描述。在多种基准测试上的大量实验表明,VEPA能够持续提升视觉密集任务的性能,并增强标准监督后训练的效果。进一步分析显示,性能提升源于可迁移的强化视觉基础,而非新增的任务特定训练。