Vision-Language Models (VLMs) have demonstrated remarkable progress in single-image understanding, yet effective reasoning across multiple images remains challenging. We identify a critical capability gap in existing multi-image alignment approaches: current methods focus primarily on localized reasoning with pre-specified image indices (``Look at Image 3 and...''), bypassing the essential skills of global visual search and autonomous cross-image comparison. To address this limitation, we introduce a Simple-to-Hard (S2H) learning framework that systematically constructs multi-image preference data across three hierarchical reasoning levels requiring an increasing level of capabilities: (1) single-image localized reasoning, (2) multi-image localized comparison, and (3) global visual search. Unlike prior work that relies on model-specific attributes, such as hallucinations or attention heuristics, to generate preference pairs, our approach leverages prompt-driven complexity to create chosen/rejected pairs that are applicable across different models. Through extensive evaluations on LLaVA and Qwen-VL models, we show that our diverse multi-image reasoning data significantly enhances multi-image reasoning performance, yielding significant improvements over baseline methods across benchmarks. Importantly, our approach maintains strong single-image reasoning performance while simultaneously strengthening multi-image understanding capabilities, thus advancing the state of the art for holistic visual preference alignment.
翻译:视觉语言模型(VLM)在单图像理解方面取得了显著进展,但在多图像间进行有效推理仍具有挑战性。我们指出现有多图像对齐方法中存在关键能力缺口:当前方法主要聚焦于通过预设图像索引(如“查看图3并……”)进行局部推理,绕过了全局视觉搜索与自主跨图像对比等关键技能。为解决这一限制,我们提出了一种“简单到困难”(S2H)学习框架,该框架系统性地构建了跨三个层级推理难度的多图像偏好数据,能力要求逐级递增:(1)单图像局部推理,(2)多图像局部对比,(3)全局视觉搜索。与以往依赖模型特定属性(如幻觉或注意力启发式)生成偏好对的工作不同,我们的方法利用提示驱动的复杂性来创建适用于不同模型的“选定/拒绝”对。通过在LLaVA和Qwen-VL模型上的广泛评估,我们表明多样化的多图像推理数据显著提升了多图像推理性能,在多个基准测试中较基线方法实现大幅改善。重要的是,我们的方法在增强多图像理解能力的同时,保持了强大的单图像推理性能,从而推动了整体视觉偏好对齐领域的最新进展。