Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models

Although current large Vision-Language Models (VLMs) have advanced in multimodal understanding and reasoning, their fundamental perceptual and reasoning abilities remain limited. Specifically, even on simple jigsaw tasks, existing VLMs perform near randomly, revealing deficiencies in core perception and reasoning capabilities. While high-quality vision-language data can enhance these capabilities, its scarcity and limited scalability impose significant constraints. To address this, we propose AGILE, an Agentic jiGsaw Interaction Learning for Enhancing visual perception and reasoning in VLMs. AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment. At each step, the model generates executable code to perform an action based on the current state, while the environment provides fine-grained visual feedback to guide task completion. Through this iterative cycle of observation and interaction, the model incrementally improves its perceptual and reasoning capabilities via exploration and feedback. Experimental results show that AGILE not only substantially boosts performance on jigsaw tasks of varying complexity (e.g., increasing accuracy from 9.5% to 82.8% under the 2 $\times$ 2 setting) but also demonstrates strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%. These results indicate notable enhancements in both perceptual and reasoning abilities. This work opens a new avenue for advancing reasoning and generalization in multimodal models and provides an efficient, scalable solution to the scarcity of multimodal reinforcement learning data. The code and datasets is available at https://github.com/yuzeng0-0/AGILE .

翻译：尽管当前的大型视觉语言模型在多模态理解与推理方面取得了进展，但其基础的感知与推理能力仍然有限。具体而言，即使在简单的拼图任务上，现有视觉语言模型的表现也近乎随机，这揭示了其核心感知与推理能力的不足。虽然高质量的视觉语言数据可以增强这些能力，但其稀缺性和有限的可扩展性带来了显著制约。为解决这一问题，我们提出了AGILE（智能拼图交互学习），旨在增强视觉语言模型的视觉感知与推理能力。AGILE将拼图求解构建为一个交互过程，使模型能够逐步与环境互动。在每一步中，模型基于当前状态生成可执行代码以执行动作，同时环境提供细粒度的视觉反馈来引导任务完成。通过这种观察与交互的迭代循环，模型借助探索与反馈逐步提升其感知与推理能力。实验结果表明，AGILE不仅在复杂度各异的拼图任务上显著提升了性能（例如在2 $\times$ 2设置下将准确率从9.5%提升至82.8%），还在9项通用视觉任务上展现出强大的泛化能力，平均提升了3.1%。这些结果表明模型在感知与推理能力方面均取得了显著增强。这项工作为推进多模态模型的推理与泛化能力开辟了新途径，并为多模态强化学习数据的稀缺性提供了一种高效、可扩展的解决方案。代码与数据集已发布于https://github.com/yuzeng0-0/AGILE。