We explore multi-step reasoning in vision-language models (VLMs). The problem is challenging, as reasoning data consisting of multiple steps of visual and language processing are barely available. To overcome the challenge, we first introduce a least-to-most visual reasoning paradigm, which interleaves steps of decomposing a question into sub-questions and invoking external tools for resolving sub-questions. Based on the paradigm, we further propose a novel data synthesis approach that can automatically create questions and multi-step reasoning paths for an image in a bottom-up manner. Our approach divides the complex synthesis task into a few simple sub-tasks, and (almost entirely) relies on open-sourced models to accomplish the sub-tasks. Therefore, the entire synthesis process is reproducible and cost-efficient, and the synthesized data is quality guaranteed. With the approach, we construct $50$k visual reasoning examples. Then, we develop a visual reasoner through supervised fine-tuning, which is capable of generally enhancing the reasoning abilities of a wide range of existing VLMs in a plug-and-play fashion. Extensive experiments indicate that the visual reasoner can consistently and significantly improve four VLMs on four VQA benchmarks. Our code and dataset are available at https://github.com/steven-ccq/VisualReasoner.
翻译:本文探索了视觉语言模型中的多步推理问题。该问题具有挑战性,因为包含多步视觉与语言处理过程的推理数据极为稀缺。为克服这一挑战,我们首先提出了一种从最少到最多的视觉推理范式,该范式交替执行问题分解为子问题、调用外部工具解决子问题两个步骤。基于此范式,我们进一步提出了一种新颖的数据合成方法,能够以自底向上的方式为图像自动生成问题及多步推理路径。该方法将复杂的合成任务分解为若干简单子任务,并(几乎完全)依赖开源模型完成这些子任务。因此,整个合成过程具有可复现性和成本效益,且合成数据质量有保障。通过该方法,我们构建了5万个视觉推理示例。随后,我们通过监督微调开发出视觉推理器,该推理器能够以即插即用方式普遍增强现有各类视觉语言模型的推理能力。大量实验表明,该视觉推理器能在四个视觉问答基准测试中持续显著提升四种视觉语言模型的性能。我们的代码与数据集已发布于 https://github.com/steven-ccq/VisualReasoner。