Visual language reasoning requires a system to extract text or numbers from information-dense images like charts or plots and perform logical or arithmetic reasoning to arrive at an answer. To tackle this task, existing work relies on either (1) an end-to-end vision-language model trained on a large amount of data, or (2) a two-stage pipeline where a captioning model converts the image into text that is further read by another large language model to deduce the answer. However, the former approach forces the model to answer a complex question with one single step, and the latter approach is prone to inaccurate or distracting information in the converted text that can confuse the language model. In this work, we propose a dual-system for multi-step multimodal reasoning, which consists of a "System-1" step for visual information extraction and a "System-2" step for deliberate reasoning. Given an input, System-2 breaks down the question into atomic sub-steps, each guiding System-1 to extract the information required for reasoning from the image. Experiments on chart and plot datasets show that our method with a pre-trained System-2 module performs competitively compared to prior work on in- and out-of-distribution data. By fine-tuning the System-2 module (LLaMA-2 70B) on only a small amount of data on multi-step reasoning, the accuracy of our method is further improved and surpasses the best fully-supervised end-to-end approach by 5.7% and a pipeline approach with FlanPaLM (540B) by 7.5% on a challenging dataset with human-authored questions.
翻译:视觉语言推理要求系统从图表等高密度图像中提取文本或数值,并通过逻辑或算术推理得出答案。现有工作采用两种方法:(1) 基于大规模数据训练的端到端视觉语言模型,(2) 两阶段流水线:先由描述模型将图像转化为文本,再由大语言模型基于该文本推导答案。然而,前一种方法强制模型通过单一步骤回答复杂问题,后一种方法则易因转换文本中的不准确或干扰信息误导语言模型。本文提出一种面向多步多模态推理的双系统框架,包含用于视觉信息提取的"系统-1"步骤和用于审慎推理的"系统-2"步骤。给定输入后,系统-2将问题分解为原子子步骤,每个步骤引导系统-1从图像中提取推理所需信息。在图表数据集上的实验表明,采用预训练系统-2模块的方法在分布内与分布外数据上均取得与现有工作竞争的性能。仅需少量多步推理数据微调系统-2模块(LLaMA-2 70B),该方法准确率进一步提升,在含人工标注问题的挑战性数据集上,超越最佳全监督端到端方法5.7%,并超越基于FlanPaLM(540B)的流水线方法7.5%。