The field of vision-and-language (VL) understanding has made unprecedented progress with end-to-end large pre-trained VL models (VLMs). However, they still fall short in zero-shot reasoning tasks that require multi-step inferencing. To achieve this goal, previous works resort to a divide-and-conquer pipeline. In this paper, we argue that previous efforts have several inherent shortcomings: 1) They rely on domain-specific sub-question decomposing models. 2) They force models to predict the final answer even if the sub-questions or sub-answers provide insufficient information. We address these limitations via IdealGPT, a framework that iteratively decomposes VL reasoning using large language models (LLMs). Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer. These three modules perform the divide-and-conquer procedure iteratively until the model is confident about the final answer to the main question. We evaluate IdealGPT on multiple challenging VL reasoning tasks under a zero-shot setting. In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE. Code is available at https://github.com/Hxyou/IdealGPT
翻译:视觉与语言(VL)理解领域借助端到端的大规模预训练VL模型取得了前所未有的进展。然而,这些模型在面对需要多步推理的零样本推理任务时仍存在不足。为解决这一问题,以往研究采用了分而治之的流水线方法。本文认为,现有工作存在以下固有缺陷:1)依赖领域特定的子问题分解模型;2)即使子问题或子答案提供的信息不充分,也强制模型预测最终答案。我们通过提出IdealGPT框架来解决上述局限,该框架利用大语言模型(LLMs)迭代分解VL推理过程。具体而言,IdealGPT使用一个LLM生成子问题、一个VLM提供相应子答案,并借助另一个LLM进行推理以得出最终答案。这三个模块迭代执行分而治之流程,直至模型对主问题的最终答案具备充分置信度。我们在零样本设置下对多个具有挑战性的VL推理任务进行了评估。结果表明,我们的IdealGPT在VCR和SNLI-VE任务上分别以绝对性能提升10%和15%超越了现有最优的类GPT-4模型。代码已开源至https://github.com/Hxyou/IdealGPT