IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

The field of vision-and-language (VL) understanding has made unprecedented progress with end-to-end large pre-trained VL models (VLMs). However, they still fall short in zero-shot reasoning tasks that require multi-step inferencing. To achieve this goal, previous works resort to a divide-and-conquer pipeline. In this paper, we argue that previous efforts have several inherent shortcomings: 1) They rely on domain-specific sub-question decomposing models. 2) They force models to predict the final answer even if the sub-questions or sub-answers provide insufficient information. We address these limitations via IdealGPT, a framework that iteratively decomposes VL reasoning using large language models (LLMs). Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer. These three modules perform the divide-and-conquer procedure iteratively until the model is confident about the final answer to the main question. We evaluate IdealGPT on multiple challenging VL reasoning tasks under a zero-shot setting. In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE. Code is available at https://github.com/Hxyou/IdealGPT

翻译：视觉与语言（VL）理解领域借助端到端的大规模预训练VL模型取得了前所未有的进展。然而，这些模型在面对需要多步推理的零样本推理任务时仍存在不足。为解决这一问题，以往研究采用了分而治之的流水线方法。本文认为，现有工作存在以下固有缺陷：1）依赖领域特定的子问题分解模型；2）即使子问题或子答案提供的信息不充分，也强制模型预测最终答案。我们通过提出IdealGPT框架来解决上述局限，该框架利用大语言模型（LLMs）迭代分解VL推理过程。具体而言，IdealGPT使用一个LLM生成子问题、一个VLM提供相应子答案，并借助另一个LLM进行推理以得出最终答案。这三个模块迭代执行分而治之流程，直至模型对主问题的最终答案具备充分置信度。我们在零样本设置下对多个具有挑战性的VL推理任务进行了评估。结果表明，我们的IdealGPT在VCR和SNLI-VE任务上分别以绝对性能提升10%和15%超越了现有最优的类GPT-4模型。代码已开源至https://github.com/Hxyou/IdealGPT

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/