Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting. Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools. Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning. Nevertheless, in their current form, these models heavily rely on human engineering of in-context examples in the prompt, which are often dataset- and task-specific and require significant labor by highly skilled programmers. In this work, we present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples, thereby avoiding human-created in-context examples. On a number of visual reasoning tasks, we show that our framework leads to consistent gains in performance, makes LLMs as controllers setup more robust, and removes the need for human engineering of in-context examples.

翻译：视觉推理领域主要由端到端神经网络主导，这些网络可扩展至数十亿模型参数和训练样本。然而，即使规模最大的模型在处理组合推理、泛化、细粒度时空推理以及计数任务时仍面临挑战。以大型语言模型（LLM）作为控制器的视觉推理方法，通过任务分解并协调一组（视觉）工具解决子任务，原则上能够克服上述局限。近期，这类模型在组合视觉问答、视觉定位、视频时序推理等任务中取得了显著性能。然而，当前形式的模型严重依赖人工设计的提示内上下文示例，这些示例通常具有数据集和任务特异性，且需要高技能程序员的大量劳动。本研究提出一种框架，通过引入时空抽象例程并利用少量标注样本自动生成提示内示例，从而规避人工构建提示内示例的问题。在多项视觉推理任务中，我们证明该框架能持续提升模型性能，增强LLM作为控制器的鲁棒性，并消除对人工设计提示内示例的需求。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/