Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers

Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting. Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools. Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning. Nevertheless, in their current form, these models heavily rely on human engineering of in-context examples in the prompt, which are often dataset- and task-specific and require significant labor by highly skilled programmers. In this work, we present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples, thereby avoiding human-created in-context examples. On a number of visual reasoning tasks, we show that our framework leads to consistent gains in performance, makes LLMs as controllers setup more robust, and removes the need for human engineering of in-context examples.

翻译：视觉推理主要由端到端神经网络主导，这些网络规模已扩展至数十亿模型参数和训练样本。然而，即便最大规模的模型仍难以应对组合推理、泛化、精细时空推理及计数任务。将大型语言模型（LLM）作为控制器进行视觉推理，原则上可通过分解任务并协调一组（视觉）工具来解决子任务，从而克服上述局限。近期，这类模型在组合式视觉问答、视觉定位、视频时序推理等任务中取得了优异表现。但当前形式的模型严重依赖人工设计提示中的上下文示例——这些示例通常针对特定数据集和任务，需要高技能程序员投入大量人工劳动。本研究提出一种框架，通过引入时空抽象例程并利用少量标注样本自动生成上下文示例，从而规避人工创建上下文示例的问题。在多项视觉推理任务中，我们的框架持续提升性能，增强LLM作为控制器的鲁棒性，并消除了人工设计上下文示例的必要性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/