Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting. Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools. Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning. Nevertheless, in their current form, these models heavily rely on human engineering of in-context examples in the prompt, which are often dataset- and task-specific and require significant labor by highly skilled programmers. In this work, we present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples, thereby avoiding human-created in-context examples. On a number of visual reasoning tasks, we show that our framework leads to consistent gains in performance, makes LLMs as controllers setup more robust, and removes the need for human engineering of in-context examples.
翻译:视觉推理领域主要由端到端神经网络主导,这些网络可扩展至数十亿模型参数和训练样本。然而,即使规模最大的模型在处理组合推理、泛化、细粒度时空推理以及计数任务时仍面临挑战。以大型语言模型(LLM)作为控制器的视觉推理方法,通过任务分解并协调一组(视觉)工具解决子任务,原则上能够克服上述局限。近期,这类模型在组合视觉问答、视觉定位、视频时序推理等任务中取得了显著性能。然而,当前形式的模型严重依赖人工设计的提示内上下文示例,这些示例通常具有数据集和任务特异性,且需要高技能程序员的大量劳动。本研究提出一种框架,通过引入时空抽象例程并利用少量标注样本自动生成提示内示例,从而规避人工构建提示内示例的问题。在多项视觉推理任务中,我们证明该框架能持续提升模型性能,增强LLM作为控制器的鲁棒性,并消除对人工设计提示内示例的需求。