Visual reasoning is dominated by end-to-end neural networks scaled to billions of model parameters and training examples. However, even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting. Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools. Recently, these models achieved great performance on tasks such as compositional visual question answering, visual grounding, and video temporal reasoning. Nevertheless, in their current form, these models heavily rely on human engineering of in-context examples in the prompt, which are often dataset- and task-specific and require significant labor by highly skilled programmers. In this work, we present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples, thereby avoiding human-created in-context examples. On a number of visual reasoning tasks, we show that our framework leads to consistent gains in performance, makes LLMs as controllers setup more robust, and removes the need for human engineering of in-context examples.
翻译:视觉推理主要由端到端神经网络主导,这些网络规模已扩展至数十亿模型参数和训练样本。然而,即便最大规模的模型仍难以应对组合推理、泛化、精细时空推理及计数任务。将大型语言模型(LLM)作为控制器进行视觉推理,原则上可通过分解任务并协调一组(视觉)工具来解决子任务,从而克服上述局限。近期,这类模型在组合式视觉问答、视觉定位、视频时序推理等任务中取得了优异表现。但当前形式的模型严重依赖人工设计提示中的上下文示例——这些示例通常针对特定数据集和任务,需要高技能程序员投入大量人工劳动。本研究提出一种框架,通过引入时空抽象例程并利用少量标注样本自动生成上下文示例,从而规避人工创建上下文示例的问题。在多项视觉推理任务中,我们的框架持续提升性能,增强LLM作为控制器的鲁棒性,并消除了人工设计上下文示例的必要性。