Answering visual queries is a complex task that requires both visual processing and reasoning. End-to-end models, the dominant approach for this task, do not explicitly differentiate between the two, limiting interpretability and generalization. Learning modular programs presents a promising alternative, but has proven challenging due to the difficulty of learning both the programs and modules simultaneously. We introduce ViperGPT, a framework that leverages code-generation models to compose vision-and-language models into subroutines to produce a result for any query. ViperGPT utilizes a provided API to access the available modules, and composes them by generating Python code that is later executed. This simple approach requires no further training, and achieves state-of-the-art results across various complex visual tasks.
翻译:回答视觉查询是一项复杂任务,既需要视觉处理也需要推理。作为该任务的主流方法,端到端模型并未明确区分这两者,从而限制了可解释性和泛化能力。学习模块化程序是一种有前景的替代方案,但由于需要同时学习程序和模块的困难性而面临挑战。我们提出ViperGPT框架,该框架利用代码生成模型将视觉与语言模型组合成子程序,为任意查询生成结果。ViperGPT通过提供的API访问可用模块,并生成后续执行的Python代码完成模块组合。这种简单方法无需额外训练,即可在各种复杂视觉任务中达到最先进的性能。