We present a framework that formulates visual question answering as modular code generation. In contrast to prior work on modular approaches to VQA, our approach requires no additional training and relies on pre-trained language models (LMs), visual models pre-trained on image-caption pairs, and fifty VQA examples used for in-context learning. The generated Python programs invoke and compose the outputs of the visual models using arithmetic and conditional logic. Our approach improves accuracy on the COVR dataset by at least 3% and on the GQA dataset by roughly 2% compared to the few-shot baseline that does not employ code generation.
翻译:我们提出了一种将视觉问答形式化为模块化代码生成的框架。与先前关于VQA模块化方法的工作不同,我们的方法无需额外训练,仅依赖预训练语言模型(LMs)、在图像-文本对上预训练的视觉模型,以及用于上下文学习的50个VQA示例。生成的Python程序通过算术和条件逻辑调用并组合视觉模型的输出。相比未使用代码生成的少样本基线方法,我们的方法在COVR数据集上至少提升了3%的准确率,在GQA数据集上提升了约2%。