Visual programming prompts LLMs (large language mod-els) to generate executable code for visual tasks like visual question answering (VQA). Prompt-based methods are difficult to improve while also being unreliable and costly in both time and money. Our goal is to develop an efficient visual programming system without 1) using prompt-based LLMs at inference time and 2) a large set of program and answer annotations. We develop a synthetic data augmentation approach and alternative program generation method based on decoupling programs into higher-level skills called templates and the corresponding arguments. Our results show that with data augmentation, prompt-free smaller LLMs ($\approx$ 1B parameters) are competitive with state-of-the art models with the added benefit of much faster inference
翻译:视觉编程通过提示大语言模型为视觉问答等视觉任务生成可执行代码。基于提示的方法难以改进,同时在时间和金钱成本上既不可靠又代价高昂。我们的目标是开发一种高效的视觉编程系统,其无需在推理时使用基于提示的大语言模型,也无需依赖大量程序与答案标注数据。我们提出了一种合成数据增强方法,以及一种基于将程序解耦为称为模板的高层技能及其对应参数的替代性程序生成方法。实验结果表明,通过数据增强,无需提示的小型大语言模型(参数量约10亿)与最先进模型性能相当,同时具备推理速度显著加快的优势。