Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space, recognizing instruments, and also retrieving prior knowledge. Recent work shows promise by decomposing such tasks using a large language model (LLM) into an executable program that invokes specialized vision models. However, generated programs are error-prone: they omit necessary steps, include spurious ones, and are unable to recover when the specialized models give incorrect outputs. Moreover, they require loading multiple models, incurring high latency and computation costs. We propose Visual Program Distillation (VPD), an instruction tuning framework that produces a vision-language model (VLM) capable of solving complex visual tasks with a single forward pass. VPD distills the reasoning ability of LLMs by using them to sample multiple candidate programs, which are then executed and verified to identify a correct one. It translates each correct program into a language description of the reasoning steps, which are then distilled into a VLM. Extensive experiments show that VPD improves the VLM's ability to count, understand spatial relations, and reason compositionally. Our VPD-trained PaLI-X outperforms all prior VLMs, achieving state-of-the-art performance across complex vision tasks, including MMBench, OK-VQA, A-OKVQA, TallyQA, POPE, and Hateful Memes. An evaluation with human annotators also confirms that VPD improves model response factuality and consistency. Finally, experiments on content moderation demonstrate that VPD is also helpful for adaptation to real-world applications with limited data.
翻译:解决诸如“右侧的乐器是谁发明的?”这类复杂视觉任务需要多种技能的协同:理解空间关系、识别乐器,以及检索先验知识。近期研究表明,通过利用大语言模型将此类任务分解为可执行程序并调用专用视觉模型的方法具有潜力。然而,生成的程序存在易错性:缺失必要步骤、包含无关步骤,且当专用模型输出错误时无法恢复。此外,这类方法需要加载多个模型,导致高延迟和计算成本。为此,我们提出视觉程序蒸馏——一种指令微调框架,能够生成仅需单次前向传播即可解决复杂视觉任务的视觉语言模型。VPD通过利用大语言模型采样多个候选程序,执行并验证这些程序以识别正确方案,从而蒸馏其推理能力。它将每个正确程序转化为推理步骤的语言描述,并进一步蒸馏至视觉语言模型。大量实验表明,VPD显著提升了视觉语言模型在计数、空间关系理解及组合推理方面的能力。经VPD训练的PaLI-X模型超越了所有先前的视觉语言模型,在包含MMBench、OK-VQA、A-OKVQA、TallyQA、POPE和Hateful Memes在内的多项复杂视觉任务中达到最优性能。人类评估者的验证也证实VPD改善了模型响应的事实准确性与一致性。最后,内容审核实验表明,VPD在有限数据下的真实场景迁移中同样具有实用价值。