Visual Programming (VP) has emerged as a powerful framework for Visual Question Answering (VQA). By generating and executing bespoke code for each question, these methods demonstrate impressive compositional and reasoning capabilities, especially in few-shot and zero-shot scenarios. However, existing VP methods generate all code in a single function, resulting in code that is suboptimal in terms of both accuracy and interpretability. Inspired by human coding practices, we propose Recursive Visual Programming (RVP), which simplifies generated routines, provides more efficient problem solving, and can manage more complex data structures. RVP is inspired by human coding practices and approaches VQA tasks with an iterative recursive code generation approach, allowing decomposition of complicated problems into smaller parts. Notably, RVP is capable of dynamic type assignment, i.e., as the system recursively generates a new piece of code, it autonomously determines the appropriate return type and crafts the requisite code to generate that output. We show RVP's efficacy through extensive experiments on benchmarks including VSR, COVR, GQA, and NextQA, underscoring the value of adopting human-like recursive and modular programming techniques for solving VQA tasks through coding.
翻译:视觉编程(Visual Programming, VP)已成为视觉问答(Visual Question Answering, VQA)的强大框架。通过为每个问题生成并执行定制化代码,这些方法展现出卓越的组合与推理能力,尤其在少样本和零样本场景中表现突出。然而,现有VP方法均以单函数形式生成所有代码,导致代码在准确性和可解释性方面存在不足。受人类编程实践的启发,我们提出递归式视觉编程(Recursive Visual Programming, RVP),该方法简化了生成程序流程,提升了问题求解效率,并能处理更复杂的数据结构。RVP借鉴人类编程实践,采用迭代递归式的代码生成方法处理VQA任务,允许将复杂问题分解为更小的子问题。值得注意的是,RVP具备动态类型分配能力——即系统在递归生成新代码片段时,能自主确定合适的返回类型并构建所需代码以生成该输出。通过在VSR、COVR、GQA、NextQA等基准测试中的大量实验,我们验证了RVP的有效性,凸显了采用类人递归与模块化编程技术解决VQA任务的实践价值。