Visual Programming (VP) has emerged as a powerful framework for Visual Question Answering (VQA). By generating and executing bespoke code for each question, these methods demonstrate impressive compositional and reasoning capabilities, especially in few-shot and zero-shot scenarios. However, existing VP methods generate all code in a single function, resulting in code that is suboptimal in terms of both accuracy and interpretability. Inspired by human coding practices, we propose Recursive Visual Programming (RVP), which simplifies generated routines, provides more efficient problem solving, and can manage more complex data structures. RVP is inspired by human coding practices and approaches VQA tasks with an iterative recursive code generation approach, allowing decomposition of complicated problems into smaller parts. Notably, RVP is capable of dynamic type assignment, i.e., as the system recursively generates a new piece of code, it autonomously determines the appropriate return type and crafts the requisite code to generate that output. We show RVP's efficacy through extensive experiments on benchmarks including VSR, COVR, GQA, and NextQA, underscoring the value of adopting human-like recursive and modular programming techniques for solving VQA tasks through coding.
翻译:视觉编程(VP)已成为视觉问答(VQA)领域的一个强大框架。通过为每个问题生成并执行定制代码,这些方法展现了出色的组合与推理能力,尤其在少样本和零样本场景中表现突出。然而,现有的 VP 方法将所有代码生成在单一函数内,导致代码在准确性和可解释性方面均未达到最优。受人类编码实践的启发,我们提出了递归视觉编程(RVP),它简化了生成的例程,提供了更高效的问题解决能力,并能处理更复杂的数据结构。RVP 借鉴了人类编码实践,采用迭代递归的代码生成方法处理 VQA 任务,从而将复杂问题分解为更小的部分。值得注意的是,RVP 能够进行动态类型分配,即当系统递归生成新代码段时,它会自主确定合适的返回类型,并编写必要的代码以生成相应输出。我们通过在 VSR、COVR、GQA 和 NextQA 等基准测试上的广泛实验证明了 RVP 的有效性,强调了采用类人的递归与模块化编程技术通过编码解决 VQA 任务的价值。