Recent advances in visual reasoning (VR), particularly with the aid of Large Vision-Language Models (VLMs), show promise but require access to large-scale datasets and face challenges such as high computational costs and limited generalization capabilities. Compositional visual reasoning approaches have emerged as effective strategies; however, they heavily rely on the commonsense knowledge encoded in Large Language Models (LLMs) to perform planning, reasoning, or both, without considering the effect of their decisions on the visual reasoning process, which can lead to errors or failed procedures. To address these challenges, we introduce HYDRA, a multi-stage dynamic compositional visual reasoning framework designed for reliable and incrementally progressive general reasoning. HYDRA integrates three essential modules: a planner, a Reinforcement Learning (RL) agent serving as a cognitive controller, and a reasoner. The planner and reasoner modules utilize an LLM to generate instruction samples and executable code from the selected instruction, respectively, while the RL agent dynamically interacts with these modules, making high-level decisions on selection of the best instruction sample given information from the historical state stored through a feedback loop. This adaptable design enables HYDRA to adjust its actions based on previous feedback received during the reasoning process, leading to more reliable reasoning outputs and ultimately enhancing its overall effectiveness. Our framework demonstrates state-of-the-art performance in various VR tasks on four different widely-used datasets.
翻译:近期视觉推理研究取得进展,尤其是借助大型视觉-语言模型(VLM)的方法展现出潜力,但需要访问大规模数据集,且面临计算成本高、泛化能力有限等挑战。组合式视觉推理方法已成为有效策略,但这些方法严重依赖大型语言模型(LLM)编码的常识知识来执行规划、推理或两者兼有,却未考虑其决策对视觉推理过程的影响,可能导致错误或失败。为应对这些挑战,我们提出HYDRA——一种多阶段动态组合视觉推理框架,旨在实现可靠且渐进式的通用推理。HYDRA集成三个核心模块:规划器、作为认知控制器的强化学习(RL)智能体以及推理器。规划器与推理器分别利用LLM生成指令样本及从选定指令中生成可执行代码,而RL智能体则动态与这些模块交互,基于通过反馈环路存储的历史状态信息,做出选择最优指令样本的高层决策。这种可适应设计使HYDRA能根据推理过程中接收到的先前反馈调整行为,从而产生更可靠的推理输出,最终提升整体效能。本框架在四个广泛使用的数据集上的各类视觉推理任务中均展现出最先进的性能。