Self-Questioning Vision-Language Models: Reinforcement Learning for Compositional Visual Reasoning

Vision-Language Models (VLMs) are AI systems that process both images and text, yet they often struggle with compositional visual reasoning questions that require chaining multiple steps together, such as identifying objects, counting them, and comparing the results. Existing approaches improve this reasoning by training models on human-written step-by-step explanations, but creating these annotations is expensive and difficult to scale. We propose a self-questioning framework that trains a VLM to break visual questions into smaller sub-questions and answer each one before producing a final response, using a reinforcement learning algorithm called Group Relative Policy Optimization (GRPO). The model is never shown examples of how to decompose questions, it discovers this behavior on its own, guided by a reward signal that scores whether the output contains sub-questions and whether the final answer is correct. We apply this framework to a 3-billion-parameter model, training on both synthetic scenes of geometric shapes (CLEVR) and real-world photographs (A-OKVQA). On A-OKVQA, both self-questioning and standard reinforcement learning substantially improve accuracy over the untrained model (52.2% and 51.6% vs. 46.8%). We introduce the first self-questioning VLM by rewarding not only the final answer like standard RL but additionally for generating intermediate sub-questions, enabling it to discover compositional decomposition strategies. These results suggest that teaching AI systems to ask themselves intermediate questions is a promising strategy for complex visual reasoning, particularly when the difficulty of a question warrants explicit step-by-step decomposition.

翻译：视觉语言模型（VLM）是能同时处理图像与文本的人工智能系统，但在面对需要多步推理链的组合式视觉推理问题时（例如识别物体、计数并比较结果），其表现常存在困难。现有方法通过训练模型使用人工编写的逐步解释来提升推理能力，但此类标注成本高昂且难以规模化。本文提出一种自问式框架，采用名为分组相对策略优化（GRPO）的强化学习算法，训练VLM将视觉问题分解为若干子问题，并在生成最终答案前逐一解答。模型从未接触过问题分解示例，而是完全依靠自身发现该行为模式，其训练过程由奖励信号引导：该信号既评判输出是否包含子问题，又验证最终答案的正确性。我们将该框架应用于30亿参数模型，在几何图形合成场景（CLEVR）与真实世界照片（A-OKVQA）上分别进行训练。在A-OKVQA数据集上，自问式方法与标准强化学习方法相比，模型准确率较未训练时显著提升（自问式52.2%、标准强化学习51.6%，均高于未训练模型的46.8%）。我们首次提出自问式VLM：与仅奖励最终答案的标准强化学习不同，该模型额外对生成中间子问题的行为进行奖励，从而自主发现组合分解策略。实验结果表明，教会AI系统自问中间问题，是应对复杂视觉推理的有效策略，尤其适用于需要显式逐步分解的高难度问题场景。