Vision-language models (VLMs) are typically trained as passive answerers, while their ability to actively ask diverse, non-trivial, visual-centric and grounded questions remains underexplored. Existing visual questioners' performance is bottlenecked by the availability of high-quality training data or the cost of curating them. We show that a VLM can continuously improve itself as a visual questioner without any external supervision. We propose a self-evolving framework that uses a VLM itself as both a proposer and a filter to produce harder, more informative, and visual-centric questions, while maintaining their exploration diversity to avoid training collapse. These questions are then used to train the VLM in both questioner and answerer modes. To evaluate the questioner, we introduce an agentic protocol that assesses questions along perception, reasoning, and diversity dimensions. Experiments across various backbone VLMs show that our method substantially enhances the quality and substantially expands the difficulty boundary of autonomous question generation. Under the same budget, our self-supervision is more effective than training on the static source data. Moreover, the self-evolving questioner remains a competitive or even better answerer.
翻译:视觉语言模型通常被训练为被动的回答者,但其主动提出多样化、非平凡、以视觉为中心且有根据的问题的能力尚未得到充分探索。现有视觉提问器的性能受限于高质量训练数据的可用性或整理这些数据的成本。我们证明,视觉语言模型可以在无需任何外部监督的情况下,持续自我改进其作为视觉提问器的能力。我们提出一种自演进框架,该框架利用视觉语言模型自身同时作为提议者和筛选者,以生成更难、更具信息性且更以视觉为中心的问题,同时保持其探索多样性以避免训练崩溃。这些问题随后被用于以提问者和回答者两种模式训练视觉语言模型。为评估提问器,我们引入了一种基于智能体的评估协议,该协议从感知、推理和多样性三个维度对问题进行评估。在多种骨干视觉语言模型上的实验表明,我们的方法显著提升了自主问题生成的质量,并大幅扩展了其难度边界。在同等预算下,我们的自监督方法比在静态源数据上训练更有效。此外,自演进提问器仍能保持具有竞争力甚至更优的回答者性能。