A user pointing their phone at a supermarket shelf and asking "Which soda has the least sugar?" poses a difficult challenge for current visual Al assistants. Such queries require not only object recognition, but explicit set-based reasoning such as filtering, comparison, and aggregation. Standard endto-end MLLMs often fail at these tasks because they lack an explicit mechanism for compositional logic. We propose treating visual reasoning as Visual Program Synthesis, where the model first generates a symbolic program that is executed by a separate engine grounded in visual scenes. We also introduce Set-VQA, a new benchmark designed specifically for evaluating set-based visual reasoning. Experiments show that our approach significantly outperforms state-of-the-art baselines on complex reasoning tasks, producing more systematic and transparent behavior while substantially improving answer accuracy. These results demonstrate that program-driven reasoning provides a principled alternative to black-box visual-language inference.
翻译:用户将手机对准超市货架并询问“哪种苏打水含糖量最低?”,这对当前的视觉AI助手构成了一个难题。此类查询不仅需要物体识别,还需要明确的基于集合的推理,如筛选、比较和聚合。标准的端到端多模态大语言模型往往会在此类任务中失败,因为它们缺乏用于组合逻辑的显式机制。我们提出将视觉推理视为视觉程序合成,即模型首先生成一个符号程序,由另一个基于视觉场景的引擎执行。我们还引入了Set-VQA,这是一个专门为评估基于集合的视觉推理而设计的新基准。实验表明,我们的方法在复杂推理任务上显著优于当前最先进的基线模型,产生了更系统化和透明的行为,同时大幅提高了答案的准确性。这些结果表明,程序驱动的推理为黑箱视觉语言推理提供了一种原则性的替代方案。