Reasoning in Computer Vision: Taxonomy, Models, Tasks, and Methodologies

Visual reasoning matters for many computer vision tasks that go beyond surface-level object detection and classification. Despite progress in relational, symbolic, temporal, causal, and commonsense reasoning, existing surveys typically cover only one part of the problem, such as visual question answering, scene-graph generation, neuro-symbolic AI, or multimodal chain-of-thought, and rarely analyze reasoning types, methodologies, and evaluation protocols together. This survey addresses that gap. Following a structured literature review, we group visual reasoning into five major types (relational, symbolic, temporal, causal, and commonsense) and examine how each is implemented across methods that range from graph-based models, memory networks, attention mechanisms, and neuro-symbolic systems to reasoning with vision-language models (VLMs) and multimodal large language models (MLLMs), including visual chain-of-thought, visual programming, and tool-augmented and test-time reasoning. We then review evaluation protocols for functional correctness, structural consistency, and causal validity, and we analyze their limits in generalizability, reproducibility, faithfulness, and explanatory power. We also identify open challenges: scaling to complex scenes, integrating symbolic and neural paradigms more deeply, the shortage of comprehensive benchmarks, language-prior shortcuts and hallucination in foundation models, and reasoning under weak supervision. Finally, we set out a research agenda for vision systems and argue that connecting perception and reasoning is necessary for transparent, trustworthy, and cross-domain models, especially in high-stakes settings such as autonomous driving and medical diagnostics.

翻译：视觉推理对于许多超越表层目标检测与分类的计算机视觉任务至关重要。尽管在关系推理、符号推理、时序推理、因果推理和常识推理方面取得了进展，但现有综述通常只涵盖问题的一部分，例如视觉问答、场景图生成、神经符号AI或多模态思维链，很少同时分析推理类型、方法和评估协议。本综述填补了这一空白。通过结构化文献回顾，我们将视觉推理归纳为五大主要类型（关系推理、符号推理、时序推理、因果推理和常识推理），并考察每种类型如何通过从基于图的模型、记忆网络、注意力机制、神经符号系统到视觉-语言模型（VLM）和多模态大语言模型（MLLM）的推理方法（包括视觉思维链、视觉编程、工具增强推理和测试时推理）实现。随后，我们回顾了功能性正确性、结构一致性和因果有效性的评估协议，并分析了其在可泛化性、可重复性、忠实度和解释力方面的局限性。我们还识别了开放挑战：扩展到复杂场景、更深度整合符号与神经范式、全面基准的缺乏、基础模型中的语言先验捷径和幻觉，以及弱监督下的推理。最后，我们为视觉系统设定了一个研究议程，并论证了连接感知与推理对于透明、可信且跨领域的模型至关重要，特别是在自动驾驶和医学诊断等高风险场景中。