Visual reasoning, particularly spatial reasoning, is a challenging cognitive task that requires understanding object relationships and their interactions within complex environments, especially in robotics domain. Existing vision_language models (VLMs) excel at perception tasks but struggle with fine-grained spatial reasoning due to their implicit, correlation-driven reasoning and reliance solely on images. We propose a novel neuro_symbolic framework that integrates both panoramic-image and 3D point cloud information, combining neural perception with symbolic reasoning to explicitly model spatial and logical relationships. Our framework consists of a perception module for detecting entities and extracting attributes, and a reasoning module that constructs a structured scene graph to support precise, interpretable queries. Evaluated on the JRDB-Reasoning dataset, our approach demonstrates superior performance and reliability in crowded, human_built environments while maintaining a lightweight design suitable for robotics and embodied AI applications.
翻译:视觉推理,特别是空间推理,是一项具有挑战性的认知任务,需要理解复杂环境中物体间的关系及其相互作用,在机器人领域尤为如此。现有的视觉语言模型(VLMs)在感知任务上表现出色,但由于其隐式的、基于相关性的推理方式以及仅依赖图像数据,难以处理细粒度的空间推理问题。本文提出了一种新颖的神经符号框架,该框架整合了全景图像与三维点云信息,将神经感知与符号推理相结合,以显式建模空间和逻辑关系。我们的框架包含一个用于检测实体并提取属性的感知模块,以及一个构建结构化场景图以支持精确、可解释查询的推理模块。在JRDB-Reasoning数据集上的评估表明,该方法在拥挤的人造环境中展现出卓越的性能与可靠性,同时保持了适用于机器人及具身人工智能应用的轻量化设计。