CLEVR-POC: Reasoning-Intensive Visual Question Answering in Partially Observable Environments

from arxiv, 17 pages, 10 images, Accepted at LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation

The integration of learning and reasoning is high on the research agenda in AI. Nevertheless, there is only a little attention to use existing background knowledge for reasoning about partially observed scenes to answer questions about the scene. Yet, we as humans use such knowledge frequently to infer plausible answers to visual questions (by eliminating all inconsistent ones). Such knowledge often comes in the form of constraints about objects and it tends to be highly domain or environment-specific. We contribute a novel benchmark called CLEVR-POC for reasoning-intensive visual question answering (VQA) in partially observable environments under constraints. In CLEVR-POC, knowledge in the form of logical constraints needs to be leveraged to generate plausible answers to questions about a hidden object in a given partial scene. For instance, if one has the knowledge that all cups are colored either red, green or blue and that there is only one green cup, it becomes possible to deduce the color of an occluded cup as either red or blue, provided that all other cups, including the green one, are observed. Through experiments, we observe that the low performance of pre-trained vision language models like CLIP (~ 22%) and a large language model (LLM) like GPT-4 (~ 46%) on CLEVR-POC ascertains the necessity for frameworks that can handle reasoning-intensive tasks where environment-specific background knowledge is available and crucial. Furthermore, our demonstration illustrates that a neuro-symbolic model, which integrates an LLM like GPT-4 with a visual perception network and a formal logical reasoner, exhibits exceptional performance on CLEVR-POC.

翻译：摘要：学习与推理的融合是人工智能研究议程中的重点。然而，目前鲜有研究关注如何利用现有背景知识对部分观测场景进行推理，以回答关于该场景的问题。而人类却频繁运用此类知识（通过排除所有不一致答案）推断出视觉问题的合理答案。这类知识通常以关于物体的约束形式呈现，且往往高度依赖特定领域或环境。我们提出了一个名为CLEVR-POC的新型基准测试，用于在约束条件下进行部分可观测环境中的推理密集型视觉问答（VQA）。在CLEVR-POC中，需要利用逻辑约束形式的知识，针对给定部分场景中隐藏物体的相关问题生成合理答案。例如，若已知所有杯子均为红色、绿色或蓝色，且仅有一个绿色杯子，则当观察到包括绿色杯在内的所有其他杯子后，便可推断被遮挡杯子的颜色为红色或蓝色。通过实验，我们发现预训练视觉语言模型（如CLIP，准确率约22%）和大型语言模型（如GPT-4，准确率约46%）在CLEVR-POC上的低性能表现，证实了需要能够处理推理密集型任务、且可获取并利用环境特定背景知识的框架。此外，我们的演示表明，一种融合了GPT-4等大型语言模型、视觉感知网络与形式化逻辑推理器的神经符号模型，在CLEVR-POC上展现出卓越性能。