Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in visual perception and understanding. However, these models also suffer from hallucinations, which limit their reliability as AI systems. We believe that these hallucinations are partially due to the models' struggle with understanding what they can and cannot perceive from images, a capability we refer to as self-awareness in perception. Despite its importance, this aspect of MLLMs has been overlooked in prior studies. In this paper, we aim to define and evaluate the self-awareness of MLLMs in perception. To do this, we first introduce the knowledge quadrant in perception, which helps define what MLLMs know and do not know about images. Using this framework, we propose a novel benchmark, the Self-Awareness in Perception for MLLMs (MM-SAP), specifically designed to assess this capability. We apply MM-SAP to a variety of popular MLLMs, offering a comprehensive analysis of their self-awareness and providing detailed insights. The experiment results reveal that current MLLMs possess limited self-awareness capabilities, pointing to a crucial area for future advancement in the development of trustworthy MLLMs. Code and data are available at https://github.com/YHWmz/MM-SAP.
翻译:近期多模态大语言模型在视觉感知与理解方面展现出卓越能力。然而,这些模型仍存在幻觉问题,这限制了其作为人工智能系统的可靠性。我们认为,此类幻觉部分源于模型难以理解图像中可见与不可见内容——我们将这种能力称为感知自我意识。尽管该能力至关重要,但此前研究对此鲜有涉及。本文旨在定义并评估多模态大语言模型的感知自我意识。为此,我们首先引入感知知识象限理论框架,该框架可界定模型对图像信息的已知与未知范畴。基于此框架,我们提出新型基准——多模态大语言模型感知自我意识评估基准(MM-SAP),专用于评估该能力。我们将MM-SAP应用于多种主流多模态大语言模型,对其自我意识能力进行系统性分析并获得深度洞见。实验结果表明,当前多模态大语言模型的自我意识能力十分有限,这指明了未来可信赖多模态大语言模型发展的关键方向。代码与数据已开源至https://github.com/YHWmz/MM-SAP 。