Multimodal large language models (MLLMs) have shown impressive capabilities in document understanding, a rapidly growing research area with significant industrial demand in recent years. As a multimodal task, document understanding requires models to possess both perceptual and cognitive abilities. However, current MLLMs often face conflicts between perception and cognition. Taking a document VQA task (cognition) as an example, an MLLM might generate answers that do not match the corresponding visual content identified by its OCR (perception). This conflict suggests that the MLLM might struggle to establish an intrinsic connection between the information it "sees" and what it "understands." Such conflicts challenge the intuitive notion that cognition is consistent with perception, hindering the performance and explainability of MLLMs. In this paper, we define the conflicts between cognition and perception as Cognition and Perception (C&P) knowledge conflicts, a form of multimodal knowledge conflicts, and systematically assess them with a focus on document understanding. Our analysis reveals that even GPT-4o, a leading MLLM, achieves only 68.6% C&P consistency. To mitigate the C&P knowledge conflicts, we propose a novel method called Multimodal Knowledge Consistency Fine-tuning. This method first ensures task-specific consistency and then connects the cognitive and perceptual knowledge. Our method significantly reduces C&P knowledge conflicts across all tested MLLMs and enhances their performance in both cognitive and perceptual tasks in most scenarios.
翻译:多模态大语言模型(MLLMs)在文档理解领域展现出令人印象深刻的能力,该领域是近年来一个快速增长且具有重要工业需求的研究方向。作为一项多模态任务,文档理解要求模型同时具备感知与认知能力。然而,当前的MLLMs常常面临感知与认知之间的冲突。以文档VQA任务(认知)为例,一个MLLM生成的答案可能与其OCR(感知)识别出的相应视觉内容不匹配。这种冲突表明,MLLM可能难以在其“看到”的信息与其“理解”的内容之间建立内在联系。此类冲突挑战了“认知与感知一致”这一直观概念,阻碍了MLLMs的性能与可解释性。在本文中,我们将认知与感知之间的冲突定义为认知与感知(C&P)知识冲突,这是多模态知识冲突的一种形式,并聚焦于文档理解对其进行系统性评估。我们的分析表明,即使是领先的MLLM如GPT-4o,其C&P一致性也仅达到68.6%。为了缓解C&P知识冲突,我们提出了一种名为多模态知识一致性微调的新方法。该方法首先确保任务特定的一致性,进而连接认知与感知知识。我们的方法显著降低了所有测试MLLMs中的C&P知识冲突,并在大多数场景中提升了它们在认知与感知任务上的表现。