Advances in Large Language Models (LLMs) have inspired a surge of research exploring their expansion into the visual domain. While recent models exhibit promise in generating abstract captions for images and conducting natural conversations, their performance on text-rich images leaves room for improvement. In this paper, we propose the Contrastive Reading Model (Cream), a novel neural architecture designed to enhance the language-image understanding capability of LLMs by capturing intricate details typically overlooked by existing methods. Cream integrates vision and auxiliary encoders, complemented by a contrastive feature alignment technique, resulting in a more effective understanding of textual information within document images. Our approach, thus, seeks to bridge the gap between vision and language understanding, paving the way for more sophisticated Document Intelligence Assistants. Rigorous evaluations across diverse tasks, such as visual question answering on document images, demonstrate the efficacy of Cream as a state-of-the-art model in the field of visual document understanding. We provide our codebase and newly-generated datasets at https://github.com/naver-ai/cream
翻译:摘要:大语言模型(LLM)的进展激发了探索其向视觉领域扩展的研究热潮。尽管近期模型在生成图像抽象描述和进行自然对话方面展现出潜力,但其在处理富含文本的图像时的性能仍有待提升。本文提出对比阅读模型(Cream)——一种新颖的神经架构,旨在通过捕捉现有方法通常忽略的复杂细节来增强LLM的语言-图像理解能力。Cream整合了视觉编码器与辅助编码器,并结合对比特征对齐技术,从而更有效地理解文档图像中的文本信息。因此,我们的方法致力于弥合视觉与语言理解之间的鸿沟,为更复杂的文档智能助手铺平道路。在文档图像视觉问答等多样化任务上的严格评估表明,Cream作为视觉文档理解领域的先进模型展现了有效性。我们已在 https://github.com/naver-ai/cream 上提供代码库及新生成的基准数据集。