MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

from arxiv, Project Website: https://minigpt-4.github.io/; Code, Pretrained Model, and Dataset: https://github.com/Vision-CAIR/MiniGPT-4; Deyao Zhu and Jun Chen contributed equally to this work

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. We believe the primary reason for GPT-4's advanced multi-modal generation capabilities lies in the utilization of a more advanced large language model (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen LLM, Vicuna, using just one projection layer. Our findings reveal that MiniGPT-4 possesses many capabilities similar to those exhibited by GPT-4 like detailed image description generation and website creation from hand-written drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, providing solutions to problems shown in images, teaching users how to cook based on food photos, etc. In our experiment, we found that only performing the pretraining on raw image-text pairs could produce unnatural language outputs that lack coherency including repetition and fragmented sentences. To address this problem, we curate a high-quality, well-aligned dataset in the second stage to finetune our model using a conversational template. This step proved crucial for augmenting the model's generation reliability and overall usability. Notably, our model is highly computationally efficient, as we only train a projection layer utilizing approximately 5 million aligned image-text pairs. Our code, pre-trained model, and collected dataset are available at https://minigpt-4.github.io/.

翻译：近期发布的GPT-4展现出卓越的多模态能力，例如能根据手写文本直接生成网站，以及识别图像中的幽默元素。这些特性在以往的视觉-语言模型中极为罕见。我们认为GPT-4具备先进多模态生成能力的主要原因在于采用了更先进的大语言模型（LLM）。为探究这一现象，我们提出了MiniGPT-4，该模型通过仅一个投影层，将冻结的视觉编码器与冻结的大语言模型Vicuna对齐。研究结果表明，MiniGPT-4具备许多与GPT-4类似的能力，如生成详细的图像描述和根据手写草稿创建网站。此外，我们还观察到MiniGPT-4的其他涌现能力，包括根据给定图像编写故事和诗歌、为图像中展示的问题提供解决方案、根据食物照片指导用户烹饪等。实验中发现，仅对原始图像-文本对进行预训练会产生不自然的语言输出，缺乏连贯性，存在重复和碎片化句子等问题。为解决这一问题，我们在第二阶段精心构建了一个高质量、高度对齐的数据集，并使用对话模板对模型进行微调。这一步骤对增强模型生成的可靠性和整体可用性至关重要。值得注意的是，我们的模型具有极高的计算效率，仅需训练一个投影层，使用约500万对齐的图像-文本对即可。我们的代码、预训练模型及收集的数据集均可在https://minigpt-4.github.io/获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/