The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. We believe the primary reason for GPT-4's advanced multi-modal generation capabilities lies in the utilization of a more advanced large language model (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen LLM, Vicuna, using just one projection layer. Our findings reveal that MiniGPT-4 possesses many capabilities similar to those exhibited by GPT-4 like detailed image description generation and website creation from hand-written drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, providing solutions to problems shown in images, teaching users how to cook based on food photos, etc. In our experiment, we found that only performing the pretraining on raw image-text pairs could produce unnatural language outputs that lack coherency including repetition and fragmented sentences. To address this problem, we curate a high-quality, well-aligned dataset in the second stage to finetune our model using a conversational template. This step proved crucial for augmenting the model's generation reliability and overall usability. Notably, our model is highly computationally efficient, as we only train a projection layer utilizing approximately 5 million aligned image-text pairs. Our code, pre-trained model, and collected dataset are available at https://minigpt-4.github.io/.
翻译:近期发布的GPT-4展现出卓越的多模态能力,例如能根据手写文本直接生成网站,以及识别图像中的幽默元素。这些特性在以往的视觉-语言模型中极为罕见。我们认为GPT-4具备先进多模态生成能力的主要原因在于采用了更先进的大语言模型(LLM)。为探究这一现象,我们提出了MiniGPT-4,该模型通过仅一个投影层,将冻结的视觉编码器与冻结的大语言模型Vicuna对齐。研究结果表明,MiniGPT-4具备许多与GPT-4类似的能力,如生成详细的图像描述和根据手写草稿创建网站。此外,我们还观察到MiniGPT-4的其他涌现能力,包括根据给定图像编写故事和诗歌、为图像中展示的问题提供解决方案、根据食物照片指导用户烹饪等。实验中发现,仅对原始图像-文本对进行预训练会产生不自然的语言输出,缺乏连贯性,存在重复和碎片化句子等问题。为解决这一问题,我们在第二阶段精心构建了一个高质量、高度对齐的数据集,并使用对话模板对模型进行微调。这一步骤对增强模型生成的可靠性和整体可用性至关重要。值得注意的是,我们的模型具有极高的计算效率,仅需训练一个投影层,使用约500万对齐的图像-文本对即可。我们的代码、预训练模型及收集的数据集均可在https://minigpt-4.github.io/获取。