The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous vision-language models. However, the technical details behind GPT-4 continue to remain undisclosed. We believe that the enhanced multi-modal generation capabilities of GPT-4 stem from the utilization of sophisticated large language models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced LLM, Vicuna, using one projection layer. Our work, for the first time, uncovers that properly aligning the visual features with an advanced large language model can possess numerous advanced multi-modal abilities demonstrated by GPT-4, such as detailed image description generation and website creation from hand-drawn drafts. Furthermore, we also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images, teaching users how to cook based on food photos, and so on. In our experiment, we found that the model trained on short image caption pairs could produce unnatural language outputs (e.g., repetition and fragmentation). To address this problem, we curate a detailed image description dataset in the second stage to finetune the model, which consequently improves the model's generation reliability and overall usability. Our code, pre-trained model, and collected dataset are available at https://minigpt-4.github.io/.
翻译:近期发布的GPT-4展现出非凡的多模态能力,例如直接从手写文本生成网站以及识别图像中的幽默元素,这些特征在此前的视觉-语言模型中极为罕见。然而,GPT-4背后的技术细节仍未被公开。我们认为GPT-4增强的多模态生成能力源于对先进大语言模型(LLM)的运用。为探究该现象,我们提出MiniGPT-4——该模型通过单个投影层将冻结的视觉编码器与冻结的先进大语言模型Vicuna对齐。本工作首次揭示:将视觉特征与先进大语言模型适当对齐,即可具备GPT-4所展示的诸多高级多模态能力,例如生成详细的图像描述以及根据手绘草图创建网站。此外,我们还观测到MiniGPT-4涌现的其他能力,包括根据给定图像创作故事与诗歌、基于食物照片指导用户烹饪等。实验中发现,仅使用短图像描述对训练的模型可能生成不自然的语言输出(如重复与碎片化表述)。为解决该问题,我们在第二阶段构建了详细的图像描述数据集对模型进行微调,从而显著提升了模型生成的可靠性与整体可用性。我们的代码、预训练模型及收集的数据集已开源至https://minigpt-4.github.io/。