Instruction tuning unlocks the superior capability of Large Language Models (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based instructions. However, visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current visual instruction tuning pipeline with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Moreover, we prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining our collected data with previous multi-modal instruction-following data, our model, LLaVAR, substantially improves the LLaVA model's capability on text-based VQA datasets (up to 20% accuracy improvement) while achieving an accuracy of 91.42% on ScienceQA. The GPT-4-based instruction-following evaluation also demonstrates the improvement of our model on both natural images and text-rich images. Through qualitative analysis, LLaVAR shows promising interaction (e.g., reasoning, writing, and elaboration) skills with humans based on the latest real-world online content that combines text and images. We make our code/data/models publicly available at https://llavar.github.io/.
翻译:指令微调释放了大语言模型(LLM)与人类交互的卓越能力。此外,近期遵循指令的数据集将图像作为视觉输入,收集基于图像的指令响应。然而,经过视觉指令微调的模型难以较好地理解图像中的文本细节。本研究通过引入富含文本的图像(如电影海报、书籍封面等)来增强现有的视觉指令微调流程。具体而言,我们首先利用公开可用的OCR工具对来自LAION数据集的422K张富含文本图像进行文本提取。随后,我们利用纯文本GPT-4基于识别文本和图像描述生成16K段对话,每段对话包含针对富含文本图像的问答对。通过将我们收集的数据与先前的多模态指令遵循数据结合,我们的模型LLaVAR在文本型VQA数据集上显著提升了LLaVA模型的能力(准确率提升高达20%),同时在ScienceQA上达到91.42%的准确率。基于GPT-4的指令遵循评估也表明我们的模型在自然图像和富含文本图像上均有改进。定性分析显示,LLaVAR基于结合文本与图像的最新真实世界在线内容,展现出与人类交互(如推理、写作和阐述)的出色能力。我们在https://llavar.github.io/公开了代码、数据及模型。