Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.
翻译:大语言多模态模型(LMM)近期通过视觉指令调优展现出令人鼓舞的进展。本研究表明,LLaVA中的全连接视觉-语言跨模态连接器具有惊人的强大能力与数据效率。通过对LLaVA进行简单改进——即采用CLIP-ViT-L-336px结合MLP投影,并添加面向学术任务的VQA数据及简洁响应格式化提示——我们构建了更强的基线方法,在11项基准测试中均达到最优水平。最终的13B检查点仅使用120万个公开数据,在单台8-A100节点上约1天内完成完整训练。我们期望这能推动顶尖LMM研究更易开展。代码和模型将公开提供。