Instruction-tuned large language models (LLMs) have demonstrated promising zero-shot generalization capabilities across various downstream tasks. Recent research has introduced multimodal capabilities to LLMs by integrating independently pretrained vision encoders through model grafting. These multimodal variants undergo instruction tuning, similar to LLMs, enabling effective zero-shot generalization for multimodal tasks. This study conducts a comparative analysis of different multimodal instruction tuning approaches and evaluates their performance across a range of tasks, including complex reasoning, conversation, image captioning, multiple-choice questions (MCQs), and binary classification. Through rigorous benchmarking and ablation experiments, we reveal key insights for guiding architectural choices when incorporating multimodal capabilities into LLMs. However, current approaches have limitations; they do not sufficiently address the need for a diverse multimodal instruction dataset, which is crucial for enhancing task generalization. Additionally, they overlook issues related to truthfulness and factuality when generating responses. These findings illuminate current methodological constraints in adapting language models for image comprehension and provide valuable guidance for researchers and practitioners seeking to harness multimodal versions of LLMs.
翻译:通过指令微调的大型语言模型(LLMs)在多种下游任务中展现出强大的零样本泛化能力。近期研究通过模型嫁接技术,将独立预训练的视觉编码器集成到LLMs中,为其赋予多模态能力。这些多模态变体与LLMs类似,经过指令微调后能有效实现多模态任务的零样本泛化。本研究对不同多模态指令微调方法进行了比较分析,并评估了它们在复杂推理、对话、图像描述、多项选择题(MCQs)和二分类等任务上的表现。通过严格的基准测试与消融实验,我们揭示了将多模态能力融入LLMs时的关键架构选择指导原则。然而,当前方法存在局限性:它们未能充分满足构建多样化多模态指令数据集的需求,而这对增强任务泛化能力至关重要;此外,在生成响应时忽略了真实性与事实性相关的问题。这些发现阐明了当前将语言模型适配于图像理解时的方法学局限,为希望利用LLMs多模态版本的科研人员与实践者提供了宝贵指导。