Instruction tuning is a crucial supervised training phase in Large Language Models (LLMs), aiming to enhance the LLM's ability to generalize instruction execution and adapt to user preferences. With the increasing integration of multi-modal data into LLMs, there is growing interest in Vision-Language Instruction Tuning (VLIT), which presents more complex characteristics compared to pure text instruction tuning. In this paper, we systematically review the latest VLIT settings and corresponding datasets in multi-modal LLMs and provide insights into the intrinsic motivations behind their design. For the first time, we offer a detailed multi-perspective categorization for existing VLIT datasets and identify the characteristics that high-quality VLIT data should possess. By incorporating these characteristics as guiding principles into the existing VLIT data construction process, we conduct extensive experiments and verify their positive impact on the performance of tuned multi-modal LLMs. Furthermore, we discuss the current challenges and future research directions of VLIT, providing insights for the continuous development of this field. The code and dataset related to this paper have been open-sourced at https://github.com/palchenli/VL-Instruction-Tuning.
翻译:指令微调是大语言模型(LLMs)中一个关键的监督训练阶段,旨在增强LLM泛化执行指令的能力并适应使用者偏好。随着多模态数据越来越多地融入LLMs,视觉语言指令微调(VLIT)引起了广泛关注,与纯文本指令微调相比,其呈现出更复杂的特性。本文系统综述了多模态LLMs中最新VLIT设置及对应数据集,并深入剖析其设计的内在动机。我们首次对现有VLIT数据集进行详细的多视角分类,并识别出高质量VLIT数据应具备的特征。将这些特征作为指导原则融入现有VLIT数据构建流程后,我们开展了大量实验,验证了它们对经过微调的多模态LLMs性能的积极影响。此外,我们讨论了当前VLIT面临的挑战及未来研究方向,为该领域的持续发展提供见解。本文相关代码和数据集已开源至https://github.com/palchenli/VL-Instruction-Tuning。