Recent advancements in large vision-language models (LVLMs), such as GPT4-V and LLaVA, have been substantial. LLaVA's modular architecture, in particular, offers a blend of simplicity and efficiency. Recent works mainly focus on introducing more pre-training and instruction tuning data to improve model's performance. This paper delves into the often-neglected aspects of data efficiency during pre-training and the selection process for instruction tuning datasets. Our research indicates that merely increasing the size of pre-training data does not guarantee improved performance and may, in fact, lead to its degradation. Furthermore, we have established a pipeline to pinpoint the most efficient instruction tuning (SFT) dataset, implying that not all SFT data utilized in existing studies are necessary. The primary objective of this paper is not to introduce a state-of-the-art model, but rather to serve as a roadmap for future research, aiming to optimize data usage during pre-training and fine-tuning processes to enhance the performance of vision-language models.
翻译:近期,以GPT4-V和LLaVA为代表的大型视觉语言模型取得了显著进展。其中,LLaVA的模块化架构兼具简洁性与高效性。现有研究主要聚焦于通过引入更多预训练数据和指令微调数据来提升模型性能。本文深入探讨了预训练阶段数据效率及指令微调数据集筛选过程中常被忽视的问题。研究表明,单纯扩大预训练数据规模并不能保证性能提升,反而可能导致性能下降。此外,我们构建了一套流程以识别最高效的指令微调(SFT)数据集,这意味着现有研究中使用的部分SFT数据并非不可或缺。本文主要目标并非提出最先进的模型,而是为未来研究提供指导路线图,旨在通过优化预训练与微调阶段的数据使用来提升视觉语言模型的性能。