In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet often constrained by the scarcity of high-quality image-text data. To address this bottleneck, we introduce the ShareGPT4V dataset, a pioneering large-scale resource featuring 1.2 million highly descriptive captions, which surpasses existing datasets in diversity and information content, covering world knowledge, object properties, spatial relationships, and aesthetic evaluations. Specifically, ShareGPT4V originates from a curated 100K high-quality captions collected from advanced GPT4-Vision and has been expanded to 1.2M with a superb caption model trained on this subset. ShareGPT4V first demonstrates its effectiveness for the Supervised Fine-Tuning (SFT) phase, by substituting an equivalent quantity of detailed captions in existing SFT datasets with a subset of our high-quality captions, significantly enhancing the LMMs like LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B on the MME and MMBench benchmarks, with respective gains of 222.8/22.0/22.3 and 2.7/1.3/1.5. We further incorporate ShareGPT4V data into both the pre-training and SFT phases, obtaining ShareGPT4V-7B, a superior LMM based on a simple architecture that has remarkable performance across a majority of the multi-modal benchmarks. This project is available at https://ShareGPT4V.github.io to serve as a pivotal resource for advancing the LMMs community.
翻译:在大型多模态模型(LMMs)领域,高效的模态对齐至关重要,但常常受到高质量图文数据稀缺的制约。为突破这一瓶颈,我们引入了ShareGPT4V数据集——一个开创性的大规模资源,包含120万条高度描述性字幕,在多样性和信息含量上超越现有数据集,涵盖世界知识、物体属性、空间关系及美学评价。具体而言,ShareGPT4V源自从先进GPT4-Vision中收集的10万条精选高质量字幕,并基于在此子集上训练的优异字幕模型扩展至120万条。ShareGPT4V首先在监督微调(SFT)阶段展示了其有效性:通过用我们高质量字幕的子集替换现有SFT数据集中等量的详细描述,显著提升了LLaVA-7B、LLaVA-1.5-13B和Qwen-VL-Chat-7B等LMM在MME和MMBench基准上的性能,分别取得了222.8/22.0/22.3和2.7/1.3/1.5的提升。我们进一步将ShareGPT4V数据融入预训练和SFT阶段,得到了基于简单架构的ShareGPT4V-7B——一个在大多数多模态基准中表现卓越的LMM。本项目已在https://ShareGPT4V.github.io公开,旨在作为推进LMM社区发展的关键资源。