The remarkable multimodal capabilities demonstrated by OpenAI's GPT-4 have sparked significant interest in the development of multimodal Large Language Models (LLMs). A primary research objective of such models is to align visual and textual modalities effectively while comprehending human instructions. Current methodologies often rely on annotations derived from benchmark datasets to construct image-dialogue datasets for training purposes, akin to instruction tuning in LLMs. However, these datasets often exhibit domain bias, potentially constraining the generative capabilities of the models. In an effort to mitigate these limitations, we propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models to yield a diverse and controllable dataset with varied image content. Additionally, datasets can be arbitrarily scaled. This not only provides greater flexibility compared to existing methodologies but also significantly enhances several model capabilities. Our research includes comprehensive experiments conducted on various datasets. The results emphasize substantial enhancements in more than ten commonly assessed capabilities. Additionally, our model achieves state-of-the-art results across multiple widely recognized multimodal benchmarks.
翻译:OpenAI的GPT-4所展现出的卓越多模态能力,极大推动了多模态大语言模型(LLMs)的研究发展。此类模型的核心研究目标是在理解人类指令的同时,有效对齐视觉与文本模态。当前方法通常依赖基准数据集中的标注构建图像-对话数据集(类似于LLMs中的指令微调),但这些数据集常存在领域偏差,可能限制模型的生成能力。为缓解上述局限性,我们提出一种新型数据采集方法:同步合成图像与对话用于视觉指令微调。该方法融合了ChatGPT与文本-图像生成模型的生成能力,可生成内容多样、可控的图像数据集,且数据集可任意扩展。这不仅较现有方法具有更高灵活性,还显著提升了模型的多种能力。我们在多个数据集上进行了综合实验,结果表明模型在十余项常见评估能力上取得显著提升,并在多个广泛认可的多模态基准测试中达到最优水平。