The rapid development of large language models (LLMs) has been witnessed in recent years. Based on the powerful LLMs, multi-modal LLMs (MLLMs) extend the modality from text to a broader spectrum of domains, attracting widespread attention due to the broader range of application scenarios. As LLMs and MLLMs rely on vast amounts of model parameters and data to achieve emergent capabilities, the importance of data is receiving increasingly widespread attention and recognition. Tracing and analyzing recent data-oriented works for MLLMs, we find that the development of models and data is not two separate paths but rather interconnected. On the one hand, vaster and higher-quality data contribute to better performance of MLLMs, on the other hand, MLLMs can facilitate the development of data. The co-development of multi-modal data and MLLMs requires a clear view of 1) at which development stage of MLLMs can specific data-centric approaches be employed to enhance which capabilities, and 2) by utilizing which capabilities and acting as which roles can models contribute to multi-modal data. To promote the data-model co-development for MLLM community, we systematically review existing works related to MLLMs from the data-model co-development perspective. A regularly maintained project associated with this survey is accessible at https://github.com/modelscope/data-juicer/blob/main/docs/awesome_llm_data.md.
翻译:近年来,大语言模型(LLMs)的快速发展有目共睹。基于强大的LLMs,多模态大语言模型(MLLMs)将模态从文本扩展到更广泛的领域,因其更广阔的应用场景而受到广泛关注。由于LLMs和MLLMs依赖海量的模型参数和数据来实现涌现能力,数据的重要性正受到日益广泛的关注和认可。追溯和分析近期面向MLLMs的数据相关工作,我们发现模型与数据的发展并非两条独立的路径,而是相互关联的。一方面,更庞大、更高质量的数据有助于提升MLLMs的性能;另一方面,MLLMs也能促进数据的发展。多模态数据与MLLMs的协同发展需要明确:1)在MLLMs的哪个发展阶段,可以运用哪些以数据为中心的方法来增强哪些能力;以及2)通过利用模型的哪些能力并扮演何种角色,模型可以为多模态数据的发展做出贡献。为了推动MLLM领域的数据-模型协同发展,我们从数据-模型共进化的视角,系统性地回顾了与MLLMs相关的现有工作。与本综述关联的定期维护项目可在 https://github.com/modelscope/data-juicer/blob/main/docs/awesome_llm_data.md 访问。