The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

from arxiv, Ongoing work. 31 pages. Related materials are continually maintained and available at https://github.com/modelscope/data-juicer/blob/main/docs/awesome_llm_data.md

The rapid development of large language models (LLMs) has been witnessed in recent years. Based on the powerful LLMs, multi-modal LLMs (MLLMs) extend the modality from text to a broader spectrum of domains, attracting widespread attention due to the broader range of application scenarios. As LLMs and MLLMs rely on vast amounts of model parameters and data to achieve emergent capabilities, the importance of data is receiving increasingly widespread attention and recognition. Tracing and analyzing recent data-oriented works for MLLMs, we find that the development of models and data is not two separate paths but rather interconnected. On the one hand, vaster and higher-quality data contribute to better performance of MLLMs, on the other hand, MLLMs can facilitate the development of data. The co-development of multi-modal data and MLLMs requires a clear view of 1) at which development stage of MLLMs can specific data-centric approaches be employed to enhance which capabilities, and 2) by utilizing which capabilities and acting as which roles can models contribute to multi-modal data. To promote the data-model co-development for MLLM community, we systematically review existing works related to MLLMs from the data-model co-development perspective. A regularly maintained project associated with this survey is accessible at https://github.com/modelscope/data-juicer/blob/main/docs/awesome_llm_data.md.

翻译：近年来，大语言模型（LLMs）的快速发展有目共睹。基于强大的LLMs，多模态大语言模型（MLLMs）将模态从文本扩展到更广泛的领域，因其更广阔的应用场景而受到广泛关注。由于LLMs和MLLMs依赖海量的模型参数和数据来实现涌现能力，数据的重要性正受到日益广泛的关注和认可。追溯和分析近期面向MLLMs的数据相关工作，我们发现模型与数据的发展并非两条独立的路径，而是相互关联的。一方面，更庞大、更高质量的数据有助于提升MLLMs的性能；另一方面，MLLMs也能促进数据的发展。多模态数据与MLLMs的协同发展需要明确：1）在MLLMs的哪个发展阶段，可以运用哪些以数据为中心的方法来增强哪些能力；以及2）通过利用模型的哪些能力并扮演何种角色，模型可以为多模态数据的发展做出贡献。为了推动MLLM领域的数据-模型协同发展，我们从数据-模型共进化的视角，系统性地回顾了与MLLMs相关的现有工作。与本综述关联的定期维护项目可在 https://github.com/modelscope/data-juicer/blob/main/docs/awesome_llm_data.md 访问。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日