The rise in popularity of ChatGPT and GPT-4 has significantly accelerated the development of large models, leading to the creation of numerous impressive large language models(LLMs) and multimodal large language models (MLLMs). These cutting-edge models owe their remarkable performance to high-quality data. However, the details of the training data used in leading paradigms are often kept confidential. This lack of transparency, coupled with the scarcity of open-source data, impedes further developments within the community. As a response, this paper presents "Wan Juan", a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB. It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale. All data can be accessed at https://opendatalab.org.cn/WanJuan1.0.
翻译:ChatGPT与GPT-4的广泛流行极大地加速了大模型的发展,催生了众多令人瞩目的大型语言模型(LLMs)和多模态大语言模型(MLLMs)。这些前沿模型之所以性能卓越,离不开高质量数据的支撑。然而,主流范式中采用的训练数据细节往往保密。这种透明度不足,加之开源数据的稀缺,阻碍了社区内的进一步进展。为此,本文提出了“WanJuan”这一大规模多模态数据集,包含中英文数据,采集自广泛的网络来源。数据集涵盖文本、图文及视频模态,总体积超过2TB。该数据集已用于InternLM模型的训练,与同类规模模型相比,InternLM在多维度评估中展现出显著优势。所有数据均可从https://opendatalab.org.cn/WanJuan1.0 获取。