The rise in popularity of ChatGPT and GPT-4 has significantly accelerated the development of large models, leading to the creation of numerous impressive large language models(LLMs) and multimodal large language models (MLLMs). These cutting-edge models owe their remarkable performance to high-quality data. However, the details of the training data used in leading paradigms are often kept confidential. This lack of transparency, coupled with the scarcity of open-source data, impedes further developments within the community. As a response, this paper presents "Wan Juan", a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB. It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale. All data can be accessed at https://opendatalab.org.cn/WanJuan1.0.
翻译:ChatGPT和GPT-4的流行极大地加速了大模型的发展,催生了众多令人瞩目的大语言模型(LLMs)和多模态大语言模型(MLLMs)。这些前沿模型卓越的性能得益于高质量数据。然而,主流范式所使用的训练数据细节通常保密。这种缺乏透明度的情况,加上开源数据的稀缺,阻碍了社区的进一步发展。为此,本文提出了“Wan Juan”,一个包含中英文数据的大规模多模态数据集,数据来源涵盖广泛的网络资源。该数据集包含文本、图像-文本和视频模态,总数据量超过2TB。它被用于训练InternLM模型,在与同规模模型的多维度评估对比中展现出显著优势。所有数据均可访问 https://opendatalab.org.cn/WanJuan1.0。