The rise in popularity of ChatGPT and GPT-4 has significantly accelerated the development of large models, leading to the creation of numerous impressive large language models(LLMs) and multimodal large language models (MLLMs). These cutting-edge models owe their remarkable performance to high-quality data. However, the details of the training data used in leading paradigms are often kept confidential. This lack of transparency, coupled with the scarcity of open-source data, impedes further developments within the community. As a response, this paper presents "Wan Juan", a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB. It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale. All data can be accessed at https://opendatalab.org.cn/WanJuan1.0.
翻译:ChatGPT与GPT-4的流行极大加速了大型模型的发展,催生了一系列令人瞩目的强大大型语言模型(LLMs)和多模态大型语言模型(MLLMs)。这些前沿模型之所以性能卓越,得益于高质量数据的支撑。然而,主流范式中所使用的训练数据细节通常被保密。这种缺乏透明性的状况,加之开源数据的稀缺,阻碍了社区的进一步发展。为此,本文提出"Wan Juan"——一个大规模多模态数据集,包含从广泛网络来源收集的中英文数据。该数据集涵盖文本、图像-文本和视频模态,总容量超过2TB。它被用于训练InternLM模型,该模型在与同类规模模型的对比评估中展现出显著优势。所有数据均可通过https://opendatalab.org.cn/WanJuan1.0获取。