DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

Vision-Language Pre-training (VLP) models demonstrate strong performance across various downstream tasks by learning from large-scale image-text pairs through contrastive pretraining. The release of extensive English image-text datasets (e.g., COYO-700M and LAION-400M) has enabled widespread adoption of models such as CLIP and SigLIP in tasks including cross-modal retrieval and image captioning. However, the advancement of Chinese vision-language pretraining has substantially lagged behind, due to the scarcity of high-quality Chinese image-text data. To address this gap, we develop a comprehensive pipeline for constructing a high-quality Chinese cross-modal dataset. As a result, we propose DanQing, which contains 100 million image-text pairs collected from Common Crawl. Different from existing datasets, DanQing is curated through a more rigorous selection process, yielding superior data quality. Moreover, DanQing is primarily built from 2024-2025 web data, enabling models to better capture evolving semantic trends and thus offering greater practical utility. We compare DanQing with existing datasets by continual pre-training of the SigLIP2 model. Experimental results show that DanQing consistently achieves superior performance across a range of Chinese downstream tasks, including zero-shot classification, cross-modal retrieval, and LMM-based evaluations. To facilitate further research in Chinese vision-language pre-training, we will open-source the DanQing dataset under the Creative Common CC-BY 4.0 license.

翻译：视觉-语言预训练模型通过对比预训练从大规模图文对中学习，在各种下游任务中展现出强大的性能。大量英文图文数据集（如COYO-700M和LAION-400M）的发布，使得CLIP和SigLIP等模型在跨模态检索和图像描述等任务中得到广泛应用。然而，由于高质量中文图文数据的稀缺，中文视觉-语言预训练的发展已显著滞后。为弥补这一差距，我们开发了一套构建高质量中文跨模态数据集的完整流程。基于此，我们提出了丹青数据集，该数据集包含从Common Crawl收集的1亿个图文对。与现有数据集不同，丹青通过更严格的筛选流程进行构建，从而获得了更优的数据质量。此外，丹青主要基于2024-2025年的网络数据构建，使模型能更好地捕捉不断演化的语义趋势，因而具有更强的实用价值。我们通过对SigLIP2模型进行持续预训练，将丹青与现有数据集进行了比较。实验结果表明，丹青在一系列中文下游任务中均持续取得更优的性能，包括零样本分类、跨模态检索以及基于LMM的评估。为促进中文视觉-语言预训练的进一步研究，我们将在Creative Common CC-BY 4.0许可下开源丹青数据集。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

用于三维医学影像理解的综合语言–图像预训练

专知会员服务

7+阅读 · 2025年11月5日

【CVPR2025】用于视觉-语言基础模型模态对齐的后预训练方法

专知会员服务

15+阅读 · 2025年4月18日

【KDD2023】基于大型图谱语料库的图感知语言模型预训练可以帮助多种图应用,12页pdf

专知会员服务

31+阅读 · 2023年6月7日