Vision-Language Pre-training (VLP) models demonstrate strong performance across various downstream tasks by learning from large-scale image-text pairs through contrastive pretraining. The release of extensive English image-text datasets (e.g., COYO-700M and LAION-400M) has enabled widespread adoption of models such as CLIP and SigLIP in tasks including cross-modal retrieval and image captioning. However, the advancement of Chinese vision-language pretraining has substantially lagged behind, due to the scarcity of high-quality Chinese image-text data. To address this gap, we develop a comprehensive pipeline for constructing a high-quality Chinese cross-modal dataset. As a result, we propose DanQing, which contains 100 million image-text pairs collected from Common Crawl. Different from existing datasets, DanQing is curated through a more rigorous selection process, yielding superior data quality. Moreover, DanQing is primarily built from 2024-2025 web data, enabling models to better capture evolving semantic trends and thus offering greater practical utility. We compare DanQing with existing datasets by continual pre-training of the SigLIP2 model. Experimental results show that DanQing consistently achieves superior performance across a range of Chinese downstream tasks, including zero-shot classification, cross-modal retrieval, and LMM-based evaluations. To facilitate further research in Chinese vision-language pre-training, we will open-source the DanQing dataset under the Creative Common CC-BY 4.0 license.
翻译:视觉-语言预训练模型通过对比预训练从大规模图文对中学习,在各种下游任务中展现出强大的性能。大量英文图文数据集(如COYO-700M和LAION-400M)的发布,使得CLIP和SigLIP等模型在跨模态检索和图像描述等任务中得到广泛应用。然而,由于高质量中文图文数据的稀缺,中文视觉-语言预训练的发展已显著滞后。为弥补这一差距,我们开发了一套构建高质量中文跨模态数据集的完整流程。基于此,我们提出了丹青数据集,该数据集包含从Common Crawl收集的1亿个图文对。与现有数据集不同,丹青通过更严格的筛选流程进行构建,从而获得了更优的数据质量。此外,丹青主要基于2024-2025年的网络数据构建,使模型能更好地捕捉不断演化的语义趋势,因而具有更强的实用价值。我们通过对SigLIP2模型进行持续预训练,将丹青与现有数据集进行了比较。实验结果表明,丹青在一系列中文下游任务中均持续取得更优的性能,包括零样本分类、跨模态检索以及基于LMM的评估。为促进中文视觉-语言预训练的进一步研究,我们将在Creative Common CC-BY 4.0许可下开源丹青数据集。