Vision-Language Pre-training (VLP) models have achieved remarkable success by leveraging large-scale image-text pairs. While English-centric models like CLIP and SigLIP benefit from massive datasets (e.g., LAION-400M), the development of Chinese VLP remains bottlenecked by the lack of high-quality, large-scale open-source data. In this paper, we present DanQing, a large-scale Chinese cross-modal dataset containing 100 million high-quality image-text pairs curated from Common Crawl. To ensure superior data quality, we develop an effective systematic pipeline comprising data source selection, text refinement, visual diversification, and cross-modal cross-batch filtering, thereby effectively mitigating the intrinsic noise prevalent in web data. Notably, DanQing incorporates data from 2024-2025, enabling models to capture contemporary semantic trends and emerging concepts. Extensive experiments via continued pretraining of SigLIP2 models demonstrate that DanQing consistently outperforms existing Chinese datasets across diverse downstream tasks, including zero-shot classification, cross-modal retrieval, and Chinese-centric large multimodal model tasks. Furthermore, in-depth analysis of DanQing reveals that it exhibits a more balanced semantic distribution and superior scaling capability compared to existing datasets. To facilitate further research in Chinese vision-language pre-training, we will open-source the DanQing dataset under the Creative Common CC-BY 4.0 license.
翻译:视觉-语言预训练模型通过利用大规模图文对取得了显著成功。尽管以英语为中心的模型(如CLIP和SigLIP)受益于海量数据集(例如LAION-400M),但中文VLP的发展仍因缺乏高质量、大规模的开源数据而受到制约。本文提出丹青,一个从Common Crawl中精选构建的大规模中文跨模态数据集,包含1亿个高质量图文对。为确保卓越的数据质量,我们开发了一套有效的系统化流程,涵盖数据源选择、文本精炼、视觉多样化以及跨模态跨批次过滤,从而有效缓解了网络数据中普遍存在的固有噪声。值得注意的是,丹青纳入了2024-2025年的数据,使模型能够捕捉当代语义趋势和新兴概念。通过对SigLIP2模型进行持续预训练的大量实验表明,丹青在多样化的下游任务中(包括零样本分类、跨模态检索和以中文为中心的大规模多模态模型任务)始终优于现有的中文数据集。此外,对丹青的深入分析揭示,与现有数据集相比,它展现出更均衡的语义分布和更优越的扩展能力。为促进中文视觉-语言预训练领域的进一步研究,我们将在知识共享CC-BY 4.0许可下开源丹青数据集。