We present Public Domain 12M (PD12M), a dataset of 12.4 million high-quality public domain and CC0-licensed images with synthetic captions, designed for training text-to-image models. PD12M is the largest public domain image-text dataset to date, with sufficient size to train foundation models while minimizing copyright concerns. Through the Source.Plus platform, we also introduce novel, community-driven dataset governance mechanisms that reduce harm and support reproducibility over time.
翻译:本文提出公共领域12M(PD12M)数据集,包含1240万张高质量公共领域及CC0授权图像及合成标注文本,专为文本到图像模型的训练而设计。PD12M是迄今规模最大的公共领域图文数据集,其数据量足以训练基础模型,同时最大程度降低版权风险。通过Source.Plus平台,我们还引入了创新的社区驱动型数据集治理机制,该机制能持续降低数据危害并支持研究的可复现性。