The QUILT-1M dataset is the first openly available dataset containing images harvested from various online sources. While it provides a huge data variety, the image quality and composition is highly heterogeneous, impacting its utility for text-conditional image synthesis. We propose an automatic pipeline that provides predictions of the most common impurities within the images, e.g., visibility of narrators, desktop environment and pathology software, or text within the image. Additionally, we propose to use semantic alignment filtering of the image-text pairs. Our findings demonstrate that by rigorously filtering the dataset, there is a substantial enhancement of image fidelity in text-to-image tasks.
翻译:QUILT-1M数据集是首个公开可用的、包含从多种在线来源收集的图像的数据集。尽管该数据集提供了丰富的图像多样性,但其图像质量和构成高度异质,影响了其在文本条件图像合成中的实用性。我们提出了一种自动流水线方法,能够预测图像中最常见的杂质(例如讲述者可见性、桌面环境、病理软件或图像中的文本)。此外,我们还提出使用语义对齐过滤方法处理图像-文本对。研究结果表明,通过对数据集进行严格过滤,文本到图像任务中的图像保真度得到了显著提升。