Vision and vision-language applications of neural networks, such as image classification and captioning, rely on large-scale annotated datasets that require non-trivial data-collecting processes. This time-consuming endeavor hinders the emergence of large-scale datasets, limiting researchers and practitioners to a small number of choices. Therefore, we seek more efficient ways to collect and annotate images. Previous initiatives have gathered captions from HTML alt-texts and crawled social media postings, but these data sources suffer from noise, sparsity, or subjectivity. For this reason, we turn to commercial shopping websites whose data meet three criteria: cleanliness, informativeness, and fluency. We introduce the Let's Go Shopping (LGS) dataset, a large-scale public dataset with 15 million image-caption pairs from publicly available e-commerce websites. When compared with existing general-domain datasets, the LGS images focus on the foreground object and have less complex backgrounds. Our experiments on LGS show that the classifiers trained on existing benchmark datasets do not readily generalize to e-commerce data, while specific self-supervised visual feature extractors can better generalize. Furthermore, LGS's high-quality e-commerce-focused images and bimodal nature make it advantageous for vision-language bi-modal tasks: LGS enables image-captioning models to generate richer captions and helps text-to-image generation models achieve e-commerce style transfer.
翻译:视觉及视觉语言神经网络应用(如图像分类和描述生成)依赖需要复杂数据收集过程的大规模标注数据集。这一耗时工作阻碍了大规模数据集的出现,将研究人员和实践者局限于少数选择。因此,我们寻求更高效的图像收集与标注方法。以往研究从HTML替代文本中获取描述文字并抓取社交媒体帖子,但这些数据源存在噪声大、稀疏性或主观性强等问题。为此,我们转向商业购物网站,其数据满足三大标准:清洁性、信息丰富性和流畅性。我们提出"让我们购物(LGS)"数据集——一个包含1500万对来自公开电子商务网站图像-文本对的大规模公共数据集。与现有通用领域数据集相比,LGS图像聚焦前景物体且背景复杂度较低。在LGS上的实验表明,基于现有基准数据集训练的分类器难以直接泛化到电商数据,而特定自监督视觉特征提取器能实现更好泛化。此外,LGS高质量电商图像及其双模态特性使其在视觉语言双模态任务中具有优势:该数据集能让图像描述生成模型产生更丰富的描述文本,并帮助文本到图像生成模型实现电商风格迁移。