Most recent self-supervised learning methods are pre-trained on the well-curated ImageNet-1K dataset. In this work, given the excellent scalability of web data, we consider self-supervised pre-training on noisy web sourced image-text paired data. First, we conduct a benchmark study of representative self-supervised pre-training methods on large-scale web data in a like-for-like setting. We compare a range of methods, including single-modal ones that use masked training objectives and multi-modal ones that use image-text constrastive training. We observe that existing multi-modal methods do not outperform their single-modal counterparts on vision transfer learning tasks. We derive an information-theoretical view to explain these benchmark results, which provides insight into how to design a novel vision learner. Inspired by this insight, we present a new visual representation pre-training method, MUlti-modal Generator~(MUG), that learns from scalable web sourced image-text data. MUG achieves state-of-the-art transfer performance on a variety of tasks and demonstrates promising scaling properties. Pre-trained models and code will be made public upon acceptance.
翻译:近期大多数自监督学习方法均基于精心整理的ImageNet-1K数据集进行预训练。本文中,鉴于网络数据具有卓越的可扩展性,我们考虑在含有噪声的网络图文配对数据上进行自监督预训练。首先,我们在同等条件下对大规模网络数据上代表性自监督预训练方法进行了基准研究。我们比较了多种方法,包括使用掩码训练目标的单模态方法以及使用图像-文本对比训练的多模态方法。研究发现,现有的多模态方法在视觉迁移学习任务上并未超越对应的单模态方法。我们推导出一个信息论视角来解释这些基准结果,这为设计新型视觉学习器提供了启示。基于此洞察,我们提出了一种新的视觉表征预训练方法——多模态生成器(MUG),该方法可从可扩展的网络图文数据中学习。MUG在多种任务上达到了最优迁移性能,并展示了良好的缩放特性。预训练模型与代码将在论文接收后公开。